Well, gosh. What a busy few months. What started as a bit of spit and polish to a website has turned into a huge rewrite and a great opportunity to learn and play with MVC(2), the Spark view engine, Lucene.net and NHibernate. But more on those some other time. Today, I’m interested in a performance problem that wasn’t what I was expecting it to be.
One core scenario of this website is that users can watch video files. We don’t want to allow them to view just anything, so we grant them access to a subset of the videos. Which means we have the following requirement: what programmes does the user have access to watch?
This is calculated like so: get the video access tokens for the user. Map each token to an episode. Strip out episodes which don’t have videos available. Retrieve the programmes for all viewable episodes.
For better or worse, this touches 5 tables across 3 separate databases.
We wrote a cache.
And you know what? It works great. The site is fast, the data is readily available, everybody is happy.
Except for one user.
We have a demo user that is granted access to a large number of episodes in a large number of programmes, ready to be used at conferences and other events.
This user killed the site.
Once the data was in the cache, everything was great. But it took 5 minutes to get into the cache. And killed the site while it was doing so.
I was expecting some trouble with this user – they didn’t work well on the existing site, and it’s obvious that we’re doing too much database activity. So I fired up the wonderful NHProf to see exactly what was going wrong and immediately proved myself wrong. All the data was being retrieved with a minimum amount of fuss and sql statements and round trips to the database. It took 5 seconds.
Now, we are definitely doing too much with the database here. We really could do with denormalising some of the data, using less databases and pulling back ids instead of unused entities. But it still only took 5 seconds. As an edge case populating a cache for a user only used by staff at conferences, 5 seconds is pretty much ok.
So what was happening for the remaining 4 minutes and 55 seconds?
The title of the post might help. The heart of this little process was a LINQ to objects query that joined up the lists from the various databases to create the final list of programmes, episodes and tokens that we could cache and present to the user. It went something like this:
var tokens = GetTokens();
var episodes = GetEpisodes(tokens);
var availableEpisodes = GetAvailableEpisodes(episodes);
from p in GetProgrammes(availableEpisodes)
select new ProgrammesWithTokens(p.Id, p.Title, GetTokensForProgramme(p, episodes, tokens))
where GetTokensForProgramme was defined like:
from t in tokens.OrderByDescending(x => x.GrantedDate)
from e in episodes
where e.OpaqueStringId == t.OpaqueStringId
&& e.EpisodeNumber == t.EpisodeNumber
Now, there are two glaring issues going on here: We’re always ordering the tokens list – we could do that once before anything else. Similarly, we could do something about the p.Episodes.Contains(e) which isn’t going to be very quick.
But that’s small fry. Where’s the real problem?
It took me a little while to realise that what looks deceptively like nice, declarative, friendly LINQ, is in fact (of course) nested loops. Three deep. This simple bit of data collation is O(n * m * p).
I didn’t mention data size before. Turns out it’s not really that large. The demo user has about 1600 tokens, for 1500 available episodes for about 300 programmes.
Not much, actually.
But when you count the looping, this gives us 300 * 1600 * 1500 = 720,000,000 iterations.
Seven hundred and twenty MILLION iterations. That’s 720 MILLION string comparisons, and 720 MILLION list traversals for the Contains calls. It’s also 299 unnecessary ordering of the tokens list.
That’s where my 5 minutes was going.
Rewriting this as a bunch of lookups has reduced this bottleneck to essentially the time it takes to read from the database – about 5 seconds.
So the moral of the story is simple. LINQ is nice. It’s lovely. But when you’re doing multiple from statements, don’t forget it’s now doing nested loops.
That can really hurt.