LINQ to Objects Join Query Optimization

06 October 2014

I still remember first time running .NET 3.5 and learning about Language-Integrated Query (LINQ). To this day C# is my favorite language mainly because of LINQ. But LINQ has a dark side. It allows people to write “looking nice – performing horribly” code.

Everyone is aware that LINQ to SQL can generate insane SQL query and choke the server, but not too many are thinking twice when using LINQ to objects. It is important to understand how LINQ is implemented to avoid errors from example below.

Let’s assume we want to join two sequences: small and big.

var english = new[] {

new { N = 1, T = "One" }, new { N = 2, T = "Two" },

new { N = 3, T = "Three" }, new { N = 4, T = "Four" },

new { N = 5, T = "Five" }, new { N = 6, T = "Six" },

new { N = 7, T = "Seven" }, new { N = 8, T = "Eight" },

new { N = 9, T = "Nine" }, new { N = 10, T = "Ten" }

};

var smallEnglish = english.Select(x => x);

var french = new[] {

new { N = 1, T = "Un" }, new { N = 2, T = "Deux" },

new { N = 3, T = "Trois" }, new { N = 4, T = "Quatre" },

new { N = 5, T = "Cinq" }, new { N = 6, T = "Six" },

new { N = 7, T = "Sept" }, new { N = 8, T = "Huit" },

new { N = 9, T = "Neuf" }, new { N = 10, T = "Dix" }

};

var bigFrench = Enumerable.Range(0, 50000000).Select(x => french[x % 10]);

Small sequence smallEnglish has only 10 elements and big sequence bigFrench has 50 millions. Let’s join them on N property. N stands for Number, T stands for Text. I never do this in production code, but for reading this blog on a small device having shorter names is important. Output should give us array of pairs containing text on English and text on French for each number, total 50 million records.

Let’s create join using small array on the left side and big one on the right:

Stopwatch sw = new Stopwatch();

sw.Start();

// Query syntax:

var joinOne = from e in smallEnglish

join f in bigFrench

on e.N equals f.N

select new { E = e.T, F = f.T };

// Method syntax:

// var joinOne = smallEnglish.Join(bigFrench, e => e.N, f => f.N, (e, f) => new { E = e.T, F = f.T });

int items = joinOne.Count();

sw.Stop();

Console.WriteLine("Items: " + items);

Console.WriteLine(string.Format("Time: {0}", sw.ElapsedMilliseconds));

Using query or method syntax is irrelevant. Result is the same. As we know, execution is delayed. Only when we ask for Count() on joinOne, query will be executed in order to count elements.

Let’s see how expensive this code was:

We have a problem. Memory usage jumped to 371 MB !

Let’s think what just happened here. Instead of thinking like query syntax, let’s think of join observing method syntax:

var joinOne =

smallEnglish.Join(bigFrench, e => e.N, f => f.N, (e, f) => new { E = e.T, F = f.T });

Join is extension method on IEnumerable<>. This IEnumerable (in our case smallEnglish) is “streamed” => Reading one by one element as needed to build result. On the other side, right side is loaded immediately (buffered) and for each element from the left sequence we are searching thru the right one.

What if we “stream” bigger sequence? Let’s try to swap join sides. Result should be the same, except in different order:

var joinTwo = from f in bigFrench

join e in smallEnglish

on f.N equals e.N

select new { E = e.T, F = f.T };

int items = joinTwo.Count();

Bingo! Not only code runs much faster (7.6 seconds vs. 12.4), but memory usage is only 6.3MB. That is 58 times less:

If you are wondering how different results are, let’s cut big sequence to only 20 elements. First query result is on the left, and second query result is on the right:

As you see, difference is only in order. Also, this could help you to understand and confirm that join works as is described. One element from first sequence is loaded and second (right) sequence is searched for matching elements.

I spent hundreds of hours fixing LINQ to SQL queries using TRACE or SQL Profiler and for more complex queries I ended up analyzing query execution paths. At the same time, I didn’t think that LINQ to objects could have optimization techniques too. I was obviously wrong.