Site Meter
 
 

Monthly Archives: December 2007

Outsmarted by LINQ-to-SQL

(and how to read a SQL execution plan)

I’m using LINQ-to-SQL on a current project, which is mostly a pretty positive experience (ignoring the odd frustrating limitation) – it’s incredibly easy to set up.

When using any object-relational mapper (ORM), LINQ-to-SQL, NHibernate or whatever, you can’t just blindly trust the SQL queries they’re generating. Hopefully, the queries will be as finely tuned as if you lovingly hand-crafted them yourself, but what if they’re not? Do you care if your production database server melts down? Sensibly, you’ll keep SQL Profiler open and scrutinise each new type of query.

Shock and horror

Following that advice, when I started with LINQ-to-SQL I used SQL Profiler to see what it was getting up to. For example, I had everyone’s favourite Customer-Orders relationship:

schema

… and I was doing a query to find the customers who have never ordered anything:

ExampleDBDataContext context = new ExampleDBDataContext();
 
// The query
var customers = from c in context.Customers
		where c.Orders.FirstOrDefault() == null
		select c;
 
foreach (Customer c in customers)
	Console.WriteLine(c.Name);

It worked, but I was appalled to see it generate the following SQL:

SELECT [t0].[CustomerID], [t0].[Name], [t0].[CreatedDate]
FROM [dbo].[Customers] AS [t0]
WHERE NOT (EXISTS(
    SELECT TOP (1) NULL AS [EMPTY]
    FROM [dbo].[Orders] AS [t1]
    WHERE [t1].[CustomerID] = [t0].[CustomerID]
    ))

Smell that dirty subselect! As someone who’s been brought up on the mantra "subselects are bad; always use joins!", I wanted to write to Microsoft and educate them that the correct SQL would be:

SELECT [t0].[CustomerID], [t0].[Name], [t0].[CreatedDate]
FROM [dbo].[Customers] AS [t0]
LEFT OUTER JOIN [Orders] o ON t0.[CustomerID] = o.[CustomerID]
WHERE o.[OrderID] IS NULL

Mmm, that’s much cleaner and nicer. Lovely elegant joins. Stupid LINQ-to-SQL…

I think you can guess what’s coming

Let’s put my beliefs to the test. When there are 100 customers and 1000 orders, the two methods’ execution plans look like this:

ExecutionPlans-smalldata-annotated

Notice the "query cost" values (smaller is better).

LINQ-to-SQL’s method does a "stream aggregation" to get a list of distinct CustomerIDs from the Order table, then left-anti-semi join excludes any Customer rows which match one of those IDs. My "left outer join" method, on the other hand, joins all Customer-Order pairs, then has a filter to exclude any joined rows that have a CustomerID.

LINQ-to-SQL’s method wins slightly, but only very slightly. The near-identical performance is not surprising since they both scan all 100 customers and all 1,000 orders.

More data, more data

Repeating the experiment, but now with 100 customers and 1,000,000 orders, the execution plans change to:

ExecutionPlans-largedata-annotated

Agh! My elegant method is about 200 times slower than LINQ-to-SQL’s clumsy subselect! But why?

The query plan for my method remains identical, so now it has to scan all 1,000,000 order rows, joining them to customer records, and filtering out any customers with orders.

The query plan for LINQ-to-SQL’s method has changed, so now for every Customer record, it just subselects the TOP 1 matching Order record and does a left-anti-semi join (so Customers are included only if no matching Order was found). This means it doesn’t have to look through all 1,000,000 rows – it can bail out as soon as the first matching Order is found. Assuming you have a CustomerID index on the Order table, this is very fast indeed.

Conclusions

Obviously, the conclusion isn’t "LINQ to SQL is always right", or "subselects are always better than JOINs". I am merely admitting that LINQ to SQL isn’t as dumb as I thought, nor am I as clever as I thought. Oh, and scrutinising LINQ to SQL using SQL Profiler isn’t always enough; you sometimes need to inspect those execution plans too.

First ASP.NET MVC CTP is released

Posted 3 hours ago by ScottGu:

Earlier today we released the first CTP preview of an “ASP.NET 3.5 Extensions” release that we’ve been working on (click here to read my previous post about our product roadmap).  This release brings additional runtime functionality to ASP.NET and .NET 3.5.  You can download it here (if you are using the MVC features also click here to download the MVC Toolkit extras).

Looking forwards to checking it out in more detail!

LINQ to SQL: Lazy and Eager Loading Hiccups

In a recent post, I covered a few strategies for using LINQ to SQL in a traditional multi-tier architecture. The winning strategy for me, so far anyway, is DataContext-per-unit-of-work – it’s both convenient and safe. Or so I thought.

By default, child collections in LINQ to SQL are lazy-loaded. That is, they’re not loaded until they’re required, then the child entities are populated on a one-by-one basis. In the worst case this causes the “Select N+1” problem. If we’re not going to use the child entities, great, we win, but if we know we are, it’s better to load them all in advance in a single query.

RIP : “Including” operator

Before the March 2007 CTP of LINQ to SQL, we saw plenty of demos and code snippets using the “Including” operator. It looked like this:

var query = (from customer in MyDataContext.Customers select customer)
                .Including(customer => customer.Orders);

That was a nice, concise syntax to invoke eager loading selectively (i.e. on a per-query basis). It’s quite similar to NHibernate’s ICriteria.SetFetchMode(“Orders”, FetchMode.Eager) mechanism for controlling per-query loading strategies.

Unfortunately, LINQ to SQL’s “Including” syntax was killed in the March CTP. Random forum comments and blog posts imply it was removed because it was either too buggy or just didn’t fit with their general architecture. Unfortunately we’re now a bit stuck if we want eager-loading on a per-query basis.

The official solution: DataLoadOptions

First it was called DataShape, then it was renamed to DataLoadOptions. You use it like this:

MyDataContext context = new MyDataContext();
DataLoadOptions options = new DataLoadOptions();
options.LoadWith<customer>(c => c.Orders);
context.LoadOptions = options;
var query = from c in context.Customers // ... etc

Not exactly elegant, but it can be a powerful way to control the “shape” of the data that’s returned. Note the shape.AssociateWith() method which lets you control the ordering and filtering of related entities too.

Unfortunately, changes to DataContext.LoadOptions affect the entire DataContext (obviously), and once assigned, cannot be changed during the lifetime of the DataContext (for consistency). This is completely at odds with the DataContext-per-unit-of-work strategy, as different subtasks within the same unit of work now have unwanted dependencies on each other.

Trying to use DataLoadOptions safely, and failing

I’m not yet convinced that there is a good way to control fetching strategies in LINQ to SQL on a per-query basis, since DataLoadOptions is the only control we have and that only works on a per-DataContext basis. If you don’t need deferred execution in a particular case, you can do this:

public static IList<customer> CustomersWithOrders
{
    get
    {
        return Customers
                .Select(c => new { Customer = c, Orders = c.Orders })
		.ToList()
                .Select(pair => pair.Customer)
		.ToList();
    }
}

Here we load pairs (Customer, Orders), then from that project back to simple Customer objects. Since the relevant Order records are already loaded, no further queries are needed to evaluate customer.Orders for these customers.

If anybody can help me out with a better solution, i.e. that retains the deferred IQueryable semantics, I would be really glad to hear it. Until then, this is pretty frustrating. It may be that I’m on completely the wrong lines, but the LINQ-to-SQL multi-tier story doesn’t yet appear to have a happy ending.

SQL Server in the future

One limitation when eagerly loading object graphs is the dimensionality of the result set. To get customers and their orders, we have to JOIN the two tables together, and waste a lot of space to represent the parent-child relationship in two dimensions, like this:

CustomerID CustomerName OrderID ProductID ProductName Price
1 Steve 53 77 Ferarri $50,000
1 Steve 53 91 Hamster $4.99
2 Frank 54 35 Chips $0.50
2 Frank 54 91 Hamster $4.99
2 Frank 54 102 Submarine $3,500,000

The other problem, besides wasting space, is that it’s hard to page over data like this. How many rows must you retrieve to ensure you get the first 10 complete Customer records? LINQ to SQL, like other ORMs, does some clever trickery to hide you from this problem, but it’s still wasteful.

What I’d really like to see from SQL Server is multi-dimensional result sets. You can only have 2-dimensional results sets today, but why stop there? Of course, it’s hard to visualise 6-dimensional results in Query Analyzer, but I don’t really care about that. If aggregation operations weren’t forced to collapse lists onto scalars, we could GROUP BY customer, and get the list of Orders as a child table.

CustomerID CustomerName Orders
1 Steve
OrderID Lines
53
ProductID ProductName Price
77 Ferarri $50,000
91 Hamster $4.99
2 Frank
OrderID Lines
54
ProductID ProductName Price
35 Chips $0.50
91 Hamster $4.99
102 Submarine $3,500,000

This would map trivially onto an object graph and is very easy to page over. I guess there are lots of reasons why SQL Server doesn’t already do this. Perhaps someone with more knowledge of its architecture can explain.

By the way, this is exactly what LINQ-to-objects does when you use its “groupby” keyword, and it’s very slick.