LINQ to SQL: Lazy and Eager Loading Hiccups
In a recent post, I covered a few strategies for using LINQ to SQL in a traditional multi-tier architecture. The winning strategy for me, so far anyway, is DataContext-per-unit-of-work – it’s both convenient and safe. Or so I thought.
By default, child collections in LINQ to SQL are lazy-loaded. That is, they’re not loaded until they’re required, then the child entities are populated on a one-by-one basis. In the worst case this causes the “Select N+1” problem. If we’re not going to use the child entities, great, we win, but if we know we are, it’s better to load them all in advance in a single query.
RIP : “Including” operator
Before the March 2007 CTP of LINQ to SQL, we saw plenty of demos and code snippets using the “Including” operator. It looked like this:
var query = (from customer in MyDataContext.Customers select customer) .Including(customer => customer.Orders);
That was a nice, concise syntax to invoke eager loading selectively (i.e. on a per-query basis). It’s quite similar to NHibernate’s ICriteria.SetFetchMode(“Orders”, FetchMode.Eager) mechanism for controlling per-query loading strategies.
Unfortunately, LINQ to SQL’s “Including” syntax was killed in the March CTP. Random forum comments and blog posts imply it was removed because it was either too buggy or just didn’t fit with their general architecture. Unfortunately we’re now a bit stuck if we want eager-loading on a per-query basis.
The official solution: DataLoadOptions
First it was called DataShape, then it was renamed to DataLoadOptions. You use it like this:
MyDataContext context = new MyDataContext(); DataLoadOptions options = new DataLoadOptions(); options.LoadWith<customer>(c => c.Orders); context.LoadOptions = options; var query = from c in context.Customers // ... etc
Not exactly elegant, but it can be a powerful way to control the “shape” of the data that’s returned. Note the shape.AssociateWith() method which lets you control the ordering and filtering of related entities too.
Unfortunately, changes to DataContext.LoadOptions affect the entire DataContext (obviously), and once assigned, cannot be changed during the lifetime of the DataContext (for consistency). This is completely at odds with the DataContext-per-unit-of-work strategy, as different subtasks within the same unit of work now have unwanted dependencies on each other.
Trying to use DataLoadOptions safely, and failing
I’m not yet convinced that there is a good way to control fetching strategies in LINQ to SQL on a per-query basis, since DataLoadOptions is the only control we have and that only works on a per-DataContext basis. If you don’t need deferred execution in a particular case, you can do this:
public static IList<customer> CustomersWithOrders { get { return Customers .Select(c => new { Customer = c, Orders = c.Orders }) .ToList() .Select(pair => pair.Customer) .ToList(); } }
Here we load pairs (Customer, Orders), then from that project back to simple Customer objects. Since the relevant Order records are already loaded, no further queries are needed to evaluate customer.Orders for these customers.
If anybody can help me out with a better solution, i.e. that retains the deferred IQueryable semantics, I would be really glad to hear it. Until then, this is pretty frustrating. It may be that I’m on completely the wrong lines, but the LINQ-to-SQL multi-tier story doesn’t yet appear to have a happy ending.
SQL Server in the future
One limitation when eagerly loading object graphs is the dimensionality of the result set. To get customers and their orders, we have to JOIN the two tables together, and waste a lot of space to represent the parent-child relationship in two dimensions, like this:
CustomerID | CustomerName | OrderID | ProductID | ProductName | Price |
1 | Steve | 53 | 77 | Ferarri | $50,000 |
1 | Steve | 53 | 91 | Hamster | $4.99 |
2 | Frank | 54 | 35 | Chips | $0.50 |
2 | Frank | 54 | 91 | Hamster | $4.99 |
2 | Frank | 54 | 102 | Submarine | $3,500,000 |
The other problem, besides wasting space, is that it’s hard to page over data like this. How many rows must you retrieve to ensure you get the first 10 complete Customer records? LINQ to SQL, like other ORMs, does some clever trickery to hide you from this problem, but it’s still wasteful.
What I’d really like to see from SQL Server is multi-dimensional result sets. You can only have 2-dimensional results sets today, but why stop there? Of course, it’s hard to visualise 6-dimensional results in Query Analyzer, but I don’t really care about that. If aggregation operations weren’t forced to collapse lists onto scalars, we could GROUP BY customer, and get the list of Orders as a child table.
CustomerID | CustomerName | Orders | ||||||||||||||||
1 | Steve |
|
||||||||||||||||
2 | Frank |
|
This would map trivially onto an object graph and is very easy to page over. I guess there are lots of reasons why SQL Server doesn’t already do this. Perhaps someone with more knowledge of its architecture can explain.
By the way, this is exactly what LINQ-to-objects does when you use its “groupby” keyword, and it’s very slick.