March 09, 2007
Problems of Data Access Layers

britt ekland
Britt Ekland, who played John Cassavetes' wife in “Machine Gun McCain”, and was later a Bond girl.

A nice little summary of the issues involved in writing and using data access layers (DALs) on top of databases, by Will Hartung (I think I saw this on del.icio.us):

In Days Of Yore, with ISAM style databases, this was the typical paradigm for processing.

use OrderCustomerIndex
Find Order 'CUST001'
Fetch Order
while(order.custcode = 'CUST001')
    use OrderDetailOrderIndex
    Find OrderDetail order.orderno
    Fetch OrderDetail
    while(orderdetail.orderno = order.orderno)
        use ItemItemCodeIndex
        Find Item orderdetail.itemcode
        Fetch Item
        ...
        Fetch Next OrderDetail
    end while
    Fetch Next Order
end while

So, if you had 10 orders with 5 lines items each, this code would "hit" the DB library 100 times (10 for each order, once for the OrderDetail and once for Item data).

When DB calls are "cheap", this kind of thing isn't that expensive (since every DBMS (Relational and Object) on the planet essentially has an ISAM (B-Tree) core, they do this kind of thing all day long).

So, if you wrote your code with a DAL, you could easily come up with something like this:

(defun process-orders (cust-code)
  (let ((orders (find-orders cust-code)))
    (dolist (order orders)
      (let ((details (find-order-details (order-no order))))
        (dolist (detail details)
          (let ((item (find-item (detail-item-code detail))))
            (do-stuff order detail item)))))))

But this is a trap, and this is where THE problem occurs with most any relational mapping process. Naively implemented with SQL, again for 10 orders with 5 line items each, you end up with SIXTY(!!!) SQL hits to the database. (1 query for the initial orders, 1 query for each order for its detail, 1 query for each detail for the item data). This is Bad. Round trips to the DB are SLLOOWWW, and this will KILL your performance.

Instead, you want to suck in the entire graph with a single query:

SELECT orders.*, orderdetails.*, items.*
FROM orders, orderdetails, items
WHERE orders.custcode = ?
AND orderdetails.orderno = orders.orderno
AND items.itemcode = orderdetails.itemcode

Then, you use that to build your graph so you can do this:

(defun process-orders (cust-code)
  (let ((orders (find-orders cust-code)))
    (dolist (order orders)
      (let ((details (order-details order)))
        (dolist (detail details)
          (let ((item (detail-item detail)))
            (do-stuff order detail item)))))))

And THAT's all well and good, but say you wanted to get the total of all canceled orders:

(defun canceled-orders-total (cust-code)
  (let ((orders (find-orders cust-code))
        (total 0))
    (dolist (order orders)
      (when (equal (order-status order) "C")
        (setf total (+ total (order-total order)))))
    total))

This is fine, but you just sucked the bulk of ALL of the orders, including their detail and item information, to get at a simple header field.

Now, certainly this could have been done with a simple query as well:

SELECT sum(total) FROM orders WHERE status = 'C' AND custcode=?

The point is that balancing the granularity of what your DAL does, and how it does it, needs to be determined up front, at least in a rough edged way. This is what I mean by being aware that an RDBMS is going to be your eventual persistent store. If you don't assume that, you may very well end up with a lot of code like the first example that will immediately suck as soon as you upgrade to the SQL database.

In my own days of yore (1997ish), I wrote a pretty complicated data access layer for a big Win32 application—in Visual Basic. The basic idea, which was Jim Firby's, was pretty neat, in that it made databases look almost like associative arrays.

Data was addressed with tuples, so ["books_by_isbn" "0140125051" "title"] would be the address (we called them “IGLs”, for Info Group Locator or something) of the "title" field of the record whose key field was “0140125051” in the "books_by_isbn" table. ["books_by_isbn" "0140125051" "*"] would be the collection of fields for book 0140125051, and ["books_by_isbn" "*" "*"] would be all books in the table.

There was a virtual translation layer, so if you wanted you could have a single "books" table that contained both ISBNs and ASINs, and then define virtual "books_by_isbn" and "books_by_asin" tables that would let you use IGLs that indexed by ISBN or ASIN respectively. It seems a little insane now, but actually every column of every table went through the virtual translation map, which was just another table in the database. What can I say, we were Lispers with an AI background, indirection was in our blood.

There was so much to the Datalink, as my DAL was called: 14,000 lines of shared caches, events, rules engines and schema migration support. But the result for users of the code was that complex database access patterns seemed simple and clean.

But of course we ran into the problem that some combinations of simple, clean Datalink operations would be fast and others would be slow and unless you knew a lot of the details of the Datalink implementation you couldn't easily predict which way it would go. Which is exactly the sort of trap mentioned by Will.

I ended up writing a series of tech notes on how to structure common operations with the Datalink to keep things fast. It was effective, in that users of the code could then get good performance, but it didn't feel like the most satisfying solution. But the problem seems to be that nobody knows how to come up with a satisfying solution.

Posted by jjwiseman at March 09, 2007 02:01 PM
Comments

The principle of locality would suggest that table joins and filtering be performed using SQL. Remarks regarding SQL databases as a persistence layer and table design suggestive of pointers and associative arrays reveals a failure to embrace the concept of a database as a system to manage predicates and true propositions. App developers should be writing queries, not DALs. Business rules should live in the database (including entity and referential integrity), not in some middle or client tier.

Posted by: Kevin Davis on March 10, 2007 03:04 PM

I ran into this problem, too. I created a query DSL for my specific data + an interpreter to translate the queries into decent SQL on the fly. It's an ordinary eDSL implemented as a combinator (or macro) library. I believe it solves the problem described here.

Posted by: surana on March 10, 2007 07:34 PM

I have to agree with Kevin. A database isn't just a persistence layer. Just as you want people to use map and fold (work on lists) instead of writing low-level loops (work on individual items), you should use the query language for aggregate operations (work on sets) instead of looking things up one at a time (work on individual items). The easiest approach is to use SQL directly but there are probably other approaches.

Posted by: Amit Patel on March 10, 2007 09:56 PM

Yeah, I feel like I learned my lesson regarding the use of a database for something more like persistence or layering an object model on top. Although it is apparently a common temptation.

Posted by: John Wiseman on March 12, 2007 09:49 AM
Post a comment
Name:


Email Address:


URL:




Unless you answer this question, your comment will be classified as spam and will not be posted.
(I'll give you a hint: the answer is “lisp”.)

Comments:


Remember info?