One of the consequences of the mismatches between databases and ORMs is that iterating over objects can generate explosions of database queries. Often, these explosions go unnoticed because the individual queries are fast and the number of concurrent users is small, especially during development. The explosions are sometimes hard to detect, either because developers don’t look for them or the tools that can expose the explosions don’t report everything (e.g. the Django debug toolbar doesn’t report the queries generated by ajax calls).
Exploding
Django does a good job of lazily evaluating what it can, but the simple act of using the convenient dot-notation inside a loop can turn a single database query into a cascade of supporting queries.
For example, given this book model:
class Book(models.Model):
title = models.CharField(max_length=100)
author = models.ForeignKey(Author)
jacket = models.ForeignKey(Jacket)
shelf = models.ForeignKey(Shelf)
and this books query:
books = Book.objects.filter(shelf__position__lte = 10)
which returns 10 rows. Then this template:
{% for book in books %}
{{ book.jacket.image }} {{ book.title }}
by {{ book.author.name }}
{% endfor %}
would issue a single query to get all 10 rows, but would then generate 2 additional queries per iteration to get the jacket and the author for each book. So a total of 21 queries. You really need to use the Django debug toolbar to realise this – install it now and keep your eye on it until the very end of the project: it only takes a late-breaking, well-intentioned dot to cause an explosion.
A good way to detect when explosions happen is to check the number of queries in your unit tests.
self.assertNumQueries(self.BOOKS_QUERY_COUNT, self.client.get, reverse('books'))
Add these at the start of your project. It means you have to keep increasing the counts while developing, but it does make it obvious when something has exploded.
Remember, every one of those queries involves the following steps:
- create a connection to the database (tcp/ip – handshake – username, password) (set
CONN_MAX_AGE
in Django 1.6 to help reduce this overhead) - send the SQL query over the network
- parse the SQL query
- check the table and column read/write permissions
- optimise the query plan based on the latest statistics
- start a transaction
- perform the query (checking for other users’ locks)
- return the resulting metadata and rows over the network
- commit the transaction
- close the connection
Joining
A far better approach is to get the database to do the work. It can join together the jacket and author information much more efficiently (orders of magnitude more efficiently – but that’s another topic).
Django added select_related()
to try to address this problem. Given a field name, it passes the required table to the database to be joined with the original queryset, e.g.
books = Book.objects.filter(shelf__position__lte = 10
).select_related('jacket',
'author')
It is limited though, since it only works with some relationships in some directions, and it pulls in all the columns which is usually very wasteful, especially if large text fields are present. Adding defer()
or only()
to work around this can help, but isn’t pretty, e.g.
books = Book.objects.filter(shelf__position__lte = 10
).select_related('jacket', 'author'
).defer('author__biography',
'author__summary',
'author__resume')
prefetch_related()
was added later to help with more complex relationships but still issues multiple queries and does the ‘joining’ in Python. Besides that, it also has a number of caveats and special interactions with other methods. Trying to combine these with annotations for aggregate counts and such would lead to some fragile, hard-to-construct and, I think, hard-to-read syntax that can fail even for slightly complicated queries.
So instead, we could issue the following query:
books = Book.objects.raw("""
SELECT
book.id, book.title,
jacket.image AS jacket_image,
author.name AS author_name
FROM book
NATURAL JOIN shelf NATURAL JOIN author NATURAL JOIN jacket
WHERE shelf.position <= 10
""")
(note: I’ll use natural join throughout for simplicity, and short table names for that matter, but they don’t actually work as-is with Django’s table/column naming)
And then, a slightly modified template:
{% for book in books %}
{{ book.jacket_image }} {{ book.title }}
by {{ book.author_name }}
{% endfor %}
would only issue 1 query, regardless of the number of books.
Adding further related information to the original template will dramatically increase the number of queries, whereas joining another table to the raw SQL version will add none. Similarly, returning 20 books instead of 10 would just about double the number of queries for the original version (so 41 for our example), whereas the raw one would still only issue a single query.
As well as avoiding query explosions, there are other benefits to using raw SQL. A full explanation needs another blog post, but some of the major ones are:
- it’s the only way to do anything but the simplest of aggregation queries.
- it only returns what is required (assuming you avoid
SELECT *
). Pulling every column, including large blobs, over the network when they’re not referenced wastes CPU, temporary sort/cache space and network bandwidth. - you can use easy-to-read comparisons (e.g.
position <= 10
instead ofshelf__position__lte = 10
). - if you need/want, you can use database-specific features, such as function calls, range queries, NULL ordering, etc.
- checking the query plan is easier – there’s no need to set a breakpoint and then print the queryset.query.
- it returns consistent results, whereas multiple queries, even in the same transaction, can see different data depending on what’s been committed by concurrent users (unless we used the serializable isolation level, which we don’t).
Formatting
I’ve tried a number of ways to embed SQL queries into Django modules:
...raw("""
SELECT ...
FROM ...
WHERE ...
""")
is jarring.
...raw("""SELECT ...
FROM ...
WHERE ...
""")
is better, but adds too much whitespace and makes it harder to edit.
...raw("""SELECT ..."""
"""FROM ..."""
"""WHERE ...""")
avoids the extra whitespace but doesn’t ensure enough whitespace unless you’re careful to add a space at the end of each line.
None are great and none of them provide an easy way to copy and paste the query, which is necessary to tune it and check its plan in a database client.
Elevating
For a better approach, I suggest creating an sql directory (alongside models.py – it is that important) and create python modules in there to hold the named query declarations, e.g. sql/books.py:
BOOK_DETAILS = """
SELECT
book.id,
book.title,
jacket.image AS jacket_image,
author.name AS author_name
FROM
book
NATURAL JOIN shelf
NATURAL JOIN author
NATURAL JOIN jacket
WHERE
shelf.position <= 10
"""
then it can be called like this:
from sql.books import BOOK_DETAILS
books = Book.objects.raw(BOOK_DETAILS)
Clicking the BOOK_DETAILS
reference in any decent IDE will take you to the query. Now that the query has a bit more respect it can more easily be copied and pasted into other tools, and formatted as you’d like, and re-used without repetition.
One more note about formatting. Although you could put Python comments before the declaration, I suggest keeping the query comments inside the query because they may be useful on the server, e.g. when viewing the database logs. So instead of:
#Get top 10 book summaries
#(including author and jacket details)
BOOK_DETAILS = """
SELECT
...
"""
use:
BOOK_DETAILS = """
/*
Get top 10 book summaries
(including author and jacket details)
*/
SELECT
...
"""
Downsides
There are some downsides with raw()
, however.
- You need some SQL knowledge of course, and an understanding of how Django links tables together, but I think you should have that anyway.
- If your SQL does introduce some non-standard syntax (which shouldn’t be the case for straightforward joins) it could make your application less portable. Although I’d argue that some features are worth that sacrifice, and would hope the likelihood of switching to a different database would be low if you’re already using a good one.
- Even though
raw()
ensures only 1 query is used, that query must still be sent over the network and parsed every time it’s run. - The real problem is that
raw()
returns a RawQuerySet, not a QuerySet, so no further refinement methods can be applied (e.g. filter, order_by) – though arguably they’re better off being added to the SQL.
We can do better, as I’ll explain in part 2…
About us: Isotoma is a bespoke software development company based in York and London specialising in web apps, mobile apps and product design. If you’d like to know more you can review our work or get in touch.
Typo: 21 queries should be 210, I think?
1 for the books query and then 2 more per book for 10 books. I’ve re-phrased it to hopefully remove the each/every ambiguity.
Interesting post, thank you.
Instead of
from sql/books import BOOK_DETAILS
probably should be
from sql.books import BOOK_DETAILS
Yes – thank you.
Those who use SQLAlchemy may findthis link useful.
Excellent article to keep at hand 🙂