In the previous post, we saw how we could use a single, raw query in Django to combat ORMs’ tendency to generate query explosions within loops. We used raw()
with joins and some column renaming to ensure all the data we needed came back in one go. We had to modify the property names in the template slightly (e.g. book.jacket.image
became book.jacket_image
) and the result was a RawQuerySet which had a fair number of limitations but, we avoided the typical order-of-magnitude increase in query count.
Super models
There is a way we can use a raw query, to get all the benefits described in the previous post, and return a real QuerySet. We need a Django model and so need an underlying table. A much underused feature of SQL databases is the view. Views are virtual tables derived from queries. Most are currently implemented as read-only, but they provide many benefits – again another blog post would be needed to explain them all – but they include:
- adding a level of abstraction, effectively an API (which is also available to other applications)
- protecting the application from underlying table and column changes
- giving fine-grained control of permissions, especially row-level
- providing a way to efficiently add calculated columns such as
line_total = quantity * price
- avoiding de-normalising things which should remain derived from other data
- avoiding the not-insubstantial cost of repeatedly sending the same complicated queries over the network
- enabling the database to pre-compile and cache the privilege-checks and query plans
So we could define a database view (with a few extra foreign-key columns for use later):
CREATE VIEW vbook AS
SELECT
book.id,
book.id as book_id,
book.title,
jacket.id AS jacket_id,
jacket.image AS jacket_image,
author.id AS author_id
author.name AS author_name,
shelf."position"
FROM
book
NATURAL JOIN shelf
NATURAL JOIN author
NATURAL JOIN jacket
and then map a Django model to it, using Meta.db_table='vbook'
, and set the Meta.managed = False
to tell Django not to maintain the table definition.
(note again the use of natural join for simplicity (and short table names) and that they don’t actually work as-is with Django’s table/column naming)
We’ve done this in the past for some of our projects, and shoe-horning the view-creation script into Django/buildout was always the fiddly bit.
Recently, we used django-postgres which makes the model-view mapping a bit simpler. You subclass pg.View
instead of models.Model
and set the sql
property to your query, e.g.
class vBook(pg.View):
title = models.CharField(max_length=100)
jacket_image = models.ImageField(null=True)
author_name = models.CharField(max_length=100)
shelf_position = models.IntegerField()
sql = BOOK_DETAILS
manage.py sync_pgviews
then creates the views for the models in the specified application. It’s quite a lightweight tool, and simply builds and issues the SQL CREATE VIEW
statements needed to wrap each view-based model’s sql query.
(There is a Kickstarter project under way which should also add view handling to Django)
We now have an almost fully-fledged Django model. We can query the objects against an effectively de-normalised table, while still retaining an essentially normalised schema. We can chain refinement methods to the resulting QuerySet to further filter and order the objects, including
select_related()
if necessary (although it would always be preferable, given time, to put the extra join into the view). We can have custom object managers, Meta options, introspection and all the other features of a Django model, even some read-only admin which would be good for reporting.
Note that we removed the
WHERE shelf.position <= 10
from the view definition. We can define the view without such filters and leave it to the Django application to apply them, e.g.
vbooks = vBook.objects.filter(shelf__position__lte = 10)
This gives a more flexible model, more akin to the original Book model, although you could keep the WHERE clauses in if you wanted to logically partition a table into multiple views/models.
Agile
Another great reason for using such view-based models is to postpone decisions about optimising bits of the system before we know anything about it. As an example, let’s say that a book needs to have an average rating that should be calculated from the entries in a ratings table. This could accurately be calculated using SQL’s AVG function, but that might be expensive and so we might need to pre-calculate it and possibly store it outside of the database. That would require us to design, build, test, document, and maintain:
- an external data source capable of handling the unknown load and volume
- a way to link a book to the external data source, possibly using new drivers
- a way of updating the external data source for a book, duplicating the rating table entries
- a way to synchronise the two data storage systems, possibly involving cron jobs and queues
- a way to build and sustain the data source and links in whatever production environment we use, e.g. AWS
All of this would add complexity, up-front R&D and ongoing maintenance. The results would not be as fresh or reliable as the database results, and not necessarily any faster. Using AVG in a view leaves us scope to replace how it is calculated at a later date, once we know more about the performance.
Often, caching the book results using Django’s cache is enough and we can stick with the basic approach. Premature optimisation could well be a huge waste of effort, and the view-based models let us defer those decisions and get started very quickly.
Abstract
Defining views early in a project could also be a way to postpone building complex parts of the system. Let’s say a book has a complicated ‘cost’ attribute and we’re not sure yet how it will be calculated or where all the data will come from but we need to start to display it during the initial iterations. We could add the column to the book view on day one and worry about where it comes from later. e.g.
CREATE VIEW vbook AS
SELECT
book.id,
book.id as book_id,
book.title,
6.283185 AS cost, --todo: use a cost function here instead
...
And then vbook.cost
can be used in the knowledge that the reference won’t change. Also, if the cost calculation is defined within the view and it needs to change post-production, the view can be recreated while the application is running with no migration or down-time.
More modelling
We can further enhance the view-based model and add relationships to give us back the all-too-convenient dot-notation – still useful if we’re careful to use it outside of large loops. We should make sure any relationships don’t give Django the impression that it needs to try to maintain integrity – it doesn’t need to since the underlying table only has virtual rows. We can do this using on_delete=models.DO_NOTHING
, e.g.
class vBook(pg.View):
title = models.CharField(max_length=100)
jacket_image = models.ImageField(null=True)
author_name = models.CharField(max_length=100)
shelf_position = models.IntegerField()
author = models.ForeignKey(Author,
on_delete=models.DO_NOTHING)
#needs author_id to be returned by the sql
sql = BOOK_DETAILS
These are, of course, complete Django relationships and so we can access them from any direction, e.g.
my_author.vbook_set.all()
would return an author’s books as vBook objects using a single extra query. You could even go further and use select_related('vbook')
when getting the author. Django treats these models just like any other.
We can’t save or create using such view-based models – the database will reject that for anything but the simplest views, so we still use the underlying table-based models to do that. But we can link the models together, with a one-to-one relationship, to make things easier, e.g.
class vBook(pg.View):
book = models.OneToOneField(Book,
on_delete=models.DO_NOTHING)
#needs book_id to be returned by the sql
title = models.CharField(max_length=100)
jacket_image = models.ImageField(null=True)
author_name = models.CharField(max_length=100)
shelf_position = models.IntegerField()
author = models.ForeignKey(Author,
on_delete=models.DO_NOTHING)
#needs author_id to be returned by the sql
sql = BOOK_DETAILS
Now, given a vbook instance, we can update the underlying book, for example:
vbook.book.title='A Tale of Two Cities'
vbook.book.title.save()
And also, given a book instance, we can ask for all the pre-joined book information:
book.vbook.author_name
There are still some questions about how best to use these models, e.g. how to share methods with underlying table-based models and how best to name fields from other models.
Deferred Optimisation
Once we’ve defined view-based models to bring together the core concepts of the system, we can build the application to use them for reading and the base-table models for writing. We then have a variety of options to greatly optimise the system at a later date.
Here are some ways we can optimise database views. The important points are:
- just by using views to join tables, we’ve already made great savings in the number of database queries needed
- the ideas below can be applied later, once we know if and where further optimisation is needed
- none of these would require any changes to the application code using the data: they’re true implementation changes, and database views give us such a clear separation of interface and implementation that the application could stay running!
Pull from other tables
For example, at a later point, we could modify the view to join to a table containing pre-calculated ratings and return that instead of the AVG sub-select.
Pull from external systems
We could use PostgreSQL’s foreign data wrappers to join to external data sources. It might be useful in itself to map a Django model to such a foreign table without an intermediate view, if all the data is foreign.
Use database functions
Database functions can be written in a number of languages including PL/pgSQL, Python, Ruby and Javascript. They are pre-compiled and are very efficient not least because they can access the data without it needing to be returned first and are in situ so can filter out unwanted results.
Load-balance across read-only replicas
PostgreSQL has the concept of standby/slave servers that can be kept synchronised with a master server. These can be made available for read-only queries to spread the load. Since we can be sure that view-based models are only used for reading, we could use Django’s database router to route all queries against those models to the slave servers.
Materialise
Views are virtual, except when they’re not. PostgreSQL 9.3 added materialised views.
CREATE MATERIALIZED VIEW vbook AS
SELECT ...
These are still declared as derived tables but the data in them is physically stored for much faster access: the results are effectively cached in the database. These are useful if the application can handle some staleness, and the only change necessary is in the way the view is declared – the application needn’t know or care. Such views can be refreshed periodically (although before 9.4 the refresh takes out a table lock). Indexes can be added to materialised views, potentially giving massive performance increases, especially for calculated columns.
Although I think materialisation is a last resort, it’s a very powerful one. I’ve added a MaterializedView class to isotoma/django-postgres which should help create them, though it could do with some more testing and options to control refreshing.
This really gives the best of both worlds: a single query can provide Django objects with the speed expected from de-normalised storage, but derived by the database from a normalised schema.
Summary
Django’s ORM makes accessing related objects simple and convenient. However, when using an ORM, accessing related objects in loops often leads to an explosion of supporting queries, which can go unnoticed during development but which can lead to poor performance. Django’s ORM has some methods that try to alleviate the problem but they have limitations. We can use raw SQL to efficiently join related information in the database to avoid these query explosions, as well as giving us more powerful ways to group and summarise data. I think these SQL queries should be given more prominence in our projects.
We can go further and push the raw SQL into the database by declaring virtual tables (database views). We can then map Django models onto these virtual tables to also give an extra layer of abstraction on top of the base models. This lets us defer implementation decisions and provides lots of ways to optimise the system at a later stage, once we know more about its performance, without affecting the application code.
About us: Isotoma is a bespoke software development company based in York and London specialising in web apps, mobile apps and product design. If you’d like to know more you can review our work or get in touch.
i totally that view are very good for reducing the number of queries.
but there are a problem. i found hard to use them with testing and with south. so it will be the best if you write a part 3 about testing these things …
A follow up post would be good since we’re collecting a list of implementation tips.
Meanwhile, in your unit tests you need to make sure the views are created, so either a custom test runner or something like:
from django.core.management import call_command
class BaseTestCase(TestCase):
def setUp(self):
call_command('sync_pgviews', force=True) #needs to run after migrate to see latest tables
Your approach seems very interesting yet not vastly used. Could you think of a few reasons why or ‘implementation details’ you have omitted that make this much harder in reality?
Ok, what exactly does Proper Scotch Egg by Alex Urban come from or refer to? I did a google search in hope of a new SF book and found nothing. Is this random text?! 🙂
We’re building a book site at the moment – sadly it’s just random text 🙂
Excellent part 2 🙂
How do you manage the postgres function (written in python of course) ? For example the function to calculate the final price of a product in a view.
We’ve tried installing them on the back of post_syncdb signals in the past. More recently, I’ve created an empty South migration (using
--empty
) and put theCREATE OR REPLACE
function code in the forwards method. It’s a bit better because you have more control over rolling back, though I think there should be a better method.