For people wanting to write custom SQL queries for use in Django and are wondering how to make them "fit in" somehow with the rest of your code, a short tutorial follows. Requires a little bit of familiarity with Django to be useful, I suspect. Also, if you are interested in learning more about some of the more advanced corners of Django, there are a few other people writing instructional pieces at the moment. In general, have a look at the Django aggregator for recent updates, but, in particular, I would like to draw attention to Ian Holsman's recent writeups of content types and signals as they are used in Django. Similarly, James Bennett's excellent series of posts about various useful topics are well worth your time. And that will certainly be enough link pimping for today. A note before going any further: all of the code from this post can be downloaded if you want to experiment. The tarball contains the models and a small test file. Assume we have models representing people and groups of people. For example from django.db import modelsclass Person(models.Model): name = models.CharField(maxlength = 50)class Group(models.Model): name = models.CharField(maxlength = 50) people = models.ManyToManyField(Person, related_name = 'groups', null = True, blank = True)We want to be able to answer this question from time to time: given a list of people (Person instances), find all the groups that contain all of the people. We already know what the eventual SQL query will look like for this question. The fiddly bit is how to fit it into the models in a nice way. How are we going to use this?Let's stop and think for a moment about how we are going to use this in our code. Working out the API for an addition like this usually helps avoid crafting a solution that is unusable. We are going to pass in a collection of Person instances to some function and we hope to get back some Group instances that meet the condition. Ideally, we would like to be able to pass in a QuerySet describing the people we want and get a QuerySet back from the function without ever hitting the database. However, that is not possible at the moment in Django, because it is not easy to extract the SQL from the input QuerySet in a way that can be used by our function. It will be possible to pass a QuerySet into our function, but we will have to iterate over the results in order to work with them. We will, however, be able to return a QuerySet on the Group model that can be further filtered or counted or iterated over as required. Since our function is something that essentially acts on the Group model, we also have a good guide as to where to put it. It can be a custom manager method for the Group model. In this fashion, we will be able to use it as follows: # Extract the people we want (the class clowns, in this case)people = Person.objects.filter( name__in = ('Abbott', 'Costello'))# Get the groups containing both these peopleall_groups = Group.objects.get_all_groups(people)# Since all_groups is a QuerySet, we can use it in a few ways:print all_groups.count()print [str(obj) for obj in all_groups.order_by('name')]The advantage of returning a QuerySet is clear in cases like this where the result might need to be used in a few ways that are not known to us yet. Writing the codeImplementing the get_all_groups() function is not too hard. The only thing I want to draw attention to here is where the following code tries to remain database- and application-neutral by extracting the requisite table names from the models themselves, rather than hard-coding them in. Let me show the code first and then discuss it later. from django.db import connection, backend, modelsfrom django.core.exceptions import ObjectDoesNotExistclass GroupManager(models.Manager): def get_all_groups(self, people): people_ids = [obj.id for obj in people] if not people_ids: raise ObjectDoesNotExist reln = self.model._meta.get_field('people') join_table = backend.quote_name(reln.m2m_db_table()) group_id = backend.quote_name(reln.m2m_column_name()) person_id = backend.quote_name(reln.m2m_reverse_name()) query = 'select %s from %s where %s in %s ' \ 'group by %s having count(*) = %%s' \ % (group_id, join_table, person_id, tuple(people_ids), group_id) cursor = connection.cursor() cursor.execute(query, (len(people_ids),)) group_idents = [item[0] for item in cursor.fetchall()] return self.filter(id__in = group_idents)Most of this code should be clear, except possibly how we get the database table and column names. That, admittedly, does require some poking around in the internals of Django and, if you don't want to go to all that effort in your code and are unlikely to be moving it around much (in particular, not renaming the application), then just looking up the appropriate database table names in your database should suffice. If you do want to remain very portable, as the above code attempts to do, let's have a quick tour through this part of the code. Some people may complain that this is all a bit unfair, since it is not documented. But, come on! The code is in Python. Fifteen minutes and a good search function in your editor will help locate the right bits after a bit of practice. The code is the documentation for some of these things; rewriting it in English would not really add a lot of value and would increase maintenance costs. If you want to do deep and complex things, you may need to spend a few minutes on preparation. As documented, every Manager class has access to the model it is managing through the self.model attribute. A bit of exploration in the Django code (in particular, in django/db/models/base.py) reveals that the field classes are stored in the _meta attribute of the model and they can be retrieved by name with the get_field() method. In fact, I discovered most of that one day by just importing a model in at the Python prompt and running dir(...) on anything that looked interesting to see what attributes and methods it had. I didn't even have to open a source file until I wanted to confirm that my discoveries were correct. Now that we have the many-to-many field object in hand, we need to extract the database table name and the names of the columns that refer to the Group and People class. A quick browse through the code in django/db/models/fields/related.py shows that m2m_db_table, m2m_column_name and m2m_reverse_column_name give us what we want. The backend.quote_name() function is useful to remember when constructing these sorts of queries. It takes care of any database-specific quoting that may be required to ensure the resulting names will be valid SQL. After fetching all the requisite strings, it is just a matter of putting all this information together into the SQL query. The query we construct has two parameters: the input list of people identifiers and then length of that list. Passing in the list of identifiers to the query is the only slightly awkward bit in this code. I am relying on the fact that str(tuple(...)) in Python for a list of integers will give me something that is legal in an SQL query (and I know the list of integers is non-empty, because that was handled much earlier in the function). Also, it is not possible to pass in the tuple(people_ids) part as a parameter to the query, because the database backend will quote it as a string, whereas I need it unquoted here. So, all in all, the construction of the IN clause is a little uglier than I might like it. [ Updated (July 21): In a post to django-users, Neilen Marais pointed out that there is indeed a problem with this tuple construction: str(tuple([10])), for example leads to the creation of bad SQL. You need to special-case the situation when the people_ids list contains only a single item.] Finally, after retrieving all the group identifiers, we suck them into a list and use that list to construct our final QuerySet for the Group model. Normally, you would write something like Group.objects.filter(...), but since we are in the Manager class here already (the Group.objects bit), we can just use self.filter(...) to construct our result. Adding this into the Group class is simple and documented here. We just make GroupManager the default manager on the Group class: class Group(models.Model) # original fields as before ... # and add in the new manager (the 'objects' attribute is the # default manager) objects = GroupManager()For advanced creditHow efficient is this? Well, assuming the people parameter passed into get_all_groups() is a QuerySet, we will execute (roughly -- it depends on the QuerySet and models involved) one query to compute the people_ids list, one to retrieve all of the group_idents and return a QuerySet that will eventually result in another query being executed. Three queries is not that bad, since they will use database indexes for most of the heavy work. Can we do better? As alluded to earlier, we cannot reduce it to a single query without rewriting the internals of Django's QuerySets (i.e. we need to evaluate the incoming QuerySet to get the people_ids list). With a lot more work, we could create a class derived from QuerySet that essentially provides a QuerySet interface wrapped around the SQL query we create in this function. That would collapse the internal query in get_all_groups() and the returned QuerySet query into a single event, which would reduce the ultimate query count to two (the second query would be a join between the m2m_db_table and the Group table). But that is more work than I'm willing to do at this moment and would definitely be premature (or unnecessary) optimisation in the couple of situations where I have seen this pattern in practice so far in Django applications. |