I know Djangocon has been over for a week, but I didn’t get a chance to talk about day three and specifically Paul McMillan’s excellent security talk. I also think it’s interesting that Djangocon seems to correlate with security releases (2011, 2010).
Paul demonstrated a timing attack against password reset: a method that mails a user a one-time link to use to reset their password. This timing attack could guess that link with fewer requests than would be needed to guess that link via brute force — that is, fewer than having to guess all possible combinations. It did so by measuring the difference in the times requests took between requests with more vs. fewer correct characters in the URL. I spoke with Paul and he said that this attack works best locally and would be hard to execute remotely because variability in network latency would be significant enough to make measuring the differences in timing difficult. While this attack is not completely practical, a lot of people use shared or cloud hosting which allow attackers to somewhat mitigate this by setting up attack servers in the same network.
Paul also demonstrated a timing attack which leaked some information about whether a username was valid in the system.
Securing Django in production
Even if Django is completely secure (which nothing truly is), mistakes can be made in deployment. Paul recommended an app called django-secure which checks for common misconfigurations. In addition, he said that the login URL should always be throttled to prevent password guessing. The Django security docs which your humble blogger helped write also recommend that among a number of other things. They are worth a read.
I posted a primer about Django passwords last month. Paul had some more things to say about it. Firstly, database dumps/backups and initial data which contain hashed passwords should not be public (for example, on github). As I mentioned in the primer, eight character passwords using Django’s current hashing algorithm (sha1) can be brute forced in a matter of hours in the worst case. So if you accidentally leaked a backup — and a number of high profile sites have done things like this — then consider those passwords broken.
The fix for the password problem is to use a “slower” hashing algorithm designed for hashing passwords. I spoke with Paul after the talk and one of the road blocks to using something like bcrypt is its reliance on C extensions and the Django core team is reluctant to introduce them. However, they are really trying to get something better into the Django core for 1.4.
There were a number of other recommendations including:
- Be careful where you store pickled data (cache, /tmp, etc.). Pickled objects can contain executable code.
- Use the proper cryptographic functions available in Django and Python including: random.SystemRandom, django.utils.crypto.constant_time_compare, and django.utils.crypto.salted_hmac
- Be careful when deploying HTTPS to make sure it is done properly
It’s good to hear that security people are going over Django with a fine-toothed comb.
I enjoyed some brief time traveling when Jacob showed what Django looked like in 2005 or so. It has come a long way.
I attended two talks on APIs today: Isaac Kelly’s talk on Tastypie and Tareque Hossain’s talk on the Promises & Lies in REST. Tareque’s talk involved PBS’ use of Piston and the changes that they had to make (presumably because the core has not been updated). It seems like a number of new projects in the Django/REST space have cropped up (on top of Tastypie) such as Django REST framework and dj-webmachine. At last year’s Djangocon, Eric Holscher (I think) mentioned that it seemed like there was agreement on Piston for Django REST interfaces and now the REST community is fragmenting a little and using a variety of different tools and methodologies.
Tareque recommended a number of methodologies in his talk that I would say are not very RESTful such as including the status code in the data (as opposed to just using the HTTP status code), putting the API version in the URL (a good idea but maybe that should be in the header) or putting the desired output format in the URL (.xml, .json, etc) as opposed to in the HTTP header. Perhaps thinking about “not very RESTful” though is not the right way to think about it. In his talk, Isaac said that “Restish is enough” and maybe that is the answer. If you’re doing most of the RESTful things, you’re Doing Things Right. On the other hand, once you say “Restish is enough” you’re basically admitting that everybody does REST differently and that divergence in REST interfaces is going to continue for at least the foreseeable future.
I really enjoyed Ben Slavin’s talk on Real-time Django. He shed some good insight on what to cache and when. Essentially, I would summarize it as to cache many things at every level that makes sense. On top of perhaps view level caching, you should cache partial results or really anything that prevents you from hitting your database more than you need. I have been playing with an approach that uses this to cache data from multiple databases in one fast cache. I liked the concept of “continuous caching” where essentially some out-of-band process is caching views or data so that actual requests for views don’t hit the DB.
I chose to attend Alex Gaynor’s talk on Pypy at Quora rather than Frank Wiles talk on Postgres performance tuning but it was a tough choice. Alex thinks one big strength for Django (from his time at Quora not using it) was that picking up a foreign Django codebase is easy because of all the conventions that virtually all Django apps follow. If you know Django, you can easily tell all the URLs for any Django app (urls.py) or all the forms (forms.py). Unfortunately, the Django admin doesn’t use these conventions. In passing, he also mentioned a project called Johnny Cache which I have to try. I followed some live-blogging on Frank’s talk and it looked like there were some good tidbits.
If you’re at Djangocon, say hi!
More live-ish blogging…
Eric Florenzano’s talk Why Django Sucks, and How We Can Fix It (slides) builds on top of the 2008 keynote by Cal Henderson entitled Why I Hate Django by pointing out instances where Django can improve. Like Cal, Eric complained about some things — some of which may not be solvable — and hopefully like some of the things Cal complained about they’ll get fixed. The note about needing more contributors again came up. Becoming a core developer is pretty much impossible. He complained about the reusability of apps citing django-avatar as an example. By rigidly defining a model, a “reusable app” becomes somewhat locked and cannot easily store new metadata. I really liked the concept of lazy foreign keys.
I’m somewhat torn on the idea of just switching Django’s source to github. I don’t fully buy Russell’s argument that you can just checkin to github and it will trigger to subversion. While that is a true statement, by having mirrors in bitbucket, launchpad and github, the Django core has fractioned the social aspects of those services. User comments are going to be split between between those services rather than being concentrated. However, similar to how Django used to allow comments in the documentation, these comments may not be that useful. I also think removing the Django admin from Django would be a travesty.
Adam Baldwin’s Pony Pwning (notes, slides) was a decent hundred foot overview of Django security and web security in general. I would have really liked to see more details although it looked like some security vulnerability he found was redacted from his slides to give the Django developers time to fix it. Other than that, he said that the Django community as a whole “gets” security (I generally agree) and that while Django is fairly secure by default developers still manage to make mistakes. He did point out that there is no clickjacking protection for the admin and using X-FRAME-OPTIONS would be a good addition. Also, it seems that Django’s escaping could be improved. I liked that he pushed pen testing with w3af and running a web app firewall like mod_security. While frameworks can buy a certain level of security cheaply since it’s unlikely that a single developer or group will get everything right and it’s more likely that a well thought through framework will be more secure. However, then problems in the framework are discovered and basically all sites using it are suddenly vulnerable. I think some security researcher really just needs to spend some time with Django and really push the limits of its security model. I’ve talked about doing it at work, but buyin will be tough.
Andrew Godwin’s (the developer of South) talk Step Away From That Database (slides – pdf) was the hundred foot overview of the various data stores for Django. One interesting trend that I’m picking up on is that Django developers seem to dislike MySQL a lot and Postgres is preferred by far. This might have something to do with MySQL development stalling and getting forked into Drizzle.
I enjoyed, but don’t entirely agree with Malcolm Tredinnick’s Modeling challenges (data) which seemed to be more about how to model complex data into Django’s models but in reality could have very well left out Django entirely and focused on databases and data modeling. I really liked the part about modeling dates that have different precisions. However, I am less sold on his implementation of modeling sports teams and players from data retrieved from retrosheet.
class Membership(models.Model): """ A specification of belonging to something for a period of time. Concrete base classes with supply the "somethings". """ joined = models.DateField() departed = models.DateField(null=True, blank=True) class Meta: abstract = True def _to_string(self, lhs, rhs): pairing = u"%s - %s" % (lhs, rhs) if self.departed: return u"%s (%s - %s)" % (pairing, self.joined.strftime("%d %b %Y"), self.departed.strftime("%d %b %Y")) return u"%s (%s - )" % (pairing, self.joined.strftime("%d %b %Y")) class TeamMember(Membership): team = models.ForeignKey(Team, related_name="members") person = models.ForeignKey(Person) role = models.CharField(max_length=2, choices=((COACH, "coach"), (PLAYER, "player"))) def __unicode__(self): return self._to_string(self.team, self.person)
While this will result in fairly normalized data form data that looks like “keara001″,”Austin Kearns” where 001 is Kearns’ first season and gets incremented yearly. Basically, this easily allows you to join data easily and find all the history of where a player was and is, but it doesn’t take into account seasons very well. I think retrosheet does it the way they do so that you can easily get a player’s stats for a given year. It’s a complicated problem but I’ve seen retrosheet’s solution in many places.
I think the best talk of the day so far (it JUST finished) was Eric Holscher’s Large Problems in Django, Mostly Solved (slides) which basically gives the hundred foot overview of the add-ons available and the external packages you should be using. Some stuff is pretty clear like pip, south, celery, fabric and sphinx, but on top of that there were packages that I hadn’t heard of or knew relatively little about like Haystack for search and Gunicorn for easy or simple deployments (it also might be the right solution for an easy Pythonic Hudson replacement). I was interested that Eric sees that Piston is in competition with Tastypie and that django-tagging is being overtaken by django-taggit. At work, we setup django-tagging but since then it seems that django-taggit has emerged since then. I also loved Eric’s metric of lines of docs and lines of tests as a metric of how good a project is.
- djangopackages.com is the new place to go for Django add-ons.
- 1.3 will probably have some good logging stuff built-in
As I promised, here’s some (semi-live) blogging from Djangocon.
The first talk of the day was Scaling the World’s Largest Django Application given by the guys at Disqus (slides). The basic gist is how they scaled up Django to handle a bajillion requests per month. One thing that I noticed is that to reach this scale, they are effectively throwing away referential integrity in their database. When asked about it, one of the guy mentioned that they have some scripts to verify that things have some level of integrity which I thought was a little scary. I was happy to see that they are basically measuring everything. On top of the usual continuous integration stuff like unit tests, lint (Pyflakes, actually) they are logging every query and every traceback using a package they opensourced named Sentry. In addition, the brief mention of the save() method’s concurrency issues was interesting.
Jeff Balogh’s talk on Switching addons.mozilla.org [AMO] from CakePHP to Django (slides – pdf) was on the similar subject of switching the high traffic site AMO to Django from PHP. My favorite point from this talk is how they handle the classic stale DB replication problem of a user submitting new data (to the master) and then not seeing their data (from the slave). Basically, they use a custom Django router that detects a POST and switches that user’s session to always read and write from the master which is pretty damn clever. Mozilla also de-normalized their database in order store a reference to the latest add-on (a classic problem). However, Jeff did mention that they might switch that to storing that info in cache. The main gist of this talk was cache everything.
Russell Keith-Magee’s So you want to be a core developer? and James Bennett’s Topics of Interest (going on now — see live blogging) are both on the future of Django and how to get involved and what needs to change in the Django community. The Django people need more people to get involved. James said there are only 14 committing developers and Russell said that to get Django 1.2 out the door he had to review tickets for 5 hours a night for 2 weeks because nobody else was doing it. As James said, their bus numbers — the number of people who get hit by a bus and then you’re screwed — is frighteningly small. There’s only 2 people who know Django Oracle support.
Themes from the con
- Git seems to be winning in the DVCS space. It really seems like git and github are taking over and Bazaar and Mercurial are being left at the wayside. I’ve seen tons of links to github and zero to bitbucket or launchpad. James Bennett just now briefly mentioned bitbucket and launchpad.
- Django developers think that database referential integrity is overrated. Multiple people mentioned that integrity is a farce when things scale insanely.
- Deploy early and often. The Mozilla guys deploy at least weekly. The Disqus guys deploy daily or more often. Although Jeff Balogh from Mozilla didn’t say it, I wonder if this means that like Disqus they are running out of trunk rather than branching and releasing.
- Celery (see my previous post) is awesome and everybody seems to be adopting it.
- If you aren’t using Pip and Virtualenv, you should (previous post). However, you probably shouldn’t deploy a production box from Pip and Pypi.