When I first showed Pip, the Python package installer, to a coworker a few years ago his first reaction was that he didn’t think it was a good idea to directly run code he downloaded from the Internet as root without looking at it first. He’s got a point. Paul McMillan dedicated part of his PyCon talk to this subject.
Python package management vs. Linux package management
To illustrate the security concerns, it is good to contrast how Python modules are usually installed with how Apt or Yum do it for Linux distributions. Debian and Redhat distros usually pre-provision the PGP keys for their packages with the distribution. Provided you installed a legitimate Linux distribution, you get the right PGP keys and every package downloaded through Apt/Yum is PGP checked. This means that the package is signed using private key for that distribution and you can verify that the exact package was signed and has not been modified. The package manager checks this and warns you when it does not match.
Pip and Easy Install don’t do any of that. They download packages in plaintext (which would be fine if every package was PGP signed and checked) and they download the checksums of the package in plaintext. If you manually tell Pip to point to a PyPI repository over HTTPS (say crate.io), it does not check the certificate. If you are on an untrusted network, it would not be tough to simply intercept requests to PyPI, download the package, add malicious code to setup.py and recalculate the checksum before returning the new malicious package on to be downloaded.
I think the big users of Python like the Mozillas of the world run their own PyPI servers and only load a subset of packages into it. I’ve heard of other shops making RPMs or DEBs out of Python packages. That’s what I often do. It lets you leverage the infrastructure of your distribution and the signing and checking infrastructure is already there. However, if you don’t want to do that, you can always PGP sign and verify your packages which is what the rest of this post is about.
Verifying a package
There are relatively few packages on the cheeseshop (PyPI) that are PGP signed. For this example, I’ll use rpc4django, a package I release, and Gnu Privacy Guard (GPG), a PGP implementation. The PGP signature of the package (rpc4django-0.1.12.tar.gz.asc) can be downloaded along with the package (rpc4django-0.1.12.tar.gz). If you simply attempt to verify it, you’ll probably get a message like this:
% gpg --verify rpc4django-0.1.12.tar.gz.asc rpc4django-0.1.12.tar.gz gpg: Signature made Mon Mar 12 15:14:28 2012 PDT using RSA key ID A737AB60 gpg: Can't check signature: public key not found
This message lets you know that the signature was made using PGP at the given date, but without the public key there is no way to verify that this package has not been modified since the author (me) signed it. So the next step is to get the public key for the package:
% gpg --search-keys A737AB60 gpg: searching for "A737AB60" from hkp server keys.gnupg.net (1) David Fischer <firstname.lastname@example.org> 2048 bit RSA key A737AB60, created: 2011-11-20 Keys 1-1 of 1 for "0xA737AB60". Enter number(s), N)ext, or Q)uit > q
If you hit “1″, you will import the key. Re-running the verify command will now properly verify the package:
% gpg --verify rpc4django-0.1.12.tar.gz.asc rpc4django-0.1.12.tar.gz gpg: Signature made Mon Mar 12 15:14:28 2012 PDT using RSA key ID A737AB60 gpg: Good signature from "David Fischer <email@example.com>"
The fact that ten different Python modules will probably be signed by ten different PGP keys is a problem and I’m not sure there’s a way to make that easier. In addition, my key is probably not in your web of trust; nobody who you trust has signed my public key. So when you verify the signature, you will probably also see a message like this.
gpg: WARNING: This key is not certified with a trusted signature! gpg: There is no indication that the signature belongs to the owner.
This means that I need to get my key signed by more people and you need to expand your web of trust.
Signing a package
Signing a package is easy and it is done as part of the upload process to PyPI. This assumes you have PGP all setup already. I haven’t done this in about a month so I hope the command is right.
% python setup.py sdist upload --sign
There are additional options like the correct key to sign the package, but the signing part is easy.
However, how many people actually verify the signature? Almost nobody. The package managers (Pip/EasyInstall) don’t and you probably just use one of them.
The future of Python packaging
So what can we do? I tried to work on this at the PythonSD meetup but I didn’t get very far partially because it is a tough problem and partly because there was more chatting than coding. As a concrete proposal, I think we need to get PGP verification into Pip and solve issue #425. This probably means making Python-gnupg a prerequisite for Pip (at least for PGP verification). Step two is to add certificate verification. Python3 already supports certificate checking through OpenSSL. Python2 might have to use something like the Requests library. Step three is to get a proper certificate on PyPI.
Edit: Updated command to upload signed package
As I promised, here’s some (semi-live) blogging from Djangocon.
The first talk of the day was Scaling the World’s Largest Django Application given by the guys at Disqus (slides). The basic gist is how they scaled up Django to handle a bajillion requests per month. One thing that I noticed is that to reach this scale, they are effectively throwing away referential integrity in their database. When asked about it, one of the guy mentioned that they have some scripts to verify that things have some level of integrity which I thought was a little scary. I was happy to see that they are basically measuring everything. On top of the usual continuous integration stuff like unit tests, lint (Pyflakes, actually) they are logging every query and every traceback using a package they opensourced named Sentry. In addition, the brief mention of the save() method’s concurrency issues was interesting.
Jeff Balogh’s talk on Switching addons.mozilla.org [AMO] from CakePHP to Django (slides – pdf) was on the similar subject of switching the high traffic site AMO to Django from PHP. My favorite point from this talk is how they handle the classic stale DB replication problem of a user submitting new data (to the master) and then not seeing their data (from the slave). Basically, they use a custom Django router that detects a POST and switches that user’s session to always read and write from the master which is pretty damn clever. Mozilla also de-normalized their database in order store a reference to the latest add-on (a classic problem). However, Jeff did mention that they might switch that to storing that info in cache. The main gist of this talk was cache everything.
Russell Keith-Magee’s So you want to be a core developer? and James Bennett’s Topics of Interest (going on now — see live blogging) are both on the future of Django and how to get involved and what needs to change in the Django community. The Django people need more people to get involved. James said there are only 14 committing developers and Russell said that to get Django 1.2 out the door he had to review tickets for 5 hours a night for 2 weeks because nobody else was doing it. As James said, their bus numbers — the number of people who get hit by a bus and then you’re screwed — is frighteningly small. There’s only 2 people who know Django Oracle support.
Themes from the con
- Git seems to be winning in the DVCS space. It really seems like git and github are taking over and Bazaar and Mercurial are being left at the wayside. I’ve seen tons of links to github and zero to bitbucket or launchpad. James Bennett just now briefly mentioned bitbucket and launchpad.
- Django developers think that database referential integrity is overrated. Multiple people mentioned that integrity is a farce when things scale insanely.
- Deploy early and often. The Mozilla guys deploy at least weekly. The Disqus guys deploy daily or more often. Although Jeff Balogh from Mozilla didn’t say it, I wonder if this means that like Disqus they are running out of trunk rather than branching and releasing.
- Celery (see my previous post) is awesome and everybody seems to be adopting it.
- If you aren’t using Pip and Virtualenv, you should (previous post). However, you probably shouldn’t deploy a production box from Pip and Pypi.
In a previous post, I promised to write about Pip and Virtualenv and I’m now finally making good. Others have done this before, but I think I have a little to add. If you develop a Python module and you don’t test it with virtualenv, don’t make your next release until you do.
Configuring the environment
Virtualenv creates a Python environment that is segregated from your system wide Python installation. In this way, you can test your module without any external packages mucking up the result, add different versions of dependency packages and generally verify the exact set of requirements for your package.
To create the virtual environment:
% virtualenv --no-site-packages testarea
This creates a directory testarea/ that contains directories for installing modules and a Python executable. Using the virtual environment:
% cd testarea % source bin/activate
Sourcing activate will set environment variables so that only modules installed under testarea/ are used. After setting up the environment, any desired packages can be installed (from pypi):
(testarea) % pip install rpc4django
Packages can also be uninstalled, specific versions can be installed or packages can be installed from the file system, URLs or directly from source control:
(testarea) % pip uninstall rpc4django (testarea) % pip install rpc4django==0.1.6
Pip is worth using over easy_install for its uninstall capabilities alone, but I should mention that pip is actively maintained while setuptools is mostly dead.
When you’re done with the virtual environment, simply deactivate it:
(testarea) % deactivate
Do it for the tests
While the segregated environment that virtualenv provides is extremely well suited to getting the correct environment up and running, it is just as well suited to testing your application under a variety of different package configurations. With pip and virtualenv, testing your application under three different versions of Django is a snap and it doesn’t affect your system environment in the slightest.
Dependencies made easy
My favorite feature of pip is the ability to create a requirements file based on a set of packages installed in your virtual environment (or your global site-packages). Creating a requirements file can be done automatically using the freeze command for pip:
(testarea) % pip freeze > requirements.txt (testarea) % more requirements.txt Django==1.1.1 rpc4django==0.1.7 wsgiref==0.1.2
% pip install -r requirements.txt
The requirements file can be version controlled both to aid in installation and to capture the exact versions of your dependencies directly where they are used rather than after the fact in documentation that can easily become out of date. The requirements file can be used to rebuild a virtual environment or to deploy a virtual environment into the machine’s site-packages. Pip and virtualenv are exceptionally easy to use and there’s really no excuse for a Python packager not to use them.
Note: I’m working on a fairly large sized application for work. When it is finished, I will release a post-mortem that will also function as an update to my post about packaging and distributing.