Versioning Software Packages

There are lots of different versioning schemes and versioning is definitely not a solved problem. Some projects use dates, some use an ever increasing number from (perhaps generated by their version control system), some adopt the de facto standard of major.minor.patch and some converge to pi. Fundamentally though, all of these versioning schemes convey some additional information. With any of these schemes, it is easy to compare two versions and say which was released later. Some have even more semantic meaning. Without sounding too pedantic, there is some formal discussion of this as configuration management.

Versioning conveys information

Semantic versioning is a fairly documented versioning scheme that conveys information about API stability and compatibility especially in relation to dependency management. Without repeating too many of semantic versioning’s details, essentially it allows for easy identification of backwards compatible vs. incompatible changes and changes that do not modify the API. It is definitely a step forward, but as evidenced by the fact that a 2.0.0 version of it is still a release candidate, versioning in the abstract cannot be considered complete.

A number of systems developers use every day rely on the semantics of versioning. Take Python as a perfect example. Code from Python 2.6 will almost always run without modification on 2.7. The reverse is sometimes true but certainly less likely to be true. When I use Travis-CI, I do not specify that my program must be tested under Python 2.7.0, 2.7.1, etc. Travis assumes that if it works in Python 2.7, it will work in any of those versions. These patch versions are rarely necessary for dependency management and certainly not with semantic versioning. Semantic versioning really shines when testing a package with multiple dependencies or recursive dependencies. Instead of having to test X versions of Package1 against all Y versions of Package2, basics of API compatibility can be assumed and testing becomes easier. With that said, verifying a subset of that is always nice to do too.

Additional versioning metadata

API compatibility and version comparison lend themselves well to being discovered from the version numbering scheme, but what about other types of things developers might want to know about a particular package. One piece of metadata that would be interesting to know is a classification about what changes a new version contains. Did any security vulnerabilities get fixed? Was this a bugfix release? Was this a re-release because something went wrong in the release process in the previous release? You can imagine a versioning scheme that looks something like this: 1.0.5-sb where s stands for security and b stands for bug. It would be nice for that metadata to be machine discoverable, and it would be much easier to identify dependencies that require updating.

Perhaps a better solution is to attach this metadata to version control tags. Newer version control systems have annotated tags (I’m using the git term, but Mercurial has something similar). I can imagine a tag 1.0.5 with the annotation [security fix]. Like raising awareness for semantic versioning, it requires changing the software world by getting everyone to adopt your methods which is a tall order. In addition, some things do not lend themselves very well to version control tagging such as deprecating releases. It would be nice to discover when a version is deprecated and therefore no longer taking security fixes, but that happens long after the time a release is tagged.

Conclusion

I’m not ready to make any sort of concrete proposal. However, I think this is a really interesting space and I think that information such as which versions have security vulnerabilities becomes much more valuable now that many software products capture their dependencies in requirement files or gemfiles. There are definitely projects that have taken this on parts of this such as the open source vulnerability database. Over the next few years, this is going to become somewhat solved rather than upgrading dependencies in the ad hoc fashion that that developers do now. There will be something better than subscribing to a bunch of mailing lists and feeds to keep up with security fixes for dependencies.

Additional notes
  • distutils.version.LooseVersion allows for version comparison pretty close to semantic versioning
  • There are marketing reasons to version things too.
  • I got ideas for this post when I was reading the 2.0.0-rc1 version of the semantic versioning docs and I noticed that the tagging specification section was removed since I read it last time.

Signing and Verifying Python Packages with PGP

When I first showed Pip, the Python package installer, to a coworker a few years ago his first reaction was that he didn’t think it was a good idea to directly run code he downloaded from the Internet as root without looking at it first. He’s got a point. Paul McMillan dedicated part of his PyCon talk to this subject.

Python package management vs. Linux package management

To illustrate the security concerns, it is good to contrast how Python modules are usually installed with how Apt or Yum do it for Linux distributions. Debian and Redhat distros usually pre-provision the PGP keys for their packages with the distribution. Provided you installed a legitimate Linux distribution, you get the right PGP keys and every package downloaded through Apt/Yum is PGP checked. This means that the package is signed using private key for that distribution and you can verify that the exact package was signed and has not been modified. The package manager checks this and warns you when it does not match.

Pip and Easy Install don’t do any of that. They download packages in plaintext (which would be fine if every package was PGP signed and checked) and they download the checksums of the package in plaintext. If you manually tell Pip to point to a PyPI repository over HTTPS (say crate.io), it does not check the certificate. If you are on an untrusted network, it would not be tough to simply intercept requests to PyPI, download the package, add malicious code to setup.py and recalculate the checksum before returning the new malicious package on to be downloaded.

I think the big users of Python like the Mozillas of the world run their own PyPI servers and only load a subset of packages into it. I’ve heard of other shops making RPMs or DEBs out of Python packages. That’s what I often do. It lets you leverage the infrastructure of your distribution and the signing and checking infrastructure is already there. However, if you don’t want to do that, you can always PGP sign and verify your packages which is what the rest of this post is about.

Verifying a package

There are relatively few packages on the cheeseshop (PyPI) that are PGP signed. For this example, I’ll use rpc4django, a package I release, and Gnu Privacy Guard (GPG), a PGP implementation. The PGP signature of the package (rpc4django-0.1.12.tar.gz.asc) can be downloaded along with the package (rpc4django-0.1.12.tar.gz). If you simply attempt to verify it, you’ll probably get a message like this:

This message lets you know that the signature was made using PGP at the given date, but without the public key there is no way to verify that this package has not been modified since the author (me) signed it. So the next step is to get the public key for the package:

If you hit “1”, you will import the key. Re-running the verify command will now properly verify the package:

The fact that ten different Python modules will probably be signed by ten different PGP keys is a problem and I’m not sure there’s a way to make that easier. In addition, my key is probably not in your web of trust; nobody who you trust has signed my public key. So when you verify the signature, you will probably also see a message like this.

This means that I need to get my key signed by more people and you need to expand your web of trust.

Signing a package

Signing a package is easy and it is done as part of the upload process to PyPI. This assumes you have PGP all setup already. I haven’t done this in about a month so I hope the command is right.

There are additional options like the correct key to sign the package, but the signing part is easy.

However, how many people actually verify the signature? Almost nobody. The package managers (Pip/EasyInstall) don’t and you probably just use one of them.

The future of Python packaging

So what can we do? I tried to work on this at the PythonSD meetup but I didn’t get very far partially because it is a tough problem and partly because there was more chatting than coding. As a concrete proposal, I think we need to get PGP verification into Pip and solve issue #425. This probably means making Python-gnupg a prerequisite for Pip (at least for PGP verification). Step two is to add certificate verification. Python3 already supports certificate checking through OpenSSL. Python2 might have to use something like the Requests library. Step three is to get a proper certificate on PyPI.

Edit: Updated command to upload signed package

Edit (January 2018): This 5 year old post is massively outdated. I recommend taking a look at the Python packaging and distributing docs which are much better now. The commands I typically run to distribute a package are:

Djangocon 2011 Day Three

I know Djangocon has been over for a week, but I didn’t get a chance to talk about day three and specifically Paul McMillan’s excellent security talk. I also think it’s interesting that Djangocon seems to correlate with security releases (2011, 2010).

Timing attacks

Paul demonstrated a timing attack against password reset: a method that mails a user a one-time link to use to reset their password. This timing attack could guess that link with fewer requests than would be needed to guess that link via brute force — that is, fewer than having to guess all possible combinations. It did so by measuring the difference in the times requests took between requests with more vs. fewer correct characters in the URL. I spoke with Paul and he said that this attack works best locally and would be hard to execute remotely because variability in network latency would be significant enough to make measuring the differences in timing difficult. While this attack is not completely practical, a lot of people use shared or cloud hosting which allow attackers to somewhat mitigate this by setting up attack servers in the same network.

Paul also demonstrated a timing attack which leaked some information about whether a username was valid in the system.

Securing Django in production

Even if Django is completely secure (which nothing truly is), mistakes can be made in deployment. Paul recommended an app called django-secure which checks for common misconfigurations. In addition, he said that the login URL should always be throttled to prevent password guessing. The Django security docs which your humble blogger helped write also recommend that among a number of other things. They are worth a read.

Password issues

I posted a primer about Django passwords last month. Paul had some more things to say about it. Firstly, database dumps/backups and initial data which contain hashed passwords should not be public (for example, on github). As I mentioned in the primer, eight character passwords using Django’s current hashing algorithm (sha1) can be brute forced in a matter of hours in the worst case. So if you accidentally leaked a backup — and a number of high profile sites have done things like this — then consider those passwords broken.

The fix for the password problem is to use a “slower” hashing algorithm designed for hashing passwords. I spoke with Paul after the talk and one of the road blocks to using something like bcrypt is its reliance on C extensions and the Django core team is reluctant to introduce them. However, they are really trying to get something better into the Django core for 1.4.

Miscellaneous

There were a number of other recommendations including:

  • Be careful where you store pickled data (cache, /tmp, etc.). Pickled objects can contain executable code.
  • Use the proper cryptographic functions available in Django and Python including: random.SystemRandom, django.utils.crypto.constant_time_compare, and django.utils.crypto.salted_hmac
  • Be careful when deploying HTTPS to make sure it is done properly

It’s good to hear that security people are going over Django with a fine-toothed comb.