There are lots of different versioning schemes and versioning is definitely not a solved problem. Some projects use dates, some use an ever increasing number from (perhaps generated by their version control system), some adopt the de facto standard of
major.minor.patch and some converge to pi. Fundamentally though, all of these versioning schemes convey some additional information. With any of these schemes, it is easy to compare two versions and say which was released later. Some have even more semantic meaning. Without sounding too pedantic, there is some formal discussion of this as configuration management.
Versioning conveys information
Semantic versioning is a fairly documented versioning scheme that conveys information about API stability and compatibility especially in relation to dependency management. Without repeating too many of semantic versioning’s details, essentially it allows for easy identification of backwards compatible vs. incompatible changes and changes that do not modify the API. It is definitely a step forward, but as evidenced by the fact that a 2.0.0 version of it is still a release candidate, versioning in the abstract cannot be considered complete.
A number of systems developers use every day rely on the semantics of versioning. Take Python as a perfect example. Code from Python 2.6 will almost always run without modification on 2.7. The reverse is sometimes true but certainly less likely to be true. When I use Travis-CI, I do not specify that my program must be tested under Python 2.7.0, 2.7.1, etc. Travis assumes that if it works in Python 2.7, it will work in any of those versions. These patch versions are rarely necessary for dependency management and certainly not with semantic versioning. Semantic versioning really shines when testing a package with multiple dependencies or recursive dependencies. Instead of having to test X versions of Package1 against all Y versions of Package2, basics of API compatibility can be assumed and testing becomes easier. With that said, verifying a subset of that is always nice to do too.
Additional versioning metadata
API compatibility and version comparison lend themselves well to being discovered from the version numbering scheme, but what about other types of things developers might want to know about a particular package. One piece of metadata that would be interesting to know is a classification about what changes a new version contains. Did any security vulnerabilities get fixed? Was this a bugfix release? Was this a re-release because something went wrong in the release process in the previous release? You can imagine a versioning scheme that looks something like this:
1.0.5-sb where s stands for security and b stands for bug. It would be nice for that metadata to be machine discoverable, and it would be much easier to identify dependencies that require updating.
Perhaps a better solution is to attach this metadata to version control tags. Newer version control systems have annotated tags (I’m using the git term, but Mercurial has something similar). I can imagine a tag
1.0.5 with the annotation
[security fix]. Like raising awareness for semantic versioning, it requires changing the software world by getting everyone to adopt your methods which is a tall order. In addition, some things do not lend themselves very well to version control tagging such as deprecating releases. It would be nice to discover when a version is deprecated and therefore no longer taking security fixes, but that happens long after the time a release is tagged.
I’m not ready to make any sort of concrete proposal. However, I think this is a really interesting space and I think that information such as which versions have security vulnerabilities becomes much more valuable now that many software products capture their dependencies in requirement files or gemfiles. There are definitely projects that have taken this on parts of this such as the open source vulnerability database. Over the next few years, this is going to become somewhat solved rather than upgrading dependencies in the ad hoc fashion that that developers do now. There will be something better than subscribing to a bunch of mailing lists and feeds to keep up with security fixes for dependencies.
distutils.version.LooseVersionallows for version comparison pretty close to semantic versioning
- There are marketing reasons to version things too.
- I got ideas for this post when I was reading the 2.0.0-rc1 version of the semantic versioning docs and I noticed that the tagging specification section was removed since I read it last time.
The Github public timeline is up on bigquery and I decided I’d play around with it. I created this visualization which is a first (alright, like eighth) try at measuring “how social” a project really is. The colors correspond to different programming languages and the size of the arc is based on the number of distinct collaborators on a project.
I thought about looking only at Pull Requests. It does uncover some interesting projects which have a lot of pull requests. I think this penalizes projects like Linux which doesn’t really have pull requests. Also, I wasn’t sure code submissions alone were exactly what I was looking for. I also briefly looked at only merged pull requests. I ended up filtering out projects with no stated language partially because there were a number of projects that just had common names (eg. “test).
While doing this, I figured I would have heard of the most social projects, but that there would be a lot of projects I’d never heard of for a variety of reasons. Some projects can get a lot of watchers, forks, comments or issues from an entirely separate group of people from the people I follow.
Most social projects by language
The visualization has the full dataset, but here’s a taste of the data:
- C – php-src, linux, mruby
- C++ – mosh, mysql, fr_public
- PHP – symfony, codeigniter, foundation
- Python – django, legit, flask
- Ruby – sample_app, rails, first_app
When I first showed Pip, the Python package installer, to a coworker a few years ago his first reaction was that he didn’t think it was a good idea to directly run code he downloaded from the Internet as root without looking at it first. He’s got a point. Paul McMillan dedicated part of his PyCon talk to this subject.
Python package management vs. Linux package management
To illustrate the security concerns, it is good to contrast how Python modules are usually installed with how Apt or Yum do it for Linux distributions. Debian and Redhat distros usually pre-provision the PGP keys for their packages with the distribution. Provided you installed a legitimate Linux distribution, you get the right PGP keys and every package downloaded through Apt/Yum is PGP checked. This means that the package is signed using private key for that distribution and you can verify that the exact package was signed and has not been modified. The package manager checks this and warns you when it does not match.
Pip and Easy Install don’t do any of that. They download packages in plaintext (which would be fine if every package was PGP signed and checked) and they download the checksums of the package in plaintext. If you manually tell Pip to point to a PyPI repository over HTTPS (say crate.io), it does not check the certificate. If you are on an untrusted network, it would not be tough to simply intercept requests to PyPI, download the package, add malicious code to setup.py and recalculate the checksum before returning the new malicious package on to be downloaded.
I think the big users of Python like the Mozillas of the world run their own PyPI servers and only load a subset of packages into it. I’ve heard of other shops making RPMs or DEBs out of Python packages. That’s what I often do. It lets you leverage the infrastructure of your distribution and the signing and checking infrastructure is already there. However, if you don’t want to do that, you can always PGP sign and verify your packages which is what the rest of this post is about.
Verifying a package
There are relatively few packages on the cheeseshop (PyPI) that are PGP signed. For this example, I’ll use rpc4django, a package I release, and Gnu Privacy Guard (GPG), a PGP implementation. The PGP signature of the package (rpc4django-0.1.12.tar.gz.asc) can be downloaded along with the package (rpc4django-0.1.12.tar.gz). If you simply attempt to verify it, you’ll probably get a message like this:
% gpg --verify rpc4django-0.1.12.tar.gz.asc rpc4django-0.1.12.tar.gz gpg: Signature made Mon Mar 12 15:14:28 2012 PDT using RSA key ID A737AB60 gpg: Can't check signature: public key not found
This message lets you know that the signature was made using PGP at the given date, but without the public key there is no way to verify that this package has not been modified since the author (me) signed it. So the next step is to get the public key for the package:
% gpg --search-keys A737AB60 gpg: searching for "A737AB60" from hkp server keys.gnupg.net (1) David Fischer <firstname.lastname@example.org> 2048 bit RSA key A737AB60, created: 2011-11-20 Keys 1-1 of 1 for "0xA737AB60". Enter number(s), N)ext, or Q)uit > q
If you hit “1″, you will import the key. Re-running the verify command will now properly verify the package:
% gpg --verify rpc4django-0.1.12.tar.gz.asc rpc4django-0.1.12.tar.gz gpg: Signature made Mon Mar 12 15:14:28 2012 PDT using RSA key ID A737AB60 gpg: Good signature from "David Fischer <email@example.com>"
The fact that ten different Python modules will probably be signed by ten different PGP keys is a problem and I’m not sure there’s a way to make that easier. In addition, my key is probably not in your web of trust; nobody who you trust has signed my public key. So when you verify the signature, you will probably also see a message like this.
gpg: WARNING: This key is not certified with a trusted signature! gpg: There is no indication that the signature belongs to the owner.
This means that I need to get my key signed by more people and you need to expand your web of trust.
Signing a package
Signing a package is easy and it is done as part of the upload process to PyPI. This assumes you have PGP all setup already. I haven’t done this in about a month so I hope the command is right.
% python setup.py sdist upload --sign
There are additional options like the correct key to sign the package, but the signing part is easy.
However, how many people actually verify the signature? Almost nobody. The package managers (Pip/EasyInstall) don’t and you probably just use one of them.
The future of Python packaging
So what can we do? I tried to work on this at the PythonSD meetup but I didn’t get very far partially because it is a tough problem and partly because there was more chatting than coding. As a concrete proposal, I think we need to get PGP verification into Pip and solve issue #425. This probably means making Python-gnupg a prerequisite for Pip (at least for PGP verification). Step two is to add certificate verification. Python3 already supports certificate checking through OpenSSL. Python2 might have to use something like the Requests library. Step three is to get a proper certificate on PyPI.
Edit: Updated command to upload signed package
I released v0.1.10 of RPC4Django. I fixed an issue so that setup.py has no requirements on anything outside of the standard library and I set the project up such that python setup.py test runs the unit tests.
The bigger change is that I moved the project from Launchpad to Github. I’ve already been using Github quite a bit and I thought that I’d bite the bullet and do the move. While I liked Launchpad, I think it is better suited to larger projects that will use the features like Blueprints and Translations. For a small project like RPC4Django, Github’s code-centric approach works better.