Github Timeline and Social Coding

The Github public timeline is up on bigquery and I decided I’d play around with it. I created this visualization which is a first (alright, like eighth) try at measuring “how social” a project really is. The colors correspond to different programming languages and the size of the arc is based on the number of distinct collaborators on a project.

Other attempts

I thought about looking only at Pull Requests. It does uncover some interesting projects which have a lot of pull requests. I think this penalizes projects like Linux which doesn’t really have pull requests. Also, I wasn’t sure code submissions alone were exactly what I was looking for. I also briefly looked at only merged pull requests. I ended up filtering out projects with no stated language partially because there were a number of projects that just had common names (eg. “test).

While doing this, I figured I would have heard of the most social projects, but that there would be a lot of projects I’d never heard of for a variety of reasons. Some projects can get a lot of watchers, forks, comments or issues from an entirely separate group of people from the people I follow.

Most social projects by language

The visualization has the full dataset, but here’s a taste of the data:

  • C – php-src, linux, mruby
  • C++ – mosh, mysql, fr_public
  • JavaScript – bootstrap, meteor, jquery-file-upload
  • PHP – symfony, codeigniter, foundation
  • Python – django, legit, flask
  • Ruby – sample_app, rails, first_app

Signing and Verifying Python Packages with PGP

When I first showed Pip, the Python package installer, to a coworker a few years ago his first reaction was that he didn’t think it was a good idea to directly run code he downloaded from the Internet as root without looking at it first. He’s got a point. Paul McMillan dedicated part of his PyCon talk to this subject.

Python package management vs. Linux package management

To illustrate the security concerns, it is good to contrast how Python modules are usually installed with how Apt or Yum do it for Linux distributions. Debian and Redhat distros usually pre-provision the PGP keys for their packages with the distribution. Provided you installed a legitimate Linux distribution, you get the right PGP keys and every package downloaded through Apt/Yum is PGP checked. This means that the package is signed using private key for that distribution and you can verify that the exact package was signed and has not been modified. The package manager checks this and warns you when it does not match.

Pip and Easy Install don’t do any of that. They download packages in plaintext (which would be fine if every package was PGP signed and checked) and they download the checksums of the package in plaintext. If you manually tell Pip to point to a PyPI repository over HTTPS (say, it does not check the certificate. If you are on an untrusted network, it would not be tough to simply intercept requests to PyPI, download the package, add malicious code to and recalculate the checksum before returning the new malicious package on to be downloaded.

I think the big users of Python like the Mozillas of the world run their own PyPI servers and only load a subset of packages into it. I’ve heard of other shops making RPMs or DEBs out of Python packages. That’s what I often do. It lets you leverage the infrastructure of your distribution and the signing and checking infrastructure is already there. However, if you don’t want to do that, you can always PGP sign and verify your packages which is what the rest of this post is about.

Verifying a package

There are relatively few packages on the cheeseshop (PyPI) that are PGP signed. For this example, I’ll use rpc4django, a package I release, and Gnu Privacy Guard (GPG), a PGP implementation. The PGP signature of the package (rpc4django-0.1.12.tar.gz.asc) can be downloaded along with the package (rpc4django-0.1.12.tar.gz). If you simply attempt to verify it, you’ll probably get a message like this:

This message lets you know that the signature was made using PGP at the given date, but without the public key there is no way to verify that this package has not been modified since the author (me) signed it. So the next step is to get the public key for the package:

If you hit “1”, you will import the key. Re-running the verify command will now properly verify the package:

The fact that ten different Python modules will probably be signed by ten different PGP keys is a problem and I’m not sure there’s a way to make that easier. In addition, my key is probably not in your web of trust; nobody who you trust has signed my public key. So when you verify the signature, you will probably also see a message like this.

This means that I need to get my key signed by more people and you need to expand your web of trust.

Signing a package

Signing a package is easy and it is done as part of the upload process to PyPI. This assumes you have PGP all setup already. I haven’t done this in about a month so I hope the command is right.

There are additional options like the correct key to sign the package, but the signing part is easy.

However, how many people actually verify the signature? Almost nobody. The package managers (Pip/EasyInstall) don’t and you probably just use one of them.

The future of Python packaging

So what can we do? I tried to work on this at the PythonSD meetup but I didn’t get very far partially because it is a tough problem and partly because there was more chatting than coding. As a concrete proposal, I think we need to get PGP verification into Pip and solve issue #425. This probably means making Python-gnupg a prerequisite for Pip (at least for PGP verification). Step two is to add certificate verification. Python3 already supports certificate checking through OpenSSL. Python2 might have to use something like the Requests library. Step three is to get a proper certificate on PyPI.

Edit: Updated command to upload signed package

Edit (January 2018): This 5 year old post is massively outdated. I recommend taking a look at the Python packaging and distributing docs which are much better now. The commands I typically run to distribute a package are:

MIT Sloan Sports Analytics Conference 2012

This will be a slight divergence from the usual programming…

MIT Sloan Sports Analytics Conference Attendees List
I’m at the MIT Sloan Sports Analytics Conference this weekend and I’m having a blast. Jeff van Gundy is hilarious. Also, I was feeling a little snarky when I registerred (like 4 months ago) and I was also at work doing security stuff. So I put a joke in my company name. The attendees list is sorted by company and now I’m the top of the list.

However, getting down to San Diego sports, I took a look at the attendees list and nobody is here from the Chargers or Padres. I’m hoping they’re just incognito, but I’m guessing nobody came. Hopefully the home teams don’t get left behind…