Getting started with pygit2

I should preface this post by saying that I’m not a Git expert so this is based on my experimentation rather than any deep knowledge. I bet I’m not the only one who merely skimmed the internals chapter in Pro Git. This post is the result of me setting out to learn the Git internals a little better and help anybody else who is trying to use pygit2 for something awesome. With that said, corrections are welcome.

Installation

While this is obvious to some, I think it’s worth pointing out that pygit2 versions track libgit2 versions. If you have libgit2 v0.18.0 installed, then you need to use pygit2 v0.18.x. Compiler errors will flow if you don’t. The docs don’t exactly mention this (pull request coming). Other than that, just follow the docs and you should be set. On MacOS, you can brew install libgit2 (make sure you brew update first) to get the latest libgit2 followed by pip install pygit2.

The repository

The first class almost any user of pygit2 will interact with is Repository. I’ll be traversing and introspecting the Twitter bootstrap repository in my examples.

There’s quite a bit more to the repository object and I’ll show it after I introduce some other git primitives and terminology.

Git objects

There are four fundamental git objects — commits, tags, blobs and trees — which reference snapshots of the git working directory (commits), potentially annotated named references to commits (tags), chunks of data (blobs) and an organization of blobs or directory structure (trees). I’ll save blobs and trees for a later post, but here’s some examples of using commits and tags in pygit2.

Since commits and tags are user facing, most git users should be familiar with them. Essentially commits point to a version of the working copy of the repository in time. They also have various interesting bits of metadata.

One tip/issue with using the repository bracket notation (repo[hex | oid]) is that the key MUST be either a unicode string if specifying the object hash or if it is a byte string pygit2 assumes that it points to the binary version of the hash called the oid.

Tags are essentially named pointers to commits but they can contain additional metadata.

You can read all about the types of parameters that revparse_single handles at man gitrevisions or in the Git documentation under specifying revisions.

Typically, you won’t need to ever convert between hex encoded hashes and oids, but in case you do the the conversion is trivial:

Walking commits

The Repository object makes available a walk method for iterating over commits. This script walks the commit log and writes it out to JSON.

Dump repository objects

This script dumps all tags and commits in a repository to JSON. It shows how repositories are iterable and sort of puts the whole tutorial together.

Notes
  • There’s talk of changing the pygit2 API to be more Pythonic. I was using v0.18.x for this and significant things may change in the future.
  • It helps to think of a Git repository as a tree (or directed acyclic graph if you’re into that sort of thing) where the root is the latest commit. I used to think about version control where the first commit is the root, but instead it is a leaf!
  • If your repository uses annotated or signed tags, there will be longer messages or PGP signatures in the tag message.
  • I’ve glossed over huge chunks of pygit2 — pretty much anything that writes to the repository — but if I don’t leave something for later my loyal readers won’t come back to read more. =)

GitHub Data Challenge II

GitHub’s public timeline contains a wealth of knowledge about contributions to open source software from all over the world. It’s pretty typical to see over ten thousand contributions of some sort every hour! I decided to focus on the top 200 repositories (by forks) only in order to have a more manageable set of data. Each comment, pull request or commit is tied to a repository which in turn usually has a primary language associated with it. Contributions from folks who didn’t provide a location were ignored and OpenStreetMap’s Nominatim service was used to geocode locations into latitude and longitude for those who did say where they coding from.

If you aren’t from New York City or San Francisco and you contributed to a top 200 repository, you can probably find your own commits if you zoom in enough.

Contributions

Not all events are created equal. Watching a repository is not the same as committing code or opening issues. In general, I tried to calculate contributions based on the same criteria GitHub uses but I think I’m not introspecting commits and pull requests as deeply as they are. Typically, for larger repositories, users commit to their own forks — which I ignore — and later send pull requests which I’m counting. However, this discounts a large fork which merged many commits to be worth the same as a one line pull request. The person who actually merges the pull request gets the same credit as the author which actually makes sense when I gave it a second thought.

One way to improve my accounting of contributions would be to look at the actual repositories to see which commits to forks ended up in the “main line”. For a repository that actually uses GitHub virtually all commits end up in the main repository through pull requests or via somebody with permission to push directly which appear in the githubarchive.org data. For a repository like Linux which only stores code on GitHub and doesn’t accept pull requests it would be nice to actually analyze the commit history. I bet most of the Linux contributors have GitHub accounts to attribute their work to.

Geocoding

Geocoding messy data is well… messy. The location field for users on GitHub is simply a fill-in-the-blank field and users can type anything in there from their city to their university to an IRC channel. Sometimes people just type in a country name which is fine for Singapore but doesn’t really narrow it down too much for Canada. The locations listed for contributors on the top 200 repositories was surprisingly clean, however. It wasn’t without somewhat humorous errors though.

Links

Djangocon 2011 Day Three

I know Djangocon has been over for a week, but I didn’t get a chance to talk about day three and specifically Paul McMillan’s excellent security talk. I also think it’s interesting that Djangocon seems to correlate with security releases (2011, 2010).

Timing attacks

Paul demonstrated a timing attack against password reset: a method that mails a user a one-time link to use to reset their password. This timing attack could guess that link with fewer requests than would be needed to guess that link via brute force — that is, fewer than having to guess all possible combinations. It did so by measuring the difference in the times requests took between requests with more vs. fewer correct characters in the URL. I spoke with Paul and he said that this attack works best locally and would be hard to execute remotely because variability in network latency would be significant enough to make measuring the differences in timing difficult. While this attack is not completely practical, a lot of people use shared or cloud hosting which allow attackers to somewhat mitigate this by setting up attack servers in the same network.

Paul also demonstrated a timing attack which leaked some information about whether a username was valid in the system.

Securing Django in production

Even if Django is completely secure (which nothing truly is), mistakes can be made in deployment. Paul recommended an app called django-secure which checks for common misconfigurations. In addition, he said that the login URL should always be throttled to prevent password guessing. The Django security docs which your humble blogger helped write also recommend that among a number of other things. They are worth a read.

Password issues

I posted a primer about Django passwords last month. Paul had some more things to say about it. Firstly, database dumps/backups and initial data which contain hashed passwords should not be public (for example, on github). As I mentioned in the primer, eight character passwords using Django’s current hashing algorithm (sha1) can be brute forced in a matter of hours in the worst case. So if you accidentally leaked a backup — and a number of high profile sites have done things like this — then consider those passwords broken.

The fix for the password problem is to use a “slower” hashing algorithm designed for hashing passwords. I spoke with Paul after the talk and one of the road blocks to using something like bcrypt is its reliance on C extensions and the Django core team is reluctant to introduce them. However, they are really trying to get something better into the Django core for 1.4.

Miscellaneous

There were a number of other recommendations including:

  • Be careful where you store pickled data (cache, /tmp, etc.). Pickled objects can contain executable code.
  • Use the proper cryptographic functions available in Django and Python including: random.SystemRandom, django.utils.crypto.constant_time_compare, and django.utils.crypto.salted_hmac
  • Be careful when deploying HTTPS to make sure it is done properly

It’s good to hear that security people are going over Django with a fine-toothed comb.