Getting started with pygit2

I should preface this post by saying that I’m not a Git expert so this is based on my experimentation rather than any deep knowledge. I bet I’m not the only one who merely skimmed the internals chapter in Pro Git. This post is the result of me setting out to learn the Git internals a little better and help anybody else who is trying to use pygit2 for something awesome. With that said, corrections are welcome.


While this is obvious to some, I think it’s worth pointing out that pygit2 versions track libgit2 versions. If you have libgit2 v0.18.0 installed, then you need to use pygit2 v0.18.x. Compiler errors will flow if you don’t. The docs don’t exactly mention this (pull request coming). Other than that, just follow the docs and you should be set. On MacOS, you can brew install libgit2 (make sure you brew update first) to get the latest libgit2 followed by pip install pygit2.

The repository

The first class almost any user of pygit2 will interact with is Repository. I’ll be traversing and introspecting the Twitter bootstrap repository in my examples.

There’s quite a bit more to the repository object and I’ll show it after I introduce some other git primitives and terminology.

Git objects

There are four fundamental git objects — commits, tags, blobs and trees — which reference snapshots of the git working directory (commits), potentially annotated named references to commits (tags), chunks of data (blobs) and an organization of blobs or directory structure (trees). I’ll save blobs and trees for a later post, but here’s some examples of using commits and tags in pygit2.

Since commits and tags are user facing, most git users should be familiar with them. Essentially commits point to a version of the working copy of the repository in time. They also have various interesting bits of metadata.

One tip/issue with using the repository bracket notation (repo[hex | oid]) is that the key MUST be either a unicode string if specifying the object hash or if it is a byte string pygit2 assumes that it points to the binary version of the hash called the oid.

Tags are essentially named pointers to commits but they can contain additional metadata.

You can read all about the types of parameters that revparse_single handles at man gitrevisions or in the Git documentation under specifying revisions.

Typically, you won’t need to ever convert between hex encoded hashes and oids, but in case you do the the conversion is trivial:

Walking commits

The Repository object makes available a walk method for iterating over commits. This script walks the commit log and writes it out to JSON.

Dump repository objects

This script dumps all tags and commits in a repository to JSON. It shows how repositories are iterable and sort of puts the whole tutorial together.

  • There’s talk of changing the pygit2 API to be more Pythonic. I was using v0.18.x for this and significant things may change in the future.
  • It helps to think of a Git repository as a tree (or directed acyclic graph if you’re into that sort of thing) where the root is the latest commit. I used to think about version control where the first commit is the root, but instead it is a leaf!
  • If your repository uses annotated or signed tags, there will be longer messages or PGP signatures in the tag message.
  • I’ve glossed over huge chunks of pygit2 — pretty much anything that writes to the repository — but if I don’t leave something for later my loyal readers won’t come back to read more. =)

Djangocon Day One

As I promised, here’s some (semi-live) blogging from Djangocon.

The first talk of the day was Scaling the World’s Largest Django Application given by the guys at Disqus (slides). The basic gist is how they scaled up Django to handle a bajillion requests per month. One thing that I noticed is that to reach this scale, they are effectively throwing away referential integrity in their database. When asked about it, one of the guy mentioned that they have some scripts to verify that things have some level of integrity which I thought was a little scary. I was happy to see that they are basically measuring everything. On top of the usual continuous integration stuff like unit tests, lint (Pyflakes, actually) they are logging every query and every traceback using a package they opensourced named Sentry. In addition, the brief mention of the save() method’s concurrency issues was interesting.

Jeff Balogh’s talk on Switching [AMO] from CakePHP to Django (slides – pdf) was on the similar subject of switching the high traffic site AMO to Django from PHP. My favorite point from this talk is how they handle the classic stale DB replication problem of a user submitting new data (to the master) and then not seeing their data (from the slave). Basically, they use a custom Django router that detects a POST and switches that user’s session to always read and write from the master which is pretty damn clever. Mozilla also de-normalized their database in order store a reference to the latest add-on (a classic problem). However, Jeff did mention that they might switch that to storing that info in cache. The main gist of this talk was cache everything.

Russell Keith-Magee’s So you want to be a core developer? and James Bennett’s Topics of Interest (going on now — see live blogging) are both on the future of Django and how to get involved and what needs to change in the Django community. The Django people need more people to get involved. James said there are only 14 committing developers and Russell said that to get Django 1.2 out the door he had to review tickets for 5 hours a night for 2 weeks because nobody else was doing it. As James said, their bus numbers — the number of people who get hit by a bus and then you’re screwed — is frighteningly small. There’s only 2 people who know Django Oracle support.

Themes from the con
  • Git seems to be winning in the DVCS space. It really seems like git and github are taking over and Bazaar and Mercurial are being left at the wayside. I’ve seen tons of links to github and zero to bitbucket or launchpad. James Bennett just now briefly mentioned bitbucket and launchpad.
  • Django developers think that database referential integrity is overrated. Multiple people mentioned that integrity is a farce when things scale insanely.
  • Deploy early and often. The Mozilla guys deploy at least weekly. The Disqus guys deploy daily or more often. Although Jeff Balogh from Mozilla didn’t say it, I wonder if this means that like Disqus they are running out of trunk rather than branching and releasing.
  • Celery (see my previous post) is awesome and everybody seems to be adopting it.
  • If you aren’t using Pip and Virtualenv, you should (previous post). However, you probably shouldn’t deploy a production box from Pip and Pypi.