Skip to content

Recent Articles

13
Jun

Getting started with pygit2

I should preface this post by saying that I’m not a Git expert so this is based on my experimentation rather than any deep knowledge. I bet I’m not the only one who merely skimmed the internals chapter in Pro Git. This post is the result of me setting out to learn the Git internals a little better and help anybody else who is trying to use pygit2 for something awesome. With that said, corrections are welcome.

Installation

While this is obvious to some, I think it’s worth pointing out that pygit2 versions track libgit2 versions. If you have libgit2 v0.18.0 installed, then you need to use pygit2 v0.18.x. Compiler errors will flow if you don’t. The docs don’t exactly mention this (pull request coming). Other than that, just follow the docs and you should be set. On MacOS, you can brew install libgit2 (make sure you brew update first) to get the latest libgit2 followed by pip install pygit2.

The repository

The first class almost any user of pygit2 will interact with is Repository. I’ll be traversing and introspecting the Twitter bootstrap repository in my examples.

>>> import pygit2
>>> repo = pygit2.Repository('/Users/dfischer/Projects/bootstrap')
>>> repo.head.hex    # sha1 hex hash of the commit pointed to by HEAD
u'd9b502dfb876c40b0735008bac18049c7ee7b6d2'
>>> repo.path
'/Users/dfischer/Projects/bootstrap/.git/'
>>> repo.workdir
'/Users/dfischer/Projects/bootstrap/'

There’s quite a bit more to the repository object and I’ll show it after I introduce some other git primitives and terminology.

Git objects

There are four fundamental git objects — commits, tags, blobs and trees — which reference snapshots of the git working directory (commits), potentially annotated named references to commits (tags), chunks of data (blobs) and an organization of blobs or directory structure (trees). I’ll save blobs and trees for a later post, but here’s some examples of using commits and tags in pygit2.

Since commits and tags are user facing, most git users should be familiar with them. Essentially commits point to a version of the working copy of the repository in time. They also have various interesting bits of metadata.

>>> commit = repo.revparse_single('042bb9b5')   # equivalent to repo[u'042bb9b5']
>>> commit.hex
u'042bb9b51510573a9a1db6bc66cb16311d0d580b'
>>> commit.message
u'Merge pull request #6780 from ...'
>>> commit.author.email    # clearly I'm editing out author info
u'xxx@gmail.com'
>>> commit.author.name
u'xxx'
>>> commit.author.offset
-480
>>> commit.author.time    # epoch time
1360128599

One tip/issue with using the repository bracket notation (repo[hex | oid]) is that the key MUST be either a unicode string if specifying the object hash or if it is a byte string pygit2 assumes that it points to the binary version of the hash called the oid.

Tags are essentially named pointers to commits but they can contain additional metadata.

>>> tag = repo.revparse_single('v2.3.1')
>>> tag.tagger.email
u'xxx@gmail.com'
>>> tag.tagger.name
u'xxx'
>>> tag.message
u'v2.3.1\n'
>>> tag.target    # binary version of the hex commit hash called an "oid"
'\xeb$q\x8a\xddM\xd3o\xe9/\xdb\xdby\xe6\xffL\xe5\x91\x93\x00'
>>> commit = repo[tag.target]
>>> commit.hex
u'eb24718add4dd36fe92fdbdb79e6ff4ce5919300'
>>> repo[tag.target].hex == repo.revparse_single('v2.3.1^0').hex
True

You can read all about the types of parameters that revparse_single handles at man gitrevisions or in the Git documentation under specifying revisions.

Typically, you won’t need to ever convert between hex encoded hashes and oids, but in case you do the the conversion is trivial:

>>> import base64
>>> base64.b16encode(tag.oid).lower() == tag.hex
True
Walking commits

The Repository object makes available a walk method for iterating over commits. This script walks the commit log and writes it out to JSON.

Dump repository objects

This script dumps all tags and commits in a repository to JSON. It shows how repositories are iterable and sort of puts the whole tutorial together.

Notes
  • There’s talk of changing the pygit2 API to be more Pythonic. I was using v0.18.x for this and significant things may change in the future.
  • It helps to think of a Git repository as a tree (or directed acyclic graph if you’re into that sort of thing) where the root is the latest commit. I used to think about version control where the first commit is the root, but instead it is a leaf!
  • If your repository uses annotated or signed tags, there will be longer messages or PGP signatures in the tag message.
  • I’ve glossed over huge chunks of pygit2 — pretty much anything that writes to the repository — but if I don’t leave something for later my loyal readers won’t come back to read more. =)
4
May

GitHub Data Challenge II

GitHub’s public timeline contains a wealth of knowledge about contributions to open source software from all over the world. It’s pretty typical to see over ten thousand contributions of some sort every hour! I decided to focus on the top 200 repositories (by forks) only in order to have a more manageable set of data. Each comment, pull request or commit is tied to a repository which in turn usually has a primary language associated with it. Contributions from folks who didn’t provide a location were ignored and OpenStreetMap’s Nominatim service was used to geocode locations into latitude and longitude for those who did say where they coding from.

If you aren’t from New York City or San Francisco and you contributed to a top 200 repository, you can probably find your own commits if you zoom in enough.

Contributions

Not all events are created equal. Watching a repository is not the same as committing code or opening issues. In general, I tried to calculate contributions based on the same criteria GitHub uses but I think I’m not introspecting commits and pull requests as deeply as they are. Typically, for larger repositories, users commit to their own forks — which I ignore — and later send pull requests which I’m counting. However, this discounts a large fork which merged many commits to be worth the same as a one line pull request. The person who actually merges the pull request gets the same credit as the author which actually makes sense when I gave it a second thought.

One way to improve my accounting of contributions would be to look at the actual repositories to see which commits to forks ended up in the “main line”. For a repository that actually uses GitHub virtually all commits end up in the main repository through pull requests or via somebody with permission to push directly which appear in the githubarchive.org data. For a repository like Linux which only stores code on GitHub and doesn’t accept pull requests it would be nice to actually analyze the commit history. I bet most of the Linux contributors have GitHub accounts to attribute their work to.

Geocoding

Geocoding messy data is well… messy. The location field for users on GitHub is simply a fill-in-the-blank field and users can type anything in there from their city to their university to an IRC channel. Sometimes people just type in a country name which is fine for Singapore but doesn’t really narrow it down too much for Canada. The locations listed for contributors on the top 200 repositories was surprisingly clean, however. It wasn’t without somewhat humorous errors though.

Links
25
Mar

Photos from Asia 2013

Here’s some more photos and there’s more to come.

29
Jan

FOSDEM 2013 Lineup

If you’re going to be at FOSDEM, don’t forget to say hi! Here is a list of talks that struck my fancy at first glance.

Saturday

Sunday

20
Dec

Quick Note on Pelican & Github Pages

Originally, I built the San Diego Python users group website as static HTML hosted on Github pages. However, as time progressed, the group wanted to have posts for our events with links to presentations and whatnot. I looked at Jekyll but what would a Python users group be generating its website with Ruby. Normally, I’m all about using the best tool for the job, but all of the group’s leads/members know Python. I settled on Pelican which seems to be the most fully featured of the Python static website generators out there.

One repository

Under ideal conditions I would have one repository that hosted all our markdown/reST files and also contained the static HTML output that Github serves. Project pages are a great way to do that and Pelican already has some integration. Your markdown/reST goes in the master branch and your HTML output goes in the gh-pages branch. ghp-import facilitates this quite nicely. There is already a make target for it in Pelican!

Custom domain

I already had pythonsd.org and normally on Github pages, pointing to a custom domain is as simple as adding a CNAME file. One tricky part that I didn’t realize is the CNAME file needs to be in the gh-pages branch. I added an extra line to the Makefile to copy it to the output which automatically gets put into the gh-pages branch by ghp-import. If you don’t do this, your CNAME file will be overwritten by the next “make github”.

I considered having two repositories because I had some doubts about how clean a one repo solution would be. In the end, I just needed to figure out exactly what ghp-import was doing. It worked fine and accomplishes exactly what I want.