I should preface this post by saying that I’m not a Git expert so this is based on my experimentation rather than any deep knowledge. I bet I’m not the only one who merely skimmed the internals chapter in Pro Git. This post is the result of me setting out to learn the Git internals a little better and help anybody else who is trying to use pygit2 for something awesome. With that said, corrections are welcome.
While this is obvious to some, I think it’s worth pointing out that pygit2 versions track libgit2 versions. If you have libgit2 v0.18.0 installed, then you need to use pygit2 v0.18.x. Compiler errors will flow if you don’t. The docs don’t exactly mention this (pull request coming). Other than that, just follow the docs and you should be set. On MacOS, you can
brew install libgit2 (make sure you
brew update first) to get the latest libgit2 followed by
pip install pygit2.
>>> import pygit2 >>> repo = pygit2.Repository('/Users/dfischer/Projects/bootstrap') >>> repo.head.hex # sha1 hex hash of the commit pointed to by HEAD u'd9b502dfb876c40b0735008bac18049c7ee7b6d2' >>> repo.path '/Users/dfischer/Projects/bootstrap/.git/' >>> repo.workdir '/Users/dfischer/Projects/bootstrap/'
There’s quite a bit more to the repository object and I’ll show it after I introduce some other git primitives and terminology.
There are four fundamental git objects — commits, tags, blobs and trees — which reference snapshots of the git working directory (commits), potentially annotated named references to commits (tags), chunks of data (blobs) and an organization of blobs or directory structure (trees). I’ll save blobs and trees for a later post, but here’s some examples of using commits and tags in pygit2.
Since commits and tags are user facing, most git users should be familiar with them. Essentially commits point to a version of the working copy of the repository in time. They also have various interesting bits of metadata.
>>> commit = repo.revparse_single('042bb9b5') # equivalent to repo[u'042bb9b5'] >>> commit.hex u'042bb9b51510573a9a1db6bc66cb16311d0d580b' >>> commit.message u'Merge pull request #6780 from ...' >>> commit.author.email # clearly I'm editing out author info email@example.com' >>> commit.author.name u'xxx' >>> commit.author.offset -480 >>> commit.author.time # epoch time 1360128599
One tip/issue with using the repository bracket notation (repo[hex | oid]) is that the key MUST be either a unicode string if specifying the object hash or if it is a byte string pygit2 assumes that it points to the binary version of the hash called the oid.
Tags are essentially named pointers to commits but they can contain additional metadata.
>>> tag = repo.revparse_single('v2.3.1') >>> tag.tagger.email firstname.lastname@example.org' >>> tag.tagger.name u'xxx' >>> tag.message u'v2.3.1\n' >>> tag.target # binary version of the hex commit hash called an "oid" '\xeb$q\x8a\xddM\xd3o\xe9/\xdb\xdby\xe6\xffL\xe5\x91\x93\x00' >>> commit = repo[tag.target] >>> commit.hex u'eb24718add4dd36fe92fdbdb79e6ff4ce5919300' >>> repo[tag.target].hex == repo.revparse_single('v2.3.1^0').hex True
You can read all about the types of parameters that
revparse_single handles at man gitrevisions or in the Git documentation under specifying revisions.
Typically, you won’t need to ever convert between hex encoded hashes and oids, but in case you do the the conversion is trivial:
>>> import base64 >>> base64.b16encode(tag.oid).lower() == tag.hex True
Repository object makes available a
walk method for iterating over commits. This script walks the commit log and writes it out to JSON.
Dump repository objects
This script dumps all tags and commits in a repository to JSON. It shows how repositories are iterable and sort of puts the whole tutorial together.
- There’s talk of changing the pygit2 API to be more Pythonic. I was using v0.18.x for this and significant things may change in the future.
- It helps to think of a Git repository as a tree (or directed acyclic graph if you’re into that sort of thing) where the root is the latest commit. I used to think about version control where the first commit is the root, but instead it is a leaf!
- If your repository uses annotated or signed tags, there will be longer messages or PGP signatures in the tag message.
- I’ve glossed over huge chunks of pygit2 — pretty much anything that writes to the repository — but if I don’t leave something for later my loyal readers won’t come back to read more. =)
GitHub’s public timeline contains a wealth of knowledge about contributions to open source software from all over the world. It’s pretty typical to see over ten thousand contributions of some sort every hour! I decided to focus on the top 200 repositories (by forks) only in order to have a more manageable set of data. Each comment, pull request or commit is tied to a repository which in turn usually has a primary language associated with it. Contributions from folks who didn’t provide a location were ignored and OpenStreetMap’s Nominatim service was used to geocode locations into latitude and longitude for those who did say where they coding from.
If you aren’t from New York City or San Francisco and you contributed to a top 200 repository, you can probably find your own commits if you zoom in enough.
Not all events are created equal. Watching a repository is not the same as committing code or opening issues. In general, I tried to calculate contributions based on the same criteria GitHub uses but I think I’m not introspecting commits and pull requests as deeply as they are. Typically, for larger repositories, users commit to their own forks — which I ignore — and later send pull requests which I’m counting. However, this discounts a large fork which merged many commits to be worth the same as a one line pull request. The person who actually merges the pull request gets the same credit as the author which actually makes sense when I gave it a second thought.
One way to improve my accounting of contributions would be to look at the actual repositories to see which commits to forks ended up in the “main line”. For a repository that actually uses GitHub virtually all commits end up in the main repository through pull requests or via somebody with permission to push directly which appear in the githubarchive.org data. For a repository like Linux which only stores code on GitHub and doesn’t accept pull requests it would be nice to actually analyze the commit history. I bet most of the Linux contributors have GitHub accounts to attribute their work to.
Geocoding messy data is well… messy. The location field for users on GitHub is simply a fill-in-the-blank field and users can type anything in there from their city to their university to an IRC channel. Sometimes people just type in a country name which is fine for Singapore but doesn’t really narrow it down too much for Canada. The locations listed for contributors on the top 200 repositories was surprisingly clean, however. It wasn’t without somewhat humorous errors though.
If you’re going to be at FOSDEM, don’t forget to say hi! Here is a list of talks that struck my fancy at first glance.
- 11:00-11:50 How we made the Jenkins community
- 11:00-11:50 How Google builds web services (in PHP?)
- 14:00-14:50 Practical security for developers
- 16:00-16:50 Trends in open source security
- 16:00-16:15 The LLVMLinux project
- 17:00-17:50 PDF.js Firefox’s PDF viewier
- 09:30-09:55 State of the OpenStack union 2013
- 10:00-10:25 OpenStack: 21st century app architecture and cloud operations
- 11:00-11:50 Firefox OS
- 11:40-11:55 Do you want to measure your project?
- 14:00-14:50 PostgreSQL as a schemaless database
- 14:00-14:25 Vaurien the chaos TCP proxy
- 14:30-14:55 Python for Humans
- 15:00-15:50 PostgreSQL: implementing high availability
- 15:00-15:25 Security priorities for cloud developers
Originally, I built the San Diego Python users group website as static HTML hosted on Github pages. However, as time progressed, the group wanted to have posts for our events with links to presentations and whatnot. I looked at Jekyll but what would a Python users group be generating its website with Ruby. Normally, I’m all about using the best tool for the job, but all of the group’s leads/members know Python. I settled on Pelican which seems to be the most fully featured of the Python static website generators out there.
Under ideal conditions I would have one repository that hosted all our markdown/reST files and also contained the static HTML output that Github serves. Project pages are a great way to do that and Pelican already has some integration. Your markdown/reST goes in the master branch and your HTML output goes in the gh-pages branch. ghp-import facilitates this quite nicely. There is already a make target for it in Pelican!
I already had pythonsd.org and normally on Github pages, pointing to a custom domain is as simple as adding a CNAME file. One tricky part that I didn’t realize is the CNAME file needs to be in the gh-pages branch. I added an extra line to the Makefile to copy it to the output which automatically gets put into the gh-pages branch by ghp-import. If you don’t do this, your CNAME file will be overwritten by the next “make github”.
I considered having two repositories because I had some doubts about how clean a one repo solution would be. In the end, I just needed to figure out exactly what ghp-import was doing. It worked fine and accomplishes exactly what I want.