GitHub Data Challenge II

GitHub’s public timeline contains a wealth of knowledge about contributions to open source software from all over the world. It’s pretty typical to see over ten thousand contributions of some sort every hour! I decided to focus on the top 200 repositories (by forks) only in order to have a more manageable set of data. Each comment, pull request or commit is tied to a repository which in turn usually has a primary language associated with it. Contributions from folks who didn’t provide a location were ignored and OpenStreetMap’s Nominatim service was used to geocode locations into latitude and longitude for those who did say where they coding from.

If you aren’t from New York City or San Francisco and you contributed to a top 200 repository, you can probably find your own commits if you zoom in enough.

Contributions

Not all events are created equal. Watching a repository is not the same as committing code or opening issues. In general, I tried to calculate contributions based on the same criteria GitHub uses but I think I’m not introspecting commits and pull requests as deeply as they are. Typically, for larger repositories, users commit to their own forks — which I ignore — and later send pull requests which I’m counting. However, this discounts a large fork which merged many commits to be worth the same as a one line pull request. The person who actually merges the pull request gets the same credit as the author which actually makes sense when I gave it a second thought.

One way to improve my accounting of contributions would be to look at the actual repositories to see which commits to forks ended up in the “main line”. For a repository that actually uses GitHub virtually all commits end up in the main repository through pull requests or via somebody with permission to push directly which appear in the githubarchive.org data. For a repository like Linux which only stores code on GitHub and doesn’t accept pull requests it would be nice to actually analyze the commit history. I bet most of the Linux contributors have GitHub accounts to attribute their work to.

Geocoding

Geocoding messy data is well… messy. The location field for users on GitHub is simply a fill-in-the-blank field and users can type anything in there from their city to their university to an IRC channel. Sometimes people just type in a country name which is fine for Singapore but doesn’t really narrow it down too much for Canada. The locations listed for contributors on the top 200 repositories was surprisingly clean, however. It wasn’t without somewhat humorous errors though.

Links

Quick Note on Pelican & Github Pages

Originally, I built the San Diego Python users group website as static HTML hosted on Github pages. However, as time progressed, the group wanted to have posts for our events with links to presentations and whatnot. I looked at Jekyll but what would a Python users group be generating its website with Ruby. Normally, I’m all about using the best tool for the job, but all of the group’s leads/members know Python. I settled on Pelican which seems to be the most fully featured of the Python static website generators out there.

One repository

Under ideal conditions I would have one repository that hosted all our markdown/reST files and also contained the static HTML output that Github serves. Project pages are a great way to do that and Pelican already has some integration. Your markdown/reST goes in the master branch and your HTML output goes in the gh-pages branch. ghp-import facilitates this quite nicely. There is already a make target for it in Pelican!

Custom domain

I already had pythonsd.org and normally on Github pages, pointing to a custom domain is as simple as adding a CNAME file. One tricky part that I didn’t realize is the CNAME file needs to be in the gh-pages branch. I added an extra line to the Makefile to copy it to the output which automatically gets put into the gh-pages branch by ghp-import. If you don’t do this, your CNAME file will be overwritten by the next “make github”.

I considered having two repositories because I had some doubts about how clean a one repo solution would be. In the end, I just needed to figure out exactly what ghp-import was doing. It worked fine and accomplishes exactly what I want.

Github Timeline and Social Coding


The Github public timeline is up on bigquery and I decided I’d play around with it. I created this visualization which is a first (alright, like eighth) try at measuring “how social” a project really is. The colors correspond to different programming languages and the size of the arc is based on the number of distinct collaborators on a project.

Other attempts

I thought about looking only at Pull Requests. It does uncover some interesting projects which have a lot of pull requests. I think this penalizes projects like Linux which doesn’t really have pull requests. Also, I wasn’t sure code submissions alone were exactly what I was looking for. I also briefly looked at only merged pull requests. I ended up filtering out projects with no stated language partially because there were a number of projects that just had common names (eg. “test).

While doing this, I figured I would have heard of the most social projects, but that there would be a lot of projects I’d never heard of for a variety of reasons. Some projects can get a lot of watchers, forks, comments or issues from an entirely separate group of people from the people I follow.

Most social projects by language

The visualization has the full dataset, but here’s a taste of the data:

  • C – php-src, linux, mruby
  • C++ – mosh, mysql, fr_public
  • JavaScript – bootstrap, meteor, jquery-file-upload
  • PHP – symfony, codeigniter, foundation
  • Python – django, legit, flask
  • Ruby – sample_app, rails, first_app

RPC4Django updates November 2011 edition

I released v0.1.10 of RPC4Django. I fixed an issue so that setup.py has no requirements on anything outside of the standard library and I set the project up such that python setup.py test runs the unit tests.

The bigger change is that I moved the project from Launchpad to Github. I’ve already been using Github quite a bit and I thought that I’d bite the bullet and do the move. While I liked Launchpad, I think it is better suited to larger projects that will use the features like Blueprints and Translations. For a small project like RPC4Django, Github’s code-centric approach works better.

RPC4Django Updates March 2011 Edition

I’ve been ignoring RPC4Django for a while, and I figured it was time to revisit it. There have been a couple bug reports as well as a bug reported against South. On a slight tangent, South works amazingly well. Getting back to RPC4Django, there is also a merge request on Launchpad to “allow specific methods to be available at specific URLs”. It sounds like it might be useful. What do you — the nebulous community — think? You can take a look at the code here.

I’ll be out of town for the next month on vacation through South-East Asia. I’ll be sure to post a picture or two.