GitHub’s public timeline contains a wealth of knowledge about contributions to open source software from all over the world. It’s pretty typical to see over ten thousand contributions of some sort every hour! I decided to focus on the top 200 repositories (by forks) only in order to have a more manageable set of data. Each comment, pull request or commit is tied to a repository which in turn usually has a primary language associated with it. Contributions from folks who didn’t provide a location were ignored and OpenStreetMap’s Nominatim service was used to geocode locations into latitude and longitude for those who did say where they coding from.
If you aren’t from New York City or San Francisco and you contributed to a top 200 repository, you can probably find your own commits if you zoom in enough.
Not all events are created equal. Watching a repository is not the same as committing code or opening issues. In general, I tried to calculate contributions based on the same criteria GitHub uses but I think I’m not introspecting commits and pull requests as deeply as they are. Typically, for larger repositories, users commit to their own forks — which I ignore — and later send pull requests which I’m counting. However, this discounts a large fork which merged many commits to be worth the same as a one line pull request. The person who actually merges the pull request gets the same credit as the author which actually makes sense when I gave it a second thought.
One way to improve my accounting of contributions would be to look at the actual repositories to see which commits to forks ended up in the “main line”. For a repository that actually uses GitHub virtually all commits end up in the main repository through pull requests or via somebody with permission to push directly which appear in the githubarchive.org data. For a repository like Linux which only stores code on GitHub and doesn’t accept pull requests it would be nice to actually analyze the commit history. I bet most of the Linux contributors have GitHub accounts to attribute their work to.
Geocoding messy data is well… messy. The location field for users on GitHub is simply a fill-in-the-blank field and users can type anything in there from their city to their university to an IRC channel. Sometimes people just type in a country name which is fine for Singapore but doesn’t really narrow it down too much for Canada. The locations listed for contributors on the top 200 repositories was surprisingly clean, however. It wasn’t without somewhat humorous errors though.
If you’re going to be at FOSDEM, don’t forget to say hi! Here is a list of talks that struck my fancy at first glance.
- 11:00-11:50 How we made the Jenkins community
- 11:00-11:50 How Google builds web services (in PHP?)
- 14:00-14:50 Practical security for developers
- 16:00-16:50 Trends in open source security
- 16:00-16:15 The LLVMLinux project
- 17:00-17:50 PDF.js Firefox’s PDF viewier
- 09:30-09:55 State of the OpenStack union 2013
- 10:00-10:25 OpenStack: 21st century app architecture and cloud operations
- 11:00-11:50 Firefox OS
- 11:40-11:55 Do you want to measure your project?
- 14:00-14:50 PostgreSQL as a schemaless database
- 14:00-14:25 Vaurien the chaos TCP proxy
- 14:30-14:55 Python for Humans
- 15:00-15:50 PostgreSQL: implementing high availability
- 15:00-15:25 Security priorities for cloud developers
Originally, I built the San Diego Python users group website as static HTML hosted on Github pages. However, as time progressed, the group wanted to have posts for our events with links to presentations and whatnot. I looked at Jekyll but what would a Python users group be generating its website with Ruby. Normally, I’m all about using the best tool for the job, but all of the group’s leads/members know Python. I settled on Pelican which seems to be the most fully featured of the Python static website generators out there.
Under ideal conditions I would have one repository that hosted all our markdown/reST files and also contained the static HTML output that Github serves. Project pages are a great way to do that and Pelican already has some integration. Your markdown/reST goes in the master branch and your HTML output goes in the gh-pages branch. ghp-import facilitates this quite nicely. There is already a make target for it in Pelican!
I already had pythonsd.org and normally on Github pages, pointing to a custom domain is as simple as adding a CNAME file. One tricky part that I didn’t realize is the CNAME file needs to be in the gh-pages branch. I added an extra line to the Makefile to copy it to the output which automatically gets put into the gh-pages branch by ghp-import. If you don’t do this, your CNAME file will be overwritten by the next “make github”.
I considered having two repositories because I had some doubts about how clean a one repo solution would be. In the end, I just needed to figure out exactly what ghp-import was doing. It worked fine and accomplishes exactly what I want.
There are lots of different versioning schemes and versioning is definitely not a solved problem. Some projects use dates, some use an ever increasing number from (perhaps generated by their version control system), some adopt the de facto standard of
major.minor.patch and some converge to pi. Fundamentally though, all of these versioning schemes convey some additional information. With any of these schemes, it is easy to compare two versions and say which was released later. Some have even more semantic meaning. Without sounding too pedantic, there is some formal discussion of this as configuration management.
Versioning conveys information
Semantic versioning is a fairly documented versioning scheme that conveys information about API stability and compatibility especially in relation to dependency management. Without repeating too many of semantic versioning’s details, essentially it allows for easy identification of backwards compatible vs. incompatible changes and changes that do not modify the API. It is definitely a step forward, but as evidenced by the fact that a 2.0.0 version of it is still a release candidate, versioning in the abstract cannot be considered complete.
A number of systems developers use every day rely on the semantics of versioning. Take Python as a perfect example. Code from Python 2.6 will almost always run without modification on 2.7. The reverse is sometimes true but certainly less likely to be true. When I use Travis-CI, I do not specify that my program must be tested under Python 2.7.0, 2.7.1, etc. Travis assumes that if it works in Python 2.7, it will work in any of those versions. These patch versions are rarely necessary for dependency management and certainly not with semantic versioning. Semantic versioning really shines when testing a package with multiple dependencies or recursive dependencies. Instead of having to test X versions of Package1 against all Y versions of Package2, basics of API compatibility can be assumed and testing becomes easier. With that said, verifying a subset of that is always nice to do too.
Additional versioning metadata
API compatibility and version comparison lend themselves well to being discovered from the version numbering scheme, but what about other types of things developers might want to know about a particular package. One piece of metadata that would be interesting to know is a classification about what changes a new version contains. Did any security vulnerabilities get fixed? Was this a bugfix release? Was this a re-release because something went wrong in the release process in the previous release? You can imagine a versioning scheme that looks something like this:
1.0.5-sb where s stands for security and b stands for bug. It would be nice for that metadata to be machine discoverable, and it would be much easier to identify dependencies that require updating.
Perhaps a better solution is to attach this metadata to version control tags. Newer version control systems have annotated tags (I’m using the git term, but Mercurial has something similar). I can imagine a tag
1.0.5 with the annotation
[security fix]. Like raising awareness for semantic versioning, it requires changing the software world by getting everyone to adopt your methods which is a tall order. In addition, some things do not lend themselves very well to version control tagging such as deprecating releases. It would be nice to discover when a version is deprecated and therefore no longer taking security fixes, but that happens long after the time a release is tagged.
I’m not ready to make any sort of concrete proposal. However, I think this is a really interesting space and I think that information such as which versions have security vulnerabilities becomes much more valuable now that many software products capture their dependencies in requirement files or gemfiles. There are definitely projects that have taken this on parts of this such as the open source vulnerability database. Over the next few years, this is going to become somewhat solved rather than upgrading dependencies in the ad hoc fashion that that developers do now. There will be something better than subscribing to a bunch of mailing lists and feeds to keep up with security fixes for dependencies.
distutils.version.LooseVersionallows for version comparison pretty close to semantic versioning
- There are marketing reasons to version things too.
- I got ideas for this post when I was reading the 2.0.0-rc1 version of the semantic versioning docs and I noticed that the tagging specification section was removed since I read it last time.