Although I generally stick to Python, I am going to go off on a tangent about statistics, data sets and R. You’ve been warned.
Getting the data
Last week, the World Bank released some of its underlying data that it uses as development indicators. The data is fairly clean and easy to work with. I grabbed the USA data in Excel format and transposed it (using “paste special”) so that each year was a row instead of having the years as columns. Then I saved it as a CSV file on my desktop.
Working with the data in R
R is a programming language that focuses on statistics and data visualization. Unlike Python, R has a number of useful functions for statistics as built-ins to the language. These features allow you to easy find means, minimums, maximums, standard deviations, summarize data sets, plot graphs and more. Working with the data is very interesting and it provides a good way to learn R.
First off, you can read in the CSV file saved easily.
usa = read.csv('~/Desktop/worldbank_us.csv')
The variable usa contains all columns of data and the columns can be accessed easily:
#show the population data usa$Population..total # Show urban population as a percentage of the total usa$Urban.population....of.total. # Show all available columns names(usa) # Summarize all the columns summary(usa)
Plotting with R
Visualizing the data is the real interesting aspect and this is where R really shines. First we need to get the columns we want to graph.
year = usa$Indicator energy = as.integer(as.matrix(usa$Energy.use..kt.of.oil.equivalent.)) pop = as.integer(as.matrix(usa$Population..total)) energy_per_capita = energy/pop * 1000
There are some missing data points in both the population and energy use columns for the most recent years. It is possible that that data hasn’t yet been collected and verified. By coercing the data into an integer vector any non-integer data points will be converted into the R NA type. While similar to null or Python’s None, this type indicates that the data is not available and it will be ignored in plotting. Once the data is ready, it can be plotted easily.
# plot the data onto a graphical chart plot(year, energy_per_capita, xlab="Year", ylab="Energy Use (tons of oil per person)", main="US Energy Use", type="o", sub="http://data.worldbank.org", col="blue")
When I saw the resulting graph I thought to myself: WOW, that’s a lot of energy. I don’t think I use multiple tons of oil per year, but I assume this also includes industrial, commercial and military usage. Still, that’s a lot of energy. It’s interesting to note that the peak of US energy usage was 1978 and then there’s the subsequent decline due to the energy crisis. The next thing I thought about was how energy usage has leveled off while population has continued to grow. So I decided to put population on the same chart.
# allow a 2nd line on the same plot par(new=T) plot(year, pop, xlab="", ylab="", axes=F, type="o", col="red") mtext("Population", side=4, col="red")
While the leveling of energy usage may not be as amazing as I thought due to the fact that a significant percentage of it must be industrial use which is probably declining, it is still interesting and fairly impressive. While the population has continued to grow fairly linearly, energy usage is flat or slightly less than it was 35 years ago. I guess those slightly more efficient water heaters and refrigerators are paying off.
In a previous post, I promised to write about Pip and Virtualenv and I’m now finally making good. Others have done this before, but I think I have a little to add. If you develop a Python module and you don’t test it with virtualenv, don’t make your next release until you do.
Configuring the environment
Virtualenv creates a Python environment that is segregated from your system wide Python installation. In this way, you can test your module without any external packages mucking up the result, add different versions of dependency packages and generally verify the exact set of requirements for your package.
To create the virtual environment:
% virtualenv --no-site-packages testarea
This creates a directory testarea/ that contains directories for installing modules and a Python executable. Using the virtual environment:
% cd testarea % source bin/activate
Sourcing activate will set environment variables so that only modules installed under testarea/ are used. After setting up the environment, any desired packages can be installed (from pypi):
(testarea) % pip install rpc4django
Packages can also be uninstalled, specific versions can be installed or packages can be installed from the file system, URLs or directly from source control:
(testarea) % pip uninstall rpc4django (testarea) % pip install rpc4django==0.1.6
Pip is worth using over easy_install for its uninstall capabilities alone, but I should mention that pip is actively maintained while setuptools is mostly dead.
When you’re done with the virtual environment, simply deactivate it:
(testarea) % deactivate
Do it for the tests
While the segregated environment that virtualenv provides is extremely well suited to getting the correct environment up and running, it is just as well suited to testing your application under a variety of different package configurations. With pip and virtualenv, testing your application under three different versions of Django is a snap and it doesn’t affect your system environment in the slightest.
Dependencies made easy
My favorite feature of pip is the ability to create a requirements file based on a set of packages installed in your virtual environment (or your global site-packages). Creating a requirements file can be done automatically using the freeze command for pip:
(testarea) % pip freeze > requirements.txt (testarea) % more requirements.txt Django==1.1.1 rpc4django==0.1.7 wsgiref==0.1.2
% pip install -r requirements.txt
The requirements file can be version controlled both to aid in installation and to capture the exact versions of your dependencies directly where they are used rather than after the fact in documentation that can easily become out of date. The requirements file can be used to rebuild a virtual environment or to deploy a virtual environment into the machine’s site-packages. Pip and virtualenv are exceptionally easy to use and there’s really no excuse for a Python packager not to use them.
Note: I’m working on a fairly large sized application for work. When it is finished, I will release a post-mortem that will also function as an update to my post about packaging and distributing.