Generating PDFs With (and Without) Python

Programmatically creating PDFs is fairly common among server side and desktop applications. In this post, I’ll share the results of my experiences with PDFs and what to think about when making a decision about libraries to use and the trade-offs with them. To understand the trade-offs though, we’ll start from the basics.

What’s a PDF?

Just about everybody has interacted with a PDF file or two, but it is worth going over some details. The PDF format is a document presentation format. HTML which probably more readers of this blog are familiar with is also a sort of document presentation format but the two have different goals. PDF is designed for fixed layouts on specific sizes of paper while HTML uses a variable layout that can differ across screen sizes, screen types and potentially across browsers. With relatively few exceptions, the same document should look exactly the same on two different PDF viewers even on different operating systems, screen sizes or form factors.

PDF was originally developed as a proprietary format by Adobe but is now an open standard (ISO 32000-1:2008). The standard document weighs in at over 700 pages and that doesn’t even cover every aspect of PDFs. As you might guess from the previous sentence, the format itself is extremely complicated and so usually tools are used when creating PDFs. The PDF format is fairly close in capability to Postscript, a format and programming language used by printers and PDFs are used most commonly to convey some data intended for printing. PDFs print fairly precisely and uniformly even across printers and OSs and so it’s a good idea to create a PDF for any document that is intended for printing.

Generating a PDF

If you only need to generate a single PDF, then you probably should stop reading. Creating a PDF from a word document for example can be done from the print dialog in MacOS and many Linux distros and a tool like Acrobat can accomplish the same thing on Windows. However, when it comes to generating a lot of PDFs — say a different PDF for every invoice in your freelancing job — you don’t want to spend hours in the print dialog saving files as PDFs. You’d rather automate that. There are lots of tools to accomplish this but some are better than others and more suited to specific tasks. In general though, most of these tools work along the same lines. You have a fixed layout in your document but some of the content is variable because it comes from a database or some external source. All invoices look more or less the same, but what the invoice is for and the amounts, dates and whatnot differ from invoice to invoice.

ReportLab

If you use Python regularly, you may have already heard of ReportLab. ReportLab is probably the most common Python library for creating PDFs and it is backed by a company of the same name. It is very robust albeit a bit complicated for some. ReportLab comes with a dual license of sorts. The Python library is open source but the company produces a markup language (RML, an XML-based markup) and an optimized version of the library for faster PDF generation which they sell. They claim that their proprietary improvements make it around a factor of five faster. Using markup languages like RML comes with the nice advantage that you can use your favorite template library (like Jinja2 or Django’s built-in one) to separate content from layout cleanly.

While the above example only writes paragraphs to a document, ReportLab can draw lines or shapes on a document, or support using images with your PDF.

RestructuredText and Sphinx

While not a PDF generator by itself, if you’ve ever created a Python module, you’ve probably heard of Sphinx, a module used to create documentation. It was created for the Python documentation itself but has been used by Django, Requests and many other big projects. It is the documentation format that powers Read the Docs. In general, you write a document in RestructuredText (reST), a wiki-like markup format and then run the Sphinx commands to put the whole thing together. Frequently, Sphinx is used to generate HTML documentation but it can be used to generate PDFs. It does this by creating LaTeX documents from reST and then creating PDFs from those. This can be a good solution if you’re looking for documents similar to what Sphinx already outputs.

I have a post from long ago on Sphinx and documentation in general if you desire further reading.

LaTeX

LaTeX is a document preparation system that is used widely in scientific and mathematical publishing as well as some other technical spheres. It can produce very high quality documents that are suitable for being published as books, presentations (see an example at the end of the post), handouts or articles. LaTeX has a reputation for being fairly complicated and it involves a fairly large installation to get going. LaTeX is also not a Python solution although there are Python modules to help somewhat with LaTeX. However, because it is a plain text macro language, it can be used with a template library to generate PDFs. The big advantage is that it can very precisely typeset documents and has a very extensive user base (it’s own stackexchange) and a large amount of modules to do everything from tables, figures, charts, syntax highlighting, bibliographies and more. It takes more to get going but you can generate just about anything you can think of with LaTeX. Compared with all the other solutions, it also offers the best performance in my testing for non-trivial documents.

Alternatives

There are lots of other alternatives from using Java PDF libraries via Jython to translators that translate another format into PDFs. This post is not meant to be exhaustive and new modules to generate PDFs pop up all the time. I’ll briefly touch on a few.

XHTML2PDF
There are quite a few HTML to PDF converters and they all suffer from problems that PDF has lots of concepts (like paper size) that HTML doesn’t have. XHTML2PDF works around this by adding some special additional markup tags. This allows doing things like having headers and footers on each page which HTML doesn’t do very well. XHTML2PDF uses ReportLab to actually generate the PDFs so it is strictly less powerful than ReportLab itself. It also only supports a subset of HTML and CSS (for example, “float” isn’t supported) and so it’s best to just consider it another ReportLab markup language. With that said, it’s fairly easy to get off the ground and reasonably powerful. The library itself isn’t maintained very much though. Update: see the update at the end of the post.
wkhtmltopdf
I haven’t used this one myself although I looked into when choosing a PDF library. It has many of the same issues as XHTML2PDF. It uses a headless Webkit implementation to actually layout the HTML rather than translating it to ReportLab as XHTML2PDF does. It has a C library with bindings to it. It has some support for headers and footers as well. People I’ve talked to say it is fairly easy to get started but you run into problems quickly when trying to precisely layout complex documents.
PhantomJS
I haven’t looked into this one at all and it is not Python whatsoever but it is worth mentioning. It also translates HTML to PDF.
Inkscape
Inkscape is a program used to create SVGs, an XML based document for vector graphics. However, Inkscape can be used headlessly to translate an SVG into a PDF fairly accurately. SVGs have a concept of their size which is important in translating to PDF. However, SVG isn’t a format designed for multiple pages and so it may not be the best fit for multi-page documents. As an aside, SVGs can be used by LaTeX and ReportLab as imported graphics so Inkscape could also be part of a larger solution.
Resources

If you’re just looking for a simple recommendation, I’d say ReportLab or LaTeX are the best choices although it can depend on your use case and requirements. They are a little tougher to get off the ground but give you a higher ceiling in terms of capability. Where you don’t need to create particularly complicated documents, Sphinx or an HTML to PDF translator could work.

This post is a longer version of a talk (pdf) I gave at San Diego Python on August 27, 2015. The repository for the talk is on github.

Update

As of October 2016, XHTML2PDF is no longer supported in favor of WEasyPrint. I have never personally used WEasyPrint but it looks significantly more capable than when I first saw it a few years ago. It still likely suffers from some similar problems to XHTML2PDF in that HTML is not designed for the same reasons as PDF. With that said, it has much better support. It does not rely on Reportlab and instead looks like it relies on Cairo, a Python library for converting SVGs to PDFs.

More updates

It may also be worth mentioning that there are paid cloud services in this space depending on your budget and use case. As of January 2017, I have not integrated with these so I can’t comment on them too much. DocRaptor and HyPDF are both services which integrate with Heroku and convert HTML/CSS to PDFs. There are a number of other players in this space as well.

Quick Note on Pelican & Github Pages

Originally, I built the San Diego Python users group website as static HTML hosted on Github pages. However, as time progressed, the group wanted to have posts for our events with links to presentations and whatnot. I looked at Jekyll but what would a Python users group be generating its website with Ruby. Normally, I’m all about using the best tool for the job, but all of the group’s leads/members know Python. I settled on Pelican which seems to be the most fully featured of the Python static website generators out there.

One repository

Under ideal conditions I would have one repository that hosted all our markdown/reST files and also contained the static HTML output that Github serves. Project pages are a great way to do that and Pelican already has some integration. Your markdown/reST goes in the master branch and your HTML output goes in the gh-pages branch. ghp-import facilitates this quite nicely. There is already a make target for it in Pelican!

Custom domain

I already had pythonsd.org and normally on Github pages, pointing to a custom domain is as simple as adding a CNAME file. One tricky part that I didn’t realize is the CNAME file needs to be in the gh-pages branch. I added an extra line to the Makefile to copy it to the output which automatically gets put into the gh-pages branch by ghp-import. If you don’t do this, your CNAME file will be overwritten by the next “make github”.

I considered having two repositories because I had some doubts about how clean a one repo solution would be. In the end, I just needed to figure out exactly what ghp-import was doing. It worked fine and accomplishes exactly what I want.

Intro to Django Workshop November 2012

Next weekend the San Diego Python Users Group (we) will be hosting an Introduction to Django workshop. This post is going to try to document the planning process for having a workshop like this one in your area.

Sponsorship

One thing that cannot be stressed enough is getting a sponsor. Brightscope is sponsoring this Intro to Django workshop. Our Intro to Python workshop was sponsored by the PSF and that worked extremely well. This doesn’t seem important on the surface but without planned meals it is tough to get people to stay in a room together for seven hours. Otherwise people would be constantly leaving to get drinks or coffee and the workshop wouldn’t flow as well. Free food is also a pretty good motivator and it gets people to show up.

Planning

We sent out a little survey last week to find out about people’s skill levels and what they already know about Python, HTML, and Javascript — pretty much the prerequisites of Django. Just about everybody knows HTML and some Javascript but that’s where things diverge. We have a pretty good mix of operating systems and the Python experience runs the full gamut. However, this step is key to planning out the presentation and schedule. We are having a brief newbie session on the Friday before the workshop for people to install Python and get their development environments setup. You can’t start from Django tutorial #1 if you don’t have Python installed! The goal is to have everyone on the same page so that we can have a simple working web app by noonish on Saturday. The afternoon will be spent somewhat independently adding a feature or two to that app.

Workshop materials

Preparing for the Intro to Python workshop earlier this month was easy since we liberally used the Boston Python group’s workshop materials. However, for this one we did not have as much to go off of. We are leaning heavily on the Django tutorials, the Django book (to a lesser extent) and Chander Ganesan’s PyCon talk from this year. At Djangocon this year, Russell Keith-Magee mentioned that the DSF was going to put together some training/workshop materials for just this sort of thing, but it is not ready yet. Hopefully what we create is useful and can bootstrap that process. Our materials are very early stage right now, but feel free to make suggestions or fork them for your own purposes.

Ok. Enough procrastinating. Time to work on the actual workshop materials.