San Diego Python

Programmatically creating PDFs is fairly common among server side and desktop applications. In this post, I’ll share the results of my experiences with PDFs and what to think about when making a decision about libraries to use and the trade-offs with them. To understand the trade-offs though, we’ll start from the basics.

What’s a PDF?

Just about everybody has interacted with a PDF file or two, but it is worth going over some details. The PDF format is a document presentation format. HTML which probably more readers of this blog are familiar with is also a sort of document presentation format but the two have different goals. PDF is designed for fixed layouts on specific sizes of paper while HTML uses a variable layout that can differ across screen sizes, screen types and potentially across browsers. With relatively few exceptions, the same document should look exactly the same on two different PDF viewers even on different operating systems, screen sizes or form factors.

PDF was originally developed as a proprietary format by Adobe but is now an open standard (ISO 32000-1:2008). The standard document weighs in at over 700 pages and that doesn’t even cover every aspect of PDFs. As you might guess from the previous sentence, the format itself is extremely complicated and so usually tools are used when creating PDFs. The PDF format is fairly close in capability to Postscript, a format and programming language used by printers and PDFs are used most commonly to convey some data intended for printing. PDFs print fairly precisely and uniformly even across printers and OSs and so it’s a good idea to create a PDF for any document that is intended for printing.

Generating a PDF

If you only need to generate a single PDF, then you probably should stop reading. Creating a PDF from a word document for example can be done from the print dialog in MacOS and many Linux distros and a tool like Acrobat can accomplish the same thing on Windows. However, when it comes to generating a lot of PDFs — say a different PDF for every invoice in your freelancing job — you don’t want to spend hours in the print dialog saving files as PDFs. You’d rather automate that. There are lots of tools to accomplish this but some are better than others and more suited to specific tasks. In general though, most of these tools work along the same lines. You have a fixed layout in your document but some of the content is variable because it comes from a database or some external source. All invoices look more or less the same, but what the invoice is for and the amounts, dates and whatnot differ from invoice to invoice.

ReportLab

If you use Python regularly, you may have already heard of ReportLab. ReportLab is probably the most common Python library for creating PDFs and it is backed by a company of the same name. It is very robust albeit a bit complicated for some. ReportLab comes with a dual license of sorts. The Python library is open source but the company produces a markup language (RML, an XML-based markup) and an optimized version of the library for faster PDF generation which they sell. They claim that their proprietary improvements make it around a factor of five faster. Using markup languages like RML comes with the nice advantage that you can use your favorite template library (like Jinja2 or Django’s built-in one) to separate content from layout cleanly.

from reportlab.platypus import SimpleDocTemplate, Paragraph
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch
from reportlab.lib.pagesizes import letter

import io


buf = io.BytesIO()

# Setup the document with paper size and margins
doc = SimpleDocTemplate(
    buf,
    rightMargin=inch/2,
    leftMargin=inch/2,
    topMargin=inch/2,
    bottomMargin=inch/2,
    pagesize=letter,
)

# Styling paragraphs
styles = getSampleStyleSheet()

# Write things on the document
paragraphs = []
paragraphs.append(
    Paragraph('This is a paragraph', styles['Normal']))
paragraphs.append(
    Paragraph('This is another paragraph', styles['Normal']))
doc.build(paragraphs)

# Write the PDF to a file
with open('test.pdf', 'w') as fd:
    fd.write(buf.getvalue())

from reportlab.platypus import SimpleDocTemplate, Paragraph

from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle

from reportlab.lib.units import inch

from reportlab.lib.pagesizes import letter

import io

buf = io.BytesIO()

# Setup the document with paper size and margins

doc = SimpleDocTemplate(

buf,

rightMargin=inch/2,

leftMargin=inch/2,

topMargin=inch/2,

bottomMargin=inch/2,

pagesize=letter,

)

# Styling paragraphs

styles = getSampleStyleSheet()

# Write things on the document

paragraphs = []

paragraphs.append(

Paragraph('This is a paragraph', styles['Normal']))

paragraphs.append(

Paragraph('This is another paragraph', styles['Normal']))

doc.build(paragraphs)

# Write the PDF to a file

with open('test.pdf', 'w') as fd:

fd.write(buf.getvalue())

While the above example only writes paragraphs to a document, ReportLab can draw lines or shapes on a document, or support using images with your PDF.

RestructuredText and Sphinx

While not a PDF generator by itself, if you’ve ever created a Python module, you’ve probably heard of Sphinx, a module used to create documentation. It was created for the Python documentation itself but has been used by Django, Requests and many other big projects. It is the documentation format that powers Read the Docs. In general, you write a document in RestructuredText (reST), a wiki-like markup format and then run the Sphinx commands to put the whole thing together. Frequently, Sphinx is used to generate HTML documentation but it can be used to generate PDFs. It does this by creating LaTeX documents from reST and then creating PDFs from those. This can be a good solution if you’re looking for documents similar to what Sphinx already outputs.

I have a post from long ago on Sphinx and documentation in general if you desire further reading.

LaTeX

LaTeX is a document preparation system that is used widely in scientific and mathematical publishing as well as some other technical spheres. It can produce very high quality documents that are suitable for being published as books, presentations (see an example at the end of the post), handouts or articles. LaTeX has a reputation for being fairly complicated and it involves a fairly large installation to get going. LaTeX is also not a Python solution although there are Python modules to help somewhat with LaTeX. However, because it is a plain text macro language, it can be used with a template library to generate PDFs. The big advantage is that it can very precisely typeset documents and has a very extensive user base (it’s own stackexchange) and a large amount of modules to do everything from tables, figures, charts, syntax highlighting, bibliographies and more. It takes more to get going but you can generate just about anything you can think of with LaTeX. Compared with all the other solutions, it also offers the best performance in my testing for non-trivial documents.

Alternatives

There are lots of other alternatives from using Java PDF libraries via Jython to translators that translate another format into PDFs. This post is not meant to be exhaustive and new modules to generate PDFs pop up all the time. I’ll briefly touch on a few.

XHTML2PDF: There are quite a few HTML to PDF converters and they all suffer from problems that PDF has lots of concepts (like paper size) that HTML doesn’t have. XHTML2PDF works around this by adding some special additional markup tags. This allows doing things like having headers and footers on each page which HTML doesn’t do very well. XHTML2PDF uses ReportLab to actually generate the PDFs so it is strictly less powerful than ReportLab itself. It also only supports a subset of HTML and CSS (for example, “float” isn’t supported) and so it’s best to just consider it another ReportLab markup language. With that said, it’s fairly easy to get off the ground and reasonably powerful. The library itself isn’t maintained very much though. Update: see the update at the end of the post.
wkhtmltopdf: I haven’t used this one myself although I looked into when choosing a PDF library. It has many of the same issues as XHTML2PDF. It uses a headless Webkit implementation to actually layout the HTML rather than translating it to ReportLab as XHTML2PDF does. It has a C library with bindings to it. It has some support for headers and footers as well. People I’ve talked to say it is fairly easy to get started but you run into problems quickly when trying to precisely layout complex documents.

PhantomJS

I haven’t looked into this one at all and it is not Python whatsoever but it is worth mentioning. It also translates HTML to PDF.

Inkscape

Inkscape is a program used to create SVGs, an XML based document for vector graphics. However, Inkscape can be used headlessly to translate an SVG into a PDF fairly accurately. SVGs have a concept of their size which is important in translating to PDF. However, SVG isn’t a format designed for multiple pages and so it may not be the best fit for multi-page documents. As an aside, SVGs can be used by LaTeX and ReportLab as imported graphics so Inkscape could also be part of a larger solution.

Resources

If you’re just looking for a simple recommendation, I’d say ReportLab or LaTeX are the best choices although it can depend on your use case and requirements. They are a little tougher to get off the ground but give you a higher ceiling in terms of capability. Where you don’t need to create particularly complicated documents, Sphinx or an HTML to PDF translator could work.

This post is a longer version of a talk (pdf) I gave at San Diego Python on August 27, 2015. The repository for the talk is on github.

Outputting PDFs with Django
Reportlab & Django (useful even if you don’t use Django)

Update

As of October 2016, XHTML2PDF is no longer supported in favor of WEasyPrint. I have never personally used WEasyPrint but it looks significantly more capable than when I first saw it a few years ago. It still likely suffers from some similar problems to XHTML2PDF in that HTML is not designed for the same reasons as PDF. With that said, it has much better support. It does not rely on Reportlab and instead looks like it relies on Cairo, a Python library for converting SVGs to PDFs.

More updates

It may also be worth mentioning that there are paid cloud services in this space depending on your budget and use case. As of January 2017, I have not integrated with these so I can’t comment on them too much. DocRaptor and HyPDF are both services which integrate with Heroku and convert HTML/CSS to PDFs. There are a number of other players in this space as well.

Next weekend the San Diego Python Users Group (we) will be hosting an Introduction to Django workshop. This post is going to try to document the planning process for having a workshop like this one in your area.

Sponsorship

One thing that cannot be stressed enough is getting a sponsor. Brightscope is sponsoring this Intro to Django workshop. Our Intro to Python workshop was sponsored by the PSF and that worked extremely well. This doesn’t seem important on the surface but without planned meals it is tough to get people to stay in a room together for seven hours. Otherwise people would be constantly leaving to get drinks or coffee and the workshop wouldn’t flow as well. Free food is also a pretty good motivator and it gets people to show up.

Planning

We sent out a little survey last week to find out about people’s skill levels and what they already know about Python, HTML, and Javascript — pretty much the prerequisites of Django. Just about everybody knows HTML and some Javascript but that’s where things diverge. We have a pretty good mix of operating systems and the Python experience runs the full gamut. However, this step is key to planning out the presentation and schedule. We are having a brief newbie session on the Friday before the workshop for people to install Python and get their development environments setup. You can’t start from Django tutorial #1 if you don’t have Python installed! The goal is to have everyone on the same page so that we can have a simple working web app by noonish on Saturday. The afternoon will be spent somewhat independently adding a feature or two to that app.

Workshop materials

Preparing for the Intro to Python workshop earlier this month was easy since we liberally used the Boston Python group’s workshop materials. However, for this one we did not have as much to go off of. We are leaning heavily on the Django tutorials, the Django book (to a lesser extent) and Chander Ganesan’s PyCon talk from this year. At Djangocon this year, Russell Keith-Magee mentioned that the DSF was going to put together some training/workshop materials for just this sort of thing, but it is not ready yet. Hopefully what we create is useful and can bootstrap that process. Our materials are very early stage right now, but feel free to make suggestions or fork them for your own purposes.

Ok. Enough procrastinating. Time to work on the actual workshop materials.

David Fischer dot Name

Some Things to Some People

Generating PDFs With (and Without) Python