Daniel Beer Atom | RSS | About

Converting a PDF document to HTML

7 Jan 2013

G.H. Hardy’s autobiography, titled “A Mathematician’s Apology” has been made available in PDF format by the University of Alberta’s Mathematical Science Society:

Unfortunately, no HTML version is available. Attempts to convert the document to HTML using tools like pdftotext or pdftohtml yield poor results – out-of-order text, missing paragraph breaks, unwanted page numbers and inconveniently placed footnotes.

The reason for this is that tools such as pdftotext extract only the text content, which may not tell the whole story since it usually lacks anything resembling semantic markup. Page 18 from the original PDF illustrates the problem:

Page 18 from “A Mathematician’s Apology”.

Here is the result of pdftotext for this page (long lines have been wrapped):

reciprocity. And on the other hand my examples should be drawn from the
‘pukka’ mathematics, the mathematics of the working professional
mathematician; and this condition excludes a good deal which it would be
comparatively easy to make intelligible but which trespasses on logic
and mathematical philosophy. I can hardly do better than go back to the
Greeks. I will state and prove two of the famous theorems of Greek
mathematics. They are ‘simple’ theorems, simple both in idea and in
execution, but there is no doubt at all about their being theorems of
the highest class. Each is as fresh and significant as when it has
discovered—two thousand years have not written a wrinkle on either of
them.  Finally, both the statemen ts and the proofs can be mastered in
an hour by any intelligent reader, however slender his mathematical
equipment. 1. The first is Euclid’s3 proof of the existence of an
infinity of prime numbers. The prime numbers or primes are the numbers
2, 3, 5, 7, 11, 13, 17, 19, 23, 29,… (A) which cannot be resolved into
smaller factors4. Thus 37 and 317 are prime. The pr imes are the
material out of which all numbers are built up by multiplication: thus
666 = 2 ⋅ 3 ⋅ 3 ⋅ 37 . Every number which is not prime itself is
divisible by at least one prime ( usually, of course, by several). We
have to prove that there are infinitely many primes, i.e. that the
series (A) never comes to an end. Let us suppose that it does, and that
2, 3, 5, … , P is the complete series (so that P is the largest prime);
and let us, on this hypothesis, consider the number Q defined by the
formula

Q = (2 ⋅ 3 ⋅ 5 ⋅

⋅ P) + 1 .

3

Elements IX 20. The real origin of many theorems in the Elements is
obscure, and there seems to be no particular reason for supposing that
this one is not Euclid’s own. 4 There are t echnical reasons for not
counting 1 as a prime.

18

Because of the limitations of PDF format, these tools have no way of differentiating headings, formulas, page numbers, footnotes, or even successive paragraphs from one another. These are only clear to a reader of the printed document.

By extracting and examining all available typesetting information in a PDF, we can do a better job than this.

Conversion strategy

In order to extract information from the PDF, we use the script pdf2txt.py, included in the popular Python library PDFMiner.

We can get a detailed dump of the PDF by invoking it with:

pdf2txt.py -t xml "A Mathematician's Apology.pdf"

This produces a large (8.7 MB) XML file containing, among other things, detail on every glyph in the document, arranged into pages, like so:

<?xml version="1.0" encoding="utf-8" ?>
<pages>
...

<page id="22" bbox="0.000,0.000,612.000,792.000" rotate="0">
<textbox id="0" bbox="120.599,253.060,495.014,740.450">
<textline bbox="120.600,722.080,494.928,740.450">
<text font="TimesNewRoman" bbox="120.600,722.080,125.255,740.450" size="18.370">r</text>
<text font="TimesNewRoman" bbox="125.268,722.080,131.475,740.450" size="18.370">e</text>
<text font="TimesNewRoman" bbox="131.488,722.080,137.695,740.450" size="18.370">c</text>
<text font="TimesNewRoman" bbox="137.707,722.080,141.594,740.450" size="18.370">i</text>
<text font="TimesNewRoman" bbox="141.606,722.080,148.596,740.450" size="18.370">p</text>
<text font="TimesNewRoman" bbox="148.609,722.080,153.264,740.450" size="18.370">r</text>
<text font="TimesNewRoman" bbox="153.277,722.080,160.267,740.450" size="18.370">o</text>
<text font="TimesNewRoman" bbox="160.279,722.080,166.487,740.450" size="18.370">c</text>
<text font="TimesNewRoman" bbox="166.499,722.080,170.386,740.450" size="18.370">i</text>
<text font="TimesNewRoman" bbox="170.398,722.080,174.285,740.450" size="18.370">t</text>
<text font="TimesNewRoman" bbox="174.297,722.080,181.287,740.450" size="18.370">y</text>
<text font="TimesNewRoman" bbox="181.300,722.080,184.795,740.450" size="18.370">.</text>
<text font="TimesNewRoman" bbox="184.807,722.080,188.302,740.450" size="18.370"> </text>
...

Our conversion program will operate by reading this XML data as input and producing an XHTML document as output. The program consists of a series of filters, each of which consumes one type of document and produces another type.

Each document has an informally-specified schema. The document to be consumed is fed to each filter by invoking filter methods for each node, in pre-order. This is just like a SAX parser, except that we don’t have any need to represent “closing” tags, because our schemas are not recursive.

Each filter is implemented with an example printer which can be used to inspect the immediate output of the filter. This printer dumps a representation of the intermediate document by using indentation to indicate nesting.

Filter chain

Our filter chain consists of the following filters:

GlyphExtractor

This is a SAX parser which simply converts the input data into the form described above. It produces a series of pages, each containing an unordered set of glyphs.

LineGrouper

Takes sets of glyphs and groups them into lines. Produces a series of pages, each containing a sequence of lines, which in turn contain an unordered set of glyphs.

ChunkJoiner

Within each line, sort the glyphs and join into string chunks those with identical typeface and size. Output is a series of pages, containing a series of lines, containing a series of chunks.

Depaginator

Classify and reorders lines in the document. This is the first filter to not be page-oriented. Its output is a series of “sections”. Each section has a name, and contains the series of lines of chunks described above. The purpose of this filter is to uninterleave footnotes (moving them into their own section), discard page numbers, and to identify “special” pages.

Unwrapper

This filter takes the series of lines within each section and produces instead a series of paragraphs.

HTMLGenerator

The sections from the unwrapper and marked up in HTML. The generator identifies footnotes and headings based on typeface and inserts links. The output of the generator is valid HTML, but divided into named blocks.

OutputWriter

The final filter in the chain performs miscellaneous fix-ups (via a table of regular expressions), and adds section and document preambles/postambles.

The program implementing this filter chain is available here:

Descriptions of the filter chain elements follow.

GlyphExtractor

This filter is a straight-forward SAX parser. We look only for page and text elements, and the characters contained in each text element. A section of the output from the filter’s debug printer corresponding to our example page is:

...
page
  glyph: bbox=(120.60, 722.08, 125.25, 740.45), font=(TimesNewRoman, None, 18), char=u'r'
  glyph: bbox=(125.27, 722.08, 131.47, 740.45), font=(TimesNewRoman, None, 18), char=u'e'
  glyph: bbox=(131.49, 722.08, 137.69, 740.45), font=(TimesNewRoman, None, 18), char=u'c'
  glyph: bbox=(137.71, 722.08, 141.59, 740.45), font=(TimesNewRoman, None, 18), char=u'i'
  glyph: bbox=(141.61, 722.08, 148.60, 740.45), font=(TimesNewRoman, None, 18), char=u'p'
  glyph: bbox=(148.61, 722.08, 153.26, 740.45), font=(TimesNewRoman, None, 18), char=u'r'
  glyph: bbox=(153.28, 722.08, 160.27, 740.45), font=(TimesNewRoman, None, 18), char=u'o'
  glyph: bbox=(160.28, 722.08, 166.49, 740.45), font=(TimesNewRoman, None, 18), char=u'c'
  glyph: bbox=(166.50, 722.08, 170.39, 740.45), font=(TimesNewRoman, None, 18), char=u'i'
  glyph: bbox=(170.40, 722.08, 174.28, 740.45), font=(TimesNewRoman, None, 18), char=u't'
  glyph: bbox=(174.30, 722.08, 181.29, 740.45), font=(TimesNewRoman, None, 18), char=u'y'
  glyph: bbox=(181.30, 722.08, 184.79, 740.45), font=(TimesNewRoman, None, 18), char=u'.'
  glyph: bbox=(184.81, 722.08, 188.30, 740.45), font=(TimesNewRoman, None, 18), char=u' '
...

LineGrouper

Glyphs in the XML output don’t necessarily appear in reading order, and are not required to. The first step in making sense of the input is to group the glyphs into lines. We can see from the above output that the first few glyphs on page 18 share a common baseline (see the bbox attribute). We can’t rely solely on this, because we want to correctly group superscripts and formulas (which may use a slightly different or raised font).

We discretize the Y coordinates of the bounding box (in ama_generate.py, I use one box per integer point), and then classify each horizontal line in this semi-discretized page space into a line. The procedure, using equivalence classes, is:

  1. Initially, each horizontal line belongs to its own equivalence class.

  2. For each glyph, scan over the set of horizontal lines occupied by the lower 2/3rd of its bounding box, and join the equivalence classes. As equivalence classes are joined, keep track of the lowest horizontal line which is a member of the class.

  3. Tag each glyph with the equivalence class to which the lowest line of its bounding box belongs.

  4. Group the glyphs by class, sort the classes by lowest baseline, and produce them one at a time. Augment the glyph’s style information by tagging it with the number of points between the glyph’s baseline and the lowest line in the equivalence class (this will be used later to identify superscripts).

The debug printer from this filter yields, for the page above:

...
page
  line: base=722.08
    glyph: span=(120.60, 125.25), font=(TimesNewRoman, None, 18, 0), char=u'r'
    glyph: span=(125.27, 131.47), font=(TimesNewRoman, None, 18, 0), char=u'e'
    glyph: span=(131.49, 137.69), font=(TimesNewRoman, None, 18, 0), char=u'c'
    glyph: span=(137.71, 141.59), font=(TimesNewRoman, None, 18, 0), char=u'i'
    glyph: span=(141.61, 148.60), font=(TimesNewRoman, None, 18, 0), char=u'p'
    glyph: span=(148.61, 153.26), font=(TimesNewRoman, None, 18, 0), char=u'r'
    glyph: span=(153.28, 160.27), font=(TimesNewRoman, None, 18, 0), char=u'o'
    glyph: span=(160.28, 166.49), font=(TimesNewRoman, None, 18, 0), char=u'c'
    glyph: span=(166.50, 170.39), font=(TimesNewRoman, None, 18, 0), char=u'i'
    glyph: span=(170.40, 174.28), font=(TimesNewRoman, None, 18, 0), char=u't'
    glyph: span=(174.30, 181.29), font=(TimesNewRoman, None, 18, 0), char=u'y'
    glyph: span=(181.30, 184.79), font=(TimesNewRoman, None, 18, 0), char=u'.'
    glyph: span=(184.81, 188.30), font=(TimesNewRoman, None, 18, 0), char=u' '
...
  line: base=706.00
    glyph: span=(120.60, 125.25), font=(TimesNewRoman, None, 18, 0), char=u'f'
    glyph: span=(125.27, 129.92), font=(TimesNewRoman, None, 18, 0), char=u'r'
    glyph: span=(129.94, 136.93), font=(TimesNewRoman, None, 18, 0), char=u'o'
    glyph: span=(136.94, 147.81), font=(TimesNewRoman, None, 18, 0), char=u'm'
    glyph: span=(147.74, 151.24), font=(TimesNewRoman, None, 18, 0), char=u' '
...

ChunkJoiner

This filter exists mainly to improve the speed of further processing, and to make it easier to inspect intermediate results. For each line, this filter sorts the glyphs by left edge, and joins consecutive glyphs with matching fonts which aren’t spaced too far apart. Output for page 18 (long chunk lines have been wrapped):

...
page
  line: base=722.08
    chunk: span=(120.60, 494.93), font=(TimesNewRoman, None, 18, 0),
           text=u'reciprocity. And on the other hand my examples should be drawn '
  line: base=706.00
    chunk: span=(120.60, 494.86), font=(TimesNewRoman, None, 18, 0),
           text=u'from the \u2018pukka\u2019 mathematics, the mathematics of the working '
  line: base=689.86
    chunk: span=(120.60, 494.90), font=(TimesNewRoman, None, 18, 0),
           text=u'professional mathematician; and this condition excludes a good '
...

Depaginator

The depaginator’s output groups lines by “section” rather than by page, but the lines within each section remainin in the same order. By applying a few heuristics, we can identify the following types of document line:

Page numbers are discarded, and the output document consists of sections of lines:

...
section: ident='main'
...
  line
    chunk: span=(120.60, 494.91), font=(TimesNewRoman, None, 18, 0),
           text=u'such as Fermat\u2019s \u2018two square\u2019 theorem on the law of quadratic '
  line
    chunk: span=(120.60, 494.93), font=(TimesNewRoman, None, 18, 0),
           text=u'reciprocity. And on the other hand my examples should be drawn '
  line
    chunk: span=(120.60, 494.86), font=(TimesNewRoman, None, 18, 0),
           text=u'from the \u2018pukka\u2019 mathematics, the mathematics of the working '
...
section: ident='footnotes'
  line
    chunk: span=(120.60, 123.84), font=(TimesNewRoman, None, 9, 5), text=u'1'
    chunk: span=(123.84, 214.69), font=(TimesNewRoman, None, 13, 0),
           text=u' Pascal seems the best '
  line
    chunk: span=(120.60, 123.84), font=(TimesNewRoman, None, 9, 5), text=u'2'
    chunk: span=(123.84, 253.94), font=(TimesNewRoman, None, 13, 0),
           text=u' See his letter on the \u2018Hexlet\u2019 in '
    chunk: span=(253.86, 281.65), font=(TimesNewRoman, Italic, 13, 0), text=u'Nature'
    chunk: span=(281.64, 374.06), font=(TimesNewRoman, None, 13, 0),
           text=u', vols. 127-9 (1936-7). '
  line
    chunk: span=(120.60, 123.84), font=(TimesNewRoman, None, 9, 5), text=u'3'
    chunk: span=(123.84, 126.34), font=(TimesNewRoman, None, 13, 0), text=u' '
    chunk: span=(126.36, 163.01), font=(TimesNewRoman, Italic, 13, 0), text=u'Elements'
    chunk: span=(163.02, 355.18), font=(TimesNewRoman, None, 13, 0),
           text=u' IX 20. The real origin of many theorems in the '
    chunk: span=(355.14, 391.85), font=(TimesNewRoman, Italic, 13, 0), text=u'Elements'
    chunk: span=(391.86, 479.06), font=(TimesNewRoman, None, 13, 0),
           text=u' is obscure, and there '
...

Unwrapper

This filter’s purpose is to identify successive lines which belong to the same paragraph. It does this by applying a few heuristics after examining the left and right edges of each line – for example, most paragraphs begin with a slightly indented left edge.

The output adds a new level of grouping between sections and lines:

...
section: ident='main'
...
  paragraph: edge=120.60
    chunk: font=(TimesNewRoman, None, 18, 0),
           text=u'It will be clear by now that, if we are to have any chance of '
    chunk: font=(TimesNewRoman, None, 18, 0),
           text=u'making progress, I must produce example of \u2018real\u2019 mathematical '
    chunk: font=(TimesNewRoman, None, 18, 0),
           text=u'theorems, theorems which every mathematician will admit to be '
    chunk: font=(TimesNewRoman, None, 18, 0),
           text=u'first-rate. And here I am very handicapped by the restrictions '
    chunk: font=(TimesNewRoman, None, 18, 0),
           text=u'under which I am writing. On the one hand my examples must be '
    chunk: font=(TimesNewRoman, None, 18, 0),
           text=u'very simple, and intelligible to a reader who has no specialized '
    chunk: font=(TimesNewRoman, None, 18, 0),
           text=u'mathematical knowledge; no elaborate preliminary explanations '
    chunk: font=(TimesNewRoman, None, 18, 0),
           text=u'must be needs; and a reader must be able to follow the proofs as '
    chunk: font=(TimesNewRoman, None, 18, 0),
           text=u'well as the enunciations. These conditions exclude, for instance, '
    chunk: font=(TimesNewRoman, None, 18, 0),
           text=u'many of the most beautiful theorems of the theory of numbers, '
    chunk: font=(TimesNewRoman, None, 18, 0),
           text=u'such as Fermat\u2019s \u2018two square\u2019 theorem on the law of quadratic '
    chunk: font=(TimesNewRoman, None, 18, 0),
           text=u'reciprocity. And on the other hand my examples should be drawn '
    chunk: font=(TimesNewRoman, None, 18, 0),
           text=u'from the \u2018pukka\u2019 mathematics, the mathematics of the working '
    chunk: font=(TimesNewRoman, None, 18, 0),
           text=u'professional mathematician; and this condition excludes a good '
    chunk: font=(TimesNewRoman, None, 18, 0),
           text=u'deal which it would be comparatively easy to make intelligible '
    chunk: font=(TimesNewRoman, None, 18, 0),
           text=u'but which trespasses on logic and mathematical philosophy.'
  paragraph: edge=135.00
    chunk: font=(TimesNewRoman, None, 18, 0),
           text=u'I can hardly do better than go back to the Greeks. I will state '
    chunk: font=(TimesNewRoman, None, 18, 0),
           text=u'and prove two of the famous theorems of Greek mathematics. '
    chunk: font=(TimesNewRoman, None, 18, 0),
           text=u'They are \u2018simple\u2019 theorems, simple both in idea and in execution, '
...

HTMLGenerator

This stage produces, as the name suggests, HTML output. However, the HTML output is grouped into blocks.

The filter processes a paragraph at a time, wrapping the paragraph in appropriate tags, and then grouping and wrapping the chunks within it. Many heuristics are applied to identify headings, superscripts, etc., by font and size.

The HTML is output as a series of strings, but the series of strings is grouped by block. A new block is created for each heading or section break encountered in the document. This helps to narrow the scope of regular expressions applied in the next filter.

OutputWriter

Despite our best efforts, some quirks remain in the generated output, and we’re also missing some markup on special pages. This filter fixes that by applying:

HTML is grouped and the regular expressions are applied to whole blocks. This strategy is preferable to hand-editing the generated output, because it allows us to introduce new heuristics in intermediate stages without having to redo hand-edits.

The final stage, after automated editing, is to convert the text to the output encoding (UTF8) and print it on standard output. The following command ties it all together and converts PDF to HTML in one pipelined action:

pdf2txt.py -t xml "A Mathematician's Apology.pdf" | \
    python ama_generate.py | tidy -utf8 > g.h._hardy_-_a_mathematicians_apology.html

Our example page forms part of section 12 in the finished product.