Simple HTML decluttering
24 Oct 2014
Some time ago, I implemented a very simple Perl script for the removal of HTML boilerplate (advertising, comments and other clutter surrounding the content on commercial websites). This is often useful if you want to reprocess text for offline reading, especially on devices like eReaders which don't handle complex layouts well.
The script is less than 300 lines long, and uses very simple a series of tree-processing algorithms which can be customized and tuned. The script may be downloaded here:
It requires the module
HTML::TreeBuilder (available from CPAN). The script itself is public domain -- you can use it any way you like. Invoke it as follows:
Decluttered output will be written to standard output. This appears to work reasonably well on most news websites that I've tried.
The decluttering process works on a parsed HTML tree. It consists of three processing phases, and the result of each is another HTML tree. The three phases are called "prune", "shake" and "purify". They are described below.
This is a simple top-down subtree pruning filter. The tree is walked in pre-order. Each node is examined on its own, without regard for its position in the tree, ancestry, or content. Nodes are fed to a decision-making function, which decides whether to keep or prune the node. If the node is to be pruned, the entire subtree is deleted. Otherwise, we recursively process the children of the node.
The decision-making function looks for low-hanging fruit, such as
<script> tags, or tags with class names containing poison words like "comment", "promo", or "popup".
This is the most complex filter in the tree. Decisions are made in post-order: the fate of child nodes is decided before that of their parents.
At the point of decision for each node, we have access to the following information:
Link: whether this is a hyperlinked subtree (true if either this node or any of its ancestors are
LinkWords: the total number of words in this subtree which are contained in hyperlinked subtrees.
Words: the total number of words in this subtree which are in non-hyperlinked subtrees.
Keep: whether any of the descendents of this node have been explicitly kept. We need to ensure that if we decide to keep a node, we must not also decide to prune any of its ancestors!
Tags: the total size of this subtree (i.e. the total number of opening tags if the subtree were to be printed).
These statistics are computed over every node in the tree. However, not every node constitutes a decision point. Decisions are made only at what we call "block tags". These are tags like
<section>, or any tag which we expect to be used to enclose a logical section of a document.
At each decision point, we calculate a score for the node:
Score = Words - 2*LinkWords - Tags
This is a simple linear combination whose weights were chosen by trial and error. Better weights probably exist.
If the score comes out non-negative, we set the
Keep flag for the node, which propagates upwards. On the other hand, if the score is negative, we delete this subtree (but still propagate the word and tag counts upwards, as if the node had not been removed -- this ensures that each decision is independent).
Shake: heuristic overrides
The shake algorithm makes mistakes, and we sometimes want to override them. False negatives, where boilerplate is preserved, isn't such a big deal. However, we want to be able to override classes of false positives.
In order to do this, we introduce an extra decision function, which acts on nodes independently, in the style of the function used in the "prune" phase. This function returns true if it decides that the shake algorithm should be overridden in favour of keeping a given node. This is used to detect blocks such as "author" and "byline" blocks, which often fail to yield high node scores.
This function is applied at every node in the tree, not just at block tags. If the function returns true, we set a context variable called
Blessed, which propagates downwards and suppresses the removal of descendents. We also set the
Keep flag, which propagates upwards and prevents the removal of ancestors.
This is a simple flattening filter which is applied before output. We maintain a list of approved tags, and for each one, a list of approved attributes.
All text is kept, but tags and attributes which aren't in the approved list are removed, thus flattening the tree (similar to limited-HTML filters used to filter comments on blog sites).
The "shake" procedure described above is influenced by a slightly more complex (and better-performing) algorithm described in the paper:
- Boilerplate Detection using Shallow Text Features, by Christian Kohlschütter, Peter Fankhauser and Wolfgang Nejdl.
The authors of this paper have implemented their algorithm in a Java library called boilerpipe.
There also exists a heuristic boilerplate remover called justext. This is implemented in Python, and works by looking for text containing mainly full sentences.