Extracting text (and annotations) from HTML with Python

10 minute read

Approaches

Python offers a number of options for extracting text from HTML documents.

Specialized python libraries such as Inscriptis and HTML2Text provide good conversation quality and speed, although you might prefer to settle with lxml or BeautifulSoup, particularly, if you already use these libraries in your program.

Libraries

The snippets below demonstrate the code required for converting HTML to text with inscriptis, html2text, BeautifulSoup and lxml:

# inscriptis
from inscripits import get_text
text = get_text(html_content)

# html2text
from html2text import HTML2Text
h = HTML2Text()
text = h.handle(html_content)

# beautifulsoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content)
text = soup.get_text()

# lxml
import lxml.html import fromstring
from lxml.html.clean import clean_html
doc = fromstring(html_content)
text = clean_html(doc).text_content()

Console-based web browsers

Another popular option is calling a console-based web browser such as lynx and w3m to perform the conversion, although this approach requires installing these programs on the user’s system.

import subprocess

# call lynx to perform the conversion
text = subprocess.check_output(['lynx', '-dump', url])
text = text.decode('utf8')

# use w3m instead
text = subprocess.check_output(['w3m', '-dump', url])
text = text.decode('utf8')

Choosing the best approach for you.

There are some criteria you should consider when selecting a conversion approach:

how complex is the HTML to parse and what kinds of requirements do you have in respect to the conversion quality?
are you interested in the complete page, or only in fractions (e.g., the article text, forum posts, or tables) of the content?
would semantics and/or the structure of the HTML file provide valuable information for your problem (e.g., emphasized text for the automatic generation of text summaries)?

Conversion quality

Conversion quality becomes a factor once you need to move beyond simple HTML snippets. Non-specialized approaches do not correctly interpret HTML semantics and, therefore, fail to properly convert constructs such as itemizations, enumerations, and tables.

BeautifulSoup and lxml, for example, convert the following HTML enumeration to the string firstsecond.

<ul>
  <li>first</li>
  <li>second</li>
<ul>

HTML2Text, Inscriptis and the console-based browsers, in contrast, return the correct output:

  * first
  * second

But even specialized libraries might provide inaccurate conversions at some point. HTML2Text, for example, does pretty well in interpreting HTML but fails once the HTML document becomes too complex. More complicated HTML tables, for instance, which are commonly used on Wikipedia, will return text representations that no longer reflect the correct spatial relations between text snippets as outlined in the example below:

Wikipedia snippet converted with Inscriptis. Please note that Inscriptis only wraps input lines, if this is required by the HTML document's semantics.

Chur has an oceanic climate in spite of its inland position. Summers are warm and sometimes hot, normally averaging around 25 °C (77 °F) during the day, whilst winter means are around freezing, with daytime temperatures being about 5 °C (41 °F). Between 1981 and 2010 Chur had an average of 104.6 days of rain per year and on average received 849 mm (33.4 in) of precipitation. The wettest month was August during which time Chur received an average of 112 mm (4.4 in) of precipitation. During this month there was precipitation for an average of 11.2 days. The driest month of the year was February with an average of 47 mm (1.9 in) of precipitation over 6.6 days.[19]


Climate data for Chur (1981-2010)    
Month                                  Jan     Feb     Mar     Apr     May     Jun     Jul     Aug     Sep     Oct     Nov     Dec     Year  
Average high °C (°F)                   4.8     6.4     11.2    15.1    20.0    22.7    24.9    24.1    20.0    16.1    9.5     5.3     15.0  
                                       (40.6)  (43.5)  (52.2)  (59.2)  (68.0)  (72.9)  (76.8)  (75.4)  (68.0)  (61.0)  (49.1)  (41.5)  (59.0)
Daily mean °C (°F)                     0.7     1.8     5.9     9.7     14.3    17.1    19.1    18.5    14.8    10.8    5.2     1.7     10.0  
                                       (33.3)  (35.2)  (42.6)  (49.5)  (57.7)  (62.8)  (66.4)  (65.3)  (58.6)  (51.4)  (41.4)  (35.1)  (50.0)
Average low °C (°F)                    −2.6    −2.0    1.6     4.6     8.9     11.8    13.8    13.7    10.3    6.6     1.7     −1.4    5.6   
                                       (27.3)  (28.4)  (34.9)  (40.3)  (48.0)  (53.2)  (56.8)  (56.7)  (50.5)  (43.9)  (35.1)  (29.5)  (42.1)
Average precipitation mm (inches)      51      47      55      49      71      93      109     112     81      56      70      55      849   
                                       (2.0)   (1.9)   (2.2)   (1.9)   (2.8)   (3.7)   (4.3)   (4.4)   (3.2)   (2.2)   (2.8)   (2.2)   (33.4)
Average snowfall cm (inches)           34.0    24.7    10.3    1.5     0.4     0.0     0.0     0.0     0.1     0.1     10.0    20.6    101.7 
                                       (13.4)  (9.7)   (4.1)   (0.6)   (0.2)   (0.0)   (0.0)   (0.0)   (0.0)   (0.0)   (3.9)   (8.1)   (40.0)
Average precipitation days (≥ 1.0 mm)  7.3     6.6     8.1     7.5     9.9     11.2    11.0    11.2    8.4     7.0     8.5     7.9     104.6 
Average snowy days (≥ 1.0 cm)          4.8     3.9     2.5     0.4     0.1     0.0     0.0     0.0     0.0     0.0     1.6     4.1     17.4  
Average relative humidity (%)          73      70      65      63      64      67      68      71      73      73      74      75      70    
Mean monthly sunshine hours            97      112     139     147     169     177     203     185     155     135     93      81      1,692 
Source: MeteoSwiss[19]               

The same snippet converted with HTML2Text using the default settings:

Chur has an [oceanic climate](/wiki/Oceanic_climate "Oceanic climate") in
spite of its inland position. Summers are warm and sometimes hot, normally
averaging around 25 °C (77 °F) during the day, whilst winter means are around
freezing, with daytime temperatures being about 5 °C (41 °F). Between 1981 and
2010 Chur had an average of 104.6 days of rain per year and on average
received 849 mm (33.4 in) of
[precipitation](/wiki/Precipitation_\(meteorology\) "Precipitation
\(meteorology\)").

The wettest month was August during which time Chur
received an average of 112 mm (4.4 in) of precipitation. During this month
there was precipitation for an average of 11.2 days. The driest month of the
year was February with an average of 47 mm (1.9 in) of precipitation over 6.6
days.[19]

Climate data for Chur (1981-2010)  
---  
Month  | Jan  | Feb  | Mar  | Apr  | May  | Jun  | Jul  | Aug  | Sep  | Oct  |
Nov  | Dec  | Year  
Average high °C (°F)  | 4.8  
(40.6)  | 6.4  
(43.5)  | 11.2  
(52.2)  | 15.1  
(59.2)  | 20.0  
(68.0)  | 22.7  
(72.9)  | 24.9  
(76.8)  | 24.1  
(75.4)  | 20.0  
(68.0)  | 16.1  
(61.0)  | 9.5  
(49.1)  | 5.3  
(41.5)  | 15.0  
(59.0)  
Daily mean °C (°F)  | 0.7  
(33.3)  | 1.8  
(35.2)  | 5.9  
(42.6)  | 9.7  
(49.5)  | 14.3  
(57.7)  | 17.1  
(62.8)  | 19.1  
(66.4)  | 18.5  
(65.3)  | 14.8  
(58.6)  | 10.8  
(51.4)  | 5.2  
(41.4)  | 1.7  
(35.1)  | 10.0  
(50.0)  

HTML2text does not correctly interpret the alignment of the temperature values within the table and, therefore, fails to preserve the spatial positioning of the text elements.

Inscriptis, in contrast, has been optimized towards providing accurate text representations, and even handles cascaded elements (e.g., cascaded tables, itemizations within tables, etc.) and a number of CSS attributes that are relevant to the content’s alignment. If it comes to parsing such constructs, it frequently provides even more accurate conversions than the text-based lynx browser.

If you need to interpret really complex Web pages and JavaScript, you might consider using Selenium which allows you to remote-control standard Web Browsers such as Google Chrome and Firefox from Python. Please be aware that this solution has considerable drawbacks in terms of complexity, resource requirements, scalability and stability.

Extracting relevant content only

The removal of noise elements within the Web pages (which are often also denoted as boilerplate) is another common problem. A typical news page, for instance, contains navigation elements, information on related articles, advertisements etc. that are usually not relevant to knowledge extraction tasks.

For such applications, specialized software, such as jusText, dragnet and boilerpy3 exists which aim at extracting the relevant content only. Adrien Barbaresi has written an excellent article on this topic which also evaluates some of the most commonly used text extraction approaches. In addition to general content extraction approaches, there are also specialized libraries that handle certain kinds of Web pages. The Harvest toolkit, for instance, has been optimized towards extracting posts and post metadata from Web forums and outperforms non-specialized approaches for this task.

Converting tables to Pandas Dataframes

If you need to operate on the data within HTML tables, you might consider panda’s read_html function which returns a list of dataframes for all tables within the HTML content.

from pandas import read_html
tables = read_html(html_content)

if tables:
   print(len(tables), 'tables found.')
   first_table = tables[0]

Preserving HTML structure and semantics with annotations

In the past, I often stumbled upon applications where some of the structure and semantics encoded within the original HTML document would have been helpful for downstream tasks. With the release of Inscriptis 2.0, Inscriptis supports so-called annotation rules, which enable the extraction of additional metadata from the HTML file.

The example below shows how these annotations work when parsing the following HTML snippet stored in the file chur.html:

   <h1>Chur</h1>
   <b>Chur</b> is the capital and largest town of the Swiss canton of the
   Grisons and lies in the Grisonian Rhine Valley.

The dictionary annotation_rules in the code below maps HTML tags, attributes and values to user-specified metadata which will be attached to matching text snippets:

from inscriptis import get_annotated_text, ParserConfig

annotation_rules = {
           'h1': ['heading', 'h1'],
           'h2': ['heading', 'h2'],
           'b': ['emphasis', 'bold'],
	   'i': ['emphasis', 'italic'],
	   'div#class=toc': ['table-of-contents'],
	   '#class=FactBox': ['fact-box'],
           'table': ['table']
}

output = get_annotated_text(html, ParserConfig(annotation_rules=rules)
print("Text:", output['text'])
print("Annotations:", output['label'])

The annotation rules are used in Inscriptis’ get_annotated_text method which returns a dictionary of the extracted text and a list of the corresponding annotations.

  {
    'text': 'Chur\n\nChur is the capital and largest town of the Swiss canton
             of the Grisons and lies in the Grisonian Rhine Valley.',
    'label': [(0, 4, 'heading'), (0, 4, 'h1'), (6, 10, 'emphasis')]
  }

A tuple of start and end position within the extracted text and the corresponding metadata describes each of the annotations. In the example above, for instance, the first four letters of the converted text (which refer to the term Chur) contain content originally marked by an h1 tag which is annotated with heading and h1. These annotations can be used later on within your application or by third-party software such as doccano which is able to import and visualize JSONL annotated content (please note that doccano currently does not support overlapping annotations).

Inscriptis ships with the inscript command line client which is able to postprocess annotated content and to convert it into (i) XML, (ii) a list of surface forms and metadata (i.e., the text that has been annotated), and (iii) to visualize the converted and annotated content in an HTML document.

Extracting the surface forms using inscript.py chur.html --postprocessor surface for the examples above yields the following list which maps metadata to the corresponding surface forms:
```
[
    ['heading', 'Chur'],
    ['h1': 'Chur'], 
    ['emphasis': 'Chur']
]
```

the XML conversion (inscript.py chur.html --postprocessor xml) returns the following output:

<?xml version="1.0" encoding="UTF-8" ?>
  <heading>Chur</heading>

  <emphasis>Chur</emphasis> is the capital and largest town of the Swiss
  canton of the Grisons and lies in the Grisonian Rhine Valley.

the HTML conversion yields an HTML file that contains the extracted text and the corresponding annotations. The following examples illustrate this visualization for two more complex use cases:

Stackoverflow

HTML export of an annotated Stackoverflow page

The HTML export of the annotated Stackoverflow page uses the following annotation rules which annotate headings, emphasized content, code and information on users and comments.

{
    "h1": ["heading"],
    "h2": ["heading"],
    "h3": ["heading"],
    "b": ["emphasis"],
    "code": ["code"],
    "#itemprop=dateCreated": ["creation-date"],
    "#class=lang-py": ["code"],
    "#class=user-details": ["user"],
    "#class=reputation-score": ["reputation"],
    "#class=comment-user": ["comment-user"],
    "#class=comment-date": ["comment-date"],
    "#class=comment-copy": ["comment-comment"]
}

The corresponding HTML file has been generated with the inscript command line client and the following command line parameters:

inscript.py --annotation-rules ./stackoverflow.json 
            --postprocessor html \
	    --output /tmp/stackoverflow.html \
            https://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python

Wikipedia

The second example shows a snippet of a Wikipedia page that has been annotated with the rules below:

{
    "h1": ["heading"],
    "h2": ["heading"],
    "h3": ["subheading"],
    "h4": ["subheading"],
    "h5": ["subheading"],
    "i": ["emphasis"],
    "b": ["bold"],
    "table": ["table"],
    "th": ["tableheading"],
    "a": ["link"]
}

HTML export of an annotated Wikipedia page

Some final notes

Inscriptis has been optimized towards providing accurate representations of HTML documents which are often on-par or even surpasses the quality of console-based Web-browsers such as Lynx and w3m. If this is not sufficient for your applications (e.g., since you also need JavaScript) you might consider using Selenium, which uses Chrome or Firefox to perform the conversion. Obviously this option will require considerably more resources, scales less well and is considered less stable than the use of lightweight approaches.

Please note that I am the author of Inscriptis and naturally this article has been more focused on features it provides. Nevertheless, I have also successfully used HTML2Text, lxml, BeautifulSoup, Lynx and w3m in my work and all of these are very capable tools which address many real-world application scenarios.

Resources

An article on evaluating scraping and text extraction tools for Python by Adrien Barbaresi
Harvest - A toolkit for extracting posts and post metadata from web forums
Pandas - A fast, powerful data analysis and manipulation tool.
Selenium Python documentation - Selenium allows remote control of Web browsers
Stackoverflow on extracting text from HTML

Text Web browsers

Lynx
w3m

Python Libraries

HTML2Text converts a page of HTML into clean, easy-to-read plain ASCII text.
lxml - binding for the libxml2 and libxslt libraries which provides access to these libraries using the ElementTree API. BeautifulSoup - Python library for pulling data out of HTML and XML files.

Share on

Twitter Facebook LinkedIn

Albert Weichselbraun