semanticlab.net

Extracting text (and annotations) from HTML with Python

2021-07-19T00:00:00+02:00

Approaches

Python offers a number of options for extracting text from HTML documents.

Specialized python libraries such as Inscriptis and HTML2Text provide good conversation quality and speed, although you might prefer to settle with lxml or BeautifulSoup, particularly, if you already use these libraries in your program.

Libraries

The snippets below demonstrate the code required for converting HTML to text with inscriptis, html2text, BeautifulSoup and lxml:

# inscriptis
from inscripits import get_text
text = get_text(html_content)

# html2text
from html2text import HTML2Text
h = HTML2Text()
text = h.handle(html_content)

# beautifulsoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content)
text = soup.get_text()

# lxml
import lxml.html import fromstring
from lxml.html.clean import clean_html
doc = fromstring(html_content)
text = clean_html(doc).text_content()

Console-based web browsers

Another popular option is calling a console-based web browser such as lynx and w3m to perform the conversion, although this approach requires installing these programs on the user’s system.

import subprocess

# call lynx to perform the conversion
text = subprocess.check_output(['lynx', '-dump', url])
text = text.decode('utf8')

# use w3m instead
text = subprocess.check_output(['w3m', '-dump', url])
text = text.decode('utf8')

Choosing the best approach for you.

There are some criteria you should consider when selecting a conversion approach:

how complex is the HTML to parse and what kinds of requirements do you have in respect to the conversion quality?
are you interested in the complete page, or only in fractions (e.g., the article text, forum posts, or tables) of the content?
would semantics and/or the structure of the HTML file provide valuable information for your problem (e.g., emphasized text for the automatic generation of text summaries)?

Conversion quality

Conversion quality becomes a factor once you need to move beyond simple HTML snippets. Non-specialized approaches do not correctly interpret HTML semantics and, therefore, fail to properly convert constructs such as itemizations, enumerations, and tables.

BeautifulSoup and lxml, for example, convert the following HTML enumeration to the string firstsecond.

<ul>
  <li>first</li>
  <li>second</li>
<ul>

HTML2Text, Inscriptis and the console-based browsers, in contrast, return the correct output:

  * first
  * second

But even specialized libraries might provide inaccurate conversions at some point. HTML2Text, for example, does pretty well in interpreting HTML but fails once the HTML document becomes too complex. More complicated HTML tables, for instance, which are commonly used on Wikipedia, will return text representations that no longer reflect the correct spatial relations between text snippets as outlined in the example below:

Wikipedia snippet converted with Inscriptis. Please note that Inscriptis only wraps input lines, if this is required by the HTML document's semantics.

Chur has an oceanic climate in spite of its inland position. Summers are warm and sometimes hot, normally averaging around 25 °C (77 °F) during the day, whilst winter means are around freezing, with daytime temperatures being about 5 °C (41 °F). Between 1981 and 2010 Chur had an average of 104.6 days of rain per year and on average received 849 mm (33.4 in) of precipitation. The wettest month was August during which time Chur received an average of 112 mm (4.4 in) of precipitation. During this month there was precipitation for an average of 11.2 days. The driest month of the year was February with an average of 47 mm (1.9 in) of precipitation over 6.6 days.[19]


Climate data for Chur (1981-2010)    
Month                                  Jan     Feb     Mar     Apr     May     Jun     Jul     Aug     Sep     Oct     Nov     Dec     Year  
Average high °C (°F)                   4.8     6.4     11.2    15.1    20.0    22.7    24.9    24.1    20.0    16.1    9.5     5.3     15.0  
                                       (40.6)  (43.5)  (52.2)  (59.2)  (68.0)  (72.9)  (76.8)  (75.4)  (68.0)  (61.0)  (49.1)  (41.5)  (59.0)
Daily mean °C (°F)                     0.7     1.8     5.9     9.7     14.3    17.1    19.1    18.5    14.8    10.8    5.2     1.7     10.0  
                                       (33.3)  (35.2)  (42.6)  (49.5)  (57.7)  (62.8)  (66.4)  (65.3)  (58.6)  (51.4)  (41.4)  (35.1)  (50.0)
Average low °C (°F)                    −2.6    −2.0    1.6     4.6     8.9     11.8    13.8    13.7    10.3    6.6     1.7     −1.4    5.6   
                                       (27.3)  (28.4)  (34.9)  (40.3)  (48.0)  (53.2)  (56.8)  (56.7)  (50.5)  (43.9)  (35.1)  (29.5)  (42.1)
Average precipitation mm (inches)      51      47      55      49      71      93      109     112     81      56      70      55      849   
                                       (2.0)   (1.9)   (2.2)   (1.9)   (2.8)   (3.7)   (4.3)   (4.4)   (3.2)   (2.2)   (2.8)   (2.2)   (33.4)
Average snowfall cm (inches)           34.0    24.7    10.3    1.5     0.4     0.0     0.0     0.0     0.1     0.1     10.0    20.6    101.7 
                                       (13.4)  (9.7)   (4.1)   (0.6)   (0.2)   (0.0)   (0.0)   (0.0)   (0.0)   (0.0)   (3.9)   (8.1)   (40.0)
Average precipitation days (≥ 1.0 mm)  7.3     6.6     8.1     7.5     9.9     11.2    11.0    11.2    8.4     7.0     8.5     7.9     104.6 
Average snowy days (≥ 1.0 cm)          4.8     3.9     2.5     0.4     0.1     0.0     0.0     0.0     0.0     0.0     1.6     4.1     17.4  
Average relative humidity (%)          73      70      65      63      64      67      68      71      73      73      74      75      70    
Mean monthly sunshine hours            97      112     139     147     169     177     203     185     155     135     93      81      1,692 
Source: MeteoSwiss[19]               

The same snippet converted with HTML2Text using the default settings:

Chur has an [oceanic climate](/wiki/Oceanic_climate "Oceanic climate") in
spite of its inland position. Summers are warm and sometimes hot, normally
averaging around 25 °C (77 °F) during the day, whilst winter means are around
freezing, with daytime temperatures being about 5 °C (41 °F). Between 1981 and
2010 Chur had an average of 104.6 days of rain per year and on average
received 849 mm (33.4 in) of
[precipitation](/wiki/Precipitation_\(meteorology\) "Precipitation
\(meteorology\)").

The wettest month was August during which time Chur
received an average of 112 mm (4.4 in) of precipitation. During this month
there was precipitation for an average of 11.2 days. The driest month of the
year was February with an average of 47 mm (1.9 in) of precipitation over 6.6
days.[19]

Climate data for Chur (1981-2010)  
---  
Month  | Jan  | Feb  | Mar  | Apr  | May  | Jun  | Jul  | Aug  | Sep  | Oct  |
Nov  | Dec  | Year  
Average high °C (°F)  | 4.8  
(40.6)  | 6.4  
(43.5)  | 11.2  
(52.2)  | 15.1  
(59.2)  | 20.0  
(68.0)  | 22.7  
(72.9)  | 24.9  
(76.8)  | 24.1  
(75.4)  | 20.0  
(68.0)  | 16.1  
(61.0)  | 9.5  
(49.1)  | 5.3  
(41.5)  | 15.0  
(59.0)  
Daily mean °C (°F)  | 0.7  
(33.3)  | 1.8  
(35.2)  | 5.9  
(42.6)  | 9.7  
(49.5)  | 14.3  
(57.7)  | 17.1  
(62.8)  | 19.1  
(66.4)  | 18.5  
(65.3)  | 14.8  
(58.6)  | 10.8  
(51.4)  | 5.2  
(41.4)  | 1.7  
(35.1)  | 10.0  
(50.0)  

HTML2text does not correctly interpret the alignment of the temperature values within the table and, therefore, fails to preserve the spatial positioning of the text elements.

Inscriptis, in contrast, has been optimized towards providing accurate text representations, and even handles cascaded elements (e.g., cascaded tables, itemizations within tables, etc.) and a number of CSS attributes that are relevant to the content’s alignment. If it comes to parsing such constructs, it frequently provides even more accurate conversions than the text-based lynx browser.

If you need to interpret really complex Web pages and JavaScript, you might consider using Selenium which allows you to remote-control standard Web Browsers such as Google Chrome and Firefox from Python. Please be aware that this solution has considerable drawbacks in terms of complexity, resource requirements, scalability and stability.

Extracting relevant content only

The removal of noise elements within the Web pages (which are often also denoted as boilerplate) is another common problem. A typical news page, for instance, contains navigation elements, information on related articles, advertisements etc. that are usually not relevant to knowledge extraction tasks.

For such applications, specialized software, such as jusText, dragnet and boilerpy3 exists which aim at extracting the relevant content only. Adrien Barbaresi has written an excellent article on this topic which also evaluates some of the most commonly used text extraction approaches. In addition to general content extraction approaches, there are also specialized libraries that handle certain kinds of Web pages. The Harvest toolkit, for instance, has been optimized towards extracting posts and post metadata from Web forums and outperforms non-specialized approaches for this task.

Converting tables to Pandas Dataframes

If you need to operate on the data within HTML tables, you might consider panda’s read_html function which returns a list of dataframes for all tables within the HTML content.

from pandas import read_html
tables = read_html(html_content)

if tables:
   print(len(tables), 'tables found.')
   first_table = tables[0]

Preserving HTML structure and semantics with annotations

In the past, I often stumbled upon applications where some of the structure and semantics encoded within the original HTML document would have been helpful for downstream tasks. With the release of Inscriptis 2.0, Inscriptis supports so-called annotation rules, which enable the extraction of additional metadata from the HTML file.

The example below shows how these annotations work when parsing the following HTML snippet stored in the file chur.html:

   <h1>Chur</h1>
   <b>Chur</b> is the capital and largest town of the Swiss canton of the
   Grisons and lies in the Grisonian Rhine Valley.

The dictionary annotation_rules in the code below maps HTML tags, attributes and values to user-specified metadata which will be attached to matching text snippets:

from inscriptis import get_annotated_text, ParserConfig

annotation_rules = {
           'h1': ['heading', 'h1'],
           'h2': ['heading', 'h2'],
           'b': ['emphasis', 'bold'],
	   'i': ['emphasis', 'italic'],
	   'div#class=toc': ['table-of-contents'],
	   '#class=FactBox': ['fact-box'],
           'table': ['table']
}

output = get_annotated_text(html, ParserConfig(annotation_rules=rules)
print("Text:", output['text'])
print("Annotations:", output['label'])

The annotation rules are used in Inscriptis’ get_annotated_text method which returns a dictionary of the extracted text and a list of the corresponding annotations.

  {
    'text': 'Chur\n\nChur is the capital and largest town of the Swiss canton
             of the Grisons and lies in the Grisonian Rhine Valley.',
    'label': [(0, 4, 'heading'), (0, 4, 'h1'), (6, 10, 'emphasis')]
  }

A tuple of start and end position within the extracted text and the corresponding metadata describes each of the annotations. In the example above, for instance, the first four letters of the converted text (which refer to the term Chur) contain content originally marked by an h1 tag which is annotated with heading and h1. These annotations can be used later on within your application or by third-party software such as doccano which is able to import and visualize JSONL annotated content (please note that doccano currently does not support overlapping annotations).

Inscriptis ships with the inscript command line client which is able to postprocess annotated content and to convert it into (i) XML, (ii) a list of surface forms and metadata (i.e., the text that has been annotated), and (iii) to visualize the converted and annotated content in an HTML document.

Extracting the surface forms using inscript.py chur.html --postprocessor surface for the examples above yields the following list which maps metadata to the corresponding surface forms:
```
[
    ['heading', 'Chur'],
    ['h1': 'Chur'], 
    ['emphasis': 'Chur']
]
```

the XML conversion (inscript.py chur.html --postprocessor xml) returns the following output:

<?xml version="1.0" encoding="UTF-8" ?>
  <heading>Chur</heading>

  <emphasis>Chur</emphasis> is the capital and largest town of the Swiss
  canton of the Grisons and lies in the Grisonian Rhine Valley.

the HTML conversion yields an HTML file that contains the extracted text and the corresponding annotations. The following examples illustrate this visualization for two more complex use cases:

Stackoverflow

The HTML export of the annotated Stackoverflow page uses the following annotation rules which annotate headings, emphasized content, code and information on users and comments.

{
    "h1": ["heading"],
    "h2": ["heading"],
    "h3": ["heading"],
    "b": ["emphasis"],
    "code": ["code"],
    "#itemprop=dateCreated": ["creation-date"],
    "#class=lang-py": ["code"],
    "#class=user-details": ["user"],
    "#class=reputation-score": ["reputation"],
    "#class=comment-user": ["comment-user"],
    "#class=comment-date": ["comment-date"],
    "#class=comment-copy": ["comment-comment"]
}

The corresponding HTML file has been generated with the inscript command line client and the following command line parameters:

inscript.py --annotation-rules ./stackoverflow.json 
            --postprocessor html \
	    --output /tmp/stackoverflow.html \
            https://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python

Wikipedia

The second example shows a snippet of a Wikipedia page that has been annotated with the rules below:

{
    "h1": ["heading"],
    "h2": ["heading"],
    "h3": ["subheading"],
    "h4": ["subheading"],
    "h5": ["subheading"],
    "i": ["emphasis"],
    "b": ["bold"],
    "table": ["table"],
    "th": ["tableheading"],
    "a": ["link"]
}

Some final notes

Inscriptis has been optimized towards providing accurate representations of HTML documents which are often on-par or even surpasses the quality of console-based Web-browsers such as Lynx and w3m. If this is not sufficient for your applications (e.g., since you also need JavaScript) you might consider using Selenium, which uses Chrome or Firefox to perform the conversion. Obviously this option will require considerably more resources, scales less well and is considered less stable than the use of lightweight approaches.

Please note that I am the author of Inscriptis and naturally this article has been more focused on features it provides. Nevertheless, I have also successfully used HTML2Text, lxml, BeautifulSoup, Lynx and w3m in my work and all of these are very capable tools which address many real-world application scenarios.

Resources

An article on evaluating scraping and text extraction tools for Python by Adrien Barbaresi
Harvest - A toolkit for extracting posts and post metadata from web forums
Pandas - A fast, powerful data analysis and manipulation tool.
Selenium Python documentation - Selenium allows remote control of Web browsers
Stackoverflow on extracting text from HTML

Text Web browsers

Lynx
w3m

Python Libraries

HTML2Text converts a page of HTML into clean, easy-to-read plain ASCII text.
lxml - binding for the libxml2 and libxslt libraries which provides access to these libraries using the ElementTree API. BeautifulSoup - Python library for pulling data out of HTML and XML files.

Setup and automatic renewal of wildcard SSL certificates for Kubernetes with Certbot and NSD

2021-04-20T00:00:00+02:00

Wildcard SSL certificates cover all subdomains under a certain domain - e.g. *.k8s.example.net will cover recognyze.k8s.example.net, inscripits.k8s.example.net, etc. which is very useful, if Kubernetes is used to deploy such services.

Prerequisites

The following guide assumes that you

delegate DNS for the prefix domain (in the example above k8s.example.net) to a separate zone file
which is managed by NSD (depending on your setup you might use the same NSD server, a separate instance, or even a server on another host).

Steps

add a name server (NS) entry to your domain configuration that delegates DNS for the prefix domain to a given NSD server.
```
k8s  3600    IN      NS      k8s-server.example.net.
```

setup the NSD configuration and zone file for the prefix domain. The _acme-challenge entry will be overwritten by Cerbot during the DNS-01 challenge verification process.

/etc/nsd/nsd.conf:

zone:
        name: k8s.example.net
        zonefile: /etc/nsd/zones/k8s.example.net.zone

/etc/nsd/zones/k8s.example.net:

@                3660 IN    SOA nameserver.example.net. hostmaster.example.net. 2014111364 28800 7200 604800 3660
@               84600 IN    NS  1.2.3.4
@                3600 IN    A   1.2.3.4
*                3600 IN    A   1.2.3.4
_acme-challenge    60 IN    TXT "--temporary-dummy--"

install the certbot-nsd-hook script to /opt:

cd /opt
git clone https://github.com/AlbertWeichselbraun/certbot-nsd-hook.git

create the SSL wildcard certificate with

cerbot certonly \
       -d '*.k8s.example.net'  \
       --manual  \
       --manual-auth-hook="/opt/certbot-nsd-hook/nsd-update-dns.py" \
       --post-hook="systemctl reload apache2"

adapt your apache2 configuration to use the wildcard certificate

SSLEngine on
SSLCertificateKeyFile /etc/letsencrypt/live/k8s.example.net/privkey.pem
SSLCertificateFile /etc/letsencrypt/live/k8s.example.net/fullchain.pem

add Certbot to /etc/crontab to ensure that the certificate gets automatically renewed
```
17 5  * * *   root    certbot renew --cert-name k8s.semanticlab.net
```
Note: the option --cert-name allows you to specify the certificate to renew. This is relevant if your server uses wildcard and conventional certificates at the same time, since the certbot renew command does not allow mixing of renewal strategies yet.

Resources

certbot-nsd-hook project - Scripts required for using the certbot DNS challenge in conjunction with NSD

Managing DavMail with systemd and preventing service timeouts after network reconnects.

2020-10-17T00:00:00+02:00

DavMail enables access to Exchange servers over standard protocols such as IMAP, SMTP and Caldav. It, therefore, allows you to check your company e-mail from popular mail clients such as Mailspring, Thunderbird and Geary.

The following sections outline how to (i) automatically start DavMail via systemd, and (ii) ensure that the service stays operable, even after network reconnects.

Starting DavMail via systemd

If your distribution does not provide a systemd configuration file for DavMail, you can paste the following snippet into /etc/systemd/system/davmail.service.

[Unit]
Description=Davmail Exchange gateway
Documentation=man:davmail
Documentation=https://davmail.sourceforge.net/serversetup.html
Documentation=https://davmail.sourceforge.net/advanced.html
Documentation=https://davmail.sourceforge.net/sslsetup.html
After=network.target

[Service]
Type=simple
User=davmail
PermissionsStartOnly=true
ExecStartPre=/usr/bin/touch /var/log/davmail.log
ExecStartPre=/bin/chown davmail:adm /var/log/davmail.log
ExecStart=/usr/bin/davmail -server /etc/davmail.properties
SuccessExitStatus=143
PrivateTmp=yes

[Install]
WantedBy=multi-user.target

Afterwards, you need to add the DavMail user and enable the script with

adduser --system davmail
systemctl daemon-reload
systemctl enable davmail
systemctl start davmail

Coping with network reconnects

One major problem with DavMail are network reconnects (e.g., if you change the network or move between VPNs) since they require a restart of the service to prevent timeouts when accessing your e-mail. One way of solving this issue is the use of the NetworkManager-dispatcher service, which can be enabled with

systemctl enable NetworkManager-dispatcher
systemctl start NetworkManager-dispatcher

Once enabled, the dispatcher service allows you to specify scripts that are executed if network connectivity is lost or becomes available again. The following script stops DavMail if networking becomes unavailable and restarts the service after the network is up again.

#!/bin/sh

# stop davmail, if no network connectivity is available and restart it once
# the network becomes available.

interface=$1 status=$2

case $status in
  up)
      systemctl restart davmail
      ;;
  down)
      systemctl stop davmail
      ;;
esac

You can enable automatic restarts of the DavMail service by copying the script to /etc/NetworkManager/dispatcher.sh and making it executable with chmod a+x /etc/NetworkManager/dispatcher.sh.

Resources

DavMail - DavMail POP/IMAP/SMTP/Caldav/Carddav/LDAP Exchange and Office 365 Gateway
DavMail GitHub repository
ArchWiki on managing network services with NetworkManager dispatcher

Setting up Gnome CalDAV and CardDAV support with Radicale

2020-10-12T00:00:00+02:00

Although Gnome supports CalDAV and CardDAV, it currently only allows configuring them for Nextcloud servers. Their is a long standing Bug Report which describes this issue but hasn’t yet (as of October 2020) been properly addressed.

Florian Apolloner has, therefore, developed a webapp which uses redirects to map requests meant for Nextcloud servers to other CalDAV/CardDAV servers.

If you run an Apache Web server you can instead use mod_rewrite to replicate his solution:

   # redirect used for caldav and carddav compatibility with owncloud & nextcloud
   RewriteEngine on
   RewriteRule "^/.well-known/caldav" "/dav/caldav/" [R]
   RewriteRule "^/.well-known/carddav" "/dav/carddav/" [R]
   RewriteRule "^/remote.php/webdav/" "/dav" [R]
   RewriteRule "^/remote.php/caldav" "/dav/caldav/" [R]
   RewriteRule "^/remote.php/carddav" "/dav/carddav/" [R]

The redirects’ targets need to point to the path or URL of your caldav and carddav servers (I use Radicale so in my case the proper URLs are /dav/caldav and /dav/carddav). The /webdav redirect can either point to your WebDAV server (if you plan on using WebDAV remote storage) or to a simple Web page on your system.

Once the redirects are set up, you can configure your CalDAV/CardDAV server as NextCloud server in Gnome Online Accounts. If your server does not support WebDAV you need to disable the Documents and Files sharing settings as outlined below.

Once you have completed this setup applications such as Gnome To Do and Gnome Calendar will be able to synchronize with your CalDAV server.

Resources

Gnome Bug Report #720519 - Add separate components for CalDAV and CardDAV accounts
OwnCloud/Nextcloud Emulator by Florian Apoller
Radicale CalDAV/WebDAV Server

How to resize a LUKS encrypted root partion

2020-08-26T00:00:00+02:00

The Ubuntu standard setup for an encrypted root file system is quite complex as the following output shows:

root@ephiphany~# lsblk
NAME                MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
vda                 252:0    0     1T  0 disk
├─vda1              252:1    0     1M  0 part
├─vda2              252:2    0     1G  0 part  /boot
└─vda3              252:3    0  1024G  0 part
  └─dm_crypt-3      253:0    0  1024G  0 crypt
    └─epiphany-root 253:1    0  1024G  0 lvm   /

Basically we have a disk (vda) with the root partition on the vda3 partion which holds the encrpyted LUKS device which is decrypted as dm_crypt-3. On top of dm_crypt-3 we have a physical LVM volume with volume group epiphany and the logical volume root.

Consequently, growing the root filesystem requires:

extending the vda3 paritition which is done using fdisk (please refer to a the the following guideline for more information)
resizing the LUKS parition
resizing the physical device,
resizing the logical device, and finally
growing the file system

as outlined below:

# resize the LUKS parititon (dm_crypt-3)
cryptsetup resize dm_crypt-3

# resize the physical device on top of it
pvresize /dev/mapper/dm_crypt-3 

# resize the logical device (epiphany-root)
lvextend -l +100%FREE /dev/mapper/epiphany-root

# grow the file system accordingly
resize2fs /dev/mapper/epiphany-root

Network-bound disk encryption in Ubuntu 20.04 (Focal Fossa) - Booting servers with an encrypted root file system without user interaction.

2020-08-26T00:00:00+02:00

Network-bound disk encryption allows unlocking LUKS devices (e.g. the encrypted root file system of an Ubuntu server) without entering the password. Instead a Tang server is queried for a key that can be used in conjunction with a private secret to compute the decryption key. As long as the Tang server is available, the disk can be decrypted without the need to manually enter a password.

Ubuntu 20.04 requires the following components for implementing a network-bound disk encryption:

the LUKS encrypted device(s) that should be automatically unlocked.
a Tang server that provides the public key required by the client for deriving its LUKS decryption key.
Clevis which provides clients that can use a Tang server for unlocking LUKS partitions.
For unlocking a boot device adjustments to initramfs (automatically provided by the clevis-initramfs package) are necessary.

How does network-bound disk encryption work?

The figures below outline how network-bound encryption works. In the first step we use clevis to bind a LUKS encrypted device to a Tang server, generating a secret JSON Web Key (cJWK) on the client which is then combined with the server’s public key (sJWK*) to generate the key (dJWC) that is then added to the LUKS device as a decryption key.

Once the device has been bound to the Tang server, it can compute its decryption key with the server’s help. The client first generates a ephemeral key (eJWK) that is then combined with its secret (cJWK) to generate a message (xJWK) that is sent to the server. The server combines xJWK with its private key sJWK to generate the response yJWK. Clevis then combines yJWK with the server’s public key sJWK* and eJWK to recover the decryption key dJWK.

Setup

Ubuntu 20.04 provides packages for Tang and Clevis which makes installing them straight forward.

Setup and start the Tang server

Install Tang and José (an implementation of the JavaScript Object Signing and Encryption standards used by Tang) on the Tang server.

apt install tang jose
systemctl enable tangd.socket
systemctl start tangd.socket

If you install Tang on Ubuntu 18.04 you need to manually generate the Tang keys with /usr/lib/x86_64-linux-gnu/tangd-keygen /var/db/tang before the server start.

Execute tang-show-keys to determine the signing key’s fingerprint.

tang-show-keys 
TieDkMgbVKzmXl-uyOfIa0U30lo

Host with the encrypted LUKS device(s)

Install Clevis on the host system and then use clevis luks bind for binding the device to the Tang server. Clevis will ask you to verify the signing key’s fingerprint. Afterwards, Clevis can be used to unlock the device.

# install clevis
apt install clevis clevis-luks

# ensure that the device (e.g. vda1) is encrypted and that the tang server is working
cryptsetup luksDump /dev/vda1            # just to be sure that we encrypt the right disk ;)
curl http://192.168.122.1/adv            # verify that the tang server yields a response

# enable clevis tang decryption for the given LUKS device
clevis luks bind -d /dev/vda1 tang '{"url": "http://192.168.122.1"}'

Clevis provides plugins for initramfs, dracut, systemd and udisk2 to automize the unlocking process.

Automatically unlocking a root device with Clevis

Once Clevis support has been enabled for an encrypted root file system, it can be automatically unlocked by installing the corresponding clevis plugin and rebuilding initramfs.

# insall the necessary clevis plugin
apt install clevis-initramfs

# reinitialize initramfs to support automatic unlocking of the root device.
update-initramfs -u -k 'all'            

Automatic unlocking of non-root devices with Clevis

Automatic unlocking of non-root devices via systemd is supported by the clevis-systemd plugin.

apt install clevis-system

Afterwards the encrypted non-root devices need to be added to /etc/crypttab with the _netdev option. Crypttab entries consist of the following four columns:

target: the name to be used for the mapped (i.e. decrypted) device
source device: the name of the corresponding encrypted source device
key file: none, since we do not specify a key
options: the column must be set to _netdev so that systemd is able to automatically mount the device using the clevis-systemd plugin.

encrypted_home	/dev/vdb  none  _netdev
encrypted_opt   /dev/vdc  none  _netdev

Afterwards, the devices can be added to /etc/fstab for automatic mounting:

/dev/mapper/encrypted_home  /home   xfs  defaults,_netdev  0 0
/dev/mapper/encrypted_opt   /opt    xfs  defaults,_netdev  0 0

Again it is important to add the _netdev option to ensure that systemd is able to recognize and automatically mount the encrypted device.

Warning: To the best of my knowledge it is not possible to mount an encrypted /var partition using this method, since systemd relies on /var for its networking configuration.

Resources

Github pages
- Tang
- Clevis
ADMIN Magazine article on Clevis and Tang
Youtube video by Fraser Tweedale on Clevis and Tang

Record Temperature, Humidity and Pressure with an ESP32, a Bosh BME280 sensor and InfluxDB

2020-02-02T00:00:00+01:00

Install InfluxDB and Grafana at your server and create a database

On Debian-based systems the installation of InfluxDB is straigt forward:
```
apt install influxdb influxdb-client
```
Afterwards a new database (with name sensors) can be setup by connecting to InfluxDB with the influx command and then running the CREATE DATABASE sensors.
Grafana should be installed based on the instructions on the Grafana Web Site.

Connect the Bosh BME280 sensor to the ESP32

I prefer using an RJ45 cable for the connection with the following pin layout which minimizes interference:
- VIN: White-Orange
- GND: Brown
- SCL: White-Brown
- SDA: Orange
Optional: Change the BME280’s I2C bus address: If you plan to use two sensors at once, you need to ensure that they have different I2C bus addresses.
- The default bus address is 0x76.
- If your breakout board has an SDO pin, you can change the bus address to 0x77 by connecting the SDO bin to GND.
The sensor is then connected to the ESP32 as outlined in the picture below:

Upload humidity-probe-influxdb to the ESP32

Download the humidity-probe-influxdb project from github and update custom.h-example to reflect your WiFi setup and InfluxDB URL. Once you transfer the Sketch to your ESP32 the ESP32 will

read temperature, humidity and pressure and
transfer the measurements to your InfluxDB server, once ten measurements have been collected. (The first transfer of measurements will start approximately after ten minutes)

Log into Grafana and configure your dashboards

Per default Grafana listens on port 3000 and should be available at http://your-grafana-server-ip:3000. You can log into Grafana to setup queries, graphs and dashboards as illustrated in the example below.

References

Optimizing Apache Storm Topologies

2018-07-14T00:00:00+02:00

This article summarizes hints for optimizing and deploying Apache Storm topologies.

Setup your storm cluster

I/O is zookeeper’s main bottleneck - ensure that the /data partition of zookeeper machines serializes to quick storage (ramdisk ;)
Determine the number of parallelism units using the following rule of thumb:
- number of available CPU cores on all machines minus one core per machine that is used for the Acker
- Example: 2 machines with 48 and 1 machine with 32 cores; parallelism units = 2x(48-1) + (32-1) = 125
Using multiple workers per machine allows deploying multiple topologies at once (the number of workers is determined by the number of ports configured in the supervisor.slots.ports setting in storm.yaml)

Topology configuration suggestions

Use one worker per machine and topology (intra-worker transports are more efficient)
The number of executors depends on whether your bolt is I/O or CPU bound
- CPU bound: configure one executor per available parallelism unit
- I/O bound: use 10-100 workers per parallelism unit, depending on the expected I/O delay
The total number of parallelism units in your topology should equal the number of available parallelism units

Profiling the topology

Storm UI: use the capacity metric to identify bolts which require a higher parallelism
your nextTuple and execute methods determine the spout’s/bolt’s runtime - optimize these methods
use queue’s for I/O in spouts or terminal bolts (i.e. write final results to a queue and use a writer thread that performs batch inserts to serialize the queue to disk)

Glossary

workers process - responsible for executing the topology on a particular machine
executor - thread spawned by the worker for a particular component (bold or spout); the number of executors is configured by setting the parallelism hint parameter in the setSpout or setBolt method.
task - number of instances of a particular bolt/spout to deploy; configuring more than one task using setNumTasks(n) allows to later increase the number of executors for that particular spout/bolt without redeploying the topology.

References

Headless Seafile server on a Raspberry Pi 2 with dynamic DNS

2018-02-17T00:00:00+01:00

The Raspberry Pi is operated from at home keeping noise and power consumption in mind.

Install Raspbian on Pi

1.Download and install Raspbian on the SD card. Before rebooting the device mount the boot partition and create an empty file named ssh on the partition.

Put the SD card into the Raspberry Pi, boot the system and determine its IP address with
```
nmap -sn 192.168.1.0/24
```
Log into the PI (user: pi, password: raspberry) and run sudo raspi-config to
- Change the login password
- Maximize the rootfs with expand_rootfs.

Change the root file system to F2FS

mount the SD card and copy the content of the root filesystem to a temporary directory
unmount the rootfs file system and format its partition (e.g. /dev/mmcblk0p2)
restore the root partitions content

  mkdir /tmp/rpi
  cp -a /media/{user}/root_fs /tmp/rpi
  umount /media/{user}/root_fs
  mkfs.f2fs /dev/mmcblk0p2
  mount /dev/mmcblk0p2 /mnt
  cp -a /media/{user}/root_fs/ /mnt/

adapt cmdline.txt and fstab to reflect the changed file system type:
- /media/{user}/boot_fs/cmdline.txt: change rootfstype=ext4 to rootfstype=f2fs
- /mnt/etc/fstab: change the file system type for /dev/mmcblk0p2 to f2fs and add the discard option
```
/dev/mmcblk0p1  /boot  vfat    defaults          0       2
/dev/mmcblk0p2  /      f2fs    defaults,noatime,discard  0       1
```

Remove unnecessary components and reduce power consumption

remove unnecessary services

apt-get remove --purge avahi-daemon triggerhappy`

disable HDMI (-25 mA) and LEDs (-5 mA per LED) by adding the following commands to /etc/rc.local:

# disable hdmi (25 mA)
/usr/bin/tvservice -o
    
# disable leds (5 mA per LED)
echo 0 |tee /sys/class/leds/led0/brightness
echo 0 |tee /sys/class/leds/led1/brightness

Dynamic DNS with dynu.com

Dynu.com offers a dynamic DNS service which lets you (optionally) use your own domain name for dynamic DNS. The following steps refer to this case.

set the NS entries for the chosen name to the dyno name servers:

myname.semanticlab.net    3600    IN  NS ns1.dynu.com.
myname.semanticlab.net    3600    IN  NS ns2.dynu.com.
myname.semanticlab.net    3600    IN  NS ns3.dynu.com.
myname.semanticlab.net    3600    IN  NS ns4.dynu.com.
myname.semanticlab.net    3600    IN  NS ns5.dynu.com.
myname.semanticlab.net    3600    IN  NS ns6.dynu.com.

setup a dynu account and configure the given account for dynamic dns.
use the dynu dynamic dns client or setup your router to update the dynamic dns address when required.

Install Seafile and nginx

Prerequisites: install the necessary dependencies for running seafile and nginx

apt-get install -y nginx mysql-server python-request python-mysqldb python-pil

Download the Seafile server for Raspberry Pi and follow the provided install instructions.

Optional: to enable webdav with nginx change ./seafile/conf/seafdav.conf to

[WEBDAV]
enabled = true
port = 8080
fastcgi = false
share_name = /seafdav

and add the following section to your nginx configuration

# webdav
location /seafdav {
	proxy_pass         http://127.0.0.1:8080;
	proxy_set_header   Host $host;
	proxy_set_header   X-Real-IP $remote_addr;
	proxy_set_header   X-Forwarded-For $proxy_add_x_forwarded_for;
	proxy_set_header   X-Forwarded-Host $server_name;
	client_max_body_size 0;
  
	proxy_connect_timeout  36000s;
	proxy_read_timeout  36000s;
	proxy_send_timeout  36000s;
  
	send_timeout  36000s;
  proxy_request_buffering off;
}

Port forwarding and split DNS

Log into the configuration interface of your router and
1. setup a fixed IP address for your Raspberry Pi
2. enable port forwarding to forward the following ports to the Raspberry Pi:
  - external 80 to Raspberry 80 (http)
  - external 443 to Raspberry 443 (https)
3. optional: if you can access the Raspberry’s Web service from the Internet but not from within your network your router does not support NAT loopback. In this case we need to setup split DNS to ensure that the Raspberry is accessible with the same DNS name from internal as well.
  - install unbound with apt-get install unbound
  - add the following changes to /etc/unbound/unbound.conf to enable network-wide access to the name server as well as split dns:
    # network wide access interface: 0.0.0.0 # overwrite dns responses local-zone: myname.semanticlab.net transparent local-data: "myname.semanticlab.net A {your-pi-ip}"
    * restart unbound with `service unbound restart` * change the DNS server on your router to the ip address of your pi

HTTPS with letsencryt

Install certbot with apt-get install python-certbot-nginx and then follow the instructions on the EFF Certbot page.

References

Deploying third-party artifacts to a local repository with WebDAV

2018-02-13T00:00:00+01:00

This guide outlines how to deploy third party jars to a local repository over WebDAV. Using WebDAV requires (i) setting up the login data of the WebDAV repository and (ii) providing a current Webdav wagon extension to maven.

Configure the Webdav repository in ~/.m2/settings.xml.

<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0
                          http://maven.apache.org/xsd/settings-1.0.0.xsd">
   
   
    <servers>
       <server>
           <id>mywebdavserver</id>
           <username>user</username>
           <password>***</password>
       </server>
    </servers>
 </settings>

Create a dummy pom file which provides maven with information on the required Webdav wagon:

<project>
   <modelVersion>4.0.0</modelVersion>
   <groupId>com.example</groupId>
   <artifactId>webdav-deploy</artifactId>
   <packaging>pom</packaging>
   <version>1</version>
   <name>Webdav Deploy</name>
   
   <build>
      <extensions>
         <extension>
            <groupId>org.apache.maven.wagon</groupId>
            <artifactId>wagon-webdav-jackrabbit</artifactId>
            <version>3.0.0</version>
         </extension>
      </extensions>
   </build>
</project>

upload the artifact to the repository with mvn:

mvn deploy:deploy-file -Dfile=<path-to-file> \
        -DgroupId=<group-id> \
        -DartifactId=<artificat-id> \
        -Dversion=<version> \
        -Dpackaging=<packaging> \
        -DrepositoryId=mywebdavserver \
        -Durl=dav:<url-to-the-webdav-server>

Example: deploy the latest libsvm version to our local repository.

mvn deploy:deploy-file\
    -Dfile=libsvm.jar \
    -DgroupId=tw.edu.ntu.csie \
    -DartifactId=libsvm \
    -Dversion=3.22 \
    -Dpackaging=jar \
    -DrepositoryId=semanticlab.net 
    -Durl=dav:http://semanticlab.net/deploy/

semanticlab.net

Extracting text (and annotations) from HTML with Python

Approaches

Libraries

Console-based web browsers

Choosing the best approach for you.

Conversion quality

Extracting relevant content only

Converting tables to Pandas Dataframes

Preserving HTML structure and semantics with annotations

Stackoverflow

Wikipedia

Some final notes

Resources

Text Web browsers

Python Libraries

Setup and automatic renewal of wildcard SSL certificates for Kubernetes with Certbot and NSD

Prerequisites

Steps

Resources

Managing DavMail with systemd and preventing service timeouts after network reconnects.

Starting DavMail via systemd

Coping with network reconnects

Resources

Setting up Gnome CalDAV and CardDAV support with Radicale

Resources

How to resize a LUKS encrypted root partion

Network-bound disk encryption in Ubuntu 20.04 (Focal Fossa) - Booting servers with an encrypted root file system without user interaction.

How does network-bound disk encryption work?

Setup

Setup and start the Tang server

Host with the encrypted LUKS device(s)

Automatically unlocking a root device with Clevis

Automatic unlocking of non-root devices with Clevis

Resources

Record Temperature, Humidity and Pressure with an ESP32, a Bosh BME280 sensor and InfluxDB

Install InfluxDB and Grafana at your server and create a database

Connect the Bosh BME280 sensor to the ESP32

Upload humidity-probe-influxdb to the ESP32

Log into Grafana and configure your dashboards

References

Optimizing Apache Storm Topologies

Setup your storm cluster

Topology configuration suggestions

Profiling the topology

Glossary

References

Headless Seafile server on a Raspberry Pi 2 with dynamic DNS

Install Raspbian on Pi

Change the root file system to F2FS

Remove unnecessary components and reduce power consumption

Dynamic DNS with dynu.com

Install Seafile and nginx

Port forwarding and split DNS

HTTPS with letsencryt

References

Deploying third-party artifacts to a local repository with WebDAV

Literature