Jekyll2021-07-31T06:31:51+02:00https://semanticlab.net/feed.xmlsemanticlab.netsemanticlab.netAlbert WeichselbraunExtracting text (and annotations) from HTML with Python2021-07-19T00:00:00+02:002021-07-19T00:00:00+02:00https://semanticlab.net/linux/big%20data/knowledge%20extraction/Extracting-text-from-HTML-with-Python<h2 id="approaches">Approaches</h2>
<p>Python offers a number of options for extracting text from HTML documents.</p>
<p>Specialized python libraries such as <a href="https://github.com/weblyzard/inscriptis">Inscriptis</a> and <a href="https://pypi.org/project/html2text/">HTML2Text</a> provide good conversation quality and speed, although you might prefer to settle with <a href="https://lxml.de/">lxml</a> or <a href="https://pypi.org/project/beautifulsoup4/">BeautifulSoup</a>, particularly, if you already use these libraries in your program.</p>
<h3 id="libraries">Libraries</h3>
<p>The snippets below demonstrate the code required for converting HTML to text with inscriptis, html2text, BeautifulSoup and lxml:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># inscriptis
</span><span class="kn">from</span> <span class="nn">inscripits</span> <span class="kn">import</span> <span class="n">get_text</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">get_text</span><span class="p">(</span><span class="n">html_content</span><span class="p">)</span>
<span class="c1"># html2text
</span><span class="kn">from</span> <span class="nn">html2text</span> <span class="kn">import</span> <span class="n">HTML2Text</span>
<span class="n">h</span> <span class="o">=</span> <span class="n">HTML2Text</span><span class="p">()</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">h</span><span class="p">.</span><span class="n">handle</span><span class="p">(</span><span class="n">html_content</span><span class="p">)</span>
<span class="c1"># beautifulsoup
</span><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html_content</span><span class="p">)</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">soup</span><span class="p">.</span><span class="n">get_text</span><span class="p">()</span>
<span class="c1"># lxml
</span><span class="kn">import</span> <span class="nn">lxml.html</span> <span class="kn">import</span> <span class="nn">fromstring</span>
<span class="kn">from</span> <span class="nn">lxml.html.clean</span> <span class="kn">import</span> <span class="n">clean_html</span>
<span class="n">doc</span> <span class="o">=</span> <span class="n">fromstring</span><span class="p">(</span><span class="n">html_content</span><span class="p">)</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">clean_html</span><span class="p">(</span><span class="n">doc</span><span class="p">).</span><span class="n">text_content</span><span class="p">()</span>
</code></pre></div></div>
<h3 id="console-based-web-browsers">Console-based web browsers</h3>
<p>Another popular option is calling a console-based web browser such as lynx and w3m to perform the conversion, although this approach requires installing these programs on the user’s system.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">subprocess</span>
<span class="c1"># call lynx to perform the conversion
</span><span class="n">text</span> <span class="o">=</span> <span class="n">subprocess</span><span class="p">.</span><span class="n">check_output</span><span class="p">([</span><span class="s">'lynx'</span><span class="p">,</span> <span class="s">'-dump'</span><span class="p">,</span> <span class="n">url</span><span class="p">])</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="p">.</span><span class="n">decode</span><span class="p">(</span><span class="s">'utf8'</span><span class="p">)</span>
<span class="c1"># use w3m instead
</span><span class="n">text</span> <span class="o">=</span> <span class="n">subprocess</span><span class="p">.</span><span class="n">check_output</span><span class="p">([</span><span class="s">'w3m'</span><span class="p">,</span> <span class="s">'-dump'</span><span class="p">,</span> <span class="n">url</span><span class="p">])</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="p">.</span><span class="n">decode</span><span class="p">(</span><span class="s">'utf8'</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="choosing-the-best-approach-for-you">Choosing the best approach for you.</h2>
<p>There are some criteria you should consider when selecting a conversion approach:</p>
<ul>
<li>how complex is the HTML to parse and what kinds of requirements do you have in respect to the conversion quality?</li>
<li>are you interested in the complete page, or only in fractions (e.g., the article text, forum posts, or tables) of the content?</li>
<li>would semantics and/or the structure of the HTML file provide valuable information for your problem (e.g., emphasized text for the automatic generation of text summaries)?</li>
</ul>
<h3 id="conversion-quality">Conversion quality</h3>
<p>Conversion quality becomes a factor once you need to move beyond simple HTML snippets.
Non-specialized approaches do not correctly interpret HTML semantics and, therefore, fail to properly convert constructs such as itemizations, enumerations, and tables.</p>
<p>BeautifulSoup and lxml, for example, convert the following HTML enumeration to the string <code class="language-plaintext highlighter-rouge">firstsecond</code>.</p>
<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><ul></span>
<span class="nt"><li></span>first<span class="nt"></li></span>
<span class="nt"><li></span>second<span class="nt"></li></span>
<span class="nt"><ul></span>
</code></pre></div></div>
<p>HTML2Text, Inscriptis and the console-based browsers, in contrast, return the correct output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> * first
* second
</code></pre></div></div>
<p>But even specialized libraries might provide inaccurate conversions at some point. HTML2Text, for example, does pretty well in interpreting HTML but fails once the HTML document becomes too complex. More complicated HTML tables, for instance, which are commonly used on Wikipedia, will return text representations that no longer reflect the correct spatial relations between text snippets as outlined in the example below:</p>
<figcaption>Wikipedia snippet converted with Inscriptis. Please note that Inscriptis only wraps input lines, if this is required by the HTML document's semantics.</figcaption>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Chur has an oceanic climate in spite of its inland position. Summers are warm and sometimes hot, normally averaging around 25 °C (77 °F) during the day, whilst winter means are around freezing, with daytime temperatures being about 5 °C (41 °F). Between 1981 and 2010 Chur had an average of 104.6 days of rain per year and on average received 849 mm (33.4 in) of precipitation. The wettest month was August during which time Chur received an average of 112 mm (4.4 in) of precipitation. During this month there was precipitation for an average of 11.2 days. The driest month of the year was February with an average of 47 mm (1.9 in) of precipitation over 6.6 days.[19]
Climate data for Chur (1981-2010)
Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Year
Average high °C (°F) 4.8 6.4 11.2 15.1 20.0 22.7 24.9 24.1 20.0 16.1 9.5 5.3 15.0
(40.6) (43.5) (52.2) (59.2) (68.0) (72.9) (76.8) (75.4) (68.0) (61.0) (49.1) (41.5) (59.0)
Daily mean °C (°F) 0.7 1.8 5.9 9.7 14.3 17.1 19.1 18.5 14.8 10.8 5.2 1.7 10.0
(33.3) (35.2) (42.6) (49.5) (57.7) (62.8) (66.4) (65.3) (58.6) (51.4) (41.4) (35.1) (50.0)
Average low °C (°F) −2.6 −2.0 1.6 4.6 8.9 11.8 13.8 13.7 10.3 6.6 1.7 −1.4 5.6
(27.3) (28.4) (34.9) (40.3) (48.0) (53.2) (56.8) (56.7) (50.5) (43.9) (35.1) (29.5) (42.1)
Average precipitation mm (inches) 51 47 55 49 71 93 109 112 81 56 70 55 849
(2.0) (1.9) (2.2) (1.9) (2.8) (3.7) (4.3) (4.4) (3.2) (2.2) (2.8) (2.2) (33.4)
Average snowfall cm (inches) 34.0 24.7 10.3 1.5 0.4 0.0 0.0 0.0 0.1 0.1 10.0 20.6 101.7
(13.4) (9.7) (4.1) (0.6) (0.2) (0.0) (0.0) (0.0) (0.0) (0.0) (3.9) (8.1) (40.0)
Average precipitation days (≥ 1.0 mm) 7.3 6.6 8.1 7.5 9.9 11.2 11.0 11.2 8.4 7.0 8.5 7.9 104.6
Average snowy days (≥ 1.0 cm) 4.8 3.9 2.5 0.4 0.1 0.0 0.0 0.0 0.0 0.0 1.6 4.1 17.4
Average relative humidity (%) 73 70 65 63 64 67 68 71 73 73 74 75 70
Mean monthly sunshine hours 97 112 139 147 169 177 203 185 155 135 93 81 1,692
Source: MeteoSwiss[19]
</code></pre></div></div>
<p>The same snippet converted with HTML2Text using the default settings:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Chur has an [oceanic climate](/wiki/Oceanic_climate "Oceanic climate") in
spite of its inland position. Summers are warm and sometimes hot, normally
averaging around 25 °C (77 °F) during the day, whilst winter means are around
freezing, with daytime temperatures being about 5 °C (41 °F). Between 1981 and
2010 Chur had an average of 104.6 days of rain per year and on average
received 849 mm (33.4 in) of
[precipitation](/wiki/Precipitation_\(meteorology\) "Precipitation
\(meteorology\)").
The wettest month was August during which time Chur
received an average of 112 mm (4.4 in) of precipitation. During this month
there was precipitation for an average of 11.2 days. The driest month of the
year was February with an average of 47 mm (1.9 in) of precipitation over 6.6
days.[19]
Climate data for Chur (1981-2010)
---
Month | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct |
Nov | Dec | Year
Average high °C (°F) | 4.8
(40.6) | 6.4
(43.5) | 11.2
(52.2) | 15.1
(59.2) | 20.0
(68.0) | 22.7
(72.9) | 24.9
(76.8) | 24.1
(75.4) | 20.0
(68.0) | 16.1
(61.0) | 9.5
(49.1) | 5.3
(41.5) | 15.0
(59.0)
Daily mean °C (°F) | 0.7
(33.3) | 1.8
(35.2) | 5.9
(42.6) | 9.7
(49.5) | 14.3
(57.7) | 17.1
(62.8) | 19.1
(66.4) | 18.5
(65.3) | 14.8
(58.6) | 10.8
(51.4) | 5.2
(41.4) | 1.7
(35.1) | 10.0
(50.0)
</code></pre></div></div>
<p>HTML2text does not correctly interpret the alignment of the temperature values within the table and, therefore, fails to preserve the spatial positioning of the text elements.</p>
<p>Inscriptis, in contrast, has been optimized towards providing accurate text representations, and even handles cascaded elements (e.g., cascaded tables, itemizations within tables, etc.) and a number of CSS attributes that are relevant to the content’s alignment. If it comes to parsing such constructs, it frequently provides even more accurate conversions than the text-based lynx browser.</p>
<p>If you need to interpret <em>really</em> complex Web pages and JavaScript, you might consider using <a href="https://selenium-python.readthedocs.io/">Selenium</a> which allows you to remote-control standard Web Browsers such as Google Chrome and Firefox from Python. Please be aware that this solution has considerable drawbacks in terms of complexity, resource requirements, scalability and stability.</p>
<h3 id="extracting-relevant-content-only">Extracting relevant content only</h3>
<p>The removal of noise elements within the Web pages (which are often also denoted as boilerplate) is another common problem. A typical news page, for instance, contains navigation elements, information on related articles, advertisements etc. that are usually not relevant to knowledge extraction tasks.</p>
<p>For such applications, specialized software, such as jusText, dragnet and boilerpy3 exists which aim at extracting the relevant content only. Adrien Barbaresi has written an excellent <a href="https://adrien.barbaresi.eu/blog/evaluating-text-extraction-python.html">article</a> on this topic which also evaluates some of the most commonly used text extraction approaches. In addition to general content extraction approaches, there are also specialized libraries that handle certain kinds of Web pages. The <a href="https://github.com/fhgr/harvest">Harvest</a> toolkit, for instance, has been optimized towards extracting posts and post metadata from Web forums and outperforms non-specialized approaches for this task.</p>
<h3 id="converting-tables-to-pandas-dataframes">Converting tables to Pandas Dataframes</h3>
<p>If you need to operate on the data within HTML tables, you might consider panda’s <code class="language-plaintext highlighter-rouge">read_html</code> function which returns a list of dataframes for all tables within the HTML content.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pandas</span> <span class="kn">import</span> <span class="n">read_html</span>
<span class="n">tables</span> <span class="o">=</span> <span class="n">read_html</span><span class="p">(</span><span class="n">html_content</span><span class="p">)</span>
<span class="k">if</span> <span class="n">tables</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">tables</span><span class="p">),</span> <span class="s">'tables found.'</span><span class="p">)</span>
<span class="n">first_table</span> <span class="o">=</span> <span class="n">tables</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</code></pre></div></div>
<h2 id="preserving-html-structure-and-semantics-with-annotations">Preserving HTML structure and semantics with annotations</h2>
<p>In the past, I often stumbled upon applications where <em>some</em> of the structure and semantics encoded within the original HTML document would have been helpful for downstream tasks. With the release of Inscriptis 2.0, Inscriptis supports so-called annotation rules, which enable the extraction of additional metadata from the HTML file.</p>
<p>The example below shows how these annotations work when parsing the following HTML snippet stored in the file <code class="language-plaintext highlighter-rouge">chur.html</code>:</p>
<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="nt"><h1></span>Chur<span class="nt"></h1></span>
<span class="nt"><b></span>Chur<span class="nt"></b></span> is the capital and largest town of the Swiss canton of the
Grisons and lies in the Grisonian Rhine Valley.
</code></pre></div></div>
<p>The dictionary <code class="language-plaintext highlighter-rouge">annotation_rules</code> in the code below maps HTML tags, attributes and values to user-specified metadata which will be attached to matching text snippets:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">inscriptis</span> <span class="kn">import</span> <span class="n">get_annotated_text</span><span class="p">,</span> <span class="n">ParserConfig</span>
<span class="n">annotation_rules</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">'h1'</span><span class="p">:</span> <span class="p">[</span><span class="s">'heading'</span><span class="p">,</span> <span class="s">'h1'</span><span class="p">],</span>
<span class="s">'h2'</span><span class="p">:</span> <span class="p">[</span><span class="s">'heading'</span><span class="p">,</span> <span class="s">'h2'</span><span class="p">],</span>
<span class="s">'b'</span><span class="p">:</span> <span class="p">[</span><span class="s">'emphasis'</span><span class="p">,</span> <span class="s">'bold'</span><span class="p">],</span>
<span class="s">'i'</span><span class="p">:</span> <span class="p">[</span><span class="s">'emphasis'</span><span class="p">,</span> <span class="s">'italic'</span><span class="p">],</span>
<span class="s">'div#class=toc'</span><span class="p">:</span> <span class="p">[</span><span class="s">'table-of-contents'</span><span class="p">],</span>
<span class="s">'#class=FactBox'</span><span class="p">:</span> <span class="p">[</span><span class="s">'fact-box'</span><span class="p">],</span>
<span class="s">'table'</span><span class="p">:</span> <span class="p">[</span><span class="s">'table'</span><span class="p">]</span>
<span class="p">}</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">get_annotated_text</span><span class="p">(</span><span class="n">html</span><span class="p">,</span> <span class="n">ParserConfig</span><span class="p">(</span><span class="n">annotation_rules</span><span class="o">=</span><span class="n">rules</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Text:"</span><span class="p">,</span> <span class="n">output</span><span class="p">[</span><span class="s">'text'</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Annotations:"</span><span class="p">,</span> <span class="n">output</span><span class="p">[</span><span class="s">'label'</span><span class="p">])</span>
</code></pre></div></div>
<p>The annotation rules are used in Inscriptis’ <code class="language-plaintext highlighter-rouge">get_annotated_text</code> method which returns
a dictionary of the extracted text and a list of the corresponding annotations.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="p">{</span>
<span class="s">'text'</span><span class="p">:</span> <span class="s">'Chur</span><span class="se">\n\n</span><span class="s">Chur is the capital and largest town of the Swiss canton
of the Grisons and lies in the Grisonian Rhine Valley.'</span><span class="p">,</span>
<span class="s">'label'</span><span class="p">:</span> <span class="p">[(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="s">'heading'</span><span class="p">),</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="s">'h1'</span><span class="p">),</span> <span class="p">(</span><span class="mi">6</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="s">'emphasis'</span><span class="p">)]</span>
<span class="p">}</span>
</code></pre></div></div>
<p>A tuple of start and end position within the extracted text and the corresponding metadata describes each of the annotations. In the example above, for instance, the first four letters of the converted text (which refer to the term <code class="language-plaintext highlighter-rouge">Chur</code>) contain content originally marked by an <code class="language-plaintext highlighter-rouge">h1</code> tag which is annotated with <code class="language-plaintext highlighter-rouge">heading</code> and <code class="language-plaintext highlighter-rouge">h1</code>.
These annotations can be used later on within your application or by third-party software such as <a href="https://github.com/doccano/doccano">doccano</a> which is able to import and visualize JSONL annotated content (please note that doccano currently does not support overlapping annotations).</p>
<p><code class="language-plaintext highlighter-rouge">Inscriptis</code> ships with the <code class="language-plaintext highlighter-rouge">inscript</code> command line client which is able to postprocess annotated content and to convert it into (i) XML, (ii) a list of surface forms and metadata (i.e., the text that has been annotated), and (iii) to visualize the converted and annotated content in an HTML document.</p>
<ul>
<li>Extracting the surface forms using <code class="language-plaintext highlighter-rouge">inscript.py chur.html --postprocessor surface</code> for the examples above yields the following list which maps metadata to the corresponding surface forms:
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span>
<span class="p">[</span><span class="s">'heading'</span><span class="p">,</span> <span class="s">'Chur'</span><span class="p">],</span>
<span class="p">[</span><span class="s">'h1'</span><span class="p">:</span> <span class="s">'Chur'</span><span class="p">],</span>
<span class="p">[</span><span class="s">'emphasis'</span><span class="p">:</span> <span class="s">'Chur'</span><span class="p">]</span>
<span class="p">]</span>
</code></pre></div> </div>
</li>
<li>the XML conversion (<code class="language-plaintext highlighter-rouge">inscript.py chur.html --postprocessor xml</code>) returns the following output:
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp"><?xml version="1.0" encoding="UTF-8" ?></span>
<span class="nt"><heading></span>Chur<span class="nt"></heading></span>
<span class="nt"><emphasis></span>Chur<span class="nt"></emphasis></span> is the capital and largest town of the Swiss
canton of the Grisons and lies in the Grisonian Rhine Valley.
</code></pre></div> </div>
</li>
<li>the HTML conversion yields an HTML file that contains the extracted text and the corresponding annotations. The following examples illustrate this visualization for two more complex use cases:</li>
</ul>
<h3 id="stackoverflow">Stackoverflow</h3>
<p class="full"><img src="/assets/images/2021/inscriptis/stackoverflow-annotated.png" alt="HTML export of an annotated Stackoverflow page" /></p>
<p>The HTML export of the annotated Stackoverflow page uses the following annotation rules which annotate headings, emphasized content, code and information on users and comments.</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"h1"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"heading"</span><span class="p">],</span><span class="w">
</span><span class="nl">"h2"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"heading"</span><span class="p">],</span><span class="w">
</span><span class="nl">"h3"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"heading"</span><span class="p">],</span><span class="w">
</span><span class="nl">"b"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"emphasis"</span><span class="p">],</span><span class="w">
</span><span class="nl">"code"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"code"</span><span class="p">],</span><span class="w">
</span><span class="nl">"#itemprop=dateCreated"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"creation-date"</span><span class="p">],</span><span class="w">
</span><span class="nl">"#class=lang-py"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"code"</span><span class="p">],</span><span class="w">
</span><span class="nl">"#class=user-details"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"user"</span><span class="p">],</span><span class="w">
</span><span class="nl">"#class=reputation-score"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"reputation"</span><span class="p">],</span><span class="w">
</span><span class="nl">"#class=comment-user"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"comment-user"</span><span class="p">],</span><span class="w">
</span><span class="nl">"#class=comment-date"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"comment-date"</span><span class="p">],</span><span class="w">
</span><span class="nl">"#class=comment-copy"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"comment-comment"</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>The corresponding HTML file has been generated with the <code class="language-plaintext highlighter-rouge">inscript</code> command line client and the following command line parameters:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>inscript.py <span class="nt">--annotation-rules</span> ./stackoverflow.json
<span class="nt">--postprocessor</span> html <span class="se">\</span>
<span class="nt">--output</span> /tmp/stackoverflow.html <span class="se">\</span>
https://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python
</code></pre></div></div>
<h3 id="wikipedia">Wikipedia</h3>
<p>The second example shows a snippet of a Wikipedia page that has been annotated with the rules below:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"h1"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"heading"</span><span class="p">],</span><span class="w">
</span><span class="nl">"h2"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"heading"</span><span class="p">],</span><span class="w">
</span><span class="nl">"h3"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"subheading"</span><span class="p">],</span><span class="w">
</span><span class="nl">"h4"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"subheading"</span><span class="p">],</span><span class="w">
</span><span class="nl">"h5"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"subheading"</span><span class="p">],</span><span class="w">
</span><span class="nl">"i"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"emphasis"</span><span class="p">],</span><span class="w">
</span><span class="nl">"b"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"bold"</span><span class="p">],</span><span class="w">
</span><span class="nl">"table"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"table"</span><span class="p">],</span><span class="w">
</span><span class="nl">"th"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"tableheading"</span><span class="p">],</span><span class="w">
</span><span class="nl">"a"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"link"</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p class="full"><img src="/assets/images/2021/inscriptis/wikipedia-annotated.png" alt="HTML export of an annotated Wikipedia page" /></p>
<h2 id="some-final-notes">Some final notes</h2>
<p>Inscriptis has been optimized towards providing accurate representations of HTML documents which are often on-par or even surpasses the quality of console-based Web-browsers such as Lynx and w3m. If this is not sufficient for your applications (e.g., since you also need JavaScript) you might consider using Selenium, which uses Chrome or Firefox to perform the conversion. Obviously this option will require considerably more resources, scales less well and is considered less stable than the use of lightweight approaches.</p>
<p>Please note that I am the author of Inscriptis and naturally this article has been more focused on features it provides. Nevertheless, I have also successfully used HTML2Text, lxml, BeautifulSoup, Lynx and w3m in my work and all of these are very capable tools which address many real-world application scenarios.</p>
<h2 id="resources">Resources</h2>
<ul>
<li>An article on <a href="https://adrien.barbaresi.eu/blog/evaluating-text-extraction-python.html">evaluating scraping and text extraction tools for Python </a> by Adrien Barbaresi</li>
<li><a href="https://github.com/fhgr/harvest">Harvest</a> - A toolkit for extracting posts and post metadata from web forums</li>
<li><a href="https://pandas.pydata.org/">Pandas</a> - A fast, powerful data analysis and manipulation tool.</li>
<li><a href="https://selenium-python.readthedocs.io/">Selenium Python documentation</a> - Selenium allows remote control of Web browsers</li>
<li><a href="https://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python">Stackoverflow on extracting text from HTML</a></li>
</ul>
<h3 id="text-web-browsers">Text Web browsers</h3>
<ul>
<li><a href="https://lynx.invisible-island.net/">Lynx</a></li>
<li><a href="http://w3m.sourceforge.net/">w3m</a></li>
</ul>
<h3 id="python-libraries">Python Libraries</h3>
<ul>
<li><a href="https://pypi.org/project/html2text/">HTML2Text</a> converts a page of HTML into clean, easy-to-read plain ASCII text.</li>
<li><a href="https://lxml.de/">lxml</a> - binding for the libxml2 and libxslt libraries which provides access to these libraries using the ElementTree API.
<a href="https://pypi.org/project/beautifulsoup4/">BeautifulSoup</a> - Python library for pulling data out of HTML and XML files.</li>
</ul>Albert WeichselbraunPython offers a number of options for extracting text from HTML documents. Specialized python libraries such as Inscriptis and HTML2Text provide good conversation quality and speed, although you might prefer to settle with lxml or BeautifulSoup if you already use these libraries in your program.Setup and automatic renewal of wildcard SSL certificates for Kubernetes with Certbot and NSD2021-04-20T00:00:00+02:002021-04-20T00:00:00+02:00https://semanticlab.net/sysadmin/linux/devops/Setup-and-automatic-renewal-of-wildcard-certificates-for-kubernetes-with-certbot-and-nsd<p>Wildcard SSL certificates cover all subdomains under a certain domain - e.g. <code class="language-plaintext highlighter-rouge">*.k8s.example.net</code> will cover <code class="language-plaintext highlighter-rouge">recognyze.k8s.example.net</code>, <code class="language-plaintext highlighter-rouge">inscripits.k8s.example.net</code>, etc. which is very useful, if Kubernetes is used to deploy such services.</p>
<h2 id="prerequisites">Prerequisites</h2>
<p>The following guide assumes that you</p>
<ul>
<li>delegate DNS for the prefix domain (in the example above <code class="language-plaintext highlighter-rouge">k8s.example.net</code>) to a separate zone file</li>
<li>which is managed by NSD (depending on your setup you might use the same NSD server, a separate instance, or even a server on another host).</li>
</ul>
<h2 id="steps">Steps</h2>
<ol>
<li>add a name server (NS) entry to your domain configuration that delegates DNS for the prefix domain to a given NSD server.
<pre><code class="language-dns">k8s 3600 IN NS k8s-server.example.net.
</code></pre>
</li>
<li>setup the NSD configuration and zone file for the prefix domain. The <code class="language-plaintext highlighter-rouge">_acme-challenge</code> entry will be overwritten by Cerbot during the DNS-01 challenge verification process.
<ul>
<li><code class="language-plaintext highlighter-rouge">/etc/nsd/nsd.conf</code>:
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>zone:
name: k8s.example.net
zonefile: /etc/nsd/zones/k8s.example.net.zone
</code></pre></div> </div>
</li>
<li>
<p><code class="language-plaintext highlighter-rouge">/etc/nsd/zones/k8s.example.net</code>:</p>
<pre><code class="language-dns">@ 3660 IN SOA nameserver.example.net. hostmaster.example.net. 2014111364 28800 7200 604800 3660
@ 84600 IN NS 1.2.3.4
@ 3600 IN A 1.2.3.4
* 3600 IN A 1.2.3.4
_acme-challenge 60 IN TXT "--temporary-dummy--"
</code></pre>
</li>
</ul>
</li>
<li>install the <code class="language-plaintext highlighter-rouge">certbot-nsd-hook</code> script to <code class="language-plaintext highlighter-rouge">/opt</code>:
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> /opt
git clone https://github.com/AlbertWeichselbraun/certbot-nsd-hook.git
</code></pre></div> </div>
</li>
<li>create the SSL wildcard certificate with
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cerbot certonly <span class="se">\</span>
<span class="nt">-d</span> <span class="s1">'*.k8s.example.net'</span> <span class="se">\</span>
<span class="nt">--manual</span> <span class="se">\</span>
<span class="nt">--manual-auth-hook</span><span class="o">=</span><span class="s2">"/opt/certbot-nsd-hook/nsd-update-dns.py"</span> <span class="se">\</span>
<span class="nt">--post-hook</span><span class="o">=</span><span class="s2">"systemctl reload apache2"</span>
</code></pre></div> </div>
</li>
<li>adapt your apache2 configuration to use the wildcard certificate
<div class="language-apache highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">SSLEngine</span> <span class="ss">on</span>
<span class="nc">SSLCertificateKeyFile</span> /etc/letsencrypt/live/k8s.example.net/privkey.pem
<span class="nc">SSLCertificateFile</span> /etc/letsencrypt/live/k8s.example.net/fullchain.pem
</code></pre></div> </div>
</li>
<li>add Certbot to <code class="language-plaintext highlighter-rouge">/etc/crontab</code> to ensure that the certificate gets automatically renewed
<pre><code class="language-crontab">17 5 * * * root certbot renew --cert-name k8s.semanticlab.net
</code></pre>
<p><strong>Note:</strong> the option <code class="language-plaintext highlighter-rouge">--cert-name</code> allows you to specify the certificate to renew. This is relevant if your server uses wildcard and conventional certificates at the same time, since the <code class="language-plaintext highlighter-rouge">certbot renew</code> command does not allow mixing of renewal strategies yet.</p>
</li>
</ol>
<h1 id="resources">Resources</h1>
<ul>
<li><a href="https://github.com/AlbertWeichselbraun/certbot-nsd-hook">certbot-nsd-hook project</a> - Scripts required for using the certbot DNS challenge in conjunction with NSD</li>
</ul>Albert WeichselbraunWildcard SSL certificates cover all subdomains under a certain domain - e.g. *.k8s.example.net will cover recognyze.k8s.example.net, inscripits.k8s.example.net, etc. which is very useful, if Kubernetes is used to deploy such services.Managing DavMail with systemd and preventing service timeouts after network reconnects.2020-10-17T00:00:00+02:002020-10-17T00:00:00+02:00https://semanticlab.net/desktop/e-mail/linux/sysadmin/Managing-DavMail-with-systemd-and-preventing-service-timeouts-after-network-reconnects<p><a href="https://davmail.sourceforge.net">DavMail</a> enables access to Exchange servers over standard protocols such as IMAP, SMTP and Caldav.
It, therefore, allows you to check your company e-mail from popular mail clients such as Mailspring, Thunderbird and Geary.</p>
<p>The following sections outline how to (i) automatically start DavMail via systemd, and (ii) ensure that the service stays operable, even after network reconnects.</p>
<h1 id="starting-davmail-via-systemd">Starting DavMail via systemd</h1>
<p>If your distribution does not provide a systemd configuration file for DavMail, you can paste the following snippet into <code class="language-plaintext highlighter-rouge">/etc/systemd/system/davmail.service</code>.</p>
<div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">[Unit]</span>
<span class="py">Description</span><span class="p">=</span><span class="s">Davmail Exchange gateway</span>
<span class="py">Documentation</span><span class="p">=</span><span class="s">man:davmail</span>
<span class="py">Documentation</span><span class="p">=</span><span class="s">https://davmail.sourceforge.net/serversetup.html</span>
<span class="py">Documentation</span><span class="p">=</span><span class="s">https://davmail.sourceforge.net/advanced.html</span>
<span class="py">Documentation</span><span class="p">=</span><span class="s">https://davmail.sourceforge.net/sslsetup.html</span>
<span class="py">After</span><span class="p">=</span><span class="s">network.target</span>
<span class="nn">[Service]</span>
<span class="py">Type</span><span class="p">=</span><span class="s">simple</span>
<span class="py">User</span><span class="p">=</span><span class="s">davmail</span>
<span class="py">PermissionsStartOnly</span><span class="p">=</span><span class="s">true</span>
<span class="py">ExecStartPre</span><span class="p">=</span><span class="s">/usr/bin/touch /var/log/davmail.log</span>
<span class="py">ExecStartPre</span><span class="p">=</span><span class="s">/bin/chown davmail:adm /var/log/davmail.log</span>
<span class="py">ExecStart</span><span class="p">=</span><span class="s">/usr/bin/davmail -server /etc/davmail.properties</span>
<span class="py">SuccessExitStatus</span><span class="p">=</span><span class="s">143</span>
<span class="py">PrivateTmp</span><span class="p">=</span><span class="s">yes</span>
<span class="nn">[Install]</span>
<span class="py">WantedBy</span><span class="p">=</span><span class="s">multi-user.target</span>
</code></pre></div></div>
<p>Afterwards, you need to add the DavMail user and enable the script with</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>adduser <span class="nt">--system</span> davmail
systemctl daemon-reload
systemctl <span class="nb">enable </span>davmail
systemctl start davmail
</code></pre></div></div>
<h1 id="coping-with-network-reconnects">Coping with network reconnects</h1>
<p>One major problem with DavMail are network reconnects (e.g., if you change the network or move between VPNs) since they require a restart of the service to prevent timeouts when accessing your e-mail. One way of solving this issue is the use of the <code class="language-plaintext highlighter-rouge">NetworkManager-dispatcher</code> service, which can be enabled with</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>systemctl <span class="nb">enable </span>NetworkManager-dispatcher
systemctl start NetworkManager-dispatcher
</code></pre></div></div>
<p>Once enabled, the dispatcher service allows you to specify scripts that are executed if network connectivity is lost or becomes available again. The following script stops DavMail if networking becomes unavailable and restarts the service after the network is up again.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/sh</span>
<span class="c"># stop davmail, if no network connectivity is available and restart it once</span>
<span class="c"># the network becomes available.</span>
<span class="nv">interface</span><span class="o">=</span><span class="nv">$1</span> <span class="nv">status</span><span class="o">=</span><span class="nv">$2</span>
<span class="k">case</span> <span class="nv">$status</span> <span class="k">in
</span>up<span class="p">)</span>
systemctl restart davmail
<span class="p">;;</span>
down<span class="p">)</span>
systemctl stop davmail
<span class="p">;;</span>
<span class="k">esac</span>
</code></pre></div></div>
<p>You can enable automatic restarts of the DavMail service by copying the script to <code class="language-plaintext highlighter-rouge">/etc/NetworkManager/dispatcher.sh</code> and making it executable with <code class="language-plaintext highlighter-rouge">chmod a+x /etc/NetworkManager/dispatcher.sh</code>.</p>
<h1 id="resources">Resources</h1>
<ul>
<li><a href="https://davmail.sourceforge.net">DavMail</a> - DavMail POP/IMAP/SMTP/Caldav/Carddav/LDAP Exchange and Office 365 Gateway</li>
<li><a href="https://github.com/mguessan/davmail">DavMail GitHub repository</a></li>
<li><a href="https://wiki.archlinux.org/index.php/NetworkManager#Network_services_with_NetworkManager_dispatcher">ArchWiki on managing network services with NetworkManager dispatcher</a></li>
</ul>Albert WeichselbraunDavMail enables access to Exchange servers over standard protocols such as IMAP, SMTP and Caldav. It, therefore, allows you to check your company e-mail from popular mail clients such as Mailspring, Thunderbird and Geary.Setting up Gnome CalDAV and CardDAV support with Radicale2020-10-12T00:00:00+02:002020-10-12T00:00:00+02:00https://semanticlab.net/sysadmin/linux/Gnome-Todo-and-CalDAV-servers<p>Although Gnome supports CalDAV and CardDAV, it currently only allows configuring them for Nextcloud servers. Their is a long standing <a href="https://bugzilla.gnome.org/show_bug.cgi?id=720519">Bug Report</a> which describes this issue but hasn’t yet (as of October 2020) been properly addressed.</p>
<p>Florian Apolloner has, therefore, developed a <a href="https://gist.github.com/apollo13/f4fc8f33a2700dffb9e11c1b056c53ba">webapp</a> which uses redirects to map requests meant for Nextcloud servers to other CalDAV/CardDAV servers.</p>
<p>If you run an Apache Web server you can instead use <code class="language-plaintext highlighter-rouge">mod_rewrite</code> to replicate his solution:</p>
<div class="language-apache highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="c"># redirect used for caldav and carddav compatibility with owncloud & nextcloud</span>
<span class="nc">RewriteEngine</span> <span class="ss">on</span>
<span class="nc">RewriteRule</span> "^/.well-known/caldav" "/dav/caldav/" [R]
<span class="nc">RewriteRule</span> "^/.well-known/carddav" "/dav/carddav/" [R]
<span class="nc">RewriteRule</span> "^/remote.php/webdav/" "/dav" [R]
<span class="nc">RewriteRule</span> "^/remote.php/caldav" "/dav/caldav/" [R]
<span class="nc">RewriteRule</span> "^/remote.php/carddav" "/dav/carddav/" [R]
</code></pre></div></div>
<p>The redirects’ targets need to point to the path or URL of your caldav and carddav servers (I use <a href="https://radicale.org">Radicale</a> so in my case the proper URLs are <code class="language-plaintext highlighter-rouge">/dav/caldav</code> and <code class="language-plaintext highlighter-rouge">/dav/carddav</code>). The <code class="language-plaintext highlighter-rouge">/webdav</code> redirect can either point to your WebDAV server (if you plan on using WebDAV remote storage) or to a simple Web page on your system.</p>
<p>Once the redirects are set up, you can configure your CalDAV/CardDAV server as <em>NextCloud</em> server in <em>Gnome Online Accounts</em>. If your server does not support WebDAV you need to disable the <code class="language-plaintext highlighter-rouge">Documents</code> and <code class="language-plaintext highlighter-rouge">Files</code> sharing settings as outlined below.</p>
<p><img src="/assets/images/2020/nextcloud-settings.png" alt="Gnome Nextcloud Settings" title="Gnome Nextcloud Settings" /></p>
<p>Once you have completed this setup applications such as <em>Gnome To Do</em> and <em>Gnome Calendar</em> will be able to synchronize with your CalDAV server.</p>
<h1 id="resources">Resources</h1>
<ul>
<li><a href="https://bugzilla.gnome.org/show_bug.cgi?id=720519">Gnome Bug Report #720519</a> - Add separate components for CalDAV and CardDAV accounts</li>
<li><a href="https://gist.github.com/apollo13/f4fc8f33a2700dffb9e11c1b056c53ba">OwnCloud/Nextcloud Emulator by Florian Apoller</a></li>
<li><a href="https://radicale.org">Radicale CalDAV/WebDAV Server</a></li>
</ul>Albert WeichselbraunAlthough Gnome supports CalDAV and CardDAV, it currently only allows configuring them for Nextcloud servers. Their is a long standing Bug Report which describes this issue but hasn’t yet (as of October 2020) been properly addressed.How to resize a LUKS encrypted root partion2020-08-26T00:00:00+02:002020-08-26T00:00:00+02:00https://semanticlab.net/sysadmin/encryption/How-to-resize-a-LUKS-encrypted-root-partition<p>The Ubuntu standard setup for an encrypted root file system is quite complex as the following output shows:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@ephiphany~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 252:0 0 1T 0 disk
├─vda1 252:1 0 1M 0 part
├─vda2 252:2 0 1G 0 part /boot
└─vda3 252:3 0 1024G 0 part
└─dm_crypt-3 253:0 0 1024G 0 crypt
└─epiphany-root 253:1 0 1024G 0 lvm /
</code></pre></div></div>
<p>Basically we have a disk (<code class="language-plaintext highlighter-rouge">vda</code>) with the root partition on the <code class="language-plaintext highlighter-rouge">vda3</code> partion which holds the encrpyted LUKS device which is decrypted as <code class="language-plaintext highlighter-rouge">dm_crypt-3</code>. On top of <code class="language-plaintext highlighter-rouge">dm_crypt-3</code> we have a physical LVM volume with volume group <code class="language-plaintext highlighter-rouge">epiphany</code> and the logical volume <code class="language-plaintext highlighter-rouge">root</code>.</p>
<p>Consequently, growing the root filesystem requires:</p>
<ol>
<li>extending the vda3 paritition which is done using fdisk (please refer to a the the following <a href="https://access.redhat.com/articles/1190213">guideline</a> for more information)</li>
<li>resizing the LUKS parition</li>
<li>resizing the physical device,</li>
<li>resizing the logical device, and finally</li>
<li>growing the file system</li>
</ol>
<p>as outlined below:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># resize the LUKS parititon (dm_crypt-3)</span>
cryptsetup resize dm_crypt-3
<span class="c"># resize the physical device on top of it</span>
pvresize /dev/mapper/dm_crypt-3
<span class="c"># resize the logical device (epiphany-root)</span>
lvextend <span class="nt">-l</span> +100%FREE /dev/mapper/epiphany-root
<span class="c"># grow the file system accordingly</span>
resize2fs /dev/mapper/epiphany-root
</code></pre></div></div>Albert WeichselbraunThe Ubuntu standard setup for an encrypted root file system is quite complex as the following output shows:Network-bound disk encryption in Ubuntu 20.04 (Focal Fossa) - Booting servers with an encrypted root file system without user interaction.2020-08-26T00:00:00+02:002020-08-26T00:00:00+02:00https://semanticlab.net/sysadmin/encryption/Network-bound-disk-encryption-in-ubuntu-20.04<p>Network-bound disk encryption allows unlocking LUKS devices (e.g. the encrypted root file system of an Ubuntu server) without entering the password. Instead a Tang server is queried for a key that can be used in conjunction with a private secret to compute the decryption key. As long as the Tang server is available, the disk can be decrypted without the need to manually enter a password.</p>
<p>Ubuntu 20.04 requires the following components for implementing a network-bound disk encryption:</p>
<ol>
<li>the LUKS encrypted device(s) that should be automatically unlocked.</li>
<li>a Tang server that provides the public key required by the client for deriving its LUKS decryption key.</li>
<li>Clevis which provides clients that can use a Tang server for unlocking LUKS partitions.</li>
<li>For unlocking a boot device adjustments to initramfs (automatically provided by the <code class="language-plaintext highlighter-rouge">clevis-initramfs</code> package) are necessary.</li>
</ol>
<h1 id="how-does-network-bound-disk-encryption-work">How does network-bound disk encryption work?</h1>
<p>The figures below outline how network-bound encryption works. In the first step we use clevis to bind a LUKS encrypted device to a Tang server, generating a secret JSON Web Key (<code class="language-plaintext highlighter-rouge">cJWK</code>) on the client which is then combined with the server’s public key (<code class="language-plaintext highlighter-rouge">sJWK*</code>) to generate the key (<code class="language-plaintext highlighter-rouge">dJWC</code>) that is then added to the LUKS device as a decryption key.</p>
<p><img src="/assets/images/2020/clevis-bind-to-tang-server.svg" alt="Bind the LUKS device to the Tang server" /></p>
<p>Once the device has been bound to the Tang server, it can compute its decryption key with the server’s help. The client first generates a ephemeral key (<code class="language-plaintext highlighter-rouge">eJWK</code>) that is then combined with its secret (<code class="language-plaintext highlighter-rouge">cJWK</code>) to generate a message (<code class="language-plaintext highlighter-rouge">xJWK</code>) that is sent to the server. The server combines <code class="language-plaintext highlighter-rouge">xJWK</code> with its private key <code class="language-plaintext highlighter-rouge">sJWK </code>to generate the response <code class="language-plaintext highlighter-rouge">yJWK</code>. Clevis then combines <code class="language-plaintext highlighter-rouge">yJWK </code>with the server’s public key <code class="language-plaintext highlighter-rouge">sJWK</code>* and <code class="language-plaintext highlighter-rouge">eJWK </code>to recover the decryption key <code class="language-plaintext highlighter-rouge">dJWK</code>.</p>
<p><img src="/assets/images/2020/clevis-recover-encryption-key.svg" alt="Recover the decryption key with the help of the Tang server" /></p>
<h1 id="setup">Setup</h1>
<p>Ubuntu 20.04 provides packages for Tang and Clevis which makes installing them straight forward.</p>
<h2 id="setup-and-start-the-tang-server">Setup and start the Tang server</h2>
<p>Install Tang and José (an implementation of the JavaScript Object Signing and Encryption standards used by Tang) on the Tang server.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apt <span class="nb">install </span>tang jose
systemctl <span class="nb">enable </span>tangd.socket
systemctl start tangd.socket
</code></pre></div></div>
<p>If you install Tang on Ubuntu 18.04 you need to manually generate the Tang keys with <code class="language-plaintext highlighter-rouge">/usr/lib/x86_64-linux-gnu/tangd-keygen /var/db/tang</code> before the server start.</p>
<p>Execute <code class="language-plaintext highlighter-rouge">tang-show-keys</code> to determine the signing key’s fingerprint.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tang-show-keys
TieDkMgbVKzmXl-uyOfIa0U30lo
</code></pre></div></div>
<h2 id="host-with-the-encrypted-luks-devices">Host with the encrypted LUKS device(s)</h2>
<p>Install Clevis on the host system and then use <code class="language-plaintext highlighter-rouge">clevis luks bind</code> for binding the device to the Tang server. Clevis will ask you to verify the signing key’s fingerprint. Afterwards, Clevis can be used to unlock the device.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># install clevis</span>
apt <span class="nb">install </span>clevis clevis-luks
<span class="c"># ensure that the device (e.g. vda1) is encrypted and that the tang server is working</span>
cryptsetup luksDump /dev/vda1 <span class="c"># just to be sure that we encrypt the right disk ;)</span>
curl http://192.168.122.1/adv <span class="c"># verify that the tang server yields a response</span>
<span class="c"># enable clevis tang decryption for the given LUKS device</span>
clevis luks <span class="nb">bind</span> <span class="nt">-d</span> /dev/vda1 tang <span class="s1">'{"url": "http://192.168.122.1"}'</span>
</code></pre></div></div>
<p>Clevis provides plugins for initramfs, dracut, systemd and udisk2 to automize the unlocking process.</p>
<h3 id="automatically-unlocking-a-root-device-with-clevis">Automatically unlocking a root device with Clevis</h3>
<p>Once Clevis support has been enabled for an encrypted root file system, it can be automatically unlocked by installing the corresponding clevis plugin and rebuilding initramfs.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># insall the necessary clevis plugin</span>
apt <span class="nb">install </span>clevis-initramfs
<span class="c"># reinitialize initramfs to support automatic unlocking of the root device.</span>
update-initramfs <span class="nt">-u</span> <span class="nt">-k</span> <span class="s1">'all'</span>
</code></pre></div></div>
<h2 id="automatic-unlocking-of-non-root-devices-with-clevis">Automatic unlocking of non-root devices with Clevis</h2>
<p>Automatic unlocking of non-root devices via systemd is supported by the <code class="language-plaintext highlighter-rouge">clevis-systemd</code> plugin.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apt <span class="nb">install </span>clevis-system
</code></pre></div></div>
<p>Afterwards the encrypted non-root devices need to be added to <code class="language-plaintext highlighter-rouge">/etc/crypttab</code> with the <code class="language-plaintext highlighter-rouge">_netdev</code> option. Crypttab entries consist of the following four columns:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">target</code>: the name to be used for the mapped (i.e. decrypted) device</li>
<li><code class="language-plaintext highlighter-rouge">source device</code>: the name of the corresponding encrypted source device</li>
<li><code class="language-plaintext highlighter-rouge">key file</code>: <code class="language-plaintext highlighter-rouge">none</code>, since we do not specify a key</li>
<li><code class="language-plaintext highlighter-rouge">options</code>: the column must be set to <code class="language-plaintext highlighter-rouge">_netdev</code> so that systemd is able to automatically mount the device using the <code class="language-plaintext highlighter-rouge">clevis-systemd</code> plugin.</li>
</ul>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>encrypted_home /dev/vdb none _netdev
encrypted_opt /dev/vdc none _netdev
</code></pre></div></div>
<p>Afterwards, the devices can be added to <code class="language-plaintext highlighter-rouge">/etc/fstab</code> for automatic mounting:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/dev/mapper/encrypted_home /home xfs defaults,_netdev 0 0
/dev/mapper/encrypted_opt /opt xfs defaults,_netdev 0 0
</code></pre></div></div>
<p>Again it is important to add the <code class="language-plaintext highlighter-rouge">_netdev</code> option to ensure that systemd is able to recognize and automatically mount the encrypted device.</p>
<p class="notice--danger"><strong>Warning:</strong> To the best of my knowledge it is not possible to mount an encrypted <code class="language-plaintext highlighter-rouge">/var</code> partition using this method, since systemd relies on <code class="language-plaintext highlighter-rouge">/var</code> for its networking configuration.</p>
<h1 id="resources">Resources</h1>
<ul>
<li>Github pages
<ul>
<li><a href="https://github.com/latchset/tang">Tang</a></li>
<li><a href="https://github.com/latchset/clevis">Clevis</a></li>
</ul>
</li>
<li><a href="https://www.admin-magazine.com/Archive/2018/43/Automatic-data-encryption-and-decryption-with-Clevis-and-Tang">ADMIN Magazine article on Clevis and Tang</a></li>
<li><a href="https://www.youtube.com/watch?v=Dk6ZuydQt9I">Youtube video by Fraser Tweedale on Clevis and Tang</a></li>
</ul>Albert WeichselbraunNetwork-bound disk encryption allows unlocking LUKS devices (e.g. the encrypted root file system of an Ubuntu server) without entering the password. Instead a Tang server is queried for a key that can be used in conjunction with a private secret to compute the decryption key. As long as the Tang server is available, the disk can be decrypted without the need to manually enter a password.Record Temperature, Humidity and Pressure with an ESP32, a Bosh BME280 sensor and InfluxDB2020-02-02T00:00:00+01:002020-02-02T00:00:00+01:00https://semanticlab.net/linux/iot/esp32/bme280/sensor/influxdb/Record-Temperature-Humidity-Pressure-Monitoring-with-an-ESP32-a-BME280-and-InfluxDB<h2 id="install-influxdb-and-grafana-at-your-server-and-create-a-database">Install InfluxDB and Grafana at your server and create a database</h2>
<ol>
<li>
<p>On Debian-based systems the installation of InfluxDB is straigt forward:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apt <span class="nb">install </span>influxdb influxdb-client
</code></pre></div> </div>
<p>Afterwards a new database (with name <code class="language-plaintext highlighter-rouge">sensors</code>) can be setup by connecting to InfluxDB with the <code class="language-plaintext highlighter-rouge">influx</code> command and then running the <code class="language-plaintext highlighter-rouge">CREATE DATABASE sensors</code>.</p>
</li>
<li>
<p>Grafana should be installed based on the instructions on the <a href="https://grafana.com/docs/grafana/latest/installation/debian/">Grafana Web Site</a>.</p>
</li>
</ol>
<h3 id="connect-the-bosh-bme280-sensor-to-the-esp32">Connect the Bosh BME280 sensor to the ESP32</h3>
<ol>
<li>I prefer using an RJ45 cable for the connection with the following pin layout which minimizes interference:
<ul>
<li>VIN: White-Orange</li>
<li>GND: Brown</li>
<li>SCL: White-Brown</li>
<li>SDA: Orange</li>
</ul>
</li>
<li>Optional: Change the BME280’s I2C bus address: If you plan to use two sensors at once, you need to ensure that they have different I2C bus addresses.
<ul>
<li>The default bus address is 0x76.</li>
<li>If your breakout board has an SDO pin, you can change the bus address to 0x77 by connecting the SDO bin to GND.
<img src="/assets/images/sensors/BME280.jpg" alt="BME280 breakout board with SDO pin" /></li>
</ul>
</li>
<li>
<p>The sensor is then connected to the ESP32 as outlined in the picture below:</p>
<p><img src="/assets/images/sensors/Wiring-ESP32-BME280.png" alt="Connecting an BME280 to an ESP32 (Source: Last Minute Engineers" /></p>
</li>
</ol>
<h2 id="upload-humidity-probe-influxdb-to-the-esp32">Upload humidity-probe-influxdb to the ESP32</h2>
<p>Download the <a href="https://github.com/AlbertWeichselbraun/humidity-probe-influxdb">humidity-probe-influxdb project</a> from github and update <code class="language-plaintext highlighter-rouge">custom.h-example</code> to reflect your WiFi setup and InfluxDB URL. Once you transfer the Sketch to your ESP32 the ESP32 will</p>
<ul>
<li>read temperature, humidity and pressure and</li>
<li>transfer the measurements to your InfluxDB server, once ten measurements have been collected.
(The first transfer of measurements will start approximately after ten minutes)</li>
</ul>
<h2 id="log-into-grafana-and-configure-your-dashboards">Log into Grafana and configure your dashboards</h2>
<p>Per default Grafana listens on port 3000 and should be available at <code class="language-plaintext highlighter-rouge">http://your-grafana-server-ip:3000</code>. You can log into Grafana to setup queries, graphs and dashboards as illustrated in the example below.</p>
<p><img src="/assets/images/sensors/Grafana-Screenshot.png" alt="Example Grafana Screenshot" /></p>
<h3 id="references">References</h3>
<ol>
<li><a href="https://lastminuteengineers.com/bme280-esp32-weather-station/">Creating A Simple ESP32 Weather Station With BME280</a></li>
<li><a href="https://grafana.com/docs/grafana/latest/installation/debian/">Installing Grafana on Debian or Ubuntu</a></li>
</ol>Albert WeichselbraunInstall InfluxDB and Grafana at your server and create a databaseOptimizing Apache Storm Topologies2018-07-14T00:00:00+02:002018-07-14T00:00:00+02:00https://semanticlab.net/linux/storm/java/Optimizing-Storm-Deployments<p>This article summarizes hints for optimizing and deploying Apache Storm topologies.</p>
<h3 id="setup-your-storm-cluster">Setup your storm cluster</h3>
<ol>
<li>I/O is zookeeper’s main bottleneck - ensure that the <code class="language-plaintext highlighter-rouge">/data</code> partition of zookeeper machines serializes to quick storage (ramdisk ;)</li>
<li>Determine the number of parallelism units using the following rule of thumb:
<ul>
<li><em>number of available CPU cores</em> on all machines minus one core per machine that is used for the <em>Acker</em></li>
<li>Example: 2 machines with 48 and 1 machine with 32 cores; parallelism units = 2x(48-1) + (32-1) = 125</li>
</ul>
</li>
<li>Using multiple workers per machine allows deploying multiple topologies at once (the number of workers is determined by the number of ports configured in the <code class="language-plaintext highlighter-rouge">supervisor.slots.ports</code> setting in <code class="language-plaintext highlighter-rouge">storm.yaml</code>)</li>
</ol>
<h3 id="topology-configuration-suggestions">Topology configuration suggestions</h3>
<ol>
<li>Use one worker per machine and topology (intra-worker transports are more efficient)</li>
<li>The number of executors depends on whether your bolt is I/O or CPU bound
<ul>
<li>CPU bound: configure one executor per available parallelism unit</li>
<li>I/O bound: use 10-100 workers per parallelism unit, depending on the expected I/O delay</li>
</ul>
</li>
<li>The total number of parallelism units in your topology should equal the number of available parallelism units</li>
</ol>
<h3 id="profiling-the-topology">Profiling the topology</h3>
<ol>
<li>Storm UI: use the capacity metric to identify bolts which require a higher parallelism</li>
<li>your <code class="language-plaintext highlighter-rouge">nextTuple</code> and <code class="language-plaintext highlighter-rouge">execute</code> methods determine the spout’s/bolt’s runtime - optimize these methods</li>
<li>use queue’s for I/O in spouts or terminal bolts (i.e. write final results to a queue and use a writer thread that performs batch inserts to serialize the queue to disk)</li>
</ol>
<h3 id="glossary">Glossary</h3>
<ul>
<li>workers process - responsible for executing the topology on a particular machine</li>
<li>executor - thread spawned by the worker for a particular component (bold or spout); the number of executors is configured by setting the <code class="language-plaintext highlighter-rouge">parallelism hint</code> parameter in the <code class="language-plaintext highlighter-rouge">setSpout</code> or <code class="language-plaintext highlighter-rouge">setBolt</code> method.</li>
<li>task - number of instances of a particular bolt/spout to deploy; configuring more than one task using <code class="language-plaintext highlighter-rouge">setNumTasks(n)</code> allows to later increase the number of executors for that particular spout/bolt without redeploying the topology.</li>
</ul>
<h3 id="references">References</h3>
<ol>
<li><a href="https://www.slideshare.net/ptgoetz/scaling-apache-storm-strata-hadoopworld-2014?qid=19b9de2b-175b-415e-94c8-7a537d8c2a9a&v=qf1&b=&from_search=2">Scaling Apache Storm</a></li>
<li><a href="http://storm.apache.org/releases/1.2.2/Understanding-the-parallelism-of-a-Storm-topology.html">Understanding the Parallelism of an Apache Storm Topology</a></li>
<li><a href="https://stackoverflow.com/questions/17257448/what-is-the-task-in-storm-parallelism">Stack overflow - What is a “Task” in Storm Parallelism</a></li>
<li><a href="https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_storm-component-guide/content/storm-parallelism.html">Hortonworks - Storm Parallelism</a></li>
</ol>Albert WeichselbraunThis article summarizes hints for optimizing and deploying Apache Storm topologies.Headless Seafile server on a Raspberry Pi 2 with dynamic DNS2018-02-17T00:00:00+01:002018-02-17T00:00:00+01:00https://semanticlab.net/linux/seafile/raspberry%20pi/Raspberry-Pi-Home-Server-Configuration<p>The Raspberry Pi is operated from at home keeping noise and power consumption in mind.</p>
<h3 id="install-raspbian-on-pi">Install Raspbian on Pi</h3>
<p>1.Download and install <a href="https://www.raspberrypi.org/downloads/raspbian/">Raspbian</a> on the SD card. Before rebooting the device mount the <code class="language-plaintext highlighter-rouge">boot</code> partition and create an empty file named <code class="language-plaintext highlighter-rouge">ssh</code> on the partition.</p>
<ol>
<li>Put the SD card into the Raspberry Pi, boot the system and determine its IP address with
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nmap <span class="nt">-sn</span> 192.168.1.0/24
</code></pre></div> </div>
</li>
<li>Log into the PI (<code class="language-plaintext highlighter-rouge">user</code>: <code class="language-plaintext highlighter-rouge">pi</code>, <code class="language-plaintext highlighter-rouge">password</code>: <code class="language-plaintext highlighter-rouge">raspberry</code>) and run <code class="language-plaintext highlighter-rouge">sudo raspi-config</code> to
<ul>
<li>Change the login password</li>
<li>Maximize the rootfs with <code class="language-plaintext highlighter-rouge">expand_rootfs</code>.</li>
</ul>
</li>
</ol>
<h3 id="change-the-root-file-system-to-f2fs">Change the root file system to F2FS</h3>
<ul>
<li>mount the SD card and copy the content of the root filesystem to a temporary directory</li>
<li>unmount the rootfs file system and format its partition (e.g. <code class="language-plaintext highlighter-rouge">/dev/mmcblk0p2</code>)</li>
<li>restore the root partitions content</li>
</ul>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="nb">mkdir</span> /tmp/rpi
<span class="nb">cp</span> <span class="nt">-a</span> /media/<span class="o">{</span>user<span class="o">}</span>/root_fs /tmp/rpi
umount /media/<span class="o">{</span>user<span class="o">}</span>/root_fs
mkfs.f2fs /dev/mmcblk0p2
mount /dev/mmcblk0p2 /mnt
<span class="nb">cp</span> <span class="nt">-a</span> /media/<span class="o">{</span>user<span class="o">}</span>/root_fs/ /mnt/
</code></pre></div></div>
<ul>
<li>adapt <code class="language-plaintext highlighter-rouge">cmdline.txt</code> and <code class="language-plaintext highlighter-rouge">fstab</code> to reflect the changed file system type:
<ul>
<li><code class="language-plaintext highlighter-rouge">/media/{user}/boot_fs/cmdline.txt</code>: change <code class="language-plaintext highlighter-rouge">rootfstype=ext4</code> to <code class="language-plaintext highlighter-rouge">rootfstype=f2fs</code></li>
<li><code class="language-plaintext highlighter-rouge">/mnt/etc/fstab</code>: change the file system type for <code class="language-plaintext highlighter-rouge">/dev/mmcblk0p2</code> to <code class="language-plaintext highlighter-rouge">f2fs</code> and add the <code class="language-plaintext highlighter-rouge">discard</code> option
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/dev/mmcblk0p1 /boot vfat defaults 0 2
/dev/mmcblk0p2 / f2fs defaults,noatime,discard 0 1
</code></pre></div> </div>
</li>
</ul>
</li>
</ul>
<h3 id="remove-unnecessary-components-and-reduce-power-consumption">Remove unnecessary components and reduce power consumption</h3>
<ol>
<li>remove unnecessary services
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apt-get remove <span class="nt">--purge</span> avahi-daemon triggerhappy<span class="sb">`</span>
</code></pre></div> </div>
</li>
<li>disable HDMI (-25 mA) and LEDs (-5 mA per LED) by adding the following commands to <code class="language-plaintext highlighter-rouge">/etc/rc.local</code>:
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># disable hdmi (25 mA)
/usr/bin/tvservice -o
# disable leds (5 mA per LED)
echo 0 |tee /sys/class/leds/led0/brightness
echo 0 |tee /sys/class/leds/led1/brightness
</code></pre></div> </div>
</li>
</ol>
<h3 id="dynamic-dns-with-dynucom">Dynamic DNS with dynu.com</h3>
<p><a href="https://www.dynu.com/">Dynu.com</a> offers a dynamic DNS service which lets you (optionally) use your own domain name for dynamic DNS. The following steps refer to this case.</p>
<ul>
<li>set the <code class="language-plaintext highlighter-rouge">NS</code> entries for the chosen name to the dyno name servers:
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>myname.semanticlab.net 3600 IN NS ns1.dynu.com.
myname.semanticlab.net 3600 IN NS ns2.dynu.com.
myname.semanticlab.net 3600 IN NS ns3.dynu.com.
myname.semanticlab.net 3600 IN NS ns4.dynu.com.
myname.semanticlab.net 3600 IN NS ns5.dynu.com.
myname.semanticlab.net 3600 IN NS ns6.dynu.com.
</code></pre></div> </div>
</li>
<li>setup a dynu account and configure the given account for dynamic dns.</li>
<li>use the dynu dynamic dns client or setup your router to update the dynamic dns address when required.</li>
</ul>
<h3 id="install-seafile-and-nginx">Install Seafile and nginx</h3>
<ul>
<li>Prerequisites: install the necessary dependencies for running seafile and nginx
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apt-get <span class="nb">install</span> <span class="nt">-y</span> nginx mysql-server python-request python-mysqldb python-pil
</code></pre></div> </div>
</li>
<li>Download the <a href="https://www.seafile.com/en/download/">Seafile server for Raspberry Pi</a> and follow the provided install instructions.</li>
<li>Optional: to enable webdav with nginx change <code class="language-plaintext highlighter-rouge">./seafile/conf/seafdav.conf</code> to
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[WEBDAV]
enabled = true
port = 8080
fastcgi = false
share_name = /seafdav
</code></pre></div> </div>
<p>and add the following section to your nginx configuration</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># webdav
location /seafdav {
proxy_pass http://127.0.0.1:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Host $server_name;
client_max_body_size 0;
proxy_connect_timeout 36000s;
proxy_read_timeout 36000s;
proxy_send_timeout 36000s;
send_timeout 36000s;
proxy_request_buffering off;
}
</code></pre></div> </div>
</li>
</ul>
<h3 id="port-forwarding-and-split-dns">Port forwarding and split DNS</h3>
<ul>
<li>Log into the configuration interface of your router and
<ol>
<li>setup a fixed IP address for your Raspberry Pi</li>
<li>enable port forwarding to forward the following ports to the Raspberry Pi:
<ul>
<li>external 80 to Raspberry 80 (http)</li>
<li>external 443 to Raspberry 443 (https)</li>
</ul>
</li>
<li>optional: if you can access the Raspberry’s Web service from the Internet but not from within your network your router does not support NAT loopback. In this case we need to setup split DNS to ensure that the Raspberry is accessible with the same DNS name from internal as well.
<ul>
<li>install unbound with <code class="language-plaintext highlighter-rouge">apt-get install unbound</code></li>
<li>add the following changes to <code class="language-plaintext highlighter-rouge">/etc/unbound/unbound.conf</code> to enable network-wide access to the name server as well as split dns:
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># network wide access
interface: 0.0.0.0
# overwrite dns responses
local-zone: myname.semanticlab.net transparent
local-data: "myname.semanticlab.net A {your-pi-ip}"
</code></pre></div> </div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> * restart unbound with `service unbound restart`
* change the DNS server on your router to the ip address of your pi
</code></pre></div> </div>
</li>
</ul>
</li>
</ol>
</li>
</ul>
<h3 id="https-with-letsencryt">HTTPS with letsencryt</h3>
<p>Install certbot with <code class="language-plaintext highlighter-rouge">apt-get install python-certbot-nginx</code> and then follow the instructions on the <a href="https://certbot.eff.org/">EFF Certbot page</a>.</p>
<h3 id="references">References</h3>
<ol>
<li><a href="https://hackernoon.com/raspberry-pi-headless-install-462ccabd75d0">Headless Raspberry Pi Setup</a></li>
<li><a href="http://whitehorseplanet.org/gate/topics/documentation/public/howto_ext4_to_f2fs_root_partition_raspi.html">Howto: Replace the micro SD card’s ext4 partition with f2fs</a></li>
</ol>Albert WeichselbraunThe Raspberry Pi is operated from at home keeping noise and power consumption in mind.Deploying third-party artifacts to a local repository with WebDAV2018-02-13T00:00:00+01:002018-02-13T00:00:00+01:00https://semanticlab.net/maven/java/Maven-upload-third-party-resources-to-repository<p>This guide outlines how to deploy third party jars to a local repository over WebDAV. Using WebDAV requires
(i) setting up the login data of the WebDAV repository and
(ii) providing a <strong>current</strong> Webdav wagon extension to maven.</p>
<ol>
<li>Configure the Webdav repository in <code class="language-plaintext highlighter-rouge">~/.m2/settings.xml</code>.
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><settings</span> <span class="na">xmlns=</span><span class="s">"http://maven.apache.org/SETTINGS/1.0.0"</span>
<span class="na">xmlns:xsi=</span><span class="s">"http://www.w3.org/2001/XMLSchema-instance"</span>
<span class="na">xsi:schemaLocation=</span><span class="s">"http://maven.apache.org/SETTINGS/1.0.0
http://maven.apache.org/xsd/settings-1.0.0.xsd"</span><span class="nt">></span>
<span class="nt"><servers></span>
<span class="nt"><server></span>
<span class="nt"><id></span>mywebdavserver<span class="nt"></id></span>
<span class="nt"><username></span>user<span class="nt"></username></span>
<span class="nt"><password></span>***<span class="nt"></password></span>
<span class="nt"></server></span>
<span class="nt"></servers></span>
<span class="nt"></settings></span>
</code></pre></div> </div>
</li>
<li>Create a <em>dummy</em> pom file which provides maven with information on the required Webdav wagon:
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><project></span>
<span class="nt"><modelVersion></span>4.0.0<span class="nt"></modelVersion></span>
<span class="nt"><groupId></span>com.example<span class="nt"></groupId></span>
<span class="nt"><artifactId></span>webdav-deploy<span class="nt"></artifactId></span>
<span class="nt"><packaging></span>pom<span class="nt"></packaging></span>
<span class="nt"><version></span>1<span class="nt"></version></span>
<span class="nt"><name></span>Webdav Deploy<span class="nt"></name></span>
<span class="nt"><build></span>
<span class="nt"><extensions></span>
<span class="nt"><extension></span>
<span class="nt"><groupId></span>org.apache.maven.wagon<span class="nt"></groupId></span>
<span class="nt"><artifactId></span>wagon-webdav-jackrabbit<span class="nt"></artifactId></span>
<span class="nt"><version></span>3.0.0<span class="nt"></version></span>
<span class="nt"></extension></span>
<span class="nt"></extensions></span>
<span class="nt"></build></span>
<span class="nt"></project></span>
</code></pre></div> </div>
</li>
<li>upload the artifact to the repository with <code class="language-plaintext highlighter-rouge">mvn</code>:
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mvn deploy:deploy-file <span class="nt">-Dfile</span><span class="o">=</span><path-to-file> <span class="se">\</span>
<span class="nt">-DgroupId</span><span class="o">=</span><group-id> <span class="se">\</span>
<span class="nt">-DartifactId</span><span class="o">=</span><artificat-id> <span class="se">\</span>
<span class="nt">-Dversion</span><span class="o">=</span><version> <span class="se">\</span>
<span class="nt">-Dpackaging</span><span class="o">=</span><packaging> <span class="se">\</span>
<span class="nt">-DrepositoryId</span><span class="o">=</span>mywebdavserver <span class="se">\</span>
<span class="nt">-Durl</span><span class="o">=</span>dav:<url-to-the-webdav-server>
</code></pre></div> </div>
<p><strong>Example:</strong> deploy the latest libsvm version to our local repository.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mvn deploy:deploy-file<span class="se">\</span>
<span class="nt">-Dfile</span><span class="o">=</span>libsvm.jar <span class="se">\</span>
<span class="nt">-DgroupId</span><span class="o">=</span>tw.edu.ntu.csie <span class="se">\</span>
<span class="nt">-DartifactId</span><span class="o">=</span>libsvm <span class="se">\</span>
<span class="nt">-Dversion</span><span class="o">=</span>3.22 <span class="se">\</span>
<span class="nt">-Dpackaging</span><span class="o">=</span>jar <span class="se">\</span>
<span class="nt">-DrepositoryId</span><span class="o">=</span>semanticlab.net
<span class="nt">-Durl</span><span class="o">=</span>dav:http://semanticlab.net/deploy/
</code></pre></div> </div>
</li>
</ol>
<h3 id="literature">Literature</h3>
<ul>
<li><a href="https://maven.apache.org/guides/mini/guide-3rd-party-jars-remote.html">Guide to deploying 3rd party JARs to remote repository</a></li>
<li><a href="https://www.chrissearle.org/2008/02/10/Deploying_jars_to_third_party_maven_repository_via_WebDAV/">Deploying jars to third party maven repository via WebDAV</a></li>
</ul>Albert WeichselbraunThis guide outlines how to deploy third party jars to a local repository over WebDAV. Using WebDAV requires (i) setting up the login data of the WebDAV repository and (ii) providing a current Webdav wagon extension to maven.