Kevin Deldycke - BeautifoulSouphttps://kevin.deldycke.com/2008-07-08T00:24:26+02:00Python Ultimate Regular Expression to Catch HTML Tags2008-07-08T00:24:26+02:002008-07-08T00:24:26+02:00Kevin Deldycketag:kevin.deldycke.com,2008-07-08:/2008/07/python-ultimate-regular-expression-to-catch-html-tags/<p>_<strong>Disclaimer</strong>: this is a dirty hack! To parse <span class="caps">HTML</span> or <span class="caps">XML</span>, use a dedicated library like the good old <a href="https://pypi.python.org/pypi/beautifulsoup4"><code>BeautifoulSoup</code></a> or <a href="https://lxml.de/lxmlhtml.html"><code>lxml.html</code></a>.</p> <p>1 year and 3 months ago I&rsquo;ve came with a <a href="https://kevin.deldycke.com/2007/03/ultimate-regular-expression-for-html-tag-parsing-with-php/"><span class="caps">PHP</span> regexp to parse <span class="caps">HTML</span> tag soup</a>.</p> <p>Here is an improved version, in Python (my …</p><p>_<strong>Disclaimer</strong>: this is a dirty hack! To parse <span class="caps">HTML</span> or <span class="caps">XML</span>, use a dedicated library like the good old <a href="https://pypi.python.org/pypi/beautifulsoup4"><code>BeautifoulSoup</code></a> or <a href="https://lxml.de/lxmlhtml.html"><code>lxml.html</code></a>.</p> <p>1 year and 3 months ago I&rsquo;ve came with a <a href="https://kevin.deldycke.com/2007/03/ultimate-regular-expression-for-html-tag-parsing-with-php/"><span class="caps">PHP</span> regexp to parse <span class="caps">HTML</span> tag soup</a>.</p> <p>Here is an improved version, in Python (my favorite language so far), that is normally much prone to detect strange <span class="caps">HTML</span> tags. It also support attributes without value so it&rsquo;s closer to the <a href="https://www.w3.org/TR/REC-html40/"><span class="caps">HTML</span> specification</a>, but doesn&rsquo;t strictly stick to it in order to catch <a href="https://en.wikipedia.org/wiki/Tag_soup">tag soup</a> and malformatted&nbsp;tags.</p> <div class="highlight"><pre><span></span><span class="n">ultimate_regexp</span> <span class="o">=</span> <span class="s2">&quot;(?i)&lt;\/?\w+((\s+\w+(\s*=\s*(?:</span><span class="se">\&quot;</span><span class="s2">.*?</span><span class="se">\&quot;</span><span class="s2">|&#39;.*?&#39;|[^&#39;</span><span class="se">\&quot;</span><span class="s2">&gt;\s]+))?)+\s*|\s*)\/?&gt;&quot;</span> </pre></div> <p>And here is it applied in a trivial example (in a python&nbsp;shell):</p> <div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">import</span> <span class="nn">re</span> <span class="go">&gt;&gt;&gt;</span> <span class="gp">&gt;&gt;&gt; </span><span class="n">content</span> <span class="o">=</span> <span class="s2">&quot;&quot;&quot;This is the &lt;strong&gt;content&lt;/strong&gt; in which we want to</span> <span class="go">&lt;em&gt;find&lt;/em&gt; &lt;a href=&quot;https://en.wikipedia.org/wiki/Html&quot;&gt;HTML&lt;/a&gt; tags.&quot;&quot;&quot;</span> <span class="go">&gt;&gt;&gt;</span> <span class="gp">&gt;&gt;&gt; </span><span class="n">ultimate_regexp</span> <span class="o">=</span> <span class="s2">&quot;(?i)&lt;\/?\w+((\s+\w+(\s*=\s*(?:</span><span class="se">\&quot;</span><span class="s2">.*?</span><span class="se">\&quot;</span><span class="s2">|&#39;.*?&#39;|[^&#39;</span><span class="se">\&quot;</span><span class="s2">&gt;\s]+))?)+\s*|\s*)\/?&gt;&quot;</span> <span class="go">&gt;&gt;&gt;</span> <span class="gp">&gt;&gt;&gt; </span><span class="k">for</span> <span class="n">match</span> <span class="ow">in</span> <span class="n">re</span><span class="o">.</span><span class="n">finditer</span><span class="p">(</span><span class="n">ultimate_regexp</span><span class="p">,</span> <span class="n">content</span><span class="p">):</span> <span class="gp">... </span> <span class="k">print</span> <span class="nb">repr</span><span class="p">(</span><span class="n">match</span><span class="o">.</span><span class="n">group</span><span class="p">())</span> <span class="gp">...</span> <span class="go">&#39;&lt;strong&gt;&#39;</span> <span class="go">&#39;&lt;/strong&gt;&#39;</span> <span class="go">&#39;&lt;em&gt;&#39;</span> <span class="go">&#39;&lt;/em&gt;&#39;</span> <span class="go">&#39;&lt;a href=&quot;https://en.wikipedia.org/wiki/Html&quot;&gt;&#39;</span> <span class="go">&#39;&lt;/a&gt;&#39;</span> <span class="go">&gt;&gt;&gt;</span> </pre></div>