Tag Archive for 'regexp'

Python ultimate regular expression to catch HTML tags

1 year and 3 months ago I’ve came with a PHP regexp to parse HTML tag soup. Here is an improved version, in Python (my favorite language so far), that is normally much prone to detect strange HTML tags. It also support attributes without value so it’s closer to the HTML specification, but doesn’t strictly stick to it in order to catch tag soup and malformatted tags.

ultimate_regexp = "(?i)<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>"

And here is it applied in a trivial example (in a python shell):

>>> import re
>>>
>>> content = """This is the <strong>content</strong> in which we want to
<em>find</em> <a href="http://en.wikipedia.org/wiki/Html">HTML</a> tags."""
>>>
>>> ultimate_regexp = "(?i)<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>"
>>>
>>> for match in re.finditer(ultimate_regexp, content):
...   print repr(match.group())
...
'<strong>'
'</strong>'
'<em>'
'</em>'
'<a href="http://en.wikipedia.org/wiki/Html">'
'</a>'
>>>

Ultimate Regular Expression for HTML tag parsing with PHP

Disclaimer: this is a dirty hack ! To parse HTML or XML, use a dedicated library.

Tonight I found the ultimate regex to get HTML tags out of a string. It was written a year ago by Phil Haack on his blog. His regex is quite bullet-proof: it’s able to parse HTML tags written on multiple lines which contain any sort of attributes (with or without a value, with single or double quotes).

Unfortunately his regular expression was designed for Microsoft .NET, so I’ve spend some time to convert it to PHP. Here is the result:

$regex = "/<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>/i";

And finally, my version based on the one above:

$regex = "/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>/i";

The latter include the following enhancement:

  • accept hyphens as attribute’s middle characters (thanks Ged)