1 year and 3 months ago I’ve came with a PHP regexp to parse HTML tag soup. Here is an improved version, in Python (my favorite language so far), that is normally much prone to detect strange HTML tags. It also support attributes without value so it’s closer to the HTML specification, but doesn’t strictly stick to it in order to catch tag soup and malformatted tags.
ultimate_regexp = "(?i)<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>"
And here is it applied in a trivial example (in a python shell):
>>> import re >>> >>> content = """This is the <strong>content</strong> in which we want to <em>find</em> <a href="http://en.wikipedia.org/wiki/Html">HTML</a> tags.""" >>> >>> ultimate_regexp = "(?i)<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>" >>> >>> for match in re.finditer(ultimate_regexp, content): ... print repr(match.group()) ... '<strong>' '</strong>' '<em>' '</em>' '<a href="http://en.wikipedia.org/wiki/Html">' '</a>' >>>
I adapted your ultimate regex to parse malformed downloaded xml files that had unxepected spaces in the tags.
for match in re.finditer(regexp, data): data = data.replace( match.group(), match.group().replace(' ', '') )since i know the xml schema im downloading does not contain spaces in the tags, this worked quite nicely
Seems like it doesn’t catch doctype-tags (
<!DOCTYPE) and comments (<!--) otherwise it works great