Ultimate Regular Expression for HTML tag parsing with PHP

Disclaimer: this is a dirty hack ! To parse HTML or XML, use a dedicated library.

Tonight I found the ultimate regex to get HTML tags out of a string. It was written a year ago by Phil Haack on his blog. His regex is quite bullet-proof: it’s able to parse HTML tags written on multiple lines which contain any sort of attributes (with or without a value, with single or double quotes).

Unfortunately his regular expression was designed for Microsoft .NET, so I’ve spend some time to convert it to PHP. Here is the result:

$regex = "/<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>/i";

And finally, my version based on the one above:

$regex = "/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>/i";

The latter include the following enhancement:

  • accept hyphens as attribute’s middle characters (thanks Ged)

31 thoughts on “Ultimate Regular Expression for HTML tag parsing with PHP

  1. Because of WordPress parsing algorithm, an extra space was introduced between $regex = "/< and \/?\w+.... I’ve fixed this and it should be OK now. Can you retry please ?

  2. Hello, cheers for this. However I’ve noticed the last quote of the regex (ending the string) is one of those horrible Word-style quote marks. So anyone copy and pasting this should replace it with a ‘proper’ one.

  3. However I’ve noticed the last quote of the regex (ending the string) is one of those horrible Word-style quote marks.

    Good catch ! Thanks BazzA, I think you’ve find a WordPress’ bug. Let me investigate this issue…

  4. However I’ve noticed the last quote of the regex (ending the string) is one of those horrible Word-style quote marks.

    I’ve just fixed the post content.

  5. heh.. this does not catch

    [HTML tags stripped by WordPress]

    for example. and this:

    [HTML tags stripped by WordPress]

    aha, that’s because attribute http-equiv
    But for my needs it feets anyway – thanks a lot :)

  6. also, I guess this is not good idea to strip html at the forum about html – better would be convert to entities, isn’t it? ;)

  7. Hi Ger, sorry about the tag deletion, but that’s default WordPress policy. I think you should convert to entities manually.

    Regarding your regular expression issue, I guess you were talking about something similar to:

    <meta http-equiv="Refresh" content="3600">
    

    So yes, this tag doesn’t match because my regular expression wasn’t considering http-equiv as a valid attribute. This is of course wrong, as the HTML specs obviously allow hyphens in attribute name.

    I’ve updated my regular expression in the blog post.

  8. by which regex pateren i can get all html anchor tags from html document, i used this one:

    preg_match_all("/<a>[\s-\w+&@#\/%?=~_|!:,.;\"']+/i",$cr,$pm,PREG_SET_ORDER);
    

    its working but it not parsing those anchors which has images inside, eg:

    </a><a href="test.html" rel="nofollow"> </a>
    

    can some give me the right one so that i can get all types of achors from html ducument.
    thanks

  9. Pingback: Python ultimate regular expression to catch HTLM tags at Coolkevmen

  10. Hi,

    Thanks for publishing this code, it’s very useful, however I’m trying to catch and remove non-standard tags generated by MS Word, which contain hyphens in the tag (eg ) and colons in the attributes (eg. w:st="on"), and I notice that this code doesn’t currently pick these up. Any chance of an amended version?

    Many thanks

  11. Hi there,

    can any one write a good regular expression this one for Python, to remove all hmtl tags. i really need one,

    thanks in advance

  12. advising people to parse it with regular expressions is.. not smart.

    I agree.

    To clarify: this code is far from being a good practice. It’s just a hack intended to get rid of HTML tag soup.

    Now, a little bit of context: PHP is not my language of choice and at the time I wrote this article I didn’t found any PHP library that is tolerant to tag soup. Hence the hack.

    For Python, the langage I practice everyday, I recommand using lxml, especially its lxml.html module. And on that subject, don’t miss Ian Bicking’s post: “lxml: an underappreciated web scraping library”.

  13. I want to get all the <a> tag that include tag withing it.
    eg:

    <a href="www.google.com"><img src="google.jpg"></a>
    

    here i want to get http://www.google.com and google.jpg
    please help me how can do that with some example.

    simply i want to get the “image url” and “link url” when a image include inside anchor tag

    how can i do it?

    thanks
    Nayana Adassuriya

  14. Please provide regular expression to extract any tag and attributes… along with javascript present in attributes…

    Thanks

  15. Pingback: bertelli.name » Blog Archive » HTML hates Regexp

  16. This “ultimate” expression stumbles quite easily on the first:

    ';
        ... any other php code...
    ?>
    

    it encounters. In text embedded end tags are difficult for most RegEx solutions.

  17. @Cees: I agree, “ultimate” here is exaggerated. I may have been too enthusiastic when I wrote this blog post back in 2007. But I appreciate the defeating use-case you’ve catch ! :)

  18. Pingback: HTML hates Regexp | Bertelli's Place

  19. Pingback: List of Sites from my recent RegEX Re(lated) Search « Web Development Journal

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Notify me of followup comments via e-mail. You can also subscribe without commenting.