Ultimate Regular Expression for HTML tag parsing with PHP

Disclaimer: this is a dirty hack ! To parse HTML or XML, use a dedicated library.

Tonight I found the ultimate regex to get HTML tags out of a string. It was written a year ago by Phil Haack on his blog. His regex is quite bullet-proof: it’s able to parse HTML tags written on multiple lines which contain any sort of attributes (with or without a value, with single or double quotes).

Unfortunately his regular expression was designed for Microsoft .NET, so I’ve spend some time to convert it to PHP. Here is the result:

$regex = "/<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>/i";

And finally, my version based on the one above:

$regex = "/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>/i";

The latter include the following enhancement:

  • accept hyphens as attribute’s middle characters (thanks Ged)

22 Responses to “Ultimate Regular Expression for HTML tag parsing with PHP”


  • sorry, but isn’t working.

  • Because of Wordpress parsing algorithm, an extra space was introduced between $regex = "/< and \/?\w+.... I’ve fixed this and it should be OK now. Can you retry please ?

  • Hello, cheers for this. However I’ve noticed the last quote of the regex (ending the string) is one of those horrible Word-style quote marks. So anyone copy and pasting this should replace it with a ‘proper’ one.

  • However I’ve noticed the last quote of the regex (ending the string) is one of those horrible Word-style quote marks.

    Good catch ! Thanks BazzA, I think you’ve find a Wordpress’ bug. Let me investigate this issue…

  • However I’ve noticed the last quote of the regex (ending the string) is one of those horrible Word-style quote marks.

    I’ve just fixed the post content.

  • BTW, if you need to see a real example use of this regexep, read the importImagesFromPost() function from my Wordpress to e107 v0.8 script . You will find there an adaptation of the regexp featured in this article.

  • heh.. this does not catch

    [HTML tags stripped by Wordpress]

    for example. and this:

    [HTML tags stripped by Wordpress]

    aha, that’s because attribute http-equiv
    But for my needs it feets anyway – thanks a lot :)

  • also, I guess this is not good idea to strip html at the forum about html – better would be convert to entities, isn’t it? ;)

  • Hi Ger, sorry about the tag deletion, but that’s default Wordpress policy. I think you should convert to entities manually.

    Regarding your regular expression issue, I guess you were talking about something similar to:

    <meta http-equiv="Refresh" content="3600">
    

    So yes, this tag doesn’t match because my regular expression wasn’t considering http-equiv as a valid attribute. This is of course wrong, as the HTML specs obviously allow hyphens in attribute name.

    I’ve updated my regular expression in the blog post.

  • Pakistan Peshawar

    by which regex pateren i can get all html anchor tags from html document, i used this one:

    preg_match_all("/<a>[\s-\w+&@#\/%?=~_|!:,.;\"']+/i",$cr,$pm,PREG_SET_ORDER);
    

    its working but it not parsing those anchors which has images inside, eg:

    </a><a href="test.html" rel="nofollow"> </a>
    

    can some give me the right one so that i can get all types of achors from html ducument.
    thanks

  • for br tag tell the regular expression plzzzzzzz

  • Hi,

    Thanks for publishing this code, it’s very useful, however I’m trying to catch and remove non-standard tags generated by MS Word, which contain hyphens in the tag (eg ) and colons in the attributes (eg. w:st="on"), and I notice that this code doesn’t currently pick these up. Any chance of an amended version?

    Many thanks

  • Very helpful regex assistance, thank you!

  • Hi there,

    can any one write a good regular expression this one for Python, to remove all hmtl tags. i really need one,

    thanks in advance

  • @Hamza: take a look here.

  • Johnny B

    HTML isn’t a regular language, so advising people to parse it with regular expressions is.. not smart.

    See http://htmlparsing.icenine.ca for more information.

  • advising people to parse it with regular expressions is.. not smart.

    I agree.

    To clarify: this code is far from being a good practice. It’s just a hack intended to get rid of HTML tag soup.

    Now, a little bit of context: PHP is not my language of choice and at the time I wrote this article I didn’t found any PHP library that is tolerant to tag soup. Hence the hack.

    For Python, the langage I practice everyday, I recommand using lxml, especially its lxml.html module. And on that subject, don’t miss Ian Bicking’s post: “lxml: an underappreciated web scraping library”.

  • You should try Tidy. Its very forgiving of tag soup. And will allow you to pars the tag soup as actual DOM (of sorts)

  • Fred-Eric Lafaille
    Warning: ereg_replace() [function.ereg-replace]: REG_BADRPT
    
  • what part allows hyphens. i just need that part.

  • Nayana Adassuriya

    I want to get all the <a> tag that include tag withing it.
    eg:

    <a href="www.google.com"><img src="google.jpg"></a>
    

    here i want to get http://www.google.com and google.jpg
    please help me how can do that with some example.

    simply i want to get the “image url” and “link url” when a image include inside anchor tag

    how can i do it?

    thanks
    Nayana Adassuriya

  • Please provide regular expression to extract any tag and attributes… along with javascript present in attributes…

    Thanks

Leave a Reply

Additional comments powered by BackType