Disclaimer: this is a dirty hack ! To parse HTML or XML, use a dedicated library.
Tonight I found the ultimate regex to get HTML tags out of a string. It was written a year ago by Phil Haack on his blog. His regex is quite bullet-proof: it’s able to parse HTML tags written on multiple lines which contain any sort of attributes (with or without a value, with single or double quotes).
Unfortunately his regular expression was designed for Microsoft .NET, so I’ve spend some time to convert it to PHP. Here is the result:
$regex = "/<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>/i";
And finally, my version based on the one above:
$regex = "/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>/i";
The latter include the following enhancement:
- accept hyphens as attribute’s middle characters (thanks Ged)

sorry, but isn’t working.
Because of Wordpress parsing algorithm, an extra space was introduced between
$regex = "/<and\/?\w+.... I’ve fixed this and it should be OK now. Can you retry please ?Hello, cheers for this. However I’ve noticed the last quote of the regex (ending the string) is one of those horrible Word-style quote marks. So anyone copy and pasting this should replace it with a ‘proper’ one.
Good catch ! Thanks BazzA, I think you’ve find a Wordpress’ bug. Let me investigate this issue…
I’ve just fixed the post content.
BTW, if you need to see a real example use of this regexep, read the
importImagesFromPost()function from my Wordpress to e107 v0.8 script . You will find there an adaptation of the regexp featured in this article.heh.. this does not catch
[HTML tags stripped by Wordpress]for example. and this:
[HTML tags stripped by Wordpress]aha, that’s because attribute
http-equivBut for my needs it feets anyway – thanks a lot
also, I guess this is not good idea to strip html at the forum about html – better would be convert to entities, isn’t it?
Hi Ger, sorry about the tag deletion, but that’s default Wordpress policy. I think you should convert to entities manually.
Regarding your regular expression issue, I guess you were talking about something similar to:
So yes, this tag doesn’t match because my regular expression wasn’t considering
http-equivas a valid attribute. This is of course wrong, as the HTML specs obviously allow hyphens in attribute name.I’ve updated my regular expression in the blog post.
by which regex pateren i can get all html anchor tags from html document, i used this one:
preg_match_all("/<a>[\s-\w+&@#\/%?=~_|!:,.;\"']+/i",$cr,$pm,PREG_SET_ORDER);its working but it not parsing those anchors which has images inside, eg:
can some give me the right one so that i can get all types of achors from html ducument.
thanks
for br tag tell the regular expression plzzzzzzz
Hi,
Thanks for publishing this code, it’s very useful, however I’m trying to catch and remove non-standard tags generated by MS Word, which contain hyphens in the tag (eg ) and colons in the attributes (eg.
w:st="on"), and I notice that this code doesn’t currently pick these up. Any chance of an amended version?Many thanks
Very helpful regex assistance, thank you!
Hi there,
can any one write a good regular expression this one for Python, to remove all hmtl tags. i really need one,
thanks in advance
@Hamza: take a look here.
HTML isn’t a regular language, so advising people to parse it with regular expressions is.. not smart.
See http://htmlparsing.icenine.ca for more information.
I agree.
To clarify: this code is far from being a good practice. It’s just a hack intended to get rid of HTML tag soup.
Now, a little bit of context: PHP is not my language of choice and at the time I wrote this article I didn’t found any PHP library that is tolerant to tag soup. Hence the hack.
For Python, the langage I practice everyday, I recommand using lxml, especially its
lxml.htmlmodule. And on that subject, don’t miss Ian Bicking’s post: “lxml: an underappreciated web scraping library”.You should try Tidy. Its very forgiving of tag soup. And will allow you to pars the tag soup as actual DOM (of sorts)
what part allows hyphens. i just need that part.
I want to get all the
<a>tag that include tag withing it.eg:
here i want to get
http://www.google.comandgoogle.jpgplease help me how can do that with some example.
simply i want to get the “image url” and “link url” when a image include inside anchor tag
how can i do it?
thanks
Nayana Adassuriya
Please provide regular expression to extract any tag and attributes… along with javascript present in attributes…
Thanks