Disclaimer: this is a dirty hack! To parse HTML or XML, use a dedicated library.
Tonight I found the ultimate regex to get HTML tags out of a string. It was written a year ago by Phil Haack on his blog. His regex is quite bullet-proof: it’s able to parse HTML tags written on multiple lines which contain any sort of attributes (with or without a value, with single or double quotes).
Unfortunately his regular expression was designed for Microsoft .NET, so I’ve spend some time to convert it to PHP. Here is the result:
$regex = "/<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>/i";
And finally, my version based on the one above:
$regex = "/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>/i";
The latter include the following enhancement:
- accept hyphens as attribute’s middle characters ( thanks Ged)