Whilst working on a script for my GetProThemes app recently, I came across a problem with PHP’s loadHTML and loadHTMLFile methods.
The problem
I noticed that when using loadHTMLFile
to parse an HTML document, the character encoding — UTF-8 in this case — was not being taken into consideration. Because of this, there was some mojibake after I extracted some content from the document. Here is an example of the problem:
$i18n_str = 'Iñtërnâtiônàlizætiøn'; $html = <<<EOS <!doctype html> <head> <meta charset="UTF-8"> <title>html 5 document</title> </head> <body> <h1 id="title">$i18n_str</h1> </body> </html> EOS; $dom = new DOMDocument(); $dom->loadHTML( $html ); echo $dom->getElementById( 'title' )->textContent; // output: Iñtërnâtiônà lizætiøn
After some digging into the PHP source code, I discovered this function, along with loadHTML, uses Libxml for determining the character set of the HTML document automatically. It uses a function named htmlCheckEncoding
for this purpose. What this function does is to look for a meta tag declaring the character set. Unfortunately, it only looks for the HTML4 style declaration:
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
This means that if your source document is HTML5, it will not pick up the newer meta tag declaration which has this form:
<meta charset="utf-8">
It seems that this glitch has been fixed in version 2.8.0 of Libxml, but if you are stuck with an older version then I have created a workaround.
The solution
I have created a drop-in replacements for the loadHTML
/loadHTMLFile
methods which will automatically convert an HTML5 character set declaration, if it exists, into an HTML4 character set declaration and thus allowing Libxml to parse the document correctly.
Fixing the above example is trivial:
require_once 'DOMDocumentCharset.php'; $i18n_str = 'Iñtërnâtiônàlizætiøn'; $html = <<<EOS <!doctype html> <head> <meta charset="UTF-8"> <title>html 5 test</title> </head> <body> <h1 id="title">$i18n_str</h1> </body> </html> EOS; $dom = new DOMDocumentCharset(); $dom->loadHTMLCharset( $html ); echo $dom->getElementById( 'title' )->textContent; // output: Iñtërnâtiônàlizætiøn
So, the fix involves:
1. Including the DOMDocumentCharset
class
2. Instantiating DOMDocumentCharset
rather than DOMDocument
3. Calling the new loadHTMLCharset
method
The class will only activate the workaround if the installed Libxml version is less than 2.8.0, so upgrading Libxml will not break this code.
The source code can be found on GitHub: dom-document-charset
Wow thx Glen, this really made my day 🙂
Awesome, thank you!