Terence Eden’s Blog<p><strong>Pretty Print HTML using PHP 8.4's new HTML DOM</strong></p><p><a href="https://shkspr.mobi/blog/2025/03/pretty-print-html-using-php-8-4s-new-html-dom/" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">shkspr.mobi/blog/2025/03/prett</span><span class="invisible">y-print-html-using-php-8-4s-new-html-dom/</span></a></p><p>Those whom the gods would send mad, they first teach recursion.</p><p>PHP 8.4 introduces a new <a href="https://www.php.net/manual/en/class.dom-htmldocument.php" rel="nofollow noopener noreferrer" target="_blank">Dom\HTMLDocument class</a> it is a modern HTML5 replacement for the ageing XHTML based DOMDocument. You can <a href="https://wiki.php.net/rfc/domdocument_html5_parser" rel="nofollow noopener noreferrer" target="_blank">read more about how it works</a> - the short version is that it reads and correctly sanitises HTML and turns it into a nested object. Hurrah!</p><p>The one thing it <em>doesn't</em> do is pretty-printing. When you call <code>$dom->saveHTML()</code> it will output something like:</p><pre><code><html lang="en-GB"><head><title>Test</title></head><body><h1>Testing</h1><main><p>Some <em>HTML</em> and an <img src="example.png"></p><ol><li>List</li><li>Another list</li></ol></main></body></html></code></pre><p>Perfect for a computer to read, but slightly tricky for humans.</p><p>As was <a href="https://libraries.mit.edu/150books/2011/05/11/1985/" rel="nofollow noopener noreferrer" target="_blank">written by the sages</a>:</p><blockquote><p>A computer language is not just a way of getting a computer to perform operations but rather … it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute.</p></blockquote><p>HTML <em>is</em> a programming language. Making markup easy to read for humans is a fine and noble goal. The aim is to turn the single line above into something like:</p><pre><code><html lang="en-GB"> <head> <title>Test</title> </head> <body> <h1>Testing</h1> <main> <p>Some <em>HTML</em> and an <img src="example.png"></p> <ol> <li>List</li> <li>Another list</li> </ol> </main> </body></html></code></pre><p>Cor! That's much better!</p><p>I've cobbled together a script which is <em>broadly</em> accurate. There are a million-and-one edge cases and about twice as many personal preferences. This aims to be quick, simple, and basically fine. I am indebted to <a href="https://topic.alibabacloud.com/a/php-domdocument-recursive-formatting-of-indented-html-documents_4_86_30953142.html" rel="nofollow noopener noreferrer" target="_blank">this random Chinese script</a> and to <a href="https://github.com/wasinger/html-pretty-min" rel="nofollow noopener noreferrer" target="_blank">html-pretty-min</a>.</p><p><strong>Step By Step</strong></p><p>I'm going to walk through how everything works. This is as much for my benefit as for yours! This is beta code. It sorta-kinda-works for me. Think of it as a first pass at an attempt to prove that something can be done. Please don't use it in production!</p><p><strong>Setting up the DOM</strong></p><p>The new HTMLDocument should be broadly familiar to anyone who has used the previous one.</p><pre><code>$html = '<html lang="en-GB"><head><title>Test</title></head><body><h1>Testing</h1><main><p>Some <em>HTML</em> and an <img src="example.png"></p><ol><li>List<li>Another list</body></html>'$dom = Dom\HTMLDocument::createFromString( $html, LIBXML_NOERROR, "UTF-8" );</code></pre><p>This automatically adds <code><head></code> and <code><body></code> elements. If you don't want that, use the <a href="https://www.php.net/manual/en/libxml.constants.php#constant.libxml-html-noimplied" rel="nofollow noopener noreferrer" target="_blank"><code>LIBXML_HTML_NOIMPLIED</code> flag</a>:</p><pre><code>$dom = Dom\HTMLDocument::createFromString( $html, LIBXML_NOERROR | LIBXML_HTML_NOIMPLIED, "UTF-8" );</code></pre><p><strong>Where not to indent</strong></p><p>There are certain elements whose contents shouldn't be pretty-printed because it might change the meaning or layout of the text. For example, in a paragraph:</p><pre><code><p> Some <em> HT <strong>M</strong> L </em></p></code></pre><p>I've picked these elements from <a href="https://html.spec.whatwg.org/multipage/text-level-semantics.html#text-level-semantics" rel="nofollow noopener noreferrer" target="_blank">text-level semantics</a> and a few others which I consider sensible. Feel free to edit this list if you want.</p><pre><code>$preserve_internal_whitespace = [ "a", "em", "strong", "small", "s", "cite", "q", "dfn", "abbr", "ruby", "rt", "rp", "data", "time", "pre", "code", "var", "samp", "kbd", "sub", "sup", "b", "i", "mark", "u", "bdi", "bdo", "span", "h1", "h2", "h3", "h4", "h5", "h6", "p", "li", "button", "form", "input", "label", "select", "textarea",];</code></pre><p>The function has an option to <em>force</em> indenting every time it encounters an element.</p><p><strong>Tabs 🆚 Space</strong></p><p>Tabs, obviously. Users can set their tab width to their personal preference and it won't get confused with semantically significant whitespace.</p><pre><code>$indent_character = "\t";</code></pre><p><strong>Recursive Function</strong></p><p>This function reads through each node in the HTML tree. If the node should be indented, the function inserts a new node with the requisite number of tabs before the existing node. It also adds a suffix node to indent the next line appropriately. It then goes through the node's children and recursively repeats the process.</p><p><strong>This modifies the existing Document</strong>.</p><pre><code>function prettyPrintHTML( $node, $treeIndex = 0, $forceWhitespace = false ){ global $indent_character, $preserve_internal_whitespace; // If this node contains content which shouldn't be separately indented // And if whitespace is not forced if ( property_exists( $node, "localName" ) && in_array( $node->localName, $preserve_internal_whitespace ) && !$forceWhitespace ) { return; } // Does this node have children? if( property_exists( $node, "childElementCount" ) && $node->childElementCount > 0 ) { // Move in a step $treeIndex++; $tabStart = "\n" . str_repeat( $indent_character, $treeIndex ); $tabEnd = "\n" . str_repeat( $indent_character, $treeIndex - 1); // Remove any existing indenting at the start of the line $node->innerHTML = trim($node->innerHTML); // Loop through the children $i=0; while( $childNode = $node->childNodes->item( $i++ ) ) { // Was the *previous* sibling a text-only node? // If so, don't add a previous newline if ( $i > 0 ) { $olderSibling = $node->childNodes->item( $i-1 ); if ( $olderSibling->nodeType == XML_TEXT_NODE && !$forceWhitespace ) { $i++; continue; } $node->insertBefore( $node->ownerDocument->createTextNode( $tabStart ), $childNode ); } $i++; // Recursively indent all children prettyPrintHTML( $childNode, $treeIndex, $forceWhitespace ); }; // Suffix with a node which has "\n" and a suitable number of "\t" $node->appendChild( $node->ownerDocument->createTextNode( $tabEnd ) ); }}</code></pre><p><strong>Printing it out</strong></p><p>First, call the function. <strong>This modifies the existing Document</strong>.</p><pre><code>prettyPrintHTML( $dom->documentElement );</code></pre><p>Then call <a href="https://www.php.net/manual/en/dom-htmldocument.savehtml.php" rel="nofollow noopener noreferrer" target="_blank">the normal <code>saveHtml()</code> serialiser</a>:</p><pre><code>echo $dom->saveHTML();</code></pre><p>Note - this does not print a <code><!doctype html></code> - you'll need to include that manually if you're intending to use the entire document.</p><p><strong>Licence</strong></p><p>I consider the above too trivial to licence - but you may treat it as MIT if that makes you happy.</p><p><strong>Thoughts? Comments? Next steps?</strong></p><p>I've not written any formal tests, nor have I measured its speed, there may be subtle-bugs, and catastrophic errors. I know it doesn't work well if the HTML is already indented. It mysteriously prints double newlines for some unfathomable reason.</p><p>I'd love to know if you find this useful. Please <a href="https://gitlab.com/edent/pretty-print-html-using-php/" rel="nofollow noopener noreferrer" target="_blank">get involved on GitLab</a> or drop a comment here.</p><p><a rel="nofollow noopener noreferrer" class="hashtag u-tag u-category" href="https://shkspr.mobi/blog/tag/howto/" target="_blank">#HowTo</a> <a rel="nofollow noopener noreferrer" class="hashtag u-tag u-category" href="https://shkspr.mobi/blog/tag/html/" target="_blank">#HTML</a> <a rel="nofollow noopener noreferrer" class="hashtag u-tag u-category" href="https://shkspr.mobi/blog/tag/php/" target="_blank">#php</a></p>