wp-php-toolkit/html
最新稳定版本:v0.7.5
Composer 安装命令:
composer require wp-php-toolkit/html
包简介
HTML component for WordPress.
关键字:
README 文档
README
| slug | html | ||||
|---|---|---|---|---|---|
| title | HTML | ||||
| install | wp-php-toolkit/html | ||||
| credit_title | Ported from WordPress core | ||||
| credit_body | The HTML component is a port of WordPress core's <code>WP_HTML_Tag_Processor</code> and <code>WP_HTML_Processor</code>. Source: <a href="https://github.com/WordPress/wordpress-develop/tree/trunk/src/wp-includes/html-api">WordPress/wordpress-develop</a>. Bug fixes flow in both directions. | ||||
| see_also |
|
A pure-PHP HTML5 parser and tag rewriter mirroring WordPress core's HTML API. Treat HTML the way browsers do — without libxml2, DOMDocument, or regex hacks — and rewrite attributes in a single linear pass.
Why this exists
WordPress runs HTML fragments through filters every time a request renders: post content, block markup, comments, excerpts, widgets, feeds, imported documents. Those fragments can omit <html> and <body>, close tags implicitly, or mix browser-correct markup with author mistakes that DOMDocument and regular expressions do not model well.
The HTML component gives WordPress-style code the same parsing model WordPress core uses: a browser-compatible tokenizer and tree-aware processor that run in pure PHP. Choose it for exact-byte rewrites, imperfect fragments, and post-content filters where a full DOM would do too much work.
The component gives you two processors. WP_HTML_Tag_Processor is a forward-only cursor over tags and tokens — useful for attribute rewriting at scale. WP_HTML_Processor layers HTML5 tree construction on top so you can query by ancestry (breadcrumbs), serialize the parsed document, and trust that <p>one<p>two parses as two paragraphs the way a browser sees it.
Footgun: Mutations are buffered. Nothing changes in the source string until you call get_updated_html(). If you read get_attribute() after a set_attribute() on the same tag, you see the new value — but downstream tooling reading the original string sees stale HTML until you serialize.
Add loading="lazy" to every image
The "hello world" of tag rewriting. One linear pass, no DOM, no reserialization cost beyond the bytes you actually changed.
Try this: click Run, then change 'lazy' to 'eager' on the first image only by guarding it with $tags->get_attribute( 'src' ) === 'hero.jpg'. Run again and notice that get_updated_html() only rewrites the bytes for that one tag.
<?php require '/wordpress/wp-content/php-toolkit/vendor/autoload.php'; $html = <<<'HTML' <article> <img src="hero.jpg" alt="Hero"> <p>Intro copy.</p> <img src="inline.jpg" alt="Inline"> </article> HTML; $tags = new WP_HTML_Tag_Processor( $html ); while ( $tags->next_tag( 'img' ) ) { // Don't clobber an explicit eager hint the author already set. if ( null === $tags->get_attribute( 'loading' ) ) { $tags->set_attribute( 'loading', 'lazy' ); } $tags->set_attribute( 'decoding', 'async' ); } echo $tags->get_updated_html();
<article>
<img decoding="async" loading="lazy" src="hero.jpg" alt="Hero">
<p>Intro copy.</p>
<img decoding="async" loading="lazy" src="inline.jpg" alt="Inline">
</article>
Rewrite relative links to absolute URLs
Use this before sending post content to an RSS feed, an email template, or a CDN-backed copy of a site. The processor rewrites only the changed bytes, so untouched markup stays byte-identical.
<?php require '/wordpress/wp-content/php-toolkit/vendor/autoload.php'; $html = <<<'HTML' <p>See <a href="/about">about</a>, <a href="https://example.com/x">x</a>, and <a href="contact.html">contact</a>.</p> HTML; $base = 'https://my-site.test/'; $tags = new WP_HTML_Tag_Processor( $html ); while ( $tags->next_tag( 'a' ) ) { $href = $tags->get_attribute( 'href' ); if ( null === $href || '' === $href ) { continue; } if ( preg_match( '#^[a-z][a-z0-9+.-]*:#i', $href ) || 0 === strpos( $href, '//' ) || 0 === strpos( $href, '#' ) ) { continue; } $tags->set_attribute( 'href', rtrim( $base, '/' ) . '/' . ltrim( $href, '/' ) ); } echo $tags->get_updated_html();
<p>See <a href="https://my-site.test/about">about</a>, <a href="https://example.com/x">x</a>,
and <a href="https://my-site.test/contact.html">contact</a>.</p>
Strip every script and inline event handler
A common sanitization step: neutralize untrusted HTML before display. Blank a script's body with set_modifiable_text() and strip every on* attribute via get_attribute_names_with_prefix().
<?php require '/wordpress/wp-content/php-toolkit/vendor/autoload.php'; $untrusted = <<<'HTML' <p onclick="x()">hi</p> <script>evil()</script> <img src="x" onerror="boom()"> HTML; $tags = new WP_HTML_Tag_Processor( $untrusted ); while ( $tags->next_tag() ) { // next_tag() never lands on closing tags, so no is_tag_closer() guard // is needed here. if ( 'SCRIPT' === $tags->get_tag() ) { $tags->set_modifiable_text( '' ); } foreach ( $tags->get_attribute_names_with_prefix( 'on' ) as $attr ) { $tags->remove_attribute( $attr ); } } echo $tags->get_updated_html();
<p >hi</p>
<script></script>
<img src="x" >
Stamp a CSP nonce on inline scripts and styles
Content Security Policy in nonce- mode requires every inline <script> and <style> to carry a matching nonce attribute. Tag-by-tag is exactly the right granularity.
<?php require '/wordpress/wp-content/php-toolkit/vendor/autoload.php'; $nonce = bin2hex( random_bytes( 8 ) ); $html = <<<'HTML' <head><style>body{font:16px sans-serif}</style></head> <body><script>console.log("hi")</script><script src="vendor.js"></script></body> HTML; $tags = new WP_HTML_Tag_Processor( $html ); while ( $tags->next_tag() ) { $tag = $tags->get_tag(); if ( 'SCRIPT' === $tag || 'STYLE' === $tag ) { $tags->set_attribute( 'nonce', $nonce ); } } echo "nonce: {$nonce}\n\n"; echo $tags->get_updated_html();
nonce: <random>
<head><style nonce="<random>">body{font:16px sans-serif}</style></head>
<body><script nonce="<random>">console.log("hi")</script><script nonce="<random>" src="vendor.js"></script></body>
Build a srcset from a single src
Generate responsive image markup at render time without touching the editor data model. Read the existing src, derive a srcset with width descriptors, add a sizes hint.
<?php require '/wordpress/wp-content/php-toolkit/vendor/autoload.php'; $html = '<figure><img src="https://cdn.test/uploads/photo.jpg" alt="Sunset"></figure>'; $widths = array( 480, 768, 1200 ); $tags = new WP_HTML_Tag_Processor( $html ); while ( $tags->next_tag( 'img' ) ) { $src = $tags->get_attribute( 'src' ); if ( null === $src || $tags->get_attribute( 'srcset' ) !== null ) { continue; } $variants = array(); foreach ( $widths as $w ) { $variants[] = $src . '?w=' . $w . ' ' . $w . 'w'; } $tags->set_attribute( 'srcset', implode( ', ', $variants ) ); $tags->set_attribute( 'sizes', '(max-width: 768px) 100vw, 768px' ); } echo $tags->get_updated_html();
<figure><img sizes="(max-width: 768px) 100vw, 768px" srcset="https://cdn.test/uploads/photo.jpg?w=480 480w, https://cdn.test/uploads/photo.jpg?w=768 768w, https://cdn.test/uploads/photo.jpg?w=1200 1200w" src="https://cdn.test/uploads/photo.jpg" alt="Sunset"></figure>
Decode HTML entities the way the spec demands
The HTML5 entity table has roughly 2,200 named references and a long list of edge cases. WP_HTML_Decoder implements the algorithm — don't roll your own.
<?php require '/wordpress/wp-content/php-toolkit/vendor/autoload.php'; echo "attribute: " . WP_HTML_Decoder::decode_attribute( 'path?a=1&b=2&copy' ) . "\n"; echo "text: " . WP_HTML_Decoder::decode_text_node( 'AT&T — 100% 😀' ) . "\n"; // Safe URL prefix check that decodes character references while comparing. // `j` is the letter `j`, so this string really does start with javascript:. // strpos() would miss it. $is_javascript = WP_HTML_Decoder::attribute_starts_with( 'javascript:alert(1)', 'javascript:', 'ascii-case-insensitive' ); var_dump( $is_javascript );
attribute: path?a=1&b=2©
text: AT&T — 100% 😀
bool(true)
Find images by ancestry with breadcrumbs
The full WP_HTML_Processor understands HTML5 tree construction, so you can ask "find every <img> directly inside a <figure>" without writing your own DOM walker.
<?php require '/wordpress/wp-content/php-toolkit/vendor/autoload.php'; $html = <<<'HTML' <article> <figure><img src="hero.jpg" alt="Hero"><figcaption>Hero shot</figcaption></figure> <p>Body copy <img src="emoji.png" alt=""> mid-paragraph.</p> <figure><img src="diagram.png" alt="Diagram"></figure> </article> HTML; $p = WP_HTML_Processor::create_fragment( $html ); $figure_images = 0; while ( $p->next_tag( array( 'breadcrumbs' => array( 'FIGURE', 'IMG' ) ) ) ) { $p->add_class( 'figure-image' ); $figure_images++; } echo "found {$figure_images} figure images\n"; echo $p->get_updated_html();
found 2 figure images
<article>
<figure><img class="figure-image" src="hero.jpg" alt="Hero"><figcaption>Hero shot</figcaption></figure>
<p>Body copy <img src="emoji.png" alt=""> mid-paragraph.</p>
<figure><img class="figure-image" src="diagram.png" alt="Diagram"></figure>
</article>
Outline a document by walking tokens with depth
The full processor exposes get_current_depth() and get_breadcrumbs(). Combine with next_token() to print a structural outline.
<?php require '/wordpress/wp-content/php-toolkit/vendor/autoload.php'; $html = <<<'HTML' <section><h1>Title</h1> <section><h2>Chapter 1</h2><p>Body</p></section> <section><h2>Chapter 2</h2><p>More body</p></section> </section> HTML; $p = WP_HTML_Processor::create_fragment( $html ); while ( $p->next_token() ) { if ( '#tag' !== $p->get_token_type() || $p->is_tag_closer() ) { continue; } $tag = $p->get_tag(); if ( ! preg_match( '/^H[1-6]$/', $tag ) ) { continue; } $indent = str_repeat( ' ', max( 0, $p->get_current_depth() - 2 ) ); $text = ''; while ( $p->next_token() ) { if ( '#text' === $p->get_token_type() ) { $text .= $p->get_modifiable_text(); continue; } if ( '#tag' === $p->get_token_type() && $tag === $p->get_tag() && $p->is_tag_closer() ) { break; } } echo "{$indent}{$tag} {$text}\n"; }
H1 Title
H2 Chapter 1
H2 Chapter 2
Bookmarks: annotate a parent based on its children
Bookmarks are the one escape from forward-only scanning. Save a position, scan ahead, decide what to do, then seek() back and rewrite the earlier tag.
<?php require '/wordpress/wp-content/php-toolkit/vendor/autoload.php'; $html = <<<'HTML' <ul> <li><input type="checkbox" checked> Buy milk</li> <li><input type="checkbox"> Walk the dog</li> <li><input type="checkbox" checked> Read book</li> </ul> HTML; $tags = new WP_HTML_Tag_Processor( $html ); $tags->next_tag( 'ul' ); $tags->set_bookmark( 'list' ); $total = 0; $done = 0; while ( $tags->next_tag( 'input' ) ) { $total++; if ( null !== $tags->get_attribute( 'checked' ) ) { $done++; } } $tags->seek( 'list' ); $tags->set_attribute( 'data-progress', $done . '/' . $total ); $tags->release_bookmark( 'list' ); echo $tags->get_updated_html();
<ul data-progress="2/3">
<li><input type="checkbox" checked> Buy milk</li>
<li><input type="checkbox"> Walk the dog</li>
<li><input type="checkbox" checked> Read book</li>
</ul>
When to use which
| Use | For |
|---|---|
WP_HTML_Tag_Processor | Attribute rewriting, sanitization, finding tags by name. Forward-only walks. Anything where speed and byte-honesty matter more than context. |
WP_HTML_Processor::create_fragment() | Queries by ancestry (breadcrumbs), heading outline extraction, anything that needs to know "is this tag inside that one." |
WP_HTML_Decoder::decode_text_node() | Turning entity-encoded text (AT&T) back into raw text correctly. Implements the HTML5 entity algorithm — don't roll your own. |
WP_HTML_Decoder::attribute_starts_with() | Safe URL-prefix checks that decode HTML character references while comparing — so javascript: (where a is the letter a) is correctly recognized as starting with javascript:. The classic strpos approach misses these. |
Footgun: next_tag() only stops on opening tags. Closers and text are skipped, so a guard like ! $tags->is_tag_closer() inside a next_tag() loop is harmless but never fires. If you need to visit closing tags or text nodes, use next_token() instead and check get_token_type().
Footgun: Tag-name matches are uppercase. get_tag() always returns the tag name in uppercase ('IMG', not 'img'). Compare accordingly. The filter argument to next_tag() is case-insensitive in either direction.
Footgun: Don't confuse WP_HTML_Tag_Processor with the full processor. The cursor is forward-only and ancestry-blind, and it doesn't expose get_breadcrumbs() at all — calling that on a WP_HTML_Tag_Processor raises a Call to undefined method error. Breadcrumbs and HTML5 tree construction (implicit <tbody> insertion, automatic <p> closing, and the rest) live only on WP_HTML_Processor.
统计信息
- 总下载量: 46.97k
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 0
- 点击次数: 1
- 依赖项目数: 4
- 推荐数: 0
其他信息
- 授权协议: GPL-2.0-or-later
- 更新时间: 2025-09-06