ioodev/elephscraper
Composer 安装命令:
composer require ioodev/elephscraper
包简介
ElephScraper is a lightweight and PHP-native web scraping toolkit built using Guzzle and Symfony DomCrawler. It provides a clean and powerful interface to extract HTML content, metadata, and structured data from any website.
关键字:
README 文档
README
ElephScraper is a lightweight, PHP-native web scraping toolkit, built on top of Guzzle and Symfony DomCrawler. This library provides a clean and powerful interface for extracting HTML, metadata, and structured data from any web page — or from an HTML string you already have yourself.
Fast. Clean. Eleph-style scraping. 🐘⚡
Part of the ioodev scraper ecosystem alongside
SnakyScraper (Python) and
NodeScraper (Node.js) — three libraries with
a similar API philosophy for three different language ecosystems.
Moving from
riodevnet/elephscraper? See Migrating from v1.0 below — the namespace and package name changed in v1.1.0.
📋 Table of Contents
- Features
- Installation
- Basic Usage
- Error Handling
- Request Options (Headers, Timeout, Proxy, etc.)
- Full API Reference
- Project Structure
- Testing & Quality Tools
- Migrating from v1.0 (
riodevnet/elephscraper) - Contributing
- Changelog
- License
🚀 Features
- ✅ Extract metadata: title, description, keywords, author, charset, canonical, and more
- ✅ Full support for Open Graph, Twitter Card, CSRF token, and HTTP-equiv headers
- ✅ Extract headings, paragraphs, images, lists, and links — complete with
rel,nofollow, etc. details - ✅ Flexible
filter()method with tag/class/ID-based selectors - ✅ Can load from a URL or directly from an HTML string (
fromHtml()) — no HTTP request needed, great for testing - ✅ Never throws a fatal error — fetch/parse failures can always be checked via
isValid()/getError(), or optionally thrown as an exception (throwOnError) - ✅ Custom headers, timeout, proxy, cookies, and other Guzzle options via the
$optionsparameter - ✅ Safe return types: string, array, or associative array — always
null(never a crash) when data isn't found - ✅ Strict types & full type-hints (PHP 8.0+) for a safer development experience
- ✅ Built on top of Guzzle + Symfony DomCrawler + CssSelector
- ✅ PHPUnit test suite, PHPStan level 6, and PHP-CS-Fixer (PSR-12) already set up
📦 Installation
Install via Composer:
composer require ioodev/elephscraper
Requires PHP 8.0 or newer.
🛠️ Basic Usage
<?php require_once __DIR__ . '/vendor/autoload.php'; use Ioodev\Elephscraper\ElephScraper; $scraper = new ElephScraper('https://example.com'); if (!$scraper->isValid()) { // Request failed (timeout, DNS error, 404, etc.) — will not throw a fatal error. die('Failed to load page: ' . $scraper->getError()?->getMessage()); } echo $scraper->title(); // "Welcome to Example.com" echo $scraper->description(); // "Example site for testing" print_r($scraper->h1()); // ["Main Title", "News"] print_r($scraper->openGraph());
Load from an HTML string (no HTTP request)
Useful for unit tests, or when you already have HTML from another source (headless browser, file cache, webhook payload, etc.):
$html = '<html><head><title>Static Page</title></head><body>...</body></html>'; $scraper = ElephScraper::fromHtml($html); echo $scraper->title(); // "Static Page"
⚠️ Error Handling
By default, the constructor never throws an exception — this is intentional, so a single broken URL in the middle of a batch/loop process doesn't halt the entire process. Always check one of:
$scraper = new ElephScraper($url); if (!$scraper->isValid()) { echo 'Error: ' . $scraper->getError()->getMessage(); // continue to the next URL, etc. }
If you prefer a "fail-fast" model with try/catch, set throwOnError:
use Ioodev\Elephscraper\Exceptions\ScraperException; try { $scraper = new ElephScraper($url, ['throwOnError' => true]); } catch (ScraperException $e) { echo 'Scraping failed: ' . $e->getMessage(); }
All extraction methods (title(), h1(), links(), etc.) are always safe to call
even if the document fails to load — they will return null (never a crash), including
edge cases from previous versions that had a fatal error bug under this condition.
⚙️ Request Options (Headers, Timeout, Proxy, etc.)
The constructor's second parameter is an options array passed directly to
Guzzle request(), merged
with the defaults (timeout: 10, connect_timeout: 5, redirects followed, browser
User-Agent):
$scraper = new ElephScraper('https://example.com', [ 'timeout' => 20, 'headers' => [ 'User-Agent' => 'MyBot/1.0', 'Accept-Language' => 'en-US,en;q=0.9', ], 'proxy' => 'http://localhost:8125', 'verify' => false, // disable SSL verification (use with caution in production) ]);
You can also inject your own Guzzle Client instance (for example, for testing with a
mock handler):
$scraper = new ElephScraper('https://example.com', [ 'client' => $myMockedGuzzleClient, ]);
📚 Full API Reference
🔹 Page Metadata
$scraper->title(); // ?string $scraper->description(); // ?string $scraper->keywords(); // ?string[] — comma-split result, already trimmed $scraper->keywordString(); // ?string — raw "content" attribute $scraper->charset(); // ?string $scraper->canonical(); // ?string $scraper->contentType(); // ?string — from meta http-equiv="Content-Type" $scraper->author(); // ?string $scraper->csrfToken(); // ?string — checks <meta name="csrf-token">, falls back to <input name="csrf-token"> $scraper->image(); // ?string — shortcut for og:image $scraper->viewport(); // ?string[] — comma-split result from meta viewport $scraper->viewportString(); // ?string
🔹 Open Graph & Twitter Card
$scraper->openGraph(); // array<string,?string> — all common og: properties $scraper->openGraph('og:title'); // ?string — a specific property $scraper->twitterCard(); // array<string,?string> — all common twitter: tags $scraper->twitterCard('twitter:title'); // ?string — a specific property
🔹 Heading & Text
$scraper->h1(); // ?string[] $scraper->h2(); // ?string[] $scraper->h3(); // ?string[] $scraper->h4(); // ?string[] $scraper->h5(); // ?string[] $scraper->h6(); // ?string[] $scraper->p(); // ?string[] — all <p> elements, trimmed
🔹 List
$scraper->ul(); // ?string[] — all <li> text inside <ul> $scraper->ol(); // ?string[] — all <li> text inside <ol>
🔹 Images
$scraper->images(); // ?string[] — all <img> src $scraper->imageDetails(); // ?array<int, array{url:?string, alt_text:?string, title:?string}>
🔹 Links
$scraper->links(); // ?string[] — all <a> href $scraper->linkDetails(); // ?array<int, array{ // url: ?string, // protocol: string, // "https", "mailto", "" if relative, etc. // text: string, // title: string, // target: string, // rel: string[], // is_nofollow: bool, // is_ugc: bool, // is_noopener: bool, // is_noreferrer: bool, // }>
🔍 Custom DOM Filter
filter() is the most flexible method — ideal for scraping custom HTML structures
like product lists, article cards, data tables, etc.
$scraper->filter( element: 'div', attributes: ['id' => 'main'], multiple: false, extract: ['.title', '#desc', 'p'], returnHtml: false );
Filter multiple elements at once:
$products = $scraper->filter( element: 'div', attributes: ['class' => 'product-card'], multiple: true, extract: ['.product-title', '.price'], returnHtml: false ); // [ // ['.product-title' => 'Wireless Mouse', '.price' => '$15.00'], // ['.product-title' => 'Mechanical Keyboard', '.price' => '$85.00'], // ]
Get raw HTML from a single section:
$scraper->filter( element: 'section', attributes: ['class' => 'hero'], returnHtml: true );
Selector rules for
extract:
- Tag name:
h2,p,span, etc.- Class:
.className(automatically matches even if the element has multiple classes)- ID:
#idNameResult array keys always follow the original selector string (e.g.
result['.title']). Values inattributes(forclass/id/other attributes) are safe from quote characters — they won't break the selector as they could in previous versions.
Returns null if the document fails to load, or if no matching elements are found.
🔧 Low-Level Access
For cases not covered by the built-in methods, you can drop straight down to Symfony DomCrawler:
$scraper->isValid(); // bool — whether the document loaded successfully $scraper->getError(); // ?Throwable — the last exception, if any $scraper->getHtml(); // ?string — raw HTML $scraper->getCrawler(); // ?Symfony\Component\DomCrawler\Crawler $scraper->getUrl(); // ?string — source URL (or base URL from loadHtml())
🗂 Project Structure
elephscraper/
├── .github/
│ └── workflows/
│ └── ci.yml # GitHub Actions: tests + static analysis on PHP 8.0–8.3
├── examples/
│ ├── basic-usage.php
│ ├── custom-options.php
│ └── from-html-and-filter.php
├── src/
│ ├── Exceptions/
│ │ ├── InvalidUrlException.php
│ │ └── ScraperException.php
│ ├── Support/
│ │ └── CssSelectorBuilder.php # safe CSS selector builder (escaping, id/class normalization)
│ └── ElephScraper.php # main class
├── tests/
│ └── Unit/
│ ├── CssSelectorBuilderTest.php
│ └── ElephScraperTest.php
├── .gitignore
├── .php-cs-fixer.php
├── CHANGELOG.md
├── composer.json
├── LICENSE
├── phpstan.neon
├── phpunit.xml
└── README.md
This separation is intentional, to make further development easier:
src/Exceptions/— all exception classes, so library consumers cancatch (ScraperException $e)specifically without catching generic PHP errors.src/Support/— internal helpers (currentlyCssSelectorBuilder) kept separate from the main class so they can be unit-tested independently and reused if more selector features are added later.tests/Unit/— mirrors thesrc/namespace structure, one test file per class.examples/— runnable scripts (php examples/basic-usage.php) for quick onboarding without needing to read the whole README.
🧪 Testing & Quality Tools
composer install composer test # run PHPUnit composer test:coverage # PHPUnit + coverage report composer analyse # PHPStan level 6 composer lint # check code formatting (PSR-12), without modifying files composer lint:fix # automatically fix code formatting
The test suite covers metadata extraction, heading/paragraph/list extraction, images &
links (including the edge case of relative links without rel), filter() (single &
multiple, including values containing quote characters), and behavior when the document
fails to load (must return null, not crash).
🔁 Migrating from v1.0 (riodevnet/elephscraper)
Version 1.1.0 changes the namespace and package name following the username rename
from riodevnet to ioodev. Migration steps:
composer remove riodevnet/elephscraper composer require ioodev/elephscraper
Then find-and-replace throughout your project:
- use Riodevnet\Elephscraper\ElephScraper; + use Ioodev\Elephscraper\ElephScraper;
All method names remain exactly the same — there are no signature changes to any
public method that existed in v1.0, so you only need to update the use statement.
See CHANGELOG.md for the full list of changes, new features, and bug
fixes.
🤝 Contributing
Found a bug? Want to add a feature? Open an issue or submit a pull request at github.com/ioodev/elephscraper!
Before opening a PR, please run:
composer test
composer analyse
composer lint
📝 Changelog
See CHANGELOG.md for the full list of changes in each version.
📄 License
MIT License © 2025–2026 — ioodev
🔗 Related Libraries
- Guzzle
- Symfony DomCrawler
- Symfony CssSelector
- SnakyScraper — Python version
- NodeScraper — Node.js version
💡 Why ElephScraper?
ElephScraper is your trusty PHP elephant — strong, smart, and always ready to extract exactly the data you need. 🐘
统计信息
- 总下载量: 0
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 3
- 点击次数: 1
- 依赖项目数: 0
- 推荐数: 0
其他信息
- 授权协议: MIT
- 更新时间: 2026-06-24