datahelm/crawler 问题修复 & 功能扩展

解决BUG、新增功能、兼容多环境部署,快速响应你的开发需求

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

datahelm/crawler

Composer 安装命令:

composer require datahelm/crawler

包简介

Generic Scrapy-like web crawler for Laravel — auto-detects lists, pagination, fields, and supports API/SPA, infinite scroll, image downloading, dedup, resumable crawls, LLM-ready Markdown output, and pluggable output sinks.

README 文档

README

Scrapy-style web crawler for Laravel — auto-detects lists, pagination, and fields; supports API/SPA sites, infinite scroll, image downloading, dedup, pluggable output sinks, and LLM-ready Markdown output (like Firecrawl / Crawl4AI).

Documentation

📖 Documentation: https://datahelm.dev

Architecture overview

Installation

composer require datahelm/crawler

Publish config (optional):

php artisan vendor:publish --tag=crawler-config

Quick start

php artisan datahelm:scrap:generate "https://example.com/listing" --get-detail=true --robot
php artisan datahelm:robot:example --limit=10

datahelm:scrap:generate auto-detects the repeating item block, pagination, and fields on a listing page like this:

Example listing page auto-detection

Presets (field-detection heuristics)

A preset is a named bundle of heuristics used only by datahelm:scrap:generate — the auto-detection step that guesses selectors on a site it has never seen. It has no effect once a blueprint is saved; it only shapes what gets written into that blueprint the moment it's generated.

php artisan datahelm:scrap:generate "https://example.com/listing" --preset=ecommerce

Built-in presets (config/crawler.phppresets), selected with --preset= or the CRAWLER_PRESET env var (default: generic):

Preset Use for
generic Unsure / mixed content — safe default for any country, any vertical
ecommerce Online shops, marketplaces (adds handle, product-image, qty, …)
auctions Auction/lot listings
properties Real-estate listings

Each preset is an array of hints the detectors match against CSS classes, HTML attributes, and JSON field names:

Key Controls
price_patterns Regexes for currency formats ($, R$, , £, …) — locale-specific, add your own symbol if missing
image_field_hints CSS class / JSON key fragments that mark an image field (image, thumb, gallery, …)
link_field_hints Same, for the item's URL/link field (url, href, slug, handle, …)
rating_hints CSS class fragments for star/score widgets
stock_hints CSS class fragments for availability/inventory
sku_hints CSS class / JSON key fragments for product codes (SKU, EAN, MPN, …)
image_path_prefix A fixed URL path segment that identifies image URLs on a known platform (e.g. VTEX's /arquivos/); null = auto
list_core_fields, list_min_core_fields, list_min_success_rate, list_min_link_uniqueness Thresholds the detector uses to decide "this repeating block is really a list of items"
item_schema Suggested item_schema (type-coercion map) to carry into the generated blueprint

image_field_hints / link_field_hints / rating_hints / stock_hints / sku_hints are CSS-class vocabulary and stay in English regardless of the page's display language (developers write class="star-rating" on French/Portuguese/Arabic sites alike). Only price_patterns and image_path_prefix are actually locale/platform specific.

Adding your own preset — extend an existing one by merging in local vocabulary, in config/crawler.php:

'presets' => [
    // ...
    'auctions_pt_BR' => [
        'price_patterns'    => ['/R\$\s*[\d.,]+/'],
        'image_field_hints' => array_merge(
            ['image', 'img', 'photo', 'thumb'],
            ['foto', 'fotos', 'imagem', 'imagens', 'galeria'],
        ),
        'link_field_hints'  => ['url', 'link', 'href', 'permalink', 'lote'],
        'image_path_prefix' => '/arquivos/',
    ],
],
php artisan datahelm:scrap:generate <url> --preset=auctions_pt_BR

Item pipeline

Where a preset shapes how fields are found, the pipeline shapes what happens to their values afterwards — it runs on every crawl execution (datahelm:scrap:run), not just generation, transforming each already-extracted ScrapedItem before it's exported.

By default (config/crawler.phppipeline) every item passes through:

  1. TrimProcessor — collapses whitespace and trims every string field.
  2. AbsoluteUrlProcessor — resolves relative link/image/gallery_images/… URLs against the page they were scraped from.

A blueprint can override this default for itself with pipeline_names — a list of short names resolved against config('crawler.pipeline_registry'):

'pipeline_registry' => [
    'trim'            => TrimProcessor::class,
    'absolute_url'    => AbsoluteUrlProcessor::class,
    'schema_coercion' => SchemaCoercionProcessor::class,
],
{ "pipeline_names": ["trim", "schema_coercion"] }

This is a replacement, not an addition: listing ["trim"] runs only TrimProcessor for that blueprint — AbsoluteUrlProcessor no longer runs, so relative URLs are left as-is. Leaving pipeline_names empty ([], the default) keeps the global pipeline untouched — most blueprints never need to set this.

Adding a custom processor — implement ItemProcessor, register it, then reference it by name:

final class StripEmojiProcessor implements \DataHelm\Crawler\Pipeline\ItemProcessor
{
    public function process(ScrapedItem $item, string $pageUrl): ScrapedItem
    {
        $title = $item->get('title');
        if (is_string($title)) {
            $item->set('title', preg_replace('/[\x{1F300}-\x{1FAFF}]/u', '', $title));
        }

        return $item;
    }
}
// config/crawler.php
'pipeline_registry' => [
    // ...
    'strip_emoji' => \App\Pipeline\StripEmojiProcessor::class,
],
{ "pipeline_names": ["trim", "absolute_url", "strip_emoji"] }

Presets vs. pipeline

They sound similar (both pick a named config by string) but act at opposite ends of the process:

Presets Pipeline
Runs during scrap:generate only, once scrap:run, every execution
Acts on Detection heuristics — finding the right selectors Extracted values — transforming them after extraction
Selected via --preset=ecommerce on the CLI "pipeline_names": [...] in the blueprint JSON
Outlives the run? No — only its effect on the generated blueprint persists Yes — read from the blueprint on every future run

Markdown / LLM-ready output

Turn any page into clean Markdown instead of a wall of HTML — the feature Firecrawl and Crawl4AI are known for, now in the Laravel world. Two ways to use it:

1. As a field — render one element's content (an article body, a product description) as Markdown by setting the field type to markdown. The css selector locates the element; its content is converted to Markdown (headings, lists, links, images, code, tables preserved; scripts, styles, and site chrome stripped):

{
  "name": "description",
  "css": ".product-description",
  "type": "markdown"
}

2. As an output format — export a whole crawl as a single Markdown document, one section per item, ready to drop into an LLM context window or a RAG index. Set the blueprint's output.format:

{ "output": { "format": "markdown" } }
php artisan datahelm:scrap:run example --output=storage/app/scrapes/example.md

The converter (DataHelm\Crawler\Markdown\HtmlToMarkdown) is dependency-free (ext-dom only) and can be used on its own:

$markdown = (new \DataHelm\Crawler\Markdown\HtmlToMarkdown())->convert($html);

Output formats

Set output.format in the blueprint (or rely on the default). Every format writes to --output=<path> or storage/app/scrapes/<name>.<ext> when omitted; use --output=- to stream to STDOUT.

Format Ext Best for
json (default) .json Pretty array — human-readable, small crawls
jsonl .jsonl One object per line — large crawls, streaming consumers
csv .csv Spreadsheets; array fields are JSON-encoded per cell
markdown .md LLM ingestion / RAG — one Markdown section per item

HTTP transports

The package works out of the box with plain HTTP (guzzle). Heavier transports are optional — only needed for JS-heavy sites or bot protection.

Transport What it does Extra infrastructure
guzzle Plain HTTP (default) None
auto Escalates on bot blocks browserless and/or FlareSolverr recommended
browser Headless Chrome (JS / SPA) browserless
flaresolverr Cloudflare challenge solver FlareSolverr
scraping_api Managed anti-bot API Paid API key

Escalation ladder when using auto:

guzzle ─► browser ─► flaresolverr ─► scraping_api

Environment variables

Variable Default Description
CRAWLER_TRANSPORT guzzle guzzle, browser, flaresolverr, scraping_api, auto
CRAWLER_COMMAND_PREFIX datahelm Artisan command prefix (datahelm:scrap:generate, …)
BROWSERLESS_URL http://browserless:3000 browserless service URL
BROWSERLESS_TOKEN (empty) Optional browserless auth token
FLARESOLVERR_URL http://flaresolverr:8191 FlareSolverr service URL
FLARESOLVERR_MAX_TIMEOUT 60000 Challenge timeout (ms)
CRAWLER_PROXY_URL (empty) Upstream proxy for browser / flaresolverr transports
SCRAPING_API_URL (empty) Managed scraping API base URL
SCRAPING_API_KEY (empty) API key for scraping_api transport

When Laravel runs inside Docker on the same network as the services, use hostnames browserless and flaresolverr. When Laravel runs on the host machine, use http://localhost:3010 and http://localhost:8191.

Optional: anti-bot services only

Start browserless and FlareSolverr without a full development stack:

docker compose -f docker/compose.services.yml up -d

Stop when done (each service runs a full Chromium and uses RAM/CPU):

docker compose -f docker/compose.services.yml stop

Full development environment

For nginx, PHP, PostgreSQL, Redis, Supervisor, and all crawler services together, use the separate environment repository:

github.com/datahelm/environment

git clone https://github.com/datahelm/environment.git
cd environment
cp .env.example .env
export UID=$(id -u) GID=$(id -g)
docker compose up -d

Artisan commands

Command Description
datahelm:scrap:generate Auto-detect a site and generate a scrape blueprint
datahelm:scrap:run Run a blueprint and export items (JSON / JSONL / CSV / Markdown)
datahelm:scrap:shell Interactive CSS/XPath selector shell against a live URL
datahelm:scrap:validate Validate a blueprint JSON file
datahelm:robot:{name} Run a site-specific robot (scaffolded with --robot)

What's new

LLM-ready Markdown output — the Firecrawl / Crawl4AI feature, now in Laravel.

  • markdown field type — set a field's type to markdown to render the matched element's content as clean Markdown (headings, nested lists, ordered lists, links, images, fenced code with language, tables, blockquotes, <br>). Scripts, styles, and site chrome (nav/header/footer/aside) are stripped. See Markdown / LLM-ready output.
  • markdown output format — set output.format to markdown to export a whole crawl as a single Markdown document, one section per item, ready for an LLM context window or a RAG index. See Output formats.
  • HtmlToMarkdown converter (DataHelm\Crawler\Markdown\HtmlToMarkdown) — the engine behind both, dependency-free (ext-dom only) and usable on its own.
  • Fix: datahelm:scrap:run now honours the blueprint's output.format (JSON / JSONL / CSV / Markdown); previously it always wrote JSON.

License

MIT

统计信息

  • 总下载量: 0
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 0
  • 点击次数: 2
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • 开发语言: PHP

其他信息

  • 授权协议: MIT
  • 更新时间: 2026-07-04

承接程序开发

PHP开发

VUE

Vue开发

前端开发

小程序开发

公众号开发

系统定制

数据库设计

云部署

网站建设

安全加固