README

Official PHP client for the WebScraping.AI API.

The API gives you LLM-powered scraping tools with Chromium JavaScript rendering, rotating proxies, and built-in HTML parsing — full HTML, visible text, selected page areas, AI-extracted fields, and free-form question answering over any URL.

Requirements

PHP 8.2 or newer
A PSR-18 HTTP client — Guzzle, Symfony HttpClient, or any other implementation
A PSR-17 message factory

If you don't already have these installed, the simplest pair is:

composer require guzzlehttp/guzzle nyholm/psr7

php-http/discovery (a transitive dependency) will pick them up automatically.

Installation

composer require webscraping-ai/webscraping-ai-php

Quick start

use WebScrapingAI\Client;

$client = new Client(apiKey: getenv('WEBSCRAPING_AI_KEY'));

// Full HTML
$html = $client->html(url: 'https://example.com');

// Visible text
$text = $client->text(url: 'https://example.com');

// HTML for one selector
$h1 = $client->selected(url: 'https://example.com', selector: 'h1');

// HTML for multiple selectors (returns array)
$chunks = $client->selectedMultiple(
    url: 'https://example.com',
    selectors: ['h1', 'p', 'a'],
);

// LLM question over a page
$answer = $client->question(
    url: 'https://example.com',
    question: 'What is the main topic?',
);

// LLM-extracted structured fields
$fields = $client->fields(
    url: 'https://example.com',
    fields: [
        'title' => 'Main product title',
        'price' => 'Current price',
    ],
);

// Account quota
$account = $client->account();

All optional parameters (headers, timeout, js, js_timeout, wait_for, proxy, country, custom_proxy, device, error_on_404, error_on_redirect, js_script, …) are PHP named arguments. See the API docs for the full parameter reference.

Bring your own HTTP client

By default, the client builds its own transport. If Guzzle is installed it is used with a request deadline applied (see Timeouts); otherwise php-http/discovery resolves whatever PSR-18 client is installed. To pin a specific client, pass it explicitly:

use GuzzleHttp\Client as Guzzle;
use Nyholm\Psr7\Factory\Psr17Factory;
use WebScrapingAI\Client;

$factory = new Psr17Factory();
$client = new Client(
    apiKey: getenv('WEBSCRAPING_AI_KEY'),
    httpClient: new Guzzle(['timeout' => 30.0]),
    requestFactory: $factory,
    uriFactory: $factory,
);

Injecting your own client opts out of the default deadline — configure transport-level timeouts on the client you pass.

Timeouts

Two different timeouts are in play, and they're easy to confuse:

The timeout parameter accepted by each endpoint method (html(), text(), …) controls server-side page retrieval — how long the API waits for the target page. It does not bound how long your HTTP client waits.
The transport timeout bounds how long the PSR-18 client itself will wait on TCP connect and on reading the response body, so a stalled connection can't hang your process forever.

By default the client applies a transport deadline when it builds its own client and Guzzle is available: a total request timeout of Client::DEFAULT_TIMEOUT (60s) and a TCP connect timeout of Client::DEFAULT_CONNECT_TIMEOUT (10s). Override them via the constructor:

$client = new Client(
    apiKey: getenv('WEBSCRAPING_AI_KEY'),
    timeout: 120.0,        // total request deadline, seconds
    connectTimeout: 5.0,   // TCP connect deadline, seconds
);

These constructor timeouts apply only to the auto-built default client. If you inject your own httpClient, or no concrete client (Guzzle) is available and discovery falls back to an unknown PSR-18 implementation, no deadline is imposed — in that case inject a client with timeouts configured to get one.

Errors

The client raises typed exceptions for every documented status code:

Status	Exception
400	`WebScrapingAI\Exception\BadRequestException`
402	`WebScrapingAI\Exception\PaymentRequiredException`
403	`WebScrapingAI\Exception\AuthenticationException`
429	`WebScrapingAI\Exception\RateLimitException`
500	`WebScrapingAI\Exception\ServerException`
504	`WebScrapingAI\Exception\GatewayTimeoutException`

All inherit from WebScrapingAI\Exception\ApiException, which exposes $message, $status, $statusCode, $statusMessage, $body, and $responseBody. The latter three are populated when the API surfaces target-page errors as 500s.

Transport-level failures raise WebScrapingAI\Exception\ApiTimeoutException (the PSR-18 client timed out) or WebScrapingAI\Exception\ApiConnectionException (DNS / connection refused / TLS).

All SDK-originated exceptions implement the marker interface WebScrapingAI\Exception\WebScrapingAIException, so a single catch (WebScrapingAIException $e) block catches everything.

Response shapes

The client returns whatever the API returns — it does not normalise or unwrap. A couple of current quirks worth knowing:

fields() returns ['result' => [...fields...]] (the live API wraps the extracted fields under a result key).
selectedMultiple() returns array<int, array<int, string>> — an outer wrapper containing all matched chunks concatenated.

These are upstream spec/server drifts; the official Ruby and Python clients return the same shapes.

Migration from 3.x

3.x was generated from the OpenAPI spec under the namespace OpenAPI\Client\ and used per-tag classes (AIApi, HTMLApi, etc.). 4.0 is a hand-authored rewrite with a single WebScrapingAI\Client entry point. There are no deprecation shims — pin to ^3.2 if you need the old surface.

Development

composer install
composer test       # PHPUnit
composer lint       # php-cs-fixer (dry-run)
composer analyse    # PHPStan

License

MIT — see LICENSE.

webscraping-ai/webscraping-ai-php

包简介

关键字：

README 文档