承接 survos/site-discovery-bundle 相关项目开发

从需求分析到上线部署,全程专人跟进,保证项目质量与交付效率

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

survos/site-discovery-bundle

最新稳定版本:2.1.2

Composer 安装命令:

composer require survos/site-discovery-bundle

包简介

Discover hosted SaaS tenant sites (*.example.com) via web archive indexes. Currently supports Internet Archive CDX API only.

README 文档

README

Discovers tenant hostnames under a shared SaaS domain — e.g. *.pastperfectonline.com, *.omeka.net — by querying web archive indexes.

Current backend: Internet Archive CDX API only. Common Crawl support is planned but not yet implemented.

Requirements: PHP 8.4+, Symfony 8.0+

Background: how the CDX API works

The Internet Archive crawls the web continuously and stores every URL in a CDX (Capture inDeX). URLs are sorted in SURT (Sort-friendly URI Reordering Transform) order, which reverses domain label order:

https://fauquierhistory.pastperfectonline.com/path
  →  com,pastperfectonline,fauquierhistory)/path

For a SaaS platform like PastPerfect Online, every tenant has a subdomain. Their SURT keys look like:

com,pastperfectonline,fauquierhistory)/
com,pastperfectonline,bainbridgehistorymuseum)/advancedsearch

The tenant slug (fauquierhistory) sits between the shared SURT prefix (com,pastperfectonline,) and the closing ). This bundle queries CDX for all URLs under the registered domain, filters to subdomain-only rows, and extracts unique slugs.

Computing the SURT prefix

Reverse the domain labels, join with commas, add a trailing comma:

Domain SURT prefix
pastperfectonline.com com,pastperfectonline,
omeka.net net,omeka,
myheritage.com com,myheritage,
arcgis.com com,arcgis,

Installation

composer require survos/site-discovery-bundle

Register if not using Symfony Flex:

// config/bundles.php
return [
    Survos\SiteDiscoveryBundle\SurvosSiteDiscoveryBundle::class => ['all' => true],
];

Configuration

# config/packages/survos_site_discovery.yaml
survos_site_discovery:
    user_agent: "MyApp SiteDiscovery"   # defaults to "SurvosSiteDiscoveryBundle"

Console command

site:discover <domain> <surtPrefix> [options]

Arguments

Argument Description
domain Bare registered domain, e.g. pastperfectonline.com
surtPrefix SURT prefix for subdomain rows, e.g. com,pastperfectonline,

Options

Option Default Description
--output stdout Write JSONL to this file path
--limit 0 Stop after N unique sites (0 = unlimited). Always use a small number during development.
--page-size 5000 CDX rows per API request (max ~10 000)
--scheme https URL scheme used in base_url

Examples

# Discover PastPerfect Online sites, print to stdout (first 5 for testing)
bin/console site:discover pastperfectonline.com com,pastperfectonline, --limit=5

# Write to a JSONL file
bin/console site:discover pastperfectonline.com com,pastperfectonline, \
    --output=var/discovery/pastperfect-sites.jsonl

# Discover Omeka.net sites
bin/console site:discover omeka.net net,omeka, \
    --output=var/discovery/omeka-sites.jsonl

# Full discovery — slow, expect 10–30 s per CDX page
bin/console site:discover pastperfectonline.com com,pastperfectonline, \
    --output=var/discovery/pastperfect-sites.jsonl

Output JSONL shape

One JSON object per line:

{
  "slug":           "fauquierhistory",
  "host":           "fauquierhistory.pastperfectonline.com",
  "base_url":       "https://fauquierhistory.pastperfectonline.com",
  "discovered_via": "internet_archive_cdx",
  "validated":      false,
  "validated_at":   null
}

PHP API

Inject CdxDiscoveryService

use Survos\SiteDiscoveryBundle\Service\CdxDiscoveryService;
use Survos\SiteDiscoveryBundle\Model\DiscoveredSite;

final class MyHarvester
{
    public function __construct(
        private readonly CdxDiscoveryService $cdx,
    ) {}

    public function run(): void
    {
        foreach ($this->cdx->discover('pastperfectonline.com', 'com,pastperfectonline,') as $site) {
            // $site is a DiscoveredSite value object
            echo $site->slug;     // "fauquierhistory"
            echo $site->host;     // "fauquierhistory.pastperfectonline.com"
            echo $site->baseUrl;  // "https://fauquierhistory.pastperfectonline.com"

            $row = $site->toArray(); // JSONL-ready associative array
        }
    }
}

CdxDiscoveryService::discover() signature

public function discover(
    string $domain,       // e.g. "pastperfectonline.com"
    string $surtPrefix,   // e.g. "com,pastperfectonline,"
    string $scheme   = 'https',
    int    $limit    = 0,     // 0 = unlimited; set small for dev/testing
    int    $pageSize = 5000,
): \Generator  // yields DiscoveredSite

DiscoveredSite value object

final readonly class DiscoveredSite
{
    public string $slug;           // "fauquierhistory"
    public string $host;           // "fauquierhistory.pastperfectonline.com"
    public string $baseUrl;        // "https://fauquierhistory.pastperfectonline.com"
    public string $discoveredVia;  // "internet_archive_cdx"

    public function toArray(): array;  // JSONL-ready
}

CDX API technical notes

These notes are provided for agents and developers integrating with or extending this bundle.

Endpoint: https://web.archive.org/cdx/search/cdx

Parameters used by this bundle:

Parameter Value Purpose
url e.g. pastperfectonline.com Registered domain (no wildcard)
matchType domain Returns all URLs in the entire domain tree
output json Structured response; row 0 is a header array
fl urlkey Only fetch the SURT key column — cheapest option
collapse urlkey CDX-level deduplication
filter urlkey:{surtPrefix}[a-z0-9] Restrict to subdomain rows; skips bare domain entries
limit 5000 Rows per page
showResumeKey true Enables pagination
resumeKey {key from previous page} Continue from prior page

Pagination: when showResumeKey=true, the last row of each page is a resume-key string (not a urlkey). It does NOT start with the SURT prefix — that is how we distinguish it from real data rows. Pass it as resumeKey on the next request.

Why fl=urlkey instead of fl=original: the original field contains the raw crawled URL, which requires URL parsing to extract the hostname. The urlkey encodes the slug directly and unambiguously. One regex match is all that is needed.

Why matchType=domain instead of matchType=host: matchType=host with a wildcard (*.example.com) returns empty results. matchType=domain with the bare registered domain returns the full tree.

Latency: CDX API requests with matchType=domain can take 10–30 seconds per page. The response is streamed; the bundle waits for the full response. Plan accordingly.

Coverage gaps: sites blocked by robots.txt during crawl, or newer than the most recent IA crawl, will not appear. Use the output as a candidate list to be validated, not as a definitive registry.

Rate limiting

The CDX API is free and unauthenticated. The Internet Archive does not publish a formal rate limit, but hammering the API with parallel requests is antisocial. This bundle makes one sequential request per page. Do not add concurrency.

Downstream validation

This bundle only discovers candidate hostnames. It does not validate that a host is currently live or that it is still running the expected platform. Add a probe step in your consumer bundle, e.g.:

// Pseudo-code — implement in your bundle
$response = $httpClient->request('GET', $site->baseUrl . '/AdvancedSearch');
$isLive = $response->getStatusCode() === 200
    && str_contains($response->getContent(), 'pastperfectonline');

Planned backends

  • Common Crawl Host Index (higher coverage; requires DuckDB or Athena)
  • Static seed file (CSV/JSONL of known hosts, for offline or pre-seeded use)

Pull requests for additional backends are welcome. Implement CdxDiscoveryService as a reference — yield DiscoveredSite objects, accept a $limit parameter, stream results lazily.

License

MIT

统计信息

  • 总下载量: 164
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 0
  • 点击次数: 3
  • 依赖项目数: 1
  • 推荐数: 0

GitHub 信息

  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • 开发语言: PHP

其他信息

  • 授权协议: MIT
  • 更新时间: 2026-03-04

承接程序开发

PHP开发

VUE

Vue开发

前端开发

小程序开发

公众号开发

系统定制

数据库设计

云部署

网站建设

安全加固