承接 mauricioperera/php-vector-store 相关项目开发

从需求分析到上线部署,全程专人跟进,保证项目质量与交付效率

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

mauricioperera/php-vector-store

最新稳定版本:v1.1.1

Composer 安装命令:

composer require mauricioperera/php-vector-store

包简介

Zero-dependency PHP vector database with BM25, hybrid search, Matryoshka, IVF indexing, and Int8 quantization

README 文档

README

Zero-dependency PHP vector database with BM25 full-text search, hybrid search (vector + text), Matryoshka progressive search, IVF indexing, Int8 quantization, and 1-bit binary quantization (32x compression). Pure PHP 8.1+ — no SQLite, no C extensions, no FFI.

composer require mauricioperera/php-vector-store

Why

Most vector databases require C extensions (sqlite-vec), external services (Pinecone, Weaviate), or specific runtimes (Python). PHP Vector Store runs anywhere PHP runs — shared hosting, WordPress, Laravel, any framework.

New in v1.1: BinaryQuantizedStore — 1-bit sign quantization, 96 bytes/vector (768d), 27.8x faster than Int8, Hamming distance search via XOR + popcount.

v0.2: BM25 full-text search, hybrid search fusion (RRF + Weighted), multiple distance metrics, StoreInterface for polymorphism, typed models, and a PHPUnit test suite.

Scaling Guide

Vectors Recommended Config Storage/vec Total (100K) Speed
<5K Float32 768d + Matryoshka 3,072 B 300 MB ~3ms
5K-20K Float32 384d + Matryoshka 1,536 B 150 MB ~1.4ms
20K-100K Int8 384d + IVF + Matryoshka 392 B 38 MB ~5ms
100K-500K Binary 768d + IVF + Matryoshka 96 B 9.4 MB ~6ms
>500K Use sqlite-vec or external service

Quick Start

use PHPVectorStore\VectorStore;
use PHPVectorStore\QuantizedStore;
use PHPVectorStore\IVFIndex;
use PHPVectorStore\HybridSearch;
use PHPVectorStore\HybridMode;
use PHPVectorStore\Distance;
use PHPVectorStore\BM25\Index as BM25Index;

// 1. Vector search
$store = new QuantizedStore( __DIR__ . '/vectors', 384 );
$store->set( 'articles', 'art-1', $embedding, ['title' => 'My Article'] );
$store->flush();

$results = $store->matryoshkaSearch( 'articles', $query, 5, [128, 256, 384] );

// 2. Full-text search (BM25)
$bm25 = new BM25Index();
$bm25->addDocument( 'articles', 'art-1', 'My article about machine learning...' );

$results = $bm25->search( 'articles', 'machine learning', 10 );

// 3. Hybrid search (vector + text combined)
$hybrid = new HybridSearch( $store, $bm25, HybridMode::RRF );
$results = $hybrid->search( 'articles', $query_vector, 'machine learning', 5 );

// 4. Multiple distance metrics
$results = $store->search( 'articles', $query, 5, 0, Distance::Euclidean );

Features

Vector Storage (Float32, Int8 & Binary)

// Full precision: dim x 4 bytes per vector
$store = new VectorStore( '/path', 768 );

// Int8 quantized: dim + 8 bytes per vector (4x smaller)
$q8 = new QuantizedStore( '/path', 384 );

// Binary quantized: ceil(dim/8) bytes per vector (32x smaller)
$b1 = new BinaryQuantizedStore( '/path', 768 );  // 96 B/vec

All three implement StoreInterface — use them interchangeably.

BM25 Full-Text Search

Okapi BM25 inverted index, collection-aware, with persistence.

use PHPVectorStore\BM25\Index;
use PHPVectorStore\BM25\Config;
use PHPVectorStore\BM25\SimpleTokenizer;

$bm25 = new Index(
    config: new Config( k1: 1.5, b: 0.75 ),
    tokenizer: new SimpleTokenizer(),
);

// Index documents
$bm25->addDocument( 'articles', 'doc-1', 'The quick brown fox...' );
$bm25->addDocument( 'articles', 'doc-2', 'Database systems and SQL...' );

// Search
$results = $bm25->search( 'articles', 'quick fox', 10 );
// [['id' => 'doc-1', 'score' => 1.234, 'rank' => 1], ...]

// Get raw scores (for hybrid fusion)
$scores = $bm25->scoreAll( 'articles', 'quick fox' );
// ['doc-1' => 1.234, 'doc-2' => 0.0]

// Persist to disk
$bm25->save( '/path/vectors', 'articles' );  // writes articles.bm25.bin
$bm25->load( '/path/vectors', 'articles' );  // restores state

The SimpleTokenizer handles Unicode text with configurable stop words:

// Custom stop words for Spanish
$tokenizer = new SimpleTokenizer(
    stopWords: ['el', 'la', 'los', 'las', 'de', 'en', 'y', 'que', 'es', 'un', 'una'],
    minTokenLength: 2,
);
$bm25 = new Index( tokenizer: $tokenizer );

Hybrid Search

Combines vector similarity with BM25 text relevance using fusion strategies.

use PHPVectorStore\HybridSearch;
use PHPVectorStore\HybridMode;

// RRF fusion (recommended — robust, no tuning needed)
$hybrid = new HybridSearch( $store, $bm25, HybridMode::RRF );
$results = $hybrid->search( 'articles', $vector, 'search text', 5 );

// Weighted fusion (tunable weights)
$hybrid = new HybridSearch( $store, $bm25, HybridMode::Weighted );
$results = $hybrid->search( 'articles', $vector, 'search text', 5, [
    'vectorWeight' => 0.7,
    'textWeight'   => 0.3,
]);

// Multi-collection hybrid
$results = $hybrid->searchAcross(
    ['articles', 'comments'],
    $vector, 'search text', 10,
);

RRF (Reciprocal Rank Fusion): score(d) = Σ 1/(k + rank(d)) — combines ranks from both legs without needing score normalization. Best default choice.

Weighted: Min-max normalizes both score sets to [0,1], then combined = w_vec * vecNorm + w_text * textNorm. Use when you want explicit control over the balance.

Distance Metrics

use PHPVectorStore\Distance;

// Cosine similarity (default) — best for normalized embeddings
$store->search( 'col', $query, 5, 0, Distance::Cosine );

// Euclidean distance — converted to similarity: 1/(1+dist)
$store->search( 'col', $query, 5, 0, Distance::Euclidean );

// Dot product — for pre-normalized vectors
$store->search( 'col', $query, 5, 0, Distance::DotProduct );

// Manhattan distance — robust to outliers: 1/(1+dist)
$store->search( 'col', $query, 5, 0, Distance::Manhattan );

Works with search(), matryoshkaSearch(), and searchAcross() on all three stores.

IVF Clustering

K-means partitions vectors into clusters for sub-linear search.

$ivf = new IVFIndex( $store, numClusters: 100, numProbes: 20 );
$ivf->build( 'articles' );
$results = $ivf->search( 'articles', $query, 5 );
$results = $ivf->matryoshkaSearch( 'articles', $query, 5, [128, 256, 384] );

Works with VectorStore, QuantizedStore, and BinaryQuantizedStore (via StoreInterface).

Matryoshka Multi-Stage Search

Progressive refinement — each stage narrows candidates before the next.

$store->matryoshkaSearch( 'col', $query, 5, [128, 384, 768] );

Speedup: 3-5x over brute-force (Int8), 13.7x (Binary). Combined with IVF: 10-15x.

StoreInterface

VectorStore, QuantizedStore, and BinaryQuantizedStore all implement StoreInterface:

use PHPVectorStore\StoreInterface;

function buildIndex( StoreInterface $store ): void {
    $ivf = new IVFIndex( $store );
    $ivf->build( 'articles' );
}

// Works with any store
buildIndex( new VectorStore( '/path', 384 ) );
buildIndex( new QuantizedStore( '/path', 384 ) );
buildIndex( new BinaryQuantizedStore( '/path', 768 ) );

Typed Models

use PHPVectorStore\Document;
use PHPVectorStore\SearchResult;

$doc = new Document(
    id: 'doc-1',
    vector: [0.1, 0.2, ...],
    text: 'The quick brown fox...',
    metadata: ['title' => 'My Doc'],
);

$result = new SearchResult(
    id: 'doc-1',
    score: 0.95,
    rank: 1,
    metadata: ['title' => 'My Doc'],
    collection: 'articles',
);

Typed Exceptions

use PHPVectorStore\Exception\VectorStoreException;
use PHPVectorStore\Exception\DimensionMismatchException;
use PHPVectorStore\Exception\CollectionNotFoundException;

Concurrency & Scaling Notes

File Locking

All flush() operations use flock(LOCK_EX) to prevent race conditions when multiple PHP processes write to the same collection simultaneously. This ensures atomic writes even under concurrent web requests.

Dimension Validation

set() throws DimensionMismatchException if the vector has fewer dimensions than the store was configured with. This catches mismatches early (e.g., passing a 384d vector to a 768d store).

JSON Manifest Scaling

Each collection stores its ID list and metadata in a .json sidecar file. For collections approaching 100K vectors, this manifest can grow large (~10-20 MB). Considerations:

  • Memory: The entire manifest is loaded into memory on first access to a collection. For 100K vectors with metadata, budget ~50-100 MB of PHP memory.
  • Latency: JSON decode of a large manifest adds ~50-200ms on first load (cached for subsequent operations within the same request).
  • Mitigation: Use multiple collections (per entity type) to keep individual manifests small. A collection of 10K vectors has a ~1-2 MB manifest.

For datasets beyond 100K vectors, consider sqlite-vec or an external vector database.

API Reference

StoreInterface (VectorStore, QuantizedStore & BinaryQuantizedStore)

// Write
->set( $collection, $id, $vector, $metadata = [] )
->remove( $collection, $id ): bool
->drop( $collection )
->flush()

// Read
->get( $collection, $id ): ?array     // {id, vector, metadata}
->has( $collection, $id ): bool
->count( $collection ): int
->ids( $collection ): string[]
->collections(): string[]
->stats(): array
->dimensions(): int
->directory(): string

// Search
->search( $collection, $query, $limit = 5, $dimSlice = 0, $distance = null )
->matryoshkaSearch( $collection, $query, $limit = 5, $stages = [...], $multiplier = 3, $distance = null )
->searchAcross( $collections, $query, $limit = 5, $dimSlice = 0, $distance = null )

// Import/Export
->import( $collection, $records ): int
->export( $collection ): array

BM25\Index

->addDocument( $collection, $id, $text )
->removeDocument( $collection, $id )
->search( $collection, $query, $limit = 10 ): array
->scoreAll( $collection, $query ): array    // id => score
->count( $collection ): int
->vocabularySize( $collection ): int
->save( $directory, $collection )
->load( $directory, $collection )
->exportState( $collection ): array
->importState( $collection, $state )

HybridSearch

->search( $collection, $vector, $text, $limit = 5, $options = [] )
->searchAcross( $collections, $vector, $text, $limit = 5, $options = [] )

Options: fetchK, vectorWeight, textWeight, rrfK, dimSlice.

IVFIndex

new IVFIndex( StoreInterface $store, int $numClusters = 100, int $numProbes = 10 )

->build( $collection, $sampleDims = 128 ): array
->search( $collection, $query, $limit = 5, $dimSlice = 0 )
->matryoshkaSearch( $collection, $query, $limit, $stages, $multiplier = 3 )
->hasIndex( $collection ): bool
->indexStats( $collection ): ?array
->dropIndex( $collection )

Math (static)

VectorStore::normalize( $vector ): array
VectorStore::cosineSim( $a, $b, $dims ): float
VectorStore::euclideanDist( $a, $b, $dims ): float
VectorStore::dotProduct( $a, $b, $dims ): float
VectorStore::manhattanDist( $a, $b, $dims ): float
VectorStore::computeScore( $a, $b, $dims, Distance $distance ): float

Storage Format

vectors/
├── articles.bin          ← Float32: N x dim x 4 bytes
├── articles.json         ← Manifest: IDs + metadata
├── articles.q8.bin       ← Int8: N x (dim + 8) bytes
├── articles.q8.json      ← Int8 manifest
├── articles.b1.bin       ← Binary: N x ceil(dim/8) bytes
├── articles.b1.json      ← Binary manifest
├── articles.ivf.json     ← IVF: centroids + cluster assignments
├── articles.bm25.bin     ← BM25: inverted index (serialized PHP)
└── .htaccess             ← Access protection

Testing

composer install
vendor/bin/phpunit

57 tests across 6 suites: VectorStore, QuantizedStore, BinaryQuantizedStore, IVFIndex, BM25, HybridSearch.

Performance

Speed (1,000 vectors, bge-base 768d, PHP 8.2)

Method Int8 Binary Speedup
Brute-force 768d 556ms 20ms 27.8x
Matryoshka 128→384→768 86ms 6.3ms 13.7x

Storage

Format Per vector 10K 100K 500K
Float32 768d 3,072 B 30 MB 300 MB 1.5 GB
Float32 384d 1,536 B 15 MB 150 MB 750 MB
Int8 768d 776 B 7.6 MB 76 MB 380 MB
Int8 384d 392 B 3.8 MB 38 MB 192 MB
Binary 768d 96 B 0.9 MB 9.4 MB 47 MB
Binary 384d 48 B 0.47 MB 4.7 MB 23 MB

Integration Patterns

WordPress

$store = new QuantizedStore( WP_CONTENT_DIR . '/vectors', 384 );
$bm25  = new BM25\Index();

add_action( 'wp_after_insert_post', function( $id, $post ) use ( $store, $bm25 ) {
    if ( 'publish' !== $post->post_status ) return;
    $text   = $post->post_title . ' ' . wp_strip_all_tags( $post->post_content );
    $vector = array_slice( your_embedding_api( $text ), 0, 384 );
    $store->set( 'posts', (string) $id, $vector, ['title' => $post->post_title] );
    $bm25->addDocument( 'posts', (string) $id, $text );
    $store->flush();
    $bm25->save( WP_CONTENT_DIR . '/vectors', 'posts' );
}, 10, 2 );

// Hybrid search
$hybrid = new HybridSearch( $store, $bm25, HybridMode::RRF );
$results = $hybrid->search( 'posts', $query_vector, $search_text, 5 );

Laravel

// Service Provider
$this->app->singleton( StoreInterface::class, fn() =>
    new QuantizedStore( storage_path( 'vectors' ), 384 )
);

// Controller
public function search( Request $request ) {
    $store   = app( StoreInterface::class );
    $query   = array_slice( $this->embed( $request->q ), 0, 384 );
    $results = $store->matryoshkaSearch( 'articles', $query, 10, [128, 256, 384] );
    return Article::whereIn( 'id', array_column( $results, 'id' ) )->get();
}

Neuron AI (RAG)

use PHPVectorStore\Integration\NeuronVectorStore;

class MyRAG extends RAG {
    protected function vectorStore(): VectorStoreInterface {
        return new NeuronVectorStore(
            directory:  __DIR__ . '/vectors',
            dimensions: 384,
            quantized:  true,
            matryoshka: true,
        );
    }
}

Architecture

PHPVectorStore\
├── StoreInterface           ← Common interface
├── VectorStore              ← Float32 storage (implements StoreInterface)
├── QuantizedStore           ← Int8 storage (implements StoreInterface)
├── BinaryQuantizedStore     ← 1-bit storage (implements StoreInterface)
├── IVFIndex                 ← K-means clustering (wraps StoreInterface)
├── HybridSearch             ← Vector + BM25 fusion
├── Distance                 ← Enum: Cosine, Euclidean, DotProduct, Manhattan
├── HybridMode               ← Enum: RRF, Weighted
├── Document                 ← Typed model
├── SearchResult             ← Typed model
├── BM25\
│   ├── Index                ← Okapi BM25 inverted index
│   ├── Config               ← k1, b parameters
│   ├── TokenizerInterface   ← Pluggable tokenization
│   └── SimpleTokenizer      ← Unicode tokenizer with stop words
├── Exception\
│   ├── VectorStoreException
│   ├── DimensionMismatchException
│   └── CollectionNotFoundException
└── Integration\
    └── NeuronVectorStore    ← Neuron AI RAG adapter

License

MIT

统计信息

  • 总下载量: 15
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 1
  • 点击次数: 2
  • 依赖项目数: 1
  • 推荐数: 1

GitHub 信息

  • Stars: 1
  • Watchers: 0
  • Forks: 0
  • 开发语言: PHP

其他信息

  • 授权协议: MIT
  • 更新时间: 2026-03-22

承接程序开发

PHP开发

VUE

Vue开发

前端开发

小程序开发

公众号开发

系统定制

数据库设计

云部署

网站建设

安全加固