mauricioperera/php-vector-store
最新稳定版本:v1.1.1
Composer 安装命令:
composer require mauricioperera/php-vector-store
包简介
Zero-dependency PHP vector database with BM25, hybrid search, Matryoshka, IVF indexing, and Int8 quantization
关键字:
README 文档
README
Zero-dependency PHP vector database with BM25 full-text search, hybrid search (vector + text), Matryoshka progressive search, IVF indexing, Int8 quantization, and 1-bit binary quantization (32x compression). Pure PHP 8.1+ — no SQLite, no C extensions, no FFI.
composer require mauricioperera/php-vector-store
Why
Most vector databases require C extensions (sqlite-vec), external services (Pinecone, Weaviate), or specific runtimes (Python). PHP Vector Store runs anywhere PHP runs — shared hosting, WordPress, Laravel, any framework.
New in v1.1: BinaryQuantizedStore — 1-bit sign quantization, 96 bytes/vector (768d), 27.8x faster than Int8, Hamming distance search via XOR + popcount.
v0.2: BM25 full-text search, hybrid search fusion (RRF + Weighted), multiple distance metrics, StoreInterface for polymorphism, typed models, and a PHPUnit test suite.
Scaling Guide
| Vectors | Recommended Config | Storage/vec | Total (100K) | Speed |
|---|---|---|---|---|
| <5K | Float32 768d + Matryoshka | 3,072 B | 300 MB | ~3ms |
| 5K-20K | Float32 384d + Matryoshka | 1,536 B | 150 MB | ~1.4ms |
| 20K-100K | Int8 384d + IVF + Matryoshka | 392 B | 38 MB | ~5ms |
| 100K-500K | Binary 768d + IVF + Matryoshka | 96 B | 9.4 MB | ~6ms |
| >500K | Use sqlite-vec or external service | — | — | — |
Quick Start
use PHPVectorStore\VectorStore; use PHPVectorStore\QuantizedStore; use PHPVectorStore\IVFIndex; use PHPVectorStore\HybridSearch; use PHPVectorStore\HybridMode; use PHPVectorStore\Distance; use PHPVectorStore\BM25\Index as BM25Index; // 1. Vector search $store = new QuantizedStore( __DIR__ . '/vectors', 384 ); $store->set( 'articles', 'art-1', $embedding, ['title' => 'My Article'] ); $store->flush(); $results = $store->matryoshkaSearch( 'articles', $query, 5, [128, 256, 384] ); // 2. Full-text search (BM25) $bm25 = new BM25Index(); $bm25->addDocument( 'articles', 'art-1', 'My article about machine learning...' ); $results = $bm25->search( 'articles', 'machine learning', 10 ); // 3. Hybrid search (vector + text combined) $hybrid = new HybridSearch( $store, $bm25, HybridMode::RRF ); $results = $hybrid->search( 'articles', $query_vector, 'machine learning', 5 ); // 4. Multiple distance metrics $results = $store->search( 'articles', $query, 5, 0, Distance::Euclidean );
Features
Vector Storage (Float32, Int8 & Binary)
// Full precision: dim x 4 bytes per vector $store = new VectorStore( '/path', 768 ); // Int8 quantized: dim + 8 bytes per vector (4x smaller) $q8 = new QuantizedStore( '/path', 384 ); // Binary quantized: ceil(dim/8) bytes per vector (32x smaller) $b1 = new BinaryQuantizedStore( '/path', 768 ); // 96 B/vec
All three implement StoreInterface — use them interchangeably.
BM25 Full-Text Search
Okapi BM25 inverted index, collection-aware, with persistence.
use PHPVectorStore\BM25\Index; use PHPVectorStore\BM25\Config; use PHPVectorStore\BM25\SimpleTokenizer; $bm25 = new Index( config: new Config( k1: 1.5, b: 0.75 ), tokenizer: new SimpleTokenizer(), ); // Index documents $bm25->addDocument( 'articles', 'doc-1', 'The quick brown fox...' ); $bm25->addDocument( 'articles', 'doc-2', 'Database systems and SQL...' ); // Search $results = $bm25->search( 'articles', 'quick fox', 10 ); // [['id' => 'doc-1', 'score' => 1.234, 'rank' => 1], ...] // Get raw scores (for hybrid fusion) $scores = $bm25->scoreAll( 'articles', 'quick fox' ); // ['doc-1' => 1.234, 'doc-2' => 0.0] // Persist to disk $bm25->save( '/path/vectors', 'articles' ); // writes articles.bm25.bin $bm25->load( '/path/vectors', 'articles' ); // restores state
The SimpleTokenizer handles Unicode text with configurable stop words:
// Custom stop words for Spanish $tokenizer = new SimpleTokenizer( stopWords: ['el', 'la', 'los', 'las', 'de', 'en', 'y', 'que', 'es', 'un', 'una'], minTokenLength: 2, ); $bm25 = new Index( tokenizer: $tokenizer );
Hybrid Search
Combines vector similarity with BM25 text relevance using fusion strategies.
use PHPVectorStore\HybridSearch; use PHPVectorStore\HybridMode; // RRF fusion (recommended — robust, no tuning needed) $hybrid = new HybridSearch( $store, $bm25, HybridMode::RRF ); $results = $hybrid->search( 'articles', $vector, 'search text', 5 ); // Weighted fusion (tunable weights) $hybrid = new HybridSearch( $store, $bm25, HybridMode::Weighted ); $results = $hybrid->search( 'articles', $vector, 'search text', 5, [ 'vectorWeight' => 0.7, 'textWeight' => 0.3, ]); // Multi-collection hybrid $results = $hybrid->searchAcross( ['articles', 'comments'], $vector, 'search text', 10, );
RRF (Reciprocal Rank Fusion): score(d) = Σ 1/(k + rank(d)) — combines ranks from both legs without needing score normalization. Best default choice.
Weighted: Min-max normalizes both score sets to [0,1], then combined = w_vec * vecNorm + w_text * textNorm. Use when you want explicit control over the balance.
Distance Metrics
use PHPVectorStore\Distance; // Cosine similarity (default) — best for normalized embeddings $store->search( 'col', $query, 5, 0, Distance::Cosine ); // Euclidean distance — converted to similarity: 1/(1+dist) $store->search( 'col', $query, 5, 0, Distance::Euclidean ); // Dot product — for pre-normalized vectors $store->search( 'col', $query, 5, 0, Distance::DotProduct ); // Manhattan distance — robust to outliers: 1/(1+dist) $store->search( 'col', $query, 5, 0, Distance::Manhattan );
Works with search(), matryoshkaSearch(), and searchAcross() on all three stores.
IVF Clustering
K-means partitions vectors into clusters for sub-linear search.
$ivf = new IVFIndex( $store, numClusters: 100, numProbes: 20 ); $ivf->build( 'articles' ); $results = $ivf->search( 'articles', $query, 5 ); $results = $ivf->matryoshkaSearch( 'articles', $query, 5, [128, 256, 384] );
Works with VectorStore, QuantizedStore, and BinaryQuantizedStore (via StoreInterface).
Matryoshka Multi-Stage Search
Progressive refinement — each stage narrows candidates before the next.
$store->matryoshkaSearch( 'col', $query, 5, [128, 384, 768] );
Speedup: 3-5x over brute-force (Int8), 13.7x (Binary). Combined with IVF: 10-15x.
StoreInterface
VectorStore, QuantizedStore, and BinaryQuantizedStore all implement StoreInterface:
use PHPVectorStore\StoreInterface; function buildIndex( StoreInterface $store ): void { $ivf = new IVFIndex( $store ); $ivf->build( 'articles' ); } // Works with any store buildIndex( new VectorStore( '/path', 384 ) ); buildIndex( new QuantizedStore( '/path', 384 ) ); buildIndex( new BinaryQuantizedStore( '/path', 768 ) );
Typed Models
use PHPVectorStore\Document; use PHPVectorStore\SearchResult; $doc = new Document( id: 'doc-1', vector: [0.1, 0.2, ...], text: 'The quick brown fox...', metadata: ['title' => 'My Doc'], ); $result = new SearchResult( id: 'doc-1', score: 0.95, rank: 1, metadata: ['title' => 'My Doc'], collection: 'articles', );
Typed Exceptions
use PHPVectorStore\Exception\VectorStoreException; use PHPVectorStore\Exception\DimensionMismatchException; use PHPVectorStore\Exception\CollectionNotFoundException;
Concurrency & Scaling Notes
File Locking
All flush() operations use flock(LOCK_EX) to prevent race conditions when multiple PHP processes write to the same collection simultaneously. This ensures atomic writes even under concurrent web requests.
Dimension Validation
set() throws DimensionMismatchException if the vector has fewer dimensions than the store was configured with. This catches mismatches early (e.g., passing a 384d vector to a 768d store).
JSON Manifest Scaling
Each collection stores its ID list and metadata in a .json sidecar file. For collections approaching 100K vectors, this manifest can grow large (~10-20 MB). Considerations:
- Memory: The entire manifest is loaded into memory on first access to a collection. For 100K vectors with metadata, budget ~50-100 MB of PHP memory.
- Latency: JSON decode of a large manifest adds ~50-200ms on first load (cached for subsequent operations within the same request).
- Mitigation: Use multiple collections (per entity type) to keep individual manifests small. A collection of 10K vectors has a ~1-2 MB manifest.
For datasets beyond 100K vectors, consider sqlite-vec or an external vector database.
API Reference
StoreInterface (VectorStore, QuantizedStore & BinaryQuantizedStore)
// Write ->set( $collection, $id, $vector, $metadata = [] ) ->remove( $collection, $id ): bool ->drop( $collection ) ->flush() // Read ->get( $collection, $id ): ?array // {id, vector, metadata} ->has( $collection, $id ): bool ->count( $collection ): int ->ids( $collection ): string[] ->collections(): string[] ->stats(): array ->dimensions(): int ->directory(): string // Search ->search( $collection, $query, $limit = 5, $dimSlice = 0, $distance = null ) ->matryoshkaSearch( $collection, $query, $limit = 5, $stages = [...], $multiplier = 3, $distance = null ) ->searchAcross( $collections, $query, $limit = 5, $dimSlice = 0, $distance = null ) // Import/Export ->import( $collection, $records ): int ->export( $collection ): array
BM25\Index
->addDocument( $collection, $id, $text ) ->removeDocument( $collection, $id ) ->search( $collection, $query, $limit = 10 ): array ->scoreAll( $collection, $query ): array // id => score ->count( $collection ): int ->vocabularySize( $collection ): int ->save( $directory, $collection ) ->load( $directory, $collection ) ->exportState( $collection ): array ->importState( $collection, $state )
HybridSearch
->search( $collection, $vector, $text, $limit = 5, $options = [] ) ->searchAcross( $collections, $vector, $text, $limit = 5, $options = [] )
Options: fetchK, vectorWeight, textWeight, rrfK, dimSlice.
IVFIndex
new IVFIndex( StoreInterface $store, int $numClusters = 100, int $numProbes = 10 ) ->build( $collection, $sampleDims = 128 ): array ->search( $collection, $query, $limit = 5, $dimSlice = 0 ) ->matryoshkaSearch( $collection, $query, $limit, $stages, $multiplier = 3 ) ->hasIndex( $collection ): bool ->indexStats( $collection ): ?array ->dropIndex( $collection )
Math (static)
VectorStore::normalize( $vector ): array VectorStore::cosineSim( $a, $b, $dims ): float VectorStore::euclideanDist( $a, $b, $dims ): float VectorStore::dotProduct( $a, $b, $dims ): float VectorStore::manhattanDist( $a, $b, $dims ): float VectorStore::computeScore( $a, $b, $dims, Distance $distance ): float
Storage Format
vectors/
├── articles.bin ← Float32: N x dim x 4 bytes
├── articles.json ← Manifest: IDs + metadata
├── articles.q8.bin ← Int8: N x (dim + 8) bytes
├── articles.q8.json ← Int8 manifest
├── articles.b1.bin ← Binary: N x ceil(dim/8) bytes
├── articles.b1.json ← Binary manifest
├── articles.ivf.json ← IVF: centroids + cluster assignments
├── articles.bm25.bin ← BM25: inverted index (serialized PHP)
└── .htaccess ← Access protection
Testing
composer install vendor/bin/phpunit
57 tests across 6 suites: VectorStore, QuantizedStore, BinaryQuantizedStore, IVFIndex, BM25, HybridSearch.
Performance
Speed (1,000 vectors, bge-base 768d, PHP 8.2)
| Method | Int8 | Binary | Speedup |
|---|---|---|---|
| Brute-force 768d | 556ms | 20ms | 27.8x |
| Matryoshka 128→384→768 | 86ms | 6.3ms | 13.7x |
Storage
| Format | Per vector | 10K | 100K | 500K |
|---|---|---|---|---|
| Float32 768d | 3,072 B | 30 MB | 300 MB | 1.5 GB |
| Float32 384d | 1,536 B | 15 MB | 150 MB | 750 MB |
| Int8 768d | 776 B | 7.6 MB | 76 MB | 380 MB |
| Int8 384d | 392 B | 3.8 MB | 38 MB | 192 MB |
| Binary 768d | 96 B | 0.9 MB | 9.4 MB | 47 MB |
| Binary 384d | 48 B | 0.47 MB | 4.7 MB | 23 MB |
Integration Patterns
WordPress
$store = new QuantizedStore( WP_CONTENT_DIR . '/vectors', 384 ); $bm25 = new BM25\Index(); add_action( 'wp_after_insert_post', function( $id, $post ) use ( $store, $bm25 ) { if ( 'publish' !== $post->post_status ) return; $text = $post->post_title . ' ' . wp_strip_all_tags( $post->post_content ); $vector = array_slice( your_embedding_api( $text ), 0, 384 ); $store->set( 'posts', (string) $id, $vector, ['title' => $post->post_title] ); $bm25->addDocument( 'posts', (string) $id, $text ); $store->flush(); $bm25->save( WP_CONTENT_DIR . '/vectors', 'posts' ); }, 10, 2 ); // Hybrid search $hybrid = new HybridSearch( $store, $bm25, HybridMode::RRF ); $results = $hybrid->search( 'posts', $query_vector, $search_text, 5 );
Laravel
// Service Provider $this->app->singleton( StoreInterface::class, fn() => new QuantizedStore( storage_path( 'vectors' ), 384 ) ); // Controller public function search( Request $request ) { $store = app( StoreInterface::class ); $query = array_slice( $this->embed( $request->q ), 0, 384 ); $results = $store->matryoshkaSearch( 'articles', $query, 10, [128, 256, 384] ); return Article::whereIn( 'id', array_column( $results, 'id' ) )->get(); }
Neuron AI (RAG)
use PHPVectorStore\Integration\NeuronVectorStore; class MyRAG extends RAG { protected function vectorStore(): VectorStoreInterface { return new NeuronVectorStore( directory: __DIR__ . '/vectors', dimensions: 384, quantized: true, matryoshka: true, ); } }
Architecture
PHPVectorStore\
├── StoreInterface ← Common interface
├── VectorStore ← Float32 storage (implements StoreInterface)
├── QuantizedStore ← Int8 storage (implements StoreInterface)
├── BinaryQuantizedStore ← 1-bit storage (implements StoreInterface)
├── IVFIndex ← K-means clustering (wraps StoreInterface)
├── HybridSearch ← Vector + BM25 fusion
├── Distance ← Enum: Cosine, Euclidean, DotProduct, Manhattan
├── HybridMode ← Enum: RRF, Weighted
├── Document ← Typed model
├── SearchResult ← Typed model
├── BM25\
│ ├── Index ← Okapi BM25 inverted index
│ ├── Config ← k1, b parameters
│ ├── TokenizerInterface ← Pluggable tokenization
│ └── SimpleTokenizer ← Unicode tokenizer with stop words
├── Exception\
│ ├── VectorStoreException
│ ├── DimensionMismatchException
│ └── CollectionNotFoundException
└── Integration\
└── NeuronVectorStore ← Neuron AI RAG adapter
License
MIT
统计信息
- 总下载量: 15
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 1
- 点击次数: 2
- 依赖项目数: 1
- 推荐数: 1
其他信息
- 授权协议: MIT
- 更新时间: 2026-03-22