iliaal/phonetic
Composer 安装命令:
pie install iliaal/phonetic
包简介
Native phonetic matching for PHP: Double Metaphone, Beider-Morse Phonetic Matching, and Daitch-Mokotoff Soundex.
关键字:
README 文档
README
Native phonetic matching for PHP: Double Metaphone, Beider-Morse Phonetic Matching (BMPM), and Daitch-Mokotoff Soundex, the phonetic name-matching encoders that PHP core does not ship.
PHP core has soundex() and metaphone(), but not these three, which are the standard tools for fuzzy name matching, record linkage, and genealogy search across spelling and transliteration variants.
Choosing an algorithm
| Double Metaphone | BMPM | Daitch-Mokotoff Soundex | |
|---|---|---|---|
| Output | primary + alternate key | language-aware token set | distinct 6-digit codes |
| Two names match when | keys are equal | token sets intersect | code sets intersect |
| Strongest for | English and general Latin-script names | cross-language and transliteration variants (Slavic, Germanic, Hebrew, Romance) | Eastern-European and Ashkenazi surnames, genealogy |
| Spelling-variant recall | good | highest | high, within its language model |
| Ambiguity handling | up to 2 keys | many tokens | multiple codes |
| Relative speed | fastest (1.0x) | slowest (~40x) | middle (~3x) |
| Data source | clean-room published algorithm | Apache Commons Codec rule data | Apache Commons Codec rule data |
Rule of thumb: reach for Double Metaphone as a fast general-purpose default, BMPM when names cross languages or scripts, and Daitch-Mokotoff for Eastern-European and Jewish genealogy where it is the field standard.
API
Double Metaphone
Primary + alternate phonetic keys (Lawrence Philips). Clean-room implementation.
double_metaphone(string $string, int $max_length = 4): array double_metaphone("Schwarzenegger"); // ['primary' => 'XRSN', 'alternate' => 'XFRT'] double_metaphone("Smith"); // ['primary' => 'SM0', 'alternate' => 'XMT'] double_metaphone("Catherine", 3); // ['primary' => 'K0R', 'alternate' => 'KTR']
alternate equals primary when the algorithm produced no alternate branch. max_length caps each key (default 4; 0 or negative = unlimited).
Beider-Morse Phonetic Matching
Language-aware token set, joined by | (alternatives) and - (words). Matches Apache Commons Codec's default BeiderMorseEncoder.
bmpm(string $string, int $name_type = BMPM_GENERIC, int $accuracy = BMPM_APPROX, string $language = ""): string bmpm("Jackson"); // "iakson|iaksun|...|zokson" bmpm("Garcia", BMPM_SEPHARDIC, BMPM_EXACT);// "garsia|gartSa"
Empty $language auto-detects; pass a language name (e.g. "russian") to force it. Constants: BMPM_GENERIC, BMPM_ASHKENAZI, BMPM_SEPHARDIC, BMPM_APPROX, BMPM_EXACT.
Daitch-Mokotoff Soundex
List of distinct 6-digit codes (the algorithm branches on ambiguous letters). Matches Apache Commons Codec's DaitchMokotoffSoundex in branching mode.
dm_soundex(string $string): array dm_soundex("Auerbach"); // ['097400', '097500'] dm_soundex("Peters"); // ['734000', '739400']
Usage
Encode each name, then compare. Double Metaphone matches on key equality; BMPM and Daitch-Mokotoff produce sets, so two names match when their sets intersect.
// Double Metaphone: equal primary, or a primary that crosses the other's alternate function names_sound_alike(string $a, string $b): bool { $x = double_metaphone($a); $y = double_metaphone($b); return $x['primary'] === $y['primary'] || $x['primary'] === $y['alternate'] || $x['alternate'] === $y['primary']; } names_sound_alike("Catherine", "Kathryn"); // true (both K0RN / KTRN) // Daitch-Mokotoff: codes match when the sets overlap (bool) array_intersect( dm_soundex("Moskowitz"), // ['645740'] dm_soundex("Moskovitz") // ['645740'] -> true ); // BMPM: split a single-word token string on '|' and intersect $a = explode('|', bmpm("Peterson")); // pitirzon, pitirzun $b = explode('|', bmpm("Petersen")); // ..., pitirzon, ... (bool) array_intersect($a, $b); // true
For indexed lookup, store the encoded key(s) with each record and query by encoded value instead of re-encoding at search time. Multi-word BMPM output also separates words with -, so split on both | and - for those.
Performance
Single-name encode, warm, optimized non-ASan PHP 8.4 on one core, over a representative mix of 18 short names. Absolute time scales with input length; the relative ordering is the stable part.
| function | per call | throughput | relative |
|---|---|---|---|
double_metaphone() |
~0.18 µs | ~5.7M/sec | 1.0x |
dm_soundex() |
~0.5 µs | ~1.9M/sec | ~3x slower |
bmpm() |
~7.3 µs | ~135k/sec | ~40x slower |
Double Metaphone is a single linear pass over the input, so it's the cheapest. Daitch-Mokotoff branches on ambiguous letters and dedups the resulting codes; a first-byte rule index keeps it fast. BMPM is the heaviest: language detection, a main transliteration pass, and two final rule passes, expanding a Cartesian product of phoneme alternatives capped at 20 per word. When you know the language, passing an explicit $language skips auto-detection and can cut bmpm time several-fold, though the gain depends on the chosen language's ruleset. Choose BMPM for recall, not throughput.
Notes & limitations
- Input is UTF-8.
bmpm()anddm_soundex()fold accented Latin and lowercase both Latin and Cyrillic script before rule matching, so rawИвановencodes correctly. - Greek-script input is a known limitation: Greek capitals are not lowercased (the algorithm's context-sensitive final-sigma cannot be expressed by a point-wise case map), so pass Greek names already lowercased or romanized.
double_metaphone()targets ASCII/Latin; non-letter bytes are skipped, matching Apache Commons Codec.
Install
Via PIE:
pie install iliaal/phonetic
Requires PHP 8.1 or later.
License
BSD 3-Clause (see LICENSE).
The Beider-Morse and Daitch-Mokotoff rule data is vendored from Apache Commons Codec under the Apache License 2.0; its notice is included in Section 2 of the LICENSE file. Double Metaphone is a clean-room implementation of the published algorithm.
统计信息
- 总下载量: 0
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 0
- 点击次数: 5
- 依赖项目数: 0
- 推荐数: 0
其他信息
- 授权协议: BSD-3-Clause
- 更新时间: 2026-06-29