iliaal/phonetic 问题修复 & 功能扩展

解决BUG、新增功能、兼容多环境部署,快速响应你的开发需求

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

iliaal/phonetic

Composer 安装命令:

pie install iliaal/phonetic

包简介

Native phonetic matching for PHP: Double Metaphone, Beider-Morse Phonetic Matching, and Daitch-Mokotoff Soundex.

README 文档

README

Native phonetic matching for PHP: Double Metaphone, Beider-Morse Phonetic Matching (BMPM), and Daitch-Mokotoff Soundex, the phonetic name-matching encoders that PHP core does not ship.

PHP core has soundex() and metaphone(), but not these three, which are the standard tools for fuzzy name matching, record linkage, and genealogy search across spelling and transliteration variants.

Choosing an algorithm

Double Metaphone BMPM Daitch-Mokotoff Soundex
Output primary + alternate key language-aware token set distinct 6-digit codes
Two names match when keys are equal token sets intersect code sets intersect
Strongest for English and general Latin-script names cross-language and transliteration variants (Slavic, Germanic, Hebrew, Romance) Eastern-European and Ashkenazi surnames, genealogy
Spelling-variant recall good highest high, within its language model
Ambiguity handling up to 2 keys many tokens multiple codes
Relative speed fastest (1.0x) slowest (~40x) middle (~3x)
Data source clean-room published algorithm Apache Commons Codec rule data Apache Commons Codec rule data

Rule of thumb: reach for Double Metaphone as a fast general-purpose default, BMPM when names cross languages or scripts, and Daitch-Mokotoff for Eastern-European and Jewish genealogy where it is the field standard.

API

Double Metaphone

Primary + alternate phonetic keys (Lawrence Philips). Clean-room implementation.

double_metaphone(string $string, int $max_length = 4): array

double_metaphone("Schwarzenegger");        // ['primary' => 'XRSN', 'alternate' => 'XFRT']
double_metaphone("Smith");                 // ['primary' => 'SM0',  'alternate' => 'XMT']
double_metaphone("Catherine", 3);          // ['primary' => 'K0R',  'alternate' => 'KTR']

alternate equals primary when the algorithm produced no alternate branch. max_length caps each key (default 4; 0 or negative = unlimited).

Beider-Morse Phonetic Matching

Language-aware token set, joined by | (alternatives) and - (words). Matches Apache Commons Codec's default BeiderMorseEncoder.

bmpm(string $string, int $name_type = BMPM_GENERIC, int $accuracy = BMPM_APPROX, string $language = ""): string

bmpm("Jackson");                           // "iakson|iaksun|...|zokson"
bmpm("Garcia", BMPM_SEPHARDIC, BMPM_EXACT);// "garsia|gartSa"

Empty $language auto-detects; pass a language name (e.g. "russian") to force it. Constants: BMPM_GENERIC, BMPM_ASHKENAZI, BMPM_SEPHARDIC, BMPM_APPROX, BMPM_EXACT.

Daitch-Mokotoff Soundex

List of distinct 6-digit codes (the algorithm branches on ambiguous letters). Matches Apache Commons Codec's DaitchMokotoffSoundex in branching mode.

dm_soundex(string $string): array

dm_soundex("Auerbach");                    // ['097400', '097500']
dm_soundex("Peters");                      // ['734000', '739400']

Usage

Encode each name, then compare. Double Metaphone matches on key equality; BMPM and Daitch-Mokotoff produce sets, so two names match when their sets intersect.

// Double Metaphone: equal primary, or a primary that crosses the other's alternate
function names_sound_alike(string $a, string $b): bool {
    $x = double_metaphone($a);
    $y = double_metaphone($b);
    return $x['primary']   === $y['primary']
        || $x['primary']   === $y['alternate']
        || $x['alternate'] === $y['primary'];
}

names_sound_alike("Catherine", "Kathryn");   // true  (both K0RN / KTRN)

// Daitch-Mokotoff: codes match when the sets overlap
(bool) array_intersect(
    dm_soundex("Moskowitz"),                 // ['645740']
    dm_soundex("Moskovitz")                  // ['645740']  -> true
);

// BMPM: split a single-word token string on '|' and intersect
$a = explode('|', bmpm("Peterson"));         // pitirzon, pitirzun
$b = explode('|', bmpm("Petersen"));         // ..., pitirzon, ...
(bool) array_intersect($a, $b);              // true

For indexed lookup, store the encoded key(s) with each record and query by encoded value instead of re-encoding at search time. Multi-word BMPM output also separates words with -, so split on both | and - for those.

Performance

Single-name encode, warm, optimized non-ASan PHP 8.4 on one core, over a representative mix of 18 short names. Absolute time scales with input length; the relative ordering is the stable part.

function per call throughput relative
double_metaphone() ~0.18 µs ~5.7M/sec 1.0x
dm_soundex() ~0.5 µs ~1.9M/sec ~3x slower
bmpm() ~7.3 µs ~135k/sec ~40x slower

Double Metaphone is a single linear pass over the input, so it's the cheapest. Daitch-Mokotoff branches on ambiguous letters and dedups the resulting codes; a first-byte rule index keeps it fast. BMPM is the heaviest: language detection, a main transliteration pass, and two final rule passes, expanding a Cartesian product of phoneme alternatives capped at 20 per word. When you know the language, passing an explicit $language skips auto-detection and can cut bmpm time several-fold, though the gain depends on the chosen language's ruleset. Choose BMPM for recall, not throughput.

Notes & limitations

  • Input is UTF-8. bmpm() and dm_soundex() fold accented Latin and lowercase both Latin and Cyrillic script before rule matching, so raw Иванов encodes correctly.
  • Greek-script input is a known limitation: Greek capitals are not lowercased (the algorithm's context-sensitive final-sigma cannot be expressed by a point-wise case map), so pass Greek names already lowercased or romanized.
  • double_metaphone() targets ASCII/Latin; non-letter bytes are skipped, matching Apache Commons Codec.

Install

Via PIE:

pie install iliaal/phonetic

Requires PHP 8.1 or later.

License

BSD 3-Clause (see LICENSE).

The Beider-Morse and Daitch-Mokotoff rule data is vendored from Apache Commons Codec under the Apache License 2.0; its notice is included in Section 2 of the LICENSE file. Double Metaphone is a clean-room implementation of the published algorithm.

统计信息

  • 总下载量: 0
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 0
  • 点击次数: 5
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • 开发语言: C

其他信息

  • 授权协议: BSD-3-Clause
  • 更新时间: 2026-06-29

承接程序开发

PHP开发

VUE

Vue开发

前端开发

小程序开发

公众号开发

系统定制

数据库设计

云部署

网站建设

安全加固