README

Arabic search that works on MySQL / PostgreSQL / SQLite — no Elasticsearch, no Meilisearch.

Arabic text is written many ways for the same word: with or without diacritics, أ/إ/آ vs bare ا, ة vs ه, ى vs ي, and Persian/Urdu look-alike letters (ک, ی) that are visually identical to Arabic ones but sit at different Unicode codepoints. Naive LIKE misses all of these. This package normalizes both your stored text and the search term through the same pipeline, so مكة matches مكه, مُحَمَّد matches محمد, and کتاب (Persian keheh) matches كتاب (Arabic kaf).

You declare which columns are searchable once on the model; at query time you pass only the search word.

Article::arabicSearch('اسلام')->paginate(20);   // matches إسلام, الإسلام, ...

Installation

composer require ismailelbery/laravel-arabic-search
php artisan vendor:publish --tag=arabic-search-config
php artisan vendor:publish --tag=arabic-search-migrations

Requirements: PHP 8.1+, Laravel 10 / 11 / 12, and ext-mbstring. ext-intl is optional — if present it adds an NFKC pass that folds Arabic presentation forms from PDFs; a built-in map covers the common cases when it is absent.

Normalization rules

This table is the contract. Each rule is individually toggleable in config/arabic-search.php.

Rule	Transform	Example	Default
`strip_invisibles`	remove zero-width & bidi controls (U+200B–200F, U+061C, U+FEFF, …)	`احمد‏` → `احمد`	on
`unicode_compatibility`	NFKC / presentation forms & ligatures	`ﻻ` → `لا`	on
`strip_tashkeel`	remove harakat & Quranic marks (U+064B–065F, U+0670, U+06D6–06ED, …)	`مُحَمَّد` → `محمد`	on
`strip_tatweel`	remove kashida U+0640	`محـــمد` → `محمد`	on
`normalize_alef`	أ إ آ ٱ ٲ ٵ → ا	`إسلام` → `اسلام`	on
`normalize_yeh`	ى (maqsura), ی (Farsi), ے (Urdu) → ي	`موسى`, `موسی` → `موسي`	on
`normalize_taa_marbuta`	ة → ه	`مكة` → `مكه`	on
`normalize_heh`	ہ ۀ ە (Urdu/Persian) → ه	`ہ` → `ه`	on
`normalize_kaf`	ک ڪ (Persian/Urdu) → ك	`کتاب` → `كتاب`	on
`normalize_waw`	ؤ ۆ ۇ ۈ → و	`مؤمن` → `مومن`	on
`normalize_hamza`	ؤ → و, ئ → ي	`قائم` → `قايم`	on
`strip_standalone_hamza`	ء → (removed)	`سماء` → `سما`	off
`normalize_dad_zah`	ظ → ض (tolerant of a common misspelling)	`ظل` → `ضل`	off
`normalize_digits`	٠١٢٣ and ۰۱۲۳ (Persian) → 0123	`٢٠٢٥` → `2025`	on
`lowercase_latin`	lowercase mixed Latin	`HeLLo` → `hello`	on
`collapse_whitespace`	runs of whitespace → single, trim	`محمد` → `محمد`	on

Design decision — recall over precision. Normalization is intentionally lossy: مكة and مكه will match, by design. Precision is recovered by relevance ordering (exact > prefix > contains), not by being conservative here.

Two rules are off by default because they are lossy across genuinely different words, not just spelling variants of one letter — enable them only if you want that tolerance:

strip_standalone_hamza — merges سماء/سما.
normalize_dad_zah — folds ظ→ض, so ظلّ (shade) and ضلّ (to go astray) collide. Turn it on when your users frequently confuse the two letters. The toggle applies to both search paths (shadow column and whereArabicVariants) so they stay consistent.

Enable in config/arabic-search.php:

'rules' => [
    'normalize_dad_zah' => true,
],

Changing it changes the normalizer version — run arabic-search:rebuild afterwards for shadow-column tables.

Debug any term end-to-end:

php artisan arabic-search:inspect "مُحَمَّدٌ ٢٠٢٥"

Setup on a model

Add the trait and list your searchable columns:

use IsmailElbery\ArabicSearch\Concerns\HasArabicSearch;

class Article extends Model
{
    use HasArabicSearch;

    protected array $arabicSearchable = ['title', 'body'];
}

Add the shadow columns. Edit the published migration (or write your own using the macro):

Schema::table('articles', function (Blueprint $table) {
    $table->arabicNormalized(['title', 'body']); // adds title_normalized, body_normalized
});

Backfill existing rows:

php artisan arabic-search:rebuild "App\Models\Article"

That's it. New/updated rows keep their shadow columns in sync automatically on save.

How it works

You never search the original column. The package maintains a normalized shadow column next to it (title → title_normalized). On save, an observer normalizes the source into the shadow column; on search, the term is normalized with the same pipeline and matched against the shadow column. Because both sides run identical PHP normalization, they are guaranteed to agree — there is no SQL-vs-app drift.

articles
├── title              "مُحَمَّدٌ رسولُ الله"   ← original, shown to the user
└── title_normalized   "محمد رسول الله"         ← searched against

⚠️ Bulk writes bypass model events. Model::query()->update(), insert(), upsert() and raw SQL do not fire the observer, so the shadow columns go stale. Run arabic-search:rebuild afterwards.

Standalone normalizer (no model needed)

use IsmailElbery\ArabicSearch\Facades\ArabicSearch;

ArabicSearch::normalize('مُحَمَّدٌ');          // "محمد"
ArabicSearch::tokenize('بسم الله الرحمن');    // ["بسم","الله","الرحمن"]

Searching an existing table with no shadow column

Have a legacy users table you can't (or don't want to) alter? Use variant expansion — it matches every orthographic spelling directly against the raw column, no _normalized column and no rebuild needed:

User::whereArabicVariants('name', 'اسلام')->paginate();
DB::table('users')->whereArabicVariants('name', 'اسلام')->get();

// Two (or more) columns — OR-ed together:
User::whereArabicVariants(['first_name', 'last_name'], 'اسلام')->get();

// Composes with other conditions:
DB::table('docs')->where('pinned', true)
    ->orWhereArabicVariants('title', 'اسلام')->get();

Searching اسلام matches stored اسلام, إسلام, أسلام, آسلام, الإسلام, and diacritized/kashida spellings like إِسْلَام and اســلام — while correctly not matching a different word like اسلم. It works on MySQL, PostgreSQL and SQLite (a PCRE-backed REGEXP function is registered automatically for SQLite).

When to use which:

	Shadow column (`HasArabicSearch`)	Variant expansion (`whereArabicVariants`)
Schema change	adds `_normalized` column	none
Backfill	`arabic-search:rebuild`	none
Matching	`LIKE` on the normalized column	regex on the raw column
Uses an index	no in v1 (`LIKE` infix); fulltext planned	no (regex full-scan)
Best for	tables you own	legacy/read-only tables, small–medium

Configuration highlights

Key	Meaning
`term_logic`	`and` (all tokens must match, default) or `or` (any)
`order_by_relevance`	exact > prefix > contains ordering (default `true`)
`min_token_length`	tokens shorter than this are dropped (default `2`)
`column_suffix`	shadow-column suffix (default `_normalized`)
`match_mode`	reserved. v1 always uses `like`; index-backed `fulltext` is on the roadmap and not yet wired

Changing any rule changes the normalizer version (ArabicSearch::version()); rerun arabic-search:rebuild so stored data matches.

What this does NOT do (yet)

Naming the limits earns more trust than hiding them:

No morphological / root analysis. Searching كتب will not automatically find مكتوب/كاتب. (Root-based matching is a v2 maybe.)
No stemming — light prefix/suffix stripping (ال، و، ب، ون، ين) is planned for v1.1, opt-in.
No synonym/fuzzy/Levenshtein matching.
LIKE infix matches can't use an index — great for small/medium tables. Index-backed fulltext matching is planned but not in v1; for very large datasets today, reach for a dedicated engine.

Use this vs. Meilisearch/Typesense: reach for this when you want correct Arabic matching on the database you already have, with zero extra infrastructure. Reach for a dedicated engine when you need typo-tolerance, faceting, or sub-10ms search over millions of rows.

Testing

composer install
vendor/bin/phpunit

The suite leads with an input → expected-output table (NormalizationTest) plus idempotency checks and an integration SearchTest against in-memory SQLite.

License

MIT.

ismailelbery/laravel-arabic-search

包简介

关键字：

README 文档