Description:
Currently, MySQL lacks a collation that properly supports Arabic case-insensitive and diacritic-insensitive comparison, especially with Arabic letter "Alef" variants such as:
- ا (U+0627)
- أ (U+0623)
- إ (U+0625)
- آ (U+0622)
These characters are considered different in existing collations such as `utf8mb4_general_ci` or `utf8mb4_unicode_ci`, resulting in incorrect query results and unexpected behavior for native Arabic speakers. For example:
SELECT * FROM users WHERE name = 'احمد';
Would fail to match:
- "أحمد"
- "إحمد"
- "آحمد"
### Feature Request:
Introduce a new collation like `utf8mb4_arabic_ai_ci` or extend the current Arabic collations to:
- Normalize all Alef variants to a base form.
- Optionally ignore diacritics (tashkeel).
- Support case-insensitivity.
This would bring MySQL closer to true Arabic linguistic handling, improve search relevance, and fix common user frustration in Arabic applications.
### References:
- Full article with technical explanation and proposed solution:
https://ahmadessamdev.medium.com/arabic-case-insensitive-in-database-systems-how-to-solve-...
Thank you!
How to repeat:
-- Create table
CREATE TABLE users (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci
);
-- Insert sample Arabic names with Alef variants
INSERT INTO users (name) VALUES
('احمد'), -- Alef
('أحمد'), -- Alef with Hamza above
('إحمد'), -- Alef with Hamza below
('آحمد'); -- Alef with Madda
-- Now try to search using plain Alef:
SELECT * FROM users WHERE name = 'احمد';
❌ Expected:
The query should return all 4 rows, treating all Alef variants as equal.
❌ Actual:
It only returns the exact match 'احمد'.
🎯 Why this matters
Arabic users expect that searching for "احمد" should match "أحمد", "إحمد", and "آحمد", which all represent the same logical name in Arabic. Current collations treat these as completely different letters.