# Unicode 17 Support

**Status:** Draft  
**Roadmap Section:** Other  
**Primary Contact:** Ryusuke Kajiyama  
**Email:** ryusuke.kajiyama@oracle.com  
**Company / Organization:** Oracle Information Systems Japan  
**Role / Team:** Technology Business Development Director  
**Additional Authors / Contributors:** None listed  
**Date:** 2026-05-17  
**Target Release:** MySQL 10.0.x  
**Related References:** Bug #95205

> This template is a guide for writing clear and reviewable technical design proposals. It does not replace broader contribution guidelines, engineering standards, or release processes.

---

## 1. High-Level Description

### Executive Summary

This proposal updates MySQL 10.0 Unicode support from the current Unicode Collation Algorithm (UCA) 9.0 baseline used by the `utf8mb4_0900_*` collation family to a new Unicode 17 baseline. The change should be delivered by adding a new, versioned `utf8mb4_1700_*` collation family rather than redefining the existing `utf8mb4_0900_*` collations in place.

The primary user benefit is modern Unicode behavior for sorting, comparison, language-specific tailoring, emoji, and newly encoded scripts while preserving predictable behavior for existing deployments. Existing schemas using `utf8mb4_0900_*` should continue to behave exactly as before. New deployments and opt-in migrations can adopt Unicode 17 collations explicitly.

The proposal also recommends aligning MySQL's vendored ICU dependency with Unicode 17, because MySQL regular expression support is ICU-backed and Unicode-version drift between collations and regular expressions would create confusing behavior.

### User / Developer Stories

As a MySQL application developer, I want MySQL 10.0 to support the latest Unicode version available as of May 2026, so that applications can correctly store, compare, sort, and search modern text, including newer scripts and emoji.

As a database administrator, I want existing `utf8mb4_0900_*` collations to remain stable during upgrade, so that ORDER BY, GROUP BY, UNIQUE constraints, replication, and application-visible behavior do not change unexpectedly.

As a DBA planning a Unicode upgrade, I want a new explicit `utf8mb4_1700_*` family, preflight queries, and documented migration steps, so that I can test equality and ordering changes before moving production schemas.

As a connector or client-library maintainer, I want new collation names and IDs to be introduced in a versioned and discoverable way, so that clients can update metadata mappings safely.

As a MySQL server developer, I want the existing generated UCA data-table model to be extended rather than replaced wholesale, so that the implementation can reuse proven scanner, comparison, and collation-registration infrastructure.

### Scope

**In Scope**

- Add a new Unicode 17 / UCA 17 based `utf8mb4_1700_*` collation family.
- Preserve existing `utf8mb4_0900_*` collations without semantic changes.
- Generate new versioned UCA data tables analogous to the existing `uca900_*` tables.
- Update or generalize the UCA data generation pipeline currently centered around `uca9-dump.cc`.
- Register new collations in the server collation metadata layer and expose them through existing mechanisms such as `SHOW COLLATION` and `INFORMATION_SCHEMA.COLLATIONS`.
- Evaluate whether new MySQL 10.0 installations should default to `utf8mb4_1700_ai_ci` while in-place upgrades retain the existing server default.
- Update the vendored ICU baseline to a Unicode 17 capable ICU release, preferably ICU 78.x or newer compatible maintenance release.
- Add tests for sorting, equality, uniqueness, index behavior, full-text search, regular expressions, replication, upgrade, backup/restore, and connector metadata.
- Provide migration guidance, audit SQL, and compatibility warnings.

**Out of Scope / Limitations**

- Do not redefine `utf8mb4_0900_*` to use Unicode 17 weights.
- Do not remove or deprecate existing `utf8mb4_0900_*` collations as part of this proposal.
- Do not redesign MySQL's metadata character set behavior in the same change.
- Do not introduce automatic storage-time Unicode normalization.
- Do not make in-place upgrades automatically convert user schemas to `utf8mb4_1700_*`.
- Do not combine this proposal with any broader `utf8mb3` removal or `utf8` alias behavior change.
- Exact exhaustive Unicode 9-to-17 code point delta enumeration is outside this proposal; the design depends on official Unicode/UCA/CLDR/ICU data ingestion rather than hand-maintained lists.

### References

- MySQL Server source files and directories including `strings/uca9-dump.cc`, `strings/uca900_data.h`, `strings/uca900_ja_data.h`, `strings/uca900_zh_data.h`, `strings/ctype-uca.cc`, `strings/CHARSET_INFO.txt`, `mysys/charset.cc`, and `extra/icu/icu-release-77-1`.
- ICU 78 release documentation indicating Unicode 17 support.
- Unicode 17 data snapshots, including script, block, emoji, emoji sequence, emoji ZWJ sequence, and derived normalization data.
- Prior MySQL collation-related work such as the multibyte equality fix around `utf8mb4_0900_ai_ci` behavior.

---

## 2. Requirements

Use precise wording such as **MUST**, **SHOULD**, and **MAY** where appropriate.

### Functional Requirements

- FR1. MySQL 10.0 MUST add a new Unicode 17 based `utf8mb4_1700_*` collation family.
- FR2. MySQL 10.0 MUST preserve the semantics of all existing `utf8mb4_0900_*` collations.
- FR3. The new collations MUST be exposed through existing metadata surfaces, including `SHOW COLLATION` and `INFORMATION_SCHEMA.COLLATIONS`.
- FR4. The new collations MUST have distinct collation names and collation IDs from existing collations.
- FR5. The implementation MUST update generated UCA data tables for Unicode 17 rather than manually encoding individual new characters.
- FR6. The implementation MUST include Unicode 17 DUCET data and required locale tailoring data in a reproducible generation pipeline.
- FR7. The implementation SHOULD provide `utf8mb4_1700_ai_ci` as the primary accent-insensitive, case-insensitive default candidate for fresh MySQL 10.0 installations.
- FR8. In-place upgrades MUST preserve the existing `collation_server` value unless the administrator explicitly changes it.
- FR9. The implementation SHOULD include commonly used sensitivity variants such as accent-sensitive, case-sensitive, and locale-specific collations where equivalent `0900` variants exist.
- FR10. The implementation SHOULD update the vendored ICU dependency to a Unicode 17 capable version to keep `REGEXP` behavior aligned with the Unicode generation used for collations.
- FR11. The server MUST reject, warn, or clearly document unsupported mixed-version replication cases where a replica does not understand `utf8mb4_1700_*` collation names or IDs.
- FR12. Documentation MUST describe migration risks for ordering, equality, uniqueness, full-text indexes, regex behavior, replication, backup/restore, and connector metadata.
- FR13. Documentation SHOULD include audit SQL that helps users find existing `utf8mb4_0900_*` columns and detect possible duplicate-key collisions under `utf8mb4_1700_*`.
- FR14. Utilities and dump/restore workflows SHOULD emit explicit character set and collation clauses where needed to avoid accidental default-collation drift.

### Non-Functional Requirements

- NFR1. The change MUST be backward compatible for existing schemas that continue to use `utf8mb4_0900_*`.
- NFR2. The change SHOULD avoid measurable regression in common string comparison, ORDER BY, GROUP BY, index creation, and bulk-load workloads.
- NFR3. The Unicode 17 data generation process MUST be reproducible from checked-in or documented upstream Unicode/UCA/CLDR/ICU inputs.
- NFR4. The design SHOULD minimize changes to the runtime comparison engine by reusing the existing UCA scanner and collation infrastructure where possible.
- NFR5. The feature SHOULD be observable through existing metadata surfaces without requiring new administrative concepts.
- NFR6. The migration path MUST be explicit and testable by DBAs before production cutover.
- NFR7. The implementation SHOULD include performance benchmarks for representative multilingual, CJK, emoji-heavy, and ASCII-dominant datasets.
- NFR8. The documentation SHOULD be concise enough for community review while retaining enough implementation detail for server, QA, connector, and documentation teams.

---

## 3. Impact Checklist

Use this section as a quick summary of which interfaces or subsystems are affected by the proposal. The checklist does **not** replace the detailed specification in the sections below.

- [x] Configuration options or system variables
- [x] Command-line options or utilities
- [x] User-visible behavior
- [x] Upgrade / downgrade compatibility

---

## 4. High-Level Specification

### Summary of the Approach

MySQL 10.0 should introduce Unicode 17 support as a parallel, versioned collation family. Existing Unicode 9 collations remain unchanged; new Unicode 17 collations are added under names such as `utf8mb4_1700_ai_ci`.

The approach has four pillars:

1. **Versioned addition, not replacement.** Keep `utf8mb4_0900_*` stable and add `utf8mb4_1700_*`.
2. **Reproducible data generation.** Generate new UCA/CLDR data tables from Unicode 17 inputs using an updated version of the existing UCA generation workflow.
3. **Aligned Unicode dependencies.** Update ICU so regular expression behavior does not lag behind the new collation generation.
4. **Explicit migration.** Preserve upgrade behavior by default and provide DBA-controlled migration steps.

### User Interface

### Configuration / Knobs — New configuration clauses or options

No new SQL grammar is required.

Existing clauses should accept the new collation names:

- `CHARACTER SET utf8mb4`
- `COLLATE utf8mb4_1700_ai_ci`
- `ALTER DATABASE ... COLLATE ...`
- `ALTER TABLE ... CONVERT TO CHARACTER SET ... COLLATE ...`
- Column-level `CHARACTER SET` and `COLLATE` clauses.

Fresh installations MAY use `utf8mb4_1700_ai_ci` as the compiled or initialized default collation for `utf8mb4`, subject to product decision. In-place upgrades MUST NOT change existing schema-level or server-level defaults automatically.

### Configuration / Knobs — New system variables or command-line options

No new system variables are required.

Existing variables remain the control points:

- `character_set_server`
- `collation_server`
- `character_set_database`
- `collation_database`
- `character_set_connection`
- `collation_connection`
- related client/session character set variables.

If product management decides to default fresh MySQL 10.0 installations to Unicode 17 behavior, the install-time default for `collation_server` MAY become `utf8mb4_1700_ai_ci`. Upgrade code MUST preserve the existing configured value.

### Configuration / Knobs — New command-line options for utilities

No new command-line options are required.

Existing utilities such as `mysqldump`, `mysqlpump` if applicable, and client tools SHOULD continue to emit explicit character set and collation clauses where they already do so. Documentation SHOULD recommend explicit collation clauses during migrations to avoid accidental reliance on changed server defaults.

### Configuration / Knobs — New UDFs or similar extension points

No new UDFs or extension points are required.

### New Statements

No new SQL statements are required.

The feature is exposed through existing DDL and metadata statements.

### Observability

The new collations MUST be observable through existing mechanisms:

- `SHOW COLLATION`
- `SHOW CHARACTER SET`
- `INFORMATION_SCHEMA.COLLATIONS`
- `INFORMATION_SCHEMA.COLUMNS`
- `INFORMATION_SCHEMA.SCHEMATA`
- `INFORMATION_SCHEMA.TABLES`
- client metadata surfaces that expose collation IDs or charset numbers.

### User Procedure

A recommended DBA migration procedure is:

1. Upgrade MySQL binaries to a version that supports both `utf8mb4_0900_*` and `utf8mb4_1700_*`.
2. Confirm that existing schemas still use their original collations.
3. Inventory databases, tables, and columns using `utf8mb4_0900_*`.
4. For each migration candidate, compare ordering under the old and new collations.
5. Check for possible duplicate-key collisions under the new collation before converting UNIQUE or PRIMARY KEY columns.
6. Convert database defaults, table defaults, or individual columns explicitly.
7. Rebuild affected indexes and full-text indexes where required.
8. Validate application behavior, connectors, replication, backup/restore, and query plans.
9. Only after validation, optionally update server defaults for future object creation.

### Security Context

This proposal does not introduce a new privilege model, authentication mechanism, encryption behavior, or direct security boundary.

However, collation changes can affect security-sensitive application logic when applications rely on string equality or ordering for identifiers, usernames, case-insensitive comparisons, allowlists, denylists, or uniqueness checks. Documentation SHOULD warn users to validate such logic before migration.

Mixed-version replication should be treated carefully because older replicas may not understand new collation names or IDs. The server SHOULD provide clear errors or warnings for unsupported topologies.

### Compatibility and Behavior Change

This proposal is compatible by default if implemented as a new collation family.

Expected compatibility behavior:

- Existing `utf8mb4_0900_*` collations continue to behave exactly as before.
- Existing schemas are not automatically converted.
- Existing in-place upgrade defaults are preserved.
- New `utf8mb4_1700_*` collations are opt-in for existing deployments.
- Fresh installations MAY use `utf8mb4_1700_ai_ci` as the default if product policy chooses that behavior.

Behavior changes occur only where users explicitly choose the new collation, or where new installations inherit a new default. Possible changes include:

- Different `ORDER BY` results.
- Different `GROUP BY` and `DISTINCT` equivalence classes.
- New UNIQUE constraint conflicts during migration.
- Different `MIN()` / `MAX()` results where collation order changes.
- Different `LIKE`, string function, or comparison outcomes in edge cases.
- Different full-text tokenization or ranking behavior if character classification changes.
- Different `REGEXP` behavior after ICU upgrade.
- Connector or client issues if collation IDs are hard-coded or not updated.
- Replication incompatibility if a source emits new-collation DDL/DML to an older replica.

---

## 5. Low-Level Design

### Block Diagram

_TBD_

### Interface Specification

#### New collation names

The exact list should be finalized by the MySQL server team, but the naming pattern SHOULD follow existing MySQL versioned collation practice:

- `utf8mb4_1700_ai_ci`
- `utf8mb4_1700_as_ci`
- `utf8mb4_1700_as_cs`
- `utf8mb4_1700_bin` if a versioned binary counterpart is needed for consistency
- locale-specific variants corresponding to existing important `0900` variants, for example Japanese and Chinese tailorings where generated data exists.

The proposal MUST NOT rename or reinterpret existing `utf8mb4_0900_*` collations.

#### Metadata

Each new collation requires:

- unique collation ID;
- collation name;
- character set name `utf8mb4`;
- compiled/default flags as appropriate;
- sort length metadata;
- pad attribute, expected to remain `NO PAD` for UCA 9.0+ style collations unless implementation analysis finds a reason otherwise;
- `strxfrm_multiply` and related sort-key metadata;
- correct exposure through `SHOW COLLATION` and `INFORMATION_SCHEMA.COLLATIONS`.

#### Data files

Add generated files analogous to current `uca900_*` assets:

- `strings/uca1700_data.h`
- `strings/uca1700_ja_data.h` where Japanese tailoring is supported
- `strings/uca1700_zh_data.h` where Chinese tailoring is supported
- additional locale-specific generated files if the supported collation list requires them.

Existing `uca900_*` files remain unchanged.

#### Generator

Either generalize `strings/uca9-dump.cc` into a version-parameterized generator or add a new Unicode 17 generator. The generator should take official Unicode 17 DUCET/allkeys data and CLDR tailoring data as inputs and produce deterministic C/C++ header outputs.

The generation process SHOULD be documented so future Unicode upgrades can be repeated without reverse-engineering the workflow.

#### Runtime comparison

The implementation should reuse the existing UCA scanner and comparison logic where possible. If `uca_scanner_900` is version-specific only because of data layout, introduce a generic scanner or a `uca_scanner_1700` counterpart using the same conceptual model.

The implementation must preserve support for:

- multi-level UCA weights;
- contractions;
- expansions;
- implicit weights;
- Japanese kana-sensitive / quaternary behavior where applicable;
- locale-specific tailoring.

#### ICU / REGEXP

Update the vendored ICU directory from the current ICU 77.1 baseline to a Unicode 17 capable ICU release, ideally ICU 78.x or an approved maintenance release.

The change should include:

- build integration updates;
- platform compatibility checks;
- ICU data packaging updates;
- regex regression tests;
- documentation notes for any visible regex differences.

#### Replication and version gates

New collation IDs and names must be considered versioned metadata. The server should reject or clearly warn about unsupported replication where older replicas cannot resolve `utf8mb4_1700_*` metadata.

At minimum, release notes and upgrade docs must state that schemas using `utf8mb4_1700_*` should not be replicated to older MySQL versions that do not support those collations.

### Design / Implementation Steps

1. **Confirm Unicode target**
   - Confirm Unicode 17 is the target baseline for MySQL 10.0.
   - Confirm the ICU version that provides the chosen Unicode 17 baseline.

2. **Update ICU dependency**
   - Replace or update `extra/icu/icu-release-77-1` with a Unicode 17 capable ICU release.
   - Update build scripts and platform-specific integration.
   - Run baseline regex and ICU tests.

3. **Build Unicode 17 data generation pipeline**
   - Generalize `uca9-dump.cc` or add a new generator.
   - Feed Unicode 17 DUCET/allkeys and CLDR tailoring inputs.
   - Produce deterministic `uca1700_*` headers.
   - Add generation documentation.

4. **Add generated Unicode 17 data files**
   - Add `uca1700_data.h` and locale-specific generated files.
   - Keep `uca900_*` files intact.
   - Ensure generated files are reviewed for size, build impact, and license compatibility.

5. **Register new collations**
   - Add `utf8mb4_1700_*` entries to collation registration structures.
   - Assign stable, non-conflicting collation IDs.
   - Verify metadata output through `SHOW COLLATION` and `INFORMATION_SCHEMA`.

6. **Extend comparison implementation**
   - Wire new generated data into `ctype-uca.cc` or equivalent collation implementation.
   - Validate contractions, expansions, implicit weights, and locale tailorings.
   - Verify sort-key generation and `strxfrm_multiply` assumptions.

7. **Default behavior and upgrade policy**
   - Decide whether fresh MySQL 10.0 installations default to `utf8mb4_1700_ai_ci`.
   - Ensure in-place upgrades preserve existing defaults.
   - Add tests for install and upgrade paths.

8. **Replication, backup, and connector validation**
   - Add tests for DDL containing new collations in mixed-version contexts.
   - Validate dump/restore with explicit and implicit collations.
   - Coordinate connector metadata updates for new IDs.

9. **Documentation and migration guide**
   - Document new collations, compatibility risks, and migration procedure.
   - Provide audit SQL and duplicate-collision checks.
   - Document known regex/full-text differences if any.

10. **Performance and regression testing**
    - Benchmark sort, compare, index creation, joins, GROUP BY, DISTINCT, and bulk load.
    - Include multilingual, CJK, emoji, and ASCII-heavy datasets.
    - Block release on unacceptable regressions.

---

## 6. QA Notes

QA must treat this as a broad semantic change, not just an added collation list.

### Core collation tests

- Verify that every `utf8mb4_1700_*` collation appears in `SHOW COLLATION` with correct metadata.
- Verify `INFORMATION_SCHEMA.COLLATIONS` rows for name, charset, ID, default flag, compiled flag, sort length, and pad attribute.
- Verify explicit `CREATE DATABASE`, `CREATE TABLE`, column definitions, and `ALTER TABLE ... CONVERT TO CHARACTER SET ... COLLATE ...` with new collation names.
- Verify old `utf8mb4_0900_*` results remain unchanged.
- Verify `WEIGHT_STRING()` output stability for golden test cases.
- Verify `ORDER BY`, `GROUP BY`, `DISTINCT`, `MIN()`, and `MAX()` behavior under `_0900_` and `_1700_`.

### Equality and uniqueness tests

- Test UNIQUE indexes before and after conversion.
- Add cases where byte length differs but collation equality may hold.
- Reuse regression patterns from prior multibyte equality bugs.
- Verify duplicate-key detection during `ALTER TABLE ... CONVERT TO CHARACTER SET`.
- Verify case-insensitive, accent-insensitive, case-sensitive, and accent-sensitive variants.

### Unicode 17 coverage tests

- Include scripts introduced after Unicode 9, such as Chorasmian, Dives Akuru, Dogra, Elymaic, Kawi, Makasar, Medefaidrin, and other scripts identified in Unicode 17 data.
- Include supplementary-plane characters.
- Include emoji code points, emoji presentation characters, emoji modifiers, flag sequences, keycap sequences, tag sequences, and ZWJ sequences.
- Include combining marks and normalization-related edge cases, while confirming MySQL does not perform storage-time normalization.

### Locale tailoring tests

- Validate Japanese tailoring, including kana-sensitive behavior where supported.
- Validate Chinese tailoring where generated data exists.
- Validate any other locale-specific `1700` collations introduced.
- Compare against expected UCA/CLDR orderings.

### Full-text search tests

- Test built-in full-text parser behavior with `utf8mb4_1700_*`.
- Test ngram parser behavior for CJK text.
- Test punctuation and word-boundary behavior for new scripts and emoji-adjacent text.
- Test full-text index rebuild after collation conversion.
- Compare result sets and ranking before and after conversion where applicable.

### Regular expression tests

- Run existing regex suites after ICU upgrade.
- Add tests for Unicode properties and character classes affected by Unicode 17.
- Test multibyte-safe matching with new scripts and emoji.
- Document any ICU-induced behavior differences.

### Upgrade and migration tests

- Verify in-place upgrade preserves `collation_server`.
- Verify fresh-install default behavior according to product decision.
- Verify schema objects retain old collations unless explicitly altered.
- Verify conversion from `_0900_` to `_1700_` works at database, table, and column levels.
- Verify rollback guidance for failed conversions.

### Replication tests

- Test row-based and statement-based replication with new-collation DDL and DML.
- Test mixed-version topologies and confirm unsupported cases fail clearly.
- Test replication after all nodes support `utf8mb4_1700_*`.
- Test clone/restore scenarios with new collation metadata.

### Backup and restore tests

- Verify logical dump includes explicit character set and collation clauses where needed.
- Restore dumps onto servers with different defaults and confirm schema semantics are preserved.
- Test restore failure or warning behavior on servers that do not support `utf8mb4_1700_*`.

### Connector and protocol tests

- Verify C API metadata surfaces new charset/collation IDs correctly.
- Coordinate tests for major connectors and ORMs that may hard-code collation IDs or names.
- Verify introspection tools correctly display and preserve new collation names.

### Performance tests

- Benchmark string comparison hot paths.
- Benchmark filesort and `ORDER BY` on large datasets.
- Benchmark `GROUP BY`, `DISTINCT`, and joins involving collated string columns.
- Benchmark index creation and `ALTER TABLE ... CONVERT TO CHARACTER SET`.
- Benchmark ASCII-only, mixed Latin, CJK, supplementary-plane, and emoji-heavy datasets.

### Release criteria

- All existing `utf8mb4_0900_*` compatibility tests pass unchanged.
- New `utf8mb4_1700_*` metadata is complete and stable.
- Unicode 17 data generation is reproducible.
- ICU upgrade tests pass on supported platforms.
- Mixed-version replication behavior is explicitly tested and documented.
- Migration documentation includes audit SQL, duplicate-collision checks, and rollback notes.
- No unacceptable performance regression remains open.