| Bug #120599 | Field_set::sql_type() Corrupts 4-byte UTF-8 Characters, Breaking Replication | ||
|---|---|---|---|
| Submitted: | 2 Jun 8:24 | Modified: | 2 Jun 12:53 |
| Reporter: | fei yang | Email Updates: | |
| Status: | Open | Impact on me: | |
| Category: | MySQL Server | Severity: | S3 (Non-critical) |
| Version: | 8.0.43 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
[2 Jun 11:37]
fei yang
Description:
Field_set::sql_type() silently replaces 4-byte UTF-8 characters (e.g., emojis) with '?' during charset conversion from column charset (utf8mb4) to system_charset_info (utf8mb3). This causes:
Replication failure when store_create_info() path is used for binlog (triggered by sql_generate_invisible_primary_key=ON), producing duplicate SET value errors on the replica.
Incorrect SHOW CREATE TABLE output — emojis displayed as '?', making distinct values indistinguishable.
The sibling function Field_enum::sql_type() properly handles this with a well_formed_error check and hex literal fallback; Field_set::sql_type() lacks this entirely.
How to repeat:
Case 1: Replication Failure with GIPK
Setup: Master-slave replication, sql_generate_invisible_primary_key=ON on both.
-- On master:
CREATE TABLE t1(
c0 INT,
c1 SET('🔥','😅','🚀')
) CHARSET=utf8mb4;
Replica error: Column 'c1' has duplicated value '?' in SET — replication breaks.
Case 2: SHOW CREATE TABLE Corruption
CREATE TABLE t2(
c1 SET('🔥','😅','🚀')
) CHARSET=utf8mb4;
SHOW CREATE TABLE t2\G
Actual output: c1 set('?','?','?') — all three emojis become indistinguishable '?'.
Root Cause
In sql/field.cc, Field_set::sql_type() converts SET values from column charset (utf8mb4) to system_charset_info (utf8mb3) via String::copy() with &dummy_errors. When 4-byte characters cannot be represented in utf8mb3, they are silently replaced with '?'. The dummy_errors counter is never checked, so no fallback is attempted.
Field_enum::sql_type() in the same file handles this correctly: after String::copy(), it checks well_formed_error() and falls back to hex literal representation (e.g., _utf8mb4 0xF09F9A80), preserving the original value.
When is_pk_generated=true in mysql_create_table(), binlog uses store_create_info() (which calls Field_set::sql_type()) instead of the original thd->query(), writing corrupted SQL to the binlog and breaking replication on the replica.
[2 Jun 11:38]
fei yang
Correction to Suggested Fix & Field_enum has the same bug The well_formed_error check in Field_enum::sql_type() does not work. After String::copy() replaces unmappable 4-byte UTF-8 chars with '?', the resulting '?' is well-formed in utf8mb3, so well_formed_error is always 0 and the hex literal branch is never reached. Also, even if reached, the hex literal uses the converted (already corrupted) data, not the original bytes. So the well_formed_error code in Field_enum::sql_type() is effectively dead code. The correct fix is: check the conversion error counter returned by String::copy(). When it is non-zero, generate a hex literal from the original unconverted bytes (*pos, *len), not the corrupted converted data. Both Field_set::sql_type() and Field_enum::sql_type() need this fix.
[2 Jun 12:53]
fei yang
I'd like to provide a pure-ASCII reproduction that avoids the need to type emoji characters directly, which may be difficult in some terminals.
Reproduction (pure ASCII, copy-paste friendly):
-- Step 1: Enable GIPK on both master and slave
SET GLOBAL sql_generate_invisible_primary_key=ON;
-- Step 2: On master, create a table with emoji SET values using UNHEX
SET @e1 = UNHEX('F09F94A5'); -- 🔥
SET @e2 = UNHEX('F09F9885'); -- 😅
SET @e3 = UNHEX('F09F9A80'); -- 🚀
SET @sql = CONCAT(
'CREATE TABLE test.t1(c0 INT, c1 SET(''',
@e1, ''',''',
@e2, ''',''',
@e3, ''')) CHARSET=utf8mb4'
);
PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
-- Step 3: Bug manifest - SHOW CREATE TABLE shows '?' for all emojis
SHOW CREATE TABLE test.t1\G
-- Expected: c1 set('🔥','😅','🚀')
-- Actual: c1 set('?','?','?')
-- Step 4: On replica, replication breaks with:
-- Error 'Column 'c1' has duplicated value '?' in SET'
-- Because three distinct emojis all become indistinguishable '?' in binlog,
-- causing a duplicate SET value error on the replica.

Description: Field_set::sql_type() silently replaces 4-byte UTF-8 characters (e.g., emojis) with '?' during charset conversion from column charset (utf8mb4) to system_charset_info (utf8mb3). This causes: Replication failure when store_create_info() path is used for binlog (triggered by sql_generate_invisible_primary_key=ON), producing duplicate SET value errors on the replica. Incorrect SHOW CREATE TABLE output — emojis displayed as '?', making distinct values indistinguishable. The sibling function Field_enum::sql_type() properly handles this with a well_formed_error check and hex literal fallback; Field_set::sql_type() lacks this entirely. How to repeat: Case 1: Replication Failure with GIPK Setup: Master-slave replication, sql_generate_invisible_primary_key=ON on both. -- On master: CREATE TABLE t1( c0 INT, c1 SET('🔥','😅','🚀') ) CHARSET=utf8mb4; Replica error: Column 'c1' has duplicated value '?' in SET — replication breaks. Case 2: SHOW CREATE TABLE Corruption CREATE TABLE t2( c1 SET('🔥','😅','🚀') ) CHARSET=utf8mb4; SHOW CREATE TABLE t2\G Actual output: c1 set('?','?','?') — all three emojis become indistinguishable '?'. Root Cause In sql/field.cc, Field_set::sql_type() converts SET values from column charset (utf8mb4) to system_charset_info (utf8mb3) via String::copy() with &dummy_errors. When 4-byte characters cannot be represented in utf8mb3, they are silently replaced with '?'. The dummy_errors counter is never checked, so no fallback is attempted. Field_enum::sql_type() in the same file handles this correctly: after String::copy(), it checks well_formed_error() and falls back to hex literal representation (e.g., _utf8mb4 0xF09F9A80), preserving the original value. When is_pk_generated=true in mysql_create_table(), binlog uses store_create_info() (which calls Field_set::sql_type()) instead of the original thd->query(), writing corrupted SQL to the binlog and breaking replication on the replica. Suggested fix: Add the same well_formed_error check and hex literal fallback to Field_set::sql_type(), matching Field_enum::sql_type() behavior.