Bug #120599 Field_set::sql_type() Corrupts 4-byte UTF-8 Characters, Breaking Replication
Submitted: 2 Jun 8:24 Modified: 2 Jun 12:53
Reporter: fei yang Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:8.0.43 OS:Any
Assigned to: CPU Architecture:Any

[2 Jun 8:24] fei yang
Description:
Field_set::sql_type() silently replaces 4-byte UTF-8 characters (e.g., emojis) with '?' during charset conversion from column charset (utf8mb4) to system_charset_info (utf8mb3). This causes:

Replication failure when store_create_info() path is used for binlog (triggered by sql_generate_invisible_primary_key=ON), producing duplicate SET value errors on the replica.
Incorrect SHOW CREATE TABLE output — emojis displayed as '?', making distinct values indistinguishable.
The sibling function Field_enum::sql_type() properly handles this with a well_formed_error check and hex literal fallback; Field_set::sql_type() lacks this entirely.

How to repeat:
Case 1: Replication Failure with GIPK
Setup: Master-slave replication, sql_generate_invisible_primary_key=ON on both.
-- On master:
CREATE TABLE t1(
  c0 INT,
  c1 SET('🔥','😅','🚀')
) CHARSET=utf8mb4;

Replica error: Column 'c1' has duplicated value '?' in SET — replication breaks.

Case 2: SHOW CREATE TABLE Corruption
CREATE TABLE t2(
  c1 SET('🔥','😅','🚀')
) CHARSET=utf8mb4;

SHOW CREATE TABLE t2\G

Actual output: c1 set('?','?','?') — all three emojis become indistinguishable '?'.

Root Cause
In sql/field.cc, Field_set::sql_type() converts SET values from column charset (utf8mb4) to system_charset_info (utf8mb3) via String::copy() with &dummy_errors. When 4-byte characters cannot be represented in utf8mb3, they are silently replaced with '?'. The dummy_errors counter is never checked, so no fallback is attempted.

Field_enum::sql_type() in the same file handles this correctly: after String::copy(), it checks well_formed_error() and falls back to hex literal representation (e.g., _utf8mb4 0xF09F9A80), preserving the original value.

When is_pk_generated=true in mysql_create_table(), binlog uses store_create_info() (which calls Field_set::sql_type()) instead of the original thd->query(), writing corrupted SQL to the binlog and breaking replication on the replica.

Suggested fix:
Add the same well_formed_error check and hex literal fallback to Field_set::sql_type(), matching Field_enum::sql_type() behavior.
[2 Jun 11:37] fei yang
Description:
Field_set::sql_type() silently replaces 4-byte UTF-8 characters (e.g., emojis) with '?' during charset conversion from column charset (utf8mb4) to system_charset_info (utf8mb3). This causes:

Replication failure when store_create_info() path is used for binlog (triggered by sql_generate_invisible_primary_key=ON), producing duplicate SET value errors on the replica.
Incorrect SHOW CREATE TABLE output — emojis displayed as '?', making distinct values indistinguishable.
The sibling function Field_enum::sql_type() properly handles this with a well_formed_error check and hex literal fallback; Field_set::sql_type() lacks this entirely.

How to repeat:
Case 1: Replication Failure with GIPK
Setup: Master-slave replication, sql_generate_invisible_primary_key=ON on both.
-- On master:
CREATE TABLE t1(
  c0 INT,
  c1 SET('🔥','😅','🚀')
) CHARSET=utf8mb4;

Replica error: Column 'c1' has duplicated value '?' in SET — replication breaks.

Case 2: SHOW CREATE TABLE Corruption
CREATE TABLE t2(
  c1 SET('🔥','😅','🚀')
) CHARSET=utf8mb4;

SHOW CREATE TABLE t2\G

Actual output: c1 set('?','?','?') — all three emojis become indistinguishable '?'.

Root Cause
In sql/field.cc, Field_set::sql_type() converts SET values from column charset (utf8mb4) to system_charset_info (utf8mb3) via String::copy() with &dummy_errors. When 4-byte characters cannot be represented in utf8mb3, they are silently replaced with '?'. The dummy_errors counter is never checked, so no fallback is attempted.

Field_enum::sql_type() in the same file handles this correctly: after String::copy(), it checks well_formed_error() and falls back to hex literal representation (e.g., _utf8mb4 0xF09F9A80), preserving the original value.

When is_pk_generated=true in mysql_create_table(), binlog uses store_create_info() (which calls Field_set::sql_type()) instead of the original thd->query(), writing corrupted SQL to the binlog and breaking replication on the replica.
[2 Jun 11:38] fei yang
Correction to Suggested Fix & Field_enum has the same bug

The well_formed_error check in Field_enum::sql_type() does not work. After String::copy() replaces unmappable 4-byte UTF-8 chars with '?', the resulting '?' is well-formed in utf8mb3, so well_formed_error is always 0 and the hex literal branch is never reached. Also, even if reached, the hex literal uses the converted (already corrupted) data, not the original bytes. So the well_formed_error code in Field_enum::sql_type() is effectively dead code.

The correct fix is: check the conversion error counter returned by String::copy(). When it is non-zero, generate a hex literal from the original unconverted bytes (*pos, *len), not the corrupted converted data. Both Field_set::sql_type() and Field_enum::sql_type() need this fix.
[2 Jun 12:53] fei yang
I'd like to provide a pure-ASCII reproduction that avoids the need to type emoji characters directly, which may be difficult in some terminals.

Reproduction (pure ASCII, copy-paste friendly):

-- Step 1: Enable GIPK on both master and slave
SET GLOBAL sql_generate_invisible_primary_key=ON;

-- Step 2: On master, create a table with emoji SET values using UNHEX
SET @e1 = UNHEX('F09F94A5');  -- 🔥
SET @e2 = UNHEX('F09F9885');  -- 😅
SET @e3 = UNHEX('F09F9A80');  -- 🚀
SET @sql = CONCAT(
  'CREATE TABLE test.t1(c0 INT, c1 SET(''',
  @e1, ''',''',
  @e2, ''',''',
  @e3, ''')) CHARSET=utf8mb4'
);
PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

-- Step 3: Bug manifest - SHOW CREATE TABLE shows '?' for all emojis
SHOW CREATE TABLE test.t1\G
-- Expected: c1 set('🔥','😅','🚀')
-- Actual:   c1 set('?','?','?')

-- Step 4: On replica, replication breaks with:
--   Error 'Column 'c1' has duplicated value '?' in SET'
-- Because three distinct emojis all become indistinguishable '?' in binlog,
-- causing a duplicate SET value error on the replica.