Bug #112950 mysql 5.5 Upgrade 8.0 exception caused by 0900 collate
Submitted: 3 Nov 2023 8:20 Modified: 4 Nov 2023 11:15
Reporter: xichao li (OCA) Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any

[3 Nov 2023 8:20] xichao li
Description:
Question:
Recently, when we performed the mysql upgrade, we found that:
After importing normal data from mysql 5.5 to mysql 8.0, a unique index collision error was raised.
At first, we thought that the migration palace tool caused the problem, but after continuous analysis and verification.
This problem is caused by the 0900 series collation rules in mysql 8.0.

To this end, we verified the code and found that there was indeed logic that could cause the problem to occur.

Since we do not find the corresponding instructions in the official documentation, we cannot confirm whether it is normal behavior or a bug in the 0900 proofreading rule.
The bug is committed because it does cause an exception.

Related stacks and implementations:

|-row_insert_for_mysql (./storage/innobase/row/row0mysql.cc:1713)
  |-row_insert_for_mysql_using_ins_graph (./storage/innobase/row/row0mysql.cc:1583)
    |-row_ins_step (./storage/innobase/row/row0ins.cc:3710)
      |-row_ins (./storage/innobase/row/row0ins.cc:3586)
        |-row_ins_index_entry_step (./storage/innobase/row/row0ins.cc:3467)
          |-row_ins_index_entry (./storage/innobase/row/row0ins.cc:3331)
            |-row_ins_clust_index_entry (./storage/innobase/row/row0ins.cc:3139)
              |-row_ins_clust_index_entry_low (./storage/innobase/row/row0ins.cc:2442)
                |-btr_pcur_t::open (./storage/innobase/include/btr0pcur.h:646)
                  |-btr_cur_search_to_nth_level (./storage/innobase/btr/btr0cur.cc:1246)
                    |-page_cur_search_with_match_bytes (./storage/innobase/page/page0cur.cc:758)
                      |-cmp_dtuple_rec_with_match_bytes (./storage/innobase/rem/rem0cmp.cc:810)
                        |-cmp_data (./storage/innobase/rem/rem0cmp.cc:449)
                          |-cmp_whole_field (./storage/innobase/rem/rem0cmp.cc:369)
                            |-innobase_mysql_cmp (./storage/innobase/rem/rem0cmp.cc:109)
                              |-my_strnncollsp_uca_900 (./strings/ctype-uca.cc:4936)
                                |-my_strnncoll_uca_900 (./strings/ctype-uca.cc:4898)
                                  |-my_strnncoll_uca<uca_scanner_900<Mb_wc_utf8mb4, 1>, 1, Mb_wc_utf8mb4> (./strings/ctype-uca.cc:1798)
                                    |-next (./strings/ctype-uca.cc:1729)
                                      |-next_raw (./strings/ctype-uca.cc:1476)

template <class Mb_wc, int LEVELS_FOR_COMPARE>
ALWAYS_INLINE int uca_scanner_900<Mb_wc, LEVELS_FOR_COMPARE>::next_raw() {
  int remain_weight = more_weight();
  if (remain_weight >= 0) return remain_weight;

  do {
    my_wc_t wc = 0;

    /* Get next code point */
    int mblen = mb_wc(&wc, sbeg, send);
    if (mblen <= 0) {
      if (LEVELS_FOR_COMPARE == 1) {
        ++weight_lv;
        return -1;
      }

      if (++weight_lv < LEVELS_FOR_COMPARE) {
        if (LEVELS_FOR_COMPARE == 4 && cs->coll_param == &ja_coll_param) {
          // Return directly if we don't have quaternary weight.
          if (weight_lv == 3 && !has_quaternary_weight) return -1;
        }
        /*
          Restart scanning from the beginning of the string, and add
          a level separator.
        */
        sbeg = sbeg_dup;
        return 0;
      }

      // If we don't have any more levels left, we're done.
      return -1;
    }

    sbeg += mblen;
    assert(wc <= uca->maxchar);  // mb_wc() has already checked this.

    if (my_uca_have_contractions(uca)) {
      const uint16 *cweight;
      /*
        If we have scanned a code point which can have previous context,
        and there were some more code points already before,
        then verify that {prev_char, wc} together form
        a real previous context pair.
        Note, we support only 2-character long sequences with previous
        context at the moment. CLDR does not have longer sequences.
        CLDR doesn't have previous context rule whose first character is
        0x0000, so the initial value (0) of prev_char won't break the logic.
      */
      if (my_uca_can_be_previous_context_tail(uca->contraction_flags, wc) &&
          my_uca_can_be_previous_context_head(uca->contraction_flags,
                                              prev_char) &&
          (cweight = previous_context_find(prev_char, wc))) {
        // For Japanese kana-sensitive collation.
        if (LEVELS_FOR_COMPARE == 4 && cs->coll_param == &ja_coll_param) {
          int quat_wt = handle_ja_contraction_quat_wt();
          prev_char = 0;
          if (quat_wt > 0) return quat_wt;
        }
        prev_char = 0; /* Clear for the next code point */
        return *cweight;
      } else if (my_uca_can_be_contraction_head(uca->contraction_flags, wc)) {
        /* Check if wc starts a contraction */
        size_t chars_skipped;  // Ignored.
        if ((cweight = contraction_find(wc, &chars_skipped))) return *cweight;
      }
      prev_char = wc;
    }

    // For Japanese kana-sensitive collation.
    if (LEVELS_FOR_COMPARE == 4 && cs->coll_param == &ja_coll_param) {
      int quat_wt = handle_ja_common_quat_wt(wc);
      if (quat_wt == -1)
        continue;
      else if (quat_wt)
        return quat_wt;
    }
    /* Process single code point */
    uint page = wc >> 8;
    uint code = wc & 0xFF;

    /* If weight page for wc does not exist, then calculate algoritmically */
    const uint16 *wpage = uca->weights[page];
    if (!wpage) return next_implicit(wc);

    /* Calculate pointer to wc's weight, using page and offset */
    wbeg = UCA900_WEIGHT_ADDR(wpage, weight_lv, code);
    wbeg_stride = UCA900_DISTANCE_BETWEEN_WEIGHTS;
    num_of_ce_left = UCA900_NUM_OF_CE(wpage, code);
  } while (!wbeg[0]); /* Skip ignorable code points */   <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<------------------------------------- 

  uint16 rtn = *wbeg;
  wbeg += wbeg_stride;
  --num_of_ce_left;
  return rtn;
}

How to repeat:
step-1:
CREATE TABLE `ta1` (
  `c1` varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
  PRIMARY KEY (`c1`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

step-2:
insert into ta1 values("62284830187725972");

step-3: 
Download the attached file and execute it:
source d2.sql

Note: The file contents of the attachment d2.sql may be seen by some editors as the same as the contents of step-2.
However, vim can find its contents as follows:
INSERT INTO ta1 VALUES ("<feff>62284830187725972");
[3 Nov 2023 8:23] xichao li
attached sql file

Attachment: d2.sql (application/octet-stream, text), 51 bytes.

[3 Nov 2023 12:14] MySQL Verification Team
HI Mr. li,

Thank you for your bug report.

Our strict policy is that you should upgrade from 5.5 to 5.6, then from 5.6 to 5.7 and lastly 5.7 to 8.0.

However, we found that your analysis is correct.

Verified as reported.
[3 Nov 2023 13:51] Roy Lyseng
Posted by developer:
 
This is not a bug.
First, this is not an upgrade bug, since the table is created in 8.0 with a collation that was introduced in 8.0.
Second, the data set includes code points that are indeed ignorable according to utf8mb4_0900_ai_ci.
Thus, the strings '62284830187725972' and '<FEFF>62284830187725972' will be considered equal and cause a unique index violation.
U+FEFF is code point ZERO WIDTH NO-BREAK SPACE, but if this from a UTF-16 it is a BOM (Byte Order Mark) for big-endian UTF-16.
The workaround here will be to sanitize the data. It may be possible to replace the index with a binary collation.
[4 Nov 2023 11:15] xichao li
Thank you for your feedback.

When we combine the code, we see the implementation mechanism for string comparison at utf8mb4_0900. We also feel that this may be the logic of the utf8mb4_0900_ai_ci collation rule itself.
But since no instructions were found, the bug was submitted for confirmation.

The solution is also, as you said, by cleaning up abnormal data.
For future databases involved in the upgrade, it is planned to change the collation rule back to the utf8mb4_general_ci before 8.0 to avoid the recurrence of this problem. After all, deleting data is not something DBAs are good at.

In addition, since 8.0.1, utf8mb4_0900_ai_ci has been the default collation rule for the utf8mb4 character set.
We can only see that the utf8mb4_0900_ai_ci is promoted relative to the utf8mb4_general_ci, but there is no complete indication of the difference. This may cause the user to fail to recognize the difference, causing the upgrade to fail.

Finally, judging from our recent upgrades, two out of five database upgrades have this problem, and we suspect that more systems may have this problem.
We are guessing: since the end user of the application may paste data from anywhere, then the application system inserts into the mysql database. If characters like <feff> are present, they are largely inserted into tables in the database.
Since mysql 5.7 is nearing the end of its life cycle, this error will inevitably occur when it is upgraded to 8.0.