Bug #119167 SIGSEGV in calc_row_difference()/memcmp during GROUP BY + MAX(CONCAT(...)) (temporary table update path)
Submitted: 16 Oct 8:28 Modified: 16 Oct 10:45
Reporter: shing aq Email Updates:
Status: Need Feedback Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:8.0.25 OS:Linux (ARM Kylin v10)
Assigned to: CPU Architecture:ARM (ARM Kylin v10)
Tags: calc_row_difference, coredump, memcmp, TempTableAggregateIterator

[16 Oct 8:28] shing aq
Description:
In production, running a large GROUP BY query involving many MAX(CONCAT(...)) columns caused mysqld to receive a SIGSEGV and crash the entire server (master outage). The crash stack (captured from the error log) shows memcmp in libc being invoked from the calc_row_difference path.

08:06:51 UTC - mysqld got signal 11 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
Thread pointer: 0xfffaaaf756000
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = fff940553010 thread_stack 0x46000
/mysql/bin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x2c) [0x1c99324]
/mysql/bin/mysqld(handle_fatal_signal+0x284) [0xf5e07c]
linux-vdso.so.1(__kernel_rt_sigreturn+0) [0xfffc292f07e0]
/lib64/libc.so.6(memcmp+0x10c) [0xfffc28852f0c]
/mysql/bin/mysqld() [0x1d8c158]
/mysql/bin/mysqld(ha_innobase::update_row(unsigned char const*, unsigned char*)+0xb0) [0x1d8cd78]
/mysql/bin/mysqld(handler::ha_update_row(unsigned char const*, unsigned char*)+0x184) [0x103e114]
/mysql/bin/mysqld(TempTableAggregateIterator::Init()+0x28c) [0x125086c]
/mysql/bin/mysqld(filesort(THD*, Filesort*, RowIterator*, unsigned long, unsigned long long, Filesort_info*, Sort_result*, unsigned long long*)+0x408) [0x1023c30]
/mysql/bin/mysqld(SortingIterator::DoSort()+0x4c) [0xdbea1c]
/mysql/bin/mysqld(SortingIterator::Init()+0x1c) [0xdbea7c]
/mysql/bin/mysqld(Query_expression::ExecuteIteratorQuery(THD*)+0x2e4) [0xef6a1c]
/mysql/bin/mysqld(Query_expression::execute(THD*)+0x2c) [0xef6ccc]
/mysql/bin/mysqld(Sql_cmd_dml::execute_inner(THD*)+0x234) [0x9e9e21c]
/mysql/bin/mysqld(Sql_cmd_dml::execute(THD*)+0x4b4) [0xea7054]
/mysql/bin/mysqld(mysql_execute_command(THD*, bool)+0xd4c) [0xe557a44]
/mysql/bin/mysqld(dispatch_sql_command(THD*, Parser_state*)+0x324) [0xe5a4f4]
/mysql/bin/mysqld(dispatch_command(THD*, COM_DATA const*, enum_server_command)+0x12c4) [0xe5bb8c]
/mysql/bin/mysqld(do_command(THD*)+0x160) [0xe5cado]
/mysql/bin/mysqld() [0xf51428]
/mysql/bin/mysqld() [0x21f14f4]
/lib64/libpthread.so.0(+0x88cc) [0xfffc291788cc]
/lib64/libc.so.6(+0xda1ec) [0xfffc288aa1ec]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (fffaaf431828): is an invalid pointer
Connection ID (thread ID): 24189625
Status: NOT_KILLED

This is the SQL we ran in production. After running this query, the database immediately coredumped and exited.
The error log contains the stack trace shown above.  
The table structure is similar to the one in production:
CREATE TABLE test_large_tmp (
  id BIGINT PRIMARY KEY AUTO_INCREMENT,
  col_01 VARCHAR(64),
  col_02 VARCHAR(64),
  col_03 VARCHAR(64),
  col_04 VARCHAR(64),
  col_05 VARCHAR(64),
-- ... many similar columns omitted
  col_99 VARCHAR(64),
  col_100 VARCHAR(64),
  grp_key1 VARCHAR(20),
  grp_key2 VARCHAR(20),
  grp_key3 VARCHAR(20),
  grp_key4 VARCHAR(20),
  grp_key5 VARCHAR(20)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
The SQL we ran in production:
-- Table size: approximately 5.15 million rows
SELECT 
    COUNT(1),
    MAX(CONCAT(CONCAT('', ''), col_01)) AS col1,
    MAX(CONCAT(CONCAT('', ''), col_02)) AS col2,
    MAX(CONCAT(CONCAT('', ''), col_03)) AS col3,
    MAX(CONCAT(CONCAT('', ''), col_04)) AS col4,
    MAX(CONCAT(CONCAT('', ''), col_05)) AS col5,
-- ... many similar columns omitted
    MAX(CONCAT(CONCAT('', ''), col_68)) AS col68,
    MAX(CONCAT(CONCAT('', ''), col_69)) AS col69,
    MAX(CONCAT(CONCAT('', ''), col_70)) AS col70
FROM test_large_tmp
GROUP BY grp_key1, grp_key2, grp_key3, grp_key4, grp_key5;

Why we think this is a bug:
By reviewing the source code of version 8.0.25, I inferred the related function call path:
TempTableAggregateIterator::Init() -> handler::ha_update_row() -> ha_innobase::update_row() -> calc_row_difference() -> memcmp()
We can confirm that the function at 0x1d8c158 is indeed calc_row_difference().
After analyzing this function, I found that it calls memcmp() in two places. 
The relevant code is as follows:
	/* for virtual columns, we only materialize
	its index, and index field length would not
	exceed max_field_len. So continue if the
	first max_field_len bytes are matched up */
	if (o_len != UNIV_SQL_NULL && n_len != UNIV_SQL_NULL &&
		o_len >= max_field_len && n_len >= max_field_len &&
		memcmp(o_ptr, n_ptr, max_field_len) == 0) {

		dfield_t *vfield = dtuple_get_nth_v_field(uvect->old_vrow, num_v);
		col->copy_type(dfield_get_type(vfield));

		if (is_multi_value) {
			innobase_get_multi_value(prebuilt->m_mysql_table, i, vfield, nullptr,
			static_cast<uint>(old_row - new_row), comp,
								   uvect->heap);
			} else {
			  buf = innodb_fill_old_vcol_val(prebuilt, vfield, o_len, col,
											 old_mysql_row_col, col_pack_len, buf);
			}
			num_v++;
			continue;
		}
    }
	
    // After analysis, I believe that this SQL should enter the branch below, rather than the one above, so the crash probably occurs in this branch:
    if (o_len != n_len || (o_len != UNIV_SQL_NULL && o_len != 0 &&
                           0 != memcmp(o_ptr, n_ptr, o_len))) {
      /* The field has changed */
      bool multi_value_calc_by_diff = false;
      dfield_t old_field, new_field;

      ufield = uvect->fields + n_changed;

      UNIV_MEM_INVALID(ufield, sizeof *ufield);

      /* Let us use a dummy dfield to make the conversion
      from the MySQL column format to the InnoDB format */

      /* If the length of new geometry object is 0, means
      this object is invalid geometry object, we need
      to block it. */
      if (DATA_GEOMETRY_MTYPE(col_type) && o_len != 0 && n_len == 0) {
        return (DB_CANT_CREATE_GEOMETRY_OBJECT);
      }
      // rest of the code omitted

How to repeat:
I'm sorry that I currently cannot reproduce this bug reliably. I suspect that during execution, o_len might be assigned an incorrect value, causing memcmp to access out-of-bounds memory.
I have also reviewed the source code but have not found a complete chain of evidence to support this yet. I am continuing to investigate this bug.

If your team has any internal insights or ideas, I would greatly appreciate if you could share them.

I will continue to study and attempt to reproduce this bug. If I can reproduce it reliably, I will continue to update and refine the details here.
[16 Oct 10:45] MySQL Verification Team
Hello!

Don't bother studying 8.0.25 source code.  Please test against 8.0.43 and 8.4.6.

There were many fixes since,  maybe this is related:

https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-28.html
"
InnoDB: An update operation that inserted a new row into a temporary table used for aggregation caused the temporary table to be moved to disk and the update operation to be retried on the new on-disk temporary table. A BLOB pointer in the record data prepared before the temporary table was moved to disk was rendered stale, causing a failure. (Bug #33242407)"
[16 Oct 14:49] Jean-François Gagné
> If your team has any internal insights or ideas, I would greatly appreciate if you could share them.

What I would suggest is loading the content of the table in a recent MySQL version (8.0.25 dates from 2021-05-11, 8.0.43 is the most recent from 2025-07-22), and running the query to reproduce the crash.

If you repro, do a binary search in the table to see which rows are the cause (load the half left of the table and try reproducing; if not, load half right; then iterate to narrow down).

If you do not repro with 8.0.43, try with 8.0.25.  Hopefully you will repro, then you can narrow-down which row and work around the problem, and you will know upgrading is a way to fully solve the problem.  If you do not repro with a clean 8.0.25, then narrowing-down the bug will be very hard.