MySQL Bugs: #90854: SQL thread hangs in "Syncing ndb table schema operation and binlog"

Bug #90854	SQL thread hangs in "Syncing ndb table schema operation and binlog"
Submitted:	14 May 2018 8:30	Modified:	19 May 2018 16:08
Reporter:	Daniël van Eeden (OCA)	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	7.6.4	OS:	Any
Assigned to:	MySQL Verification Team	CPU Architecture:	Any
Tags:	ndb, replication, sql thread

Description:
mysql> select * from information_schema.processlist where id=1087917\G
*************************** 1. row ***************************
     ID: 1087917
   USER: system user
   HOST: 
     DB: db1
COMMAND: Connect
   TIME: 403319
  STATE: Syncing ndb table schema operation and binlog
   INFO: ALTER TABLE t1 CHANGE c1 ____c1 varchar(255) CHARACTER SET utf8mb4 DEFAULT NULL
1 row in set (0.02 sec)

But I don't see any progress or resource usage.
ndb_desc showed that the column was renamed and ndb_show_tables didn't show any internal temp tables.

I tried to stop the SQL thread and that seemed to hang in "Killing slave" state and eventually resulted in:
mysql> stop slave sql_thread;
ERROR 1146 (42S02): Table 'mysql.slave_relay_log_info' doesn't exist

I tried to restart mysqld. This resulted in:

08:12:21 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=805306368
read_buffer_size=131072
max_used_connections=0
max_threads=3000
thread_count=2
connection_count=0
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4282510 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x236cca0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7ffdfc7324e0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x3b)[0xf74adb]
/usr/sbin/mysqld(handle_fatal_signal+0x461)[0x82ba71]
/lib64/libpthread.so.0(+0xf6d0)[0x7f052afca6d0]
/usr/sbin/mysqld(_ZN17Cost_model_server4initEv+0x5a)[0xc9c42a]
/usr/sbin/mysqld(_Z9lex_startP3THD+0x2d)[0xd223ad]
/usr/sbin/mysqld(_ZN21Execute_sql_statement19execute_server_codeEP3THD+0xe2)[0xd6a212]
/usr/sbin/mysqld(_ZN18Prepared_statement23execute_server_runnableEP15Server_runnable+0x1cd)[0xd6c43d]
/usr/sbin/mysqld(_ZN13Ed_connection14execute_directEP15Server_runnable+0xb4)[0xd6d604]
/usr/sbin/mysqld(_ZN13Ed_connection14execute_directE19st_mysql_lex_string+0x34)[0xd6d704]
/usr/sbin/mysqld(_ZN20Ndb_local_connection13execute_queryE19st_mysql_lex_stringPKjPK10Suppressor+0x5d)[0x132fccd]
/usr/sbin/mysqld(_ZN20Ndb_local_connection17execute_query_isoE19st_mysql_lex_stringPKjPK10Suppressor+0x66)[0x132fe86]
/usr/sbin/mysqld(_ZN20Ndb_local_connection11delete_rowsEPKcmS1_mbz+0x198)[0x13302c8]
/usr/sbin/mysqld[0x131501e]
/usr/sbin/mysqld[0x8739ac]
/usr/sbin/mysqld(_Z26ha_binlog_index_purge_fileP3THDPKc+0x33)[0x87d8d3]
/usr/sbin/mysqld(_ZN13MYSQL_BIN_LOG17purge_index_entryEP3THDPyb+0x359)[0xf04cb9]
/usr/sbin/mysqld(_ZN13MYSQL_BIN_LOG10purge_logsEPKcbbbPyb+0x368)[0xf0d838]
/usr/sbin/mysqld(_ZN13MYSQL_BIN_LOG22purge_logs_before_dateElb+0x498)[0xf0e3e8]
/usr/sbin/mysqld[0x8250a5]
/usr/sbin/mysqld(_Z11mysqld_mainiPPc+0x874)[0x825e44]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f052999f445]
/usr/sbin/mysqld[0x81ba35]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (9cdfee18): DELETE FROM mysql.ndb_binlog_index WHERE File='../log/binlog.000792'
Connection ID (thread ID): 0
Status: NOT_KILLED

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

From c++filt:
/usr/sbin/mysqld(my_print_stacktrace+0x3b)[0xf74adb]
/usr/sbin/mysqld(handle_fatal_signal+0x461)[0x82ba71]
/lib64/libpthread.so.0(+0xf6d0)[0x7f052afca6d0]
/usr/sbin/mysqld(Cost_model_server::init()+0x5a)[0xc9c42a]
/usr/sbin/mysqld(lex_start(THD*)+0x2d)[0xd223ad]
/usr/sbin/mysqld(Execute_sql_statement::execute_server_code(THD*)+0xe2)[0xd6a212]
/usr/sbin/mysqld(Prepared_statement::execute_server_runnable(Server_runnable*)+0x1cd)[0xd6c43d]
/usr/sbin/mysqld(Ed_connection::execute_direct(Server_runnable*)+0xb4)[0xd6d604]
/usr/sbin/mysqld(Ed_connection::execute_direct(st_mysql_lex_string)+0x34)[0xd6d704]
/usr/sbin/mysqld(Ndb_local_connection::execute_query(st_mysql_lex_string, unsigned int const*, Suppressor const*)+0x5d)[0x132fccd]
/usr/sbin/mysqld(Ndb_local_connection::execute_query_iso(st_mysql_lex_string, unsigned int const*, Suppressor const*)+0x66)[0x132fe86]
/usr/sbin/mysqld(Ndb_local_connection::delete_rows(char const*, unsigned long, char const*, unsigned long, bool, ...)+0x198)[0x13302c8]
/usr/sbin/mysqld[0x131501e]
/usr/sbin/mysqld[0x8739ac]
/usr/sbin/mysqld(ha_binlog_index_purge_file(THD*, char const*)+0x33)[0x87d8d3]
/usr/sbin/mysqld(MYSQL_BIN_LOG::purge_index_entry(THD*, unsigned long long*, bool)+0x359)[0xf04cb9]
/usr/sbin/mysqld(MYSQL_BIN_LOG::purge_logs(char const*, bool, bool, bool, unsigned long long*, bool)+0x368)[0xf0d838]
/usr/sbin/mysqld(MYSQL_BIN_LOG::purge_logs_before_date(long, bool)+0x498)[0xf0e3e8]
/usr/sbin/mysqld[0x8250a5]
/usr/sbin/mysqld(mysqld_main(int, char**)+0x874)[0x825e44]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f052999f445]
/usr/sbin/mysqld[0x81ba35]

How to repeat:
Unknown

Hi,
Thanks for the report, I need bit more data to find out what's going on.

If I understand properly you have 2 ndbcluster 7.6.4 setups that are replicating one to another. You executed alter table.. on master, it executed ok on master, was written in binlog, executed on slave and on slave you have this stuck? Is that correct?

Can you please get me the ndb_error_report from both master and slave so we can look at the full logs and not only mysql error crash.

How is the replication setup between master and slave cluster?

Thanks
Bogdan

Replication is from from a InnoDB instance running 5.7.21 to the cluster running 7.6.4.

Replication is in minimal-RBR format. But as this is an ALTER it will be sent as a query.

Hi Daniël,

Some questions from our dev team:

- How large was the table? 
- Was the alter only a column rename, the type is the same? (can we have exact alter?)
- Column rename will become a copying alter in this release and it could be that the copying was still ongoing when the slave was stopped. How long was the slave stuck before stopped?

Thanks
Bogdan

Hi Daniel,

I can't reproduce this but let's see if the dev team have enough data to figure out what happened and how to not allow it to happen again.

Thanks
Bogdan

I can reproduce immediately upon starting the api node, about  half the time on 7.5.10.