MySQL Bugs: #114777: lost connection in MySQL NDB cluster

Bug #114777	lost connection in MySQL NDB cluster
Submitted:	25 Apr 2024 7:09	Modified:	8 May 2024 6:26
Reporter:	CunDi Fang	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	8.0.35-cluster MySQL Cluster Community S	OS:	Any (20.04)
Assigned to:		CPU Architecture:	Any

Description:
Hello, I found a bug in 8.0.35-cluster version of MYSQL cluster. It will cause the current mysql service to crash.

The detail is as follow.

OS version and name:
Ubuntu 22.04.3 LTS (Jammy Jellyfish)
Linux eb1f47b08982 6.5.11-8-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-8 (2024-01-30T12:27Z) x86_64 x86_64 x86_64 GNU/Linux

I start by executing the alter statement on a node:

Poc 1:
```
ALTER TABLE mytest90.test5 ALTER COLUMN column2 SET DEFAULT 479;
```

While this statement is in the process of executing, note that it is in the process of executing, not at the end or not yet started, the following delete statement is executed on another node.

Poc 2:
```
delete from mytest90.test5
where
EXISTS (
  select
      subq_0.c0 as c0,
      subq_0.c1 as c1,
      subq_0.c1 as c2,
      mytest90.test5.column1 as c3,
      case when (ref_0.column6 is NULL)
          and (EXISTS (
            select
                ref_2.column1 as c0,
                mytest90.test5.column2 as c1,
                ref_0.column2 as c2,
                ref_2.column1 as c3,
                ref_2.column2 as c4
              from
                mytest90.test1 as ref_2
              where false
              limit 37)) then mytest90.test5.column2 else mytest90.test5.column2 end
         as c4
    from
      mytest90.test3 as ref_0
        right join (select
              ref_1.column4 as c0,
              mytest90.test5.column2 as c1
            from
              mytest90.test4 as ref_1
            where ref_1.column5 is not NULL
            limit 54) as subq_0        on (false)
    where 9 is not NULL
limit 60);
```

This time it was found that the mysql service of the node that tried to execute the delete statement crashed:

failed:Lost connection to MySQL server during query

How to repeat:
Importing the database file I give later, and then executing Poc

Suggested fix:
It should be the process of the alter statement, a series of locking operations, resulting in the delete statement can not be executed properly, but why it will cause the mysql service crash, this I have no clue at the moment.

Architecture Information:
'''
[NDBD DEFAULT]
NoOfReplicas =2
DataMemory = 512M
IndexMemory = 64M

[NDB_MGMD]
NodeId=1
hostname =192.172.10.8
datadir =/var/lib/mysql-cluster

[NDBD]
NodeId =2
hostname =192.172.10.9
datadir =/usr/local/mysql-cluster/data
NodeGroup=0
[NDBD]
NodeId =3
hostname =192.172.10.10
datadir =/usr/local/mysql-cluster/data
NodeGroup=1
[NDBD]
NodeId =4
hostname =192.172.10.11
datadir =/usr/local/mysql-cluster/data
NodeGroup=0
[NDBD]
NodeId =5
hostname =192.172.10.12
datadir =/usr/local/mysql-cluster/data
NodeGroup=1

[mysqld]
NodeId =6
hostname =192.172.10.9
[mysqld]
NodeId =7
hostname =192.172.10.10
[mysqld]
NodeId =8
hostname =192.172.10.11
[mysqld]
NodeId =9
hostname =192.172.10.12
'''

Attempted and successfully reproduced!

Please upload ndb_error_report output

Okay, so despite some effort, I did manage to surface the bug in my environment and packaged up the log file, which will be uploaded later. In the meantime, I have an updated understanding of the bug.

Instead of just 2 nodes, he needs the synchronized synergy of 4 nodes.

Poc in node1:
```
ALTER TABLE mytest90.test5 ALTER COLUMN column2 SET DEFAULT 479;
```
result:
```
SQL executed successfully.
start time: 1714479495.92 end time:1714479495.426890
```

Poc in node2:
```
delete from mytest90.test5
where
EXISTS (
  select
      subq_0.c0 as c0,
      subq_0.c1 as c1,
      subq_0.c1 as c2,
      mytest90.test5.column1 as c3,
      case when (ref_0.column6 is NULL)
          and (EXISTS (
            select
                ref_2.column1 as c0,
                mytest90.test5.column2 as c1,
                ref_0.column2 as c2,
                ref_2.column1 as c3,
                ref_2.column2 as c4
              from
                mytest90.test1 as ref_2
              where false
              limit 37)) then mytest90.test5.column2 else mytest90.test5.column2 end
         as c4
    from
      mytest90.test3 as ref_0
        right join (select
              ref_1.column4 as c0,
              mytest90.test5.column2 as c1
            from
              mytest90.test4 as ref_1
            where ref_1.column5 is not NULL
            limit 54) as subq_0        on (false)
    where 9 is not NULL
limit 60);
```
result:
```
MySQL query failed: Lost connection to MySQL server during query
start time: 1714479495.18168 end time:1714479495.194724
```

Poc in node3:
```
ALTER TABLE mytest90.test5 RENAME COLUMN column2 TO column64;
```
result:
```
SQL executed successfully.
start time: 1714479495.106 end time:1714479495.775025
```

Poc in node4:
```
insert into mytest90.test5 values (
cast(nullif(cast(coalesce(cast(null as float),
    cast(null as float)) as float),
  cast(null as float)) as float),
26);
```
result:
```
SQL executed successfully.
start time: 1714479495.11671 end time:1714479495.16158
```

The bug is indeed reproducible, but it is very demanding in terms of time points and sequences.

time line of this bug

Attachment: Figure_114777.png (image/png, text), 31.15 KiB.

There are no log files in this archive.
Check if you have log files.
If you delete log files while server is already running it will not work, you need to shutdown all nodes, delete logs and then start cluster.

No, this is not a complete log with crash data.

I'm pretty sure I can reproduce the bug consistently with my cluster configuration and databases, plus at certain points in time in a certain chronological order, and then this zip is the one that I get from the "db_error_reporter /var/lib/mysqcluster/config.ini root " command.

Be sure to note that sql statements are executed on the corresponding nodes in a specific order and at a specific point in time.

No crash in the logs that I can see so cluster did not crash, can you upload logs from the SQL node you were connected to.

Thanks

Did you mean this one?
```
2024-05-08T05:42:49Z UTC - mysqld got signal 11 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
BuildID[sha1]=1135254ad6e65d8565c7a1bb98c0a03b430d0f46
Thread pointer: 0x7f1068000fd0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f110c108b70 thread_stack 0x100000
/usr/local/mysql-cluster/bin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x2134dc1]
/usr/local/mysql-cluster/bin/mysqld(print_fatal_signal(int)+0x2a2) [0xff07a2]
/usr/local/mysql-cluster/bin/mysqld(handle_fatal_signal+0xa5) [0xff0955]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f11377ee520]
/usr/local/mysql-cluster/bin/mysqld(ndb_pushed_builder_ctx::make_pushed_join()+0xaa) [0x2640c0a]
/usr/local/mysql-cluster/bin/mysqld(ndbcluster_push_to_engine(THD*, AccessPath*, JOIN*)+0x264) [0x2623fb4]
/usr/local/mysql-cluster/bin/mysqld(JOIN::push_to_engines()+0x71) [0xe5ec01]
/usr/local/mysql-cluster/bin/mysqld(JOIN::optimize(bool)+0x1cf8) [0xe78978]
/usr/local/mysql-cluster/bin/mysqld(Query_block::optimize(THD*, bool)+0xc1) [0xedac91]
/usr/local/mysql-cluster/bin/mysqld(Query_expression::optimize(THD*, TABLE*, bool, bool)+0xad) [0xf5990d]
/usr/local/mysql-cluster/bin/mysqld(Item_subselect::exec(THD*)+0x285) [0x120de95]
/usr/local/mysql-cluster/bin/mysqld(Item_exists_subselect::val_bool()+0x38) [0x12095d8]
/usr/local/mysql-cluster/bin/mysqld(Item_exists_subselect::val_int()+0x11) [0x12052e1]
/usr/local/mysql-cluster/bin/mysqld(Sql_cmd_delete::delete_from_single_table(THD*)+0xdf1) [0x136aa81]
/usr/local/mysql-cluster/bin/mysqld(Sql_cmd_dml::execute(THD*)+0x1e4) [0xee4ef4]
/usr/local/mysql-cluster/bin/mysqld(mysql_execute_command(THD*, bool)+0xae7) [0xe80bc7]
/usr/local/mysql-cluster/bin/mysqld(dispatch_sql_command(THD*, Parser_state*)+0x51b) [0xe843bb]
/usr/local/mysql-cluster/bin/mysqld(dispatch_command(THD*, COM_DATA const*, enum_server_command)+0x2351) [0xe86d01]
/usr/local/mysql-cluster/bin/mysqld(do_command(THD*)+0x15b) [0xe8787b]
/usr/local/mysql-cluster/bin/mysqld() [0xfe09b8]
/usr/local/mysql-cluster/bin/mysqld() [0x2848944]
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f1137840ac3]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7f11378d1bf4]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7f106800abd0): is an invalid pointer
Connection ID (thread ID): 19
Status: NOT_KILLED

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
2024-05-08T05:42:49.491645Z 0 [Warning] [MY-011070] [Server] 'Disabling symbolic links using --skip-symbolic-links (or equivalent) is the default. Consider not using this option as it' is deprecated and will be removed in a future release.
2024-05-08T05:42:49.491796Z 0 [System] [MY-010116] [Server] /usr/local/mysql-cluster/bin/mysqld (mysqld 8.0.35-cluster) starting as process 11357
2024-05-08T05:42:49.500615Z 1 [System] [MY-013576] [InnoDB] InnoDB initialization has started.
2024-05-08T05:42:49.784563Z 1 [System] [MY-013577] [InnoDB] InnoDB initialization has ended.
2024-05-08T05:42:49.877916Z 0 [ERROR] [MY-010901] [Server] Can't open shared library '/usr/local/mysql-cluster/lib/plugin/' (errno: 22 /usr/local/mysql-cluster/lib/plugin/: cannot read file data: Is a directory).
2024-05-08T05:42:49.877950Z 0 [Warning] [MY-010736] [Server] Couldn't load plugin named '' with soname ''.
2024-05-08T05:42:49.879457Z 0 [System] [MY-010866] [NDB] Metadata: Initialization
```

Bug #114850 is marked as duplicate of this one