Bug #98549 Node1 Crash on cluster mysql NDB
Submitted: 11 Feb 2020 8:49 Modified: 24 Apr 2020 16:32
Reporter: Damien Larrieu Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.7.21 OS:CentOS (CentOS 7)
Assigned to: MySQL Verification Team CPU Architecture:x86

[11 Feb 2020 8:49] Damien Larrieu
Description:
Hi,

I had a crash of the node1 on a cluster NDB 5.7.21.
I didn't find the root cause. I suppose a bug but i'm not sure and i would like your help.
Here the log for Node 1. I haven't find anything on others nodes at the same time.

2020-01-29T07:59:43.488838Z 4302827 [Note] Aborted connection 4302827 to db: 'mysql' user: 'agent_user' host: 'localhost' (Got an error writing communication packets)
NdbReceiver::handle_rec_attrs: attrId: 12341 currRecAttr: (nil) rec_attr_list: 0x7f96cc2d1b40 attrSize: 12336 0
27
08:00:17 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=26
max_threads=151
thread_count=22
connection_count=20
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 68193 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f96cc555860
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f972c793e30 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x3b)[0xf91d8b]
/usr/sbin/mysqld(handle_fatal_signal+0x461)[0x840301]
/lib64/libpthread.so.0(+0xf680)[0x7f9764002680]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7f96cc341a30): select trxitems0_.TRX_REC_ITEM as col_0_0_, sum(trxitems0_.TRX_SETTELMENT_AMOUNT) as col_1_0_, trxitems0_.TRX_SETTELMENT_CURRENCY as col_2_0_ from BOBOM.TRX_ITEMS trxitems0_ where (trxitems0_.TRX_BUSDATE between '2020-01-29 00:00:00' and '2020-01-29 00:00:00' or '2020-01-29 00:00:00' is null or '2020-01-29 00:00:00' is null) and trim(upper(trxitems0_.TRX_AMOUNT_SIGN))='CR' and (trim(upper(trxitems0_.TRX_REC_ITEM))='00518' or '00518' is null) group by trxitems0_.TRX_REC_ITEM , trxitems0_.TRX_SETTELMENT_CURRENCY limit 1000
Connection ID (thread ID): 2327130
Status: NOT_KILLED

2020-01-29T08:00:24.032727Z 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2020-01-29T08:00:24.135090Z 0 [Note] InnoDB: Highest supported file format is Barracuda.
2020-01-29T08:00:24.156413Z 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 1074521038
2020-01-29T08:00:24.156437Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 1074521329
2020-01-29T08:00:24.166207Z 0 [Note] InnoDB: Database was not shutdown normally!
2020-01-29T08:00:24.166226Z 0 [Note] InnoDB: Starting crash recovery.
2020-01-29T08:00:24.305340Z 0 [Note] InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percent: 95 96 97 98 99
2020-01-29T08:00:24.806129Z 0 [Note] InnoDB: Apply batch completed
2020-01-29T08:00:24.806171Z 0 [Note] InnoDB: Last MySQL binlog file position 0 751459687, file name binlogs-63.000134
2020-01-29T08:00:25.257331Z 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2020-01-29T08:00:25.257355Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2020-01-29T08:00:25.257700Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2020-01-29T08:00:25.308390Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2020-01-29T08:00:25.405989Z 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active.
2020-01-29T08:00:25.406009Z 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active.
2020-01-29T08:00:25.406616Z 0 [Note] InnoDB: Waiting for purge to start
2020-01-29T08:00:25.456768Z 0 [Note] InnoDB: 5.7.21 started; log sequence number 1074521329
2020-01-29T08:00:25.457314Z 0 [Note] InnoDB: Loading buffer pool(s) from /mysql/mysql_cluster/mysqld/ib_buffer_pool
2020-01-29T08:00:25.457509Z 0 [Note] Plugin 'FEDERATED' is disabled.
2020-01-29T08:00:25.457758Z 0 [ERROR] Can't open shared library '/usr/lib64/mysql/plugin/audit_trial.so' (errno: 0 /usr/lib64/mysql/plugin/audit_trial.so: cannot open shared object file: No such file or directory)
2020-01-29T08:00:25.674188Z 0 [Note] InnoDB: Buffer pool(s) load completed at 200129 12:00:25
2020-01-29T08:00:25.909448Z 0 [Note] NDB: NodeID is 63, management server 'PS-PR-SRL-DB-03:1186'
2020-01-29T08:00:26.309997Z 0 [Note] NDB[0]: NodeID: 63, all storage nodes connected
2020-01-29T08:00:26.310156Z 0 [Note] NDB Binlog: Starting...
2020-01-29T08:00:26.310281Z 0 [Note] NDB Util: Starting...
2020-01-29T08:00:26.310452Z 1 [Note] NDB Binlog: Started
2020-01-29T08:00:26.310463Z 1 [Note] NDB Binlog: Setting up
2020-01-29T08:00:26.310612Z 0 [Note] NDB Index Stat: Starting...
2020-01-29T08:00:26.310633Z 0 [Note] NDB Index Stat: Wait for server start completed
2020-01-29T08:00:26.310981Z 2 [Note] NDB Util: Wait for server start completed
2020-01-29T08:00:26.311261Z 1 [Note] NDB Binlog: Created schema Ndb object, reference: 0x8004003f, name: 'Ndb Binlog schema change monitoring'
2020-01-29T08:00:26.311315Z 1 [Note] NDB Binlog: Created injector Ndb object, reference: 0x8005003f, name: 'Ndb Binlog data change monitoring'
2020-01-29T08:00:26.311322Z 1 [Note] NDB Binlog: Setup completed
2020-01-29T08:00:26.311326Z 1 [Note] NDB Binlog: Wait for server start completed

2020-01-29T08:00:26.407364Z 0 [ERROR] unknown variable 'SET GLOBAL net_read_timeout=600;'
2020-01-29T08:00:26.407381Z 0 [ERROR] Aborting

How to repeat:
I don't know how to repeat this crash.No particular processing at this time
[12 Feb 2020 23:55] MySQL Verification Team
Hi,

Thank you for your report but unfortunately, this does not give us much to go on.

Can you upload the full report generated by ndb_error_reporter?
Can you let us know if there's anything out of the ordinary, or new, that you recently did?

kind regards
[13 Feb 2020 9:17] Damien Larrieu
OS: Red Hat Enterprise Linux Server release 7.4 (Maipo)
DB: Server version: 5.7.21-ndb-7.5.9-cluster-commercial-advanced-log MySQL Cluster Server - Advanced Edition (Commercial)
[6 Mar 2020 13:22] Damien Larrieu
first part of log archive

Attachment: ndb_error_report_20200228101159.7z.001 (application/octet-stream, text), 2.86 MiB.

[6 Mar 2020 13:23] Damien Larrieu
second part of log archive

Attachment: ndb_error_report_20200228101159.7z.002 (application/octet-stream, text), 2.86 MiB.

[6 Mar 2020 13:25] Damien Larrieu
third part of log archive

Attachment: ndb_error_report_20200228101159.7z.003 (application/octet-stream, text), 2.86 MiB.

[6 Mar 2020 13:27] Damien Larrieu
last part of log archive

Attachment: ndb_error_report_20200228101159.7z.004 (application/octet-stream, text), 2.10 MiB.

[6 Mar 2020 13:28] Damien Larrieu
Hi,

My customer has given me logs from all nodes that you will get in 4 archives cut with 7zip.

Best regards, Damien
[6 Mar 2020 16:14] MySQL Verification Team
Hi Damien,

Thanks for the logs. I will start the analysis.

> Advanced Edition (Commercial)

You are using a commercial version of the MCCGE, please be sure to open SR at MOS and link this bug number to that SR.

Thanks
Bogdan
[24 Mar 2020 16:32] MySQL Verification Team
Hi,

> I had a crash of the node1

There is no NODE1 crash recorded in the logs you provided.
There are no recorded issues with the cluster at all in 2020.

The only crash was the SQL node, and this I can't reproduce nor deduce what really happened from the log. The "(Got an error writing communication packets)" could be 
 - network (or any other hardware) error
 - the high load so cluster was unresponsive (I cannot say this is a case as I don't see any high load at that time on the cluster)

I don't see the management nodes for this event (you only have here 26-28 February, nothing for January) and in the data node logs, I don't see anything peculiar. 

It does look like you were running a backup and that you stopped some nodes (or whole cluster) that day but I don't see no crashes on the cluster side and can't say why SQL node crashed.

kind regards
Bogdan
[25 Apr 2020 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".