MySQL Bugs: #86360: Error 2306 'Pointer too large' on ndbmtd start after a full cluster crash

Bug #86360	Error 2306 'Pointer too large' on ndbmtd start after a full cluster crash
Submitted:	17 May 2017 15:34	Modified:	22 May 2017 21:01
Reporter:	Andrew Blackmore	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	7.4.15	OS:	Ubuntu (16.04)
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
I have a 4 data node cluster that experienced a full crash where all nodes simultaneously failed and then when attempting to bring the cluster back online I receive this error message

Forced node shutdown completed. Caused by error 2306: 'Pointer too large (Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

I have attempted to restart multiple times with different partial restarts but the 2 data nodes of this particular node group both produce this error message thus I cannot get the cluster back online.

How to repeat:
I am not sure how to get to this state but it is consistently repeatable once it is achieved as I have tried restarting many times.

Hi Andrew, 

I cannot reproduce this issue and the error log you provided does not provide enough data, the error you are getting (pointer too large in dbdih) is not helpful (lot of different issues that are not bugs can lead to that).

kind regards
bogdan

I believe that the initial error that actually triggered the chain of events is in that same error log but it is just a little further up:

Time: Wednesday 17 May 2017 - 02:11:03
Status: Temporary error, restart node
Message: Send signal error (Internal error, programming error or missing error message, please report a bug)
Error: 2339
Error data: Unhandled sections in sendSignal for GSN 33 (KEYINFO20).
Error object: 
Program: ndbmtd
Pid: 2884 thr: 1
Version: mysql-5.6.36 ndb-7.4.15
Trace: /usr/local/mysql/data/ndb_2_trace.log.9 [t1..t11]
***EOM***

This happened at the time when the entire cluster crashed and caused what I believe to be the rest of the issues

Hi,

It is possible but send signal error is usually effect and not the cause of the crash. What really caused the crash I can't say from the info I have, and I doubt it's a bug. 

Normally a partial cluster start and initial start of remaining nodes followed by initial start of other nodes clears everything up but you were unlucky enough to have 2 nodes from the same group fail. In that case restoring backup is the fastest/safest (and often only) possibility.

Now this is entering a domain of support so I do suggest you contact Oracle Support team, they can 
 - figure out why this happened to you and how to prevent it from happening again
 - get the system up and running with the least amount of downtime

Without ndb_2_trace.log.9 [t1..t11] we can't see how the crash happened, but often even with those logs we might be in same sistuation

best regards
bogdan

I have added the log files you mentioned so that you can look at them. I have already moved on to a less volatile solution going forward.

Thanks for uploading trace.log.9, we'll see if there's anything there to show how to reproduce the problem

all best
Bogdan

Regarding Unhandled sections in sendSignal for GSN 33 (KEYINFO20).

Do you have tables with large primary or unique keys?
Typically some char or varchar keys.

Yes there is a table with a unique key that is VARCHAR(16)

Hi Andrew, 

any chance you can give us the table structure? You can mangle the column names just leave types. And can you tell us the count(*) for that table?

thanks
Bogdan