Bug #86360 Error 2306 'Pointer too large' on ndbmtd start after a full cluster crash
Submitted: 17 May 2017 15:34 Modified: 22 May 2017 21:01
Reporter: Andrew Blackmore Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:7.4.15 OS:Ubuntu (16.04)
Assigned to: Bogdan Kecman CPU Architecture:Any

[17 May 2017 15:34] Andrew Blackmore
Description:
I have a 4 data node cluster that experienced a full crash where all nodes simultaneously failed and then when attempting to bring the cluster back online I receive this error message

Forced node shutdown completed. Caused by error 2306: 'Pointer too large (Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

I have attempted to restart multiple times with different partial restarts but the 2 data nodes of this particular node group both produce this error message thus I cannot get the cluster back online.

How to repeat:
I am not sure how to get to this state but it is consistently repeatable once it is achieved as I have tried restarting many times.
[18 May 2017 4:16] Bogdan Kecman
Hi Andrew, 

I cannot reproduce this issue and the error log you provided does not provide enough data, the error you are getting (pointer too large in dbdih) is not helpful (lot of different issues that are not bugs can lead to that).

kind regards
bogdan
[18 May 2017 15:30] Andrew Blackmore
I believe that the initial error that actually triggered the chain of events is in that same error log but it is just a little further up:

Time: Wednesday 17 May 2017 - 02:11:03
Status: Temporary error, restart node
Message: Send signal error (Internal error, programming error or missing error message, please report a bug)
Error: 2339
Error data: Unhandled sections in sendSignal for GSN 33 (KEYINFO20).
Error object: 
Program: ndbmtd
Pid: 2884 thr: 1
Version: mysql-5.6.36 ndb-7.4.15
Trace: /usr/local/mysql/data/ndb_2_trace.log.9 [t1..t11]
***EOM***

This happened at the time when the entire cluster crashed and caused what I believe to be the rest of the issues
[18 May 2017 19:33] Bogdan Kecman
Hi,

It is possible but send signal error is usually effect and not the cause of the crash. What really caused the crash I can't say from the info I have, and I doubt it's a bug. 

Normally a partial cluster start and initial start of remaining nodes followed by initial start of other nodes clears everything up but you were unlucky enough to have 2 nodes from the same group fail. In that case restoring backup is the fastest/safest (and often only) possibility.

Now this is entering a domain of support so I do suggest you contact Oracle Support team, they can 
 - figure out why this happened to you and how to prevent it from happening again
 - get the system up and running with the least amount of downtime

Without ndb_2_trace.log.9 [t1..t11] we can't see how the crash happened, but often even with those logs we might be in same sistuation

best regards
bogdan
[18 May 2017 19:41] Andrew Blackmore
I have added the log files you mentioned so that you can look at them. I have already moved on to a less volatile solution going forward.
[18 May 2017 19:52] Bogdan Kecman
Thanks for uploading trace.log.9, we'll see if there's anything there to show how to reproduce the problem

all best
Bogdan
[19 May 2017 8:00] Mauritz Sundell
Regarding Unhandled sections in sendSignal for GSN 33 (KEYINFO20).

Do you have tables with large primary or unique keys?
Typically some char or varchar keys.
[22 May 2017 15:56] Andrew Blackmore
Yes there is a table with a unique key that is VARCHAR(16)
[22 May 2017 20:50] Bogdan Kecman
Hi Andrew, 

any chance you can give us the table structure? You can mangle the column names just leave types. And can you tell us the count(*) for that table?

thanks
Bogdan