MySQL Bugs: #40724: Entire Cluster Crashes When A LG With Large Undo Log >150M Is Created

Bug #40724	Entire Cluster Crashes When A LG With Large Undo Log >150M Is Created
Submitted:	14 Nov 2008 2:08	Modified:	19 Feb 2009 10:48
Reporter:	Mikiya Okuno	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Cluster: Disk Data	Severity:	S1 (Critical)
Version:	6.3.18	OS:	Linux (Ubuntu 8.10)
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
A logfile group with a large undo buffer, more than 150, causes a complete cluster crash. And a cluster cannot come back online after a crash.

How to repeat:
I can create a logfile group with 150M undo buffer.

mysql> CREATE LOGFILE GROUP lg_1 ADD UNDOFILE 'undo_2.dat' INITIAL_SIZE 256M UNDO_BUFFER_SIZE 150M ENGINE NDB;
Query OK, 0 rows affected (18.14 sec) 

mysql> DROP LOGFILE GROUP lg_1 ENGINE NDB;
Query OK, 0 rows affected (0.68 sec)    

But I cannot create it with 151M undo buffer.

mysql> CREATE LOGFILE GROUP lg_1 ADD UNDOFILE 'undo_2.dat' INITIAL_SIZE 256M UNDO_BUFFER_SIZE 151M ENGINE NDB;
ERROR 1528 (HY000): Failed to create LOGFILE GROUP

Then, the following message appears on the management client and a cluster goes down.

ndb_mgm> Node 11: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.                                                                                         
Node 12: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. 

After that, I cannot start the cluster again, and the following message is displayed after start:

ndb_mgm> ALL START
Node 12: Start initiated (version 6.3.18)
Node 11: Start initiated (version 6.3.18)
ndb_mgm> aNode 11: Forced node shutdown completed. Occured during startphase 4. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
llNode 12: Forced node shutdown completed. Occured during startphase 4. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.

To start the cluster, --initial restart is required.

Suggested fix:
n/a

I did the test using the following settings.

[NDBD DEFAULT]
NoOfReplicas=2
DataMemory=1024M
IndexMemory=64M 
LockPagesInMainMemory=1

MaxNoOfTables=256
MaxNoOfOrderedIndexes=512
MaxNoOfUniqueHashIndexes=256
MaxNoOfAttributes=8192      
MaxNoOfConcurrentOperations=250000
FragmentLogFileSize=64M           
NoOfFragmentLogFiles=4            
RedoBuffer=64M                    
ODirect=1                         

### Disk data related 
DiskPageBufferMemory=2000M
SharedGlobalMemory=1500M

duplicate of bug#34102 (which I just fixed)