MySQL Bugs: #45096: Forced node shutdown upon restart with full data area

Bug #45096	Forced node shutdown upon restart with full data area
Submitted:	26 May 2009 13:03	Modified:	20 Jan 2016 10:27
Reporter:	Guido Ostkamp	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.1-telco-7.0	OS:	Solaris
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
Hello,

during stability tests we are facing the following critical startup failure with full data area:

*****************************************************************************
...
memmanagerlock waiting for lock, contentions: 2400 spins: 49473860
jbalock waiting for lock, contentions: 200 spins: 176660
2009-05-26 14:45:53 [ndbd] INFO     -- Killed by node 3 as copyfrag failed, error: 827
2009-05-26 14:45:53 [ndbd] INFO     -- NDBCNTR (Line: 260) 0x0000000a
2009-05-26 14:45:53 [ndbd] INFO     -- Error handler startup shutting down system
2009-05-26 14:45:53 [ndbd] INFO     -- Error handler shutdown completed - exiting
2009-05-26 14:45:53 [ndbd] INFO     -- Angel received ndbd startup failure count 1.
2009-05-26 14:45:57 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Occured during startphase 5. Caused by error
 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, rest
*****************************************************************************

Interestingly the error message at the end is incomplete in the log file.

This situation is re-occuring on each startup (we tried twice).

We are using bzr revid jonas@mysql.com-20090524191743-7gc0xl8kmnpmluux dated Sun 2009-05-24 21:17:43 +0200 compiled on Solaris Sparc with
CC=cc CXX=CC CFLAGS="-xO5 -fast -mt -m64 -xbinopt=prepare" CXXFLAGS="-xO5 -fast -mt -m64 -xbinopt=prepare" LDFLAGS="-xbinopt=prepare" ./configure --prefix=/export/home/wsch/6.4_2009_01_29 --with-plugins=max --without-docs --without-man

Full logs will be uploaded shortly.

How to repeat:
1. Create records in DB tables until DB is full on first node (nodeid=2)
2. Put some background 'insert' load on the nodes
3. kill -9 repeatedly on second node (nodeid=3)
4. Problem occurs

Full logs uploaded to FTP server, file 'bug-data-45096.tar.bz2'.

Just for your info:

We tried additional startup with '--initial' but it failed as well.

Hi Guido

Thank you for this report. Just to confirm that I've understood correctly:

Question 1:

"2. Put some background 'insert' load on the nodes"
Just wanted to know what happens here? Do the inserts fail because the DB is full?

Question 2:

In the end the node doesn't start even when there is no insert load anymore? And this is both with and without --initial?

Question 3:

Since 1 node is still alive and should contain all data, does it help if you:
 - delete some data first
  - (alternatively if this was unacceptable, you could also restart the node with more DataMemory allocated)
 - then let the other node join the cluster

Hello Henrik,

here are the answers to your questions

Question 1:

The inserts fail (as expected).

Question 2:

I retried the tests. After stopping the load, it is still not possible to restart the second node (forced node shutdown occurs).

I have then stopped the first node as well, and restarted it (which was possible). After it came up, restart of the second node was still unsuccessful. While the restart of the second node was running, we went into a strange state, where it was no longer possible to use the mysql shell at all even on the first node (after "use <dbname>" command mysql shell hung).

Question 3:

Deleting might be unacceptable (in case of customer data, we are not allowed to do that), but it was also technically impossible because a trigger had to be fired when trying to delete, and the update caused by the trigger could not be executed due to missing space.

I expect restart with changed configs to work, but this is not a feasible solution.

Regards

Guido Ostkamp

Hi,

Testing with latest 7.4.9 I can get to the described situation where second node won't start after database is full but contrary to what you encountered with 7.0 you can
 - shutdown mgm node(s)
 - change your config increasing datamemory
 - start mgm node(s)
 - start second node with --initial
 - when the second node is fully started restart the first data node

I tried this twice on 7.4.9 and it worked both times without a hitch.

kind regards
Bogdan Kecman