MySQL Bugs: #46289: Error 2341 after reboot and then initial when changing redo logfile size

Bug #46289	Error 2341 after reboot and then initial when changing redo logfile size
Submitted:	18 Jul 2009 19:36	Modified:	27 Sep 2009 20:04
Reporter:	Clint Alexander	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	mysql-5.1.34 ndb-7.0.6	OS:	Linux
Assigned to:		CPU Architecture:	Any
Tags:	2341, initial restart, Redo logfile

Description:

After performing an accidental non-initial restart of an NDBD node and then restarting with initial, we received Error: 2341.

----------------------------------
THE TASK 
With few custom configurations, we're attempting to import 2800 MyISAM tables into Cluster.

SHELL> mysqldump | sed "s/ENGINE=MYISAM/ENGINE=NDBCLUSTER/g" > dumpfile.sql

----------------------------------
THE CLUSTER SETUP

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     4 node(s)
id=2    @192.168.0.1  (mysql-5.1.34 ndb-7.0.6, Nodegroup: 0, Master)
id=3    @192.168.0.2  (mysql-5.1.34 ndb-7.0.6, Nodegroup: 0)
id=4    @192.168.0.3  (mysql-5.1.34 ndb-7.0.6, Nodegroup: 1)
id=5    @192.168.0.4  (mysql-5.1.34 ndb-7.0.6, Nodegroup: 1)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @192.168.0.100  (mysql-5.1.34 ndb-7.0.6)

[mysqld(API)]   4 node(s)
id=11   @192.168.0.11  (mysql-5.1.34 ndb-7.0.6)
id=12   @192.168.0.12  (mysql-5.1.34 ndb-7.0.6)
id=13   @192.168.0.13  (mysql-5.1.34 ndb-7.0.6)
id=14   @192.168.0.14  (mysql-5.1.34 ndb-7.0.6)

---- System Specs:
NDBD - Dual Quad Intel Xeon 3.1GHz | 16GB
MYSQLD - Dual Intel 2.66GHz | 8GB
MGMD - Dual P4 3.20GHz | 2GB

----------------------------------
CONFIG.INI

[TCP DEFAULT]
SendBufferMemory                = 2M
ReceiveBufferMemory             = 2M
[NDB_MGMD DEFAULT]
PortNumber                      = 1186
Datadir                         = /var/lib/mysql-cluster
[NDB_MGMD]
Id                              = 1
Hostname                        = 192.168.0.100
ArbitrationRank                 = 1
[NDBD DEFAULT]
NoOfReplicas                    = 2
Datadir                         = /var/lib/mysql-cluster
DataMemory                      = 11G 
IndexMemory                     = 2G  
StartPartialTimeout             = 0  
StartPartitionedTimeout         = 0  
StartFailureTimeout             = 0  
MaxNoOfTables                   = 20186  
LogLevelStartup                 = 15
LogLevelShutdown                = 15
LogLevelCheckpoint              = 8
LogLevelNodeRestart             = 15
MaxNoOfExecutionThreads         = 4   
[NDBD]
Id                              = 2
Hostname                        = 192.168.0.1
[NDBD]
Id                              = 3
Hostname                        = 192.168.0.2 
[NDBD]
Id                              = 4
Hostname                        = 192.168.0.3 
[NDBD]
Id                              = 5
Hostname                        = 192.168.0.4 
[MYSQLD DEFAULT]
BatchSize                       = 512    
[MYSQLD]
Id                              = 11
Hostname                        = 192.168.0.11 
[MYSQLD]
Id                              = 12
Hostname                        = 192.168.0.12 
[MYSQLD]
Id                              = 13
Hostname                        = 192.168.0.13 
[MYSQLD]
Id                              = 14
Hostname                        = 192.168.0.14 

----------------------------------
THE STEPS TAKEN

ndb_mgm> 2 restart
Node 2: Node shutdown initiated
Node 2: Node shutdown completed, restarting, no start.
Node 2 is being restarted

<... waited a while ...>

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     4 node(s)
id=2    @192.168.0.86  (mysql-5.1.34 ndb-7.0.6, not started)
...

ndb_mgm> 2 start
Database node 2 is being started.

ndb_mgm> Node 2: Start initiated (version 7.0.6)
Node 2: Data usage decreased to 0%(0 32K pages of total 360448)
Node 2: Forced node shutdown completed. Occured during startphase 5. Caused by error 2350: 'Invalid configuration received from Management Server(Configuration error). Permanent error, external action needed'.

[ Checked log on node 2 ...]
Time: Saturday 18 July 2009 - 02:32:42
Status: Permanent error, external action needed
Message: Invalid configuration received from Management Server (Configuration error)
Error: 2350
Error data: Invalid file size for redo logfile,  size only changable with --initial
Error object: DBLQH (Line: 13457) 0x0000000a
Program: /usr/bin/ndbmtd
Pid: 32677
Trace: /var/lib/mysql-cluster//ndb_2_trace.log.6 /var/lib/mysql-cluster//ndb_2_trace.log.6_t1 /var/lib/mysql-cluster//ndb_2_trace.log.6_t2 /var/lib/m

(... oops, forgot the --initial )
(performed initial start on ndbd server)

SHELL> service ndbd initial
(/usr/bin/ndbmtd --initial --ndb-nodeid=2 -c 192.168.0.100:1186)
Initializing ndbmtd:2009-07-18 03:17:56 [ndbd] INFO     -- Configuration fetched from '192.168.0.100:1186', generation: 1

(... waited and watched mgm console ...)

ndb_mgm> Node 2: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

(... checked log on NDBD server ...)

Time: Saturday 18 July 2009 - 02:45:29
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbdict/Dbdict.cpp
Error object: DBDICT (Line: 3977) 0x0000000e
Program: /usr/bin/ndbmtd
Pid: 420
Trace: /var/lib/mysql-cluster//ndb_2_trace.log.8 /var/lib/mysql-cluster//ndb_2_trace.log.8_t1 /var/lib/mysql-cluster//ndb_2_trace.log.8_t2 /var/lib/mysql-cluster//ndb

How to repeat:
Described in description.

Suggested fix:
Unknown.

There is a reference to 192.168.0.86 which should be 192.168.0.1 (for NDBMTD01) -- addresses were changed for clarity and ease of reading but I obviously typo'd on that line.

Clint,

Just a correction in your description, you did actually a:
 ndb_mgmd> 2 restart -n

right?

Did you change something to the configuration around FragmentLogFileSize or the number of them?
I was not succesful to repeat with the steps given.

Cheers,
Geert

I can't repeat with the following steps, using ndbd or ndbmtd:

* Changing configuration, for example FragmentLogFileSize from 16M to 32M.
* Reload ndb_mgmd configuration
* ndb_mgm> 3 restart -n
* ndb_mgm> 3 start
* Failes because we need initial node restart with the above change
* Start ndbd --initial
* Node joines as normal.

Was this repeatable or was this a single incident?

Cheers,

Geert

There was a total of 2 restarts on the data node. I did -not- restart the node with -n or any other arguments; the first attempt was a normal 'restart' with no option, the second attempt I had to start on the node itself with --initial.

I copied and pasted from the terminal so what I described is exactly what I had seen/typed (except for the IP addresses of course).

Although I usually give a good walk-thru and some code to recreate the problem, this one happened once and I only had the time to create a bug report. 

I'm finishing up 5.1 CMCDBA certifications this week, so I will try to create an easy recreation script over the weekend and will post it here.

Stay tuned...

//Clint

Hi,
Last night I run into this problem. We had a maintenance (software upgrade on network stuffs). I made a fresh backup, didn't stop the cluster, I wanted to examine the recover. It is failed.
none of the 2 of ndbd node came online....

Time: Tuesday 4 August 2009 - 00:34:53
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming
 error or missing error message, please report a bug)
Error: 2341
Error data: dbdict/Dbdict.cpp
Error object: DBDICT (Line: 2837) 0x0000000a
Program: /usr/sbin/ndbmtd
Pid: 6401
Trace: /var/lib/mysql-cluster-trr/ndb_3_trace.log.4 /var/lib/mysql-cluster-trr/n
db_3_trace.log.4_t1 /var/lib/mysql-cluster-trr/ndb_3_trace.log.4_t2 /var/lib/mys
ql-c
Time: Tuesday 4 August 2009 - 00:39:53
Status: Temporary error, restart node
Message: Another node failed during system restart, please investigate error(s) 
on other node(s) (Restart error)
Error: 2308
Error data: Node 4 disconnected
Error object: QMGR (Line: 2910) 0x0000000a
Program: /usr/sbin/ndbmtd
Pid: 6546
Trace: /var/lib/mysql-cluster-trr/ndb_3_trace.log.5 /var/lib/mysql-cluster-trr/n
db_3_trace.log.5_t1 /var/lib/mysql-cluster-trr/ndb_3_trace.log.5_t2 /var/lib/mys
ql-cluster-trr/ndb_3_trac

Time: Tuesday 4 August 2009 - 00:25:48
Status: Temporary error, restart node
Message: Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate
 if there are error(s) on other node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 5300) 0x0000000a
Program: /usr/sbin/ndbmtd
Pid: 4304
Trace: /var/lib/mysql-cluster-trr/ndb_4_trace.log.3 /var/lib/mysql-cluster-trr/ndb_4_trace.log.3_t1 /var/
lib/mys
Time: Tuesday 4 August 2009 - 00:39:52
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error m
essage, please report a bug)
Error: 2341
Error data: dbdict/Dbdict.cpp
Error object: DBDICT (Line: 2837) 0x0000000a
Program: /usr/sbin/ndbmtd
Pid: 13714
Trace: /var/lib/mysql-cluster-trr/ndb_4_trace.log.4 /var/lib/mysql-cluster-trr/ndb_4_trace.log.4_t1 /var/
lib/mysql-cluster-trr/ndb_4_trace.log.4_t2 /var/lib/mysql-

so I had to recover from the backup.

Had the same problem today.  Issued a SHUTDOWN and then restarted the management and data nodes.

Time: Thursday 27 August 2009 - 10:52:09
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbdict/Dbdict.cpp
Error object: DBDICT (Line: 2837) 0x0000000a
Program: /usr/local/mysql/libexec/ndbmtd
Pid: 4149
Trace: /usr/local/mysql/data/ndb_21_trace.log.12 /usr/local/mysql/data/ndb_21_trace.log.12_t1 /usr/local/mysql/data/ndb_21_trace.log.12_t2 /usr/loc

I also had to restart the data nodes with --initial and restore from a backup.

Hi,

1) There is not sufficient logs to know exactly what has failed.
2) We're releasing 7.0.7 today (?) (note:: source only, binaries takes a bit
   longer) that has fixed a few problems resembling this.
   In short I have high hopes that problem is actually fixed.

Setting it need-feedback so you can test with 7.0.7

/Jonas

Jonas,

Thanks you for the information.  I will be looking for the new 7.0.7 release since we use the source distribution.  I will build it tomorrow and see how it goes.

Thanks,
Matthew Bilek
Webroot Software, Inc.
mbilek@webroot.com

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".