Bug #25710 Restarting data node _always_ requires --initial
Submitted: 18 Jan 2007 22:32 Modified: 13 Mar 2009 9:08
Reporter: Bugs System Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.1.14 OS:Linux (CentOS4.4 x86_64)
Assigned to: Assigned Account CPU Architecture:Any
Tags: annoying, cluster, ndbd, restart, schema, slow

[18 Jan 2007 22:32] Bugs System
Description:
After updating from 5.1.12 to 5.1.14, every ndbd data node restart requires the use of --initial for the node to startup, otherwise the errors below appear in the log, and the node does not start.

We had 3 x 5GB tablespace files, but reduced them to 1 and dropped data to force a faster --initial start. Configuration testing takes a long time for every change due to waiting for all the data nodes to start.

Cluster is 4x Dell Poweredge 1950 2xdual-core 4gb ram, 72gb raid1 local storage. All servers are ndbd and sql nodes, 1 is also mgmd.

ndb_1_cluster.log:
2007-01-18 17:00:01 [MgmSrvr] INFO     -- Node 1: Node 6 Connected
2007-01-18 17:00:01 [MgmSrvr] ALERT    -- Node 7: Node 6 Disconnected
2007-01-18 17:00:01 [MgmSrvr] INFO     -- Node 7: Communication to Node 6 closed
2007-01-18 17:00:01 [MgmSrvr] ALERT    -- Node 7: Arbitration check won - node group majority
2007-01-18 17:00:01 [MgmSrvr] INFO     -- Node 7: President restarts arbitration thread [state=6]
2007-01-18 17:00:01 [MgmSrvr] INFO     -- Node 7: DICT: remove lock by failed node 6 for NodeRestart
2007-01-18 17:00:01 [MgmSrvr] INFO     -- Node 7: DICT: lock bs: 0 ops: 0 poll: 0 cnt: 0 queue:
2007-01-18 17:00:01 [MgmSrvr] ALERT    -- Node 8: Node 6 Disconnected
2007-01-18 17:00:01 [MgmSrvr] INFO     -- Node 8: Communication to Node 6 closed
2007-01-18 17:00:01 [MgmSrvr] ALERT    -- Node 9: Node 6 Disconnected
2007-01-18 17:00:01 [MgmSrvr] INFO     -- Node 9: Communication to Node 6 closed
2007-01-18 17:00:01 [MgmSrvr] INFO     -- Node 7: Local checkpoint 1508 started. Keep GCI = 2029109 oldest restorable GCI = 2036131
2007-01-18 17:00:01 [MgmSrvr] ALERT    -- Node 6: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error

How to repeat:
kill -9 ndbd
ndbd
Watch it fail.
[18 Jan 2007 22:33] Bugs System
I have also tried to use the ndb_mgm program to stop/start nodes, but the same error appears.
[18 Jan 2007 23:08] Jonas Oreland
Hi,

Please upload config.ini, clusterlog and error/trace files
(and maybe your schema, if it could be schema related...)

/Jonas
[19 Jan 2007 19:35] Bugs System
/etc/config.ini from cluster.

Attachment: config.ini (application/octet-stream, text), 688 bytes.

[19 Jan 2007 19:36] Bugs System
Updated version field to the actual MySQL version used.
[19 Jan 2007 19:41] Bugs System
/var/lib/mysql-cluster/cluster.log file from cluster after restarting ndb_mgmd ok, but restarting node6 failed.

Attachment: cluster.log (text/x-log), 23.60 KiB.

[19 Jan 2007 19:43] Bugs System
/var/lib/mysql-cluster/ndb_1_out.log file from cluster.

Attachment: ndb_1_out.log (text/x-log), 425 bytes.

[19 Jan 2007 19:57] Jonas Oreland
Hi,

I also need ndb_6_error.log and ndb_6_trace.log.*
(from crash demonstrated in cluster log)

These are stored locally for the datanodes...

/Jonas
[19 Jan 2007 19:58] Bugs System
/var/lib/mysql/lindb.err file from cluster node that failed to start.

Attachment: lindb.err (application/octet-stream, text), 485 bytes.

[19 Jan 2007 20:02] Bugs System
/data/ndb6_error.log file from ndbd cluster node from failed node.

Attachment: ndb_6_error.log (text/x-log), 1.04 KiB.

[19 Jan 2007 22:13] Bugs System
Trace log from the ndbd server crash.

Attachment: ndb_6_trace.log.1.gz (application/x-gzip, text), 31.26 KiB.

[19 Jan 2007 22:14] Bugs System
Trace log from the ndbd server crash.

Attachment: ndb_6_trace.log.2.gz (application/x-gzip, text), 31.21 KiB.

[19 Jan 2007 22:18] Bugs System
Trace log from the ndbd server crash.

Attachment: ndb_6_trace.log.next (application/octet-stream, text), 1 bytes.

[19 Jan 2007 22:23] Bugs System
What exactly is needed for schema inforation? eg. information_schema dump?
[20 Jan 2007 0:43] Jonas Oreland
Hi,

It looks look memory shortage,
Can you try "all dump 1000" from you running cluster
and paste output from clusterlog here in bug report?

/Jonas
[22 Jan 2007 17:00] Bugs System
I just found stopping all ndbd nodes, rm -f all files in the /data directory, then doing ndbd --initial, and restarting datanodes works fine both from command line and from ndb_mgm. Neat. This cleared the tablespace, datafile, and tables on disk though. Not so neat.

I am recreating the table space now to see what effect this has on restarting datanodes.
[22 Jan 2007 17:08] Bugs System
With no data in the on disk cluster tablesapce, all the nodes are starting fine. I have tried changing the order of restarting all the nodes in the cluster (eg. restarting all the nodes starting from the current master, and forcing the master to "move". restarting all the nodes with the master node last. restarting from ndb_mgm. restarting from commandline using killall ndbd;ndbd). 

Next test is to add some data and see what happens.
[22 Jan 2007 20:17] Bugs System
I have noticed if you delete /var/lib/mysql-cluster/ndb_1_cluster.log while ndb_mgmd is running, the file is not created again until you restart ndb_mgmd.

Here's the output of all dump 1000:

2007-01-22 15:14:10 [MgmSrvr] INFO     -- NDB Cluster Management Server. Version 5.1.14 (beta)
2007-01-22 15:14:10 [MgmSrvr] INFO     -- Id: 1, Command port: 1186
2007-01-22 15:14:12 [MgmSrvr] INFO     -- Node 1: Node 9 Connected
2007-01-22 15:14:12 [MgmSrvr] INFO     -- Node 1: Node 7 Connected
2007-01-22 15:14:12 [MgmSrvr] INFO     -- Node 1: Node 8 Connected
2007-01-22 15:14:12 [MgmSrvr] INFO     -- Node 9: Prepare arbitrator node 1 [ticket=15b200084b71b9b3]
2007-01-22 15:14:12 [MgmSrvr] INFO     -- Node 7: Started arbitrator node 1 [ticket=15b200084b71b9b3]
2007-01-22 15:15:25 [MgmSrvr] INFO     -- Node 7: Data usage is 0%(124 32K pages of total 51200)
2007-01-22 15:15:25 [MgmSrvr] INFO     -- Node 7: Index usage is 39%(63 8K pages of total 160)
2007-01-22 15:15:25 [MgmSrvr] INFO     -- Node 7: Resource 0 min: 0 max: 640 curr: 74
2007-01-22 15:15:25 [MgmSrvr] INFO     -- Node 7: Resource 1 min: 0 max: 0 curr: 3
2007-01-22 15:15:25 [MgmSrvr] INFO     -- Node 7: Resource 2 min: 0 max: 0 curr: 7
2007-01-22 15:15:25 [MgmSrvr] INFO     -- Node 8: Data usage is 0%(128 32K pages of total 51200)
2007-01-22 15:15:25 [MgmSrvr] INFO     -- Node 8: Index usage is 41%(66 8K pages of total 160)
2007-01-22 15:15:25 [MgmSrvr] INFO     -- Node 8: Resource 0 min: 0 max: 640 curr: 74
2007-01-22 15:15:25 [MgmSrvr] INFO     -- Node 8: Resource 1 min: 0 max: 0 curr: 3
2007-01-22 15:15:25 [MgmSrvr] INFO     -- Node 8: Resource 2 min: 0 max: 0 curr: 7
2007-01-22 15:15:25 [MgmSrvr] INFO     -- Node 9: Data usage is 0%(128 32K pages of total 51200)
2007-01-22 15:15:25 [MgmSrvr] INFO     -- Node 9: Index usage is 41%(66 8K pages of total 160)
2007-01-22 15:15:25 [MgmSrvr] INFO     -- Node 9: Resource 0 min: 0 max: 640 curr: 74
2007-01-22 15:15:25 [MgmSrvr] INFO     -- Node 9: Resource 1 min: 0 max: 0 curr: 3
2007-01-22 15:15:25 [MgmSrvr] INFO     -- Node 9: Resource 2 min: 0 max: 0 curr: 7

You will notice only 3 nodes listed, as node 6 will not startup at all now. I belive the problem is related to memory usage. The nodes were stable through ndbd restarts until 1600mb datamemory, and 1mb index memory. As soon as I increased index memory to 4mb, ndbd crashed at stage 5 again. Strange thing is, after changing the setting back to 1mb for index memory, ndbd still will not start, not even using --iniit!
[22 Jan 2007 20:35] Bugs System
My mistake: node 6 does startup with --initial, but it will not restart using just ndbd (or 6 restart from ndb_mgmd).

All the other nodes now will not start without --initial.

Is crashed node data distributed to other nodes in the cluster, preventing clean startup without --initial on every node, for every subsequent restart?
[23 Jan 2007 18:08] Bugs System
An older Bug #25357 "ndbd node cannot restart with disk tables" shows similar characteristics. At least I'm not the only one :/
[30 Jan 2007 16:17] Bugs System
Aha, after testing more thoroughly I have more information on this bug. I modified the data and index memory settings, dumped and restored the data, changed/stopped mysqld, restarted ndbd with/without --initial many times. 

I can get the cluster to repeat the behaviour every single time.

It looks like ndbd restarting breaks, and starts to require --initial, after doing an alter table on any of the databases to add an index on one field (the second index on the table including the PK).

I'm pretty sure this is not related to memory. The servers are 64-bit em64t, with data/index memory both set to 100mb for testing. pmap on the ndbd processes shows under 300mb for the active process. The machine is running with at least 2gb free during testing also.

I'm attaching a complete trace of a crash immediately following an alter table run.
[30 Jan 2007 16:19] Bugs System
Most recent logs after running alter table statements, and trying to do a ndbd restart. Nodes all now require --initial to resta

Attachment: cluster.tar.gz (application/x-gzip, text), 183.29 KiB.

[30 Jan 2007 16:24] Bugs System
I'm now running with these changes to config.ini on all nodes, and have restarted ndb_mgmd after every change to the file. The rest of the file just specifies hostnames, ids, and the datadir.

[NDBD DEFAULT]
NoOfReplicas=2
DataDir=/data
DataMemory=200M
IndexMemory=200M
StringMemory=15
MaxNoOfAttributes=10000
MaxNoOfConcurrentOperations=500000
MaxNoOfLocalOperations=800000
TimeBetweenWatchDogCheck=10000
#StopOnError=0
#LockPagesInMainMemory=1
[31 Jan 2007 16:18] Bugs System
Updated OS.
[31 Jan 2007 16:59] Bugs System
These are the exact MySQL provided RPM's used on the nodes (management rpm's only included on first node):

MySQL-ndb-storage-5.1.14-0.glibc23
MySQL-ndb-tools-5.1.14-0.glibc23
MySQL-client-5.1.14-0.glibc23
MySQL-ndb-management-5.1.14-0.glibc23
MySQL-server-5.1.14-0.glibc23
MySQL-ndb-extra-5.1.14-0.glibc23
perl-DBD-MySQL-2.9004-3.1
[6 Feb 2007 23:20] Jason Downing
Hi Brian,

I reported the 25357 bug. Do you know if they have fixed this bug in the 5.1.15 release? I filed my bug ages ago, and I'm fairly disappointed by the lack of response from Mysql. The new release show all sorts of trivial issues fixed, but this issue renders the cluster unusable for us, and they leave my bugs unfixed for a month. Any idea what's going on with it?

The other issue that has caused us to drop the cluster is that even with disk tables the capacity of the database is really low. With in memory tables and 2Gb of ram on each data node (no partitioning) we can get a 300 Mb sql file in. With disk data tables this goes up to maybe 400 Mb, even though only a very small percentage of our fields are indexed. Any idea what gives here? my email is jasondowning@hytechscales.com

Jason
[7 Feb 2007 1:20] Jonas Oreland
Hi Brian an Jason,

Very sorry that we have not had the time to resolve your issues yet.
Unfortunatly, I can not make any promise to when time will arise.

This bug report looks like it has everything needed to be reproduced/fixed
Thx for this.

I assure you that we will try to fix this as soon as we can,

/Jonas
[8 May 2007 9:10] Jonas Oreland
Hi Brian.

I (very) confident that this is the same as 
http://bugs.mysql.com/bug.php?id=24166

Which was fixed in 5.1.15

Could you please retest with a newer version than 5.1.14

And again sorry that bug-report has lingered so long.

/Jonas
[1 Mar 2008 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[6 Mar 2008 17:13] Bugs System
We dropped cluster a long time ago, but I hope this works for others.