MySQL Bugs: #44806: 2009-05-11 17:27:19 [ndbd] WARNING -- Ndb kernel thread 0 is stuck in: Job Hand

Bug #44806	2009-05-11 17:27:19 [ndbd] WARNING -- Ndb kernel thread 0 is stuck in: Job Hand
Submitted:	11 May 2009 22:48	Modified:	25 May 2009 13:33
Reporter:	James Eastman	Email Updates:
Status:	Won't fix	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.1-telco-7.0	OS:	Linux (Linux ast4.nts.BLAH.com 2.6.9-55.ELsmp #1 SMP Fri Apr 20 17:03:35 EDT 2007 i686 i686 i386 GNU/Linux)
Assigned to:		CPU Architecture:	Any
Tags:	7.0.5-0.rhel4.i386, MySQL 7.0 Cluster storage node, ndb, NDB startup

Description:
I recently upgraded my MySQL cluster members to MySQL cluster 7.0 and now I cannot start my ndb storage nodes.  I did my installs with the RPMs found at mysql.com.  For my startup I brought up my admin nodes (two of them) and then I tried to start one ndb storage node.  The storage node never comes up.  On the management nose in the ndb_mgm tool I see ->

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     3 node(s)
id=2    @146.18.237.115  (mysql-5.1.32 ndb-7.0.5, starting, Nodegroup: 0)
id=3 (not connected, accepting connect from 146.18.237.49)
id=4 (not connected, accepting connect from 146.18.237.104)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @146.18.237.48  (mysql-5.1.32 ndb-7.0.5)

[mysqld(API)]   3 node(s)
id=5 (not connected, accepting connect from 146.18.237.48)
id=6 (not connected, accepting connect from 146.18.237.59)
id=7 (not connected, accepting connect from 146.18.237.103)

ndb_mgm>

My current /etc/my.cnf on *BOTH* management nodes *AND* ndb storage nodes is ->

[MYSQLD]
ndbcluster
ndb-connectstring="146.18.237.48:1186;146.18.237.59:1186"

# Options for ndbd process:
[mysql_cluster]
ndb-connectstring="146.18.237.48:1186;146.18.237.59:1186

My /var/lib/mysql-cluster/config.ini on these same nodes is ->

[TCP DEFAULT]
SendBufferMemory=2M
ReceiveBufferMemory=2M

[NDB_MGMD DEFAULT]
PortNumber=1186
datadir=/var/lib/mysql-cluster  # Directory for management node log files

[NDB_MGMD]
Id=1
Hostname=146.18.237.48
ArbitrationRank=1

[NDB_MGMD]
Id=2
Hostname=146.18.237.59
ArbitrationRank=1

[NDBD DEFAULT]
NoOfReplicas=3
Datadir=/usr/local/mysql/data
DataMemory=80M
IndexMemory=18M

MaxNoOfConcurrentOperations=100000

StringMemory=25
MaxNoOfTables=4096
MaxNoOfOrderedIndexes=2048
MaxNoOfUniqueHashIndexes=512
MaxNoOfAttributes=24576
DiskCheckpointSpeedInRestart=100M
InitFragmentLogFiles=FULL
NoOfFragmentLogFiles=6

TimeBetweenLocalCheckpoints=20
TimeBetweenGlobalCheckpoints=1000
TimeBetweenEpochs=100

MemReportFrequency=30
BackupReportFrequency=10

### Params for setting logging 
LogLevelStartup=15
LogLevelShutdown=15
LogLevelCheckpoint=8
LogLevelNodeRestart=15

### Params for increasing Disk throughput 
#BackupMaxWriteSize=1M
#BackupDataBufferSize=16M
#BackupLogBufferSize=4M
#BackupMemory=20M
#Reports indicates that odirect=1 can cause io errors (os err code 5) on some systems. You must test.
#ODirect=1

### Watchdog 
TimeBetweenWatchdogCheckInitial=30000

### TransactionInactiveTimeout  - should be enabled in Production 
#TransactionInactiveTimeout=30000
### CGE 6.3 - REALTIME EXTENSIONS 
#RealTimeScheduler=1
#SchedulerExecutionTimer=80
#SchedulerSpinTimer=40

### DISK DATA 
#SharedGlobalMemory=384M
#read my blog how to set this:
#DiskPageBufferMemory=3072M

### Multithreading 
MaxNoOfExecutionThreads=8
[NDBD]
Id=3
Hostname=146.18.237.115

[NDBD]
Id=4
Hostname=146.18.237.49

[NDBD]
Id=5
Hostname=146.18.237.104

[MYSQLD]
Id=7
Hostname=146.18.237.48
[MYSQLD]
Id=8
Hostname=146.18.237.48
[MYSQLD]
Id=9
Hostname=146.18.237.48
[MYSQLD]
Id=10
Hostname=146.18.237.48
[MYSQLD]
Id=11
Hostname=146.18.237.48

[MYSQLD]
Id=12
Hostname=146.18.237.59
[MYSQLD]
Id=13
Hostname=146.18.237.59
[MYSQLD]
Id=14
Hostname=146.18.237.59
[MYSQLD]
Id=15
Hostname=146.18.237.59
[MYSQLD]
Id=16
Hostname=146.18.237.59

[MYSQLD]
Id=17
Hostname=146.18.237.103
[MYSQLD]
Id=18
Hostname=146.18.237.103
[MYSQLD]
Id=19
Hostname=146.18.237.103
[MYSQLD]
Id=20
Hostname=146.18.237.103
[MYSQLD]
Id=21
Hostname=146.18.237.103

The node id for the ndb node that isn't coming up is ->

[NDBD]
Id=3
Hostname=146.18.237.115

Here's my /usr/local/mysql/data/ndb_2_out.log ->

2009-05-11 17:41:00 [ndbd] INFO     -- Angel pid: 14831 ndb pid: 14832
NDBMT: non-mt
2009-05-11 17:41:00 [ndbd] INFO     -- NDB Cluster -- DB node 2
2009-05-11 17:41:00 [ndbd] INFO     -- mysql-5.1.32 ndb-7.0.5-beta --
2009-05-11 17:41:00 [ndbd] INFO     -- Ndbd_mem_manager::init(1) min: 1028Mb initial: 1048Mb
Adding 1048Mb to ZONE_LO (1,33535)
2009-05-11 17:41:00 [ndbd] INFO     -- Start initiated (mysql-5.1.32 ndb-7.0.5)
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310256 for In/Deflate buffer
WOPool::init(61, 9)
RWPool::init(22, 13)
2009-05-11 17:41:01 [ndbd] WARNING  -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=99
2009-05-11 17:41:01 [ndbd] INFO     -- Watchdog: User time: 10  System time: 96
RWPool::init(42, 18)
RWPool::init(62, 13)
2009-05-11 17:41:01 [ndbd] WARNING  -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=199
2009-05-11 17:41:01 [ndbd] INFO     -- Watchdog: User time: 10  System time: 106
2009-05-11 17:41:01 [ndbd] WARNING  -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=299
2009-05-11 17:41:01 [ndbd] INFO     -- Watchdog: User time: 10  System time: 116
Using 1 fragments per node
RWPool::init(c2, 18)
RWPool::init(e2, 14)
WOPool::init(41, 7)
RWPool::init(82, 12)
RWPool::init(a2, 51)
WOPool::init(21, 6)

In my /var/log/messages I see ->

May 11 17:41:00 ast4 ndbd: 2009-05-11 17:41:00 [ndbd] INFO     -- Configuration fetched from '146.18.237.48:1186', generation: 1
May 11 17:41:00 ast4 ndbd: ndbd startup succeeded

I’ve read through other bug posts and I can’t seem to find the magic sauce.  Please help.  Thanks in advance for your attention.

-JE

How to repeat:
Upgrade cluster to MySQL Cluster 7.0 as best I can tell

I attempted a startup of another ndb storage node and now I have some interesting files to attach.  So, I'm attaching those files.  also the ndb_mgm node says ->

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     3 node(s)
id=2 (not connected, accepting connect from 146.18.237.115)
id=3 (not connected, accepting connect from 146.18.237.49)
id=4    @146.18.237.104  (mysql-5.1.32 ndb-7.0.5, starting, Nodegroup: 0)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @146.18.237.48  (mysql-5.1.32 ndb-7.0.5)

[mysqld(API)]   3 node(s)
id=5 (not connected, accepting connect from 146.18.237.48)
id=6 (not connected, accepting connect from 146.18.237.59)
id=7 (not connected, accepting connect from 146.18.237.103)

ndb_mgm> Node 4: Forced node shutdown completed. Occured during startphase 4. Initiated by signal 6. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

the error log

Attachment: ndb_4_error.log (application/octet-stream, text), 2.02 KiB.

the out.log

Attachment: ndb_4_out.log (application/octet-stream, text), 15.04 KiB.

Can't brign cluster back up.  Web apps not working.

6.3 to 7.0 upgrade is not online(bug#44294) / filesystem.
you must backup/restore your 6.3 cluster, start a 7.0 initial and restore...

Uh oh .... since I've already done the upgrade and done an initial start on one of my ndb nodes is there any way to dump that data and/or revert my upgrade to the previous version and then dump my data .... please, please, please :-D

Hi,

If you did "--initial" on node=2,
you can start the cluster partially (i.e wo/ node 2)
by starting node 3 and 4 like
ndbd --ndb-nodeid=3 --nowait-nodes=2
ndbd --ndb-nodeid=4 --nowait-nodes=2

So do this *with 6.3 binaries*, and then take a backup.

---

But I would first take an file-system backup of the ndb_X_fs directories
that you have...in case something goes wrong...

That idea sounds nice ..... Now, I've also already upgraded the mgmt node(s) and the mysql api nodes.  Do I need to down grade them as well *OR* can I leave them as mysql 7?

I've followed your instructions and downgraded box 1 to have the admin management server and MySQL server at ver 6.3.2 *AND* I've downgraded the *ONE* ndb node I'm bringing up to do this backup to ver 6.3.2.  I've started the management server *nad* then a ndb storage engine and in ndb_mgm client I see ->

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     3 node(s)
id=2 (not connected, accepting connect from 146.18.237.115)
id=3 (not connected, accepting connect from 146.18.237.49)
id=4    @146.18.237.104  (mysql-5.1.30 ndb-6.3.20, starting, Nodegroup: 0, Master)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @146.18.237.48  (mysql-5.1.30 ndb-6.3.20)

[mysqld(API)]   3 node(s)
id=5 (not connected, accepting connect from 146.18.237.48)
id=6 (not connected, accepting connect from 146.18.237.59)
id=7 (not connected, accepting connect from 146.18.237.103)

ndb_mgm>

On the ndb node I see lots of schtuff in the logs.  I'll attach that log as a file.  The thing is, that NDB node seems to be hung in any case.  What gives?  Thx.

James

Latest ndb node log file

Attachment: ndb_4_out.log (application/octet-stream, text), 16.56 KiB.

I've finally been able to downgrade my NDB nodes to 6.3, as well as my managemetn node.  I've done the saem to my MySQL Client/API nodes.  There is one problem, however.  When I start *ONLY ONE* MySQL client / API node everything is fine.  I can see my databases and get to them from an external client / PHP.  However, when I star a second API node all of the client access falls apart.  I've noticed that on the mysql client nodes there are directories in /var/lib/mysql that correspond to my DB names.  Is this supposed to be the case.  In any case I'll now attach the /var/lib/mysql/ast1.nts.BLAH.com.err fiel so you can see what I'm seeing.  What gives?

mysql API node error file

Attachment: ast1.nts.BLAH.com.err (application/octet-stream, text), 14.81 KiB.

Also, here's what ndb_mgm showed at the time of the MySQL API node errors ->

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     3 node(s)
id=2    @146.18.237.115  (mysql-5.1.30 ndb-6.3.20, Nodegroup: 0)
id=3    @146.18.237.49  (mysql-5.1.30 ndb-6.3.20, Nodegroup: 0)
id=4    @146.18.237.104  (mysql-5.1.30 ndb-6.3.20, Nodegroup: 0, Master)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @146.18.237.48  (mysql-5.1.30 ndb-6.3.20)

[mysqld(API)]   15 node(s)
id=5 (not connected, accepting connect from 146.18.237.48)
id=6 (not connected, accepting connect from 146.18.237.48)
id=7 (not connected, accepting connect from 146.18.237.48)
id=8 (not connected, accepting connect from 146.18.237.48)
id=9 (not connected, accepting connect from 146.18.237.48)
id=10 (not connected, accepting connect from 146.18.237.59)
id=11 (not connected, accepting connect from 146.18.237.59)
id=12 (not connected, accepting connect from 146.18.237.59)
id=13 (not connected, accepting connect from 146.18.237.59)
id=14 (not connected, accepting connect from 146.18.237.59)
id=15 (not connected, accepting connect from 146.18.237.103)
id=16 (not connected, accepting connect from 146.18.237.103)
id=17 (not connected, accepting connect from 146.18.237.103)
id=18 (not connected, accepting connect from 146.18.237.103)
id=19 (not connected, accepting connect from 146.18.237.103)

ndb_mgm>

Hi again

I didnt notice before, but using 3 replicas is not supported in 6.3
(but is fixed in 7.0, see http://bugs.mysql.com/bug.php?id=18621)

It sounds like a miracle that you didn't get any problems with this before.

So now I urge you to take a backup (using START BACKUP)
and restore it on either a clean cluster. 6.3 or 7.0
And I also strongly recommend to instead use NoOfReplicas=2
(and hence 2 or 4 data nodes instead of 3) given that even if we
dont't have any know bugs, our test-coverage of 3-replica is quite low.

Read the manual carefully on how to take backup and restore it.

ok, Ok .... I'll go to two API nodes and two NDB data nodes.  In any case, can I just use mysqldump to get the result I seek?

Please do upgrade by on-line backup, using --initial and then restore or wait for 7.0.6 as the upgrade issues is fixed in that version

This bug is not scheduled to be fixed at this time.