Bug #43021 | MySQL Cluster NDB kernel thread stuck on startup | ||
---|---|---|---|
Submitted: | 19 Feb 2009 13:56 | Modified: | 23 Feb 2009 15:58 |
Reporter: | Guido Ostkamp | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S1 (Critical) |
Version: | 6.4.3 | OS: | Any |
Assigned to: | Jonas Oreland | CPU Architecture: | Any |
[19 Feb 2009 13:56]
Guido Ostkamp
[19 Feb 2009 14:02]
Guido Ostkamp
Additional information has been uploaded on MySQL FTP server (bug-data-43021.tar.bz2). Regards Guido Ostkamp
[19 Feb 2009 14:56]
Jonas Oreland
is it still hanging ?
[19 Feb 2009 15:04]
Guido Ostkamp
In order to get you the stacktraces, I had to invoke dbx and that hit the heartbeat. First node went down with 2009-02-19 12:25:40 [ndbd] INFO -- Watchdog: User time: 97 System time: 742 saving 109090000 at 1005e24a0 (0)saving 109220000 at 1005e9ea0 (0) saving 1092e8000 at 1005edba0 (0)saving 109158000 at 1005e61a0 (0) saving 108fc8000 at 1005de7a0 (0) 2009-02-19 13:55:52 [ndbd] INFO -- Watchdog: User time: 22134 System time: 88783 2009-02-19 13:55:52 [ndbd] WARNING -- Watchdog: Warning overslept 8488 ms, expected 100 ms. 2009-02-19 13:55:53 [ndbd] INFO -- Node 3 disconnected 2009-02-19 13:55:53 [ndbd] INFO -- QMGR (Line: 2908) 0x0000000a 2009-02-19 13:55:53 [ndbd] INFO -- Error handler startup shutting down system 2009-02-19 13:55:53 [ndbd] INFO -- Error handler shutdown completed - exiting 2009-02-19 13:55:53 [ndbd] INFO -- Angel received ndbd startup failure count 1. 2009-02-19 13:55:56 [ndbd] ALERT -- Node 2: Forced node shutdown completed. Occured during startphase 4. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'. and second node then with 2009-02-19 12:26:59 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=9239 2009-02-19 12:26:59 [ndbd] INFO -- Watchdog: User time: 258 System time: 1315 saving 109088000 at 1005de528 (0) saving 109150000 at 1005e2228 (0)saving 1092e0000 at 1005e9c28 (0)saving 109218000 at 1005e5f28 (0) saving 108fc0000 at 1005da828 (0) 2009-02-19 13:55:51 [ndbd] INFO -- Node failure during restart 2009-02-19 13:55:51 [ndbd] INFO -- QMGR (Line: 3489) 0x0000000a 2009-02-19 13:55:51 [ndbd] INFO -- Error handler startup shutting down system 2009-02-19 13:55:51 [ndbd] INFO -- Error handler shutdown completed - exiting 2009-02-19 13:55:51 [ndbd] INFO -- Angel received ndbd startup failure count 1. 2009-02-19 13:55:56 [ndbd] ALERT -- Node 3: Forced node shutdown completed. Occured during startphase 4. Caused by error 2308: 'Another node fa iled during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'. Now both ndb's are gone. Regards Guido Ostkamp
[19 Feb 2009 15:13]
Jonas Oreland
ok, i meant does it hang it you try to start it again ?
[19 Feb 2009 15:35]
Guido Ostkamp
After restart and giving some bad log entries for a while, it seems to have recovered somehow this time: sing 4 fragments per node RWPool::init(c2, 18) RWPool::init(e2, 16) WOPool::init(41, 8) RWPool::init(82, 12) RWPool::init(a2, 52) WOPool::init(21, 10) WOPool::init(21, 10) WOPool::init(21, 10) WOPool::init(21, 10) WOPool::init(21, 10) 2009-02-19 16:17:27 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=110 2009-02-19 16:17:27 [ndbd] INFO -- Watchdog: User time: 91 System time: 158 2009-02-19 16:17:27 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=220 2009-02-19 16:17:27 [ndbd] INFO -- Watchdog: User time: 91 System time: 170 2009-02-19 16:17:27 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=330 2009-02-19 16:17:27 [ndbd] INFO -- Watchdog: User time: 92 System time: 183 2009-02-19 16:17:27 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=440 2009-02-19 16:17:27 [ndbd] INFO -- Watchdog: User time: 92 System time: 195 2009-02-19 16:17:27 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=549 2009-02-19 16:17:27 [ndbd] INFO -- Watchdog: User time: 93 System time: 209 2009-02-19 16:17:27 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=659 2009-02-19 16:17:27 [ndbd] INFO -- Watchdog: User time: 93 System time: 222 2009-02-19 16:17:27 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=769 2009-02-19 16:17:27 [ndbd] INFO -- Watchdog: User time: 93 System time: 235 2009-02-19 16:17:27 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=880 2009-02-19 16:17:27 [ndbd] INFO -- Watchdog: User time: 94 System time: 247 2009-02-19 16:17:27 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=990 2009-02-19 16:17:27 [ndbd] INFO -- Watchdog: User time: 94 System time: 260 2009-02-19 16:17:28 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=1100 2009-02-19 16:17:28 [ndbd] INFO -- Watchdog: User time: 95 System time: 273 2009-02-19 16:17:28 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=1210 2009-02-19 16:17:28 [ndbd] INFO -- Watchdog: User time: 95 System time: 286 2009-02-19 16:17:28 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=1320 2009-02-19 16:17:28 [ndbd] INFO -- Watchdog: User time: 96 System time: 299 2009-02-19 16:17:28 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=1430 2009-02-19 16:17:28 [ndbd] INFO -- Watchdog: User time: 96 System time: 312 2009-02-19 16:17:28 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=1540 2009-02-19 16:17:28 [ndbd] INFO -- Watchdog: User time: 97 System time: 325 2009-02-19 16:17:28 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=1650 2009-02-19 16:17:28 [ndbd] INFO -- Watchdog: User time: 97 System time: 338 2009-02-19 16:17:28 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=1760 2009-02-19 16:17:28 [ndbd] INFO -- Watchdog: User time: 98 System time: 351 2009-02-19 16:17:28 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=1869 2009-02-19 16:17:28 [ndbd] INFO -- Watchdog: User time: 98 System time: 364 2009-02-19 16:17:28 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=1980 2009-02-19 16:17:28 [ndbd] INFO -- Watchdog: User time: 99 System time: 377 2009-02-19 16:17:29 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=2089 2009-02-19 16:17:29 [ndbd] INFO -- Watchdog: User time: 99 System time: 390 2009-02-19 16:17:29 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=2199 2009-02-19 16:17:29 [ndbd] INFO -- Watchdog: User time: 100 System time: 403 2009-02-19 16:17:29 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=2310 2009-02-19 16:17:29 [ndbd] INFO -- Watchdog: User time: 100 System time: 416 2009-02-19 16:17:29 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=2420 2009-02-19 16:17:29 [ndbd] INFO -- Watchdog: User time: 101 System time: 429 2009-02-19 16:17:29 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=2530 2009-02-19 16:17:29 [ndbd] INFO -- Watchdog: User time: 101 System time: 442 2009-02-19 16:17:29 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=2639 2009-02-19 16:17:29 [ndbd] INFO -- Watchdog: User time: 101 System time: 455 ... 2009-02-19 16:17:32 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=5830 2009-02-19 16:17:32 [ndbd] INFO -- Watchdog: User time: 115 System time: 821 2009-02-19 16:17:32 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=5939 2009-02-19 16:17:32 [ndbd] INFO -- Watchdog: User time: 115 System time: 844 2009-02-19 16:17:33 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=6050 2009-02-19 16:17:33 [ndbd] INFO -- Watchdog: User time: 116 System time: 847 2009-02-19 16:17:33 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=6160 2009-02-19 16:17:33 [ndbd] INFO -- Watchdog: User time: 116 System time: 850 2009-02-19 16:17:33 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=6269 2009-02-19 16:17:33 [ndbd] INFO -- Watchdog: User time: 117 System time: 851 saving 109090000 at 1005e24a0 (0) saving 109158000 at 1005e61a0 (0)saving 109220000 at 1005e9ea0 (0)saving 1092e8000 at 1005edba0 (0) saving 108fc8000 at 1005de7a0 (0) restartCreateObj(1) file: 1 restartCreateObj(2) file: 1 restartCreateObj(3) file: 1 restartCreateObj(4) file: 1 restartCreateObj(5) file: 1 restartCreateObj(6) file: 1 restartCreateObj(7) file: 1 restartCreateObj(8) file: 1 restartCreateObj(9) file: 1 ... ORE table: 2 255 rows appliedRESTORE table: 2 255 rows applied RESTORE table: 2 246 rows appliedRESTORE table: 2 262 rows applied RESTORE table: 2 269 rows applied RESTORE table: 2 272 rows applied RESTORE table: 2 237 rows applied RESTORE table: 2 262 rows applied RESTORE table: 3 0 rows applied RESTORE table: 3 2 rows applied RESTORE table: 3 3 rows appliedRESTORE table: 3 4 rows applied RESTORE table: 3 3 rows applied RESTORE table: 3 3 rows applied RESTORE table: 3 2 rows applied RESTORE table: 3 3 rows applied RESTORE table: 4 2 rows appliedRESTORE table: 4 3 rows applied RESTORE table: 4 2 rows applied RESTORE table: 4 3 rows applied ... RESTORE table: 35 10465 rows applied delay: reqs=delay: reqs=delay: reqs=444444 delay: reqs=44 delay: reqs=44 delay: reqs=44 delay: reqs=44 delay: reqs=44 delay: reqs=44 delay: reqs=44 delay: reqs=44 ... delay: reqs=44 m_active_buckets.set(1) alloc_chunk(5071 16) -
[19 Feb 2009 16:12]
Jonas Oreland
ok, got an idea can you check if changing LockPagesInMainMemory to 2 makes a difference... if it does, I can fix it :-)
[20 Feb 2009 11:25]
Guido Ostkamp
I tried with "LockPagesInMainMemory=2" and while there are still some entries about stuck threads (all in one second), the number is not as high as with setting "...=1" (where those entries are visible for up to 7-10 seconds). I did not manage to entirely get stuck forever this time. Example for settting '2': DBMT: MaxNoOfExecutionThreads=8 NDBMT: workers=4 threads=4 2009-02-20 10:50:17 [ndbd] INFO -- NDB Cluster -- DB node 3 2009-02-20 10:50:17 [ndbd] INFO -- mysql-5.1.32 ndb-6.4.3-beta -- 2009-02-20 10:50:17 [ndbd] INFO -- Configuration fetched at eibe port 1186 2009-02-20 10:50:17 [ndbd] INFO -- Ndbd_mem_manager::init(1) min: 2254Mb initial: 2274Mb Adding 2274Mb to ZONE_LO (1,72767) NDBMT: num_threads=7 thr: 0 tid: 5 DBTC(0) DBDIH(0) DBDICT(0) NDBCNTR(0) QMGR(0) NDBFS(0) TRIX(0) DBUTIL(0) thr: 1 tid: 6 BACKUP(0) DBLQH(0) DBACC(0) DBTUP(0) SUMA(0) DBTUX(0) TSMAN(0) LGMAN(0) PGMAN(0) RESTORE(0) DBINFO(0) PGMAN(5 ) thr: 2 tid: 7 PGMAN(1) DBACC(1) DBLQH(1) DBTUP(1) BACKUP(1) DBTUX(1) RESTORE(1) thr: 3 tid: 8 PGMAN(2) DBACC(2) DBLQH(2) DBTUP(2) BACKUP(2) DBTUX(2) RESTORE(2) thr: 6 tid: 1 CMVMI(0) saving 108f00000 at 1005daaa0 (0) thr: 4 tid: 9 PGMAN(3) DBACC(3) DBLQH(3) DBTUP(3) BACKUP(3) DBTUX(3) RESTORE(3) 2009-02-20 10:50:26 [ndbd] WARNING -- Ndb kernel thread 5 is stuck in: Unknown place elapsed=70 2009-02-20 10:50:26 [ndbd] INFO -- Locked future allocations 2009-02-20 10:50:26 [ndbd] INFO -- Watchdog: User time: 13 System time: 543 thr: 5 tid: 10 PGMAN(4) DBACC(4) DBLQH(4) DBTUP(4) BACKUP(4) DBTUX(4) RESTORE(4) 2009-02-20 10:50:26 [ndbd] INFO -- Start initiated (mysql-5.1.32 ndb-6.4.3) 2009-02-20 10:50:26 [ndbd] INFO -- Watchdog: User time: 14 System time: 551 2009-02-20 10:50:26 [ndbd] WARNING -- Watchdog: Warning overslept 5109 ms, expected 100 ms. 2009-02-20 10:50:26 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=110 2009-02-20 10:50:26 [ndbd] INFO -- Watchdog: User time: 14 System time: 565 2009-02-20 10:50:26 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=220 2009-02-20 10:50:26 [ndbd] INFO -- Watchdog: User time: 15 System time: 577 2009-02-20 10:50:26 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Job Handling elapsed=330 2009-02-20 10:50:26 [ndbd] INFO -- Watchdog: User time: 15 System time: 588 saving 1093b0000 at 1005f18a0 (0) NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer ... Regards Guido Ostkamp
[23 Feb 2009 11:25]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/67163 2853 Jonas Oreland 2009-02-23 ndb - bug#43021 - mlockall can cause watchdog failure. 1) inform watchdog that we're calling mlockall (refresh_watchdog) 2) move setting of real watchdog interval to after mlockall
[23 Feb 2009 11:26]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/67164 2854 Jonas Oreland 2009-02-23 ndb - bug#43021 - mlockall can cause watchdog failure. 1) inform watchdog that we're calling mlockall (refresh_watchdog) 2) move setting of real watchdog interval to after mlockall
[23 Feb 2009 11:31]
Bugs System
Pushed into 5.1.32-ndb-6.2.17 (revid:jonas@mysql.com-20090223112617-en8yhzmffthq912s) (version source revid:jonas@mysql.com-20090223112617-en8yhzmffthq912s) (merge vers: 5.1.32-ndb-6.2.17) (pib:6)
[23 Feb 2009 11:32]
Bugs System
Pushed into 5.1.32-ndb-6.3.23 (revid:jonas@mysql.com-20090223112759-dk1lvkaoy4297lpx) (version source revid:jonas@mysql.com-20090223112759-dk1lvkaoy4297lpx) (merge vers: 5.1.32-ndb-6.3.23) (pib:6)
[23 Feb 2009 11:33]
Bugs System
Pushed into 5.1.32-ndb-6.4.3 (revid:jonas@mysql.com-20090223112952-ihesm8xlerrd179p) (version source revid:jonas@mysql.com-20090223112952-ihesm8xlerrd179p) (merge vers: 5.1.32-ndb-6.4.3) (pib:6)
[23 Feb 2009 15:58]
Jon Stephens
Documented in the NDB-6.2.17, 6.3.23, and 6.4.3 changelogs as follows: When using ndbmtd, NDB kernel threads could hang while trying to start the data nodes with LockPagesInMainMemory set to 1.
[28 Dec 2009 13:36]
Chris Picton
I am still seeing this bug with 7.0.9 x86_64 rhel5 RPMS If I have LockPagesInMainMemory=1, starting ndbmtd gives a lot of the following: 2009-12-28 15:21:02 [ndbd] WARNING -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=100 2009-12-28 15:21:02 [ndbd] INFO -- Watchdog: User time: 13 System time: 38 2009-12-28 15:21:02 [ndbd] WARNING -- Ndb kernel thread 1 is stuck in: Job Handling elapsed=100 2009-12-28 15:21:02 [ndbd] INFO -- Watchdog: User time: 13 System time: 38 2009-12-28 15:21:02 [ndbd] WARNING -- Ndb kernel thread 2 is stuck in: Job Handling elapsed=100 2009-12-28 15:21:02 [ndbd] INFO -- Watchdog: User time: 13 System time: 38 2009-12-28 15:21:02 [ndbd] WARNING -- Ndb kernel thread 3 is stuck in: Job Handling elapsed=100 2009-12-28 15:21:02 [ndbd] INFO -- Watchdog: User time: 13 System time: 38 2009-12-28 15:21:02 [ndbd] WARNING -- Ndb kernel thread 4 is stuck in: Job Handling elapsed=100 2009-12-28 15:21:02 [ndbd] INFO -- Watchdog: User time: 13 System time: 38 2009-12-28 15:21:02 [ndbd] WARNING -- Ndb kernel thread 5 is stuck in: Job Handling elapsed=100 Setting LockPagesInMainMemory to 0, or using ndbd, those warnings go away. My config is: [TCP DEFAULT] SendBufferMemory=4M ReceiveBufferMemory=4M [NDB_MGMD DEFAULT] Datadir=/srv/mysql-cluster/data LogDestination=FILE:filename=ndb_cluster.log,maxsize=10000000,maxfiles=6 ArbitrationRank=1 [NDB_MGMD] Id=1 Hostname=10.202.1.100 [NDB_MGMD] Id=2 Hostname=10.202.1.101 [NDBD DEFAULT] NoOfReplicas=2 Datadir=/srv/mysql-cluster/data FileSystemPathDD=/srv/mysql-cluster/data #FileSystemPathUndoFiles=/srv/mysql-cluster/data #FileSystemPathDataFiles=/srv/mysql-cluster/data DataMemory=56000M IndexMemory=8000M LockPagesInMainMemory=0 MaxNoOfConcurrentOperations=100000 StringMemory=25 MaxNoOfTables=4096 MaxNoOfOrderedIndexes=2048 MaxNoOfUniqueHashIndexes=512 MaxNoOfAttributes=24576 DiskCheckpointSpeedInRestart=100M FragmentLogFileSize=256M InitFragmentLogFiles=FULL NoOfFragmentLogFiles=328 RedoBuffer=32M TimeBetweenLocalCheckpoints=20 TimeBetweenGlobalCheckpoints=1000 TimeBetweenEpochs=100 MemReportFrequency=30 BackupReportFrequency=10 ### Params for setting logging LogLevelStartup=15 LogLevelShutdown=15 LogLevelCheckpoint=8 LogLevelNodeRestart=15 ### Params for increasing Disk throughput BackupMaxWriteSize=1M BackupDataBufferSize=16M BackupLogBufferSize=4M BackupMemory=20M #Reports indicates that odirect=1 can cause io errors (os err code 5) on some systems. You must test. ODirect=1 ### Watchdog TimeBetweenWatchdogCheckInitial=60000 ### TransactionInactiveTimeout - should be enabled in Production TransactionInactiveTimeout=30000 ### CGE 6.3 - REALTIME EXTENSIONS RealTimeScheduler=1 SchedulerExecutionTimer=80 SchedulerSpinTimer=40 ### DISK DATA SharedGlobalMemory=384M DiskPageBufferMemory=1024M ### Multithreading MaxNoOfExecutionThreads=8 ### Increasing the LongMessageBuffer b/c of a bug (20090903) LongMessageBuffer=32M BatchSizePerLocalScan=512 [NDBD] Id=3 Hostname=10.202.1.4 [NDBD] Id=4 Hostname=10.202.1.5 ## BELOW ARE (INACTIVE) SLOTS FOR DATA NODES TO ALLOW FOR GROWTH #[NDBD] #Id=5 #Hostname= #[NDBD] #Id=6 #Hostname= #[NDBD] #Id=7 #Hostname= #[NDBD] #Id=8 #Hostname= #[NDBD] #Id=9 #Hostname= #[NDBD] #Id=10 #Hostname= [MYSQLD DEFAULT] BatchSize=512 #BatchByteSize=2048K #MaxScanBatchSize=2048K [MYSQLD] Id=11 Hostname=10.202.1.2 [MYSQLD] Id=12 Hostname=10.202.1.2 [MYSQLD] Id=13 Hostname=10.202.1.2 [MYSQLD] Id=14 Hostname=10.202.1.2 [MYSQLD] Id=15 Hostname=10.202.1.2 [MYSQLD] Id=16 Hostname=10.202.1.2 [MYSQLD] Id=17 Hostname=10.202.1.2 [MYSQLD] Id=18 Hostname=10.202.1.2 [MYSQLD] Id=19 Hostname=10.202.1.3 [MYSQLD] Id=20 Hostname=10.202.1.3 [MYSQLD] Id=21 Hostname=10.202.1.3 [MYSQLD] Id=22 Hostname=10.202.1.3 [MYSQLD] Id=23 Hostname=10.202.1.3 [MYSQLD] Id=24 Hostname=10.202.1.3 [MYSQLD] Id=25 Hostname=10.202.1.3 [MYSQLD] Id=26 Hostname=10.202.1.3