Bug #11776 | mysql cluster unable to restart after clean shutdown | ||
---|---|---|---|
Submitted: | 6 Jul 2005 14:48 | Modified: | 22 Jul 2005 14:28 |
Reporter: | Fabio Coatti | Email Updates: | |
Status: | Duplicate | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S1 (Critical) |
Version: | 4.1.12 | OS: | Linux (Linux (AMD64)) |
Assigned to: | Assigned Account | CPU Architecture: | Any |
[6 Jul 2005 14:48]
Fabio Coatti
[6 Jul 2005 16:17]
Fabio Coatti
attached file with full trace dump
[6 Jul 2005 20:09]
Jonas Oreland
Hi, This is a duplicate of http://bugs.mysql.com/bug.php?id=8918 (among others...there were 5 bug reports for the same bug :-)) This is when using more than 4Gb of memory, or actually when writing stuff to disk that was mapped in an address > 4Gb. This has been fixed in src and the fix will be in 4.1.13
[7 Jul 2005 13:57]
Fabio Coatti
I've just reinstalled from src downloaded with bitkeeper, ad the behaviour is exactly the same I've done this: backup (start backup) shutdown all compiled and installed sources (both storage nodes and mgm node) started storage nodes with --initial recovered from backup issued a shutdown from mgm node started storage nodes (without --initial) ... and storage nodes with the same behaviour, i.e.: 005-07-07 15:46:54 [MgmSrvr] INFO -- Node 2: DICT: index 117 activated done 2005-07-07 15:46:54 [MgmSrvr] INFO -- Node 2: DICT: index 118 activated done 2005-07-07 15:46:54 [MgmSrvr] INFO -- Node 2: DICT: index 119 activated done 2005-07-07 15:46:54 [MgmSrvr] INFO -- Node 2: DICT: index 120 activated done 2005-07-07 15:47:10 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected 2005-07-07 15:47:10 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected Am I doing something wrong? I'm pretty sure that the binaries installed are the right ones (just compiled), but which version should visible on mgm-node? it says 4.1.12 even for bk pulled version, right?
[18 Jul 2005 13:03]
Fabio Coatti
I've gust tried with nightly build as of 14/07, but no changes to tish behavoiur...
[20 Jul 2005 8:18]
Jonas Oreland
Hi Can you attatch the error/trace log from the latest build The old error/trace has been fixed...but you might have found another error...
[20 Jul 2005 17:28]
Fabio Coatti
I've compiled bk-version downloaded today; (tomorrow I can try with tar.gz of nightly build, if needed) the behaviour is the same, the errors are here: node 2: Date/Time: Wednesday 20 July 2005 - 19:17:28 Type of error: error Message: Internal program error (failed ndbrequire) Fault ID: 2341 Problem data: DbtupExecQuery.cpp Object of reference: DBTUP (Line: 443) 0x00000008 ProgramName: /usr/sbin/ndbd ProcessID: 25533 TraceFile: /var/lib/mysql-cluster/ndb_2_trace.log.3 Version 4.1.12 ***EOM*** Node 3: Date/Time: Wednesday 20 July 2005 - 19:17:28 Type of error: error Message: Internal program error (failed ndbrequire) Fault ID: 2341 Problem data: DbtupExecQuery.cpp Object of reference: DBTUP (Line: 443) 0x00000008 ProgramName: /usr/sbin/ndbd ProcessID: 5177 TraceFile: /var/lib/mysql-cluster/ndb_3_trace.log.19 Version 4.1.12 ***EOM*** I'll attach trace files asap.
[21 Jul 2005 13:47]
Fabio Coatti
I've tried also with 2005-07-16 nightly build, the behaviour is slightly different, The storage nodes reached and passed Phase 9, from mgm node the status was "started", but the tables were not OK. Steps: Compiled snapshot (16/07/05) backupped the wole database stopped and relaunched storage,api and mgm nodes (--initial) restored the backup issued shutdown from mgm node restarted the system, of course without --initial when I've tried to access tables, the database was corrupt (no tables) excerpts from mgm log: 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 2: DICT: index 126 rebuild done 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 2: DICT: index 127 rebuild done 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 2: DICT: index 128 rebuild done 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 2: Communication to Node 4 opened 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 2: Communication to Node 5 opened 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 2: Communication to Node 6 opened 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 2: Communication to Node 7 opened 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 2: Communication to Node 8 opened 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 2: Communication to Node 9 opened 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 2: Communication to Node 10 opened 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 2: Communication to Node 0 opened 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 2: Start phase 8 completed (system restart) 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 2: Start phase 9 completed (system restart) 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 2: Started (version 4.1.12) 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 3: Communication to Node 4 opened 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 3: Communication to Node 5 opened 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 3: Communication to Node 6 opened 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 3: Communication to Node 7 opened 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 3: Communication to Node 8 opened 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 3: Communication to Node 9 opened 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 3: Communication to Node 10 opened 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 3: Communication to Node 0 opened 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 3: Start phase 8 completed (system restart) 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 3: Start phase 9 completed (system restart) 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 3: Started (version 4.1.12) 2005-07-21 11:11:07 [MgmSrvr] INFO -- Node 2: Node 4 Connected 2005-07-21 11:11:08 [MgmSrvr] INFO -- Node 2: Node 4: API version 4.1.12 2005-07-21 11:11:08 [MgmSrvr] INFO -- Node 3: Node 4 Connected 2005-07-21 11:11:08 [MgmSrvr] INFO -- Node 3: Node 4: API version 4.1.12 2005-07-21 11:11:08 [MgmSrvr] INFO -- Node 3: Node 1: API version 4.1.12 2005-07-21 11:11:08 [MgmSrvr] INFO -- Node 2: Node 1: API version 4.1.12 2005-07-21 11:11:08 [MgmSrvr] INFO -- Node 3: Prepare arbitrator node 1 [ticket=3f2e000138a90f28] 2005-07-21 11:11:09 [MgmSrvr] INFO -- Node 2: Started arbitrator node 1 [ticket=3f2e000138a90f28] 2005-07-21 11:11:53 [MgmSrvr] INFO -- Mgmt server state: nodeid 5 reserved for ip X.Y.Z.W, m_reserved_nodes 000000000000002e. 2005-07-21 11:11:53 [MgmSrvr] INFO -- Node 3: Node 5 Connected 2005-07-21 11:11:53 [MgmSrvr] INFO -- Node 2: Node 5 Connected 2005-07-21 11:11:53 [MgmSrvr] INFO -- Node 2: Node 5: API version 4.1.12 2005-07-21 11:11:53 [MgmSrvr] INFO -- Node 3: Node 5: API version 4.1.12 2005-07-21 11:12:06 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected 2005-07-21 11:12:06 [MgmSrvr] ALERT -- Node 3: Node 2 Disconnected 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Communication to Node 2 closed 2005-07-21 11:12:06 [MgmSrvr] ALERT -- Node 3: Network partitioning - arbitration required 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: President restarts arbitration thread [state=7] 2005-07-21 11:12:06 [MgmSrvr] ALERT -- Node 3: Arbitration won - positive reply from node 1 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: GCP Take over started 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: LCP Take over started 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: ParticipatingDIH = 0000000000000000 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: ParticipatingLQH = 0000000000000000 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000] 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000] 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000] 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: m_LCP_COMPLETE_REP_From_Master_Received = 1 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: GCP Take over completed 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: LCP Take over completed (state = 4) 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: ParticipatingDIH = 0000000000000000 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: ParticipatingLQH = 0000000000000000 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000] 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000] 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000] 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: m_LCP_COMPLETE_REP_From_Master_Received = 1 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Table 0: TabCopyStatus: 4 TabUpdateStatus: 2 TabLcpStatus: 3 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Fragment 0: noLcpReplicas==0 0(on 3)=5(Idle) 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Fragment 1: noLcpReplicas==0 0(on 3)=5(Idle) 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Table 1: TabCopyStatus: 0 TabUpdateStatus: 0 TabLcpStatus: 3 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Fragment 0: noLcpReplicas==0 0(on 2)=5(Idle) 1(on 3)=5(Idle) 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Fragment 1: noLcpReplicas==0 0(on 2)=5(Idle) 1(on 3)=5(Idle) 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Table 2: TabCopyStatus: 0 TabUpdateStatus: 0 TabLcpStatus: 3 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Fragment 0: noLcpReplicas==0 0(on 2)=5(Idle) 1(on 3)=5(Idle) 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Fragment 1: noLcpReplicas==0 0(on 3)=5(Idle) 1(on 2)=5(Idle) 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Table 3: TabCopyStatus: 0 TabUpdateStatus: 0 TabLcpStatus: 3 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Fragment 0: noLcpReplicas==0 0(on 2)=5(Idle) 1(on 3)=5(Idle) 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Fragment 1: noLcpReplicas==0 0(on 3)=5(Idle) 1(on 2)=5(Idle) 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Table 4: TabCopyStatus: 0 TabUpdateStatus: 0 TabLcpStatus: 3 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Fragment 0: noLcpReplicas==0 0(on 2)=5(Idle) 1(on 3)=5(Idle) 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Fragment 1: noLcpReplicas==0 0(on 3)=5(Idle) 1(on 2)=5(Idle) 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Table 5: TabCopyStatus: 0 TabUpdateStatus: 0 TabLcpStatus: 3 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Fragment 0: noLcpReplicas==0 0(on 2)=5(Idle) 1(on 3)=5(Idle) 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Fragment 1: noLcpReplicas==0 0(on 3)=5(Idle) 1(on 2)=5(Idle) 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Table 6: TabCopyStatus: 0 TabUpdateStatus: 0 TabLcpStatus: 3 [CUT] 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Fragment 0: noLcpReplicas==0 0(on 2)=5(Idle) 1(on 3)=5(Idle) 2005-07-21 11:12:06 [MgmSrvr] INFO -- Node 3: Fragment 1: noLcpReplicas==0 0(on 3)=5(Idle) 1(on 2)=5(Idle) 2005-07-21 11:12:06 [MgmSrvr] INFO -- Mgmt server state: nodeid 2 freed, m_reserved_nodes 000000000000002a. 2005-07-21 11:12:06 [MgmSrvr] WARNING -- Allocate nodeid (2) failed. Connection from ip X.Y.Z.W. Returned error string "Id 2 already allocated by an other node." 2005-07-21 11:12:06 [MgmSrvr] INFO -- Mgmt server state: node id's 2 4 connected but not reserved 2005-07-21 11:12:06 [MgmSrvr] INFO -- Mgmt server state: node id's 1 not connected but reserved 2005-07-21 11:12:07 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected 2005-07-21 11:12:07 [MgmSrvr] INFO -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000022. 2005-07-21 11:12:07 [MgmSrvr] INFO -- Mgmt server state: nodeid 3 reserved for ip X.Y.Z.Q, m_reserved_nodes 000000000000002a. 2005-07-21 11:12:08 [MgmSrvr] INFO -- Node 1: Node 3 Connected 2005-07-21 11:12:09 [MgmSrvr] WARNING -- Allocate nodeid (2) failed. Connection from ip X.Y.Z.W. Returned error string "Id 2 already allocated by an other node." 2005-07-21 11:12:09 [MgmSrvr] INFO -- Mgmt server state: node id's 2 4 connected but not reserved 2005-07-21 11:12:09 [MgmSrvr] INFO -- Mgmt server state: node id's 1 not connected but reserved 2005-07-21 11:12:12 [MgmSrvr] INFO -- Mgmt server state: nodeid 2 reserved for ip X.Y.Z.W, m_reserved_nodes 000000000000002e. 2005-07-21 11:12:13 [MgmSrvr] INFO -- Node 1: Node 2 Connected error: Date/Time: Thursday 21 July 2005 - 11:12:06 Type of error: error Message: Pointer too large Fault ID: 2306 Problem data: DbtupExecQuery.cpp Object of reference: DBTUP (Line: 604) 0x00000008 ProgramName: /usr/sbin/ndbd ProcessID: 10204 TraceFile: /var/lib/mysql-cluster/ndb_3_trace.log.22 Version 4.1.12 ***EOM*** Date/Time: Thursday 21 July 2005 - 11:12:05 Type of error: error Message: Pointer too large Fault ID: 2306 Problem data: DbtupExecQuery.cpp Object of reference: DBTUP (Line: 604) 0x00000008 ProgramName: /usr/sbin/ndbd ProcessID: 16174 TraceFile: /var/lib/mysql-cluster/ndb_2_trace.log.13 Version 4.1.12 ***EOM*** And this, if can be useful: Date/Time: Thursday 21 July 2005 - 11:12:53 Type of error: error Message: Arbitrator shutdown Fault ID: 2305 Problem data: Arbitrator decided to shutdown this node Object of reference: QMGR (Line: 3796) 0x00000008 ProgramName: /usr/sbin/ndbd ProcessID: 16342 TraceFile: /var/lib/mysql-cluster/ndb_2_trace.log.14 Version 4.1.12 ***EOM***
[21 Jul 2005 13:48]
Fabio Coatti
Trace files for nightly build 16-07-05
Attachment: ndb_2_trace.log.14.gz (application/x-gzip, text), 55.69 KiB.
[21 Jul 2005 13:49]
Fabio Coatti
Trace files for nightly build 16-07-05
Attachment: ndb_3_trace.log.22.gz (application/x-gzip, text), 29.69 KiB.
[21 Jul 2005 14:22]
Jonas Oreland
Hi, I'm a bit confused...this resembles the bug report that I refered to earlier that has been fixed from _4.1.13_ But you say that you downloaded and compiled latest..and still error logs show version 4.1.12 /Jonas
[21 Jul 2005 14:35]
Fabio Coatti
I understand, but I've downloaded this tar file: mysql-4.1.14-nightly-20050716.tar.gz and yesterday I've installed latest BK tree. The ndbd executable is compiled today; I'll try again once more to be sure to have compiled the right package, just to be on the safe side (I really hope this is my fault). I'll report back asap.
[22 Jul 2005 14:22]
Fabio Coatti
I've tried again, after cleaning up the whole tree, this time with 4.1.14-nightly-20050716 The problems seems almost vanished, so I suppose some problme on my side. Anyway, I've seen only one issue in cluster restarting, as shown here: 2005-07-22 14:30:00 [MgmSrvr] INFO -- Node 2: Communication to Node 0 opened 2005-07-22 14:30:00 [MgmSrvr] INFO -- Node 2: Start phase 8 completed (system restart) 2005-07-22 14:30:00 [MgmSrvr] INFO -- Node 2: Start phase 9 completed (system restart) 2005-07-22 14:30:00 [MgmSrvr] INFO -- Node 2: Started (version 4.1.14) 2005-07-22 14:30:00 [MgmSrvr] INFO -- Node 3: Start phase 2 completed (node restart) 2005-07-22 14:30:00 [MgmSrvr] INFO -- Node 2: Node 4 Connected 2005-07-22 14:30:00 [MgmSrvr] INFO -- Node 2: Node 5 Connected 2005-07-22 14:30:00 [MgmSrvr] INFO -- Node 2: Node 4: API version 4.1.14 2005-07-22 14:30:00 [MgmSrvr] INFO -- Node 2: Node 5: API version 4.1.14 2005-07-22 14:30:01 [MgmSrvr] INFO -- Node 2: Node 1: API version 4.1.14 2005-07-22 14:30:02 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected 2005-07-22 14:30:02 [MgmSrvr] INFO -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000036. 2005-07-22 14:30:02 [MgmSrvr] WARNING -- Allocate nodeid (3) failed. Connection from ip X.Y.Z.W. Returned error string "Id 3 already allocated by another node." 2005-07-22 14:30:02 [MgmSrvr] INFO -- Mgmt server state: node id's 3 connected but not reserved 2005-07-22 14:30:02 [MgmSrvr] INFO -- Mgmt server state: node id's 1 not connected but reserved I'm still investigating this, but it's hard to reproduce, so I suppose that this bug can be closed, thanks for the help.
[22 Jul 2005 14:28]
Jonas Oreland
Closed...as it after some investigation already was reported and solved.