| Bug #12937 | Unable to recover slave from 'Out of operation records' condition - cluster/rbr | ||
|---|---|---|---|
| Submitted: | 1 Sep 2005 20:10 | Modified: | 15 Sep 2005 14:55 |
| Reporter: | Jonathan Miller | Email Updates: | |
| Status: | Closed | ||
| Category: | Server: Cluster | Severity: | S2 (Serious) |
| Version: | 5.1.2 | OS: | Linux (Linux) |
| Assigned to: | Tomas Ulin | Target Version: | |
[1 Sep 2005 20:10]
Jonathan Miller
[1 Sep 2005 20:10]
Jonathan Miller
10) edit config.ini and added MaxNoOfConcurrentOperations: 500000 under NDBD DEFAULT 11) ndb_mgm ndb10:14000 -e "1 stop" # Manager shuts down. 12) ndb_mgmd -f config.ini # Manager starts up 13) ndb_mgm ndb10:14000 -e "2 restart" # Data node goes into restart. 14) ndb_mgm ndb10:14000 -e "2 status" # Continue to check 2 until back up and running completely. 15) ndb_mgm ndb10:14000 -e "3 restart" # Data node goes into restart. 16) ndb_mgm ndb10:14000 -e "3 status" Connected to Management Server at: ndb10:14000 Node 3: starting (Phase 4) (Version 5.1.0) # Data node goes bye bye ndb_mgm ndb10:14000 -e "3 status" Connected to Management Server at: ndb10:14000 Node 3: not connected # Error log ndb_3_error.log: Current byte-offset of file-pointer is: 867 Date/Time: Thursday 1 September 2005 - 20:59:46 Type of error: error Message: Internal program error (failed ndbrequire) Fault ID: 2341 Problem data: DbdihMain.cpp Object of reference: DBDIH (Line: 8410) 0x0000000a ProgramName: /home/ndbdev/jmiller/builds/libexec/ndbd ProcessID: 26952 TraceFile: /space/run/ndb_3_trace.log.1 Version 5.1.0 (a_drop5p3) ***EOM*** NDB_1_Cluster.log: 2005-09-01 20:58:55 [MgmSrvr] INFO -- Node 4: Communication to Node 3 opened2005-09-01 20:58:55 [MgmSrvr] INFO -- Node 1: Node 3 Connected 2005-09-01 20:58:55 [MgmSrvr] INFO -- Node 2: Communication to Node 3 opened2005-09-01 20:58:55 [MgmSrvr] INFO -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000002. 2005-09-01 20:58:55 [MgmSrvr] INFO -- Node 5: Communication to Node 3 opened2005-09-01 20:58:57 [MgmSrvr] INFO -- Node 2: Node 3 Connected 2005-09-01 20:58:57 [MgmSrvr] INFO -- Node 5: Node 3 Connected 2005-09-01 20:58:57 [MgmSrvr] INFO -- Node 4: Node 3 Connected 2005-09-01 20:58:57 [MgmSrvr] INFO -- Node 2: Node 3: API version 5.1.0 2005-09-01 20:58:57 [MgmSrvr] INFO -- Node 3: Node 2 Connected 2005-09-01 20:58:57 [MgmSrvr] INFO -- Node 3: Node 4 Connected 2005-09-01 20:58:57 [MgmSrvr] INFO -- Node 3: Node 5 Connected 2005-09-01 20:58:57 [MgmSrvr] INFO -- Node 3: CM_REGCONF president = 4, own Node = 3, our dynamic id = 6 2005-09-01 20:58:57 [MgmSrvr] INFO -- Node 3: Node 2: API version 5.1.2 2005-09-01 20:58:57 [MgmSrvr] INFO -- Node 3: Node 4: API version 5.1.2 2005-09-01 20:58:57 [MgmSrvr] INFO -- Node 3: Node 5: API version 5.1.0 2005-09-01 20:58:57 [MgmSrvr] INFO -- Node 3: Start phase 1 completed 2005-09-01 20:58:57 [MgmSrvr] INFO -- Node 4: Node 3: API version 5.1.0 2005-09-01 20:58:57 [MgmSrvr] INFO -- Node 5: Node 3: API version 5.1.0 2005-09-01 20:58:57 [MgmSrvr] INFO -- Node 3: Start phase 2 completed (node restart) 2005-09-01 20:58:58 [MgmSrvr] INFO -- Node 3: Receive arbitrator node 1 [ticket=11db0003130cb6f0] 2005-09-01 20:59:00 [MgmSrvr] INFO -- Node 3: Start phase 3 completed (node restart) 2005-09-01 20:59:00 [MgmSrvr] INFO -- Node 3: Start phase 4 completed (node restart) 2005-09-01 20:59:01 [MgmSrvr] ALERT -- Node 2: Node 3 Disconnected 2005-09-01 20:59:01 [MgmSrvr] INFO -- Node 2: Communication to Node 3 closed2005-09-01 20:59:01 [MgmSrvr] INFO -- Node 1: Node 3 Connected 2005-09-01 20:59:01 [MgmSrvr] ALERT -- Node 4: Node 3 Disconnected 2005-09-01 20:59:01 [MgmSrvr] INFO -- Node 4: Communication to Node 3 closed2005-09-01 20:59:01 [MgmSrvr] ALERT -- Node 4: Arbitration check won - node group majority 2005-09-01 20:59:01 [MgmSrvr] INFO -- Node 4: President restarts arbitration thread [state=6] 2005-09-01 20:59:01 [MgmSrvr] ALERT -- Node 5: Node 3 Disconnected 2005-09-01 20:59:01 [MgmSrvr] INFO -- Node 5: Communication to Node 3 closed2005-09-01 20:59:04 [MgmSrvr] INFO -- Node 2: Communication to Node 3 opened2005-09-01 20:59:04 [MgmSrvr] INFO -- Node 5: Communication to Node 3 opened2005-09-01 20:59:05 [MgmSrvr] INFO -- Node 4: Communication to Node 3 opened 17) retry to start node 3 and recieved the same. 18) shutdown cluster: ndb_mgm ndb10:14000 -e "shutdown" [ndbdev@ndb10 run]$ ndb_mgm $ndbc -e "shutdown" Connected to Management Server at: ndb10:14000 3 NDB Cluster storage node(s) have shutdown. NDB Cluster management server shutdown. 19) vi config.ini and reduce the vaule down to 100000 and try to restart cluster # All data nodes die. The following is found in the cluster log: 2005-09-01 21:03:38 [MgmSrvr] ALERT -- Node 2: Node 3 Disconnected 2005-09-01 21:03:38 [MgmSrvr] INFO -- Node 2: Communication to Node 3 closed2005-09-01 21:03:38 [MgmSrvr] INFO -- Node 1: Node 3 Connected 2005-09-01 21:03:38 [MgmSrvr] ALERT -- Node 4: Node 3 Disconnected 2005-09-01 21:03:38 [MgmSrvr] INFO -- Node 4: Communication to Node 3 closed2005-09-01 21:03:38 [MgmSrvr] ALERT -- Node 2: Node 5 Disconnected 2005-09-01 21:03:38 [MgmSrvr] INFO -- Node 2: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 2 sig->failNo = 2005-09-01 21:03:38 [MgmSrvr] INFO -- Node 2: Communication to Node 3 closed2005-09-01 21:03:38 [MgmSrvr] INFO -- Node 2: Communication to Node 5 closed2005-09-01 21:03:38 [MgmSrvr] ALERT -- Node 4: Node 5 Disconnected 2005-09-01 21:03:38 [MgmSrvr] INFO -- Node 4: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 2 sig->failNo = 2005-09-01 21:03:38 [MgmSrvr] INFO -- Node 4: Communication to Node 3 closed2005-09-01 21:03:38 [MgmSrvr] INFO -- Node 4: Communication to Node 5 closed2005-09-01 21:03:38 [MgmSrvr] INFO -- Node 1: Node 5 Connected 2005-09-01 21:03:38 [MgmSrvr] INFO -- Node 1: Node 2 Connected 2005-09-01 21:03:39 [MgmSrvr] INFO -- Node 1: Node 4 Connected # Now both node 3 and node 5 that are on NDB12 are getting: Date/Time: Thursday 1 September 2005 - 21:06:46 Type of error: error Message: Internal program error (failed ndbrequire) Fault ID: 2341 Problem data: DbdihMain.cpp Object of reference: DBDIH (Line: 8410) 0x0000000a ProgramName: /home/ndbdev/jmiller/builds/libexec/ndbd ProcessID: 27169 TraceFile: /space/run/ndb_5_trace.log.2 Version 5.1.0 (a_drop5p3) ***EOM*** # Node 2 and 4 are getting: Date/Time: Thursday 1 September 2005 - 21:06:01 Type of error: error Message: Arbitrator shutdown Fault ID: 2305 Problem data: Arbitrator decided to shutdown this node Object of reference: QMGR (Line: 3790) 0x0000000a ProgramName: /home/ndbdev/jmiller/builds/libexec/ndbd ProcessID: 5326 TraceFile: /space/run/ndb_2_trace.log.2 Version 5.1.2 (a_drop5p4) ***EOM*** 20) vi config.ini, remove entry and try to restart cluster # geting same response as in step 19. All files will be moved to ndb under jonathan home directory /bug### where ### == this bug report number.
[13 Sep 2005 16:36]
Tomas Ulin
I copied the filesystem and started up a cluster ok with it. So I cannot reproduce the SR crash in DBDIH. I don't know if there could be some issue with the nodes being different versions 5.1.0 and 5.1.2? Don't think so. Can you try to see if you can do a SR with this filesystem, in your setup, with the latest wl2325 src?
[14 Sep 2005 20:28]
Jonathan Miller
Will take a look at it per your request.
[15 Sep 2005 14:55]
Jonathan Miller
Can not reproduce with latest clone of 5.1.2. [jbm]
