MySQL Bugs: #12937: Unable to recover slave from 'Out of operation records' condition

Bug #12937	Unable to recover slave from 'Out of operation records' condition - cluster/rbr
Submitted:	1 Sep 2005 20:10	Modified:	15 Sep 2005 14:55
Reporter:	Jonathan Miller	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.1.2	OS:	Linux (Linux)
Assigned to:	Tomas Ulin	CPU Architecture:	Any

Description:
I was recreating Bug 11726 to verify a fix for the error message. Once I received the error, I decided to try and recover from the problem.

SETUP:
-----------------------------------------------------------------
Master Cluster: 
NDB08/NDB09.
2 NDBD nodes each
NDB08 = ndb_mgmd
NDB08 = mysqld

Slave Cluster:
NDB10/NDB12
2 NDBD nodes each
NDB10 = ndb_mgmd
NDB10 = mysqld
--------------------------------------------------------------------
Setps:
---------------------------------------------------------------------
1) Brought up both clusters and mysqld processes
2) On Master > reset master;
3) On Slave > resert slave; & start slave;
# Grants issues
4) On master > source /home/ndbdev/jmiller/setupFiles/grant.sql
# Slave now connects and is happy
5) On Master > create database atae;
6) On Slave > SHOW DATABASES;
# atae is on slave.
7) on NDB12 start /space/ndbtest/go7 script pointing at NDB08
# dcacache table created on both master and slave.
8)on NDB12 start /space/ndbtest/go8 script pointing at NDB08
# records are being loaded on master and replicated to slave.
# Continue to watch slave using "SELECT COUNT(*) FROM dcacache;"
# Notice that the number of records has stopped increasing. Using the SQL command a 
# couple more times and I get a really wierd result returned
# The count jumps from 430000 to 4294967295 and back.

mysql> select count(*) from dcacache;
+----------+
| count(*) |
+----------+
|   430000 |
+----------+
1 row in set (0.01 sec)

mysql> select count(*) from dcacache;
+----------+
| count(*) |
+----------+
|   430000 |
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from dcacache;
+------------+
| count(*)   |
+------------+
| 4294967295 |
+------------+
1 row in set (0.01 sec)

mysql> select count(*) from dcacache;
+----------+
| count(*) |
+----------+
|   430000 |
+----------+
1 row in set (0.00 sec)

# Found the new error that I was looking for in Bug 11726
050901 20:53:15 [Warning] Slave: Got temporary error 233 'Out of operation records in transaction coordinator (increase MaxNoOfConcurrentOperations)' from NDB Error_code: 1297

# Decided to try and recover from this.
9) On slave > stop slave;

-- With have to do in 2 parts due to 8k limit --

How to repeat:
see above:

10) edit config.ini and added MaxNoOfConcurrentOperations: 500000 under NDBD DEFAULT
11) ndb_mgm ndb10:14000 -e "1 stop"
# Manager shuts down.
12) ndb_mgmd -f config.ini
# Manager starts up 
13) ndb_mgm ndb10:14000 -e "2 restart"
# Data node goes into restart.
14) ndb_mgm ndb10:14000 -e "2 status" 
# Continue to check 2 until back up and running completely.
15) ndb_mgm ndb10:14000 -e "3 restart"
# Data node goes into restart.
16) ndb_mgm ndb10:14000 -e "3 status"
Connected to Management Server at: ndb10:14000
Node 3: starting (Phase 4) (Version 5.1.0)
# Data node goes bye bye
ndb_mgm ndb10:14000 -e "3 status"
Connected to Management Server at: ndb10:14000
Node 3: not connected
# Error log ndb_3_error.log:
Current byte-offset of file-pointer is: 867
Date/Time: Thursday 1 September 2005 - 20:59:46
Type of error: error
Message: Internal program error (failed ndbrequire)
Fault ID: 2341
Problem data: DbdihMain.cpp
Object of reference: DBDIH (Line: 8410) 0x0000000a
ProgramName: /home/ndbdev/jmiller/builds/libexec/ndbd
ProcessID: 26952
TraceFile: /space/run/ndb_3_trace.log.1
Version 5.1.0 (a_drop5p3)
***EOM***
NDB_1_Cluster.log:
2005-09-01 20:58:55 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 opened2005-09-01 20:58:55 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2005-09-01 20:58:55 [MgmSrvr] INFO     -- Node 2: Communication to Node 3 opened2005-09-01 20:58:55 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000002.
2005-09-01 20:58:55 [MgmSrvr] INFO     -- Node 5: Communication to Node 3 opened2005-09-01 20:58:57 [MgmSrvr] INFO     -- Node 2: Node 3 Connected
2005-09-01 20:58:57 [MgmSrvr] INFO     -- Node 5: Node 3 Connected
2005-09-01 20:58:57 [MgmSrvr] INFO     -- Node 4: Node 3 Connected
2005-09-01 20:58:57 [MgmSrvr] INFO     -- Node 2: Node 3: API version 5.1.0
2005-09-01 20:58:57 [MgmSrvr] INFO     -- Node 3: Node 2 Connected
2005-09-01 20:58:57 [MgmSrvr] INFO     -- Node 3: Node 4 Connected
2005-09-01 20:58:57 [MgmSrvr] INFO     -- Node 3: Node 5 Connected
2005-09-01 20:58:57 [MgmSrvr] INFO     -- Node 3: CM_REGCONF president = 4, own Node = 3, our dynamic id = 6
2005-09-01 20:58:57 [MgmSrvr] INFO     -- Node 3: Node 2: API version 5.1.2
2005-09-01 20:58:57 [MgmSrvr] INFO     -- Node 3: Node 4: API version 5.1.2
2005-09-01 20:58:57 [MgmSrvr] INFO     -- Node 3: Node 5: API version 5.1.0
2005-09-01 20:58:57 [MgmSrvr] INFO     -- Node 3: Start phase 1 completed
2005-09-01 20:58:57 [MgmSrvr] INFO     -- Node 4: Node 3: API version 5.1.0
2005-09-01 20:58:57 [MgmSrvr] INFO     -- Node 5: Node 3: API version 5.1.0
2005-09-01 20:58:57 [MgmSrvr] INFO     -- Node 3: Start phase 2 completed (node restart)
2005-09-01 20:58:58 [MgmSrvr] INFO     -- Node 3: Receive arbitrator node 1 [ticket=11db0003130cb6f0]
2005-09-01 20:59:00 [MgmSrvr] INFO     -- Node 3: Start phase 3 completed (node restart)
2005-09-01 20:59:00 [MgmSrvr] INFO     -- Node 3: Start phase 4 completed (node restart)
2005-09-01 20:59:01 [MgmSrvr] ALERT    -- Node 2: Node 3 Disconnected
2005-09-01 20:59:01 [MgmSrvr] INFO     -- Node 2: Communication to Node 3 closed2005-09-01 20:59:01 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2005-09-01 20:59:01 [MgmSrvr] ALERT    -- Node 4: Node 3 Disconnected
2005-09-01 20:59:01 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 closed2005-09-01 20:59:01 [MgmSrvr] ALERT    -- Node 4: Arbitration check won - node group majority
2005-09-01 20:59:01 [MgmSrvr] INFO     -- Node 4: President restarts arbitration thread [state=6]
2005-09-01 20:59:01 [MgmSrvr] ALERT    -- Node 5: Node 3 Disconnected
2005-09-01 20:59:01 [MgmSrvr] INFO     -- Node 5: Communication to Node 3 closed2005-09-01 20:59:04 [MgmSrvr] INFO     -- Node 2: Communication to Node 3 opened2005-09-01 20:59:04 [MgmSrvr] INFO     -- Node 5: Communication to Node 3 opened2005-09-01 20:59:05 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 opened
17) retry to start node 3 and recieved the same.

18) shutdown cluster: ndb_mgm ndb10:14000 -e "shutdown"
[ndbdev@ndb10 run]$ ndb_mgm $ndbc -e "shutdown"
Connected to Management Server at: ndb10:14000
3 NDB Cluster storage node(s) have shutdown.
NDB Cluster management server shutdown.

19) vi config.ini and reduce the vaule down to 100000 and try to restart cluster
# All data nodes die. The following is found in the cluster log:
2005-09-01 21:03:38 [MgmSrvr] ALERT    -- Node 2: Node 3 Disconnected
2005-09-01 21:03:38 [MgmSrvr] INFO     -- Node 2: Communication to Node 3 closed2005-09-01 21:03:38 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2005-09-01 21:03:38 [MgmSrvr] ALERT    -- Node 4: Node 3 Disconnected
2005-09-01 21:03:38 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 closed2005-09-01 21:03:38 [MgmSrvr] ALERT    -- Node 2: Node 5 Disconnected
2005-09-01 21:03:38 [MgmSrvr] INFO     -- Node 2: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 2 sig->failNo =
2005-09-01 21:03:38 [MgmSrvr] INFO     -- Node 2: Communication to Node 3 closed2005-09-01 21:03:38 [MgmSrvr] INFO     -- Node 2: Communication to Node 5 closed2005-09-01 21:03:38 [MgmSrvr] ALERT    -- Node 4: Node 5 Disconnected
2005-09-01 21:03:38 [MgmSrvr] INFO     -- Node 4: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 2 sig->failNo =
2005-09-01 21:03:38 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 closed2005-09-01 21:03:38 [MgmSrvr] INFO     -- Node 4: Communication to Node 5 closed2005-09-01 21:03:38 [MgmSrvr] INFO     -- Node 1: Node 5 Connected
2005-09-01 21:03:38 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2005-09-01 21:03:39 [MgmSrvr] INFO     -- Node 1: Node 4 Connected
# Now both node 3 and node 5 that are on NDB12 are getting:
Date/Time: Thursday 1 September 2005 - 21:06:46
Type of error: error
Message: Internal program error (failed ndbrequire)
Fault ID: 2341
Problem data: DbdihMain.cpp
Object of reference: DBDIH (Line: 8410) 0x0000000a
ProgramName: /home/ndbdev/jmiller/builds/libexec/ndbd
ProcessID: 27169
TraceFile: /space/run/ndb_5_trace.log.2
Version 5.1.0 (a_drop5p3)
***EOM***

# Node 2 and 4 are getting:
Date/Time: Thursday 1 September 2005 - 21:06:01
Type of error: error
Message: Arbitrator shutdown
Fault ID: 2305
Problem data: Arbitrator decided to shutdown this node
Object of reference: QMGR (Line: 3790) 0x0000000a
ProgramName: /home/ndbdev/jmiller/builds/libexec/ndbd
ProcessID: 5326
TraceFile: /space/run/ndb_2_trace.log.2
Version 5.1.2 (a_drop5p4)
***EOM***
20) vi config.ini, remove entry and try to restart cluster
# geting same response as in step 19.

All files will be moved to ndb under jonathan home directory /bug### where ### == this bug report number.

I copied the filesystem and started up a cluster ok with it.

So I cannot reproduce the SR crash in DBDIH.

I don't know if there could be some issue with the nodes being different versions 5.1.0 and 5.1.2?  Don't think so.

Can you try to see if you can do a SR with this filesystem, in your setup, with the latest wl2325 src?

Will take a look at it per your request.

Can not reproduce with latest clone of 5.1.2. [jbm]