Bug #11726 Error Message does not give cause: Slave SQL thread retried transaction 10 time
Submitted: 4 Jul 2005 16:21 Modified: 1 Sep 2005 19:31
Reporter: Jonathan Miller Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:5.1.0-wl2325-wl1354-new OS:Linux (Linux)
Assigned to: Tomas Ulin CPU Architecture:Any

[4 Jul 2005 16:21] Jonathan Miller
Description:
050704 18:05:38 [ERROR] Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
050704 18:05:38 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master1.000001' position 3429007

The above error message gives no cause to what the orgianl problem was/is. These type of error messages are pretty much usless to the customer and to QA. 

An example of what would be a better error message: (note: I am not sure this is a deadlock condition, just showing an example)

050704 18:05:38 [ERROR] Slave SQL thread retried transaction 10 time(s) due to deadlock condition. Consider raising the value of the slave_transaction_retries variable.
050704 18:05:38 [ERROR] Error running query, slave SQL thread aborted due to slave_transaction_retries reaching max value. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master1.000001' position 3429007

NOTE: I have seen the following error message many times with no other message before or after. When this happens the user has no ideas where to look or what to do. This will generate support calls and emails and waste the customers and our time. 

050704 18:05:38 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master1.000001' position 3429007

Fix what problem???

How to repeat:
Run Sabre test with Cluster Replication

Suggested fix:
We should know why we are having to retry. The reason for retry should be part of the error message. We should also know why we aborted the SQL Thread. This should be listed as part of the error message.
[31 Aug 2005 21:57] Jonathan Miller
I will see if I can reproduce.
[1 Sep 2005 19:31] Jonathan Miller
Trying to recreate this I now get these messages:
replication started in log 'FIRST' at position 4
050901 20:44:50 [Note] NDB Binlog: CREATE TABLE Event: REPL$atae/dcacache
050901 20:50:12 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table atae.dcacache, Error_code: 233
050901 20:50:27 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table atae.dcacache, Error_code: 233
050901 20:50:43 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table atae.dcacache, Error_code: 233
050901 20:50:59 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table atae.dcacache, Error_code: 233
050901 20:51:17 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table atae.dcacache, Error_code: 233
050901 20:51:35 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table atae.dcacache, Error_code: 233
050901 20:51:54 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table atae.dcacache, Error_code: 233

050901 20:53:15 [ERROR] Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.

050901 20:53:15 [Warning] Slave: Got temporary error 233 'Out of operation records in transaction coordinator (increase MaxNoOfConcurrentOperations)' from NDB Error_code: 1297

050901 20:53:15 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master1.000001' position 49123051

Tyring to recover from this problem has produced other issues. I will open another bug for those issues. The error message given back in this case is much better then it was. Thanks for fixing it :-)