MySQL Bugs: #11726: Error Message does not give cause: Slave SQL thread retried transaction 10 time

Bug #11726	Error Message does not give cause: Slave SQL thread retried transaction 10 time
Submitted:	4 Jul 2005 16:21	Modified:	1 Sep 2005 19:31
Reporter:	Jonathan Miller	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	5.1.0-wl2325-wl1354-new	OS:	Linux (Linux)
Assigned to:	Tomas Ulin	CPU Architecture:	Any

Description:
050704 18:05:38 [ERROR] Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
050704 18:05:38 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master1.000001' position 3429007

The above error message gives no cause to what the orgianl problem was/is. These type of error messages are pretty much usless to the customer and to QA.

An example of what would be a better error message: (note: I am not sure this is a deadlock condition, just showing an example)

050704 18:05:38 [ERROR] Slave SQL thread retried transaction 10 time(s) due to deadlock condition. Consider raising the value of the slave_transaction_retries variable.
050704 18:05:38 [ERROR] Error running query, slave SQL thread aborted due to slave_transaction_retries reaching max value. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master1.000001' position 3429007

NOTE: I have seen the following error message many times with no other message before or after. When this happens the user has no ideas where to look or what to do. This will generate support calls and emails and waste the customers and our time.

050704 18:05:38 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master1.000001' position 3429007

Fix what problem???

How to repeat:
Run Sabre test with Cluster Replication

Suggested fix:
We should know why we are having to retry. The reason for retry should be part of the error message. We should also know why we aborted the SQL Thread. This should be listed as part of the error message.

I will see if I can reproduce.

Trying to recreate this I now get these messages:
replication started in log 'FIRST' at position 4
050901 20:44:50 [Note] NDB Binlog: CREATE TABLE Event: REPL$atae/dcacache
050901 20:50:12 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table atae.dcacache, Error_code: 233
050901 20:50:27 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table atae.dcacache, Error_code: 233
050901 20:50:43 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table atae.dcacache, Error_code: 233
050901 20:50:59 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table atae.dcacache, Error_code: 233
050901 20:51:17 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table atae.dcacache, Error_code: 233
050901 20:51:35 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table atae.dcacache, Error_code: 233
050901 20:51:54 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table atae.dcacache, Error_code: 233

050901 20:53:15 [ERROR] Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.

050901 20:53:15 [Warning] Slave: Got temporary error 233 'Out of operation records in transaction coordinator (increase MaxNoOfConcurrentOperations)' from NDB Error_code: 1297

050901 20:53:15 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master1.000001' position 49123051

Tyring to recover from this problem has produced other issues. I will open another bug for those issues. The error message given back in this case is much better then it was. Thanks for fixing it :-)