Bug #70318 some transactions may fail in SQL_thread without actually retrying
Submitted: 12 Sep 2013 16:13 Modified: 20 Nov 2013 17:08
Reporter: Santosh Praneeth Banda Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.6.12 OS:Any
Assigned to: CPU Architecture:Any

[12 Sep 2013 16:13] Santosh Praneeth Banda
Description:
Some transactions may fail in SQL_thread without actually retrying slave_transaction_retries times. This can happen if a transaction sets fatal_error even though the error is actually lock wait timeout. This thd->fatal_error_flag is set here

function mysql_update() {
......
if (table->file->is_fatal_error(error, HA_CHECK_DUP_KEY))
  flags|= ME_FATALERROR; /* Other handler errors are fatal */
.....
}

virtual bool is_fatal_error(int error, uint flags)
{
  if (!handler::is_fatal_error(error, flags) ||
      error == HA_ERR_NO_PARTITION_FOUND ||
      error == HA_ERR_NOT_IN_LOCK_PARTITIONS)
    return FALSE;
  return TRUE;
}

and has_temporary_error() uses thd->is_fatal_error like this

bool has_temporary_error()
{
  ....
  if (thd->is_fatal_error || !thd->is_error())
  DBUG_RETURN(0);
  ...
}

In 51 the has_temporary_error() doesn't use thd->is_fatal_error

if (!thd->is_error())
  DBUG_RETURN(0);

How to repeat:
generate transactions that unnecessarily set thd->is_fatal_error flag and check that sql_thread doesn't try to retry transactions..

Suggested fix:
Use a different flag rather than thd->is_fatal_error... This is the revision that introduced usage of thd->is_fatal_error http://bazaar.launchpad.net/~mysql/mysql-server/5.6/revision/2876.415.1
[20 Oct 2013 17:08] Venkatesh Duggirala
Hello Santosh, 

We have looked into the code snippet and the concern you have mentioned above. 
Please find our analysis below on the same.

In mysql_update() function
{
....
while (!(error=info.read_record(&info)) && !thd->killed)
{
....
 error= table->file->ha_[bulk]_update_row();
....
 if ( table->file->is_fatal_error(error, HA_CHECK_DUP_KEY)) )
     flags|= ME_FATALERROR; /* Other handler errors are fatal */
...
}

Please observe the while condition read_record() where all the records will be locked. If SQL thread is executing UPDATE, then it would try to take lock on required records in this read_record() function. If it is successful in taking the lock, only then it enters into while loop.  The internal thread (SQL thread)'s lock time out period is set to '1 year'. So after SQL thread acquiring the lock in read_record(), it will hold for an year. Hence ha_update_row() in the mentioned code context cannot return ER_LOCK_WAIT_TIMEOUT error unless a update transaction takes more than one year  which is not usual case.

Note: The 1 year lock time out period for SQL thread cannot be altered.
Please have a look at end code in init_slave_thread() function for more details

We have done some testing as well to verify the above analysis. 
Please do let me know if you have any test cases/scenarios where SQL thread 
is not re-trying the transactions even though error was "lock_wait_timeout'
we will look into it.
[21 Nov 2013 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".