Bug #93347 VIEW_CHANGE_LOG_EVENT may not be logged on a overloaded primary
Submitted: 26 Nov 2018 14:56 Modified: 26 Feb 2019 13:20
Reporter: Nuno Carvalho Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S3 (Non-critical)
Version:5.7 OS:Any
Assigned to: CPU Architecture:Any

[26 Nov 2018 14:56] Nuno Carvalho
Description:
The VIEW_CHANGE_LOG_EVENT, meaning that some member joined
the group, is logged on all members.
The VIEW_CHANGE_LOG_EVENT must be logged on the message stream
order, since it does mark the stream position on which the new
member joined.
On SECONDARY members this is achieved by queueing it in-order on
the applier queue, which already ensures the stream order.
On PRIMARY members, this is achieved by waiting that all precedent
local transactions are already committed, only then the
VIEW_CHANGE_LOG_EVENT is committed. No posterior transaction can
commit before the VIEW_CHANGE_LOG_EVENT since the plugin pipeline
will ensure that.

The wait for the local transactions commit is done on
handlers/certification_handler.cc
```
538 int Certification_handler::wait_for_local_transaction_execution()
539 {
...
565   if ((error= sql_command_interface->
566                   wait_for_server_gtid_executed(local_gtid_certified_string,
567                                                 GTID_WAIT_TIMEOUT)))
568   {
569     /* purecov: begin inspected */
570     if (error == 1) //timeout
571     {
572       log_message(MY_ERROR_LEVEL,
573                   "Timeout when waiting for the server to execute local "
574                   "transactions in order assure the group change proper logging");
575     }
576     else
577     {
578       log_message(MY_ERROR_LEVEL,
579                   "Error when waiting for the server to execute local "
580                   "transactions in order assure the group change proper logging");
581     }
582     /* purecov: end */
583   }
584   delete sql_command_interface;
585   DBUG_RETURN(error);
586 }
```

The GTID_WAIT_TIMEOUT is 30 seconds, which means, despite unlikely,
that if the already prepared and allowed to commit transactions take
more than 30 seconds to commit, the wait is interrupted and a error
is thrown.

The GTID_WAIT_TIMEOUT is 30 seconds, which means, despite unlikely,
that if the already prepared and allowed to commit transactions take
more than 30 seconds to commit, the wait is interrupted and a error
is thrown.

That error will make the VIEW_CHANGE_LOG_EVENT not to be logged:
```
522   //Assure the last known local transaction was already executed
523   error= wait_for_local_transaction_execution();
524 
525   /**
526     Create a transactional block for the View change log event
527     GTID
528     BEGIN
529     VCLE
530     COMMIT
531   */
532   if (!error)
533     error= inject_transactional_events(pevent,cont);
534 
535   DBUG_RETURN(error);
536 }
```

If the new member that joined the group recovers from this member it
will not find a VIEW_CHANGE_LOG_EVENT which will make it never
leave the RECOVERY state.

How to repeat:
See the code.

Suggested fix:
Extend the change on http://rb.no.oracle.com/rb/r/20494/diff/22-23#2
to handle the generic case.

Or change the timeout to infinite.
[26 Feb 2019 13:20] David Moss
Posted by developer:
 
Thank you for your feedback, this has been fixed in upcoming versions and the following was added to the 5.7.26 and 8.0.16 change logs:

On overloaded servers there was a possibility that when a member joined the group, the VIEW_CHANGE_LOG_EVENT event which marks that point was not logged in the correct place. This could lead to errors in the data transfer to the newly joining server and data divergence. Now, the VIEW_CHANGE_LOG_EVENT event is logged in the correct place in the binary log. In addition, there are some warnings logged about the delays in logging the event.