MySQL Bugs: #99778: Error when processing certification information in the incremental recovery proc

Bug #99778	Error when processing certification information in the incremental recovery proc
Submitted:	4 Jun 2020 12:27	Modified:	2 Jul 2020 14:12
Reporter:	Snehal Bhavsar	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S2 (Serious)
Version:	8.0.18	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Hi..!! 
We are facing these errors in InnoDB Cluster once after the successful clone process, the node gets stuck into incremental recovery process and do not come back Online nor does rejoins the cluster. Once the clone is successfully completed node gets restarted and when it does incremental distributed recovery, it gets stuck into it and then after some time gets into missing state.

Could anyone please look into this errors and suggest the fix. 
How can we get the node back Online.

2020-06-04T09:25:20.477859Z 170 [ERROR] [MY-013328] [Repl] Plugin group_replication reported: 'The certification information could not be set in this server: 'Certification information is too large for transmission.''
2020-06-04T09:25:20.478647Z 170 [ERROR] [MY-011624] [Repl] Plugin group_replication reported: 'Error when processing certification information in the incremental recovery process'
2020-06-04T09:25:20.478887Z 170 [ERROR] [MY-011620] [Repl] Plugin group_replication reported: 'Fatal error during the incremental recovery process of Group Replication. The server will leave the group.'
2020-06-04T09:25:20.481430Z 170 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
2020-06-04T09:25:24.297814Z 158 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='172.31.197.131', master_port= 1122, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2020-06-04T09:25:24.400161Z 158 [ERROR] [MY-011575] [Repl] Plugin group_replication reported: 'All donors left. Aborting group replication incremental recovery.'
2020-06-04T09:25:24.400286Z 158 [ERROR] [MY-011620] [Repl] Plugin group_replication reported: 'Fatal error during the incremental recovery process of Group Replication. The server will leave the group.'
2020-06-04T09:25:24.400358Z 158 [Warning] [MY-011646] [Repl] Plugin group_replication reported: 'Skipping leave operation: member already left the group.'
2020-06-04T09:25:24.400388Z 158 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'

More details:
mysql> select count_transactions_rows_validating from performance_schema.replication_group_member_stats;

+------------------------------------+
| count_transactions_rows_validating |
+------------------------------------+
|                           17271312 |
|                           17271298 |
+------------------------------------+

How to repeat:
1. Run addInstance() command to add the joining instance into existing InnoDB Cluster.

2. Choose the Clone for the recovery process: C

3. Once the clone process is completed then monitor the status of that node, which will be: 
"recoveryStatusText": "Distributed recovery in progress" for a while and then gets
"status": "RECOVERING"
4. After some time: Process ends up with the above errors on the logs and status gets missing.
"status": "(MISSING)"

Error logs of Joiner instance at the time of adding the node to the cluster

Attachment: Error Log Details.txt (text/plain), 19.17 KiB.

Hi Snehal,

Thank you for reporting this issue, it is being analyzed.

As a workaround, two minutes after the error
[ERROR] [MY-013328] [Repl] Plugin group_replication reported: 'The certification information could not be set in this server: 'Certification information is too large for transmission.''
please do:
  RESET SLAVE FOR CHANNEL "group_replication_recovery";
and then rejoin the server.

Best regards,
Nuno Carvalho

Hello All..!!

Yes,this solution worked for us. We can run this command to reset the channel 'group_replication_recovery' and then adding the node into the cluster by clone has worked and the node is Online now. 

Thanks a lot for the support and for providing immediate fix for this issue.

Hi Snehal,

Great that the workaround did work, we will work on a fix.

Best regards,
Nuno Carvalho

Hello all,

I am facing similar issue today and the joiner node is again not being recovered to be into Online state. 

As of earlier, we had later noticed that, the value of count_transactions_rows_validating was low, and that's why the node was added.

But today, I followed exactly same steps to rejoin the node but having higher value of count_transactions_rows_validating as below, the node is unfortunately not being added back to the cluster with this workaround. 

mysql> select count_transactions_rows_validating from performance_schema.replication_group_member_stats;
+------------------------------------+
| count_transactions_rows_validating |
+------------------------------------+
|                            3720382 |
|                            3720402 |
|                                  0 |
+------------------------------------+ 

Hence the node gets missing again and gain even if all the lag of incremental recovery has been covered, it ends up with the below mentioned errors. 

Errors in logs:

2020-06-13T07:13:26.391684Z 142 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='172.31.197.131', master_port= 1122, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2020-06-13T07:13:26.512967Z 142 [ERROR] [MY-011575] [Repl] Plugin group_replication reported: 'All donors left. Aborting group replication incremental recovery.'
2020-06-13T07:13:26.513049Z 142 [ERROR] [MY-011620] [Repl] Plugin group_replication reported: 'Fatal error during the incremental recovery process of Group Replication. The server will leave the group.'
2020-06-13T07:13:26.513095Z 142 [Warning] [MY-011646] [Repl] Plugin group_replication reported: 'Skipping leave operation: member already left the group.'
2020-06-13T07:13:26.513117Z 142 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'

The errors seems to be as earlier.

2020-06-13T07:42:52.226511Z 213 [System] [MY-010562] [Repl] Slave I/O thread for channel 'group_replication_recovery': connected to master 'mysql_innodb_cluster_11@172.31.197.131:1122',replication started in log 'FIRST' at position 4
2020-06-13T07:48:23.595309Z 214 [ERROR] [MY-013328] [Repl] Plugin group_replication reported: 'The certification information could not be set in this server: 'Certification information is too large for transmission.''
2020-06-13T07:48:23.595434Z 214 [ERROR] [MY-011624] [Repl] Plugin group_replication reported: 'Error when processing certification information in the incremental recovery process'
2020-06-13T07:48:23.595536Z 214 [ERROR] [MY-011620] [Repl] Plugin group_replication reported: 'Fatal error during the incremental recovery process of Group Replication. The server will leave the group.'
2020-06-13T07:48:23.595823Z 214 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
2020-06-13T07:48:27.233845Z 212 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='172.31.197.131', master_port= 1122, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2020-06-13T07:48:27.347563Z 212 [ERROR] [MY-011575] [Repl] Plugin group_replication reported: 'All donors left. Aborting group replication incremental recovery.'
2020-06-13T07:48:27.347641Z 212 [ERROR] [MY-011620] [Repl] Plugin group_replication reported: 'Fatal error during the incremental recovery process of Group Replication. The server will leave the group.'
2020-06-13T07:48:27.347716Z 212 [Warning] [MY-011646] [Repl] Plugin group_replication reported: 'Skipping leave operation: member already left the group.'
2020-06-13T07:48:27.347742Z 212 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'

Also, please let us know how much should be the value of count_transactions_rows_validating to add the joining node to the Innodb cluster, as this can be a problem in production environment where the server is having high load and if we have to wait too long to check if the value of "count_transactions_rows_validating" can come down and then we may be able to Join the node.

In short if this errors happens 'Certification information is too large for transmission.' then how much should be the ideal size of Certification information and is there any way to set it externally, to manage such situation.

Hi Snehal,

This bug happens when clone takes a considerable amount of time, to overcome it please follow the workaround I gave.

Best regards,
Nuno Carvalho

Posted by developer:
 
Added changelog entry for MySQL 8.0.22:

While a remote cloning procedure was taking place on a joining member during distributed recovery, Group Replication considered the pre-cloning gtid_executed value of the joining member when identifying the common set of transactions that had been applied on all members. This meant that garbage collection for applied transactions from the group's set of certification information (shown as the count_transactions_rows_validating field in the Performance Schema table replication_group_member_stats) did not take place during the remote cloning procedure. If the remote cloning procedure took a long time, the certification information could therefore get too large to transmit to the joining member when it restarted after the remote cloning procedure, in which case an error was raised and the member was not able to join the group. 

To avoid this issue, Group Replication now considers only group members with ONLINE status when identifying the common set of transactions that have been applied on all members. When a joining member enters ONLINE state after distributed recovery, its certification information is updated with the certification information from the donor at the time when the member joined, and garbage collection takes place for this on future rounds. 

As a workaround for this issue in earlier releases, after the remote cloning operation completes, wait two minutes to allow a round of garbage collection to take place to reduce the size of the group's certification information. Then issue the following statement on the joining member, so that it stops trying to apply the previous set of certification information:

RESET SLAVE FOR CHANNEL group_replication_recovery;

Also noted workaround in
https://dev.mysql.com/doc/refman/8.0/en/group-replication-cloning.html