Bug #88329 Group replication thread's initialization methods can get stuck waiting
Submitted: 2 Nov 2017 10:40 Modified: 22 Feb 2018 9:12
Reporter: Pedro Gomes Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S3 (Non-critical)
Version:8.0.3 OS:Any
Assigned to: CPU Architecture:Any

[2 Nov 2017 10:40] Pedro Gomes
Description:
Observed when coding a new thread routine this is theoretical problem never seen on current code.

So usually, when GR starts a thread, the starting methods do:

> start_method
> {
>   lock(run_lock) 
>
>   launch_thread
>
>   while (!running)
>   {
>     mysql_cond_wait(&run_cod, &run_lock);  << Step A
>   }
>
>   unlock(run_lock)
> }

And threads handling methods do:

>
> thread_handler_method
> {
>   lock(run_lock) 
>   running=true << Step B
>   mysql_cond_broadcast(&run_cond);  
>   unlock(run_lock)
>
>   execution
>
>   lock(run_lock) 
>   running=false << Step C
>   mysql_cond_broadcast(&run_cond);
>   unlock(run_lock)
>
> }

What is not taken in account here is that if the threads have near 0 execution time, Step B and C can be executed when A unblocks waiting for a signal.
So, the conditional flag will change to true and then to false again, while the code for A loops. 

Looking at the code, to the psi keys we get a list of threads to check for this issue. 

extern PSI_thread_key
               key_GR_THD_applier_module_receiver,
               key_GR_THD_cert_broadcast,
               key_GR_THD_delayed_init,
               key_GR_THD_plugin_session,
               key_GR_THD_group_partition_handler,
               key_GR_THD_recovery;
 

How to repeat:
This was seen in a new thread code where there was no execution code being activated 

The code above can be created in some plugin method and ran to see it can get stuck, but no test can be created to test this under the current code AFAIK. 

Suggested fix:
When looking at the slave, it is seen that it uses a thread id for example on start and a termination flag on stops.

A simpler solution is to add a termination flag that can be used here. So the code would be > start_method
> {
>   lock(run_lock) 
>
>   terminated= false;
>   launch_thread
>
>   while (!running && !terminated)
>   {
>     mysql_cond_wait(&run_cod, &run_lock);  
>   }
>
>   unlock(run_lock)
> }

and in the handler 

> thread_handler_method
> {
>   lock(run_lock) 
>   running=true 
>   mysql_cond_broadcast(&run_cond);  
>   unlock(run_lock)
>
>   execution
>
>   lock(run_lock) 
>   running=false 
>   terminated= true;
>   mysql_cond_broadcast(&run_cond);
>   unlock(run_lock)
>
> }
[22 Feb 2018 9:12] David Moss
Posted by developer:
 
As this is theoretical and not seen by users closing without change log entry.
[23 Apr 2018 13:38] Nuno Carvalho
Fixed on 8.0.11
[31 May 2018 14:48] David Moss
Reclosing.