Description:
Observed when coding a new thread routine this is theoretical problem never seen on current code.
So usually, when GR starts a thread, the starting methods do:
> start_method
> {
> lock(run_lock)
>
> launch_thread
>
> while (!running)
> {
> mysql_cond_wait(&run_cod, &run_lock); << Step A
> }
>
> unlock(run_lock)
> }
And threads handling methods do:
>
> thread_handler_method
> {
> lock(run_lock)
> running=true << Step B
> mysql_cond_broadcast(&run_cond);
> unlock(run_lock)
>
> execution
>
> lock(run_lock)
> running=false << Step C
> mysql_cond_broadcast(&run_cond);
> unlock(run_lock)
>
> }
What is not taken in account here is that if the threads have near 0 execution time, Step B and C can be executed when A unblocks waiting for a signal.
So, the conditional flag will change to true and then to false again, while the code for A loops.
Looking at the code, to the psi keys we get a list of threads to check for this issue.
extern PSI_thread_key
key_GR_THD_applier_module_receiver,
key_GR_THD_cert_broadcast,
key_GR_THD_delayed_init,
key_GR_THD_plugin_session,
key_GR_THD_group_partition_handler,
key_GR_THD_recovery;
How to repeat:
This was seen in a new thread code where there was no execution code being activated
The code above can be created in some plugin method and ran to see it can get stuck, but no test can be created to test this under the current code AFAIK.
Suggested fix:
When looking at the slave, it is seen that it uses a thread id for example on start and a termination flag on stops.
A simpler solution is to add a termination flag that can be used here. So the code would be > start_method
> {
> lock(run_lock)
>
> terminated= false;
> launch_thread
>
> while (!running && !terminated)
> {
> mysql_cond_wait(&run_cod, &run_lock);
> }
>
> unlock(run_lock)
> }
and in the handler
> thread_handler_method
> {
> lock(run_lock)
> running=true
> mysql_cond_broadcast(&run_cond);
> unlock(run_lock)
>
> execution
>
> lock(run_lock)
> running=false
> terminated= true;
> mysql_cond_broadcast(&run_cond);
> unlock(run_lock)
>
> }