Description:
THD has the following member:
killed_state volatile killed;
This is used all over the place to detect if threads are killed or not.
There's one problem with this however, volatile doesn't do barriers. So everything could be awfully different for each load that's done.
I think this is actually a bug in replication. In start_slave_thread() there's not only this accessed but a bunch of other things that are volatile too. It appears that there's likely an opportunity to wedge things if a slave thread is killed at exactly the wrong time.
there's also the possibility to lose the killing of a thread in debug builds (a DBUG_EVALUATE_IF will read&write thd->killed, possibly clobbering a changed value).
There's also several places where killed is likely to be missed, potentially for quite a while... some of the other logic around the place looks suspect with a volitle thd->killed, at the very least leading to rather unexpected behaviour.
But the big thing? Since killed is used for more things, there exists the possibility of code that's like "if(!killed) killed=thing;" which would completely fall apart on POWER (and ARM and maybe others) in the current code (this may currently be the case... there's *sooo* many places to audit to work out if this is an actual bug).
How to repeat:
This is a tricky one... there's several scenarios described above where this could be a problem, largely caught by code analysis as especially in the various error paths, mtr test suite is likely quite inadequate to catch them.
If this does all magically currently work, it's not because of design, it's distinctly by accident.
I've observed some strange shutdown behaviour on POWER8 at least, which could be explained by the use of volatile around thd->killed and threads just reading stale values for a long time.
I'm pretty sure the logic in opt_explain.cc for explaining other threads could be convinced to do the wrong thing when reading a stale killed value, although a very close audit would be required to be sure.
My patch is if anything, wonderfully defensive.
Suggested fix:
my patch!
I *think* the correct thing to do here is to remove const for a bunch of the methods that do end up checking an atomic variable... which is why there's changes in const in my patch. I'm going to ask Paul McKenney about this as he's been knee deep in the C++ spec relating to multicore and can likely give a definitive answer.
This changes plugin ABI slightly:
< int thd_killed(const void* thd);
< void thd_set_kill_status(const void* thd);
---
> int thd_killed(void* thd);
> void thd_set_kill_status(void* thd);
but the idea of plugin ABI is pure fantasy anyway :)