Description:
SHOW ENGINE INNODB STATUS prints rw-lock stats as part of SEMAPHORE section.
Based on code analysis and run I found that stats are inconsistent and incorrect.
Let me explain this with the code snippet.
Pre-text
========
"show engine innodb status" print following information about 3 types of locks along with summary.
RW-shared spins 60819, rounds 979369, OS waits 16525
RW-excl spins 12945, rounds 353215, OS waits 7308
RW-sx spins 12024, rounds 49540, OS waits 537
Spin rounds per wait: 16.10 RW-shared, 27.29 RW-excl, 4.12 RW-sx
Logic is simple.
- Flow try to acquire a lock, if it fails to acquire, it would start spinning this would cause "spins" counter to increment.
- Spinning happens for srv_n_spin_wait_rounds (default 30) rounds. How many rounds are needed for each spins is aggregated in "rounds" counter.
- If despite of spin there is no success in acquriring lock then it causes OS-waits. reflected in "OS waits" counter.
With that background let's understand the issues (All code references are based on mysql-8.0.19 tag)
inconsistency with shared lock stats calculation
================================================
When lock enter spinning it would spin for <= srv_n_spin_wait_rounds (default 30) rounds. So the "rounds" aggregator should increment based on number-of-rounds completed.
If you look at the logic of other locks (X and SX) then it does what is expected.
rw_lock_x_lock_func()
699 spin_count += i;
rw_lock_sx_lock_func()
793 spin_count += i;
(Increment spin_count by round).
If you look at logic of shared lock it does 1 increment for N rounds of spin cycle.
rw_lock_s_lock_spin()
340 ++spin_count;
So to be consistent it should use the same protocol that other 2 locks are using. Counting total rounds (and not representational round) for the given spin cycle.
incorrectness with all lock stats calculation
=============================================
As explained above each lock is suppose to count the rounds and add it to spin_lock.
While the flow is spinning the cycle if the lock is released it will come out of spin-cycle even before it completes it 30 rounds (this is obivious. no point in spinning if you got the needed things). This can be refered as short-circuit spin-cycle.
But since there are multiple contenders for the same condition only one would succeed.
So let's say a thread start spinning and on 8th attempt (out of 30) it sees the lock is released and comes out of the spin-cycle loop but before it can acquire the lock some other threads takes it over the said thread now needs to resume it spin-cycle from the point where it left (note: from the point where it left).
Now the spin_count increment logic will add round/attempt count to spin_count and on re-try it will *re-add* those existing attempt one more time when it should add the difference only.
spin_count += i;
Say spin_count = 0 .... i = 8 and thread get signal that lock is ready.
spin_count is set to 8 (spin_count += i;) but thread fails to get the lock and restart spinning.
Now i = 30 (thread complete the spinning attempts) this time spin_count is incremented to 38 and not 30.
So the said 8 attempts is re-added twice.
[This could be small off error but imagine when the said locks and counters are done millions of times small error would aggravate to something really off messing the final count].
670 lock_loop:
671
......
......
......
689 /* Spin waiting for the lock_word to become free */
690 os_rmb;
691 while (i < srv_n_spin_wait_rounds && lock->lock_word <= X_LOCK_HALF_DECR) {
692 if (srv_spin_wait_delay) {
693 ut_delay(ut_rnd_interval(0, srv_spin_wait_delay));
694 }
695
696 i++;
697 }
698
699 spin_count += i;
700
701 if (i >= srv_n_spin_wait_rounds) {
702 os_thread_yield();
703
704 } else {
705 goto lock_loop;
706 }
707 }
inconsistency with SX lock stats calculation
============================================
As per the protocol (used by other 2 locks S and X) when thread fails to acquire a lock it start spinning.
This causes "spins" counter to increment once to reflect for the said attempt to acquire lock thread had to spin.
But as explained above before the thread completes it round of 30 attempts it could be interrupted multiple times.
Each time it is interrupted it shouldn't cause "spins" counter to increment as it is part of the same "spins" action.
This protocol is correctly handled in S and X locks as follows:
rw_lock_s_lock_spin()
322 rw_lock_stats.rw_s_spin_wait_count.inc();
rw_lock_x_lock_func()
683 } else {
684 if (!spinning) {
685 spinning = true;
686 rw_lock_stats.rw_x_spin_wait_count.inc();
687 }
688
But the same logic in SX causes "spin" counter to increment number of times the said spin loop is interrupted (including awakes post OS-wait).
rw_lock_sx_lock_func()
775 rw_lock_stats.rw_sx_spin_wait_count.add(spin_wait_count);
....
781 ++spin_wait_count;
-------------------------------------------------------------------------------------------------------------
Why all these things matter:
* If you DB is facing contention then you look at the PFS or SHOW ENGINE .... to check what is causing it.
* Based on the inputs of SHOW ENGINE you are suppose to re-tune innodb_spin_wait_delay and innodb_spin_wait_pause_multiplier.
* If the stats reflecting effect of these 2 values are off then your tunning can go wrong.
* Also, innodb_spin_wait_pause_multiplier tunning is becoming more and more important with latency differences orignating from x86 skylake/cascade-lake/(other old mode), ARM (absence of pause), etc....
* So having proper monitoring data is important for tunning things correctly.
Let me given you and example:
I tried same workload for same time with same seed condition and collected the stats.
Number of-course will not match but just look at the by what % ratios are off.
Effect is more profound when you are analyzing the real hot-spot.
before:
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 254923
OS WAIT ARRAY INFO: signal count 194719
RW-shared spins 90595, rounds 117625, OS waits 27338
RW-excl spins 9704, rounds 284216, OS waits 8344
RW-sx spins 38777, rounds 306944, OS waits 2730
Spin rounds per wait: 1.30 RW-shared, 29.29 RW-excl, 7.92 RW-sx
after:
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 240651
OS WAIT ARRAY INFO: signal count 179004
RW-shared spins 92767, rounds 1465426, OS waits 31100
RW-excl spins 10900, rounds 338101, OS waits 9880
RW-sx spins 35817, rounds 276527, OS waits 2649
Spin rounds per wait: 15.80 RW-shared, 31.02 RW-excl, 7.72 RW-sx
[There are other performance problems especially given the code is really hot.
I will point them in the different issues since they are performance problems and not a bug].
----------------------------------------------------------------------
Testing: Basic mtr testing done. No related regression observed.
How to repeat:
"SHOW ENGINE INNODB STATUS" and code analysis/inspection.
Suggested fix:
I will be attaching patch (along with OCA). One can also track it here.
https://github.com/mysqlonarm/mysql-server/tree/bug-rw_lock_stats
https://github.com/mysqlonarm/mysql-server/commit/4256a9a3f12dce5960a1e8a57a7c7b4c4b3a3a61