Description:
The testcase ndb_rpl_last_conflict_epoch_var fails quite consistently with a result diff when run with valgrind.
It seems to be a problem with the "Ndb_slave_max_replicated_epoch" status variable which return what seems to be a bogus value.
Have checked the slaves log file and there one can see a printouts saying "NDB Slave : MaxReplicatedEpoch set to 0 (0/0) at Slave start"
How to repeat:
Running the test with valgrind will in most cases produce a result diff:
$7.5> ./mtr ndb_rpl_last_conflict_epoch_var --valgrind
<snip>
@@ -36,12 +36,12 @@
still < the saved session epoch
Max replicated epoch should be < committed epoch
@max_replicated_epoch < @master_session_epoch
-1
+0
Some invariants :
@conflict_detected_epoch != 0
1
@conflict_detected_epoch > @max_replicated_epoch
-1
+0
Primary has detected the problem, but Secondary
is unaware as the slave is offline.
Now bring it back
With some extra printouts of the affected @values, you get bwlow log showing the @max_replicated_epoch seems to contain garbage:
@@ -28,26 +28,34 @@
insert into test.t1 values (2,2,repeat("A", 1000));
commit;
Capturing session epoch from Master
+@master_session_epoch
+111669149700
Show that last_conflict_epoch is 0
Variable_name Value
Ndb_conflict_last_conflict_epoch 0
Restart Secondary->Primary replication
Check that max_replicated_epoch is
still < the saved session epoch
+@max_replicated_epoch
+90194313230
+@master_session_epoch
+111669149700
Max replicated epoch should be < committed epoch
@max_replicated_epoch < @master_session_epoch
-1
+0
Some invariants :
@conflict_detected_epoch != 0
1
@conflict_detected_epoch > @max_replicated_epoch
-1
+0
Primary has detected the problem, but Secondary
is unaware as the slave is offline.
Now bring it back
Now sync up
Now lets check that the max replicated epoch
on the Primary is == the last_conflict_epoch
+@max_replicated_epoch
+133143986201
@max_replicated_epoch = @conflict_detected_epoch
1
Cleanup
Suggested fix:
Since problem occurs concistently when the mysqld is run in valgrind it feels like this is timing related. Tested by adding some sleeps just to see if waiting for something to be replicated would solve the problem, but unfortunately it didn't.
Also considered that saving the values in @variables in the different mysqld(s) was done in a wrong way but I couldn't determine that was the case either. Alternatively one could save the values in mysqltest's $variables.
Notice that it's weird that the comparision fails. The check does " 90194313230 < 111669149700" which should normally evaluate to 1. But since there seems to be some garbage in the @max_replicated_epoch variable the comparison probably overflows somehow.