Bug #80958 Ndb_slave_max_replicated_epoch contains garbage value on slow host
Submitted: 5 Apr 2016 9:25 Modified: 11 Aug 2016 14:52
Reporter: Magnus Blåudd Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Replication Severity:S2 (Serious)
Version:7.5.2 OS:Any
Assigned to: CPU Architecture:Any

[5 Apr 2016 9:25] Magnus Blåudd
Description:
The testcase ndb_rpl_last_conflict_epoch_var fails quite consistently with a result diff when run with valgrind.

It seems to be a problem with the "Ndb_slave_max_replicated_epoch" status variable which return what seems to be a bogus value.

Have checked the slaves log file and there one can see a printouts saying "NDB Slave : MaxReplicatedEpoch set to 0 (0/0) at Slave start"

How to repeat:

Running the test with valgrind will in most cases produce a result diff:

$7.5> ./mtr ndb_rpl_last_conflict_epoch_var --valgrind
<snip>
@@ -36,12 +36,12 @@
 still < the saved session epoch
 Max replicated epoch should be < committed epoch
 @max_replicated_epoch < @master_session_epoch
-1
+0
 Some invariants :
 @conflict_detected_epoch != 0
 1
 @conflict_detected_epoch > @max_replicated_epoch
-1
+0
 Primary has detected the problem, but Secondary
 is unaware as the slave is offline.
 Now bring it back

With some extra printouts of the affected @values, you get bwlow log showing the @max_replicated_epoch seems to contain garbage:
@@ -28,26 +28,34 @@
 insert into test.t1 values (2,2,repeat("A", 1000));
 commit;
 Capturing session epoch from Master
+@master_session_epoch
+111669149700
 Show that last_conflict_epoch is 0
 Variable_name	Value
 Ndb_conflict_last_conflict_epoch	0
 Restart Secondary->Primary replication
 Check that max_replicated_epoch is 
 still < the saved session epoch
+@max_replicated_epoch
+90194313230
+@master_session_epoch
+111669149700
 Max replicated epoch should be < committed epoch
 @max_replicated_epoch < @master_session_epoch
-1
+0
 Some invariants :
 @conflict_detected_epoch != 0
 1
 @conflict_detected_epoch > @max_replicated_epoch
-1
+0
 Primary has detected the problem, but Secondary
 is unaware as the slave is offline.
 Now bring it back
 Now sync up
 Now lets check that the max replicated epoch
 on the Primary is == the last_conflict_epoch
+@max_replicated_epoch
+133143986201
 @max_replicated_epoch = @conflict_detected_epoch
 1
 Cleanup

Suggested fix:
Since problem occurs concistently when the mysqld is run in valgrind it feels like this is timing related. Tested by adding some sleeps just to see if waiting for something to be replicated would solve the problem, but unfortunately it didn't.

Also considered that saving the values in @variables in the different mysqld(s) was done in a wrong way but I couldn't determine that was the case either. Alternatively one could save the values in mysqltest's $variables.

Notice that it's weird that the comparision fails. The check does " 90194313230 < 111669149700" which should normally evaluate to 1. But since there seems to be some garbage in the @max_replicated_epoch variable the comparison probably overflows somehow.
[11 Aug 2016 14:52] Jon Stephens
Change in test only. Fixed in NDB 7.4.13 and 7.5.4. Closed.