MySQL Bugs: #2011: rare race condition producing "binlog has bad magic number" error in slave

Bug #2011	rare race condition producing "binlog has bad magic number" error in slave
Submitted:	4 Dec 2003 6:10	Modified:	4 Dec 2003 6:34
Reporter:	Guilhem Bichot	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	4.0.16	OS:	Any (all)
Assigned to:	Guilhem Bichot	CPU Architecture:	Any

Description:
This causes a rare failure of the rpl_max_relay_size.test test. Is present in 4.0.16, 4.1.1 and the 5.0 tree.

How to repeat:
The failure can be made repeatable by adding a sleep() before check_binlog_magic()
around line 3540 in sql/slave.cc, and then running the test.

Suggested fix:
will push a fix now.

More precisely, the error says that the slave SQL thread could not use one of the relay logs because it has a bad magic number.

Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Fixed in the 4.0 tree:
ChangeSet@1.1640.1.1, 2003-12-04 15:30:14+01:00, guilhem@mysql.com

Hello, 

I am a PhD student in the University of Central Florida, and my research involves debugging techniques, especially in the area of concurrent bugs. 

I have been trying to understand in depth this bug, but I am having some difficulties. I will really appreciate your help, if you can point me in the right direction and help me understand this bug better. 

I was able to reproduce the bug by it by adding sleep() in the specified location (in slave.cc:next_event after is_active() and before check_binlog_magic()). 

However, the failure was not as I expected. In particular, I have printed a detailed trace, which shows what happens and where my confusion comes from.  

1) After the SQL thread checks is_active(), it goes to sleep. 
2) The IO thread does a lot of work, during the sleep and rotates the logs, etc. 
3) The SQL thread comes back from sleep and does check_binlog_magic(). However, there is no error and we continue with another iteration of the loop!!! The SQL thread detects that the "hot log got rotated under our feet" and does reopen_relay_log(). 
4) The SQL thread reads many more relay log events, before if fails in check_binlog_magic and when it fails there is no interleaving of other threads between is_active() and check_binlog_magic(). 

Thank you very much, 
Any insight will be highly appreciated. 

Martin

Hello Martin. This bug was fixed six years ago; you say you reproduced this bug, do you mean you are using a 4.0.16 version intentionally to study this old already-fixed bug?

Yes, I am sorry, I should have clarified. 
Yes - I am using 4.0.16 and intentionally reproduced this bug in order to study it. When I apply the patch that you submitted, the bug is does not appear any more. 

My problem was in understanding this bug, even though I spend quite some time. Initially I thought the problem would be: 
- The SQL thread checks is_active() 
   - The IO thread rotates the binlog, or does something else. 
- The SQL thread comes back and binlog_magic fails. 

However, my experiments show that binlog_magic does not fail when the SQL thread comes back, and the failure happens much later, after many more iterations of next_event(). It also appears that the SQL thread has recovered from the interleaving by doing reopen_relay_log(). On the iteration that binlog_magic fails, there is no remote interleaving during the execution of function next_event(). 

I hope that makes some sense. I am not a mysql developer, so I am not very familiar with the code. However, I have found MySQL to be very valuable for research - because of the detailed bug comments, test cases, patches, etc. 

Thank you for your time. 
Martin

Hello Martin. I wanted to build 4.0.16, so that I could see the bug in action (I have no remembrance of its details), alas it does not build on my recent Ubuntu (it requires LinuxThreads which I don't have, my Ubuntu has NPTL). Thus, I'm sorry, I don't know how to answer your questions. If you get stuck, I suggest picking a different, much more recent bug report, as study target. You'll get fresh memories from the developer(s) who fixed it, and a source tree which builds on nowadays OS :-)

Thank you very much for taking the time to look at this. 
I checked my code and I realized that I was also not able to use 4.0.16. 
Instead I am using 4.1.16 - and I manually re-introduced the bug by changing the code. I had forgotten that I did that - I thought I was using 4.0.16. 
A patch, to re-introduce the bug would look something like this: 

In slave.cc 

- if (!hot_log) /* if hot_log, we already have this mutex */
-    pthread_mutex_lock(log_lock);
  if (rli->relay_log.is_active(rli->linfo.log_file_name))

  DBUG_ASSERT(rli->cur_log_fd == -1);
+ sleep(1);
- if (check_binlog_magic(cur_log,&errmsg))
- { 
-   if (!hot_log)
-     pthread_mutex_unlock(log_lock);
-   goto err;
- }
- if (!hot_log)
-   pthread_mutex_unlock(log_lock);
- continue;
- }
- if (!hot_log)
-    pthread_mutex_unlock(log_lock);

+ if (check_binlog_magic(cur_log,&errmsg))
+   goto err;
+ continue; 
+ }

Thanks again for the time that you already spent looking at this. 
I was actually hoping that you would remember the details of the bug, without having to debug yourself. 

Martin

Hello Martin. So you patched 4.1.16. But that version is more than 2 years newer than 4.0.16, so that can explain why you don't see, when testing 4.1.16, the same bug as with 4.0.16.

Yes - you are probably right. 
Even though I thought that I am triggering the same bug, because I still get the same error "binlog has bad magic number". Also I checked that even in newer versions, such as 5.0.82 the slave.cc file is almost unchanged for all those years.

Hello Martin. When I compare slave.cc in recent 4.1 vs recent 5.0, I see 390 blocks of differences (so-called "diff hunks"); the file has changed a lot, actually.

Hmm. In this case, it is probably as you said - the code has changed and the bug is likely triggered in a different way because of that. 
Thanks again for spending extra time and looking into this. 

There are a bunch of other mysql concurrency bugs that I can look at :)
Also, if my research leads to something useful, I hope I can contribute a little bit on some OPEN bugs as well. 

Martin