Bug #30752 | rpl_dual_pos_advance valgrind (jump depends on uninitialized LOG_INFO) | ||
---|---|---|---|
Submitted: | 31 Aug 2007 19:12 | Modified: | 28 Nov 2007 18:42 |
Reporter: | Timothy Smith | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 5.0 | OS: | Any |
Assigned to: | Sven Sandberg | CPU Architecture: | Any |
Tags: | pbfail |
[31 Aug 2007 19:12]
Timothy Smith
[14 Sep 2007 8:46]
Sven Sandberg
== Scenario Causing the Error ====== The THD class has a field `LOG_INFO *current_linfo' which stores the filename of the current logfile used by the thread. When the main thread purges a log file, it first checks whether the log file is in use by any other thread. This is done at sql_repl:212-223 by iterating through the global `threads' variable and checking for each thread whether current_linfo contains the name of the log file to purge. The `current_linfo' is modified concurrently so that it temporarily points to uninitialized memory at the intervals sql_repl:367-369 or sql_repl:1341-1343, and reading the uninitialized memory causes the bug. More precisely, the scenario is as follows. Thread A executes sql_repl.cc:367 or sql_repl.cc:1341. At both these places, current_linfo of Thread A is assigned the address of a local uninitialized struct. Thread B executes sql_repl:215 with tmp pointing to Thread A. This causes a crash inside the call to bcmp() at sql_repl.cc:218. (The struct would be initialized two lines later by Thread A, at sql_repl.cc:369 or sql_repl.cc:1343, respectively.) == Reproducibility ====== I have not been able to reproduce this (after running the rpl_dual_pos_advance test 2800 times). The cause was found by inspecting the code. According to Lars, the two threads causing the error do not normally run at the same time, because one is run on server and the other on slave. Hence, the error only happens when replicating in several steps, so that one instance of the program is both master and slave. In particular, the rpl_dual_pos_advance test replicates in a loop, from I to II and from II to I. == Suggested Action ====== Fix A. There is a simple fix: we currently do "let current_linfo point to a struct on the stack, then initialize the struct". Instead, we should reverse the order, and do "initialize the struct on stack, then let current_linfo point to it". This amounts to reordering the code. Fix B. Lars pointed out that there is an underlying systematic error here. It would be better if LOG_INFO::log_file_name was initialized to the empty string in the constructor. That would make it less easy to make similar bugs in the future. Warning: we still have to lock before writing log_file_name. If we would write to the log_file_name while another thread was examining it, we would temporarily remove the trailing zero so that the other thread would access it. Fix C. Mats pointed out that we should always put a lock on variables that are readable from other threads. In particular, we should lock before assigning &linfo to thd->current_linfo.
[14 Sep 2007 8:47]
Sven Sandberg
== Detailed description of what goes wrong ====== This includes a proof that nothing else caused the bug. Let us roll up the traceback: 1. At bcmp:59, one of the following is uninitialized. - s1 Or Something That It Points To (OSTIPT) - s2 OSTIPT - len 2. The variables at bcmp:59:my_bcmp() map to variables in sql_repl.cc:218: - log_name OSTIPT - linfo->log_file_name OSTIPT - log_name_len log_name_len is initialized (and even correct if log_name is correctly initialized). So either: (2.1) log_name OSTIPT is uninitialized, or (2.2) linfo->log_file_name OSTIPT is uninitialized. We consider these cases separately. 2.1. Let us first investigate whether log_name OSTIPT is uninitialized at sql_repl.cc:218. log_name in log_in_use() maps to log_info.log_filename in log.cc:1174. First, note that this is a stack variable, so the address of log_info.log_filename is not uninitialized. It remains to check whether STIPT is uninitialized. The line log.cc:1173-1174 can only be reached if none of the two gotos above was executed. Therefore, the call to find_log_pos at log.cc:1164 returned 0. To avoid digging into code that probably works, we make the following assumption: **Assumption 2.1: my_b_gets() in mf_iocache2.c:184 is correctly implemented. We now prove the following. **Lemma 2.1: If Assumption 2.1 holds, and find_log_pos() is called and returns 0, then find_log_pos() initializes its parameter linfo->log_file_name. **Proof. Let us look inside find_log_pos() at log.cc:820. find_log_pos() can only return 0 if the second break statement of the for loop is reached, and the first break statement is not reached. If the first break is not reached, it means that the return value from the call to my_b_gets() at log.cc is >1. By Assumption 2.1, the return value from my_b_gets() is >1 only if it copied a null-terminated string of length >0 to the second parameter. From this, it follows that fname in find_log_pos() is initialized before find_log_pos() returns. But fname is an alias for linfo->log_file_name in find_log_pos(), so it follows that linfo->log_file_name is properly initialized by find_log_pos(). **End of proof of Lemma 2.1 We now apply Lemma 2.1, and obtain that (under assumption 2.1), log_info.log_filename is correctly initialized at log.cc:1174. This shows what we wanted: both log_name and what it points to are correctly initialized. 2.2. It remains to check whether linfo->log_file_name OSTIPT is uninitialized at sql_repl.cc:218. Since sql_repl.cc:218 is reached, linfo is not null, by the condition of the surrounding if(). Note that linfo is an alias for the current_linfo member of the THD class, for one of the threads in the global variable threads. Therefore, the current_linfo of some thread is not null. Let us find the possible points in the code when current_linfo of some thread is not null. Note that current_linfo is set to 0 in the constructor of THD, at sql_class.cc:213. Before this point, it was uninitialized, but the THD object cannot have been added to the global threads variable (this is seen by following all function, method, and constructor calls in THD::THD() before sql.class.cc:213). Therefore, current_linfo can only by non-null after the code explicitly sets it to non-null, and before it re-sets it to null again. Let us grep for current_linfo in the code (I removed matches in binary files, SCCS files and comments): $ grep -n -H current_linfo */* libmysqld/sql_class.cc:213: current_linfo = 0; sql/repl_failsafe.cc:301: thd->current_linfo = &linfo; sql/repl_failsafe.cc:381: thd->current_linfo = 0; sql/sql_class.cc:213: current_linfo = 0; sql/sql_class.h:1520: LOG_INFO* current_linfo; sql/sql_repl.cc:184: if ((linfo = tmp->current_linfo)) sql/sql_repl.cc:215: if ((linfo = tmp->current_linfo)) sql/sql_repl.cc:367: thd->current_linfo = &linfo; sql/sql_repl.cc:736: thd->current_linfo = 0; sql/sql_repl.cc:751: thd->current_linfo = 0; sql/sql_repl.cc:1341: thd->current_linfo = &linfo; sql/sql_repl.cc:1429: thd->current_linfo = 0; Since there are no places where the address of current_linfo is taken, there are no aliases for it. Hence, it can only be non-null at one of the three places found by grep: - sql/repl_failsafe.cc:301 - sql/sql_repl.cc:367 - sql/sql_repl.cc:1341 At all these places, current_linfo is set to the address of a local variable. This address is correct and initialized until the function returns. At all three places, thd->current_linfo is set to 0 before the function returns. We now have the following conceivable scenarios when thd->current_info could be uninitialized: 2.2.1 thread A assigns the address of a local struct to thd->current_linfo; thread B loads thd->current_linfo into a local variable; thread A returns from the current function and invalidates the local struct; and thread B reads from its copy of &thd->current_linfo. However, threads A and B properly lock LOCK_thread_count, so this is not possible. 2.2.2 thread A assigns the address of a local struct to thd->current_linfo, and the data at that address is uninitialized. Then thread B reads thd->current_linfo before thread A initializes the data. This can happen in two of the three places above, namely sql_repl.cc:367 and sql_repl.cc:1341. This is the bug. Remark: Going back to grep's results, we see that 2.2.1 is a risk not only when thread B executes sql_repl:215, but also when it executes sql_repl:184. However, the lock is properly locked here, too.
[14 Sep 2007 8:51]
Sven Sandberg
The bug is in 5.0, 5.1, and 5.2
[14 Sep 2007 10:27]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/34258 ChangeSet@1.2501, 2007-09-14 12:27:17+02:00, sven@murkla.(none) +2 -0 BUG#30752 rpl_dual_pos_advance valgrind (jump depends on uninitialized LOG_INFO) Problem: one thread could read uninitialized memory from (the stack of) another thread. Fix: swapped order of initializing the memory and making it available to the other thread. Fix: put lock around the statement that makes the memory available to the other thread. Fix: all fields of the struct are now initialized in the constructor, to avoid future problems.
[14 Sep 2007 10:39]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/34261 ChangeSet@1.2501, 2007-09-14 12:39:11+02:00, sven@murkla.(none) +2 -0 BUG#30752 rpl_dual_pos_advance valgrind (jump depends on uninitialized LOG_INFO) Problem: one thread could read uninitialized memory from (the stack of) another thread. Fix: swapped order of initializing the memory and making it available to the other thread. Fix: put lock around the statement that makes the memory available to the other thread. Fix: all fields of the struct are now initialized in the constructor, to avoid future problems.
[3 Oct 2007 9:57]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/34816 ChangeSet@1.2537, 2007-10-03 11:57:14+02:00, sven@murkla.(none) +2 -0 BUG#30752 rpl_dual_pos_advance valgrind (jump depends on uninitialized LOG_INFO) Problem: one thread could read uninitialized memory from (the stack of) another thread. Fix: swapped order of initializing the memory and making it available to the other thread. Fix: put lock around the statement that makes the memory available to the other thread. Fix: all fields of the struct are now initialized in the constructor, to avoid future problems.
[27 Nov 2007 10:49]
Bugs System
Pushed into 5.0.54
[27 Nov 2007 10:50]
Bugs System
Pushed into 5.1.23-rc
[27 Nov 2007 10:53]
Bugs System
Pushed into 6.0.4-alpha
[28 Nov 2007 18:42]
Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release. If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at http://dev.mysql.com/doc/en/installing-source.html Documented bugfix in 5.0.54, 5.1.23, and 6.0.4 changelogs as follows: One thread could read uninitialized memory from the stack of another thread. This issue was only known to occur in a mysqld process acting as both a master and a slave.