Bug #30752 rpl_dual_pos_advance valgrind (jump depends on uninitialized LOG_INFO)
Submitted: 31 Aug 2007 19:12 Modified: 28 Nov 2007 18:42
Reporter: Timothy Smith Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.0 OS:Any
Assigned to: Sven Sandberg CPU Architecture:Any
Tags: pbfail

[31 Aug 2007 19:12] Timothy Smith
Description:
I don't believe I've seen this before, probably a rare timing-related bug?  It showed up in mysql-5.0-maint, but probably other versions could be affected.

VALGRIND: 'Conditional jump or move depends on uninitialised value(s)'
    COUNT: 1
    FUNCTION: my_bcmp    FILES:    slave.err
    TESTS:    rpl_dual_pos_advance
    STACK: at 0x9F305D: my_bcmp (bcmp.c:59)
             by 0x741560: log_in_use(char const*) (sql_repl.cc:218)
             by 0x675C5F: MYSQL_LOG::purge_logs(char const*, bool, bool, bool, unsigned long long*) (log.cc:1173)
             by 0x6770E7: MYSQL_LOG::purge_first_log(st_relay_log_info*, bool) (log.cc:1053)
             by 0x737424: next_event(st_relay_log_info*) (slave.cc:5032)
             by 0x73B7CF: exec_relay_log_event(THD*, st_relay_log_info*) (slave.cc:3253)
             by 0x73C302: handle_slave_sql (slave.cc:3933)
             by 0x4D37192: start_thread (in /lib64/libpthread-2.4.so)
             by 0x52AA47C: clone (in /lib64/libc-2.4.so)

How to repeat:
See the pb-valgrind host here:

https://intranet.mysql.com/secure/pushbuild/showpush.pl?dir=mysql-5.0-maint&order=618

Suggested fix:
Ensure that the LOG_INFO structure is properly initialized - in particular, the log_file_name member.
[14 Sep 2007 8:46] Sven Sandberg
== Scenario Causing the Error ======

The THD class has a field `LOG_INFO *current_linfo' which stores the
filename of the current logfile used by the thread.  When the main
thread purges a log file, it first checks whether the log file is in
use by any other thread.  This is done at sql_repl:212-223 by
iterating through the global `threads' variable and checking for each
thread whether current_linfo contains the name of the log file to
purge.  The `current_linfo' is modified concurrently so that it
temporarily points to uninitialized memory at the intervals
sql_repl:367-369 or sql_repl:1341-1343, and reading the uninitialized
memory causes the bug.

More precisely, the scenario is as follows.  Thread A executes
sql_repl.cc:367 or sql_repl.cc:1341.  At both these places,
current_linfo of Thread A is assigned the address of a local
uninitialized struct.  Thread B executes sql_repl:215 with tmp
pointing to Thread A.  This causes a crash inside the call to bcmp()
at sql_repl.cc:218.  (The struct would be initialized two lines later
by Thread A, at sql_repl.cc:369 or sql_repl.cc:1343, respectively.)

== Reproducibility ======

I have not been able to reproduce this (after running the
rpl_dual_pos_advance test 2800 times).  The cause was found by
inspecting the code.

According to Lars, the two threads causing the error do not normally
run at the same time, because one is run on server and the other on
slave.  Hence, the error only happens when replicating in several
steps, so that one instance of the program is both master and slave.
In particular, the rpl_dual_pos_advance test replicates in a loop,
from I to II and from II to I.

== Suggested Action ======

Fix A.  There is a simple fix: we currently do "let current_linfo
point to a struct on the stack, then initialize the struct".  Instead,
we should reverse the order, and do "initialize the struct on stack,
then let current_linfo point to it". This amounts to reordering the
code.

Fix B.  Lars pointed out that there is an underlying systematic error
here.  It would be better if LOG_INFO::log_file_name was initialized
to the empty string in the constructor.  That would make it less easy
to make similar bugs in the future.

Warning: we still have to lock before writing log_file_name.  If we
would write to the log_file_name while another thread was examining
it, we would temporarily remove the trailing zero so that the other
thread would access it.

Fix C.  Mats pointed out that we should always put a lock on variables
that are readable from other threads.  In particular, we should lock
before assigning &linfo to thd->current_linfo.
[14 Sep 2007 8:47] Sven Sandberg
== Detailed description of what goes wrong ======

This includes a proof that nothing else caused the bug.

Let us roll up the traceback:

1. At bcmp:59, one of the following is uninitialized.
 - s1 Or Something That It Points To (OSTIPT)
 - s2 OSTIPT
 - len

2. The variables at bcmp:59:my_bcmp() map to variables in
sql_repl.cc:218:

 - log_name OSTIPT
 - linfo->log_file_name OSTIPT
 - log_name_len

log_name_len is initialized (and even correct if log_name is correctly
initialized). So either: (2.1) log_name OSTIPT is uninitialized, or
(2.2) linfo->log_file_name OSTIPT is uninitialized.  We consider these
cases separately.

2.1. Let us first investigate whether log_name OSTIPT is uninitialized
at sql_repl.cc:218. log_name in log_in_use() maps to
log_info.log_filename in log.cc:1174. First, note that this is a stack
variable, so the address of log_info.log_filename is not
uninitialized. It remains to check whether STIPT is uninitialized.
The line log.cc:1173-1174 can only be reached if none of the two gotos
above was executed. Therefore, the call to find_log_pos at log.cc:1164
returned 0.  To avoid digging into code that probably works, we make
the following assumption:

**Assumption 2.1: my_b_gets() in mf_iocache2.c:184 is correctly
implemented.

We now prove the following.

**Lemma 2.1: If Assumption 2.1 holds, and find_log_pos() is called and
returns 0, then find_log_pos() initializes its parameter
linfo->log_file_name.

**Proof.  Let us look inside find_log_pos() at
log.cc:820. find_log_pos() can only return 0 if the second break
statement of the for loop is reached, and the first break statement is
not reached. If the first break is not reached, it means that the
return value from the call to my_b_gets() at log.cc is >1. By
Assumption 2.1, the return value from my_b_gets() is >1 only if it
copied a null-terminated string of length >0 to the second
parameter. From this, it follows that fname in find_log_pos() is
initialized before find_log_pos() returns. But fname is an alias for
linfo->log_file_name in find_log_pos(), so it follows that
linfo->log_file_name is properly initialized by find_log_pos().
**End of proof of Lemma 2.1

We now apply Lemma 2.1, and obtain that (under assumption 2.1),
log_info.log_filename is correctly initialized at log.cc:1174.  This
shows what we wanted: both log_name and what it points to are
correctly initialized.

2.2. It remains to check whether linfo->log_file_name OSTIPT is
uninitialized at sql_repl.cc:218.  Since sql_repl.cc:218 is reached,
linfo is not null, by the condition of the surrounding if().  Note
that linfo is an alias for the current_linfo member of the THD class,
for one of the threads in the global variable threads.  Therefore, the
current_linfo of some thread is not null.  Let us find the possible
points in the code when current_linfo of some thread is not null.

Note that current_linfo is set to 0 in the constructor of THD, at
sql_class.cc:213.  Before this point, it was uninitialized, but the
THD object cannot have been added to the global threads variable (this
is seen by following all function, method, and constructor calls in
THD::THD() before sql.class.cc:213).

Therefore, current_linfo can only by non-null after the code
explicitly sets it to non-null, and before it re-sets it to null
again.  Let us grep for current_linfo in the code (I removed matches
in binary files, SCCS files and comments):

$ grep -n -H current_linfo */*
libmysqld/sql_class.cc:213:  current_linfo =  0;
sql/repl_failsafe.cc:301:  thd->current_linfo = &linfo;
sql/repl_failsafe.cc:381:  thd->current_linfo = 0;
sql/sql_class.cc:213:  current_linfo =  0;
sql/sql_class.h:1520:  LOG_INFO*  current_linfo;
sql/sql_repl.cc:184:    if ((linfo = tmp->current_linfo))
sql/sql_repl.cc:215:    if ((linfo = tmp->current_linfo))
sql/sql_repl.cc:367:  thd->current_linfo = &linfo;
sql/sql_repl.cc:736:  thd->current_linfo = 0;
sql/sql_repl.cc:751:  thd->current_linfo = 0;
sql/sql_repl.cc:1341:    thd->current_linfo = &linfo;
sql/sql_repl.cc:1429:  thd->current_linfo = 0;

Since there are no places where the address of current_linfo is taken,
there are no aliases for it.  Hence, it can only be non-null at one of
the three places found by grep:
 - sql/repl_failsafe.cc:301
 - sql/sql_repl.cc:367
 - sql/sql_repl.cc:1341

At all these places, current_linfo is set to the address of a local
variable.  This address is correct and initialized until the function
returns.  At all three places, thd->current_linfo is set to 0 before
the function returns.  We now have the following conceivable scenarios
when thd->current_info could be uninitialized:

2.2.1 thread A assigns the address of a local struct to
thd->current_linfo; thread B loads thd->current_linfo into a local
variable; thread A returns from the current function and invalidates
the local struct; and thread B reads from its copy of
&thd->current_linfo.  However, threads A and B properly lock
LOCK_thread_count, so this is not possible.

2.2.2 thread A assigns the address of a local struct to
thd->current_linfo, and the data at that address is uninitialized.
Then thread B reads thd->current_linfo before thread A initializes the
data.  This can happen in two of the three places above, namely
sql_repl.cc:367 and sql_repl.cc:1341.  This is the bug.

Remark: Going back to grep's results, we see that 2.2.1 is a risk not
only when thread B executes sql_repl:215, but also when it executes
sql_repl:184.  However, the lock is properly locked here, too.
[14 Sep 2007 8:51] Sven Sandberg
The bug is in 5.0, 5.1, and 5.2
[14 Sep 2007 10:27] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/34258

ChangeSet@1.2501, 2007-09-14 12:27:17+02:00, sven@murkla.(none) +2 -0
  BUG#30752 rpl_dual_pos_advance valgrind (jump depends on uninitialized LOG_INFO)
  Problem: one thread could read uninitialized memory from (the stack of) another thread.
  Fix: swapped order of initializing the memory and making it available to the other thread.
  Fix: put lock around the statement that makes the memory available to the other thread.
  Fix: all fields of the struct are now initialized in the constructor, to avoid future problems.
[14 Sep 2007 10:39] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/34261

ChangeSet@1.2501, 2007-09-14 12:39:11+02:00, sven@murkla.(none) +2 -0
  BUG#30752 rpl_dual_pos_advance valgrind (jump depends on uninitialized LOG_INFO)
  Problem: one thread could read uninitialized memory from (the stack of) another thread.
  Fix: swapped order of initializing the memory and making it available to the other thread.
  Fix: put lock around the statement that makes the memory available to the other thread.
  Fix: all fields of the struct are now initialized in the constructor, to avoid future problems.
[3 Oct 2007 9:57] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/34816

ChangeSet@1.2537, 2007-10-03 11:57:14+02:00, sven@murkla.(none) +2 -0
  BUG#30752 rpl_dual_pos_advance valgrind (jump depends on uninitialized LOG_INFO)
  Problem: one thread could read uninitialized memory from (the stack of) another
  thread.
  Fix: swapped order of initializing the memory and making it available to the
  other thread.
  Fix: put lock around the statement that makes the memory available to the other
  thread.
  Fix: all fields of the struct are now initialized in the constructor, to avoid
  future problems.
[27 Nov 2007 10:49] Bugs System
Pushed into 5.0.54
[27 Nov 2007 10:50] Bugs System
Pushed into 5.1.23-rc
[27 Nov 2007 10:53] Bugs System
Pushed into 6.0.4-alpha
[28 Nov 2007 18:42] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented bugfix in 5.0.54, 5.1.23, and 6.0.4 changelogs as follows:

        One thread could read uninitialized memory from the stack of
        another thread. This issue was only known to occur in a
        mysqld process acting as both a master and a slave.