MySQL Bugs: #2926: some potential race conditions

Bug #2926	some potential race conditions
Submitted:	23 Feb 2004 10:41	Modified:	31 Mar 2004 7:34
Reporter:	Min Xu	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server	Severity:	S3 (Non-critical)
Version:	4.1.1	OS:	Solaris (solaris 8 & 9)
Assigned to:	Sergei Golubchik	CPU Architecture:	Any

Description:
The follows are potential race conditions found by my experimental race detector 
during execution close to the crash of the server as in bug #644.
They may have caused the bug. They are multiple of them. I am sure not all of
them are really harmful. You know, it was found by an automatic tool, and my
knowledge of MySQL doesn't allow me distinguish which is real bug and which
is not.

First of all, let me say a few words on the format of the following report. My
detector reports two piece of information:

  * Two statements executed from two threads that raced. One of the statement
    is a read from a shared variable, and the other is a write to the same
    variable.

  * A trace of statements (a program slice, if you wish) that was affected by
    the read statement from the first bullet. The trace of statements have the
    format of read A -> write B -> read B -> write C -> etc. Between each
    read/write pair of statements, there is some data dependency, i.e. A's
    value affect B's value. The trace finally ends with a write statement.
    The race happens (i.e. the write statement from the first bullet happens
    just before the final write statement but after the previous write in the
    trace.
    (Notice that, these statements are not necesarily from within the same
     function. They may came from different function, or multiple calls of the
     same the function.)

The tool is still experimental. It is possible to make mistakes. Some race may
seem non-sense, but keep in mind it may due to context switch, where the tool
thinks all accesses from the same _processor_ are all from the same _thread_,
where in fact they can be from different threads.

But don't worry, I have done initial screening as much as I could. The obviouly
false ones are not reported here. I will sort these races from the most likely
to the least likely.

============================================================================
queue_remove <--> queue_insert

At least two races on queue->elements and queue->root[*] was found in the two
functions above.

1. The racing statements are:

mysys/queues.c:202 queue->root[idx]=queue->root[queue->elements--];

and

mysys/queues.c:175 idx= ++queue->elements;

After "queue->elements" was loaded, decremented and stored on line 202, it was
used to compute half_queue on line
227 (half_queue=(elements=queue->elements) >> 1).
Finally the value of half_queue was saved by a kernel context switch. But
before the save finishes, the original "queue->elements" value was incremented
by another thread on line 175.

The context switch should happen before queue_remove returns (_downheap was
inlined in my code). The value of "half_queue" may or may not be still useful
when the context switch happens. If it is useful, then it is likely a race
condition.

2. The racing statements are:

queues.c:202 queue->root[idx]=queue->root[queue->elements--];

and

queues.c:187 queue->root[idx]=element;

After the read of "queue->root[queue->elements--]" on line 202, the value was
passed to "element" in the inlined function of _down_heap on line
226(element=queue->root[idx]). Then the value was passed to "queue->root[idx]"
on line 246 (queue->root[idx]=element). Then another invocation of queue_remove
passed the value to "queue->root[idx]" (idx maybe different from the previous
idx) on line 202 and similarly the value was passed to "queue->root[idx]" at
line 246. But before the write on line 246 finishes, the original value of
"queue->root[queue->elements--]" was changed by a different thread on line 187
(queue->root[idx]=element).

This may not be harmful. I don't know if the value of
"queue->root[queue->elements--]" should hold during the two calls of
queue_remove.

In summary, the race was:

"line 202 read" --> "line 226" --> "line 246 write" --> "line 202 read" -->
"line 226" --> "line 246 write"

Another race is quite similar:
"line 202 read" --> "line 202 write" --> "line 202 read" --> "line 226" -->
"line 246 write"

===========================================================================
queue_remove <--> queue remove

One race on queue->root[] was found. The racing statements are:

201     byte *element=queue->root[++idx]; /* Intern index starts from 1 */

and

246   queue->root[idx]=element;

The value of "queue->root[++idx]" was first copied into a register used for
element. Then a context switch happens (this should happen before the function
returns, otherwise, the register won't be saved since sparc has windowed
register and function returns pop a register window). During the context switch
the register was saved, but before the save finishes, the original value of
"queue->root[++idx]" was updated by a different thread on line 246.

===========================================================================
hash_search <--> my_hash_insert

The race condition is on "pos->data" variable. The two racing statements are:

mysys/hash.c:157 DBUG_RETURN (pos->data);

and

hash.c:338       pos->data=(byte*) record;

After pos->data is read, it is used in open_table (line 830 in sql_base.cc).
  "for (table=(TABLE*) hash_search(&open_cache,(byte*) key,key_length)"
Then in the same function, the value of "table" is passed to thd->open_tables
on line 897. Later on, in function close_thread_table on line 424, the value is
passed to "table", finally "table" is saved in "unused_table" and "table->next"
and "table->prev" in the same function on line 455. But before the value is
saved another thread modifies pos->data on line 338 of hash.c in function
my_hash_insert().

This may or may not be a bug. It depends on does the value of the hash table
has to hold for the time of open_table and close_thread_table.

Similarly, in hash_next, on line 183 of hash.c, the variable "pos->next" is
raced between two thread. The dependency again goes into both open_table and
close_thread_table. And finally, the overwriting statement is on line 339 of
hash.c in my_hash_insert.

============================================================================
mi_open <--> mi_open

The two racing statements are:

myisam/mi_open.c:57  for (pos=myisam_open_list ; pos ; pos=pos->next)

and

myisam/mi_open.c:583
myisam_open_list=list_add(myisam_open_list,&m_info->open_list)

"myisam_open_list" is used to compute pos then the data dependency is
eventually propagated to

580       thr_lock_data_init(&share->lock,&m_info->lock,(void*) m_info);

by "share->lock" varible. Finally the dependency is propagated into

thr_lock.c:661       pthread_mutex_lock(&lock->mutex);

However, in the mean time another thread executes mi_open() and changed
myisam_open_list on line 583.

This may not be a bug. It is perhaps the case once myisam_open_list is used to
compute some pointer in the for loop, it can be freely changed.

============================================================================
my_hash_insert <--> my_hash_insert

This is also similar to the race in hash_search and my_hash_insert. The two
racing statements are:

hash.c:324
idx=hash_mask(rec_hashnr(info,record),info->blength,info->records+1);

and

hash.c:349    info->blength+= info->blength;

"info->blength" was used to compute idx. The value of idx is then passed to
empty[0] on line 334 (empty[0]=pos[0];). Then the value is used in hash_next on
line 183 (hash->current_record= idx;). But before the value of
"hash->current_record" is saved, another thread changed "info->blength" in
my_hash_insert on line 349.

Again, this might be false alarm since info->blength does have to be fixed by
the time hash_next is called.

============================================================================
_mi_read_rnd_static_record <--> mi_lock_database

The two racing statements are:

myisam/mi_statrec.c:241 if ((! cache_read || share->base.reclength >
cache_length) && share->tot_locks == 0)

and

myisam/mi_locking.c:160       share->tot_locks++;

The race is on variable "share->tot_locks". Once the value of it was read on
line 241, a context switch may occur. Even though the variable only controls
the if statement, the value can remain in a register until the context switch
happens. I don't know if the switch happens before or after the if statement
finishes. But during the save of register of the context switch another thread
come and changed "share->tot_locks" on line 160 of mi_locking.c. Anyway, this
might be false alarm. But if the store of "share->tot_locks" is before the
if-statement finishes, the race might be likely harmful.

How to repeat:
see #644

>
> Collecting a stack trace is always on my todo list. I am
> painfully aware of that it might not be easy for me to do
> it soon, since my tool works on a binary level. My tool
> runs inside a simulator where everything available was
> binary instruction stream. But I will definitely look at

That's enough, assuming you have access to registers of the
analyzed process.

> the files you pointed out and try to figure out how to
> added this feature.
> 
> It was also my impression these functions are not intended
> thread-safe. This was quite an uncommon multithreading
> practice to me. I previously thought the most common way
> to write multithreaded programs is to write thread-safe
> functions and use those functions without worrying about
> races. One thing I thought might be true in MySQL is that
> it seems the developer some times put a data structure
> (queue, hash, etc) local to a thread. Therefore, the data
> structure is only accessible through the current thread
> pointer. In these cases, there is no need for any locking
> to the data structure really. Is this what's happening in
> MySQL as an optimization?

Not exactly. And a practice, actually, is common and very logical.
There is some simple data structure, and a library to manage it,
e.g. priority hash, binary tree, or a linked list.
Of course such a generic library need not and cannot bother
about threads, it can be run in single-threaded or multi-threaded
environment.

But if such a structure is used to store some shared data then a
caller should of course take precausions to avoid race conditions.

> About the statistics updates. I should mention the
> frequency I was reporting was the actually harmful races
> one each of them. It means each time the race happens at
> least one thread's update to the stats is lost.

Yes, this is exactly what I meant - using statistic_increment() one
says "increment a variable really fast, and I don't care if this
increment sometimes - but rarely - will be lost".

After two "painful" (my eyes hurt :-)) days to added a rough stack trace feature into my detector. I was able get more information about the races I've reported.

By looking at the calling context of each reported race condition here, I think I have ruled out two of them to be not-a-bug. However, there are still two I cannot tell if they are bugs.

Basically, for the race conditions on queue_insert and queue_remove, the two functions are called from the context of thr_end_alarm and thr_alarm, which are properly locked. Therefore, this race is not a bug.

For the race condition on hash_search and my_hash_insert, these two functions are called from the context of open_tables()->open_table(). The race on pos->data (table) is a false alarm since after the table is found in the hash, it is OK for another thread to modifiy the pos->data pointer (during adding another new table into the hash).

Another race I couldn't tell is the race on the field "blength". The racing statements are:

mysys/hash.c:324  idx=hash_mask(rec_hashnr(info,record),info->blength,info->records+1);

mysys:hash.c:349    info->blength+= info->blength;

According the the stack trace, idx is computed based on info->blength, then used to compute pos=data+idx and then empty[0]. Then the same thread called hash_next() in open_table(), which uses the value of empty[0] to compute something. In the mean time, info->blength was doubled when records in the table overflows. Can this be a bug? The trace shows after my_hash_insert, a position variable is compuated based on the old blength, then the position is used in addressing the record in hash_next. However, blength changed, which might affect the compuated pos? Is the hash table here a dynamic hash? Do buckets in the hash move around when the size of the hash is changed? (I assume blength is the number of buckets in the hash table)

Finally, the race may really be related to bug #644 is the _mi_read_rnd_static_record vs. mi_lock_database race. 

The rough stack traces are:

  _mi_read_rnd_static_record+0xd4
  mi_scan
  _ZN9ha_myisam8rnd_nextEPc
  _Z13rr_sequentialP14st_read_record+0x14
  _Z10sub_selectP4JOINP13st_join_tableb
  _Z9do_selectP4JOINP4ListI4ItemEP8st_tableP9Procedure
  _ZN4JOIN4execEv
  
and:

  mi_lock_database
  _Z15unlock_externalP3THDPP8st_tablej
  _Z24mysql_unlock_read_tablesP3THDP13st_mysql_lock
  _ZN4JOIN9join_freeEb
  _Z9do_selectP4JOINP4ListI4ItemEP8st_tableP9Procedure
  _ZN4JOIN4execEv

It seems apparently locking was not used in do_select function. The race was on the share->tot_locks variable. After

./myisam/mi_statrec.c:241       if ((! cache_read || share->base.reclength > cache_length) && share->tot_locks == 0)

My detector observed the value shared->tot_locks was changed by another thread at:

./myisam/mi_locking.c:160       share->tot_locks++;

before the first thread exit the function  _mi_read_rnd_static_record. I don't know if the update happens before the if statement finishes though.

OK, that's all I know for now. I am not sure if I explained everything well. Please ask for explanation if you need it. Looking forward to your comments!

Thanks for your time!

Hi There,

I have just tested a version of mysql 4.1 from bk repository using my race detector. This new version of 4.1 doesn't crash on the prepared queries that used to crash 4.1.1 (refer to bug #644 for detail of that bug). Here I want to report what I have found.

The new version of mysql doesn't not crash and my detector also didn't report one of the race I have reported here (#2926). The race is the hash_insert and my_hash_insert race I reported. Therefore, this disappearing race may have contributed to the race in #644.

On the other hand, the race I reported between _mi_read_rnd_static_record and mi_lock_database is still reported in the new version of the server. This means my detector may have made a mistake. That was not a real bug.

In any cases, please feel free to close this bug since the crash doesn't happen anymore with the new version of the server from BK repository. These two potential races reported here are either removed or false.

Thanks for your time.

Dear Min Xu, thank you for your work. I close this issue.

I have take a quick look on the reported things:

- In MySQL all queues and hashes that are used by many threads have their own locks.
- The code in myisam/mi_open.c (test_if_reopen() and list_add()) is safe because this code is protected by the mutex THR_LOCK_myisam.
- The code in myisam/mi_static.c is safe as tot_locks was also incremented by this thread (as all MySQL/MyISAM code are using table locks per statement) in mi_lock.cc and we are only intersted to know if value != 0.  The given code is only relevant when you are using MyISAM as a library without using table locks.

My guess of what happend was that the prepared statement code used the same table object for both clients, which would lead to a lot of strange things. This issue has already been fixed in 4.1

Regards,
Monty

Thanks a lot for your comments. It more useful than you think it is to me. :-)

> My guess of what happend was that the prepared statement code used the same
> table object for both clients, which would lead to a lot of strange things.

What you have said here makes a lot of sense to me. In fact, my detector reported many races. I did the initial screening and I remember there was
races involving table structure. Basically, I have seen races to the fields
of table or JOIN data structures, especially when one table is in use, it
can re-initialized by a different thread. But somehow I didn't report them
and I though they are false alarms. :-(

> This issue has already been fixed in 4.1

It is after 4.1.1, right? The latest version I tested with the problem
is 4.1.1.

yes.
But if you compile MySQL yourself anyway, you can get a daily source snapshot from http://downloads.mysql.com/snapshots.php and give it a try