Bug #62100 the original table missing while execute the ddl 'alter table add column '
Submitted: 6 Aug 2011 1:01 Modified: 30 Mar 2012 19:12
Reporter: luo longjiu Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Storage Engines Severity:S1 (Critical)
Version:5.1.48 OS:Linux
Assigned to: Inaam Rana
Tags: Contribution

[6 Aug 2011 1:01] luo longjiu
Description:
on our daily database operation,we add a column on 128 tables,unlucky,we find some table lose:

(1)ddl script:
alter table xxxx_0000-xxxx_0127 add column biz_type tinyint default 1 comment 'xxx';
(2)error log:
task log:
Error 1005: Can't create table 'xxxx_01.#sql-57f0_25a510' (errno: -1) 

Mysql err log:
110803  3:39:16  InnoDB: Warning: problems renaming 'xx_22/#sql-2635_23d3a8' to 'xx_22/xxxx_1451', 25000 iterations  (fil0fil.c:: fil_rename_tablespace)
InnoDB: Warning: tablespace './xx_22/#sql-2635_23d3a8.ibd' has i/o ops stopped for a long time 24999  (fil0fil.c:: fil_mutex_enter_and_prepare_for_io)
110803  3:39:16  InnoDB: Warning: problems renaming 'xx_22/#sql-2635_23d3a8' to 'xx_22/xxxx_1451', 25001 iterations 
110803  3:39:16 [ERROR] Cannot find or open table xx_22/xxxx_1451 from the internal data dictionary of InnoDB though the .frm file for the table exists. Maybe you have deleted and recreated InnoDB data files but have forgotten to delete the corresponding .frm files of InnoDB tables, or you have moved .frm files to another database?
or, the table contains indexes that this version of the engine doesn't support.

(3):
we attempt increase the value of  three parametes:
| table_definition_cache     2048  |---> original 256        
| table_open_cache           2048  |---> original 256  
| innodb_open_files          65535 |---> original 300

How to repeat:
can't repeat

Suggested fix:
we have meet three times on out product database environment!!!
[6 Aug 2011 8:58] Valerii Kravchuk
Please, check if the same problem still happens with a newer version, 5.1.58.
[6 Aug 2011 9:05] luo longjiu
we try to reappear the scene on our testing environment(5.1.48) with same my.cnf,
but the mising table phenomenon  no longer appear.
[6 Aug 2011 10:23] Sveta Smirnova
Thank you for the feedback.

Do you use --innodb-file-per-table option? If yes please send us output of `ls -la datadir\databasedir` Please also send us full error log file.
[11 Aug 2011 19:48] Sveta Smirnova
Thank you for the files provided.

Unfortunately this is still misterius. Could you please send us your configuration file also? Please check OS error log in case if you had disk failures in time when table was missed.
[12 Aug 2011 0:54] luo longjiu
Hi Sveta Smirnova,in this fault,we execute the ddl opertion on 16 database nodes,and most of the node go wrong.
Thank you for your reply.
[12 Aug 2011 1:09] hui liu
pstack info

Attachment: mysqld_pstack.txt (text/plain), 153.60 KiB.

[12 Aug 2011 1:20] hui liu
To avoiding suffering such DDL lost table issues in our product, such patch is used for mysql.-5.1.48(innodb plugin-1.0.9), which two things are done:
* add debug info when retry over 25000 times
* wake-up io handler threads in fil_mutex_enter_and_prepare_for_io
[12 Aug 2011 1:23] hui liu
suggested patch for force wake the io handler threads

Attachment: ddl_deadlock_force_wake.patch (application/octet-stream, text), 1.38 KiB.

[12 Aug 2011 1:26] hui liu
As a matter of fact, there is anther issue, that is why failed DDL could not rollback? It works in innobase while not in innodb_plugin. Ticket #62146 discussed that issue separately:
http://bugs.mysql.com/bug.php?id=62146
[24 Aug 2011 8:15] hui liu
used patch for product

Attachment: 62100.patch (application/octet-stream, text), 3.28 KiB.

[7 Sep 2011 3:51] hui liu
Let me add some explanation for the race condition risk and the solution of this issue.

First of all, let's have a look for the stacks.

Thread 4 (Thread 0x47903940 (LWP 7929)):
#0  0x000000326480a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002aaaaac8c5d2 in os_event_wait_low (event=0x43a68a0, reset_sig_count=0) at os/os0sync.c:399
#2  0x00002aaaaace154b in srv_master_thread (arg=0x0) at srv/srv0srv.c:2827
#3  0x0000003264806367 in start_thread () from /lib64/libpthread.so.0
#4  0x0000003263cd30ad in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x41efa940 (LWP 7916)):
#0  0x000000326480a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002aaaaac8c5d2 in os_event_wait_low (event=0x47c79b0, reset_sig_count=0) at os/os0sync.c:399
#2  0x00002aaaaac8b1ec in os_aio_simulated_handle (global_segment=1, message1=0x41efa0f0, message2=0x41efa0e8, type=0x41efa0e0) at os/os0file.c:4191
#3  0x00002aaaaac4c800 in fil_aio_wait (segment=1) at fil/fil0fil.c:4504
#4  0x00002aaaaace1dfd in io_handler_thread (arg=0x2aaaaaf6b048) at srv/srv0start.c:475
#5  0x0000003264806367 in start_thread () from /lib64/libpthread.so.0
#6  0x0000003263cd30ad in clone () from /lib64/libc.so.6

It's clear the io_handler_thread and srv_master_thread are both waiting for the event by pthread_cond_wait(&(event->cond_var), &(event->os_mutex)), while the os_event_set is using pthread_cond_broadcast(cond) to post the signals, therefore, master thread would receive the signal while not io handler thread, with a very very low ratio.

Secondly, based on this background, let's have a look what happen for DDL, and the reason caused this issue.

For the fil_rename_tablespace function, key logic is:

retry:  
       mutex_enter(&fil_system->mutex);
       count++;
        if (count > 25000) {
             space->stop_ios = FALSE;
             mutex_exit(&fil_system->mutex);
             return(FALSE);
        }

     space = fil_space_get_by_id(id);
     space->stop_ios = TRUE;
     if (node->n_pending > 0 || node->n_pending_flushes > 0) {
                mutex_exit(&fil_system->mutex);
                os_thread_sleep(20000);
                goto retry;

   
From the error.log, we could confirm that retried 25000 times and exit for DDL, and then let's have a look for the master thread that received the signal, which we expected to do the flush actions.

Same as above, we list the key logic that master thread blocked in fil_mutex_enter_and_prepare_for_io:

retry:
        mutex_enter(&fil_system->mutex);
        if (space != NULL && space->stop_ios) {
                mutex_exit(&fil_system->mutex);
                os_thread_sleep(20000);
                goto retry;
               

Seen from error.log, we blocked here, the space that DDL renamed table #sql_table happened in unflushed space for master, then we could hit the bug.

Lastly, let me explain the fixing patch of the bug. 
The ideal is very simple, adding forced wakeup for the io handler threads to do the flush, just that all. As a matter of fact, the followed logic for master thread is also the same logic, but as file io_stop is TRUE, it's blocked while not reaching the normal logic after the blocked position.
[20 Sep 2011 17:20] Sveta Smirnova
See also bug #62474
[27 Jan 2012 6:58] Mark Callaghan
I got this bug -- see http://bugs.mysql.com/bug.php?id=64144. I am still trying to understand how it went wrong and how the patch here helps. From the retry code in fil_rename_tablespace, I think the 'if' block is the source of retries. If the 'else' block is run I don't see how it would get stuck in a retry loop. And I think that the cause would be n_pending > 0, but it sure would be nice to have that in the error message as suggested by the bug filer.

       if (node->n_pending > 0 || node->n_pending_flushes > 0) {
                /* There are pending i/o's or flushes, sleep for a while and
                retry */
                mutex_exit(&fil_system->mutex);
                os_thread_sleep(20000);
                goto retry;

        } else if (node->modification_counter > node->flush_counter) {
                /* Flush the space */

I don't understand the code for stop_ios in fil_mutex_enter_and_prepare_io. When n_open < max_n_open then this returns early and stop_ios is not checked. That doesn't look right to me. It will allow other threads to increment n_pending after stop_ios has been set.

I am curious if that is the problem. Suppose that n_pending > 0 when fil_rename_tablespace starts and that buffer pool flushing is done and this table has many dirty pages so n_pending continues to get incremented and the thread in fil_rename_tablespace continues to wait.

        if (fil_system->n_open < fil_system->max_n_open) {
                return;
        }

        space = fil_space_get_by_id(space_id);

        if (space != NULL && space->stop_ios) {
                /* We are going to do a rename file and want to stop new i/o's
                for a while */

In looking at the production db error log, it waits for ~12,000 loops and then there is a long lock wait error message for another thread that waited 241 seconds for trx_doublewrite->mutex. The retry loop runs 50 times/second as it sleeps 20 ms per iteration and 240 * 50 == 12000

I am not sure if the thread stuck in fil_rename_tablespace caused or is suffering from the long semaphore wait. The lost table in this case is about 4G or 262144 pages. Assuming the server flushed 1000 pages/second that takes 262 seconds to flush all dirty pages and I wonder if the delay from flushing dirty pages can explain this. 

InnoDB: Warning: a long semaphore wait:
--Thread 1707227456 has waited at buf/buf0flu.c line 825 for 241.00 seconds the semaphore:
Mutex at 0x2ab8ec256528 created file trx/trx0sys.c line 191, lock var 1
waiters flag 1
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 0, pwrites 0
[27 Jan 2012 16:36] Inaam Rana
Mark,

You are right. Following code in fil_mutex_enter_and_prepare_for_io() needs to be fixed:

 921         if (fil_system->n_open < fil_system->max_n_open) {
 922 
 923                 return;

We need to push this code after the stop_ios check.

However, the hang you see on production server is likely not caused by this. You see a thread waiting on doublwrite mutex which implies that a batch write on doublewrite buffer is taking very long. If we return from above then it is possible to keep flushing dirty pages of the tablespace that we are trying to rename but that should not cause a very very slow doublewrite buffer batch.

I think explanation for that is the fact (as mentioned in an earlier comment by Hickey) that we wake up IO helper threads when we have posted a full doublwrite batch. Very likely what happened during this batch is that the rename tablespace command has set stop_ios and n_open > max_open. We are waiting for pending IOs to complete but we have not waken up the IO helper threads.

I have assigned the bug to myself. I'll post a patch internally for review
[27 Jan 2012 17:59] Mark Callaghan
Inaam - can you help me understand why the thread gets stuck in that case? Can't something else come along and wake up the write threads? Doesn't the thread that used the doublewrite buffer wake the write threads? I want to spend more time trying to reproduce the problem but I don't understand the failure scenario in this case yet.
[27 Jan 2012 20:33] Inaam Rana
Mark,

Here is how this works:

* All write activity on data pages happens through doublewrite buffer.
* Writes are done in batches either LRU batch or flush_list batch (this is as of 5.1 plugin. In 5.6 we introduced single page flushes as well)
* When a flush starts it starts posting the pages to doublewrite buffer: see buf_flush_post_to_doublewrite_buf(). At this stage pages are not written to disk. They are just 'held' in doublewrite buffer.
* Once either the doublewrite buffer gets full (the size is harcoded 128 Pages) or the flush batch is finished a call to buf_flush_buffered_writes() is made. This call actually flushes the pages to disk. During this call:
  * doublewrite mutex is acquired
  * First all pages are written to doublewrite buffer in a big single IO
  * System tablespace, where doublewrite buffer resides, is flushed to disk.
  * Pages are written to data file. This write to datafile is made through this call:
                fil_io(OS_FILE_WRITE | OS_AIO_SIMULATED_WAKE_LATER,
                       FALSE, buf_block_get_space(block), 0,
                       buf_block_get_page_no(block), 0, UNIV_PAGE_SIZE,
                       (void*)block->frame, (void*)block);

   * Note the OS_AIO_SIMULATED_WAKE_LATER flag. We pass this flag because we want to dispatch all IO request before waking up the IO threads. The reason is that IO helper threads try to coalesce these requests and we want to give coalescing maximum chance.

* Now coming to your question why can't any other thread wake up IO helper threads:
  * The master thread is doing a flush batch and is stuck in fil_mutex_enter_and_prepare_for_io()
  * Other threads trying to do any write need to go through doublewrite and the doublewrite mutex is held by the master thread.
  * A read activity can still wake up IO threads but we use simulated AIO only in case of readahead. If there is no readahead happening then a single won't go to IO helper threads.

Hope it is clear as mud! Let me know if you have any questions.
[28 Jan 2012 3:18] Mark Callaghan
Why do you think the thread is stuck in fil_rename_tablespace? 

If is there because node->n_pending > 0, then writer/reader threads already made it past the stop_ios check in fil_mutex_enter_and_prepare_io and past fil_node_prepare_for_io. I don't think there is anything that needs to be woken up.

If it is there because node->n_pending_flushes > 0 then fil_flush is in progress. I haven't looked at that codepath recently.

If it is there because node->modification_counter > node->flush_counter then fil_flush will be called and I don't think the fil_rename_tablespace thread should get stuck.

So I don't understand yet why it would need to wake the writer threads. And if it did they would likely get stuck by the stop_ios check.

If you help me get less confused I can try to reproduce the problem.
[28 Jan 2012 3:40] Mark Callaghan
Alas, I think I get it, kind of. Some writes have been requested so n_pending was incremented. The double write buffer is not full so the writes don't get done. The writer threads won't be woken up because the double write buffer isn't full.
[28 Jan 2012 4:53] Mark Callaghan
I was able to reproduce the problem. Use an optimized build as the test case is not deterministic. It won't fail every time. I modified fil0fil.c and attached a file with the patch. Those changes make it more likely that the race is hit.  The race was hit once the first 10 times I ran it. When it is hit the ibd file is not renamed and the frm file is renamed.

t/i.test:
--source include/have_innodb_plugin.inc

create table t (i int primary key auto_increment, c char(250), d char(250), e char(250)) engine=innodb;
create index cx on t(c);
create index dx on t(d);
create index ex on t(e);

insert into t values (NULL, 'a', 'b', 'c');
insert into t select NULL, 'a', 'b', 'c' from t;
insert into t select NULL, 'a', 'b', 'c' from t;
insert into t select NULL, 'a', 'b', 'c' from t;
insert into t select NULL, 'a', 'b', 'c' from t;
insert into t select NULL, 'a', 'b', 'c' from t;
insert into t select NULL, 'a', 'b', 'c' from t;
insert into t select NULL, 'a', 'b', 'c' from t;
insert into t select NULL, 'a', 'b', 'c' from t;
insert into t select NULL, 'a', 'b', 'c' from t;
insert into t select NULL, 'a', 'b', 'c' from t;
insert into t select NULL, 'a', 'b', 'c' from t;
insert into t select NULL, 'a', 'b', 'c' from t;
insert into t select NULL, 'a', 'b', 'c' from t;
insert into t select NULL, 'a', 'b', 'c' from t;
insert into t select NULL, 'a', 'b', 'c' from t;
insert into t select NULL, 'a', 'b', 'c' from t;

show innodb status;
select * from information_schema.table_statistics where table_name = 't';

rename table t to s;

--------------------

t/i-master.opt:

--innodb_file_format=Barracuda --innodb_file_per_table=1

--------------------

The interesting thread stacks are:

select,os_thread_sleep,fil_rename_tablespace,dict_table_rename_in_cache,row_rename_table_for_mysql,innobase_rename_table,ha_innobase::rename_table,mysql_rename_table,do_rename,rename_tables,mysql_rename_tables,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone

select,os_thread_sleep,fil_mutex_enter_and_prepare_for_io,_fil_io,buf_flush_buffered_writes,buf_flush_buffered_writes,buf_flush_try_neighbors,buf_flush_batch,log_preflush_pool_modified_pages,log_checkpoint_margin_background,srv_master_thread,start_thread,clone
[28 Jan 2012 4:55] Mark Callaghan
patch for mysql 5.1.52

Attachment: f.patch (application/octet-stream, text), 2.32 KiB.

[28 Jan 2012 4:58] Inaam Rana
Mark,

You got that right!
Congratulations on deciphering my cryptic explanation.
[28 Jan 2012 15:53] Mark Callaghan
Is the assert for the second call to dict_table_rename_in_cache from row_rename_table_for_mysql needed? I would rather not crash the server. Can it handle an error?

                        ut_a(dict_table_rename_in_cache(table,
                                                        old_name, FALSE));
[28 Jan 2012 18:59] Mark Callaghan
proposed patch for mysql 5.1.52

Attachment: fil.patch (application/octet-stream, text), 10.86 KiB.

[30 Jan 2012 17:34] Valerii Kravchuk
Bug #63553 was marked as a duplicate of this one.
[3 Feb 2012 4:57] hui liu
Great analyze from Mark and Inaam. Looking forward the stable and final patch. 
And I wonder what's the progress now? 

Compared with the patch provided before, Mark's patch is also try to wake up the write thread while in fil_rename_tablespace instead of 
fil_mutex_enter_and_prepare_for_io, with some different logic handle.

As mentioned above, max_n_open judgement is also a risk when no other updates for the rest tables during that period, the position really needs tweak.

Thanks again to let me more known more about the puzzled bug.
[3 Feb 2012 21:11] Mark Callaghan
Your work made this much easier for me to debug. Had I read your stack trace I would have saved even more time.
[20 Feb 2012 19:19] Sveta Smirnova
See also bug #62894
[20 Feb 2012 19:22] Sveta Smirnova
See also bug#62474
[30 Mar 2012 19:12] John Russell
Added to changelog for 5.1.62, 5.5.22, 5.6.5: 

A DDL operation such as ALTER TABLE ... ADD COLUMN could stall,
eventually timing out with an Error 1005: Can't create table message
referring to fil_rename_tablespace.