Bug #54617 | During shutdown: Failing assertion: arr->n_reserved == 0 | ||
---|---|---|---|
Submitted: | 18 Jun 2010 16:59 | Modified: | 9 Dec 2010 1:00 |
Reporter: | Vasil Dimov | Email Updates: | |
Status: | Duplicate | Impact on me: | |
Category: | MySQL Server: InnoDB storage engine | Severity: | S3 (Non-critical) |
Version: | 5.5.5 | OS: | Any |
Assigned to: | Assigned Account | CPU Architecture: | Any |
[18 Jun 2010 16:59]
Vasil Dimov
[18 Jun 2010 19:05]
Jimmy Yang
The assertion is about sync_primary_wait_array, and its n_reserved count is not at zero when server shutdown, the call stack is following: ut_a(arr->n_reserved == 0); sync_array_free() sync_close() innobase_shutdown_for_mysql() The n_reserved count is being controlled by following two APIs: sync_array_reserve_cell sync_array_free_cell Simply by code inspection, there is one suspect place in rw_lock_s_lock_spin() that could return without calling sync_array_free_cell(): rw_lock_s_lock_spin() { lock_loop: /* We try once again to obtain the lock */ if (TRUE == rw_lock_s_lock_low(lock, pass, file_name, line)) { rw_s_spin_round_count += i; return; /* Success */ <=== Could return without calling sync_array_free_cell() in there is a llop back to lock_loop. } else { sync_array_reserve_cell() <=== reserve if (TRUE == rw_lock_s_lock_low(lock, pass, file_name, line)) { sync_array_free_cell(sync_primary_wait_array, index); return; /* Success */ } goto lock_loop; } }
[18 Jun 2010 19:31]
Jimmy Yang
The loop in rw_lock_s_lock_spin() should be fine, as there is a "sync_array_wait_event" call actually calls sync_array_free_cell() before calling "goto lock_loop;" The assertion is more likely a race condition now.
[19 Jun 2010 4:09]
Jimmy Yang
Cannot reproduce. Please verify.
[25 Jun 2010 18:15]
Elena Stepanova
Not quite sure if it's the same problem, but I got this assertion with the stack trace below on mysqld bootstrap performed by MTR during regular SysQA tests with mysql-trunk-release. There were no special conditions, the test was run numerous times before and after, on the same machine as well as others, and this was the only time it ever happened so far; so I assume it's a very low-probability race condition. I'm running a statistical test on the same box now (1000 runs only doing bootstrap and executing 1-line test), but it does not look promising -- no crashes after first 500 rounds. mysql-5.5.5-m3-solaris10-x86_64-test/bin/mysqld --no-defaults --bootstrap --basedir=mysql-5.5.5-m3-solaris10-x86_64-test --datadir=mysql-5.5.5-m3-solaris10-x86_64-test/vardirs/02_load_RBR_InnoDB_MyISAM_var/install.db --loose-skip-falcon --loose-skip-ndbcluster --tmpdir=mysql-5.5.5-m3-solaris10-x86_64-test/vardirs/02_load_RBR_InnoDB_MyISAM_var/tmp/ --core-file --lc-messages-dir=mysql-5.5.5-m3-solaris10-x86_64-test/share/ --character-sets -dir=mysql-5.5.5-m3-solaris10-x86_64-test/share/charsets [7] sigacthandler(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff165588 ---- called from signal handler with signal 11 (SIGSEGV) ------ [8] sync_array_free(arr = ???) (optimized), at 0xbf5b69 (line ~292) in "sync0arr.c" [9] sync_close() (optimized), at 0xc0a7a5 (line ~1454) in "sync0sync.c" [10] innobase_shutdown_for_mysql() (optimized), at 0xbedf2f (line ~2083) in "srv0start.c" [11] innobase_end(hton = ???, type = ???) (optimized), at 0xaec015 (line ~2484) in "ha_innodb.cc" [12] ha_finalize_handlerton(plugin = ???) (optimized), at 0x9ca567 (line ~421) in "handler.cc" [13] plugin_deinitialize(plugin = ???, ref_check = ???) (optimized), at 0x896f3f (line ~861) in "sql_plugin.cc" [14] reap_plugins() (optimized), at 0x897563 (line ~949) in "sql_plugin.cc" [15] plugin_shutdown() (optimized), at 0x899025 (line ~1634) in "sql_plugin.cc" [16] clean_up(print_message = ???) (optimized), at 0x80ce1b (line ~1510) in "mysqld.cc" [17] unireg_abort(exit_code = ???) (optimized), at 0x80ccd4 (line ~1442) in "mysqld.cc" [18] mysqld_main(argc = ???, argv = ???) (optimized), at 0x811f57 (line ~4853) in "mysqld.cc" [19] main(argc = ???, argv = ???) (optimized), at 0x80b859 (line ~24) in "main.cc"
[28 Jun 2010 17:15]
Mikhail Izioumtchenko
I'll try to reproduce it but the code need to be looked at by Inaam. We also see some very low reproducibility vlgrind warnings on uninitialized memory from that area, not sure if I filed a bug.
[30 Jun 2010 15:01]
Inaam Rana
I am assigning this to Sunny and adding myself as a reviewer because I think there is atleast one code path in the purge thread that looks suspicious. I am not sure if in the above cases a separate purge thread is used or not. If no purge thread is used then we'll need some other explanation for this. In case where a separate purge thread is used i.e.: innodb_purge_threads = 1: At shutdown we call logs_empty_and_mark_files_at_shutdown(). This will 'suspend' the purge thread. The suspension has the effect of bringing the purge thread out of it's main loop and bring it to the following: 3059 mutex_enter(&kernel_mutex); 3060 3061 /* Decrement the active count. */ 3062 srv_suspend_thread(); 3063 3064 mutex_exit(&kernel_mutex); 3065 3066 /* Free the thread local memory. */ 3067 thr_local_free(os_thread_get_curr_id()); 3068 3069 mutex_enter(&kernel_mutex); 3070 3071 /* Free the slot for reuse. */ 3072 slot = srv_table_get_nth_slot(slot_no); 3073 slot->in_use = FALSE; 3074 3075 mutex_exit(&kernel_mutex); Note that after 'suspending' the purge thread tries to acquire kernel mutex on line 3069. If the thread that is doing the shutdown is holding the kernel mutex at that time (which is very possible) then the purge thread will wait. Meanwhile, since the number of active thread has already been decremented when suspending the purge thread therefore the shutdown thread will assume that all background threads are now in inactive and will return from logs_empty_and_mark_files_at_shutdown() after doing the cleanup. If before purge thread gets a chance to run the closing thread executes following: 1969 logs_empty_and_mark_files_at_shutdown(); 1970 1971 if (srv_conc_n_threads != 0) { 1972 fprintf(stderr, 1973 "InnoDB: Warning: query counter shows %ld queries" 1974 " still\n" 1975 "InnoDB: inside InnoDB at shutdown\n", 1976 srv_conc_n_threads); 1977 } 1978 1979 /* 2. Make all threads created by InnoDB to exit */ 1980 1981 srv_shutdown_state = SRV_SHUTDOWN_EXIT_THREADS; The purge thread will now exit inside os_event_wait_low() without freeing the sync_array_cell. The bottom line is that once a thread declares itself suspended it should not try to acquire any mutex/rw-lock that can potentially result in a wait.
[30 Jun 2010 17:14]
Mikhail Izioumtchenko
Inaam's observation has merit but I'd expect the tests use the default value which is innodb_purge_threads=0. I'll try another reproduction venue, try my stress test suite using normal shutdown only, varying the builds and trying valgrind with it.
[19 Jul 2010 23:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[20 Jul 2010 5:42]
Vasil Dimov
Does not happen in PB2 anymore.
[21 Jul 2010 22:37]
Sunny Bains
Fix for this bug checked into InnoDB internal 5.5 tree.
[27 Sep 2010 10:18]
Luis Soares
Seems this is regressing in mysql-next-mr codebase. Two new failures, with the: "Failing assertion: arr->n_reserved == 0" symptom have been spotted recently in valgrind runs: 1. http://pb2.norway.sun.com/web.py?template=mysql_show_test_failure&test_failure_id=3365824 2. http://pb2.norway.sun.com/web.py?template=mysql_show_test_failure&test_failure_id=3364469 These were observed in mysql-next-mr-stage and mysql-next-mr.
[18 Nov 2010 15:14]
Vasil Dimov
Failed as described above on revision-id: vasil.dimov@oracle.com-20100617095953-0v2pan110jlxpeor then on the next run it did not fail: vasil.dimov@oracle.com-20100617095953-0v2pan110jlxpeor and it has not failed since then.
[9 Dec 2010 1:00]
Sunny Bains
The underlying problem is the same as bug# 58653. I'm closing this bug as a duplicate of that bug. The symptoms are different though.