| Bug #5898 | Intentional server crash in InnoDB code | ||
|---|---|---|---|
| Submitted: | 5 Oct 2004 8:42 | Modified: | 30 Oct 2004 8:49 |
| Reporter: | Jani Tolonen | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Server: InnoDB storage engine | Severity: | S1 (Critical) |
| Version: | 4.0, 4.1, 5.0 | OS: | Any (All) |
| Assigned to: | Marko Mäkelä | CPU Architecture: | Any |
[5 Oct 2004 12:02]
Heikki Tuuri
Marko,
only crash the server if the same semaphore wait time exceeds 600 seconds during several ROUNDS of the check. The check runs every 2 seconds currently.
Regards,
Heikki
rv_error_monitor_thread(
/*=====================*/
/* out: a dummy parameter */
void* arg __attribute__((unused)))
/* in: a dummy parameter required by
os_thread_create */
{
ulint cnt = 0;
#ifdef UNIV_DEBUG_THREAD_CREATION
fprintf(stderr, "Error monitor thread starts, id %lu\n",
os_thread_pf(os_thread_get_curr_id()));
#endif
loop:
srv_error_monitor_active = TRUE;
cnt++;
os_thread_sleep(2000000);
if (difftime(time(NULL), srv_last_monitor_time) > 60) {
/* We referesh InnoDB Monitor values so that averages are
printed from at most 60 last seconds */
srv_refresh_innodb_monitor_stats();
}
sync_array_print_long_waits();
/* Flush stderr so that a database user gets the output
to possible MySQL error file */
fflush(stderr);
if (srv_shutdown_state < SRV_SHUTDOWN_LAST_PHASE) {
goto loop;
}
srv_error_monitor_active = FALSE;
/* We count the number of threads in os_thread_exit(). A created
thread should always use that to exit and not use return() to exit. */
os_thread_exit(NULL);
#ifndef __WIN__
return(NULL);
#else
return(0);
[7 Oct 2004 13:29]
Marko Mäkelä
ChangeSet@1.2033, 2004-10-07 15:58:47+03:00, marko@hundin.mysql.fi InnoDB: tolerate system clock glitches a little better in the error monitor thread. (Bug #5898) Jani, the function times() is not a solution, since it measures CPU time, not real time. Something like RDTSC on x86 would work, but such functions are not available on all platforms. For now, I implemented a counter that checks if the fatal timeout has been exceeded for multiple successive rounds of the check. According to Heikki, normally the locks should be released within one second. If the system clock is adjusted, with this fix InnoDB won't crash if the currently active locks are released within ten seconds. A long-term fix would be to implement a graceful shutdown in case the check fails. released within
[30 Oct 2004 8:49]
Heikki Tuuri
Fixed in 4.0.22.

Description: In InnoDB code, innobase/sync/sync0arr.c Depending on version, near line 914 there is the following piece of code: ------------------------------------------------------------------------------------------- if (cell->wait_object != NULL && difftime(time(NULL), cell->reservation_time) > fatal_timeout) { fprintf(stderr, "InnoDB: Error: semaphore wait has lasted > %lu seconds\n" "InnoDB: We intentionally crash the server, because it appears to be hung.\n", fatal_timeout); ut_error; } ------------------------------------------------------------------------------------------- The above test is checking whether a semaphore has got stuck. The fatal timeout is 600 seconds. If InnoDB notices that it has waited for a semaphore to complete more than 600 seconds, it crashes the server intentionally. There are two problems here: 1) A bug in the if() test. It relays on system function time(), which is dependent on the local time of the computer. If someone changes the machine time after a semaphore has been created, it can lead to a crash. There are strong indications that this has happened on some servers already. A fix to this problem is to create an internal time counter instead of relaying on the system time. Maybe using times() function instead, mysql.cc works as an example. 2) Although it is somewhat okay to crash a server that runs purely transactional table handlers, such as InnoDB, this is still not a good approach for systems running MyISAM / ISAM / HEAP table types, as data loss is possible. Also crashing a server in case of a deadlock or 'being stuck' seems quite a brutal way. Continously, defining a 'fatal timeout' such as 600 seconds at the moment is very vague. It all depends on the hardware, application design and current load. A machine can swap really badly for over an hour without being stuck, just as an example. A long term fix would probably be to implement an internal alarm send/receive system for cases like this. In case an alarm is triggered, send user a warning or an error and let user or application on top of the DB server decide what to do. How to repeat: On a heavy loaded system using InnoDB, where semaphores are being created, change system clock 10 minutes or more forward. You may find MySQL server being crashed. Suggested fix: See the description.