Bug #13057 Server crash after mysqlhotcopy after update from 4.1.13 to 4.1.14
Submitted: 8 Sep 2005 8:56 Modified: 12 Oct 2005 12:33
Reporter: Beat Vontobel (Silver Quality Contributor) (OCA) Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server Severity:S1 (Critical)
Version:4.1.14 OS:Linux (Linux 2.4.21 (SuSE Professional))
Assigned to: CPU Architecture:Any

[8 Sep 2005 8:56] Beat Vontobel
Description:
Since the update from 4.1.13 to 4.1.14 the mysqld process crashes (at a first glance randomly) during the first queries after the UNLOCK TABLES from mysqlhotcopy. This happened two out of five times during our nightly backup. After that I had to downgrade again to 4.1.13. We never experienced anything similar with versions before 4.1.14, and it didn't happen again since the downgrade to 4.1.13.

During backup a huge number of locked queries build up, some of them trying to get a LOCK TABLES themselves. I found the infamous "Lost connection to MySQL server during query" in quite a few of our applications logs, so more than one query got interrupted - but I don't know which one caused the crash. The stack traces below suggest, that the crash didn't happen at exactly the same query both times:

Resolved stack trace crash 1:
0x808cb93 handle_segfault + 423
0x82e4c18 pthread_sighandler + 184
0x82d1d90 my_strcasecmp_utf8 + 232
0x80b7fe0 find_field_in_table__FP3THDP8st_tablePCcUibT4PUi + 148
0x80b826f find_field_in_tables__FP3THDP10Item_identP13st_table_listPP13st_table_listb + 451
0x804ea90 fix_fields__10Item_fieldP3THDP13st_table_listPP4Item + 68
0x805666a fix_fields__9Item_funcP3THDP13st_table_listPP4Item + 130
0x80b92e5 setup_conds__FP3THDP13st_table_listPP4Item + 109
0x80bd363 prepare__4JOINPPP4ItemP13st_table_listUiP4ItemUiP8st_orderT6T4T6P13st_select_lexP18st_select_lex_unit + 387
0x80c009f mysql_select__FP3THDPPP4ItemP13st_table_listUiRt4List1Z4ItemP4ItemUiP8st_orderT7T5T7UlP13select_resultP18st_se
lect_lex_unitP13s + 695
0x80d9422 mysql_multi_update__FP3THDP13st_table_listPt4List1Z4ItemT2P4ItemUl15enum_duplicatesbP18st_select_lex_unitP13st
_select_lex + 294
0x809e46d mysql_execute_command__FP3THD + 7673
0x80a1437 mysql_parse__FP3THDPcUi + 207
0x809b73e dispatch_command__F19enum_server_commandP3THDPcUi + 1578
0x809b108 do_command__FP3THD + 188
0x809a7d7 handle_one_connection + 615
0x82e23cc pthread_start_thread + 220
0x830bd5a thread_start + 4

Resolved stack trace crash 2:
0x808cb93 handle_segfault + 423
0x82e4c18 pthread_sighandler + 184
0x804d158 set_field__10Item_fieldP5Field + 52
0x804ef82 fix_fields__10Item_fieldP3THDP13st_table_listPP4Item + 1334
0x80b8c2b setup_fields__FP3THDPP4ItemP13st_table_listRt4List1Z4ItembPt4List1Z4ItemT4 + 151
0x80bd2f8 prepare__4JOINPPP4ItemP13st_table_listUiP4ItemUiP8st_orderT6T4T6P13st_select_lexP18st_select_lex_unit + 280
0x80c009f mysql_select__FP3THDPPP4ItemP13st_table_listUiRt4List1Z4ItemP4ItemUiP8st_orderT7T5T7UlP13select_resultP18st_se
lect_lex_unitP13s + 695
0x80bd1a6 handle_select__FP3THDP6st_lexP13select_result + 150
0x809e742 mysql_execute_command__FP3THD + 8398
0x80a1437 mysql_parse__FP3THDPcUi + 207
0x809b73e dispatch_command__F19enum_server_commandP3THDPcUi + 1578
0x809b108 do_command__FP3THD + 188
0x809a7d7 handle_one_connection + 615
0x82e23cc pthread_start_thread + 220
0x830bd5a thread_start + 4

How to repeat:
Run mysqlhotcopy on a very busy server with a lot of LOCK and UNLOCK TABLES.
[8 Sep 2005 8:57] Beat Vontobel
Switched to S1, as this is a crashing bug.
[24 Sep 2005 11:38] Valeriy Kravchuk
Thank you for the bug report.

I see nothing in the 4.1.14 changes (http://dev.mysql.com/doc/mysql/en/news-4-1-14.html) that may cause such a bug that is never observed on 4.1.13. But, please, look at the changes list for any items similar to your environment, usage profile etc.

Please, provide the error log for the appropriate period, your my.cnf file content and hardware description.

The definition of a "very busy server" will be also useful for those who will try to repeat this behaviour. How many queries per second, how many simultaneous connections?

We really need more information to build a repeatable test case or prove that it is not, say, a bad hardware or configuration used.
[27 Sep 2005 8:34] Beat Vontobel
Of course I can not prove you, that it's a new bug in 4.1.14. I just never had a crash on our master server with any build before and I didn't have a crash since I switched back to 4.1.13. I switched to 4.1.14 again today and activated the general query log. Let's wait what happens.

Your questions:

1. I see the following three bugfixes in the change-log for 4.1.14, that possibly _could_ have introduced a new bug concerning table locking:

* FLUSH TABLES WITH READ LOCK combined with LOCK TABLE .. WRITE caused deadlock. (BUG#9459)

* When two threads compete for the same table, a deadlock could occur if one thread has also a lock on another table through LOCK TABLES and the thread is attempting to remove the table in some manner and the other thread want locks on both tables. (BUG#10600)

* If a thread (connection) has tables locked, the query cache is switched off for that thread. This prevents invalid results where the locking thread inserts values between a second thread connecting and selecting from the table. (BUG#12385)

2. Hardware/OS/Configuration/Usage:

Hardware: Dell PowerEdge 2650

CPUs: Dual Intel Xeon 3.06GHz (with HyperThreading turned on)

RAM: 2 GB (no swapping occurs)

/proc/version: Linux version 2.4.21-273-smp4G (root@i386.suse.de) (gcc version 3.3.1 (SuSE Linux)) #1 SMP Mon Jan 17 13:19:07 UTC 2005

MySQL: mysql-standard-4.1.13-pc-linux-gnu-i686 (dedicated db server, no other processes besides basic os)

Database usage:
* 22 databases with a total of 220 tables (mostly MyISAM, some InnoDB, some MERGE, few MEMORY)
* total database size of about 13 GB
* 24h-average of 100 queries/s and 30 concurrent connections (not evenly distributed)
* query cache turned on
* two slaves (still on 4.1.14, as they don't crash, only master affected)

3. my.cnf/error.log

See attached files (error.log since update to 4.1.14, including downgrade to 4.1.13 in the end).
[27 Sep 2005 8:59] Beat Vontobel
error.log

Attachment: error.log (application/octet-stream, text), 6.31 KiB.

[29 Sep 2005 15:11] Valeriy Kravchuk
On September 27th you wrote:

"I switched to 4.1.14 again today and activated the general query log. Let's wait what happens."

Have you noted any new crashes after that switch? Please, inform me as soon as you got something new.

Thank you for the error log and my.cnf files. Error log is a bit old, but it contains a crash information. Are you sure you have enough memory for mysqld all the time? Not obvious for me. Please, send the top results for the system without mysql running, when it will be possible.

show create table for the table used in that REPLACE statement noted in the log (if it has something to do with the problem) can also give me some insights for the other questions.
[29 Sep 2005 16:14] Beat Vontobel
No, I'm sorry, no new crashes so far, I will inform you as soon as it happens again. In the attached log you can actually find two crashes, both happend exactly at the end of the hotcopy-backup (over the period of one week). After the second crash I switched immediately back to 4.1.13, that's why the log is a little bit old. Since then (the whole period on 4.1.13) nothing unusual happened.

The table definitions for the two tables involved in the REPLACE INTO ... SELECT ... FROM ... statement. First the table SELECTed from (`meteonews`.`metar`):

CREATE TABLE `metar` (
  `id` varchar(4) collate latin1_general_ci NOT NULL default '',
  `datetime` datetime NOT NULL default '0000-00-00 00:00:00',
  `dd` smallint(3) default NULL,
  `ff` smallint(6) default NULL,
  `fxx` smallint(6) default NULL,
  `tt` tinyint(4) default NULL,
  `td` tinyint(4) default NULL,
  `rh` tinyint(3) default NULL,
  `vis` varchar(5) collate latin1_general_ci default NULL,
  `ppnn` smallint(6) default NULL,
  `cld_okt` tinyint(1) default NULL,
  `cld_hgt` smallint(6) default NULL,
  `ceiling` smallint(6) default NULL,
  `prcp_type` enum('snow','hail','rain') collate latin1_general_ci default NULL,
  `prcp_str` tinyint(1) default NULL,
  `wx` tinyint(3) NOT NULL default '-99',
  `flags` set('ltg','fog','CAVOK','NSC','SKC','CLR','err') collate latin1_general_ci NOT NULL default '',
  `metar` text collate latin1_general_ci NOT NULL,
  `source` enum('unknown','NOAA','SCAA','FSU','MeteoSchweiz','SAWS') collate latin1_general_ci NOT NULL default 'unknown',
  `inserttime` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
  PRIMARY KEY  (`id`),
  KEY `datetime` (`datetime`),
  KEY `inserttime` (`inserttime`),
  KEY `source` (`source`),
  KEY `flags` (`flags`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;

Then the table REPLACEd INTO (`met_data`.`metar_archive`):

CREATE TABLE `metar_archive` (
  `id` char(4) collate latin1_general_ci NOT NULL default '',
  `datetime` datetime NOT NULL default '0000-00-00 00:00:00',
  `dd` smallint(3) default NULL,
  `ff` smallint(6) default NULL,
  `fxx` smallint(6) default NULL,
  `tt` tinyint(4) default NULL,
  `td` tinyint(4) default NULL,
  `rh` tinyint(3) default NULL,
  `vis` char(5) collate latin1_general_ci default NULL,
  `ppnn` smallint(6) default NULL,
  `cld_okt` tinyint(1) default NULL,
  `cld_hgt` smallint(6) default NULL,
  `ceiling` smallint(6) default NULL,
  `prcp_type` enum('snow','hail','rain') collate latin1_general_ci default NULL,
  `prcp_str` tinyint(1) default NULL,
  `wx` tinyint(3) NOT NULL default '-99',
  `flags` set('ltg','fog','CAVOK','NSC','SKC','CLR','err') collate latin1_general_ci NOT NULL default '',
  PRIMARY KEY  (`id`,`datetime`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;

This table (`met_data`.`metar_archive`) is actually part of the following MERGEd table `meteonews`.`metar_archiv`:

CREATE TABLE `metar_archiv` (
  `id` char(4) collate latin1_general_ci NOT NULL default '',
  `datetime` datetime NOT NULL default '0000-00-00 00:00:00',
  `dd` smallint(3) default NULL,
  `ff` smallint(6) default NULL,
  `fxx` smallint(6) default NULL,
  `tt` tinyint(4) default NULL,
  `td` tinyint(4) default NULL,
  `rh` tinyint(3) default NULL,
  `vis` char(5) collate latin1_general_ci default NULL,
  `ppnn` smallint(6) default NULL,
  `cld_okt` tinyint(1) default NULL,
  `cld_hgt` smallint(6) default NULL,
  `ceiling` smallint(6) default NULL,
  `prcp_type` enum('snow','hail','rain') collate latin1_general_ci default NULL,
  `prcp_str` tinyint(1) default NULL,
  `wx` tinyint(3) NOT NULL default '-99',
  `flags` set('ltg','fog','CAVOK','NSC','SKC','CLR','err') collate latin1_general_ci NOT NULL default '',
  PRIMARY KEY  (`id`,`datetime`)
) ENGINE=MRG_MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci UNION=(`met_data`.`metar_archive`,`met_archive`.`metar_2004_1`,`met_archive`.`metar_2004_2`,`met_archive`.`metar_2005_1`);

All the other tables in this MERGE are compressed MyISAM tables.

On memory: I never actually saw this machine do some real swapping, so I think it should be okay. Most of the tables we use for everyday queries are really small (in the range of up to only a few MB). And we never experienced any performance bottlenecks.

Unfortunately I can not give you a top output without mysqld running (it's a production server), but see attached top.txt for one including mysqld.
[29 Sep 2005 16:39] Beat Vontobel
top output

Attachment: top.txt (text/plain), 10.70 KiB.

[12 Oct 2005 12:33] Beat Vontobel
I'm sorry about this, the crash didn't happen again. I'm actually pretty sure I hit a bug before as the crash happened two times in exactly the same circmstances on a server that runs absolutely rock solid. But now mysqld 4.1.14 is up and running since 15 days without crash. Maybe it was a hidden bug but I'm just not able to reproduce it again, so I closed the bug as "not a bug". I'll still keep the general query log running, though, for the case it would happen again.