Bug #62210 memory corruption for THD after deadlock in foreign key reference constrain
Submitted: 19 Aug 2011 11:47 Modified: 2 Feb 2012 19:59
Reporter: liu hickey (OCA) Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: InnoDB Plugin storage engine Severity:S2 (Serious)
Version:5.1.48 (1.0.9) OS:Linux (2.6.18-164.el5)
Assigned to: CPU Architecture:Any
Tags: deadlock; lock timeout

[19 Aug 2011 11:47] liu hickey
Description:

We hit such issue after running for about one month with foreign key reference:
one parent table can not be updated as lock timeout.

Three tables with foreign key referenced, meta_table(parent), meta_column(child), and mete_column_type(grandson)
for example, meta_column table schema is:
-----------------------------------------------------------------------------------------------------------+
| meta_column | CREATE TABLE `meta_column` (
  `column_id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'id',
  `table_id` bigint(20) NOT NULL COMMENT '',
  `name` varchar(255) DEFAULT '' COMMENT '
  `type` varchar(50) DEFAULT '' COMMENT '',
  `length` int(11) DEFAULT NULL COMMENT '',
  `description` text COMMENT '',
  `idx` int(11) DEFAULT NULL,
  PRIMARY KEY (`column_id`),
  KEY `FK_META_MTC` (`table_id`),
  KEY `index_meta_column_table_id` (`table_id`),
  KEY `index_meta_column_table_idx` (`idx`),
  CONSTRAINT `FK_META_MTC` FOREIGN KEY (`table_id`) REFERENCES `meta_table` (`table_id`)
) ENGINE=InnoDB AUTO_INCREMENT=10785288 DEFAULT CHARSET=utf8 COMMENT='                                                      |

The application runs well but one day, all the update/insert/delete operation for meta_table are blocked until lock timeout, while the queries to other tables work well. After about 10 hours, it's still exist and we failed to locate the hidden problem, therefore, we restart mysqld and everything is ok.

Before restart mysqld, the innodb status shows some latest deadlock info:

*** (1) TRANSACTION:
TRANSACTION 56F5B, ACTIVE 8 sec, process no 5325, OS thread id 1235179840 inserting
mysql tables in use 1, locked 1
LOCK WAIT 1329 lock struct(s), heap size 162232, 303636 row lock(s), undo log entries 237
MySQL thread id 4456815, query id 19076201 172.24.113.136 cheetah update
insert into meta_column(table_id,name,type,idx,description) values(34917,'auction_id','bigint',0,'')
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 177 page no 38 n bits 128 index `PRIMARY` of table `magicloud`.`meta_table` trx id 56F5B lock mode S locks rec but not gap waiting
Record lock, heap no 49 PHYSICAL RECORD: n_fields 21; compact format; info bits 0
*** (2) TRANSACTION:
TRANSACTION 56F5D, ACTIVE 5 sec, process no 5325, OS thread id 1234913600 starting index read, thread declared inside InnoDB 500
mysql tables in use 1, locked 1
461 lock struct(s), heap size 47544, 27735 row lock(s), undo log entries 10471
MySQL thread id 4456818, query id 19083183 172.24.113.136 cheetah Updating
update meta_table set partition_path='hdfs://hdpnn:9000/group/taobao/taobao/hive/m_group_changed_detail' where name='m_group_changed_detail'
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 177 page no 38 n bits 128 index `PRIMARY` of table `magicloud`.`meta_table` trx id 56F5D lock_mode X locks rec but not gap
Record lock, heap no 5 PHYSICAL RECORD: n_fields 21; compact format; info bits 0
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 177 page no 130 n bits 280 index `index_meta_table_name` of table `magicloud`.`meta_table` trx id 56F5D lock_mode X waiting
Record lock, heap no 137 PHYSICAL RECORD: n_fields 3; compact format; info bits 0

*** WE ROLL BACK TRANSACTION (1)

Seen from above, it looks like a normal deadlock, but it is not, as we can do any update the parent table from then on.
No other issue is detected in /var/log/messages or mysql error log

We tried to observe some abnormal issues from the generated core file, and there is really some unexpected things we caught.

5371804 related THD context for mi:
39436     mi = {
39437       host = 0x1a00000900000000 <Address 0x1a00000900000000 out of bounds>,
39438       user = 0x1000040000000 <Address 0x1000040000000 out of bounds>,
39439       password = 0x600000000000000 <Address 0x600000000000000 out of bounds>,
39440       log_file_name = 0x2100210464747303 <Address 0x2100210464747303 out of bounds>,
39441       port = 1828724992,  
39442       connect_retry = 1667852129,
39443       pos = 7237413245730254700,
39444       server_id = 7022349238346806369,
39445       ssl = 1819239263,
39446       ssl_verify_server_cert = 544107893,
39447       ssl_key = 0x6373656420746573 <Address 0x6373656420746573 out of bounds>,
39448       ssl_cert = 0x3d6e6f6974706972 <Address 0x3d6e6f6974706972 out of bounds>,
39449       ssl_ca = 0x6572656877202727 <Address 0x6572656877202727 out of bounds>,
39450       ssl_capath = 0x695f656c62617420 <Address 0x695f656c62617420 out of bounds>,
39451       ssl_cipher = 0x2030373439333d64 <Address 0x2030373439333d64 out of bounds>,
39452       relay_log_name = 0x656d616e20646e61 <Address 0x656d616e20646e61 out of bounds>,
39453       relay_log_pos = 7881145911960414013
39454     },

*************************** 10. row ***************************
     Id: 5371804
   User: cheetah
   Host: 172.24.113.146:59330
     db: magicloud
Command: Sleep
   Time: 801
  State:
   Info: NULL

Unfortunately, memory corruption is hit, and almost half of these THD instances suffered the problem.

How to repeat:
It's hard to catch the scenario but we could try to repeat with lots of foreign key referenced tables with high stress tests. That might give us some clue.
[19 Aug 2011 11:48] liu hickey
gdb info for generated core file

Attachment: gdb.log (application/octet-stream, text), 67.00 KiB.

[19 Aug 2011 11:49] liu hickey
show innodb status

Attachment: innodb.log (application/octet-stream, text), 54.58 KiB.

[19 Aug 2011 11:49] liu hickey
show processlist info

Attachment: proc.log (application/octet-stream, text), 4.66 KiB.

[17 Jan 2012 17:12] Sveta Smirnova
Thank you for the report.

You have 1 long-running transaction:

TRANSACTION 99E87, ACTIVE 1212 sec, process no 5325, OS thread id 1105189184
1414 lock struct(s), heap size 178616, 308451 row lock(s), undo log entries 1782
MySQL thread id 5427673, query id 108306779 172.24.113.136 cheetah
Trx read view will not see trx with id >= 99E88, sees < 99E88

It holds 1414 lock structs and can easily hold locks on the parent table. You also have a lot of sleeping transactions. Do you use autocommit=0 and don't close connections?

When you see same problem next time try to close all idle sleeping connections and check if this solves the issue.
[2 Feb 2012 2:18] liu hickey
Thanks Sveta pointing out the detail info which I did not pay much attention to. Yes, it's possible to block transactions applying locks.

We did set autocommit=0 and keep connections on. Killing the oldest trx for 'hang' situation on product environment is what we usually did, it's a useful quick-repair skill for running MySQL.

As we failed to identify this trouble which hit once each month (twice for all), we later removed the foreign key reference later, and that system running well from then, so I could not provide further info for this problem :(

If it's due to some client's uncommit logic bug, after removed FK ref, this issue should also be hit. However it's not. So I eliminate the application's potential uncommit bug.

The innodb status is monitored after the parent table 'hang' 10 hours, considering the time, it seems that long trx doesn't contribute much for the problem. 

Hope any others hit such issue.
[2 Feb 2012 19:59] Sveta Smirnova
Thank you for the feedback.

> If it's due to some client's uncommit logic bug, after removed FK ref,
this issue should also be hit. However it's not. So I eliminate the
application's potential uncommit bug.

Not necessary. If parent table holds the lock removing foreign key will help and this is expected.

Closing report as "Can't repeat" for now.