Bug #24884 mysqld got signal 11 on a slave host
Submitted: 7 Dec 2006 13:24 Modified: 17 Nov 2008 21:46
Reporter: Cyril SCETBON Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S1 (Critical)
Version:4.1.18 OS:Linux (Debian GNU/Linux 3.1)
Assigned to: Heikki Tuuri CPU Architecture:Any

[7 Dec 2006 13:24] Cyril SCETBON
Description:
A master is replicated over 2 slaves which are used for different tasks.
One of the slave which is the least loaded works fine but the other is getting signal 11.

How to repeat:
unknown
[7 Dec 2006 13:24] Cyril SCETBON
The error file

Attachment: magic-sql2.err.log (application/octet-stream, text), 41.78 KiB.

[7 Dec 2006 13:25] Cyril SCETBON
1st stack

Attachment: mysql_06_12_06_12_15.trace (application/octet-stream, text), 1.31 KiB.

[7 Dec 2006 13:25] Cyril SCETBON
2nd stack

Attachment: mysql_07_12_06_05_40.trace (application/octet-stream, text), 1.36 KiB.

[7 Dec 2006 13:26] Cyril SCETBON
3rd stack

Attachment: mysql_07_12_06_05_45.trace (application/octet-stream, text), 1.36 KiB.

[7 Dec 2006 13:26] Cyril SCETBON
4th stack

Attachment: mysql_07_12_06_05_50.trace (application/octet-stream, text), 1.36 KiB.

[7 Dec 2006 13:27] Valeriy Kravchuk
Thank you for a problem report. Please, try to repeat with a newr version, 4.1.22, on botyh master and slaves. In case of similar problem, please, send the appropriate part of error log and your my.cnf file content from failing slave.
[7 Dec 2006 13:28] Cyril SCETBON
All the stacks are the same :

0x81214d8 handle_segfault + 356
0xb7f38420 _end + -1348008912
0xa (?)
0x821ac04 row_search_for_mysql + 6792
0x81a1bf4 _ZN11ha_innobase13general_fetchEPcjj + 64
0x8199ced _ZN7handler15read_range_nextEv + 37
0x81904e0 _ZN12QUICK_SELECT8get_nextEv + 308
0x81957da _Z8rr_quickP14st_read_record + 26
0x8166b0b _Z21join_init_read_recordP13st_join_table + 71
0x8165ca5 _Z10sub_selectP4JOINP13st_join_tableb + 97
0x815faa9 _Z9do_selectP4JOINP4ListI4ItemEP8st_tableP9Procedure + 265
0x81578d1 _ZN4JOIN4execEv + 3553
0x8103409 _ZN30subselect_single_select_engine4execEv + 101
0x80fff84 _ZN14Item_subselect4execEv + 32
0x8100ba7 _ZN21Item_exists_subselect3valEv + 19
0x80e276d _ZN13Item_func_not7val_intEv + 21
0x80e7855 _ZN13Item_cond_and7val_intEv + 65
0x8165d41 _Z10sub_selectP4JOINP13st_join_tableb + 253
0x815faa9 _Z9do_selectP4JOINP4ListI4ItemEP8st_tableP9Procedure + 265
0x81578d1 _ZN4JOIN4execEv + 3553
0x8158448 _Z12mysql_selectP3THDPPP4ItemP13st_table_listjR4ListIS1_ES2_jP8st_orderSB_S2_SB_mP13select_resultP18st_select_lex_unitP13st_sel + 380
0x8154cc1 _Z13handle_selectP3THDP6st_lexP13select_result + 173
0x8133a45 _Z21mysql_execute_commandP3THD + 849
0x8138c9e _Z11mysql_parseP3THDPcj + 166
0x81325e9 _Z16dispatch_command19enum_server_commandP3THDPcj + 893
0x813222e _Z10do_commandP3THD + 122
0x8131a25 handle_one_connection + 829
0xb7f21b63 _end + -1348101261
0xb7e5a18a _end + -1348918886

except the lines containing the _end symbol
[7 Dec 2006 13:30] Cyril SCETBON
I can't change the version at the moment cause it is in production. I'll change it only if I'm only sure it's the only way to solve the issue.
[7 Dec 2006 13:41] Valeriy Kravchuk
Sorry, I missed your files uploaded when answering... 

Anyway, please, send my.cnf content from both slaves, and describe their hardware (how much RAM do they have) and software (uname -a results, glibc versions).
[7 Dec 2006 13:48] Cyril SCETBON
The my.cnf of the slave concerned by the issue

Attachment: my.cnf (application/octet-stream, text), 2.51 KiB.

[7 Dec 2006 14:21] Cyril SCETBON
uname -a configuration of the slave concerned

Linux magic-sql2.echo-net.net 2.6.18.1 #1 SMP Thu Nov 2 09:15:36 CET 2006 i686 GNU/

cat /proc/cpuinfo :
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 15
model		: 4
model name	: Intel(R) Xeon(TM) CPU 3.00GHz
stepping	: 1
cpu MHz		: 3000.356
cache size	: 1024 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc pni monitor ds_cpl cid cx16 xtpr
bogomips	: 6003.63

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 15
model		: 4
model name	: Intel(R) Xeon(TM) CPU 3.00GHz
stepping	: 1
cpu MHz		: 3000.356
cache size	: 1024 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc pni monitor ds_cpl cid cx16 xtpr
bogomips	: 6000.52

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 15
model		: 4
model name	: Intel(R) Xeon(TM) CPU 3.00GHz
stepping	: 1
cpu MHz		: 3000.356
cache size	: 1024 KB
physical id	: 3
siblings	: 2
core id		: 0
cpu cores	: 1
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc pni monitor ds_cpl cid cx16 xtpr
bogomips	: 6000.46

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 15
model		: 4
model name	: Intel(R) Xeon(TM) CPU 3.00GHz
stepping	: 1
cpu MHz		: 3000.356
cache size	: 1024 KB
physical id	: 3
siblings	: 2
core id		: 0
cpu cores	: 1
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc pni monitor ds_cpl cid cx16 xtpr
bogomips	: 6000.45

cat /proc/meminfo :
MemTotal:      4086360 kB
MemFree:       1096580 kB
Buffers:        198104 kB
Cached:        2552752 kB
SwapCached:          0 kB
Active:        2139048 kB
Inactive:       697912 kB
HighTotal:     3211208 kB
HighFree:       855180 kB
LowTotal:       875152 kB
LowFree:        241400 kB
SwapTotal:      526312 kB
SwapFree:       526308 kB
Dirty:             524 kB
Writeback:           0 kB
AnonPages:       85688 kB
Mapped:          10124 kB
Slab:           138796 kB
PageTables:       1224 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:   2569492 kB
Committed_AS:   239828 kB
VmallocTotal:   116728 kB
VmallocUsed:      3312 kB
VmallocChunk:   113404 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB
[7 Dec 2006 15:33] Heikki Tuuri
Hi!

It may be that the table is corrupt.

What does CHECK TABLE ... say?

Have you upgraded to 4.1.xx recently?

Regards,

Heikki
[7 Dec 2006 15:52] Cyril SCETBON
We didn't upgrade the database for a year.
I checked all the tables referenced with a "select ... from XX" in the error log and none is corrupted.
[7 Dec 2006 15:52] Cyril SCETBON
We didn't upgrade the database for a year.
I checked all the tables referenced with a "select ... from XX" in the error log and none is corrupted.
[7 Dec 2006 15:55] Heikki Tuuri
But secondary indexes may be corrupt. You can check also them with CHECK TABLE ... .

Warning: for a big table (bigger than the buffer pool), CHECK TABLE will run very long, hours or days.
[7 Dec 2006 16:30] Cyril SCETBON
mysql> check table APPUSER extended;
+----------------+-------+----------+----------+
| Table          | Op    | Msg_type | Msg_text |
+----------------+-------+----------+----------+
| magic4.APPUSER | check | status   | OK       |
+----------------+-------+----------+----------+
1 row in set (0.01 sec)

mysql> check table USER_NOTIFICATION extended;
+--------------------------+-------+----------+----------+
| Table                    | Op    | Msg_type | Msg_text |
+--------------------------+-------+----------+----------+
| magic4.USER_NOTIFICATION | check | status   | OK       |
+--------------------------+-------+----------+----------+
1 row in set (0.46 sec)

mysql> check table NOTIFICATION extended;
+---------------------+-------+----------+----------+
| Table               | Op    | Msg_type | Msg_text |
+---------------------+-------+----------+----------+
| magic4.NOTIFICATION | check | status   | OK       |
+---------------------+-------+----------+----------+
1 row in set (0.64 sec)
[8 Dec 2006 8:52] Cyril SCETBON
Another suggestion ?
[8 Dec 2006 13:34] Cyril SCETBON
This problem is causing shutdown of our production slave. 
Did you find anything else ?
[8 Dec 2006 13:43] Heikki Tuuri
Hi!

Can you repeat this on 4.1.22?

Please give the SELECT query that is crashing, and print EXPLAIN SELECT ... for it.

Regards,

Heikki
[8 Dec 2006 14:05] Cyril SCETBON
If I was able I would do it. But I don't know which one is causing the segmentation fault. You've got all the error files I found.
[8 Dec 2006 15:09] Valeriy Kravchuk
Please, check with 4.1.22, as Heikki suggested.

One of the crashing queries is:

select ID_USER, LOGIN, OTFA_HASH, US_CREATED_BY from APPUSER where OTFA_HASH is not null and US_STATUS <> 'K' and not exists (select UN.ID_USER from USER_NOTIFICATION UN, NOTIFICATION N where UN.ID_USER = APPUSER.ID_USER and UN.ID_NOTIFICATION = N.ID_NOTIFICATION and N.ID_MAIL in (3, 5))

Can you send EXPLAIN results for it, for example?
[8 Dec 2006 15:18] Cyril SCETBON
mysql> explain select ID_USER, LOGIN, OTFA_HASH, US_CREATED_BY from APPUSER where OTFA_HASH is not null and US_STATUS <> 'K' and not exists (select UN.ID_USER from USER_NOTIFICATION UN, NOTIFICATION N where UN.ID_USER = APPUSER.ID_USER and UN.ID_NOTIFICATION = N.ID_NOTIFICATION and N.ID_MAIL in (3, 5))\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: APPUSER
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 253
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: N
         type: range
possible_keys: PRIMARY,IK1_NOTIFICATION
          key: IK1_NOTIFICATION
      key_len: 2
          ref: NULL
         rows: 91
        Extra: Using where; Using index
*************************** 3. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: UN
         type: ref
possible_keys: PRIMARY,IK1_USER_NOTIFICATION,IK2_USER_NOTIFICATION
          key: IK1_USER_NOTIFICATION
      key_len: 4
          ref: magic4.N.ID_NOTIFICATION
         rows: 1
        Extra: Using where; Using index
[8 Dec 2006 15:33] Shane Bester
the stack trace reminds me of that READ-COMMITTED bug, but that was >= 5.0.  Can it be related to bug #19834 ?
[8 Dec 2006 15:36] Cyril SCETBON
As you saw we got this in my.cnf :
transaction-isolation = READ-COMMITTED
[8 Dec 2006 15:48] Heikki Tuuri
I think the READ COMMITTED bug was in 5.1. It was about releasing a row lock.

You are not using innodb_locks_unsafe_for_binlog?
[8 Dec 2006 15:59] Cyril SCETBON
No it's OFF.

You don't thing that we have to stop using READ COMMITTED transaction isolation level ?
[11 Dec 2006 10:58] Cyril SCETBON
Hi, You didn't answer to my last question. 
What's your opinion ? 
Anything else ?
[12 Dec 2006 14:52] Heikki Tuuri
http://bugs.mysql.com/bug.php?id=24996 may be a duplicate of this.
[12 Dec 2006 15:33] Heikki Tuuri
The seg fault may happen in the function below.

row0sel.c in 4.1.18

/******************************************************************
Stores the row id to the prebuilt struct. */
static
void
row_sel_store_row_id_to_prebuilt(
/*=============================*/
        row_prebuilt_t* prebuilt,       /* in: prebuilt */
        rec_t*          index_rec,      /* in: record */
        dict_index_t*   index)          /* in: index of the record */
{
        byte*   data;
        ulint   len;
        data = rec_get_nth_field(index_rec,
                        dict_index_get_sys_col_pos(index, DATA_ROW_ID), &len);

        if (len != DATA_ROW_ID_LEN) {
                fprintf(stderr,
"InnoDB: Error: Row id field is wrong length %lu in ", (ulong) len);
                dict_index_name_print(stderr, prebuilt->trx, index);
                fprintf(stderr, "\n"
"InnoDB: Field number %lu, record:\n",
                        (ulong) dict_index_get_sys_col_pos(index, DATA_ROW_ID))\
;
                rec_print(stderr, index_rec);
                putc('\n', stderr);
                ut_error;
        }

        ut_memcpy(prebuilt->row_id, data, len);
}

0x08245770 <row_search_for_mysql+6640>: call   0x82134e0 <dict_index_name_print>
0x08245775 <row_search_for_mysql+6645>: mov    0xfffffb6c(%ebp),%edi
0x0824577b <row_search_for_mysql+6651>: mov    0x18(%edi),%edx
0x0824577e <row_search_for_mysql+6654>: mov    0x30(%edx),%ecx
0x08245781 <row_search_for_mysql+6657>: mov    0x34(%edx),%edx
0x08245784 <row_search_for_mysql+6660>: sub    $0x4,%ecx
0x08245787 <row_search_for_mysql+6663>: lea    (%ecx,%ecx,4),%eax
0x0824578a <row_search_for_mysql+6666>: testb  $0x1,0xc(%edi)
0x0824578e <row_search_for_mysql+6670>: lea    (%ecx,%eax,2),%eax
0x08245791 <row_search_for_mysql+6673>: lea    (%edx,%eax,4),%eax
0x08245794 <row_search_for_mysql+6676>: je     0x8245880 <row_search_for_mysql+6 912>
0x0824579a <row_search_for_mysql+6682>: mov    0x8(%eax),%eax
0x0824579d <row_search_for_mysql+6685>: mov    %eax,0x8(%esp)
0x082457a1 <row_search_for_mysql+6689>: mov    0x844674c,%edx
0x082457a7 <row_search_for_mysql+6695>: mov    $0x83ef500,%eax
0x082457ac <row_search_for_mysql+6700>: mov    %eax,0x4(%esp)
0x082457b0 <row_search_for_mysql+6704>: mov    %edx,(%esp)
0x082457b3 <row_search_for_mysql+6707>: call   0x80c2a38 <__cxa_pure_virtual+240 >
0x082457b8 <row_search_for_mysql+6712>: mov    %esi,0x4(%esp)
0x082457bc <row_search_for_mysql+6716>: mov    0x844674c,%ecx
0x082457c2 <row_search_for_mysql+6722>: mov    %ecx,(%esp)
0x082457c5 <row_search_for_mysql+6725>: call   0x8309290 <rec_print>
0x082457ca <row_search_for_mysql+6730>: movl   $0xa,(%esp)
0x082457d1 <row_search_for_mysql+6737>: mov    0x844674c,%esi
0x082457d7 <row_search_for_mysql+6743>: mov    %esi,0x4(%esp)
0x082457db <row_search_for_mysql+6747>: call   0x80c32c8 <exit+48>
0x082457e0 <row_search_for_mysql+6752>: mov    0x844674c,%edi
0x082457e6 <row_search_for_mysql+6758>: mov    %edi,(%esp)
0x082457e9 <row_search_for_mysql+6761>: call   0x831c950 <ut_print_timestamp>
0x082457ee <row_search_for_mysql+6766>: call   0x831e680 <os_thread_get_curr_id>
0x082457f3 <row_search_for_mysql+6771>: mov    %eax,(%esp)
0x082457f6 <row_search_for_mysql+6774>: call   0x831e670 <os_thread_pf>
0x082457fb <row_search_for_mysql+6779>: mov    %eax,0x8(%esp)
0x082457ff <row_search_for_mysql+6783>: mov    0x84238dc,%esi
0x08245805 <row_search_for_mysql+6789>: mov    $0x868,%edx
0x0824580a <row_search_for_mysql+6794>: mov    %edx,0x10(%esp)
0x0824580e <row_search_for_mysql+6798>: mov    0x844674c,%edi
0x08245814 <row_search_for_mysql+6804>: mov    $0x83ef01d,%ecx
0x08245819 <row_search_for_mysql+6809>: mov    %ecx,0xc(%esp)
0x0824581d <row_search_for_mysql+6813>: mov    %esi,0x4(%esp)
0x08245821 <row_search_for_mysql+6817>: mov    %edi,(%esp)
[12 Dec 2006 15:46] Cyril SCETBON
Any way to fix it ?
[12 Dec 2006 16:43] Heikki Tuuri
Can you compile a debug build of mysqld, run it inside gdb and repeat the crash with the query below?

select ID_USER, LOGIN, OTFA_HASH, US_CREATED_BY from APPUSER
where OTFA_HASH is not null and US_STATUS <> 'K' and not exists (select
UN.ID_USER from USER_NOTIFICATION UN, NOTIFICATION N where UN.ID_USER =
APPUSER.ID_USER and UN.ID_NOTIFICATION = N.ID_NOTIFICATION and N.ID_MAIL in (3,
5))

When mysqld crashes in gdb, do

(gdb) bt full

so that we see where it crashes.

Regards,

Heikki
[12 Dec 2006 16:54] Cyril SCETBON
Sorry, but I can do it. It often happens, but we don't know how to reproduce this BUG.

Do you still suspect the option transaction-isolation ?
[12 Dec 2006 17:00] Heikki Tuuri
You could test if the bug disappears if you change the isolation level.

But you CAN catch this bug inside gdb. Just start inside gdb the debug build of mysqld compiled from source. It works like a normal mysqld until it gets a seg fault. Then do that:

(gdb) bt full
[12 Dec 2006 17:06] Cyril SCETBON
I'll try it.

For the debug build version of mysql I don't think that the managers will accept it :-(
But maybe ..
[19 Dec 2006 11:00] Cyril SCETBON
The BUG seems to be still present when not using READ COMMITTED transaction isolation level :-(

However I've noticed that Justin Albstmeijer (http://bugs.mysql.com/bug.php?id=24996) do not hit the BUG anymore (I suppose). I've read all the changes between release 4.1.18 and 4.1.22 but I don't see anything related to my BUG. Did I miss something ?
[21 Dec 2006 10:47] Cyril SCETBON
Anything ?
[21 Dec 2006 14:14] Heikki Tuuri
Hi!

I am not aware of any change between 4.1.18 and 4.1.22 that would fix this bug. It is pure luck if some change has masked or fixes this bug. But please test with 4.1.22.

Regards,

Heikki
[4 Jan 2007 16:22] Cyril SCETBON
One of the slaves has been upgraded to 4.1.22 and we are waiting for news about it.
[5 Feb 2007 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[17 Nov 2008 21:46] Cyril SCETBON
No more feedback