MySQL Bugs: #77261: 5.6.23 crash while replicating

Bug #77261	5.6.23 crash while replicating
Submitted:	5 Jun 2015 15:12	Modified:	6 Jun 2015 13:26
Reporter:	Simon Mudd (OCA)	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.6.23	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
A passive master in a master-master pair died on me today. Stack trace provided in case it's of any help.
Seen on 5.6.23 CentOS 6.

How to repeat:
13:54:39 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.

key_buffer_size=67108864
read_buffer_size=131072
max_used_connections=6
max_threads=7000
thread_count=7
connection_count=4
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 8225457 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f1fa4013fc0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f1ffb4fb7e0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x8dce65]
/usr/sbin/mysqld(handle_fatal_signal+0x494)[0x666234]
/lib64/libpthread.so.0(+0xf710)[0x7f2f2d946710]
/lib64/libc.so.6(+0x84329)[0x7f2f2c643329]
/usr/sbin/mysqld(_ZN12Field_string6unpackEPhPKhjb+0x7f)[0x7b600f]
/usr/sbin/mysqld(_Z10unpack_rowPK14Relay_log_infoP5TABLEjPKhPK9st_bitmapPS5_PmS5_+0x4b1)[0x8996e1]
/usr/sbin/mysqld(_ZN21Update_rows_log_event11do_exec_rowEPK14Relay_log_info+0x94)[0x873554]
/usr/sbin/mysqld(_ZN14Rows_log_event12do_apply_rowEPK14Relay_log_info+0x2f)[0x86cdef]
/usr/sbin/mysqld(_ZN14Rows_log_event24do_index_scan_and_updateEPK14Relay_log_info+0x287)[0x874a27]
/usr/sbin/mysqld(_ZN14Rows_log_event14do_apply_eventEPK14Relay_log_info+0x8ba)[0x87a9aa]
/usr/sbin/mysqld(_ZN9Log_event11apply_eventEP14Relay_log_info+0x74)[0x8806f4]
/usr/sbin/mysqld(_Z26apply_event_and_update_posPP9Log_eventP3THDP14Relay_log_info+0x263)[0x8ab013]
/usr/sbin/mysqld[0x8b0872]
/usr/sbin/mysqld(handle_slave_sql+0xb0b)[0x8b234b]
/usr/sbin/mysqld(pfs_spawn_thread+0x12a)[0xb0397a]
/lib64/libpthread.so.0(+0x79d1)[0x7f2f2d93e9d1]
/lib64/libc.so.6(clone+0x6d)[0x7f2f2c6a78fd]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 33245
Status: NOT_KILLED

This was on a server that had been recently (with it's peer) converted from SBR to RBR.
Due to some unexpected inconsistencies I had been skipping statements to try to get the boxes
back in sync. I don't have a clear repeatable way of reproducing this so hopefully the stack
trace will provide some sort of pointer.

Suggested fix:
If you can figure out why it died and the cause has not been resolved already, fix it.

configuration of the server concerned.

Attachment: etc_my.cnf_for_bug_report.txt (text/plain), 3.24 KiB.

Note: one of the messages I saw prior to this as I was trying to skip statements with pt-slave-restart --error-numbers=1062,1032 was the following:

                   Last_Error: Could not execute Update_rows event on table mydb.mytable; Got error 171 from storage engine, Error_code: 1030; handler error HA_ERR_CORRUPT_EVENT; the event's master log binlog.020378, end_log_pos 38062529

I was messing around with the binlog format trying to change back to SBR from RBR which finally I concluded is not possible, so wonder if somehow this triggered the other master to write something strange or simply this box picked up something it wasn't expecting to.

I don't believe that the events were "corrupt per-se" but it's possible that this changing of binlog_format between ROW and STATEMENT may trigger something which isn't expected.

A similar crash just happened on another box. I tried to change slave_exec_mode to IDEMPOTENT but it didn't seem very happy with that.

16:52:47 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.

key_buffer_size=67108864
read_buffer_size=131072
max_used_connections=8
max_threads=7000
thread_count=7
connection_count=4
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 8225457 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7fc070028c30
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fc088c177e0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x8dce65]
/usr/sbin/mysqld(handle_fatal_signal+0x494)[0x666234]
/lib64/libpthread.so.0(+0xf710)[0x7fcff87cb710]
/lib64/libc.so.6(+0x8431b)[0x7fcff74c831b]
/usr/sbin/mysqld(_ZN12Field_string6unpackEPhPKhjb+0x7f)[0x7b600f]
/usr/sbin/mysqld(_Z10unpack_rowPK14Relay_log_infoP5TABLEjPKhPK9st_bitmapPS5_PmS5_+0x4b1)[0x8996e1]
/usr/sbin/mysqld(_ZN21Update_rows_log_event11do_exec_rowEPK14Relay_log_info+0x94)[0x873554]
/usr/sbin/mysqld(_ZN14Rows_log_event12do_apply_rowEPK14Relay_log_info+0x2f)[0x86cdef]
/usr/sbin/mysqld(_ZN14Rows_log_event24do_index_scan_and_updateEPK14Relay_log_info+0x287)[0x874a27]
/usr/sbin/mysqld(_ZN14Rows_log_event14do_apply_eventEPK14Relay_log_info+0x8ba)[0x87a9aa]
/usr/sbin/mysqld(_ZN9Log_event11apply_eventEP14Relay_log_info+0x74)[0x8806f4]
/usr/sbin/mysqld(_Z26apply_event_and_update_posPP9Log_eventP3THDP14Relay_log_info+0x263)[0x8ab013]
/usr/sbin/mysqld[0x8b0872]
/usr/sbin/mysqld(handle_slave_sql+0xb0b)[0x8b234b]
/usr/sbin/mysqld(pfs_spawn_thread+0x12a)[0xb0397a]
/lib64/libpthread.so.0(+0x79d1)[0x7fcff87c39d1]
/lib64/libc.so.6(clone+0x6d)[0x7fcff752c8fd]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 21246841
Status: NOT_KILLED

Again on 5.6.23, which I'm going to upgrade to 5.6.24.

On this second box I know I did this:

root@server2 [(none)]> show global variables like 'slave_exec_mode';
+-----------------+--------+
| Variable_name   | Value  |
+-----------------+--------+
| slave_exec_mode | STRICT |
+-----------------+--------+
1 row in set (0.00 sec)

root@server2 [(none)]> set global slave_exec_mode = 'IDEMPOTENT';
Query OK, 0 rows affected (0.00 sec)

For what it's worth I'd been skipping some errors using pt-slave-restart in another window:

[root@server2 ~]# pt-slave-restart --error-numbers=1062,1032 --min-sleep=0.002
...
2015-06-05T18:52:32  relaylog.029483   102589056 1062
2015-06-05T18:52:33  relaylog.029483   102633725 1062
2015-06-05T18:52:33  relaylog.029483   102843197 1610
Error 1610 is not in --error-numbers.
[root@server2 ~]# pt-slave-restart --error-numbers=1062,1032,1610 --min-sleep=0.002
2015-06-05T18:52:47  relaylog.029483   102843197 1610
DBD::mysql::st execute failed: MySQL server has gone away [for Statement "SHOW SLAVE STATUS"] at /usr/bin/pt-slave-restart line 2924.
[root@server2 ~]#

and then I saw the server had crashed.

The prettified stack from 5.6.23:

0x00000000008dce65: my_print_stacktrace at stacktrace.c:225
0x0000000000666234: handle_fatal_signal at signal_handler.cc:162
0x00000000007b600f: Field_string::unpack(unsigned char*, unsigned char const*, unsigned int, bool) at field.cc:7007
0x00000000008996e1: unpack_row(Relay_log_info const*, TABLE*, unsigned int, unsigned char const*, st_bitmap const*, unsigned char const**, unsigned long*, unsigned char const*) at rpl_record.cc:349
0x0000000000873554: Update_rows_log_event::do_exec_row(Relay_log_info const*) at log_event.cc:13080
0x000000000086cdef: Rows_log_event::do_apply_row(Relay_log_info const*) at log_event.cc:10310
0x0000000000874a27: Rows_log_event::do_index_scan_and_update(Relay_log_info const*) at log_event.cc:10712
0x000000000087a9aa: Rows_log_event::do_apply_event(Relay_log_info const*) at log_event.cc:11411
0x00000000008806f4: Log_event::apply_event(Relay_log_info*) at log_event.cc:3185
0x00000000008ab013: apply_event_and_update_pos(Log_event**, THD*, Relay_log_info*) at rpl_slave.cc:3718
0x00000000008b0872: exec_relay_log_event at rpl_slave.cc:4226
0x00000000008b234b: handle_slave_sql at rpl_slave.cc:6051
0x0000000000b0397a: pfs_spawn_thread at pfs.cc:1863

  memcpy(to, from, length);
  // Pad the string with the pad character of the fields charset
  field_charset->cset->fill(field_charset, (char*) to + length, field_length - length, field_charset->pad_char);  <-------------crash
  return from+length;
}

I do have a reason to believe *5.6.25* might be a better choice to try.

I'll see if I can download 5.6.25 then and see if that behaves better.  Thanks.

5.6.25 seems to fix this. I see no crash when setting IDEMPOTENT mode.
Thanks.

Please note: when you set IDEMPOTENT mode you lose _all_ visibility of the fact the slave may be skipping or ignoring broken comments. That's not ideal at all.

In my case right now I don't care but I can't _see_ is the statements I'm skipping are now ok or not.

So having counters (not in 5.6 as that would be an incompatible change) but in 5.7 which indicate the number of times a broken statement/event in IDEMPOTENT mode was skipped would be very good. I could use that counter to determine when the issue has gone away, thus allowing me to remove the mode and go back to STRICT mode (if wanted).

That functionality would be most helpful.

closing as can't repeat because not repeatable with current version.