MySQL Bugs: #93346: Crash in Slave_reporting_capability::va

Bug #93346	Crash in Slave_reporting_capability::va_report
Submitted:	26 Nov 2018 14:00	Modified:	3 Dec 2018 11:54
Reporter:	Bernardo Perez	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.7.19	OS:	Linux
Assigned to:	MySQL Verification Team	CPU Architecture:	x86

Description:
The error happens when trying to write the following message on the Last_Error section of the show slave status:

"Query caused different errors on master and slave. Error on master: message (format)='' error code=3; Error on slave:actual message='no error', error code=0. Default database:''. Query:'FLUSH SLOW LOGS'"

Seems that when running the "FLUSH SLOW LOGS" we are hitting: 

https://bugs.mysql.com/bug.php?id=83232

When the error is being printed, the instance crashes with the following signature:

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
2018-11-26T12:38:14.899031Z 0 [Warning] Insecure configuration for --secure-file-priv: Location is accessible to all OS users. Consider choosing a different directory.
2018-11-26T12:38:14.899097Z 0 [Note] /usr/sbin/mysql/bin/mysqld (mysqld 5.7.19-log) starting as process 24792 ...
2018-11-26T12:38:14.900367Z 0 [Warning] Although a path was specified for the --slow-query-log-file option, log tables are used. To enable logging to files use the --log-output=file option.
2018-11-26T12:38:14.900675Z 0 [Warning] You need to use --log-bin to make --log-slave-updates work.
2018-11-26T12:38:14.902399Z 0 [Note] InnoDB: PUNCH HOLE support available
2018-11-26T12:38:14.902412Z 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2018-11-26T12:38:14.902415Z 0 [Note] InnoDB: Uses event mutexes
2018-11-26T12:38:14.902418Z 0 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2018-11-26T12:38:14.902421Z 0 [Note] InnoDB: Compressed tables use zlib 1.2.3
2018-11-26T12:38:14.902424Z 0 [Note] InnoDB: Using Linux native AIO
2018-11-26T12:38:14.902715Z 0 [Note] InnoDB: Number of pools: 1
2018-11-26T12:38:14.902817Z 0 [Note] InnoDB: Using CPU crc32 instructions
2018-11-26T12:38:14.905096Z 0 [Note] InnoDB: Initializing buffer pool, total size = 3G, instances = 8, chunk size = 128M
2018-11-26T12:38:15.085612Z 0 [Note] InnoDB: Completed initialization of buffer pool
2018-11-26T12:38:15.123209Z 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2018-11-26T12:38:15.139363Z 0 [Note] InnoDB: Highest supported file format is Barracuda.
2018-11-26T12:38:15.167556Z 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 130869254967
2018-11-26T12:38:15.167568Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 130869256035
2018-11-26T12:38:15.168475Z 0 [Note] InnoDB: Database was not shutdown normally!
2018-11-26T12:38:15.168485Z 0 [Note] InnoDB: Starting crash recovery.
2018-11-26T12:38:15.327012Z 0 [Note] InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percent: 89 90 91 92 93 94 95 96 97 98 99
2018-11-26T12:38:15.871093Z 0 [Note] InnoDB: Apply batch completed
2018-11-26T12:38:15.871115Z 0 [Note] InnoDB: Last MySQL binlog file position 0 456, file name mysql-bin-changelog.000006
2018-11-26T12:38:15.978448Z 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2018-11-26T12:38:15.978462Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2018-11-26T12:38:15.978497Z 0 [Note] InnoDB: Setting file '/usr/sbin/db/innodb/ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2018-11-26T12:38:16.013843Z 0 [Note] InnoDB: File '/usr/sbin/db/innodb/ibtmp1' size is now 12 MB.
2018-11-26T12:38:16.014682Z 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active.
2018-11-26T12:38:16.014699Z 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active.
2018-11-26T12:38:16.014853Z 0 [Note] InnoDB: Waiting for purge to start
2018-11-26T12:38:16.065475Z 0 [Note] InnoDB: 5.7.19 started; log sequence number 130869256035
2018-11-26T12:38:16.065836Z 0 [Note] InnoDB: Loading buffer pool(s) from /usr/sbin/db/innodb/ib_buffer_pool
2018-11-26T12:38:16.065900Z 0 [Note] Plugin 'FEDERATED' is disabled.
2018-11-26T12:38:16.073642Z 0 [Note] Server hostname (bind-address): '*'; port: 3306
2018-11-26T12:38:16.073696Z 0 [Note] IPv6 is available.
2018-11-26T12:38:16.073724Z 0 [Note]   - '::' resolves to '::';
2018-11-26T12:38:16.073773Z 0 [Note] Server socket created on IP: '::'.
2018-11-26T12:38:16.102376Z 0 [Warning] Recovery from master pos 53055540 and file mysql-bin-changelog.013899 for channel ''. Previous relay log pos and relay log file had been set to 4, /usr/sbin/log/relaylog/relaylog.269564 respectively.
2018-11-26T12:38:16.105788Z 0 [Note] InnoDB: Buffer pool(s) load completed at 181126 12:38:16
2018-11-26T12:38:16.108759Z 0 [Note] Event Scheduler: Loaded 0 events
2018-11-26T12:38:16.108913Z 0 [Note] /usr/sbin/mysql/bin/mysqld: ready for connections.
Version: '5.7.19-log'  socket: '/tmp/mysql.sock'  port: 3306  MySQL Community Server (GPL)
2018-11-26T12:38:16.108928Z 0 [Note] Executing 'SELECT * FROM INFORMATION_SCHEMA.TABLES;' to get a list of tables using the deprecated partition engine. You may use the startup option '--disable-partition-engine-check' to skip this check.
2018-11-26T12:38:16.108931Z 0 [Note] Beginning of list of non-natively partitioned tables
2018-11-26T12:38:16.117593Z 0 [Note] End of list of non-natively partitioned tables
2018-11-26T12:39:00.044082Z 4 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2018-11-26T12:39:00.045659Z 4 [Note] Slave I/O thread for channel '': connected to master 'repladmin@172.21.0.28:3306',replication started in log 'mysql-bin-changelog.013899' at position 53055540
2018-11-26T12:39:00.046489Z 5 [Note] Slave SQL thread for channel '' initialized, starting replication in log 'mysql-bin-changelog.013899' at position 53055540, relay log '/usr/sbin/log/relaylog/relaylog.269568' position: 4
12:39:00 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=16777216
read_buffer_size=262144
max_used_connections=1
max_threads=312
thread_count=3
connection_count=1
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 180291 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x2b7b7804c000
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 2b7a8f913990 thread_stack 0x40000
/usr/sbin/mysql/bin/mysqld(my_print_stacktrace+0x2c)[0xebdbfc]
/usr/sbin/mysql/bin/mysqld(handle_fatal_signal+0x451)[0x7b0d71]
/lib64/libpthread.so.0(+0xf100)[0x2b7a8fc28100]
/lib64/libc.so.6(+0x86ec1)[0x2b7a91125ec1]
/usr/sbin/mysql/bin/mysqld[0x1308a79]
/usr/sbin/mysql/bin/mysqld(my_vsnprintf_ex+0x26f)[0x13095ef]
/usr/sbin/mysql/bin/mysqld(_ZNK26Slave_reporting_capability9va_reportE8logleveliPKcS2_P13__va_list_tag+0x1b0)[0xe86af0]
/usr/sbin/mysql/bin/mysqld(_ZNK26Slave_reporting_capability6reportE8logleveliPKcz+0x77)[0xe86817]
/usr/sbin/mysql/bin/mysqld(_ZN15Query_log_event14do_apply_eventEPK14Relay_log_infoPKcm+0x107e)[0xe3781e]
/usr/sbin/mysql/bin/mysqld(_ZN9Log_event11apply_eventEP14Relay_log_info+0x62)[0xe35552]
/usr/sbin/mysql/bin/mysqld(_Z26apply_event_and_update_posPP9Log_eventP3THDP14Relay_log_info+0x1e0)[0xe78720]
/usr/sbin/mysql/bin/mysqld(handle_slave_sql+0x15c6)[0xe84f96]
/usr/sbin/mysql/bin/mysqld(pfs_spawn_thread+0x18f)[0x11dc6ff]
/lib64/libpthread.so.0(+0x7dc5)[0x2b7a8fc20dc5]
/lib64/libc.so.6(clone+0x6d)[0x2b7a91195c9d]

How to repeat:
The engine is in a crash loop every time trying to apply the statement:

"FLUSH SLOW LOGS" and trying to print the error.

I debugged the steps with gdb until we get into the problem.

Attachment: gdb debug (application/octet-stream, text), 12.41 KiB.

Thank you for the bug report. Please try latest released version 5.7.24. Thanks.

Hello Miguel, thanks for your fast reply.

Are you suggesting that the problems related to the crash while using Slave_reporting_capability::va_report and my_vsnprintf_ex is fixed in 5.7.24?

Thanks in advance.

Can you also kindly point to the fix that solves this issue here?

https://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-24.html#mysqld-5-7-24-bug

Regards,

Thank you for the feedback. What I was asking is you follow the basic instructions how to report a bug: try with latest released version and provide a repeatable test case. Further instructions here:
https://dev.mysql.com/doc/refman/5.7/en/bug-reports.html

Hello Miguel,

Thanks for the explanation. I thought that the instructions where clear. 

To reproduce the problem you just need to follow the steps of bug:

https://bugs.mysql.com/bug.php?id=83232

Create a master and an slave and then:
chmod 000 data/c7a-slow.log
mysql> flush slow logs;
ERROR 29 (HY000): File './c7a-slow.log' not found (Errcode: 13 - Permission denied)

Check replication slave status.
Last_SQL_Error: Query caused different errors on master and slave.     Error on master: message (format)='Invalid error code' error code=29 ; Error on slave: actual message='no error', error code=0. Default database: ''. Query: 'flush slow logs'

I did this and I just reproduced it again with the same error (but different memory addres and error codes):

(gdb) bt
#0  process_str_arg (cs=cs@entry=0x1da7cc0 <my_charset_latin1>, to=to@entry=0x2b8d2962c092 "", end=end@entry=0x2b8d2962c43b "", width=18446744073709551615, par=0x303030303030303 <Address 0x303030303030303 out of bounds>, print_type=0) at /local/mysql-5.7.19.R1/strings/my_vsnprintf.c:194
#1  0x00000000013095ef in my_vsnprintf_ex (cs=0x1da7cc0 <my_charset_latin1>, to=0x2b8d2962c092 "", n=<optimized out>, fmt=<optimized out>, ap=0x2b8c425a6328) at /local/mysql-5.7.19.R1/strings/my_vsnprintf.c:607
#2  0x0000000000e86af0 in Slave_reporting_capability::va_report (this=0x2b8d2962c000, level=<optimized out>, err_code=3002, prefix_msg=0x0, msg=0x2b8c4abd2001 "Query caused different errors on master and slave. Error on master: message (format)='%s' error code=%d; Error on slave:actual message='%s', error code=%d. Default database:'%s'. Query:'%s'",
    args=0x2b8c425a6328) at /local/mysql-5.7.19.R1/sql/rpl_reporting.cc:162
#3  0x0000000000e86817 in Slave_reporting_capability::report (this=<optimized out>, level=<optimized out>, err_code=<optimized out>, msg=<optimized out>) at /local/mysql-5.7.19.R1/sql/rpl_reporting.cc:113
#4  0x0000000000e3781e in Query_log_event::do_apply_event (this=0x2b8d22318020, rli=0x2b8d2962c000, query_arg=0x2b8d222da184 "FLUSH SLOW LOGS", q_len_arg=<optimized out>) at /local/mysql-5.7.19.R1/sql/log_event.cc:4837
#5  0x0000000000e35552 in Log_event::apply_event (this=this@entry=0x2b8d22318020, rli=rli@entry=0x2b8d2962c000) at /local/mysql-5.7.19.R1/sql/log_event.cc:3315
#6  0x0000000000e78720 in apply_event_and_update_pos (ptr_ev=ptr_ev@entry=0x2b8c425a69b0, thd=thd@entry=0x2b8d29729000, rli=rli@entry=0x2b8d2962c000) at /local/mysql-5.7.19.R1/sql/rpl_slave.cc:4714
#7  0x0000000000e84f96 in exec_relay_log_event (rli=0x2b8d2962c000, thd=0x2b8d29729000) at /local/mysql-5.7.19.R1/sql/rpl_slave.cc:5224
#8  handle_slave_sql (arg=arg@entry=0x2b8d2978d000) at /local/mysql-5.7.19.R1/sql/rpl_slave.cc:7358
#9  0x00000000011dc6ff in pfs_spawn_thread (arg=0x2b8d22d5f3a0) at /local/mysql-5.7.19.R1/storage/perfschema/pfs.cc:2188
#10 0x00002b8c428b3dc5 in start_thread () from /lib64/libpthread.so.0
#11 0x00002b8c43e28c9d in clone () from /lib64/libc.so.6

I've been debugging and I will update more tomorrow but the problem seems to be in line 43 of derror.cc:

 29 /*
 30   Error messages are stored sequentially in an array.
 31   But logically error messages are organized in sections where
 32   each section contains errors that are consecutively numbered.
 33   This function maps from a "logical" mysql_errno to an array
 34   index and returns the string.
 35 */
 36 const char* MY_LOCALE_ERRMSGS::lookup(int mysql_errno)
 37 {
 38   int offset= 0; // Position where the current section starts in the array.
 39   for (int i= 0; i < NUM_SECTIONS; i++)
 40   {
 41     if (mysql_errno < (errmsg_section_start[i] + errmsg_section_size[i]))
 42       return errmsgs[mysql_errno - errmsg_section_start[i] + offset]; 
 43     offset+= errmsg_section_size[i];
 44   }
 45   return "Invalid error code";
 46 }

I am able to consistently reproduce this in 5.7.19

The document says:

"Preferably, you should test the problem using the latest production or development version of MySQL Server before posting."

It is preferred but not forced. I will try tomorrow with the higher version certified on our systems with 5.7.23

Hi,

> It is preferred but not forced. I will try tomorrow with the higher version certified on our systems with 5.7.23

You stated that you are reproducing this by following steps from bug #83232

The bug #83232 is fixed in 5.7.24, not 5.7.23 so doubtfully 5.7.23 will show much different result from 5.7.19 you originally tested with. You have to test with 5.7.24 to see if the existing bugfix for Bug #83232 solved this bug too.

In the meantime I'll do the test myself too, but since you can easily reproduce this I'd like you to test 5.7.24

Thanks
Bogdan

Hello Bogdan,

Thanks for taking a look into this.

I get your point regarding this bug https://bugs.mysql.com/bug.php?id=83232 being fixed in 5.7.24.

The bug 83232 is related to FLUSH SLOW LOGS breaking replication. This prints an error and the replication is broken. 

The problem here is something else. When the engine is trying to print that error in the Last_Error section of the show slave status report, there is a crash because the array of the line 42 of the previous code mentioned gets somehow, populated with the wrong memory address.

In particular:

In the call done 0xe377b3 when %rax,-0x298(%rbp) the content of the address is incorrect. Then, afterwards when in 0xe377eb -0x298(%rbp),%r8 will get populated with that address and it breaks. 

All this happens when working on this: return errmsgs[mysql_errno - errmsg_section_start[i] + offset];

Looks like errmsgs is populating the array inccorrectly.

(gdb) x/100i 0xe37784
=> 0xe37784 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4068>:	mov    0x68(%rbx),%rdi
   0xe37788 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4072>:	test   %edx,%edx
   0xe3778a <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4074>:	jne    0xe37bc0 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+5152>
   0xe37790 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4080>:	lea    0x77af3e(%rip),%rax        # 0x15b26d5
   0xe37797 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4087>:	mov    %rax,-0x288(%rbp)
   0xe3779e <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4094>:	mov    %r14d,%esi
   0xe377a1 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4097>:	mov    %r9,-0x2a8(%rbp)
   0xe377a8 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4104>:	mov    %edx,-0x2a0(%rbp)
   0xe377ae <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4110>:	callq  0x7cf620 <ER_THD(THD const*, int)>
   0xe377b3 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4115>:	mov    %rax,-0x298(%rbp)
   0xe377ba <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4122>:	lea    0xfdaa8f(%rip),%rax        # 0x1e12250 <THR_THD>
   0xe377c1 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4129>:	mov    (%rax),%edi
   0xe377c3 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4131>:	callq  0x770240 <pthread_getspecific@plt>
   0xe377c8 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4136>:	mov    $0xbba,%esi
   0xe377cd <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4141>:	mov    %rax,%rdi
   0xe377d0 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4144>:	callq  0x7cf620 <ER_THD(THD const*, int)>
   0xe377d5 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4149>:	mov    -0x290(%rbp),%rcx
   0xe377dc <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4156>:	mov    -0x2a8(%rbp),%r9
   0xe377e3 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4163>:	xor    %esi,%esi
   0xe377e5 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4165>:	mov    -0x2a0(%rbp),%edx
   0xe377eb <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4171>:	mov    -0x298(%rbp),%r8

I am attaching another file with some extra debug in gdb showing the mentioned. Uploading again as public.

Attachment: gdb debug 2.txt (text/plain), 40.39 KiB.

Hi,

I'm getting different results from you both on .23 and .19.

Few questions
 - did you by any chance tried .24 (not that I actually believe it fixes anything but is good to know)
 - what binaries are you using? I assume you built those .19 and .23 yourself? What exactly env. you used to build them, what version of glibc, what gcc, what boost version.. compiled with what parameters? 
 - Can you reproduce this with oracle binaries?

thanks
Bogdan

Hello Bogdan,

Thanks for testing this! 

You mention that you are not able to get the same results, could you please confirm that you do not get crash? That would help us in our side.

To your questions:

-did you by any chance tried .24 (not that I actually believe it fixes
anything but is good to know)

I am a bit busy but I will try testing in all the lines up 5.7.23 and will come back from result. 

- what binaries are you using? I assume you built those .19 and .23
yourself? What exactly env. you used to build them, what version of
glibc, what gcc, what boost version.. compiled with what parameters?

Correct, we use our own binaries and we compile them. I will try to get some insight to your questions. 

- Can you reproduce this with oracle binaries? 

Will try to reproduce this.

Hi Bernardo,

> You mention that you are not able to get the same results, 
> could you please confirm that you do not get crash? 
> That would help us in our side.

Exactly that, I do crash the replication but no mysql crash... I used both Oracle binaries in sandbox and I compiled it myself (some Fedora 23 system).. 

> Correct, we use our own binaries and we compile them. 
> I will try to get some insight to your questions. 

Would be good to see what's doing it, maybe compiler optimized something out, would not be the first time that glibc/gcc bug produce issues like this, that's why we always check if this happens on oracle binaries too.

> - Can you reproduce this with oracle binaries? 
> Will try to reproduce this.

That would be very important step. I can't say with 100% certainty I'm doing everything you are on your system. I do follow up what I understand you did but that's all so if you also can't reproduce it on Oracle binaries then we are clear it's something to do with build envinronment.

What I tested so far is on .19 I tried reproducing 82323 but I see the error message and no crash. I also tried on .19 and .23 and .24 to crash replication in many different ways (corrupting files, corrupting network, messing up with master, editing data on slave..) to break replication get an error in the Show Slave Status and all I managed to do is break replication and get the error in the SSS, no crashes.

Another question - do you get crash for "any" error in replication or only for "Query caused different ...."

thanks for working with us on this
Bogdan

Hello Bogdan,

I've been testing on my side and also context switched to a couple of extra things, that's why I've been a bit silent.

As mentioned, I've tested in all our versions from 16 to 23. 

It fails in all the version from 16 to 21. In 22 and 23 It works.

We do not have yet a .24 binary but anyway it works for 22 and 23 so I guess this would mean it works for 24.

I am still gathering the information on how we compile the binary (GCC, options, env, flags, etc..) 

I have some part of the picture but not all. Will come back to you with that.

Checking where the error happens, I can see code changes right where we can see the array being accessed:

diff -b mysql-5.7.21/sql/derror.cc mysql-5.7.22/sql/derror.cc                                                                                                                                                                                                                                                
1c1
< /* Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.
---
> /* Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
41c41,42
<     if (mysql_errno < (errmsg_section_start[i] + errmsg_section_size[i]))
---
>     if (mysql_errno >= errmsg_section_start[i] &&
>         mysql_errno < (errmsg_section_start[i] + errmsg_section_size[i]))

I still don't understand why on this bit of the calls (you can see this in the last file attached) there is a callq that returns a position already out of bounds...

   0xe377ae <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4110>:	callq  0x7cf620 <ER_THD(THD const*, int)>
   0xe377b3 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4115>:	mov    %rax,-0x298(%rbp)
   0xe377ba <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4122>:	lea    0xfdaa8f(%rip),%rax        # 0x1e12250 <THR_THD>
   0xe377c1 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4129>:	mov    (%rax),%edi
   0xe377c3 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4131>:	callq  0x770240 <pthread_getspecific@plt>
   0xe377c8 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4136>:	mov    $0xbba,%esi
   0xe377cd <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4141>:	mov    %rax,%rdi
   0xe377d0 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4144>:	callq  0x7cf620 <ER_THD(THD const*, int)>
   0xe377d5 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4149>:	mov    -0x290(%rbp),%rcx
   0xe377dc <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4156>:	mov    -0x2a8(%rbp),%r9
   0xe377e3 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4163>:	xor    %esi,%esi
   0xe377e5 <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4165>:	mov    -0x2a0(%rbp),%edx
   0xe377eb <Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned long)+4171>:	mov    -0x298(%rbp),%r8

> Q: Another question - do you get crash for "any" error in replication or only for "Query caused different ...."

It only fails for that one. Rest of the errors work. 

I will come back to you with more information soon.

Hello Bodgan,

I've been debugging a bit more and here is what I can see:

Debug of the trace on 5.7.19 when it fails:

(gdb) bt
#0  MY_LOCALE_ERRMSGS::lookup (this=0x1e26300 <global_errmsgs>, mysql_errno=mysql_errno@entry=29) at /local/mysql-5.7.19.R1/sql/derror.cc:41
#1  0x00000000007cf635 in ER_THD (thd=<optimized out>, mysql_errno=mysql_errno@entry=29) at /local/mysql-5.7.19.R1/sql/derror.cc:57
#2  0x0000000000e377b3 in Query_log_event::do_apply_event (this=0x2afad14da020, rli=0x2afad1a2c000, query_arg=0x2afad149c184 "flush slow logs", q_len_arg=<optimized out>) at /local/mysql-5.7.19.R1/sql/log_event.cc:4837
#3  0x0000000000e35552 in Log_event::apply_event (this=this@entry=0x2afad14da020, rli=rli@entry=0x2afad1a2c000) at /local/mysql-5.7.19.R1/sql/log_event.cc:3315
#4  0x0000000000e78720 in apply_event_and_update_pos (ptr_ev=ptr_ev@entry=0x2af9ea9c19b0, thd=thd@entry=0x2afad1429000, rli=rli@entry=0x2afad1a2c000) at /local/mysql-5.7.19.R1/sql/rpl_slave.cc:4714
#5  0x0000000000e84f96 in exec_relay_log_event (rli=0x2afad1a2c000, thd=0x2afad1429000) at /local/mysql-5.7.19.R1/sql/rpl_slave.cc:5224
#6  handle_slave_sql (arg=arg@entry=0x2afad1b88000) at /local/mysql-5.7.19.R1/sql/rpl_slave.cc:7358
#7  0x00000000011dc6ff in pfs_spawn_thread (arg=0x2afacb1603a0) at /local/mysql-5.7.19.R1/storage/perfschema/pfs.cc:2188
#8  0x00002af9eac8ddc5 in start_thread () from /lib64/libpthread.so.0
#9  0x00002af9ec202c9d in clone () from /lib64/libc.so.6
(gdb) f 0
#0  MY_LOCALE_ERRMSGS::lookup (this=0x1e26300 <global_errmsgs>, mysql_errno=mysql_errno@entry=29) at /local/mysql-5.7.19.R1/sql/derror.cc:41  // here we see mysql_errno= 29
41	/local/mysql-5.7.19.R1/sql/derror.cc: No such file or directory.
(gdb) p errmsgs
$1 = (const char **) 0x2af9f2fc2020
(gdb) p *errmsgs
$2 = 0x2af9f2fc4208 "hashchk"
(gdb) p errmsg_section_start
$3 = {1000, 3000}
(gdb) p i
$4 = 0
(gdb) p errmsg_section_start[i]
$5 = 1000
(gdb) p offset
$6 = 0
(gdb)  p NUM_SECTIONS
$7 = 2
(gdb) p mysql_errno - errmsg_section_start[i] + offset
$8 = -971                           // We can see this is a pointer not an array so technically, this can be coded to be negative but I don't think you guys coded this in that way. I can also see that every time it fails this is a negative value. 
(gdb) p mysql_errno >= errmsg_section_start[i]
$9 = false
(gdb) p mysql_errno < (errmsg_section_start[i] + errmsg_section_size[i]
A syntax error in expression, near `'.
(gdb) p mysql_errno < (errmsg_section_start[i] + errmsg_section_size[i])
$10 = true
(gdb) p errmsg_section_start[i] + errmsg_section_size[i]
$11 = 1886

with changed made in 5.7.22:

const char* MY_LOCALE_ERRMSGS::lookup(int mysql_errno)
{
  int offset= 0; // Position where the current section starts in the array.
  for (int i= 0; i < NUM_SECTIONS; i++)
  {
    if (mysql_errno >= errmsg_section_start[i] &&
        mysql_errno < (errmsg_section_start[i] + errmsg_section_size[i]))
      return errmsgs[mysql_errno - errmsg_section_start[i] + offset];
    offset+= errmsg_section_size[i];
  }
  return "Invalid error code";
}

mysql_errno= 29 will never make the condition true. So, it will never get inside of that condition and will return "Invalid error code"

Now I will post the debug in 5.7.22

Debug of the same issue in 5.7.22

Breakpoint 2, MY_LOCALE_ERRMSGS::lookup (this=0x1f60fa0 <global_errmsgs>, mysql_errno=mysql_errno@entry=29) at /local/home/apol/dev/ws/mysql-sources/src/RDSMySQL/sql/derror.cc:41
(gdb) p mysql_errno >= errmsg_section_start[i]
value has been optimized out
(gdb) p mysql_errno
$7 = 29
(gdb) info registers
rax            0x15a1b66	22682470
rbx            0x2b66e48ac020	47720920956960
rcx            0x2b66e489a13a	47720920883514
rdx            0x159ddf9	22666745
rsi            0x1d	29
rdi            0x1f60fa0	32903072
rbp            0x2b6083cc24c0	0x2b6083cc24c0
rsp            0x2b6083cc24c0	0x2b6083cc24c0
r8             0x15ac212	22725138
r9             0x2b66e4812000	47720920326144
r10            0x15b3e42	22756930
r11            0x1d	29
r12            0x2b66bdbb5000	47720269828096
r13            0x2b6083cc2630	47693528049200
r14            0x2b66e48950c4	47720920862916
r15            0x0	0
rip            0x8571e9	0x8571e9 <MY_LOCALE_ERRMSGS::lookup(int)+41>
eflags         0x287	[ CF PF SF IF ]
cs             0x33	51
ss             0x2b	43
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
(gdb) x/1s 0x15a1b66
0x15a1b66:	"Invalid error code"
(gdb) s

And then, we can see in the output of "show slave status" we can see:

Last_SQL_Errno: 3002
Last_SQL_Error: Query caused different errors on master and slave. Error on master: message (format)='Invalid error code' error code=29; Error on slave:actual message='no error', error code=0. Default database:''. Query:'flush slow logs'

I understand that the code implemented after .22 was to prevent low error codes that could result in a memory corruption. You mentioned that you tried to reproduce the issue in .19 and for you it was not crashing but reporting an error.

What error it was returning? 

What mysql_errno value its reported on your binaries when you reproduce the error?

Also, can you provide some guidance on how you compile the binaries? 

I am not still abandoning other lines of testing and we are evaluating every thing you previously mentioned, will keep working on that and come back with the data.

Hi,

> What error it was returning? 

> What mysql_errno value its reported on your binaries when you reproduce the error?

Last_SQL_Errno: 3002
Last_SQL_Error: Query caused different errors on master and slave.     Error on master: message (format)='Invalid error code' error code=29 ; Error on slave: actual message='no error', error code=0. Default database: ''. Query: 'flush slow logs'

> Also, can you provide some guidance on how you compile the binaries? 

cmake .
make

 :)

used general linux tgz with boost on fedora 23 "fairly up to date" (have not run dnf update for few weeks at least, maybe months) 

> I am not still abandoning other lines of testing and we are evaluating
> every thing you previously mentioned, will keep working on that and come
> back with the data.

To be honest if even you cannot reproduce with .22 and .23 I'm not sure what are we chasing attm? 

> It only fails for that one. Rest of the errors work. 

Ah, I understood that it fails for you for any error message, I need to retest this with .19 again as I was doing thorough testing with .24 breaking replication in 100 different ways, but with .19 I just did basic crash to get an error, was not chasing the "Query caused different errors...". IIRC I did .19 with "query caused..." too without crash but can't be sure at this point tbh; but again, as it does not crash .22, .23 nor .24 even if I do reproduce it back with .19 .. what's the point?

All best
Bogdan

Hello Bodgan,

We would really appreciate if you can see if you can reproduce the error with .19 as now we are concerned about our binaries as you pointed that there could be a problem there.

If you can reproduce it in .19 it would allow us to eliminate the problem with the binaries.

Hi,

> We would really appreciate if you can see if you can reproduce 
> the error with .19 as now we are concerned about our binaries 
> as you pointed that there could be a problem there.

Hi, you are actually able to fetch the same Oracle binary yourself to try. You do understand that for bugs system it's only relevant if it exist in latest version or not as whatever the case is upgrade is always only way to solve a bug. Now I did test again with 19 and I did not reproduce a crash but to be 100% sure the issue is with your binary I'd have to do way more work and this bug rally do not require it, as neither you can reproduce it using relevant version :)

kind regards
Bogdan

Hello Bodgan,

thanks for all your work on this.

I can see that you've marked this as "Can't repeat" however, there is a clear buffer underflow bug been shown in the derror.cc in the line 42 that was fixed in the version 5.7.22

Even though, this is not crashing in your binaries, it is still reporting a "null" when reproducing it and, from a security perspective, it was a clear bug and this is why it got fixed on your code on version 5.7.22.

A different behavior of the binaries and you not being able to reproduce the crash does not remove the bug related to the buffer underflow. 

So I would really appreciate that you mark the bug as either verified or duplicated if it was related to another bug you identified when introduced the fix. 

Regarding the testing on your binaries, we did it, but in another linux version and got similar results than you. It was important for us to have more data on this. 

We will keep working in our side. Thanks a lot for your time and work on this issue. As in previous occasions, was great to collaborate with you.

Hi,

> I can see that you've marked this as "Can't repeat" however, there is a clear
> buffer underflow bug been shown in the derror.cc in the line 42 that was fixed
> in the version 5.7.22

Yes, there "was a bug", that at the time status was changed does not exist / can't be reproduced hence the "can't reproduce".

> So I would really appreciate that you mark the bug as either verified or duplicated
> if it was related to another bug you identified when introduced the fix. 

For "duplicate" I'd need to know exactly what bug fix solved the problem, and I don't. "can't reproduce" is rather same "end state" as "duplicate of a closed bug". When I catch what bug report actually did it I'll switch this one to duplicate. Verified is not a closed state so it cannot be a state for this one.

> We will keep working in our side. Thanks a lot for your time and work on this issue. 
> As in previous occasions, as great to collaborate with you.

Likewise. I do appreciate all the work you put into your reports, makes my job that easier :)

kind regards
Bogdan