Bug #39077 Command "change master" causes server crash (test "rpl_heartbeat")
Submitted: 27 Aug 2008 17:49 Modified: 28 Jan 2009 14:27
Reporter: Joerg Bruehe Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S1 (Critical)
Version:6.3.17 Cluster OS:Linux (RedHat 5, x86 + x86_64)
Assigned to: Andrei Elkin CPU Architecture:Any

[27 Aug 2008 17:49] Joerg Bruehe
Description:
Found in the build of cluster-6.3.17 (based on 5.1.27):
Test "rpl_heartbeat" causes a server crash.

The effect is specific to the RPM builds on RedHat 5, x86 + x86_64;
RedHat 5 on IA64 works, as do RPM builds on RedHat 4 and SuSE on all CPUs.

Crash symptom:

=====
rpl.rpl_heartbeat 'row'        [ fail ]

mysqltest: At line NNN: query 'change master to master_host='127.0.0.1',master_port=$MASTER_MYPORT, master_use
r='root', master_heartbeat_period= 4294968' failed with wrong errno 2013: 'Lost connection to MySQL server dur
ing query', instead of 1615...

The result from queries just before the failure was:
reset master;
set @@global.slave_net_timeout= 10;
change master to master_host='127.0.0.1',master_port=MASTER_PORT, master_user='root';
show status like 'Slave_heartbeat_period';;
Variable_name   Slave_heartbeat_period
Value   5.000
change master to master_host='127.0.0.1',master_port=MASTER_PORT, master_user='root', master_heartbeat_period=
 4294968;
ERROR HY000: Lost connection to MySQL server during query
=====

Same effect in "stmt" and "mix" log modes.

How to repeat:
Found during a release build.
[27 Aug 2008 17:51] Joerg Bruehe
Fixed a typing error in the title ...

Classification of the severity is difficult, as I cannot tell whether it will always happen or just because of some specific options.
[28 Aug 2008 18:33] Andrei Elkin
Looks as a crash.
Need to talk to Joerg to find a way to reproduce it.
[20 Nov 2008 13:30] Jonas Oreland
Can't really comment on E/R, this is replication code that cluster trees has...
but I think D1/I4 is reasonable, 
W3 dont use the heartbeat feture

assignment/fixing ETA will have to be negotiated with lars
but I think we should still be lead
[11 Dec 2008 19:35] Andrei Elkin
Made some investigations to find out:

I took a server binary that frigg34 uses to fail the test
and made an env to execute it agaist replication suite on blade10 (suggested by Joerg as having similar to frigg34 properties, particularly libc).

The frigg34 server sustains all the test except rpl_heartbeat.
So the failure is reproducable.

*** stack smashing detected ***:
/users/aelkin/BZR/mysql-5.1-telco-6.3/sql/mysqld terminated
       rpl.rpl_heartbeat 'row'        [ fail ]
       ...  wrong errno 2013: 'Lost connection to MySQL server during
		query', instead of 1619...

However, another server executable gained with compiling via 
frigg34 configure does not hit this crash with rpl_heartbeat.
All in all when there is no "smashing" the test passes and "smashing" seems to
relate to the way the built is done.
[11 Dec 2008 20:31] Andrei Elkin
-fstack-protector is found to be related. Still enforcing my built to compile with the option does not gain the crashing smashing server executable, at least on my local host.
Linux mysql1000 2.6.24-22-generic
$ gcc --version
gcc (GCC) 4.2.4 (Ubuntu 4.2.4-1ubuntu3)
libc.so.6
[11 Dec 2008 22:03] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/61413

2791 Andrei Elkin	2008-12-12
      Bug #39077  Command "change master" causes server crash (test "rpl_heartbeat")
      
      The crash happened due to too small size of a char array that is used to print out
      a warning on out-of-range for the heartbeat value.
      
      Corrected with setting the size to a suffieciently large value.
[12 Dec 2008 10:25] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/61459

2791 Andrei Elkin	2008-12-12
      Bug #39077  Command "change master" causes server crash (test "rpl_heartbeat")
            
      The crash happened due to too small size of a char array that is used to print out
      a warning on out-of-range for the heartbeat value.
            
      Corrected with setting the size to a suffieciently large value.
[12 Dec 2008 12:16] Andrei Elkin
Pushed to mysql-5.1-telco-6.3.
[12 Dec 2008 12:17] Bugs System
Pushed into 5.1.30-ndb-6.3.20  (revid:aelkin@mysql.com-20081212102531-hiwuc8zt0t343iqt) (version source revid:aelkin@mysql.com-20081212102531-hiwuc8zt0t343iqt) (pib:5)
[12 Dec 2008 14:48] Bugs System
Pushed into 5.1.30-ndb-6.4.0  (revid:aelkin@mysql.com-20081212102531-hiwuc8zt0t343iqt) (version source revid:jonas@mysql.com-20081212144400-y7rid1rkmgo5o6i6) (pib:5)
[12 Dec 2008 16:28] Andrei Elkin
pb does not reveal this problem because of imperfect build env over there.
Particularly, -fstack-protector should be provided to the compiler as rh5 does on
frigg34.
Still, the affect piece of the source code in 6.0 is the same as in telco so that 6.0 needs this patch not less.
[14 Dec 2008 12:38] Jon Stephens
Documented bugfix in the ndb-6.3.20 changelog as follows:

        Issuing a CHANGE MASTER TO ... MASTER_HEARTBEAT_PERIOD = period
        statement using an out-of-range value for period caused the
        slave to crash.

Set bug status to NDI pending merge to 6.0 tree.
[28 Jan 2009 14:27] Jon Stephens
Also documented in the 6.0.8 changelog (per IRC discussion with Andrei). Closed.