| Bug #55641 | Stopping node gracefully in the master cluster can create an inconsistent slave | ||
|---|---|---|---|
| Submitted: | 29 Jul 2010 22:48 | Modified: | 29 Sep 2010 11:33 |
| Reporter: | Matthew Montgomery | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Cluster: Replication | Severity: | S2 (Serious) |
| Version: | mysql-5.1-telco-7.1 | OS: | Any |
| Assigned to: | Jonas Oreland | CPU Architecture: | Any |
[19 Aug 2010 9:42]
Jonas Oreland
Iff this is true, it's a P1 Can someone please verify this again... or as suggested see when/how it was introduced (check backwards) /Jonas
[19 Aug 2010 9:43]
Jonas Oreland
assigning to bockie for action...
[19 Aug 2010 16:10]
MySQL Verification Team
This appears to be an extremely old bug (perhaps has always been there) it has been reproduced it using 6.1.2, 6.3.9, 7.0.15, 6.4.0 and 7.1.8
[19 Aug 2010 16:53]
MySQL Verification Team
Correction: I meant 7.1.6 instead of 7.1.8
[20 Aug 2010 6:06]
Jonas Oreland
matthew, if you feel like helping out further, a good next step would be to inspect binlog to determine if problem is at master site or at slave. /Jonas
[20 Aug 2010 14:26]
MySQL Verification Team
As suspected it's a problem with the master not writing inserts into the binlog master$ mysql -e "select count(*) from world.City" +----------+ | count(*) | +----------+ | 63583 | +----------+ ## Read all binlogs: master$ mysqlbinlog -v ndbsup-1-bin.000001 | grep -c "INSERT INTO world.City" 62376 master$ mysqlbinlog -v ndbsup-1-bin.000002 | grep -c "INSERT INTO world.City" 0 master$ mysqlbinlog -v ndbsup-1-bin.000003 | grep -c "INSERT INTO world.City" mysqlbinlog: File 'ndbsup-1-bin.000003' not found (Errcode: 2) 0 slave$ mysql -e "select count(*) from world.City"; +----------+ | count(*) | +----------+ | 62376 | +----------+
[23 Aug 2010 14:06]
Jonas Oreland
Hi, We discussed problem...and it might be that the problem is related to "graceful" shutdown. Can you see if same problem is reproducible if using "stop -a" or "kill -9" to stop node instead ? /Jonas
[23 Aug 2010 15:20]
MySQL Verification Team
It appears that when using "2 stop -a" or kill -n `pidof ndbd` the bug is not reproduced and master and slave remain synchronized.
[24 Aug 2010 11:15]
Jonas Oreland
update title...
[24 Aug 2010 14:06]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/116651 3261 Jonas Oreland 2010-08-24 ndb - bug#55641 - commit to test
[24 Aug 2010 14:10]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/116655 3262 Jonas Oreland 2010-08-24 ndb - bug#55641 - commit to test
[25 Aug 2010 7:49]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/116709 3263 Jonas Oreland 2010-08-25 ndb - bug#55641 - fix bug in node iterating
[26 Aug 2010 9:03]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/116836 3264 Jonas Oreland 2010-08-26 ndb - bug#55641 - fix suma wrt. graceful shutdown
[26 Aug 2010 9:03]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/116837 3264 Jonas Oreland 2010-08-26 ndb - bug#55641 - fix suma wrt. graceful shutdown, bug#18538 graceful shutdown can abort transactions
[26 Aug 2010 10:11]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/116843 3265 Jonas Oreland 2010-08-26 ndb - bug#55641 - fix upgrade...daa
[26 Aug 2010 12:46]
Bugs System
Pushed into mysql-5.1-telco-6.3 5.1.47-ndb-6.3.37 (revid:jonas@mysql.com-20100826123701-s6etvy71vidlaxme) (version source revid:jonas@mysql.com-20100826100724-msmzp71lce098ke3) (merge vers: 5.1.47-ndb-6.3.37) (pib:21)
[26 Aug 2010 12:46]
Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.47-ndb-7.0.18 (revid:jonas@mysql.com-20100826124130-kuw81qy5le9lpl7o) (version source revid:jonas@mysql.com-20100826124130-kuw81qy5le9lpl7o) (merge vers: 5.1.47-ndb-7.0.18) (pib:21)
[26 Aug 2010 12:54]
Jonas Oreland
pushed into 6.3.37, 7.0.18 and 7.1.7
[26 Aug 2010 12:55]
Jonas Oreland
matthew, you're most welcome to test it /Jonas
[27 Aug 2010 8:04]
Jon Stephens
Documented bugfix in the NDB-6.3.37, 7.0.18, and 7.1.7 changelogs as follows:
The planned shutdown of a data node in the master cluster could
sometimes cause rows to be skipped when writing transactions to
the binary log, leading to an inconsistent slave cluster.
Closed.
[23 Sep 2010 13:25]
MySQL Verification Team
After some additional automated testing I am still able to reproduce this.
5.1.47-ndb-7.0.18-brXXXXX-log
#!/bin/bash
PATH=/data/mysql/sandbox/7.0.18/bin:/data/mysql/sandbox/7.0.18/libexec:/home/matt/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
while true ; do
for id in 4 ; do
echo -en "master ndbd: " $(ndb_mgm -e "show" | grep Master | cut -f 1)
echo -en "\tkilling node : $id \t"
mysql -e 'truncate world.City';
mysqlslap -q "insert into world.City values('', 'Weurt', 'NLD', 'Gelderland', 1234);" -i 50 --number-of-queries 2000 -c 1 --create-schema world 2>&1 > /dev/null &
sleep 10
ndb_mgm -e "$id stop -a" 2>&1 > /dev/null
wait
sleep 1
master=$(mysql -Bse "select count(*) from world.City" -S /tmp/mysql70.sock)
slave=$(mysql -Bse "select count(*) from world.City" -S /tmp/mysql72.sock )
if [ $master == $slave ] ; then
echo "PASS"
else
echo "FAIL"
fi
echo "Master : $master --> Slave : $slave"
ndbd 2>&1 >/dev/null
ndb_waiter >/dev/null
done
done
matt@silo:~/mysql/sandbox/7.0.18$ ./bug55641.sh
master ndbd: id=5 killing node : 4 mysqlslap: Cannot run query insert into world.City values('', 'Weurt', 'NLD', 'Gelderland', 1234); ERROR : Got temporary error 4031 'Node failure caused abort of transaction' from NDBCLUSTER
PASS
Master : 9471 --> Slave : 9471
master ndbd: id=5 killing node : 5 mysqlslap: Cannot run query insert into world.City values('', 'Weurt', 'NLD', 'Gelderland', 1234); ERROR : Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER
PASS
Master : 8429 --> Slave : 8429
master ndbd: id=4 killing node : 4 PASS
Master : 100000 --> Slave : 100000
master ndbd: id=5 killing node : 5 PASS
Master : 100000 --> Slave : 100000
master ndbd: id=4 killing node : 4 PASS
Master : 100000 --> Slave : 100000
master ndbd: id=5 killing node : 5 FAIL
Master : 100000 --> Slave : 99999
master ndbd: id=4 killing node : 4 mysqlslap: Cannot run query insert into world.City values('', 'Weurt', 'NLD', 'Gelderland', 1234); ERROR : Got temporary error 4031 'Node failure caused abort of transaction' from NDBCLUSTER
PASS
Master : 10848 --> Slave : 10848
master ndbd: id=5 killing node : 5 PASS
Master : 100000 --> Slave : 100000
master ndbd: id=4 killing node : 4 PASS
Master : 100000 --> Slave : 100000
master ndbd: id=5 killing node : 5 PASS
Several times the mysqlslap command gets an error due to node failure, which is expected. Master and slave remain in sync in all these cases. However, about one in 12 executions fail with one row missing from the slave.
[29 Sep 2010 7:29]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/119334 3307 Jonas Oreland 2010-09-29 ndb - bug#55641 - followup fix, fix check_switchover
[29 Sep 2010 7:38]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/119335 3308 Jonas Oreland 2010-09-29 ndb - bug#55641 - followup fix, fix check_switchover
[29 Sep 2010 7:43]
Bugs System
Pushed into mysql-5.1-telco-6.3 5.1.47-ndb-6.3.38 (revid:jonas@mysql.com-20100929073631-xzt5ipk6hbps4tik) (version source revid:jonas@mysql.com-20100929073631-xzt5ipk6hbps4tik) (merge vers: 5.1.47-ndb-6.3.38) (pib:21)
[29 Sep 2010 7:43]
Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.47-ndb-7.0.19 (revid:jonas@mysql.com-20100929073942-j33arj0iotj2ckvp) (version source revid:jonas@mysql.com-20100929073942-j33arj0iotj2ckvp) (merge vers: 5.1.47-ndb-7.0.19) (pib:21)
[29 Sep 2010 7:48]
Jonas Oreland
DOCS: More testing of scenario revealed a missing piece of handling of the graceful shutdown. This patch addresses that. Pushed to 6.3.38, 7.0.19 and 7.1.8
[29 Sep 2010 10:54]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/119379 3288 Martin Skold 2010-09-29 [merge] Merge removed: cluster_change_hist.txt modified: mysql-test/collections/default.experimental mysql-test/suite/ndb/r/ndb_database.result mysql-test/suite/ndb/t/ndb_database.test sql/ha_ndbcluster.cc sql/ha_ndbcluster.h sql/ha_ndbcluster_binlog.cc sql/handler.cc sql/handler.h sql/sql_show.cc sql/sql_table.cc storage/ndb/include/kernel/GlobalSignalNumbers.h storage/ndb/include/kernel/signaldata/FsReadWriteReq.hpp storage/ndb/include/mgmapi/mgmapi.h storage/ndb/include/ndbapi/NdbDictionary.hpp storage/ndb/src/kernel/blocks/ERROR_codes.txt storage/ndb/src/kernel/blocks/dbdict/Dbdict.cpp storage/ndb/src/kernel/blocks/dbdih/DbdihMain.cpp storage/ndb/src/kernel/blocks/dblqh/Dblqh.hpp storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp storage/ndb/src/kernel/blocks/dbtup/Dbtup.hpp storage/ndb/src/kernel/blocks/dbtup/DbtupIndex.cpp storage/ndb/src/kernel/blocks/dbtup/DbtupMeta.cpp storage/ndb/src/kernel/blocks/dbtux/Dbtux.hpp storage/ndb/src/kernel/blocks/dbtux/DbtuxBuild.cpp storage/ndb/src/kernel/blocks/dbtux/DbtuxMaint.cpp storage/ndb/src/kernel/blocks/dbtux/DbtuxNode.cpp storage/ndb/src/kernel/blocks/dbtux/DbtuxTree.cpp storage/ndb/src/kernel/blocks/ndbfs/AsyncFile.cpp storage/ndb/src/kernel/blocks/ndbfs/AsyncFile.hpp storage/ndb/src/kernel/blocks/ndbfs/Ndbfs.cpp storage/ndb/src/kernel/blocks/ndbfs/Ndbfs.hpp storage/ndb/src/kernel/blocks/ndbfs/VoidFs.cpp storage/ndb/src/kernel/blocks/suma/Suma.cpp storage/ndb/src/kernel/blocks/suma/Suma.hpp storage/ndb/src/kernel/main.cpp storage/ndb/src/ndbapi/DictCache.cpp storage/ndb/src/ndbapi/DictCache.hpp storage/ndb/src/ndbapi/NdbDictionary.cpp storage/ndb/src/ndbapi/NdbDictionaryImpl.cpp storage/ndb/src/ndbapi/NdbDictionaryImpl.hpp storage/ndb/test/include/NdbRestarter.hpp storage/ndb/test/ndbapi/testIndex.cpp storage/ndb/test/ndbapi/testRestartGci.cpp storage/ndb/test/ndbapi/testSystemRestart.cpp storage/ndb/test/run-test/daily-basic-tests.txt storage/ndb/test/src/NdbRestarter.cpp
[29 Sep 2010 11:33]
Jon Stephens
Documented as follows in the NDB-6.3.8, 7.0.19, and 7.1.8 changelogs:
The graceful shutdown of a data node in the master cluster could
sometimes cause rows to be skipped when writing transactions to
the binary log. Testing following an initial fix for this issue
revealed an additional case where the graceful shutdown of a data
node was not handled properly. The current fix addresses this case.
Closed.

Description: A controlled shutdown of an NDBD node while transactions are in progress the binlog can miss receipt of some committed rows. In a 2 node cluster shutting down a node will result in an out of sync slave about 50% of the time. How to repeat: Basic master-slave replication setup. (1 mgm, 1 sql and 2 data nodes in each cluster) /* -- my.cnf -- */ [mysqld] user=mmontgomery socket=/tmp/mysql_matt.sock port=3371 datadir=/users/mmontgomery/data1/issue/data ndbcluster ndb_connectstring=localhost:4321 log-bin server-id=1 (or 2) /* -- end my.cnf --*/ /* -- config.ini -- */ [ndbd default] NoOfReplicas=2 Datadir=/users/mmontgomery/data1/issue [ndb_mgmd] PortNumber=4321 ID=1 Datadir=/users/mmontgomery/data1/issue Hostname=localhost [NDBD] Id=2 Hostname=localhost [NDBD] Id=3 Hostname=localhost [API] Id=5 Hostname=localhost [API] Id=6 /* -- end config.ini -- */ mysqlM> CREATE DATABASE world; mysqlM> CREATE TABLE `City` (`ID` int(11) NOT NULL AUTO_INCREMENT,`Name` char(35) NOT NULL DEFAULT '',`CountryCode` char(3) NOT NULL DEFAULT '',`District` char(20) NOT NULL DEFAULT '',`Population` int(11) NOT NULL DEFAULT '0',PRIMARY KEY (`ID`) ) ENGINE=ndbcluster AUTO_INCREMENT=18168 DEFAULT CHARSET=latin1; master$ mysqlslap -h 127.0.0.1 -P 3371 -q "insert into world.City values('', 'Weurt', 'NLD', 'Gelderland', 1234);" -i 500 --number-of-queries 2000 -c 1 --create-schema world & master$ ndb_mgm -e "2 stop" Connected to Management Server at: localhost:4321 Node 2 has shutdown. master$ mysqlslap: Cannot run query insert into world.City values('', 'Weurt', 'NLD', 'Gelderland', 1234); ERROR : Got temporary error 286 'Node failure caused abort of transaction' from NDBCLUSTER master$ mysql -e "select count(*) from City" world; +----------+ | count(*) | +----------+ | 18166 | +----------+ slave$ mysql -e "select count(*) from City" world; +----------+ | count(*) | +----------+ | 17374 | +----------+ slave$ mysql -e "show slave status\G" | grep -Ei 'running|behind' Slave_IO_Running: Yes Slave_SQL_Running: Yes Seconds_Behind_Master: 0