Bug #55641 | Stopping node gracefully in the master cluster can create an inconsistent slave | ||
---|---|---|---|
Submitted: | 29 Jul 2010 22:48 | Modified: | 29 Sep 2010 11:33 |
Reporter: | Matthew Montgomery | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Replication | Severity: | S2 (Serious) |
Version: | mysql-5.1-telco-7.1 | OS: | Any |
Assigned to: | Jonas Oreland | CPU Architecture: | Any |
[29 Jul 2010 22:48]
Matthew Montgomery
[19 Aug 2010 9:42]
Jonas Oreland
Iff this is true, it's a P1 Can someone please verify this again... or as suggested see when/how it was introduced (check backwards) /Jonas
[19 Aug 2010 9:43]
Jonas Oreland
assigning to bockie for action...
[19 Aug 2010 16:10]
MySQL Verification Team
This appears to be an extremely old bug (perhaps has always been there) it has been reproduced it using 6.1.2, 6.3.9, 7.0.15, 6.4.0 and 7.1.8
[19 Aug 2010 16:53]
MySQL Verification Team
Correction: I meant 7.1.6 instead of 7.1.8
[20 Aug 2010 6:06]
Jonas Oreland
matthew, if you feel like helping out further, a good next step would be to inspect binlog to determine if problem is at master site or at slave. /Jonas
[20 Aug 2010 14:26]
MySQL Verification Team
As suspected it's a problem with the master not writing inserts into the binlog master$ mysql -e "select count(*) from world.City" +----------+ | count(*) | +----------+ | 63583 | +----------+ ## Read all binlogs: master$ mysqlbinlog -v ndbsup-1-bin.000001 | grep -c "INSERT INTO world.City" 62376 master$ mysqlbinlog -v ndbsup-1-bin.000002 | grep -c "INSERT INTO world.City" 0 master$ mysqlbinlog -v ndbsup-1-bin.000003 | grep -c "INSERT INTO world.City" mysqlbinlog: File 'ndbsup-1-bin.000003' not found (Errcode: 2) 0 slave$ mysql -e "select count(*) from world.City"; +----------+ | count(*) | +----------+ | 62376 | +----------+
[23 Aug 2010 14:06]
Jonas Oreland
Hi, We discussed problem...and it might be that the problem is related to "graceful" shutdown. Can you see if same problem is reproducible if using "stop -a" or "kill -9" to stop node instead ? /Jonas
[23 Aug 2010 15:20]
MySQL Verification Team
It appears that when using "2 stop -a" or kill -n `pidof ndbd` the bug is not reproduced and master and slave remain synchronized.
[24 Aug 2010 11:15]
Jonas Oreland
update title...
[24 Aug 2010 14:06]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/116651 3261 Jonas Oreland 2010-08-24 ndb - bug#55641 - commit to test
[24 Aug 2010 14:10]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/116655 3262 Jonas Oreland 2010-08-24 ndb - bug#55641 - commit to test
[25 Aug 2010 7:49]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/116709 3263 Jonas Oreland 2010-08-25 ndb - bug#55641 - fix bug in node iterating
[26 Aug 2010 9:03]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/116836 3264 Jonas Oreland 2010-08-26 ndb - bug#55641 - fix suma wrt. graceful shutdown
[26 Aug 2010 9:03]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/116837 3264 Jonas Oreland 2010-08-26 ndb - bug#55641 - fix suma wrt. graceful shutdown, bug#18538 graceful shutdown can abort transactions
[26 Aug 2010 10:11]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/116843 3265 Jonas Oreland 2010-08-26 ndb - bug#55641 - fix upgrade...daa
[26 Aug 2010 12:46]
Bugs System
Pushed into mysql-5.1-telco-6.3 5.1.47-ndb-6.3.37 (revid:jonas@mysql.com-20100826123701-s6etvy71vidlaxme) (version source revid:jonas@mysql.com-20100826100724-msmzp71lce098ke3) (merge vers: 5.1.47-ndb-6.3.37) (pib:21)
[26 Aug 2010 12:46]
Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.47-ndb-7.0.18 (revid:jonas@mysql.com-20100826124130-kuw81qy5le9lpl7o) (version source revid:jonas@mysql.com-20100826124130-kuw81qy5le9lpl7o) (merge vers: 5.1.47-ndb-7.0.18) (pib:21)
[26 Aug 2010 12:54]
Jonas Oreland
pushed into 6.3.37, 7.0.18 and 7.1.7
[26 Aug 2010 12:55]
Jonas Oreland
matthew, you're most welcome to test it /Jonas
[27 Aug 2010 8:04]
Jon Stephens
Documented bugfix in the NDB-6.3.37, 7.0.18, and 7.1.7 changelogs as follows: The planned shutdown of a data node in the master cluster could sometimes cause rows to be skipped when writing transactions to the binary log, leading to an inconsistent slave cluster. Closed.
[23 Sep 2010 13:25]
MySQL Verification Team
After some additional automated testing I am still able to reproduce this. 5.1.47-ndb-7.0.18-brXXXXX-log #!/bin/bash PATH=/data/mysql/sandbox/7.0.18/bin:/data/mysql/sandbox/7.0.18/libexec:/home/matt/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games while true ; do for id in 4 ; do echo -en "master ndbd: " $(ndb_mgm -e "show" | grep Master | cut -f 1) echo -en "\tkilling node : $id \t" mysql -e 'truncate world.City'; mysqlslap -q "insert into world.City values('', 'Weurt', 'NLD', 'Gelderland', 1234);" -i 50 --number-of-queries 2000 -c 1 --create-schema world 2>&1 > /dev/null & sleep 10 ndb_mgm -e "$id stop -a" 2>&1 > /dev/null wait sleep 1 master=$(mysql -Bse "select count(*) from world.City" -S /tmp/mysql70.sock) slave=$(mysql -Bse "select count(*) from world.City" -S /tmp/mysql72.sock ) if [ $master == $slave ] ; then echo "PASS" else echo "FAIL" fi echo "Master : $master --> Slave : $slave" ndbd 2>&1 >/dev/null ndb_waiter >/dev/null done done matt@silo:~/mysql/sandbox/7.0.18$ ./bug55641.sh master ndbd: id=5 killing node : 4 mysqlslap: Cannot run query insert into world.City values('', 'Weurt', 'NLD', 'Gelderland', 1234); ERROR : Got temporary error 4031 'Node failure caused abort of transaction' from NDBCLUSTER PASS Master : 9471 --> Slave : 9471 master ndbd: id=5 killing node : 5 mysqlslap: Cannot run query insert into world.City values('', 'Weurt', 'NLD', 'Gelderland', 1234); ERROR : Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER PASS Master : 8429 --> Slave : 8429 master ndbd: id=4 killing node : 4 PASS Master : 100000 --> Slave : 100000 master ndbd: id=5 killing node : 5 PASS Master : 100000 --> Slave : 100000 master ndbd: id=4 killing node : 4 PASS Master : 100000 --> Slave : 100000 master ndbd: id=5 killing node : 5 FAIL Master : 100000 --> Slave : 99999 master ndbd: id=4 killing node : 4 mysqlslap: Cannot run query insert into world.City values('', 'Weurt', 'NLD', 'Gelderland', 1234); ERROR : Got temporary error 4031 'Node failure caused abort of transaction' from NDBCLUSTER PASS Master : 10848 --> Slave : 10848 master ndbd: id=5 killing node : 5 PASS Master : 100000 --> Slave : 100000 master ndbd: id=4 killing node : 4 PASS Master : 100000 --> Slave : 100000 master ndbd: id=5 killing node : 5 PASS Several times the mysqlslap command gets an error due to node failure, which is expected. Master and slave remain in sync in all these cases. However, about one in 12 executions fail with one row missing from the slave.
[29 Sep 2010 7:29]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/119334 3307 Jonas Oreland 2010-09-29 ndb - bug#55641 - followup fix, fix check_switchover
[29 Sep 2010 7:38]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/119335 3308 Jonas Oreland 2010-09-29 ndb - bug#55641 - followup fix, fix check_switchover
[29 Sep 2010 7:43]
Bugs System
Pushed into mysql-5.1-telco-6.3 5.1.47-ndb-6.3.38 (revid:jonas@mysql.com-20100929073631-xzt5ipk6hbps4tik) (version source revid:jonas@mysql.com-20100929073631-xzt5ipk6hbps4tik) (merge vers: 5.1.47-ndb-6.3.38) (pib:21)
[29 Sep 2010 7:43]
Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.47-ndb-7.0.19 (revid:jonas@mysql.com-20100929073942-j33arj0iotj2ckvp) (version source revid:jonas@mysql.com-20100929073942-j33arj0iotj2ckvp) (merge vers: 5.1.47-ndb-7.0.19) (pib:21)
[29 Sep 2010 7:48]
Jonas Oreland
DOCS: More testing of scenario revealed a missing piece of handling of the graceful shutdown. This patch addresses that. Pushed to 6.3.38, 7.0.19 and 7.1.8
[29 Sep 2010 10:54]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/119379 3288 Martin Skold 2010-09-29 [merge] Merge removed: cluster_change_hist.txt modified: mysql-test/collections/default.experimental mysql-test/suite/ndb/r/ndb_database.result mysql-test/suite/ndb/t/ndb_database.test sql/ha_ndbcluster.cc sql/ha_ndbcluster.h sql/ha_ndbcluster_binlog.cc sql/handler.cc sql/handler.h sql/sql_show.cc sql/sql_table.cc storage/ndb/include/kernel/GlobalSignalNumbers.h storage/ndb/include/kernel/signaldata/FsReadWriteReq.hpp storage/ndb/include/mgmapi/mgmapi.h storage/ndb/include/ndbapi/NdbDictionary.hpp storage/ndb/src/kernel/blocks/ERROR_codes.txt storage/ndb/src/kernel/blocks/dbdict/Dbdict.cpp storage/ndb/src/kernel/blocks/dbdih/DbdihMain.cpp storage/ndb/src/kernel/blocks/dblqh/Dblqh.hpp storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp storage/ndb/src/kernel/blocks/dbtup/Dbtup.hpp storage/ndb/src/kernel/blocks/dbtup/DbtupIndex.cpp storage/ndb/src/kernel/blocks/dbtup/DbtupMeta.cpp storage/ndb/src/kernel/blocks/dbtux/Dbtux.hpp storage/ndb/src/kernel/blocks/dbtux/DbtuxBuild.cpp storage/ndb/src/kernel/blocks/dbtux/DbtuxMaint.cpp storage/ndb/src/kernel/blocks/dbtux/DbtuxNode.cpp storage/ndb/src/kernel/blocks/dbtux/DbtuxTree.cpp storage/ndb/src/kernel/blocks/ndbfs/AsyncFile.cpp storage/ndb/src/kernel/blocks/ndbfs/AsyncFile.hpp storage/ndb/src/kernel/blocks/ndbfs/Ndbfs.cpp storage/ndb/src/kernel/blocks/ndbfs/Ndbfs.hpp storage/ndb/src/kernel/blocks/ndbfs/VoidFs.cpp storage/ndb/src/kernel/blocks/suma/Suma.cpp storage/ndb/src/kernel/blocks/suma/Suma.hpp storage/ndb/src/kernel/main.cpp storage/ndb/src/ndbapi/DictCache.cpp storage/ndb/src/ndbapi/DictCache.hpp storage/ndb/src/ndbapi/NdbDictionary.cpp storage/ndb/src/ndbapi/NdbDictionaryImpl.cpp storage/ndb/src/ndbapi/NdbDictionaryImpl.hpp storage/ndb/test/include/NdbRestarter.hpp storage/ndb/test/ndbapi/testIndex.cpp storage/ndb/test/ndbapi/testRestartGci.cpp storage/ndb/test/ndbapi/testSystemRestart.cpp storage/ndb/test/run-test/daily-basic-tests.txt storage/ndb/test/src/NdbRestarter.cpp
[29 Sep 2010 11:33]
Jon Stephens
Documented as follows in the NDB-6.3.8, 7.0.19, and 7.1.8 changelogs: The graceful shutdown of a data node in the master cluster could sometimes cause rows to be skipped when writing transactions to the binary log. Testing following an initial fix for this issue revealed an additional case where the graceful shutdown of a data node was not handled properly. The current fix addresses this case. Closed.