MySQL Bugs: #46732: Binary data from popular Wiki system crashes all nodes

Bug #46732	Binary data from popular Wiki system crashes all nodes
Submitted:	14 Aug 2009 16:41	Modified:	21 Aug 2009 17:28
Reporter:	Clint Alexander	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	mysql-5.1-ndb-7.0	OS:	Linux
Assigned to:		CPU Architecture:	Any
Tags:	BINARY, cluster, mediawiki, mysql-5.1.34 ndb-7.0.6, ndbd, objectcache, wiki

Description:
A popular wiki system called "Mediawiki" uses a table named "objectcache" on systems that do not run memcache. At some points, a query is made with binary data that completely crashes all the nodes on a cluster.

Not all binary data has this affect, only 'certain queries' that I have not yet identified.

My setup is:

1 Management Server
4 Data nodes
2 SQL Nodes

Note -- the SQL node that causes the crash is a slave in a standard replication process from another remote database. The purpose is to duplicate/test all the queries preformed on the stand-alone (master) SQL server prior to using the cluster in production. It's a good thing I did that. :)

I do not think the replication is a factor, but I thought I would mention it just in case.

I've attached the segment of the binary log that causes the crash -- refer to attachment and the "How to repeat" instructions.

Objectcache overview:
http://www.mediawiki.org/wiki/Manual:Objectcache_table

How to repeat:

CREATE DATABASE `wiki`

USE `wiki`

CREATE TABLE `objectcache` (
  `keyname` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
  `value` mediumblob,
  `exptime` datetime DEFAULT NULL,
  UNIQUE KEY `keyname` (`keyname`),
  KEY `exptime` (`exptime`)
) ENGINE=ndbcluster DEFAULT CHARSET=utf8;

# use attached binary_crash.sql
mysql -e "source binary_crash.sql"

All nodes crash at this point.

Suggested fix:
Unknown

Binlog data that causes a total cluster crash when imported

Attachment: binary_crash.sql.gz (application/x-gzip, text), 1.15 KiB.

Additional note...

I created the binlog copy with the following command:

mysqlbinlog -v -v --start-position=6972392 --stop-position=6972782 /var/lib/mysql/mysql.000013 > binary_crash.sql

Updated severity to Serious since this only happens with certain (uncommon?) binary queries. Otherwise it would be Critical.

Please attach logs and error logs

http://dev.mysql.com/doc/mysql-cluster-excerpt/5.1/en/mysql-cluster-programs-ndb-error-rep...

I since have cleaned with an --initial. I was skipping this table in replication so it wouldn't crash again, but, I have turned it back on so that it creates the error in the original configuration by normal operation (instead of just importing the binary crash). As soon as it crashes, I'll generate the report and attach it.

Stay tuned...

I can't reproduce the problem with the provided binary_crash.sql file.

First of all the log entries in the file refer to a table `objectcache`
in the database `manual` and not `wiki` even though the comments in the
file say `wiki.objectcache`. As values like 'wiki:messageslock' in the
comments also change to e.g. 'manual:messageslock' when decoding the
actual BINLOG data strings i assume you did a 'manual' -> 'wiki'
search&replace on the file?

After creating the `objectcache` table in the `manual` database i can
replay the binlog statements, none of them does seem to have any effect
though. The table is still empty after replaying the log even though
none of the BINLOG statements produces any warnings or errors and all
nodes are still alive at this point.

I apologize for this. When I was duplicating the error to make sure the import did what I said it would, I did not want it to insert into the original database, so I created a new one and like you guessed -- did a string replace, hoping that would cover it. I'm not a master at the binary logs (yet). However, I'm not sure why this did not work in your testing environment as it continues to work for mine.

I'm still waiting for this to happen again on my network while running under normal activity and configuration. Where it was happening at least once every 24 hours, it has not happened again yet. But I am still monitoring and waiting.

I'll try to provide a much better recreation method in the next posting and I apologize for this one not producing what I intended. It's a little embarrassing, but I'll get over it. :)

//Clint

I have not had this happen since we began the second monitoring session. I am closing this ticket and if the problem comes up again, we can readdress this ticket (or refer to it) in the subsequent report.

Again, I apologies for the open-ended report.

//Clint