MySQL Bugs: #12950: ndb backup failure if updating column with charset during backup

Bug #12950	ndb backup failure if updating column with charset during backup
Submitted:	2 Sep 2005 13:39	Modified:	13 Oct 2005 11:46
Reporter:	A M	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	4.1.14	OS:	Linux (redhat ent. 3)
Assigned to:	Jonas Oreland	CPU Architecture:	Any

Description:
if an update (or possibly any dml) is attempted on ndb tables during an ndb backup, the backup fails with:

Start of backup failed
* 3001: Could not start backup
* Send to process or receive failed.

with master ndbd node crash, however other ndbds remain alive

after master ndbd crashes, clients experience a temporary hang (~2secs) and then everything goes back to normal.

i'm not sure if this is due to just issuing a command during a backup, or if this has to do with backing up while under load. i came across the problem during load testing.

i did notice that if i pause the load testing right before i backup, then the backup runs smoothly.

attached is part of the ndb cluster log when the backup is attempted and the crash occurs

How to repeat:
i've been able to repeat this consistently with the following setup:

ndb cluster on 2 computers, each computer having ndb_mgmd, ndbd and mysql
(i.e. every computer has all three node types running).
computers are two single cpu compaq dl360, 1ghz and 800mhz, 1Gb ram each running redhat ent. 3 (glibc 2.3) and mysql/ndb v. 4.1.14
connected by gbit ethernet

1. schema:
create a table and populate it with a large amount of data, say over 50 mb
(the point is that the bug is exposed only if the backup takes more than a few seconds to run, so we need a large amount of dummy data)
for instance:
create table testdata(
keycol integer
datacol1 integer,
datacol2 integer,
textcol text character set utf8,
primary key (keycol)) engine=ndbcluster

populated with ~500,000 rows, where textcol is populated with ~100bytes

i've included a text column because in my setup, i've got a text column included in the update, maybe its part of the problem.. haven't check the test case without text column

2. load:
i've used load testing (openSTA) to create load on the system (frontend is a web app) before issueing the backup, but a client issueing an update once every 200ms should be enough.

3. once you've had a few seconds of updates, try backing up.

note: i'm quite sure that it doesnt matter which table an update touches during the backup (as long as its ndb), the crash will occur

cluster log, look for backup #2

Attachment: ndb_2_cluster.log (application/octet-stream, text), 34.85 KiB.

regarding the updates, i am referring to row updates with pk condition.
i.e. in our example:
update testdata set textcol = reverse(textcol) where keycol = <id>

Hi,

can you please attach the error log and trace files aswell?

/Jonas

What is the load situation on the servers
* with only updates running
* with only backup running
* with backup and updates running

The cluster log speaks of missed heartbeats and the error log is watchdog timeout.

Both indicate high load.
But then again it might be something else :-)

/Jonas

hi

what exactly do you mean by load on the system?

load average?
user/system/idle?
process cpu usage?
would you like the output of vmstat w/ 1 sec intervals during a load test and crash?  you might have a problem correlating the output with the exact second of the crash..

during the tests the system was under load (i came across the problem while running load tests), i would say some 60 transactions per second, 20 being inserts on non-indexed tables, 20 being updates on one table and 20 updates on a second table.  the updates are single-row changes using a pk condition.
however note that in the context of cpu load, the cpu wasnt working that hard.. i noticed an average of about 30% for both computers.

by the way, the tps is the total for both computers in the cluster (i.e. the queries do not run on only one mysql in the cluster), maybe this has something to do with the problem.

also, the cluster log is showing the missed heartbeats right after the backup request (and subsequent crash of the master ndb data node).

note that a double digit tps is not required for the crash.  if you have a backup that takes say 20 seconds, and attempt to update any ndb tables, an ndb node will crash.

I managed to reproduce this 1...on ~10 tries...
working on finding out why...

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/30150

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/30152

Would you be willing to test the patch for this before I close the bug report?
(so that I have fixed the correct bug:-))

i'd be happy to check the patch, but it would have to wait about a week before i could get to it.

Pushed into 4.1.15
Note the problem does not occur in 5.0

----

I'll close this bug report.
Please reopen if you find that it still fails, once you get around to test it.

ndbd could crash if there was updates of a column with charset, if 
  a backup was running at the same time.

The problem was that the backup incorrectly used charset-normalized reads
  for its internal triggers.

This could also lead to "incorrect" data in backup 
  (the columns being normalized, e.g aAa->AAA if charset latin_ci),
  if such a column was updated _during_ the backup.

Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Bugfix documented in 4.1.15 changlog. Closed.

Checked bug on 4.1.15.

Looks ok to me :)