Bug #18026 Node recovery with varchar and charset causing inconsistency
Submitted: 7 Mar 2006 12:44 Modified: 14 Mar 2006 10:15
Reporter: Tomas Ulin Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:5.1.8 bk src OS:
Assigned to: Jonas Oreland CPU Architecture:Any

[7 Mar 2006 12:44] Tomas Ulin
Description:
see below

How to repeat:
in one window:

./mysql-test-run --fast --start-and-exit --do-test=ndb
../client/mysql test -u root --socket=var/tmp/master.sock
mysql> create table t1 (i int key) engine=ndbcluster;

in another window:

../storage/ndb/src/mgmclient/ndb_mgm localhost 9350
ndb_mgm> 2 restart -n

in first window:

master> alter table t1 add column b int;

in other window:

ndb_mgm> 2 start

BAM!

on node 1:

Message: Internal program error (failed ndbassert) (Internal error, programming error or missing error message, please report a bug)
Error: 2343
Error data: dbtup/DbtupScan.cpp
Error object: DBTUP (Line: 327) 0x0000000e

in code:

      case AccLockReq::Refused:
        jam();
        // we cannot see deleted tuple (assert only)
        ndbassert(false);
        // skip it

trace log on node 1:

DBTUP   028206 003208 003482 003208 003482 
DBACC   001816 001823 001830 001832 001845 
DBACC   001062 002652 002729 001162 
DBTUP   032304 032325 000327 

--------------- Signal ----------------
r.bn: 249 "DBTUP", r.proc: 1, r.sigId: 142135 gsn: 332 "NEXT_SCANREQ" prio: 1
s.bn: 247 "DBLQH", s.proc: 1, s.sigId: 142134 length: 3 trace: 2 #sec: 0 fragInf: 0
 H'00000001 H'ffffff00 H'00000001
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 1, r.sigId: 142134 gsn: 389 "STORED_PROCCONF" prio: 1
s.bn: 249 "DBTUP", s.proc: 1, s.sigId: 142133 length: 2 trace: 2 #sec: 0 fragInf: 0
 H'0000000b H'00000000
--------------- Signal ----------------
r.bn: 249 "DBTUP", r.proc: 1, r.sigId: 142133 gsn: 391 "STORED_PROCREQ" prio: 1
s.bn: 247 "DBLQH", s.proc: 1, s.sigId: 142132 length: 5 trace: 2 #sec: 0 fragInf: 0
 H'0000000b H'00000003 H'00000001 H'00000002 H'00000080
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 1, r.sigId: 142132 gsn: 86 "ACC_SCANCONF" prio: 1
s.bn: 249 "DBTUP", s.proc: 1, s.sigId: 142131 length: 8 trace: 2 #sec: 0 fragInf: 0
 H'00000000 H'00000001 H'00000003 H'00000000 H'00000080 H'00000000 H'0f700100
 H'00000001
--------------- Signal ----------------
r.bn: 249 "DBTUP", r.proc: 1, r.sigId: 142131 gsn: 88 "ACC_SCANREQ" prio: 1
s.bn: 247 "DBLQH", s.proc: 1, s.sigId: 142130 length: 8 trace: 2 #sec: 0 fragInf: 0
 H'00000000 H'00f70001 H'00000003 H'00000000 H'00000080 H'00000000 H'0f700100
 H'00000000
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 1, r.sigId: 142130 gsn: 171 "COPY_FRAGREQ" prio: 1
s.bn: 246 "DBDIH", s.proc: 1, s.sigId: 142129 length: 8 trace: 2 #sec: 0 fragInf: 0
 H'00000030 H'00f60001 H'00000003 H'00000000 H'00000002 H'00000001 H'00000001
 H'00000000
--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 1, r.sigId: 142129 gsn: 178 "CREATE_FRAGCONF" prio: 1
s.bn: 246 "DBDIH", s.proc: 2, s.sigId: -1 length: 5 trace: 2 #sec: 0 fragInf: 0
 H'00000030 H'00000003 H'00000000 H'00000002 H'00000002
--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 1, r.sigId: 142128 gsn: 178 "CREATE_FRAGCONF" prio: 1
s.bn: 246 "DBDIH", s.proc: 1, s.sigId: 142126 length: 5 trace: 2 #sec: 0 fragInf: 0
 H'00000030 H'00000003 H'00000000 H'00000001 H'00000002
--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 1, r.sigId: 142127 gsn: 362 "UTIL_UNLOCK_CONF" prio: 1
s.bn: 256 "DBUTIL", s.proc: 1, s.sigId: 142125 length: 3 trace: 2 #sec: 0 fragInf: 0
 senderData: 0
 senderRef: 1000001
 lockId: 1
--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 1, r.sigId: 142126 gsn: 180 "CREATE_FRAGREQ" prio: 1
s.bn: 246 "DBDIH", s.proc: 1, s.sigId: 142123 length: 8 trace: 2 #sec: 0 fragInf: 0
 H'00000030 H'00f60001 H'00000003 H'00000000 H'00000002 H'00000001 H'00000000
[13 Mar 2006 12:34] Jonas Oreland
The problem was in cluster.schema which has a varchar primary key
  During node recovery, an incorrect ACC_LOCK is sent, as key is
  not normalized before computing hash, and locking.

This could cause inconsistent data between primary/backup
Today, hugo-toolkit does not support charset's so it's not possible to test
  from there....This should be fixed...
[13 Mar 2006 12:39] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/3772
[13 Mar 2006 12:47] Jonas Oreland
Jon, when documenting this, please explain that this was a critical bug.
Very easy to get, and possible inconsistency could follow...
[13 Mar 2006 15:14] Jonas Oreland
pushed into 5.1.8
[14 Mar 2006 10:15] Jon Stephens
Thank you for your bug report. This issue has been addressed in the
documentation. The updated documentation will appear on our website
shortly, and will be included in the next release of the relevant
product(s).

Additional info:

Documented bugfix, noting associated issues, in 5.1.8 changelog. Closed.