Bug #29501 DN crashes in LGMAN during NR while DD schema operations are being handled
Submitted: 2 Jul 2007 21:59 Modified: 11 Jul 2007 14:45
Reporter: Jonathan Miller Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Disk Data Severity:S2 (Serious)
Version:MySQL-5.1-new-ndb OS:Linux
Assigned to: Jonas Oreland CPU Architecture:Any

[2 Jul 2007 21:59] Jonathan Miller
Description:
Hi,

While trying to reproduce http://bugs.mysql.com/bug.php?id=22704 I found an issue with LGMAN.

Using cid_ndb_dd.pl script I killed one of the data nodes just as the delete of the data was about to take place. I then restarted it after the test dropped LGF and TS and before it had started to create the LFG again.

So it looks like this is not handled by the current code.

-----------------

LGMAN   001341 003101

003101:

    if(! (pages >= group_pages))
    {
      ndbout << heading << " Tail: " << ptr.p->m_file_pos[TAIL]
             << " Head: " << ptr.p->m_file_pos[HEAD]
             << " free: " << group_pages << "(" << last << ")"
             << " found: " << pages;
      for(Uint32 i = 0; i<3; i++)
      {
        ndbout << " - " << ptr.p->m_tail_pos[i];
      }
      ndbout << endl;

      ndbrequire(pages >= group_pages); <-003101

-------------------------
/space/run/ndb_2_fs/./lg1/undofile.dat rw O_DIRECT: 1
/space/run/ndb_2_fs/./ts1/datafile.dat rw O_DIRECT: 1
Dbdict: name=sys/def/SYSTAB_0,id=0,obj_ptr_i=4
Dbdict: name=sys/def/NDB$EVENTS_0,id=1,obj_ptr_i=5
Dbdict: name=mysql/def/ndb_schema,id=2,obj_ptr_i=6
Dbdict: name=mysql/def/NDB$BLOB_2_3,id=3,obj_ptr_i=7
Dbdict: name=mysql/def/ndb_apply_status,id=4,obj_ptr_i=8
RESTORE table: 0 1039 rows applied
RESTORE table: 0 1012 rows applied
RESTORE table: 1 3 rows applied
RESTORE table: 1 1 rows applied
RESTORE table: 2 2 rows applied
RESTORE table: 2 2 rows applied
RESTORE table: 3 0 rows applied
RESTORE table: 3 0 rows applied
RESTORE table: 4 0 rows applied
RESTORE table: 4 0 rows applied
Applying undo to LCP: 15
2007-07-02 23:22:54 [ndbd] INFO     -- Undo head - ./lg1/undofile.dat page: 1 lsn: 0
before flush log Tail: [ 1089536 9535 ] Head: [ 1089536 1 ] free: 9599(0) found: 9534 -
 [ 1089536 0 ] - [ 1089536 0 ] - [ 1089536 0 ]
2007-07-02 23:22:54 [ndbd] INFO     -- lgman.cpp
2007-07-02 23:22:54 [ndbd] INFO     -- LGMAN (Line: 3101) 0x0000000a
2007-07-02 23:22:54 [ndbd] INFO     -- Error handler startup shutting down system
2007-07-02 23:22:55 [ndbd] INFO     -- Error handler shutdown completed - aborting
2007-07-02 23:22:55 [ndbd] INFO     -- Angel received ndbd startup failure count 1.
2007-07-02 23:22:55 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Occured
during startphase 5. Initiated by signal 6. Caused by error 2341: 'Internal program err
or (failed ndbrequire)(Internal error, programming error or missing error message, plea
se report a bug). Temporary error

How to repeat:
1) Create LFG and TS
2) kill one data node
3) Drop TS and LFG on surviving node
4) Try to restart DN that was killed in step 2 (this node still has the old LFG)

The data node just restarted should die with:

Time: Monday 2 July 2007 - 23:22:54
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming erro
or missing error message, please report a bug)
Error: 2341
Error data: lgman.cpp
Error object: LGMAN (Line: 3101) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 16420
Trace: /space/run/ndb_2_trace.log.1
Version: Version 5.1.19 (beta)

Suggested fix:
We should be able to handle LGF and TS changes in a NR situation.
[2 Jul 2007 22:01] Jonathan Miller
Sorry, left out step 5
1) Create LFG and TS
2) kill one data node
3) Drop TS and LFG on surviving node
4) Try to restart DN that was killed in step 2 (this node still has the old LFG)
5) start a LFG create
[3 Jul 2007 9:20] Jonas Oreland
test scripts

Attachment: test.tgz (application/x-compressed-tar, text), 474 bytes.

[3 Jul 2007 9:49] Jonas Oreland
initial start
ndb_mgm -e "3 restart -a -n"
create_tab D1
ndb_mgm -e "3 start"
[3 Jul 2007 12:34] Jonas Oreland
https://intranet.mysql.com/secure/mailarchive/mail.php?folder=104&mail=154282
[4 Jul 2007 10:06] Jon Stephens
Documented bugfix for telco-6.2.4 release; left PQ status.
[11 Jul 2007 14:32] Tomas Ulin
pushed to 5.1.21  (wrong comment 25901 in the changeset comment)
[11 Jul 2007 14:45] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented bugfix in 5.1.21 changelog.