Bug #29501 DN crashes in LGMAN during NR while DD schema operations are being handled
Submitted: 2 Jul 2007 23:59 Modified: 11 Jul 2007 16:45
Reporter: Jonathan Miller
Status: Closed
Category:Server: ClusterDD Severity:S2 (Serious)
Version:MySQL-5.1-new-ndb OS:Linux
Assigned to: Jonas Oreland Target Version:

[2 Jul 2007 23:59] Jonathan Miller
Description:
Hi,

While trying to reproduce http://bugs.mysql.com/bug.php?id=22704 I found an issue with
LGMAN.

Using cid_ndb_dd.pl script I killed one of the data nodes just as the delete of the data
was about to take place. I then restarted it after the test dropped LGF and TS and before
it had started to create the LFG again.

So it looks like this is not handled by the current code.

-----------------

LGMAN   001341 003101

003101:

    if(! (pages >= group_pages))
    {
      ndbout << heading << " Tail: " << ptr.p->m_file_pos[TAIL]
             << " Head: " << ptr.p->m_file_pos[HEAD]
             << " free: " << group_pages << "(" << last << ")"
             << " found: " << pages;
      for(Uint32 i = 0; i<3; i++)
      {
        ndbout << " - " << ptr.p->m_tail_pos[i];
      }
      ndbout << endl;

      ndbrequire(pages >= group_pages); <-003101

-------------------------
/space/run/ndb_2_fs/./lg1/undofile.dat rw O_DIRECT: 1
/space/run/ndb_2_fs/./ts1/datafile.dat rw O_DIRECT: 1
Dbdict: name=sys/def/SYSTAB_0,id=0,obj_ptr_i=4
Dbdict: name=sys/def/NDB$EVENTS_0,id=1,obj_ptr_i=5
Dbdict: name=mysql/def/ndb_schema,id=2,obj_ptr_i=6
Dbdict: name=mysql/def/NDB$BLOB_2_3,id=3,obj_ptr_i=7
Dbdict: name=mysql/def/ndb_apply_status,id=4,obj_ptr_i=8
RESTORE table: 0 1039 rows applied
RESTORE table: 0 1012 rows applied
RESTORE table: 1 3 rows applied
RESTORE table: 1 1 rows applied
RESTORE table: 2 2 rows applied
RESTORE table: 2 2 rows applied
RESTORE table: 3 0 rows applied
RESTORE table: 3 0 rows applied
RESTORE table: 4 0 rows applied
RESTORE table: 4 0 rows applied
Applying undo to LCP: 15
2007-07-02 23:22:54 [ndbd] INFO     -- Undo head - ./lg1/undofile.dat page: 1 lsn: 0
before flush log Tail: [ 1089536 9535 ] Head: [ 1089536 1 ] free: 9599(0) found: 9534 -
 [ 1089536 0 ] - [ 1089536 0 ] - [ 1089536 0 ]
2007-07-02 23:22:54 [ndbd] INFO     -- lgman.cpp
2007-07-02 23:22:54 [ndbd] INFO     -- LGMAN (Line: 3101) 0x0000000a
2007-07-02 23:22:54 [ndbd] INFO     -- Error handler startup shutting down system
2007-07-02 23:22:55 [ndbd] INFO     -- Error handler shutdown completed - aborting
2007-07-02 23:22:55 [ndbd] INFO     -- Angel received ndbd startup failure count 1.
2007-07-02 23:22:55 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Occured
during startphase 5. Initiated by signal 6. Caused by error 2341: 'Internal program err
or (failed ndbrequire)(Internal error, programming error or missing error message, plea
se report a bug). Temporary error

How to repeat:
1) Create LFG and TS
2) kill one data node
3) Drop TS and LFG on surviving node
4) Try to restart DN that was killed in step 2 (this node still has the old LFG)

The data node just restarted should die with:

Time: Monday 2 July 2007 - 23:22:54
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming erro
or missing error message, please report a bug)
Error: 2341
Error data: lgman.cpp
Error object: LGMAN (Line: 3101) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 16420
Trace: /space/run/ndb_2_trace.log.1
Version: Version 5.1.19 (beta)

Suggested fix:
We should be able to handle LGF and TS changes in a NR situation.
[3 Jul 2007 0:01] Jonathan Miller
Sorry, left out step 5
1) Create LFG and TS
2) kill one data node
3) Drop TS and LFG on surviving node
4) Try to restart DN that was killed in step 2 (this node still has the old LFG)
5) start a LFG create
[3 Jul 2007 11:20] Jonas Oreland
test scripts

Attachment: test.tgz (application/x-compressed-tar, text), 474 bytes.

[3 Jul 2007 11:49] Jonas Oreland
initial start
ndb_mgm -e "3 restart -a -n"
create_tab D1
ndb_mgm -e "3 start"
[3 Jul 2007 14:34] Jonas Oreland
https://intranet.mysql.com/secure/mailarchive/mail.php?folder=104&mail=154282
[4 Jul 2007 12:06] Jon Stephens
Documented bugfix for telco-6.2.4 release; left PQ status.
[11 Jul 2007 16:32] Tomas Ulin
pushed to 5.1.21  (wrong comment 25901 in the changeset comment)
[11 Jul 2007 16:45] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of
that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available
version, including the bug fix. More information about accessing the source trees is
available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented bugfix in 5.1.21 changelog.