MySQL Bugs: #29501: DN crashes in LGMAN during NR while DD schema operations are being handled

Bug #29501	DN crashes in LGMAN during NR while DD schema operations are being handled
Submitted:	2 Jul 2007 21:59	Modified:	11 Jul 2007 14:45
Reporter:	Jonathan Miller	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Disk Data	Severity:	S2 (Serious)
Version:	MySQL-5.1-new-ndb	OS:	Linux
Assigned to:	Jonas Oreland	CPU Architecture:	Any

Description:
Hi,

While trying to reproduce http://bugs.mysql.com/bug.php?id=22704 I found an issue with LGMAN.

Using cid_ndb_dd.pl script I killed one of the data nodes just as the delete of the data was about to take place. I then restarted it after the test dropped LGF and TS and before it had started to create the LFG again.

So it looks like this is not handled by the current code.

-----------------

LGMAN   001341 003101

003101:

    if(! (pages >= group_pages))
    {
      ndbout << heading << " Tail: " << ptr.p->m_file_pos[TAIL]
             << " Head: " << ptr.p->m_file_pos[HEAD]
             << " free: " << group_pages << "(" << last << ")"
             << " found: " << pages;
      for(Uint32 i = 0; i<3; i++)
      {
        ndbout << " - " << ptr.p->m_tail_pos[i];
      }
      ndbout << endl;

      ndbrequire(pages >= group_pages); <-003101

-------------------------
/space/run/ndb_2_fs/./lg1/undofile.dat rw O_DIRECT: 1
/space/run/ndb_2_fs/./ts1/datafile.dat rw O_DIRECT: 1
Dbdict: name=sys/def/SYSTAB_0,id=0,obj_ptr_i=4
Dbdict: name=sys/def/NDB$EVENTS_0,id=1,obj_ptr_i=5
Dbdict: name=mysql/def/ndb_schema,id=2,obj_ptr_i=6
Dbdict: name=mysql/def/NDB$BLOB_2_3,id=3,obj_ptr_i=7
Dbdict: name=mysql/def/ndb_apply_status,id=4,obj_ptr_i=8
RESTORE table: 0 1039 rows applied
RESTORE table: 0 1012 rows applied
RESTORE table: 1 3 rows applied
RESTORE table: 1 1 rows applied
RESTORE table: 2 2 rows applied
RESTORE table: 2 2 rows applied
RESTORE table: 3 0 rows applied
RESTORE table: 3 0 rows applied
RESTORE table: 4 0 rows applied
RESTORE table: 4 0 rows applied
Applying undo to LCP: 15
2007-07-02 23:22:54 [ndbd] INFO     -- Undo head - ./lg1/undofile.dat page: 1 lsn: 0
before flush log Tail: [ 1089536 9535 ] Head: [ 1089536 1 ] free: 9599(0) found: 9534 -
 [ 1089536 0 ] - [ 1089536 0 ] - [ 1089536 0 ]
2007-07-02 23:22:54 [ndbd] INFO     -- lgman.cpp
2007-07-02 23:22:54 [ndbd] INFO     -- LGMAN (Line: 3101) 0x0000000a
2007-07-02 23:22:54 [ndbd] INFO     -- Error handler startup shutting down system
2007-07-02 23:22:55 [ndbd] INFO     -- Error handler shutdown completed - aborting
2007-07-02 23:22:55 [ndbd] INFO     -- Angel received ndbd startup failure count 1.
2007-07-02 23:22:55 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Occured
during startphase 5. Initiated by signal 6. Caused by error 2341: 'Internal program err
or (failed ndbrequire)(Internal error, programming error or missing error message, plea
se report a bug). Temporary error

How to repeat:
1) Create LFG and TS
2) kill one data node
3) Drop TS and LFG on surviving node
4) Try to restart DN that was killed in step 2 (this node still has the old LFG)

The data node just restarted should die with:

Time: Monday 2 July 2007 - 23:22:54
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming erro
or missing error message, please report a bug)
Error: 2341
Error data: lgman.cpp
Error object: LGMAN (Line: 3101) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 16420
Trace: /space/run/ndb_2_trace.log.1
Version: Version 5.1.19 (beta)

Suggested fix:
We should be able to handle LGF and TS changes in a NR situation.

Sorry, left out step 5
1) Create LFG and TS
2) kill one data node
3) Drop TS and LFG on surviving node
4) Try to restart DN that was killed in step 2 (this node still has the old LFG)
5) start a LFG create

test scripts

Attachment: test.tgz (application/x-compressed-tar, text), 474 bytes.

initial start
ndb_mgm -e "3 restart -a -n"
create_tab D1
ndb_mgm -e "3 start"

https://intranet.mysql.com/secure/mailarchive/mail.php?folder=104&mail=154282

Documented bugfix for telco-6.2.4 release; left PQ status.

pushed to 5.1.21  (wrong comment 25901 in the changeset comment)

Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented bugfix in 5.1.21 changelog.