Bug #9994 | NDB randomly crashes on one node | ||
---|---|---|---|
Submitted: | 19 Apr 2005 8:51 | Modified: | 16 Sep 2005 11:17 |
Reporter: | Isabel Garcia Lorenzo | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | 4.1.11 | OS: | Linux (Red Hat Linux AS 3.0 (update 4)) |
Assigned to: | Pekka Nousiainen | CPU Architecture: | Any |
[19 Apr 2005 8:51]
Isabel Garcia Lorenzo
[19 Apr 2005 9:25]
Pekka Nousiainen
Probably caused by simultaneous table/index drop from 2 different mysql sessions.
[19 Apr 2005 10:29]
Martin Skold
Looking at the trace it is not a random event causing the data node. A failed CREATE TABLE seems to run into a bug when trying to clean up (looks like some problem when dropping related indexes of the table). Is this really a showstopper? Did the node recover after restart and was the CREATE TABLE possible to perform later?
[19 Apr 2005 15:17]
Isabel Garcia Lorenzo
Thanx for the wuick response! Not sure what table caused the error, but after a cluster shutdown we can't bring up any of the 2 NDB nodes. The error we get is pretty similar to the one we reported before (maybe the first error belonged to the NDB node trying to restart?) Right now, when we try to start an NDB node, this is what we get: CLUSTER LOG ------------------ Apr 19 16:54:18 vtlinuxdes3 NDB[19393]: [MgmSrvr] NDB Cluster Management Server. Version 4.1.11 Apr 19 16:54:18 vtlinuxdes3 NDB[19393]: [MgmSrvr] Id: 1, Command port: 1187 Apr 19 16:54:18 vtlinuxdes3 ndb_mgmd: ndb_mgmd startup succeeded Apr 19 16:54:37 vtlinuxdes3 NDB[19393]: [MgmSrvr] Mgmt server state: nodeid 2 reserved for ip 192.168.200.11, m_reserved_nodes 0000000000000006. Apr 19 16:54:37 vtlinuxdes3 NDB[19393]: [MgmSrvr] Node 1: Node 2 Connected Apr 19 16:55:10 vtlinuxdes3 NDB[19393]: [MgmSrvr] Node 2: Start phase 1 completed Apr 19 16:56:10 vtlinuxdes3 NDB[19393]: [MgmSrvr] Node 2: Start phase 2 completed (system restart) Apr 19 16:56:10 vtlinuxdes3 NDB[19393]: [MgmSrvr] Node 2: Start phase 3 completed (system restart) Apr 19 16:56:34 vtlinuxdes3 NDB[19393]: [MgmSrvr] Node 1: Node 2 Disconnected Apr 19 16:56:35 vtlinuxdes3 NDB[19393]: [MgmSrvr] Mgmt server state: nodeid 2 freed, m_reserved_nodes 0000000000000002. Apr 19 16:59:05 vtlinuxdes3 NDB[19393]: [MgmSrvr] Mgmt server state: nodeid 3 reserved for ip 192.168.200.12, m_reserved_nodes 000000000000000a. Apr 19 16:59:06 vtlinuxdes3 NDB[19393]: [MgmSrvr] Node 1: Node 3 Connected Apr 19 16:59:19 vtlinuxdes3 NDB[19393]: [MgmSrvr] Mgmt server state: nodeid 2 reserved for ip 192.168.200.11, m_reserved_nodes 000000000000000e. Apr 19 16:59:19 vtlinuxdes3 NDB[19393]: [MgmSrvr] Node 1: Node 2 Connected Apr 19 16:59:20 vtlinuxdes3 NDB[19393]: [MgmSrvr] Node 3: Node 2 Connected Apr 19 16:59:20 vtlinuxdes3 NDB[19393]: [MgmSrvr] Node 2: Node 3 Connected Apr 19 16:59:20 vtlinuxdes3 NDB[19393]: [MgmSrvr] Node 2: Start phase 1 completed Apr 19 16:59:21 vtlinuxdes3 NDB[19393]: [MgmSrvr] Node 3: CM_REGCONF president = 2, own Node = 3, our dynamic id = 2 Apr 19 16:59:21 vtlinuxdes3 NDB[19393]: [MgmSrvr] Node 2: Node 3: API version 4.1.11 Apr 19 16:59:21 vtlinuxdes3 NDB[19393]: [MgmSrvr] Node 3: Node 2: API version 4.1.11 Apr 19 16:59:21 vtlinuxdes3 NDB[19393]: [MgmSrvr] Node 3: Start phase 1 completed Apr 19 16:59:21 vtlinuxdes3 NDB[19393]: [MgmSrvr] Node 2: Start phase 2 completed (system restart) Apr 19 16:59:21 vtlinuxdes3 NDB[19393]: [MgmSrvr] Node 3: Start phase 2 completed (system restart) Apr 19 16:59:21 vtlinuxdes3 NDB[19393]: [MgmSrvr] Node 2: Start phase 3 completed (system restart) Apr 19 16:59:21 vtlinuxdes3 NDB[19393]: [MgmSrvr] Node 3: Start phase 3 completed (system restart) Apr 19 16:59:45 vtlinuxdes3 NDB[19393]: [MgmSrvr] Node 1: Node 2 Disconnected Apr 19 16:59:45 vtlinuxdes3 NDB[19393]: [MgmSrvr] Node 1: Node 3 Disconnected Apr 19 16:59:45 vtlinuxdes3 NDB[19393]: [MgmSrvr] Mgmt server state: nodeid 2 freed, m_reserved_nodes 000000000000000a. Apr 19 16:59:45 vtlinuxdes3 NDB[19393]: [MgmSrvr] Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000002. NDB LOG ------------ Date/Time: Tuesday 19 April 2005 - 16:59:44 Type of error: error Message: Internal program error (failed ndbrequire) Fault ID: 2341 Problem data: Dbdict.cpp Object of reference: DBDICT (Line: 2108) 0x00000002 ProgramName: ndbd ProcessID: 3709 TraceFile: /data/db/mysql-cluster/ndb_2_trace.log.18 ***EOM*** TRACE FILE ------------- JAM CONTENTS up->down left->right ?=not block entry BLOCK ADDR ADDR ADDR ADDR ADDR ADDR ADDR ADDR DBDICT ?002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 002095 002237 002243 --------------- Signal ---------------- r.bn: 250 "DBDICT", r.proc: 2, r.sigId: 675211 gsn: 403 "TC_SCHVERCONF" prio: 1 s.bn: 245 "DBTC", s.proc: 2, s.sigId: 675210 length: 2 trace: 0 #sec: 0 fragInf: 0 H'00000082 H'00000083 --------------- Signal ---------------- r.bn: 245 "DBTC", r.proc: 2, r.sigId: 675210 gsn: 404 "TC_SCHVERREQ" prio: 1 s.bn: 250 "DBDICT", s.proc: 2, s.sigId: 675209 length: 6 trace: 0 #sec: 0 fragInf: 0 H'00000082 H'00000006 H'00000000 H'00fa0002 H'00000006 H'00000083 --------------- Signal ---------------- r.bn: 250 "DBDICT", r.proc: 2, r.sigId: 675209 gsn: 396 "TAB_COMMITCONF" prio: 1 s.bn: 246 "DBDIH", s.proc: 2, s.sigId: 675208 length: 3 trace: 0 #sec: 0 fragInf: 0 H'00000083 H'00000002 H'00000082 --------------- Signal ---------------- r.bn: 246 "DBDIH", r.proc: 2, r.sigId: 675208 gsn: 398 "TAB_COMMITREQ" prio: 1 s.bn: 250 "DBDICT", s.proc: 2, s.sigId: 675207 length: 3 trace: 0 #sec: 0 fragInf: 0 H'00000083 H'00fa0002 H'00000082 --------------- Signal ---------------- r.bn: 250 "DBDICT", r.proc: 2, r.sigId: 675207 gsn: 396 "TAB_COMMITCONF" prio: 1 s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 675206 length: 3 trace: 0 #sec: 0 fragInf: 0 H'00000083 H'00000002 H'00000082 --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 675206 gsn: 398 "TAB_COMMITREQ" prio: 1 s.bn: 250 "DBDICT", s.proc: 2, s.sigId: 675205 length: 3 trace: 0 #sec: 0 fragInf: 0 H'00000083 H'00fa0002 H'00000082 --------------- Signal ---------------- r.bn: 250 "DBDICT", r.proc: 2, r.sigId: 675205 gsn: 185 "DIADDTABCONF" prio: 1 s.bn: 246 "DBDIH", s.proc: 2, s.sigId: 675204 length: 1 trace: 0 #sec: 0 fragInf: 0 H'00000083 --------------- Signal ---------------- r.bn: 246 "DBDIH", r.proc: 2, r.sigId: 675204 gsn: 109 "ADD_FRAGCONF" prio: 1 s.bn: 250 "DBDICT", s.proc: 2, s.sigId: 675203 length: 2 trace: 0 #sec: 0 fragInf: 0 H'00018ea8 H'00000001 --------------- Signal ---------------- r.bn: 250 "DBDICT", r.proc: 2, r.sigId: 675203 gsn: 308 "LQHADDATTCONF" prio: 1 s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 675202 length: 3 trace: 0 #sec: 0 fragInf: 0 H'00000083 H'ffffff00 H'00000001 --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 675202 gsn: 674 "TUX_ADD_ATTRCONF" prio: 1 s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 675201 length: 2 trace: 0 #sec: 0 fragInf: 0 H'00000000 H'00000000 --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 675201 gsn: 674 "TUX_ADD_ATTRCONF" prio: 1 s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 675200 length: 2 trace: 0 #sec: 0 fragInf: 0 H'00000000 H'00000000 .... .... ... ... --------------- Signal ---------------- r.bn: 249 "DBTUP", r.proc: 2, r.sigId: 670288 gsn: 417 "TUP_ADD_ATTRREQ" prio: 1 s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 670287 length: 5 trace: 0 #sec: 0 fragInf: 0 H'00000001 H'00000000 H'00000009 H'0018023e H'0008000e --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 670287 gsn: 415 "TUP_ADD_ATTCONF" prio: 1 s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 670286 length: 2 trace: 0 #sec: 0 fragInf: 0 H'00000000 H'00000000 --------------- Signal ---------------- r.bn: 249 "DBTUP", r.proc: 2, r.sigId: 670286 gsn: 417 "TUP_ADD_ATTRREQ" prio: 1 s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 670285 length: 5 trace: 0 #sec: 0 fragInf: 0 H'00000000 H'00000000 H'00000008 H'00010331 H'00000001 --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 670285 gsn: 415 "TUP_ADD_ATTCONF" prio: 1 s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 670284 length: 2 trace: 0 #sec: 0 fragInf: 0 H'00000000 H'00000000 --------------- Signal ---------------- r.bn: 249 "DBTUP", r.proc: 2, r.sigId: 670284 gsn: 417 "TUP_ADD_ATTRREQ" prio: 1 s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 670283 length: 5 trace: 0 #sec: 0 fragInf: 0 H'00000001 H'00000000 H'00000008 H'00010331 H'00000001 --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 670283 gsn: 415 "TUP_ADD_ATTCONF" prio: 1 s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 670282 length: 2 trace: 0 #sec: 0 fragInf: 0 H'00000000 H'00000000 --------------- Signal ---------------- r.bn: 249 "DBTUP", r.proc: 2, r.sigId: 670282 gsn: 417 "TUP_ADD_ATTRREQ" prio: 1 s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 670281 length: 5 trace: 0 #sec: 0 fragInf: 0 H'00000000 H'00000000 H'00000007 H'00010331 H'00000001 --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 670281 gsn: 415 "TUP_ADD_ATTCONF" prio: 1 s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 670280 length: 2 trace: 0 #sec: 0 fragInf: 0 H'00000000 H'00000000 --------------- Signal ---------------- r.bn: 249 "DBTUP", r.proc: 2, r.sigId: 670280 gsn: 417 "TUP_ADD_ATTRREQ" prio: 1 s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 670279 length: 5 trace: 0 #sec: 0 fragInf: 0 H'00000001 H'00000000 H'00000007 H'00010331 H'00000001 --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 670279 gsn: 415 "TUP_ADD_ATTCONF" prio: 1 s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 670278 length: 2 trace: 0 #sec: 0 fragInf: 0 H'00000000 H'00000000 --------------- Signal ---------------- r.bn: 249 "DBTUP", r.proc: 2, r.sigId: 670278 gsn: 417 "TUP_ADD_ATTRREQ" prio: 1 s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 670277 length: 5 trace: 0 #sec: 0 fragInf: 0 H'00000000 H'00000000 H'00000006 H'00010331 H'00000001 --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 670277 gsn: 415 "TUP_ADD_ATTCONF" prio: 1 s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 670276 length: 2 trace: 0 #sec: 0 fragInf: 0 H'00000000 H'00000000 --------------- Signal ---------------- r.bn: 249 "DBTUP", r.proc: 2, r.sigId: 670276 gsn: 417 "TUP_ADD_ATTRREQ" prio: 1 s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 670275 length: 5 trace: 0 #sec: 0 fragInf: 0 H'00000001 H'00000000 H'00000006 H'00010331 H'00000001 --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 670275 gsn: 310 "LQHADDATTREQ" prio: 1 s.bn: 250 "DBDICT", s.proc: 2, s.sigId: 670274 length: 19 trace: 0 #sec: 0 fragInf: 0 H'00000000 H'00000005 H'0000006d H'ffffff00 H'00000006 H'00010331 H'00000001 H'00000007 H'00010331 H'00000001 H'00000008 H'00010331 H'00000001 H'00000009 H'0018023e H'0008000e H'0000000a H'00010251 H'00000007 --------------- Signal ---------------- r.bn: 250 "DBDICT", r.proc: 2, r.sigId: 670274 gsn: 308 "LQHADDATTCONF" prio: 1 s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 670273 length: 3 trace: 0 #sec: 0 fragInf: 0 H'0000006d H'00000232 H'00000001 --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 670273 gsn: 415 "TUP_ADD_ATTCONF" prio: 1 s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 670272 length: 2 trace: 0 #sec: 0 fragInf: 0 H'00000000 H'00000000 --------------- Signal ---------------- r.bn: 249 "DBTUP", r.proc: 2, r.sigId: 670272 gsn: 417 "TUP_ADD_ATTRREQ" prio: 1 s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 670271 length: 5 trace: 0 #sec: 0 fragInf: 0 H'00000000 H'00000000 H'00000005 H'00010351 H'00000007
[20 Apr 2005 12:45]
Isabel Garcia Lorenzo
Thanx for the wuick response! Not sure what table caused the error, but after a cluster shutdown we can't bring up any of the 2 NDB nodes. The error we get is pretty similar to the one we reported before (maybe the first error belonged to the NDB node trying to restart?)
[21 Apr 2005 13:53]
Martin Skold
Did you try starting node with ndbd --initial
[22 Apr 2005 8:49]
Francisco Javier Garcia Humphries
We have started with --initial. We lose all our NDB tables (as expected) and restore/recreate them. The problem is that the ndbrequire issue keeps happening again and again, corrupting our nodes every time. Any idea on why this is happening so often?
[22 Apr 2005 8:49]
Francisco Javier Garcia Humphries
We have started with --initial. We lose all our NDB tables (as expected) and restore/recreate them. The problem is that the ndbrequire issue keeps happening again and again, corrupting our nodes every time. Any idea on why this is happening so often?
[22 Apr 2005 8:50]
Francisco Javier Garcia Humphries
We have started with --initial. We lose all our NDB tables (as expected) and restore/recreate them. The problem is that the ndbrequire issue keeps happening again and again, corrupting our nodes every time. Any idea on why this is happening so often?
[22 Apr 2005 8:56]
Jonas Oreland
Hi, just checking... Could it be that you have changed MaxNoOfTables (order similar) The ndbrequire indicates that a table is defined with a tableId larger than currently allocated pool.
[22 Apr 2005 9:38]
Francisco Javier Garcia Humphries
Hi We have used different values for MaxNoOfTables, but we haven't changed the value since the last --initial. Right now, i've tried to change the parameter to the max (1600) and tried to start, but nodes are still not coming up (disconnect on phase 4).
[22 Apr 2005 9:43]
Isabel Garcia Lorenzo
Trace from the last startup attempt with Max Tables 1600
Attachment: ndb_2_trace.log.zip (application/x-zip-compressed, text), 70.04 KiB.
[22 Apr 2005 9:43]
Isabel Garcia Lorenzo
We have added a trace file from the last startup attempt.
[22 May 2005 23:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[15 Sep 2005 13:11]
Pekka Nousiainen
Patched fixes from 5.0 bug#11355.
[16 Sep 2005 11:17]
Jon Stephens
Thank you for your bug report. This issue has already been fixed in the latest released version of that product, which you can download at http://www.mysql.com/downloads/ Additional info: Documented fix in 5.0.10 changelog.