Bug #9282 API Node Crashes/Reloads on 'DELETE FROM'
Submitted: 18 Mar 2005 16:15 Modified: 13 Oct 2005 11:04
Reporter: David Seltzer Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:4.1.10 OS:Linux (Redhat Enterprise Linux 3.2)
Assigned to: Jonas Oreland CPU Architecture:Any

[18 Mar 2005 16:15] David Seltzer
Description:
The problem query was running routinely (about every 10 minutes until it stopped working). The problem does not seem to be an issue with the specific API node. I can replicate the issue on any of the API nodes connected to the cluster.

Running a delete statement seems to cause the API node to dump the client connection. I have observed this problem using both the standard mysql client on linux and the MysqlConnecter.net from the windows environment.

Output of the Problem Query:
------------------------------------------------------------------------------
mysql> DELETE FROM BLOOM_Lines WHERE LineDateTime<DATE_SUB(NOW(), INTERVAL 14 DAY);
ERROR 2013 (HY000): Lost connection to MySQL server during query
mysql>
------------------------------------------------------------------------------

There are some usefull entries in the cluster log:

Output of the mgm log: (Node 8 is the server that the query is being run on)
------------------------------------------------------------------------------------
Mar 17 09:45:57 mgm01 NDB[8773]: [MgmSrvr] Node 4: Node 8 Disconnected
Mar 17 09:45:57 mgm01 NDB[8773]: [MgmSrvr] Node 4: Communication to Node 8 closed
Mar 17 09:45:57 mgm01 NDB[8773]: [MgmSrvr] Node 2: Node 8 Disconnected
Mar 17 09:45:57 mgm01 NDB[8773]: [MgmSrvr] Node 2: Communication to Node 8 closed
Mar 17 09:45:57 mgm01 NDB[8773]: [MgmSrvr] Node 3: Node 8 Disconnected
Mar 17 09:45:57 mgm01 NDB[8773]: [MgmSrvr] Node 3: Communication to Node 8 closed
Mar 17 09:45:57 mgm01 NDB[8773]: [MgmSrvr] Node 5: Node 8 Disconnected
Mar 17 09:45:57 mgm01 NDB[8773]: [MgmSrvr] Node 5: Communication to Node 8 closed
Mar 17 09:45:57 mgm01 NDB[8773]: [MgmSrvr] Allocate nodeid (0) failed. Connection from ip 10.0.30.6. Returned error string "No free node id found for mysqld(API)."
Mar 17 09:45:57 mgm01 NDB[8773]: [MgmSrvr] Mgmt server state: node id's  2 3 4 5 connected but not reserved
Mar 17 09:45:57 mgm01 NDB[8773]: [MgmSrvr] Mgmt server state: nodeid 8 freed, m_reserved_nodes 00000000000000c2.
Mar 17 09:46:00 mgm01 NDB[8773]: [MgmSrvr] Mgmt server state: nodeid 8 reserved for ip 10.0.30.6, m_reserved_nodes 00000000000001c2.
Mar 17 09:46:01 mgm01 NDB[8773]: [MgmSrvr] Node 2: Communication to Node 8 opened
Mar 17 09:46:01 mgm01 NDB[8773]: [MgmSrvr] Node 3: Communication to Node 8 opened
Mar 17 09:46:01 mgm01 NDB[8773]: [MgmSrvr] Node 2: Node 8 Connected
Mar 17 09:46:01 mgm01 NDB[8773]: [MgmSrvr] Node 3: Node 8 Connected
Mar 17 09:46:01 mgm01 NDB[8773]: [MgmSrvr] Node 5: Communication to Node 8 opened
Mar 17 09:46:01 mgm01 NDB[8773]: [MgmSrvr] Node 5: Node 8 Connected
Mar 17 09:46:01 mgm01 NDB[8773]: [MgmSrvr] Node 4: Communication to Node 8 opened
Mar 17 09:46:01 mgm01 NDB[8773]: [MgmSrvr] Node 3: Node 8: API version 4.1.10
Mar 17 09:46:01 mgm01 NDB[8773]: [MgmSrvr] Node 2: Node 8: API version 4.1.10
Mar 17 09:46:01 mgm01 NDB[8773]: [MgmSrvr] Node 4: Node 8 Connected
Mar 17 09:46:01 mgm01 NDB[8773]: [MgmSrvr] Node 4: Node 8: API version 4.1.10
Mar 17 09:46:01 mgm01 NDB[8773]: [MgmSrvr] Node 5: Node 8: API version 4.1.10
---------------------------------------------------------------------------------------

How to repeat:
Here are the current system setup files:

The current table status is as follows:
-----------------------------------------------------------------------------------
mysql> show table status like 'BLOOM_Lines'\G
*************************** 1. row ***************************
           Name: BLOOM_Lines
         Engine: ndbcluster
        Version: 9
     Row_format: Dynamic
           Rows: 193965
 Avg_row_length: 0
    Data_length: 0
Max_data_length: NULL
   Index_length: 0
      Data_free: 0
 Auto_increment: NULL
    Create_time: NULL
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options:
        Comment: LineText
1 row in set (0.00 sec)

mysql>
-----------------------------------------------------------------------------------

Table Structure:
-----------------------------------------------------------------------------------
CREATE TABLE `BLOOM_Lines` (
  `LineID` int(10) unsigned NOT NULL default '0',
  `LineDateTime` datetime NOT NULL default '0000-00-00 00:00:00',
  `LineText` varchar(200) NOT NULL default '',
  `LineSource` tinyint(4) NOT NULL default '0',
  `SpeakerType` tinyint(4) NOT NULL default '0',
  `SpeakerID` tinyint(4) NOT NULL default '0',
  PRIMARY KEY  (`LineID`),
  KEY `LineDateTime` (`LineDateTime`)
) ENGINE=ndbcluster DEFAULT CHARSET=latin1 COMMENT='LineText'
-----------------------------------------------------------------------------------

my.cnf of API Node
-----------------------------------------------------------------------------------
[mysqld]
ndbcluster
basedir=/usr/local/mysql
datadir=/usr/local/mysql/data
socket=/tmp/mysql.sock
port=3306
user=mysql
skip-innodb
log-bin
server-id=1
ft_min_word_len=3
[mysql_cluster]
ndb-connectstring=mgm01
-----------------------------------------------------------------------------------

Cluster Config
-----------------------------------------------------------------------------------
[NDB_MGMD]
Hostname: mgm01
DataDir: /usr/local/mysql/data
LogDestination:   SYSLOG:facility=syslog

[NDBD DEFAULT]
NoOfReplicas: 2
DataMemory: 4G
IndexMemory: 1G
LockPagesInMainMemory: Y
DataDir: /usr/local/mysql/data

## Implented for bulk load via LOAD DATA INFILE
MaxNoOfConcurrentOperations: 131072
MaxNoOfConcurrentTransactions: 8192

[NDBD]
Hostname: stor01

[NDBD]
Hostname: stor02

[NDBD]
Hostname: stor01

[NDBD]
Hostname: stor02

[MYSQLD]
Hostname: web01

[MYSQLD]
Hostname: web02

[MYSQLD]
Hostname: ingressdb01
-----------------------------------------------------------------------------------
[21 Mar 2005 10:57] Johan Andersson
Did the mysqld print a stacktrace?
If so, please put the entire stack trace here.
[21 Mar 2005 14:29] David Seltzer
This version doesnt seem to support the --debug flag for mysqld. The configure options are:
./configure '--prefix=/usr/local/mysql' '--localstatedir=/usr/local/mysql/data' '--libexecdir=/usr/local/mysql/bin' '--with-extra-charsets\
=complex' '--enable-thread-safe-client' '--enable-local-infile' '--with-ndbcluster' '--with-innodb' 'CC=gcc' 'CFLAGS=-march=nocona -O3' 'CXXFLAGS=-march=noco\
na -O3' 'CXX=gcc'

The log output on the API node:
-------------------------------------------------------------------------------
mysqld got signal 11;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=8388600
read_buffer_size=131072
max_used_connections=5
max_connections=100
threads_connected=5
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 225791 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Number of processes running now: 0
050317 14:18:07  mysqld restarted
-------------------------------------------------------------------------------
[22 Mar 2005 1:47] frank homann
i can confirm this bug with mysql 4.1.10a on a 3 nodes cluster configuration on linux 2.6.10 with nptl...

the query responsable for the crash:
DELETE FROM max_adviews WHERE t_stamp>=20050321160000 AND t_stamp<20050321170000;

on a 587708 rows table:
mysql> SHOW COLUMNS FROM max_adviews;
+----------+--------------+------+-----+-------------------+-------+
| Field    | Type         | Null | Key | Default           | Extra |
+----------+--------------+------+-----+-------------------+-------+
| userid   | varchar(32)  |      | MUL |                   |       |
| bannerid | mediumint(9) |      | MUL | 0                 |       |
| zoneid   | mediumint(9) |      |     | 0                 |       |
| t_stamp  | timestamp    | YES  | MUL | CURRENT_TIMESTAMP |       |
| host     | varchar(255) |      |     |                   |       |
| source   | varchar(50)  |      |     |                   |       |
| country  | char(2)      |      |     |                   |       |
+----------+--------------+------+-----+-------------------+-------+

the backtrace:
Cannot determine thread, fp=0xb2f34c1c, backtrace may not be correct.
Stack range sanity check OK, backtrace follows:
0x80c7387
0x8331508
0x14
0x826c871
0x8274e3a
0x8273c04
0x826ced8
0x826ce57
0x8287feb
0x832efd2
0x8365d7a
New value of fp=(nil) failed sanity check, terminating stack trace!

i've also tried to resolve the backtrace, but i get the following output:
$ nm -n /usr/sbin/mysqld > mysqld.sym
nm: /usr/sbin/mysqld: no symbols
[12 Apr 2005 13:54] Jonas Oreland
Hi,

I tried with a 2-node cluster and 4.1.11.
And was unable to get this error.
But there is to my knowledge no major difference between 4.1.11 and 4.1.10 regarding this

I managed to get missed heartbeat once, but then the transaction was correctly aborted
(we're working on improving heartbeat handling).

/Jonas
[21 Apr 2005 15:13] Jonas Oreland
I could not reproduce it simply...but I know johan can..but he's currently in polen
[26 Apr 2005 17:45] frank homann
hi, i can give you a shell account on our test servers and tell you how to reproduce it...
please contact me either with icq (96372521) or jabber (skunk@jabber.linux.it)
[26 Apr 2005 18:53] frank homann
btw, running against your precompiled mysql-max binary i get the following backtrace:

0x8136a43
0xffffe420
(nil)
0x83db3d4
0x83e057e
0x83dfa76
0x83db936
0x83db8ae
0x83d22d6
0xb7fb8add
0xb7eea58a

resolving the stack dump against the symbols file obtained from your mysql-debug binary:

# ./resolve_stack_dump -s mysqld.sym -n mysqld.stack
0x8136a43 get_opt_type__14field_longlongP6StringUl + 259
0xffffe420 _end + -140810812
(nil)
0x83db3d4 field_separator + 55508
0x83e057e field_separator + 76414
0x83dfa76 field_separator + 73590
0x83db936 field_separator + 56886
0x83db8ae field_separator + 56750
0x83d22d6 field_separator + 18390
0xb7fb8add _end + -1349055359
0xb7eea58a _end + -1349900498

hope this helps a little more resolving this very nasty bug...
[23 May 2005 8:13] frank homann
this bug shows up when deleting many rows (ex. 30k+) on a large table (ex. 100k+ rows) at once.
as workaround just delete fewer records (ex. using the "limit 20000" statement) and repeat the query until affected_rows<limit_value.
it's probably necessary to insert a few seconds delay between deletions in order to avoid this bug...
[16 Sep 2005 13:44] Fabio Coatti
I can confirm this behaviour also for version 4.1.15-nightly-20050826, compiled for Opteron. (I'm testing this version because of bug #12118)

I get no trace or whatever in ndbd logs, in mysqld.log i have only this:

mysqld got signal 11;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=16777216
read_buffer_size=131072
max_used_connections=3
max_connections=100
threads_connected=1
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 233983 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Number of processes running now: 0

The system and configuration is the same as bugs #11776 and #12118 (two dual opterons with 8 Gb each, with gentoo linux)
The table has 734847 rows. 

When delete from table is issued, I get this:
mysql> delete from table;
ERROR 2013 (HY000): Lost connection to MySQL server during query

On mgm node, I get this:
2005-09-16 15:39:27 [MgmSrvr] ALERT    -- Node 3: Node 6 Disconnected
2005-09-16 15:39:27 [MgmSrvr] INFO     -- Node 3: Communication to Node 6 closed
2005-09-16 15:39:27 [MgmSrvr] ALERT    -- Node 2: Node 6 Disconnected
2005-09-16 15:39:27 [MgmSrvr] INFO     -- Node 2: Communication to Node 6 closed
2005-09-16 15:39:27 [MgmSrvr] INFO     -- Mgmt server state: nodeid 6 freed, m_reserved_nodes 000000000000002e.
2005-09-16 15:39:27 [MgmSrvr] INFO     -- Mgmt server state: nodeid 4 reserved for ip X.Y.Z.Q, m_reserved_nodes 000000000000003e.
2005-09-16 15:39:27 [MgmSrvr] INFO     -- Node 3: Node 4 Connected
2005-09-16 15:39:27 [MgmSrvr] INFO     -- Node 3: Node 4: API version 4.1.15
2005-09-16 15:39:27 [MgmSrvr] INFO     -- Node 2: Node 4 Connected
2005-09-16 15:39:27 [MgmSrvr] INFO     -- Node 2: Node 4: API version 4.1.15

(2,3 are ndb nodes, >= 4 API nodes)

If more informations or test are needed, just let me know.
[16 Sep 2005 14:38] Fabio Coatti
More on this; I've tried to find the number of rows that I can delete with one command;
I've seen that with limit 100000 the rows are deleted, but after this I'm unable to re-issue the same command, see: (note, no on else was using the database)

mysql> delete from table limit 200000;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    1
Current database: test_d2c

ERROR 2013 (HY000): Lost connection to MySQL server during query
mysql> select count(*) from table;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    1
Current database: test_d2c

+----------+
| count(*) |
+----------+
|   734848 |
+----------+
1 row in set (1.02 sec)

mysql> delete from table limit 100000;
Query OK, 100000 rows affected (2 min 5.01 sec)

mysql> select count(*) from table;
+----------+
| count(*) |
+----------+
|   634848 |
+----------+
1 row in set (0.01 sec)

mysql> delete from table limit 200000;
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
mysql> delete from table limit 100000;
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
[19 Sep 2005 10:05] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/30038
[19 Sep 2005 10:58] Jonas Oreland
Pushed into 4.1.15
[28 Sep 2005 7:28] Jonas Oreland
And 5.0.14
[5 Oct 2005 16:02] Jonas Oreland
is more info needed?
[6 Oct 2005 6:33] Jonas Oreland
When performing large delete, (typically > 100k rows)
  there _could_ occur a deference of an invalid pointer causing mysqld to crash.

The fix introduces lock of a mutex just before changing the mentioned pointer.
  The mutex is only used when expanding objidmap in ndbapi
[13 Oct 2005 11:04] Jon Stephens
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Documented bugfix in 4.1.15 and 5.0.14 changelogs. Closed.
[3 Jan 2010 13:41] manish kumar
HI,

When i delete some thing from table.Table getting crash.

Manish Kumar