Bug #61841 Failed ndbrequire in DblqhMain.cpp causing complete cluster shutdown
Submitted: 12 Jul 2011 22:03 Modified: 27 Apr 2015 7:44
Reporter: Evan Kinney Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:5.1.51-ndb-7.2.0-devmilestone OS:Linux (RHEL Server 6.1 (2.6.32-131.2.1.el6.x86_64))
Assigned to: MySQL Verification Team CPU Architecture:Any

[12 Jul 2011 22:03] Evan Kinney
Description:
While under somewhat intensive load, the ndbmtd daemon crashes and forces a shutdown of the cluster due to a failed call to ndbrequire from DblqhMain.cpp.

---
Time: Tuesday 12 July 2011 - 16:54:09
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dblqh/DblqhMain.cpp
Error object: DBLQH (Line: 10092) 0x00000002
Program: /usr/sbin/ndbmtd
Pid: 22465 thr: 2
Version: mysql-5.1.51 ndb-7.2.0-beta
Trace: /mysql-cluster/node1/ndb_1_trace.log.4 /mysql-cluster/node1/ndb_1_trace.log.4_t1 /mysql-cluster/node1/ndb_1_tra
---

We are able to reproduce this error every time this set of procs is run. If needed, I can get the person who's doing this to step through the procs so we can figure out which one is doing this.

We're running four API nodes (with 16 connections each) and four data nodes (two on each server with 8 threads each). There is a MySQL proxy instance in front of everything to load balance between the API nodes.

How to repeat:
Specific queries and/or proc information can be provided if necessary.
[14 Jul 2011 9:31] MySQL Verification Team
I have run numerous tests with 7.2 without any crash so going blindly into reproducing this bug will not work out for sure. 

Is there a way you can produce a "script" to run against SQL nodes that will crash the cluster?
[14 Jul 2011 10:28] Jonas Oreland
queries and data is needed...it could be data-dependant...so maybe exact set of data is needed...

i think(not sure) that this has been fixed in our internal version (but there doesnt exist any new snapshot release)

/Jonas
[15 Aug 2011 16:40] Evan Kinney
Okay, so I stuck some ndbout statements above that ndbrequire() on line 10092 in Dblqh::execSCAN_FRAGREQ().

I'm running a select statement that returns somewhere around 37,400 rows:
-----
select br.id
      from base_releases br, versioned_components vc, abstract_components ac, specific_components sc, platforms p
      where sc.versioned_component_id = vc.id and
        vc.abstract_component_id = ac.id and
        ac.base_release_id = br.id and
        sc.platform_id = p.id limit 100;
-----

This crashes this particular data node (interestingly enough, it doesn't crash the entire cluster when I use ndbd instead of ndbmtd, just the one data node) with the ndbrequire() error referenced above.

Looking at my debug output, the ndbrequire() is failing because max_rows becomes zero after being anywhere from 1 to 8:

-----
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 12
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 12
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 12
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 12
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 14
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 14
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 14
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 14
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 16
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 16
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 16
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 16
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 18
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 18
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 18
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 18
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 20
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 20
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 20
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 20
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 22
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 22
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 22
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 22
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 24
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 24
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 24
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 1
Dblqh::execSCAN_FRAGREQ - transid1: 24
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 8
Dblqh::execSCAN_FRAGREQ - transid1: 26
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 8
Dblqh::execSCAN_FRAGREQ - transid1: 26
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 8
Dblqh::execSCAN_FRAGREQ - transid1: 26
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 8
Dblqh::execSCAN_FRAGREQ - transid1: 26
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
Dblqh::execSCAN_FRAGREQ - max_rows: 0
Dblqh::execSCAN_FRAGREQ - transid1: 26
Dblqh::execSCAN_FRAGREQ - transid2: 1055488
2011-08-15 12:31:12 [ndbd] INFO     -- dblqh/DblqhMain.cpp
2011-08-15 12:31:12 [ndbd] INFO     -- DBLQH (Line: 10094) 0x00000002
2011-08-15 12:31:12 [ndbd] INFO     -- Error handler shutting down system
2011-08-15 12:31:12 [ndbd] INFO     -- Error handler shutdown completed - aborting
-----

Any clue as to why this would happen? I'm trying to get some sample data put together for you now.

/ek
[15 Aug 2011 16:42] Evan Kinney
I meant to say that the above statement would return 37,400 rows without the limit clause... but I suppose that's obvious.

/ek
[15 Aug 2011 17:46] Evan Kinney
Okay, so I was playing around with the where clause and ran into a new error:

-----
2011-08-15 13:33:12 [ndbd] INFO     -- dbspj/DbspjMain.cpp
2011-08-15 13:33:12 [ndbd] INFO     -- DBSPJ (Line: 4895) 0x00000002
2011-08-15 13:33:12 [ndbd] INFO     -- Error handler shutting down system
2011-08-15 13:33:12 [ndbd] INFO     -- Error handler shutdown completed - aborting
-----

This resulted from:
-----
select br.id
      from base_releases br, versioned_components vc, abstract_components ac, specific_components sc, platforms p
      where sc.platform_id = p.id;
-----

Do you think the Dbspj stuff could be causing the errors in the local query handler? I'd think the distributed join stuff would be a likely candidate for why this is failing...

Is there a way to disable it and revert back to the pre-7.2 behavior to see if that still happens?

/ek
[15 Aug 2011 18:23] Evan Kinney
Sorry for the deluge of comments, but I figured it out.

This is definitely due to the DBSPJ stuff:

-----
mysql> EXPLAIN    -> select br.id
    ->       from base_releases br, versioned_components vc, abstract_components ac, specific_components sc, platforms p
    ->       where sc.platform_id = p.id;
+----+-------------+-------+------+-----------------------------+-----------+---------+--------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys               | key       | key_len | ref          | rows | Extra                                        |
+----+-------------+-------+------+-----------------------------+-----------+---------+--------------+------+----------------------------------------------+
|  1 | SIMPLE      | br    | ALL  | NULL                        | NULL      | NULL    | NULL         |    5 |                                              |
|  1 | SIMPLE      | p     | ALL  | PRIMARY                     | NULL      | NULL    | NULL         |   95 | Parent of 2 pushed join@1; Using join buffer |
|  1 | SIMPLE      | sc    | ref  | sc_pi_vci,platform_id_index | sc_pi_vci | 4       | partsdb.p.id |  371 | Child of 'p' in pushed join@1                |
|  1 | SIMPLE      | ac    | ALL  | NULL                        | NULL      | NULL    | NULL         | 3022 | Using join buffer                            |
|  1 | SIMPLE      | vc    | ALL  | NULL                        | NULL      | NULL    | NULL         | 6062 | Using join buffer                            |
+----+-------------+-------+------+-----------------------------+-----------+---------+--------------+------+----------------------------------------------+
5 rows in set (0.01 sec)
-----

I set ndb_join_pushdown to off and ran the query again (I also ran the stored proc that was crashing before) and it executed successfully. Turned it back on, ran it again, and the cluster crashed.

/ek
[18 Aug 2011 7:26] Jonas Oreland
This is explain in snapshot from yesterday,
mysql> explain select br.id       from base_releases br, versioned_components vc, abstract_components ac, specific_components sc, platforms p       where sc.versioned_component_id = vc.id and         vc.abstract_component_id = ac.id and         ac.base_release_id = br.id and         sc.platform_id = p.id limit 100;
+----+-------------+-------+--------+----------------------------------------------------------+-----------+---------+-----------------------------------+------+--------------------------------+
| id | select_type | table | type   | possible_keys                                            | key       | key_len | ref                               | rows | Extra                          |
+----+-------------+-------+--------+----------------------------------------------------------+-----------+---------+-----------------------------------+------+--------------------------------+
|  1 | SIMPLE      | p     | ALL    | PRIMARY                                                  | NULL      | NULL    | NULL                              |   95 | Parent of 5 pushed join@1      |
|  1 | SIMPLE      | sc    | ref    | sc_pi_vci,platform_id_index,versioned_component_id_index | sc_pi_vci | 4       | partsdb.p.id                      |  371 | Child of 'p' in pushed join@1 |
|  1 | SIMPLE      | vc    | eq_ref | PRIMARY,abstract_component_id_index                      | PRIMARY   | 4       | partsdb.sc.versioned_component_id |    1 | Child of 'sc' in pushed join@1 |
|  1 | SIMPLE      | ac    | eq_ref | PRIMARY,abstract_components_base_release_id_fk           | PRIMARY   | 4       | partsdb.vc.abstract_component_id  |    1 | Child of 'vc' in pushed join@1 |
|  1 | SIMPLE      | br    | eq_ref | PRIMARY                                                  | PRIMARY   | 4       | partsdb.ac.base_release_id        |    1 | Child of 'ac' in pushed join@1 |
+----+-------------+-------+--------+----------------------------------------------------------+-----------+---------+-----------------------------------+------+--------------------------------+
5 rows in set (0.02 sec)