| Bug #44195 | NDBMTD - DD - PGMAN (Line: 1470) - Internal program error (failed ndbrequire) | ||
|---|---|---|---|
| Submitted: | 9 Apr 2009 20:26 | Modified: | 9 Oct 2009 13:14 |
| Reporter: | Jonathan Miller | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Cluster: Disk Data | Severity: | S2 (Serious) |
| Version: | mysql-5.1-telco-7.0 | OS: | Linux |
| Assigned to: | Jonas Oreland | CPU Architecture: | Any |
[9 Apr 2009 20:26]
Jonathan Miller
Error and trace logs
Attachment: pgman.tgz (application/x-compressed-tar, text), 244.91 KiB.
[9 Apr 2009 20:54]
Jonathan Miller
Not sure this will help or not, but... dbt2 was doing a new_order transaction and hit a (Lock wait timeout exceeded) Thu Apr 9 11:54:36 2009 Microseconds : 405658 tid:1095936320 mysql/dbc_new_order.c:88 mysql reports: SQL: call new_order(9, 1, 72, 1, 15, 48856, 9, 5, 67518, 9, 1, 73728, 9, 1, 96256, 9, 2, 53248, 9, 3, 81406, 9, 3, 26592, 9, 4, 38892, 9, 9, 57343, 9, 6, 84911, 9, 6, 36352, 9, 9, 60387, 9, 3, 98303, 9, 8, 47102, 9, 8, 40911, 9, 6, @rc), ERROR: 1205 Lock wait timeout exceeded; try restarting transaction Thu Apr 9 11:54:36 2009 Microseconds : 405680 tid:1095936320 mysql/dbc_common.c:97 ROLLBACK INITIATED Next.... DN Failure Thu Apr 9 11:54:42 2009 Microseconds : 366238 tid:1075546432 mysql/dbc_new_order.c:88 mysql reports: SQL: call new_order(3, 7, 928, 0, 9, 81920, 3, 9, 46588, 3, 5, 24576, 3, 2, 89536, 3, 7, 10172, 3, 2, 52592, 7, 8, 72702, 3, 2, 69613, 3, 4, 97607, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, @rc), ERROR: 1297 Got temporary error 286 'Node failure caused abort of transaction' from NDBCLUSTER Thu Apr 9 11:54:42 2009 Microseconds : 366309 tid:1075546432 mysql/dbc_common.c:97 ROLLBACK INITIATED The New-Order Transaction The New-Order business transaction consists of entering a complete order through a single database transaction. It represents a mid-weight, read-write transaction with a high frequency of execution and stringent response time requirements to satisfy on-line users. This transaction is the backbone of the workload. It is designed to place a variable load on the system to reflect on-line database activity as typically found in production environments. Entering a new order is done in a single database transaction with the following steps: 1. Create an order header, comprised of: 2 row selections with data retrieval, 1 row selection with data retrieval and update, 2 row insertions. 2. Order a variable number of items (average ol_cnt = 10), comprised of: (1 * ol_cnt) row selections with data retrieval, (1 * ol_cnt) row selections with data retrieval and update, (1 * ol_cnt) row insertions.
[10 Apr 2009 3:43]
Jonathan Miller
Sorry, meant to add this sooner: clone:mysql-5.1-telco-7.0 rundate_2009-04-09_09:29 build_script:/space/cluster_rep_auto/scripts/boot.sh ------------------------------------------------------------ revno: 2890 revision-id: frazer@mysql.com-20090408222804-ve6xi3f1gaunzzxy parent: frazer@mysql.com-20090408181037-8hsxlwqqydgl8f72 parent: frazer@mysql.com-20090408222513-d9ddlkxomto876ld committer: Frazer Clement branch nick: mysql-5.1-telco-6.4 timestamp: Wed 2009-04-08 23:28:04 +0100 message: Merge 6.3->7.0
[27 Apr 2009 12:59]
Jonathan Miller
Hi, Found that testing hit this error again, but using a different test. The framework was running TPC-"B" on disk data this time using 4 DN in a "MT Mixed" mode. (i.e. 2 NDB Single Threaded, 2 NDB Multiple threads). The crashing DN was an NDBMTD. Time: Sunday 26 April 2009 - 23:05:18 Status: Temporary error, restart node Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug) Error: 2341 Error data: pgman.cpp Error object: PGMAN (Line: 1470) 0x00000006 Program: /data0/cr_autotest/libexec/ndbmtd Pid: 18068 Trace: ./ndb_3_trace.log.1 ./ndb_3_trace.log.1_t1 ./ndb_3_trace.log.1_t2 ./ndb_3_trace.log.1_t3 ./ndb_3_trace.log.1_t4 ./ndb_3_ I will up load trace files shortly
[27 Apr 2009 13:06]
Jonathan Miller
Trace Files from latest crash
Attachment: tpcb-run.tgz (application/x-compressed-tar, text), 306.09 KiB.
[20 Aug 2009 6:31]
Jonas Oreland
see bug#46507
[9 Oct 2009 11:16]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/86332 3105 Jonas Oreland 2009-10-09 ndb - bug#44195 - increase size of protected area, to make sure that multiple threads dont access data in parallel
[9 Oct 2009 11:18]
Bugs System
Pushed into 5.1.39-ndb-7.1.0 (revid:jonas@mysql.com-20091009111806-bi2i4o4kychhf3e8) (version source revid:jonas@mysql.com-20091009111806-bi2i4o4kychhf3e8) (merge vers: 5.1.39-ndb-7.1.0) (pib:12)
[9 Oct 2009 11:21]
Jonas Oreland
pushed to 7.0.9 and 7.1
[9 Oct 2009 13:14]
Jon Stephens
Documented bugfix in the NDB-7.0.9 changelog as follows:
Multi-threaded data nodes could in some cases attempt to access
the same memory structure in parallel, in a non-safe manner. This
could result in data node failure when running ndbmtd while
using Disk Data tables.
Closed.

Description: Hi, Had a little time and was decided to scanned results from ACRT and found the following: Time: Thursday 9 April 2009 - 13:55:28 Status: Temporary error, restart node Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug) Error: 2341 Error data: pgman.cpp Error object: PGMAN (Line: 1470) 0x00000006 Program: /data0/cr_autotest/libexec/ndbmtd Pid: 31813 Trace: ./ndb_3_trace.log.1 ./ndb_3_trace.log.1_t1 ./ndb_3_trace.log.1_t2 ./ndb_3_trace.log.1_t3 Version: mysql-5.1.32 ndb-7.0.5-beta ***EOM*** PGMAN 001586 001470 --------------- Signal ---------------- r.bn: 261/5 "PGMAN", r.proc: 3, r.sigId: 7402172 gsn: 270 "FSWRITECONF" prio: 1 s.bn: 253 "NDBFS", s.proc: 3, s.sigId: 139344156 length: 1 trace: 0 #sec: 0 fragInf: 0 UserPointer: 19 --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 7402171 gsn: 409 "TIME_SIGNAL" prio: 1 s.bn: 252 "QMGR", s.proc: 3, s.sigId: 139344155 length: 1 trace: 0 #sec: 0 fragInf: 0 H'00000004 void Pgman::fswriteconf(Signal* signal, Ptr<Page_entry> ptr) { D("fswriteconf"); D(ptr); Page_state state = ptr.p->m_state; ndbrequire(state & Page_entry::PAGEOUT); <-001470 void Pgman::execFSWRITECONF(Signal* signal) { jamEntry(); <- 001586 FsConf* conf = (FsConf*)signal->getDataPtr(); Ptr<Page_entry> ptr; m_page_entry_pool.getPtr(ptr, conf->userPointer); fswriteconf(signal, ptr); } Signal_log: COMMIT: INSERT tab: 2 frag: 0 activeCreat: 0 rowid: [ m_page_no: 0 m_file_no: 62451 m_page_idx: 7252] key: 0 MySQLD Log: 090409 11:54:42 [ERROR] Got error 286 when reading table './dbt2/stock' 090409 11:54:42 [ERROR] Got error 4031 when reading table './dbt2/stock' 090409 11:54:42 [ERROR] Got error 4031 when reading table './dbt2/stock' 090409 11:54:44 [Note] NDB Binlog: Node: 3, down, Subscriber bitmask 00 Node 3 log.out jbalock waiting for lock, contentions: 4 spins: 371795 jbalock waiting for lock, contentions: 400 spins: 32967759 2009-04-09 13:55:28 [ndbd] INFO -- pgman.cpp 2009-04-09 13:55:28 [ndbd] INFO -- PGMAN (Line: 1470) 0x00000006 2009-04-09 13:55:28 [ndbd] INFO -- Error handler shutting down system 2009-04-09 13:55:29 [ndbd] INFO -- Error handler shutdown completed - exiting 2009-04-09 13:55:29 [ndbd] ALERT -- Node 3: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. How to repeat: I have not been able to break this down to a finite set of steps: I have not seen this before today. Happened using ACRT on the following: Config = 2-dn-mt-mysql-5.1-telco-7.0 Test = TPC-C-DD (DBT2), 10 threads 2-dn-mt = 2 Data MT Nodes [atrt] basedir=/data0/cr_autotest/run-2-dn-mt-mysql-5.1-telco-7.0/run baseport=15000 clusters= .master mt = 2 [ndb_mgmd] [mysqld] skip-grant-tables skip-innodb ndb_use_exact_count=0 loose-join_cache_level=6 loose-ndb-cluster-connection-pool=1 loose-ndb_extra_logging=9 loose-engine_condition_pushdown=1 #loose-ndb_cache_check_time=1000 [cluster_config] MaxNoOfExecutionThreads=3 MaxNoOfSavedMessages = 30 NoOfReplicas = 2 DataMemory = 8000M IndexMemory = 1000M DiskPageBufferMemory=300MB DiskCheckpointSpeed=16M RedoBuffer=200M NoOfFragmentLogFiles=10 FragmentLogFileSize=512M InitFragmentLogFiles=FULL ODirect= 1 DiskIOThreadPool=40 SharedGlobalMemory=384M SendBufferMemory = 4M MaxNoOfConcurrentOperations = 250000 MaxNoOfLocalOperations = 275000 MaxNoOfConcurrentIndexOperations = 20000 MaxNoOfAttributes=2048 MaxNoOfOrderedIndexes=512 MaxNoOfUniqueHashIndexes=512 LockPagesInMainMemory=1 MemReportFrequency=200 LogLevelCongestion=15 LogLevelStatistic=15 [cluster_config.master] ndb_mgmd = ndb21 ndbd = ndb21,ndb22 mysqld = ndb18 ndbapi= ndb18,ndb18,ndb18,ndb18 [cluster_config.ndbd.1.master] FileSystemPath=/data1/ [cluster_config.ndbd.2.master] FileSystemPath=/data1/ # # Generated by atrt # Thu Apr 9 09:50:30 2009 [mysql_cluster.master] ndb-connectstring= ndb21:15000 [cluster_config.ndb_mgmd.1.master] PortNumber= 15000 [mysqld.1.master] datadir= /data0/cr_autotest/run-2-dn-mt-mysql-5.1-telco-7.0/run/mysqld.1 socket= /data0/cr_autotest/run-2-dn-mt-mysql-5.1-telco-7.0/run/mysqld.1/mysql.sock port= 15001 ndb-connectstring= ndb21:15000 ndbcluster [client.1.master] host= ndb18 socket= /data0/cr_autotest/run-2-dn-mt-mysql-5.1-telco-7.0/run/mysqld.1/mysql.sock port= 15001 Suggested fix: N/A