Bug #44195 NDBMTD - DD - PGMAN (Line: 1470) - Internal program error (failed ndbrequire)
Submitted: 9 Apr 2009 20:26 Modified: 9 Oct 2009 13:14
Reporter: Jonathan Miller Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Disk Data Severity:S2 (Serious)
Version:mysql-5.1-telco-7.0 OS:Linux
Assigned to: Jonas Oreland CPU Architecture:Any

[9 Apr 2009 20:26] Jonathan Miller
Description:
Hi,

Had a little time and was decided to scanned results from ACRT and found the following:

Time: Thursday 9 April 2009 - 13:55:28
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: pgman.cpp
Error object: PGMAN (Line: 1470) 0x00000006
Program: /data0/cr_autotest/libexec/ndbmtd
Pid: 31813
Trace: ./ndb_3_trace.log.1 ./ndb_3_trace.log.1_t1 ./ndb_3_trace.log.1_t2 ./ndb_3_trace.log.1_t3
Version: mysql-5.1.32 ndb-7.0.5-beta
***EOM***

PGMAN   001586 001470

--------------- Signal ----------------
r.bn: 261/5 "PGMAN", r.proc: 3, r.sigId: 7402172 gsn: 270 "FSWRITECONF" prio: 1
s.bn: 253 "NDBFS", s.proc: 3, s.sigId: 139344156 length: 1 trace: 0 #sec: 0 fragInf: 0
 UserPointer: 19
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 7402171 gsn: 409 "TIME_SIGNAL" prio: 1
s.bn: 252 "QMGR", s.proc: 3, s.sigId: 139344155 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000004

void
Pgman::fswriteconf(Signal* signal, Ptr<Page_entry> ptr)
{
  D("fswriteconf");
  D(ptr);

  Page_state state = ptr.p->m_state;
  ndbrequire(state & Page_entry::PAGEOUT); <-001470

void
Pgman::execFSWRITECONF(Signal* signal)
{
  jamEntry(); <- 001586
  FsConf* conf = (FsConf*)signal->getDataPtr();
  Ptr<Page_entry> ptr;
  m_page_entry_pool.getPtr(ptr, conf->userPointer);

  fswriteconf(signal, ptr);
}

Signal_log:
COMMIT: INSERT tab: 2 frag: 0 activeCreat: 0 rowid: [ m_page_no: 0 m_file_no: 62451 m_page_idx: 7252] key: 0

MySQLD Log:

090409 11:54:42 [ERROR] Got error 286 when reading table './dbt2/stock'
090409 11:54:42 [ERROR] Got error 4031 when reading table './dbt2/stock'
090409 11:54:42 [ERROR] Got error 4031 when reading table './dbt2/stock'
090409 11:54:44 [Note] NDB Binlog: Node: 3, down, Subscriber bitmask 00

Node 3 log.out

jbalock waiting for lock, contentions: 4 spins: 371795
jbalock waiting for lock, contentions: 400 spins: 32967759
2009-04-09 13:55:28 [ndbd] INFO     -- pgman.cpp
2009-04-09 13:55:28 [ndbd] INFO     -- PGMAN (Line: 1470) 0x00000006
2009-04-09 13:55:28 [ndbd] INFO     -- Error handler shutting down system
2009-04-09 13:55:29 [ndbd] INFO     -- Error handler shutdown completed - exiting
2009-04-09 13:55:29 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

How to repeat:
I have not been able to break this down to a finite set of steps: I have not seen this before today.

Happened using ACRT on the following:
Config = 2-dn-mt-mysql-5.1-telco-7.0
Test = TPC-C-DD (DBT2), 10 threads
2-dn-mt = 2 Data MT Nodes

[atrt]
basedir=/data0/cr_autotest/run-2-dn-mt-mysql-5.1-telco-7.0/run
baseport=15000
clusters= .master
mt = 2

[ndb_mgmd]

[mysqld]
skip-grant-tables
skip-innodb
ndb_use_exact_count=0
loose-join_cache_level=6
loose-ndb-cluster-connection-pool=1
loose-ndb_extra_logging=9
loose-engine_condition_pushdown=1
#loose-ndb_cache_check_time=1000

[cluster_config]
MaxNoOfExecutionThreads=3
MaxNoOfSavedMessages = 30
NoOfReplicas = 2
DataMemory = 8000M
IndexMemory = 1000M
DiskPageBufferMemory=300MB
DiskCheckpointSpeed=16M
RedoBuffer=200M
NoOfFragmentLogFiles=10
FragmentLogFileSize=512M
InitFragmentLogFiles=FULL
ODirect= 1
DiskIOThreadPool=40
SharedGlobalMemory=384M
SendBufferMemory = 4M
MaxNoOfConcurrentOperations = 250000
MaxNoOfLocalOperations = 275000
MaxNoOfConcurrentIndexOperations = 20000
MaxNoOfAttributes=2048
MaxNoOfOrderedIndexes=512
MaxNoOfUniqueHashIndexes=512
LockPagesInMainMemory=1
MemReportFrequency=200
LogLevelCongestion=15
LogLevelStatistic=15

[cluster_config.master]
ndb_mgmd = ndb21
ndbd = ndb21,ndb22
mysqld = ndb18
ndbapi=  ndb18,ndb18,ndb18,ndb18

[cluster_config.ndbd.1.master]
FileSystemPath=/data1/

[cluster_config.ndbd.2.master]
FileSystemPath=/data1/

#
# Generated by atrt
# Thu Apr  9 09:50:30 2009

[mysql_cluster.master]
ndb-connectstring= ndb21:15000

[cluster_config.ndb_mgmd.1.master]
PortNumber= 15000

[mysqld.1.master]
datadir= /data0/cr_autotest/run-2-dn-mt-mysql-5.1-telco-7.0/run/mysqld.1
socket= /data0/cr_autotest/run-2-dn-mt-mysql-5.1-telco-7.0/run/mysqld.1/mysql.sock
port= 15001
ndb-connectstring= ndb21:15000
ndbcluster

[client.1.master]
host= ndb18
socket= /data0/cr_autotest/run-2-dn-mt-mysql-5.1-telco-7.0/run/mysqld.1/mysql.sock
port= 15001

Suggested fix:
N/A
[9 Apr 2009 20:26] Jonathan Miller
Error and trace logs

Attachment: pgman.tgz (application/x-compressed-tar, text), 244.91 KiB.

[9 Apr 2009 20:54] Jonathan Miller
Not sure this will help or not, but...

dbt2 was doing a new_order transaction and hit a (Lock wait timeout exceeded)

Thu Apr  9 11:54:36 2009
Microseconds : 405658
tid:1095936320 mysql/dbc_new_order.c:88
mysql reports: SQL: call new_order(9, 1, 72, 1, 15, 48856, 9, 5, 67518, 9, 1, 73728, 9, 1, 96256, 9,                                 2, 53248, 9, 3, 81406, 9, 3, 26592, 9, 4, 38892, 9, 9, 57343, 9, 6,                                 84911, 9, 6, 36352, 9, 9, 60387, 9, 3, 98303, 9, 8, 47102, 9, 8, 40911,           9, 6, @rc),  ERROR: 1205 Lock wait timeout exceeded; try restarting transaction
Thu Apr  9 11:54:36 2009
Microseconds : 405680
tid:1095936320 mysql/dbc_common.c:97
ROLLBACK INITIATED

Next.... DN Failure

Thu Apr  9 11:54:42 2009
Microseconds : 366238
tid:1075546432 mysql/dbc_new_order.c:88
mysql reports: SQL: call new_order(3, 7, 928, 0, 9, 81920, 3, 9, 46588, 3, 5, 24576, 3, 2, 89536, 3,                                 7, 10172, 3, 2, 52592, 7, 8, 72702, 3, 2, 69613, 3, 4, 97607, 3, 4,                                 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,                          0, 0, @rc),  ERROR: 1297 Got temporary error 286 'Node failure caused abort of transaction' from NDBCLUSTER
Thu Apr  9 11:54:42 2009
Microseconds : 366309
tid:1075546432 mysql/dbc_common.c:97
ROLLBACK INITIATED

The New-Order Transaction
The New-Order business transaction consists of entering a complete order through a single database transaction. It represents a mid-weight, read-write transaction with a high frequency of execution and stringent response time
requirements to satisfy on-line users. This transaction is the backbone of the workload. It is designed to place a variable load on the system to reflect on-line database activity as typically found in production environments.

Entering a new order is done in a single database transaction with the following steps:
1. Create an order header, comprised of:
   2 row selections with data retrieval,
   1 row selection with data retrieval and update,
   2 row insertions.
2. Order a variable number of items (average ol_cnt = 10), comprised of:
   (1 * ol_cnt) row selections with data retrieval,
   (1 * ol_cnt) row selections with data retrieval and update,
   (1 * ol_cnt) row insertions.
[10 Apr 2009 3:43] Jonathan Miller
Sorry, meant to add this sooner:

clone:mysql-5.1-telco-7.0
rundate_2009-04-09_09:29
build_script:/space/cluster_rep_auto/scripts/boot.sh
------------------------------------------------------------
revno: 2890
revision-id: frazer@mysql.com-20090408222804-ve6xi3f1gaunzzxy
parent: frazer@mysql.com-20090408181037-8hsxlwqqydgl8f72
parent: frazer@mysql.com-20090408222513-d9ddlkxomto876ld
committer: Frazer Clement 
branch nick: mysql-5.1-telco-6.4
timestamp: Wed 2009-04-08 23:28:04 +0100
message:
  Merge 6.3->7.0
[27 Apr 2009 12:59] Jonathan Miller
Hi,

Found that testing hit this error again, but using a different test. The framework was running TPC-"B" on disk data this time using 4 DN in a "MT Mixed" mode. (i.e. 2 NDB Single Threaded, 2 NDB Multiple threads). The crashing DN was an NDBMTD.

Time: Sunday 26 April 2009 - 23:05:18
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: pgman.cpp
Error object: PGMAN (Line: 1470) 0x00000006
Program: /data0/cr_autotest/libexec/ndbmtd
Pid: 18068
Trace: ./ndb_3_trace.log.1 ./ndb_3_trace.log.1_t1 ./ndb_3_trace.log.1_t2 ./ndb_3_trace.log.1_t3 ./ndb_3_trace.log.1_t4 ./ndb_3_  

I will up load trace files shortly
[27 Apr 2009 13:06] Jonathan Miller
Trace Files from latest crash

Attachment: tpcb-run.tgz (application/x-compressed-tar, text), 306.09 KiB.

[20 Aug 2009 6:31] Jonas Oreland
see bug#46507
[9 Oct 2009 11:16] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/86332

3105 Jonas Oreland	2009-10-09
      ndb - bug#44195 - increase size of protected area, to make sure that multiple threads dont access data in parallel
[9 Oct 2009 11:18] Bugs System
Pushed into 5.1.39-ndb-7.1.0 (revid:jonas@mysql.com-20091009111806-bi2i4o4kychhf3e8) (version source revid:jonas@mysql.com-20091009111806-bi2i4o4kychhf3e8) (merge vers: 5.1.39-ndb-7.1.0) (pib:12)
[9 Oct 2009 11:21] Jonas Oreland
pushed to 7.0.9 and 7.1
[9 Oct 2009 13:14] Jon Stephens
Documented bugfix in the NDB-7.0.9 changelog as follows:

        Multi-threaded data nodes could in some cases attempt to access
        the same memory structure in parallel, in a non-safe manner. This
        could result in data node failure when running ndbmtd while
        using Disk Data tables.

Closed.