MySQL Bugs: #52058: ndbd crashes in Pgman::get_page

Bug #52058	ndbd crashes in Pgman::get_page_entry during backup
Submitted:	15 Mar 2010 9:58	Modified:	19 Sep 2010 9:49
Reporter:	Niklas Hallqvist	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.1-telco-6.3	OS:	Linux (rhel5)
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	6.3.27a

Description:
Both of our nodes went down on the same assertion this weekend, during backup taking, but at different dates:

Time: Friday 12 March 2010 - 22:26:11
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: pgman.cpp
Error object: PGMAN (Line: 484) 0x0000000a
Program: /usr/sbin/ndbd
Pid: 2514
Trace: /var/lib/mysql-cluster/data/ndb_3_trace.log.7
Version: mysql-5.1.37 ndb-6.3.27a-GA

and two days later the other node went down:

Time: Sunday 14 March 2010 - 22:53:16
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: pgman.cpp
Error object: PGMAN (Line: 484) 0x0000000a
Program: /usr/sbin/ndbd
Pid: 8112
Trace: /var/lib/mysql-cluster/data/ndb_2_trace.log.20
Version: mysql-5.1.37 ndb-6.3.27a-GA

We've had a history of node failures lately, but haven't had the time to submit a bug report.  In the submitted ndb_error_reporter output, earlier failures will be contained.  I also submit the backup log and ctl file, not knowing if they're useful at all.

Unfortunately I was a bit stressed when taking things up, and quickly started by just restarting node 3, the one that failed first.  Afterwards I restarted node 2, and realised that this may have been the wrong order.  The startup takes ages (it has been going for 2 hours now, and still isn't ready).

How to repeat:
No real idea.  The system has a rather uniformly distributed load, and does the same kind of queries and updates all the time.  However the backup process seems to be relevant, as these failures was during backup taking.

This looks like an unhandled resource problem.
A workaround should be to set DiskPageBufferMemory
to bigger than the default 64M (see ref manual 5.1).

Context:
Pgman::get_page_entry()
  if (find_page_entry(..
  if (m_page_entry_pool.getNoOfFree() == 0 ...
  if (seize_page_entry(..
  // fail
  ndbrequire(false);

I raised that parameter to 256M, but tonight one of the nodes failed almost exactly at the same point in time as the last time, that is 26 minutes into the backup process:

Time: Tuesday 16 March 2010 - 22:26:21
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: pgman.cpp
Error object: PGMAN (Line: 484) 0x0000000a
Program: /usr/sbin/ndbd
Pid: 1097
Trace: /var/lib/mysql-cluster/data/ndb_3_trace.log.8
Version: mysql-5.1.37 ndb-6.3.27a-GA
***EOM***

or do I need to restart the ndb_mgmd as well as the other nodes after changing config.ini?  I just restarted the ndbd nodes. Our ordinary DBA is away for a week at this moment :-( I'll restart all the nodes ASAP.

Yes, ndb_mgmd must be restarted first
and then the DB nodes.

OK, this time the nodes was properly restarted in a running restart. But, at about the samepoint in time (i.e. 26 minutes into the backup), both nodes stopped with another error:

Time: Wednesday 17 March 2010 - 22:26:33
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug)
Error: 2303
Error data: Node 2 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 257) 0x0000000a
Program: /usr/sbin/ndbd
Pid: 2030
Trace: /var/lib/mysql-cluster/data/ndb_2_trace.log.21
Version: mysql-5.1.37 ndb-6.3.27a-GA
***EOM***

The log has details about th GCP that does not fit in this comment. If they are relevant I can attach them separately. The other node looks similar, but with some extra messages during the 20 seconds that it was alive after the first node went down

I realize there is some configurable parameter for GCP timeout and interval, but have problems to understand what to try to set it to.
I am reading http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html#mysql-cluster-nd...
and will try to figure things out. I guess we should write the backup to another place or sth, since this seems to have to do with I/O throughput (which surprises me as the system used is rather high profile, but I'll check things out).

Anyway, I do understand there are newer versions, although not packaged for RHel5, at least not that I have found. Are there known fixes in this area in there? I am not at trouble with compiling stuff myself, I just don't want to waste time if there are no relevant fixes.

Or are there other ways to accomplish backups without disturbing this online system (it validates credit card transactions in real time for mobile terminals)?

I found a message that crops upevery so often in the cluster log:
send buffer was full, but sleeping solved it, or similar.

Thus I have now raised the SendBufferMemory limit to 64M from 2M.
I have no idea if this helps for the problem at hand, but hopefully this relieves the pressure somewhat.

Is there an analysis tool that can help figure out configurables that are approaching their limits ahead of time?

Looking at config:

IndexMemory=..
25% of DataMemory may be unnecessarily large, if memory is needed
for other things.

TimeBetweenLocalCheckpoints=6

This is an unusual setting and causes continuous writing of LCPs.
It may be behind the problem.  Setting it to default now can cause
other issues, such as need for more REDO and UNDO space.
Here is couple of random links:

http://umangg.blogspot.com/2009/10/mysql-cluster-errors-faced-while.html
http://lists.mysql.com/cluster/3119

GCP stop is wellknown problem.  With disk data may be caused by too
slow disks (plenty of material to google).  High performance setups
usual require careful work by our *support engineers*.

Send buffer full is usually associated with replication. The problem
is sending all changed data back to mysql.  Larger buffer helps
only to handle peaks but not constant overload.

Well memory isn't a sparse resource, there are 26 GB per node, and we have lots to spare to throw at NDB if needed.  Also indexing will be added in the future, due to new demands on the application.  Is there really a problem with a high index-to-data memory ratio?

I will raise the LCP parameter, but that was hinted by other error messages we have gotten in the past. In fact I believe the exact value we chose may well have been hinted by the first link you supply.  But he also raises the redo fragment size I see.  I'll try that.

Thanks for your prompt answer anyway.  We are not reluctant to use your support engineers if needed.  However, as experienced database engineers ourselves, although coming from other engine backgrounds, our first approach was to tune this ourselves with the help of the net.  This is also part of our own learning curve.  If it turns out too hard to fix this quickly, we'll sure pay for help.

IndexMemory is unfortunately named.
It stores only hash indexes, basically primary keys.
And only a "pointer" is stored.
Normal (i.e. ordered) indexes use DataMemory.

About versions, no newer rhel5 rpms exist.
There has been many fixes after 6.3.27A.
With disk data, at least bug#47832 and bug#48910.

You have >= 26G ram and still need disk data?
It's best to put tables in memory when possible,
such as small but frequently accessed tables.
Our disk buffer cache is not advanced.

Thanks for the IndexMemory explanation.

No we don't need disk data (yet, anyway).  At the start we just had 6G per node, and had to go down that path,  after upgrading, we have not reverted back to just in-core data.

You think that going away from disk data makes things more stable?  We can well do that.  Another nagging little bit of ignorance: pure in-core data, does not mean less data integrity?  Things gets stored on disk anyhow, right, but is always readily available in-core as well?

Yes, avoid disk data if you can.  NDB was originally
a memory-only db with only hash indexes.  Disk data
has lots of new code (therefore bugs) and issues
made worse such as GCP stop.

In-memory tables do not exist on disk.  Their data
goes to only LCPs and REDO logs, and the backup replica.
So there is no "loading" a table from disk into memory.

And data integrity is based on storing data in 2 main
memories.

Well, I sort of understand.  But what worries me is if both nodes go away, and I restart one of the nodes.  Is the LCP+redo log available for getting the DB close to the state where it was when it crashed?  Or is the backup replica the last snapshot of the database?

The GCP commits to disk (REDO log) on each node about
once a second.  So if both nodes crash at same time,
you can lose some transactions which have been reported
as committed to the application.

Good!  This is good enough for us, since the critical last transactions which may be lost, can be recreated by application log inspection, which we already have automated routines for.

We will in a few weeks likely revert to memory-only.  Until then I will do some tuning in order to get the current setup stable again.  I assume you still want bug reports if new (as in, not formerly reported) crashes turn up?

Yes, we want bug reports and are thankful for them (community testing).
This particular bug# remains open because there is an unhandled resource
problem (the page entries we started with).

Further analysis has shown a pretty interesting pattern.  The crashes are now consistently occuring during every trial of taking a backup, about 26 minutes into it.  Looking at the last working backups, they are finished at about 25-26 minutes just around 4GB of data.  Actually the last working backups sees a 32bit wraparound.  Maybe the wraparound causes data corruption not triggering assertions until later...  Yesterday's backup met yet another assertion in getPtr of ArratPool.hpp, so it looks like we suffer from somewhat random crashes now, but always at the same time in the backup process.

from the log

../ndb_1_cluster.log.4:2010-02-23 22:25:10 [MgmSrvr] INFO     -- Node 3: Backup 128 started from node 1 completed. StartGCP: 16461312 StopGCP: 16462037 #Records: 11150462 #LogRecords: 3827 Data: 4176110276 bytes Log: 957984 bytes
../ndb_1_cluster.log.4:2010-02-24 22:24:27 [MgmSrvr] INFO     -- Node 2: Backup 129 started from node 1 completed. StartGCP: 16503444 StopGCP: 16504149 #Records: 11272317 #LogRecords: 4240 Data: 4193757492 bytes Log: 999888 bytes
../ndb_1_cluster.log.4:2010-02-25 22:24:30 [MgmSrvr] INFO     -- Node 2: Backup 130 started from node 1 completed. StartGCP: 16546247 StopGCP: 16546953 #Records: 11412007 #LogRecords: 4021 Data: 4214566668 bytes Log: 1012688 bytes
../ndb_1_cluster.log.4:2010-02-26 22:25:11 [MgmSrvr] INFO     -- Node 2: Backup 131 started from node 1 completed. StartGCP: 16589015 StopGCP: 16589741 #Records: 11566933 #LogRecords: 4443 Data: 4238895484 bytes Log: 1104588 bytes
../ndb_1_cluster.log.4:2010-02-27 22:25:01 [MgmSrvr] INFO     -- Node 2: Backup 132 started from node 1 completed. StartGCP: 16631771 StopGCP: 16632492 #Records: 11702368 #LogRecords: 4410 Data: 4258702508 bytes Log: 1088188 bytes
../ndb_1_cluster.log.4:2010-02-28 22:25:16 [MgmSrvr] INFO     -- Node 2: Backup 133 started from node 1 completed. StartGCP: 16674510 StopGCP: 16675239 #Records: 11823760 #LogRecords: 2652 Data: 4276084372 bytes Log: 731628 bytes
../ndb_1_cluster.log.4:2010-03-01 22:25:22 [MgmSrvr] INFO     -- Node 2: Backup 134 started from node 1 completed. StartGCP: 16717243 StopGCP: 16717975 #Records: 11878794 #LogRecords: 3219 Data: 4285519892 bytes Log: 859792 bytes
../ndb_1_cluster.log.4:2010-03-02 22:24:52 [MgmSrvr] INFO     -- Node 2: Backup 135 started from node 1 completed. StartGCP: 16759926 StopGCP: 16760642 #Records: 11984669 #LogRecords: 3473 Data: 5888348 bytes Log: 904656 bytes
../ndb_1_cluster.log.4:2010-03-03 22:25:10 [MgmSrvr] INFO     -- Node 2: Backup 136 started from node 1 completed. StartGCP: 16802592 StopGCP: 16803317 #Records: 12099903 #LogRecords: 3667 Data: 22498020 bytes Log: 942116 bytes
../ndb_1_cluster.log.4:2010-03-04 22:25:07 [MgmSrvr] INFO     -- Node 2: Backup 137 started from node 1 completed. StartGCP: 16845309 StopGCP: 16846033 #Records: 12232455 #LogRecords: 4877 Data: 41521248 bytes Log: 1113348 bytes
../ndb_1_cluster.log.4:2010-03-05 22:25:11 [MgmSrvr] INFO     -- Node 2: Backup 138 started from node 1 completed. StartGCP: 16888084 StopGCP: 16888810 #Records: 12362710 #LogRecords: 5209 Data: 63175608 bytes Log: 1478320 bytes
../ndb_1_cluster.log.4:2010-03-06 22:25:15 [MgmSrvr] INFO     -- Node 2: Backup 139 started from node 1 completed. StartGCP: 16930850 StopGCP: 16931578 #Records: 12476982 #LogRecords: 5858 Data: 201887820 bytes Log: 3788572 bytes
../ndb_1_cluster.log.4:2010-03-07 22:26:10 [MgmSrvr] INFO     -- Node 2: Backup 140 started from node 1 completed. StartGCP: 16973540 StopGCP: 16974294 #Records: 12632474 #LogRecords: 4418 Data: 514356848 bytes Log: 4947736 bytes
../ndb_1_cluster.log.4:2010-03-08 22:26:25 [MgmSrvr] INFO     -- Node 2: Backup 141 started from node 1 completed. StartGCP: 17016214 StopGCP: 17016975 #Records: 12893621 #LogRecords: 3238 Data: 733051660 bytes Log: 845640 bytes
../ndb_1_cluster.log.4:2010-03-09 22:25:59 [MgmSrvr] INFO     -- Node 2: Backup 142 started from node 1 completed. StartGCP: 17058877 StopGCP: 17059627 #Records: 13033168 #LogRecords: 4636 Data: 753255496 bytes Log: 1051148 bytes
../ndb_1_cluster.log:2010-03-10 22:26:58 [MgmSrvr] INFO     -- Node 2: Backup 143 started from node 1 completed. StartGCP: 17101528 StopGCP: 17102304 #Records: 13195566 #LogRecords: 4688 Data: 776490128 bytes Log: 1129676 bytes

Please attach ndb_error_reporter output.

OK, the report is there.  Currently we have stopped trying taking backups.
while I am here, may i ask if it's feasible to split the cluster, having one
of the nodes acting as a readonly snapshot (e.g. by adding a fw rule emulating network failure between the nodes), and use mysqldump to get backups instead of the NDB backup, and afterwards just, restart the readonly node, with connectivity restored, so that that node will resync with the one that has remained r/w?

For the record the ArrayPool crash is somewhere in:

DbtupExecQuery.cpp
   if (Roptype == ZREAD) {
     jam();
     if (setup_read(&req_struct, regOperPtr, regFragPtr, regTabPtr,
                    disk_page != RNIL))
     {
       if(handleReadReq(signal, regOperPtr, regTabPtr, &req_struct) != -1)
       {
         req_struct.log_size= 0;
         sendTUPKEYCONF(signal, &req_struct, regOperPtr);

> may i ask if it's feasible to split the cluster, having one
> of the nodes acting as a readonly snapshot (e.g. by adding a fw rule

It's not an operation supported by us.  Maybe never implemented
because there is already a snapshot backup.

I guess it might work if you disable arbitration, stop API ops
between commits, and cut the cable.

This could be caused by http://bugs.mysql.com/bug.php?id=54986.
Can this be retried with e.g 6.3.36

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".