Bug #33793 Race condition between "release gci" and node-failure handling
Submitted: 10 Jan 2008 11:52 Modified: 20 Feb 2008 21:48
Reporter: Lars Torstensson Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:* OS:Any
Assigned to: Jonas Oreland CPU Architecture:Any
Tags: MicroGCP
Triage: D1 (Critical) / R2 (Low) / E2 (Low)

[10 Jan 2008 11:52] Lars Torstensson
Description:
A rolling upgrade from 6.2.4 -> 6.2.9 fails
During the upgrade some config parameters where changed

Config changes:
NoOfFragmentLogFiles=64 -> NoOfFragmentLogFiles=362
DiskCheckpointSpeed=2M  -> DiskCheckpointSpeed=7M
DiskCheckpointSpeedInRestart=2M -> DiskCheckpointSpeedInRestart=7M
FragmentLogFileSize=16 -> FragmentLogFileSize=32M
RedoBuffer=128M -> RedoBuffer=32M
TimeBetweenEpochs=100

ndb_mgm> all status
Node 1: started (mysql-5.1.22 ndb-6.2.9)
Node 2: started (mysql-5.1.19 ndb-6.2.4)
Node 3: started (mysql-5.1.22 ndb-6.2.9)
Node 4: started (mysql-5.1.19 ndb-6.2.4)

ndb_mgm> 2 stop
Node 2: Node shutdown initiated
Node 2: Node shutdown completed. Initiated by signal 15.
Node 1: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Node 4: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.

Node 3: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.

Node 2 has shutdown. 

Error log from node 1
Time: Wednesday 9 January 2008 - 18:13:36
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: suma/Suma.cpp
Error object: SUMA (Line: 4976) 0x00000006
Program: /ahp_shared_software/mysql/current-ndbd-1/bin/ndbd
Pid: 3764
Trace: /dbdata/1/ndb_1_trace.log.4
Version: mysql-5.1.22 ndb-6.2.9-beta
***EOM***

How to repeat:
1. mgm servers were upgraded.
2. node 1 were restarted with -i
3. node 3 were restarted with -i
4. node 2 were stopped and node 1 then failed
[10 Jan 2008 13:09] Bogdan Kecman
As explained here:
<http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-config-params-ndbd.html>

> NoOfFragmentLogFiles=64 -> NoOfFragmentLogFiles=362
> FragmentLogFileSize=16 -> FragmentLogFileSize=32M

These changes require "Initial System Restart"
[11 Jan 2008 6:36] Jonas Oreland
If release gci has updated max-acked-gci, but not yet released last page 
  and there is a node failure,

then
  the node failure code contains(ed) a incorrect assertion that last page
   should be empty.

---

Consequence is cluster-failure.

---

Solution is to correct assertion.

---

Note on changed subject:
1) This bug has nothing to do with changing of config parameters
2) This bug has nothing to do with upgrade

I.e This is just a plain-old bug, however very unlikely as it's the first time
  we see it, and it's present in all mysql-version with replication
[11 Jan 2008 7:32] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/40893

ChangeSet@1.2185, 2008-01-11 08:33:09+01:00, jonas@perch.ndb.mysql.com +4 -0
  bug#33793 -
    dont assume that page is "all empty"
    only as gci is acked, as release_gci might not have processed it yet
[11 Jan 2008 8:19] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/40895

ChangeSet@1.2529, 2008-01-11 09:20:16+01:00, jonas@perch.ndb.mysql.com +4 -0
  ndb - bug#33793
    dont assume that page is "all empty"
    only as gci is acked, as release_gci might not have processed it yet
[11 Jan 2008 9:33] Jonas Oreland
pushed to drop6, 51-ndb, 51-telco-gca, telco-61, telco-62, telco-63, telco-64 and 51-telco
[1 Feb 2008 14:18] Jon Stephens
Documented in 5.1.23-ndb-6.3.8 changelog as follows:

        A race condition could occur (very rarely) when the release of a
        GCI was followed by a data node failure.

Left bug in PQ status pending additional merges.
[2 Feb 2008 12:05] Jon Stephens
Also documented for 5.2.23-ndb-6.2.11; left status unchanged.
[20 Feb 2008 16:03] Bugs System
Pushed into 5.1.24-rc
[20 Feb 2008 16:03] Bugs System
Pushed into 6.0.5-alpha
[20 Feb 2008 21:48] Jon Stephens
Also documented for 5.1.24 and 6.0.5.