Bug #21815 mysqld is not informed of cluster shutdown, making slave thread print errors
Submitted: 24 Aug 2006 18:04 Modified: 26 Dec 2006 16:33
Reporter: Jonathan Miller Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Cluster: Replication Severity:S3 (Non-critical)
Version:mysql-5.1 OS:Linux (Linux)
Assigned to: CPU Architecture:Any
Tags: 5.1.12

[24 Aug 2006 18:04] Jonathan Miller
Description:
You can see from the following 3 messages that the MySQLD know the cluster is shuting down or has failed:

Management server closed connection early. It is probably being shut down (or has problems). We will retry the connection.
Management server closed connection early. It is probably being shut down (or has problems). We will retry the connection.
Management server closed connection early. It is probably being shut down (or has problems). We will retry the connection.

But yet the slave process (i.e. The SQL thread) still tries to insert into the database anyway and failes:

060824 14:26:01 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table dbt2.district, Error_code: 4023
060824 14:26:10 [ERROR] Slave: Error 'Got temporary error 286 'Node failure caused abort of transaction' from NDBCLUSTER' on query. Default database: ''. Query: 'COMMIT', Error_code: 1297
060824 16:17:53 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table dbt2.order_line, Error_code: 4023
060824 16:17:53 [ERROR] Slave: Error in Write_rows event: when locking tables, Error_code: -1
060824 16:17:53 [ERROR] Slave (additional info): Can't lock file (errno: 4009) Error_code: 1015
060824 16:17:53 [Warning] Slave: Got error 4009 'Cluster Failure' from NDB Error_code: 1296
060824 16:17:53 [Warning] Slave: Can't lock file (errno: 4009) Error_code: 1015
060824 16:17:53 [Warning] Slave: Unknown error Error_code: 1105

How to repeat:
Start cluster replication, shutdown the slave cluster leaving the slave mysqld up.

Suggested fix:
The slave SQL thread should stop but with a message that "Management server closed connection". The SQL Thread should be smart about knowing with the cluster is not up to support transactions.
[24 Aug 2006 18:41] Jonas Oreland
there is no HA property of ndb_mgmd,
they can die/restart/die etc. wo/ affecting ndbapi/mysqld/slave

(try for your self)

i.e this is not a bug.
    slave _must_ continue to try to apply until it gets a 4009 back.

the fact that ndb_mgmd is "outside" the cluster is a nice feature
  as it brings down no of components that affect total 
  availibility. (hmm was that correcly spelled)
[24 Aug 2006 19:17] Jonathan Miller
This has nothing to do with ndb_mgmd, this has to do with mysqld seeing the cluster is gone and stopping the SQL thread correctly instead of continuing to try  to do writes.

/jeb
[24 Aug 2006 19:32] Jonas Oreland
I dont understand (comment marked with ***)

** these 3 statements does not mean that cluster is down
Management server closed connection early. It is probably being shut down (or
has problems). We will retry the connection.
Management server closed connection early. It is probably being shut down (or
has problems). We will retry the connection.
Management server closed connection early. It is probably being shut down (or
has problems). We will retry the connection.

But yet the slave process (i.e. The SQL thread) still tries to insert into the
database anyway and failes:

** this does not mean that cluster has shutdown
060824 14:26:01 [ERROR] Slave: Error in Write_rows event: error during
transaction execution on table dbt2.district, Error_code: 4023
** this does not mean that cluster has shutdown
060824 14:26:10 [ERROR] Slave: Error 'Got temporary error 286 'Node failure
caused abort of transaction' from NDBCLUSTER' on query. Default database: ''.
Query: 'COMMIT', Error_code: 1297
** this does not mean that cluster has shutdown
060824 16:17:53 [ERROR] Slave: Error in Write_rows event: error during
transaction execution on table dbt2.order_line, Error_code: 4023
060824 16:17:53 [ERROR] Slave: Error in Write_rows event: when locking tables,
Error_code: -1

** these 4 (at the same time) means that mysqld(slave) has lost connection to cluster
060824 16:17:53 [ERROR] Slave (additional info): Can't lock file (errno: 4009)
Error_code: 1015
060824 16:17:53 [Warning] Slave: Got error 4009 'Cluster Failure' from NDB
Error_code: 1296
060824 16:17:53 [Warning] Slave: Can't lock file (errno: 4009) Error_code: 1015
060824 16:17:53 [Warning] Slave: Unknown error Error_code: 1105

---
1) does it continue to issue warnings/errors after that ?
2) or do you want the feature that ndbd should syncronize with what-ever mysqlds
   out-there during graceful shutdown?
[25 Aug 2006 13:23] Jonathan Miller
Here is another example that I did today:

060825 15:17:26 [ERROR] Slave: Error in Delete_rows event: row application failed, Error_code: 4023
060825 15:17:26 [ERROR] Slave: Error in Delete_rows event: error during transaction execution on table dbt2.stock, Error_code: 4023
060825 15:17:26 [ERROR] Slave: Error in Write_rows event: when locking tables, Error_code: -1
060825 15:17:26 [ERROR] Slave (additional info): Can't lock file (errno: 4009) Error_code: 1015
060825 15:17:26 [Warning] Slave: Got error 4009 'Cluster Failure' from NDB Error_code: 1296
060825 15:17:26 [Warning] Slave: Can't lock file (errno: 4009) Error_code: 1015
060825 15:17:26 [Warning] Slave: Unknown error Error_code: 1105
060825 15:17:26 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master2.000011' position 4532994
Management server closed connection early. It is probably being shut down (or has problems). We will retry the connection.
Management server closed connection early. It is probably being shut down (or has problems). We will retry the connection.
Management server closed connection early. It is probably being shut down (or has problems). We will retry the connection.

restart the ndb_mgmd or even stopping does not produce any errors on the mysqld side. 

The cluster should communicate gracefull shutdown to the mysqld so that "failure" messages are not printed in the mysqld error log.
[25 Aug 2006 14:12] Jonas Oreland
Jonas questions:
1) does it continue to issue warnings/errors after that ?
2) or do you want the feature that ndbd should syncronize with what-ever
   mysqlds out-there during graceful shutdown?

I dont know answer on 1)
But I think you mean yes on 2)

---

2) is a sane feature-request...

Actually management of the entire replication solution sucks imho,
  and I think one should compile a full list of new features
  that will make it easier to handle in general, and for cluster in particular.
  And the prioritize this list. But i doubt any such thing will happen before
  tomas is back.

--- a small note: changing title on this bug report

As mysqld it self is not informed of graceful cluster shutdown...
I change to it
"mysqld is not informed of cluster shutdown, making slave print errors"

/Jonas
[25 Aug 2006 14:43] Jonathan Miller
1) does it continue to issue warnings/errors after that ?
No, it does not.