Bug #26758 | SLAVE Replication SQL Thread Crashes DataNode(s) | ||
---|---|---|---|
Submitted: | 1 Mar 2007 17:16 | Modified: | 4 Jun 2007 8:37 |
Reporter: | Steve Edwards | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | 5.1.16 | OS: | Linux (Debian Linux 64bit) |
Assigned to: | Guangbao Ni | CPU Architecture: | Any |
Tags: | cluster, Error: 2341 datamemory index, replication, sql thread |
[1 Mar 2007 17:16]
Steve Edwards
[2 Mar 2007 15:52]
Steve Edwards
Further investigation reveals this fault seems to be a function of replicating between a master cluster with 2 replicas / 4 nodes to a cluster with 1 replica / 2 nodes. Having reduced the size of the database to enable 2 replicas on the SLAVE i have been unable to reproduce this error. Have downgraded the severity to non-critical and will investigate further.
[2 Mar 2007 17:14]
Steve Edwards
Testing Replication MASTER -> SLAVE (2 replicas) we encountered the same error as before. Replication caused the non master data node in the slave cluster to crash after replicating for ~15mins. This appears to be the same issue as bug id=25598 [SLAVE CLUSTER - ERROR LOG] 2007-03-02 18:04:06 [MgmSrvr] ALERT -- Node 3: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. [NON MASTER DATE NODE ERROR LOG] Time: Friday 2 Mars 2007 - 18:04:05 Status: Temporary error, restart node Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug) Error: 2341 Error data: dbtup/DbtupRoutines.cpp Error object: DBTUP (Line: 477) 0x0000000e Program: ndbd Pid: 11166 Trace: /SPP/data/ndb_3_trace.log.2 Version: Version 5.1.15 (beta) ***EOM***
[5 Mar 2007 12:39]
Steve Edwards
Upgraded slave cluster to 5.1.16 - so far unable to reproduce this issue.
[6 Mar 2007 8:57]
Steve Edwards
Master -> Slave replication between clusters in 5.1.16 seems more functional than in 5.1.15 but was still extremely unstable during our tests. Slave cluster data nodes crashed repeatedly especially when replicating > 200 reqs/sec (see errors and trace files below) and during replication Datamemory useage increased on slave data nodes until exhausted indicating what looks like some sort of memory leak. SLAVE CLUSTER DATA NODE ERRORS - Time: Monday 5 Mars 2007 - 16:51:11 Status: Temporary error, restart node Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug) Error: 2341 Error data: backup/Backup.cpp Error object: BACKUP (Line: 3670) 0x0000000e Program: ndbd Pid: 17054 Trace: /SPP/data/ndb_3_trace.log.9 Version: Version 5.1.16 (beta) ***EOM*** Time: Monday 5 Mars 2007 - 17:18:24 Status: Temporary error, restart node Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug) Error: 2341 Error data: dblqh/DblqhMain.cpp Error object: DBLQH (Line: 6909) 0x0000000a Program: ndbd Pid: 18233 Trace: /SPP/data/ndb_3_trace.log.10 Version: Version 5.1.16 (beta) ***EOM*** Time: Tuesday 6 Mars 2007 - 01:28:07 Status: Temporary error, restart node Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug) Error: 2341 Error data: dbtup/DbtupRoutines.cpp Error object: DBTUP (Line: 477) 0x0000000e Program: ndbd Pid: 31488 Trace: /SPP/data/ndb_3_trace.log.11 Version: Version 5.1.16 (beta) ***EOM***
[6 Mar 2007 9:05]
Steve Edwards
Trace files from crashed Slave cluster data node
Attachment: ndb_error_trace.tar.gz (application/x-gzip, text), 391.85 KiB.
[6 Mar 2007 9:16]
Steve Edwards
Slave cluster increase in Datamemory during replication.
Attachment: slave_cluster_datamemory.jpg (image/jpeg, text), 151.03 KiB.
[8 Mar 2007 14:27]
Steve Edwards
The memory leak issue (and probably the data node crashes) during replication seem to be related to a bug in the slaves update handling function replace_record() - According to the replication mysqld trace file replicated UPDATE statements appear to always be executed as a delete and insert operation. This would seemingly explain why datamemory use on the slave increases. I think the problem occurs due to table having two unique indexes. The conditional last_uniq_key() wrapped around the call to ha_update_row() never evaluates to true so the update is always executed as a delete / insert. ./sql/log_event.cc replace_record() 6664 DESCRIPTION 6665 Similar to how it is done in mysql_insert(), we first try to do 6666 a ha_write_row() and of that fails due to duplicated keys (or 6667 indices), we do an ha_update_row() or a ha_delete_row() instead. 6668 */ 6743 /* 6744 REPLACE is defined as either INSERT or DELETE + INSERT. If 6745 possible, we can replace it with an UPDATE, but that will not 6746 work on InnoDB if FOREIGN KEY checks are necessary. 6747 6748 I (Matz) am not sure of the reason for the last_uniq_key() 6749 check as, but I'm guessing that it's something along the 6750 following lines. 6751 6752 Suppose that we got the duplicate key to be a key that is not 6753 the last unique key for the table and we perform an update: 6754 then there might be another key for which the unique check will 6755 fail, so we're better off just deleting the row and inserting 6756 the correct row. 6757 */ 6758 if (last_uniq_key(table, keynum) && 6759 !table->file->referenced_by_foreign_key()) 6760 { 6761 error=table->file->ha_update_row(table->record[1], 6762 table->record[0]); 6763 DBUG_RETURN(error); 6764 } 6765 else 6766 { 6767 if ((error= table->file->ha_delete_row(table->record[1]))) 6768 DBUG_RETURN(error); 6769 /* Will retry ha_write_row() with the offending row removed. */ 6770 } 6771 } ./sql/sql_insert.cc 1042 /* Check if there is more uniq keys after field */ 1043 1044 static int last_uniq_key(TABLE *table,uint keynr) 1045 { 1046 while (++keynr < table->s->keys) 1047 if (table->key_info[keynr].flags & HA_NOSAME) 1048 return 0; 1049 return 1; 1050 }
[8 Mar 2007 14:28]
Steve Edwards
replication mysqld debug trace update row
Attachment: cluster_repli_update_row.trace (application/octet-stream, text), 34.25 KiB.
[16 Mar 2007 15:27]
Steve Edwards
Removing the call to last_unique_key() causes the replicated update to be correctly performed as a replace (NdbOperation::updateTuple()) rather than a delete / write. Cluster assures that all tables have a unique PK and replication events appear to be written to / replayed from the binlog by PK. If a cluster table has additional unique indexes last_uniq_key() on PK fails and update is performed as a delete / write. Memory allocated for the deleted row or write buffer does not appear to be free'd properly (until data nodes are restarted). vi ./sql/log_event.cc 6758 //if (last_uniq_key(table, keynum) && 6759 if (!table->file->referenced_by_foreign_key()) 6760 { 6761 error=table->file->ha_update_row(table->record[1], 6762 table->record[0]); 6763 DBUG_RETURN(error); 6764 } 6765 else 6766 { 6767 if ((error= table->file->ha_delete_row(table->record[1]))) I'm not fluent enough with the codebase to propose this as a patch - but i thought the additional info might be useful to the programmer who looks into this.
[16 Mar 2007 15:29]
Steve Edwards
mysqld slave replication trace file - call to last_uniq_key removed - update correctly performed as replace
Attachment: update_row_postfix.trace (application/octet-stream, text), 16.16 KiB.
[8 May 2007 14:41]
Jonas Oreland
Hi, Can you also upload you config.ini? Can you also possibly upload a binlog which reproduces this problem (and schemas) /Jonas
[4 Jun 2007 8:37]
Guangbao Ni
with the 5.1.18 source codes, i can't reproduced it.