MySQL Bugs: #3659: >1 concurrent update causes deadlock

Bug #3659	>1 concurrent update causes deadlock
Submitted:	5 May 2004 10:34	Modified:	19 Jul 2004 13:39
Reporter:	Johan Andersson	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-4.1.2-alpha bk	OS:	FermiLinux LTS 3.01
Assigned to:	Magnus Blåudd	CPU Architecture:	Any

Description:

Two java clients talking to ONE mysqld. Autocommit is on.
Both java clients receive the same stream of data. The data stream is unpacked
and inserted/updated in the database.

However, when two updating transactions starts at the same time on the same table, then they will deadlock (since the first update trans will start scan one fragment, and the second update will start at another, then they will both wait each others fragment).

Below are table definitions and transaction log.  "2 query ..." denotes the
query executed by one of the apps and "1 query ... " is a query executed by the other app (other connection). In mysqld.log, the last the last two updates cause an deadlock. (triggers errors 296 and 266, which we get in an exception caught in the applications. One application throws 266 and the other 296.)

------ BEGIN table defs -----------------------------------------

DROP TABLE IF EXISTS T38_INTRADAY;
create table T38_INTRADAY ( ID INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
CD_REFERENTIEL varchar(16) NOT NULL default '',
DA_INT time NOT NULL default '00:00:00',
MT_CRS float default NULL,
MT_VOL int(11) default NULL,
#KEY CD_REFERENTIEL (CD_REFERENTIEL),
#KEY DA_INT (DA_INT)
) Type=NDBCluster;

DROP TABLE IF EXISTS T38_INTRADAY_1MIN;
create table T38_INTRADAY_1MIN ( ID INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
CD_REFERENTIEL varchar(16) NOT NULL default '',
DA_INT datetime NOT NULL default '1970-01-01 00:00:00',
MT_CRS_OUV float default NULL,
MT_CRS_CLO float default NULL,
MT_CRS_HAUT float default NULL,
MT_CRS_BAS float default NULL,
MT_VOL int(11) default NULL,
#KEY CD_REFERENTIEL (CD_REFERENTIEL),
#KEY DA_INT (DA_INT)
) Type=NDBCluster;

DROP TABLE IF EXISTS FFF;
CREATE TABLE FFF (TS BIGINT(20) UNSIGNED NOT NULL,
		  ID INTEGER NOT NULL,
		  SEQNO INTEGER NOT NULL,
		  FRAME VARCHAR(255) NOT NULL DEFAULT '') Type=NDBCluster;

CREATE INDEX IDX_T38_INTRADAY_1MIN ON T38_INTRADAY_1MIN (CD_REFERENTIEL, DA_INT);

--------------------- END TABLE DEFS ---------------------------------------

Look for ### below!!!
--------------------- BEGIN Mysqld.log --------------------------------------

2 Query
insert into T38_INTRADAY (CD_REFERENTIEL, DA_INT, MT_CRS, MT_VOL)            values ('XPARFR0000127771', '12:53:59', 20.93, 1472.0)

1 Query
insert into T38_INTRADAY (CD_REFERENTIEL, DA_INT, MT_CRS, MT_VOL)            values ('XPARFR0000127771', '12:53:57', 20.93, 951.0)

1 Query
select MAX(CONCAT(DA_INT, '|', MT_CRS_CLO)) from T38_INTRADAY_1MIN            where CD_REFERENTIEL='XPARFR0000127771'            group by CD_REFERENTIEL

2 Query
select MAX(CONCAT(DA_INT, '|', MT_CRS_CLO)) from T38_INTRADAY_1MIN            where CD_REFERENTIEL='XPARFR0000127771'            group by CD_REFERENTIEL

### HERE IT DEADLOCKS ("2 query" is connection, "1 query" is the other):
2 Query       
update T38_INTRADAY_1MIN            set MT_CRS_CLO=20.93, MT_CRS_HAUT=(GREATEST(MT_CRS_HAUT,20.93)), MT_CRS_BAS=(LEAST(MT_CRS_BAS,20.93)), MT_VOL=(MT_VOL+1472.0)            where CD_REFERENTIEL='XPARFR0000127771' and DA_INT='2004-05-04 12:53:00'

1 Query      
update T38_INTRADAY_1MIN            set MT_CRS_CLO=20.93, MT_CRS_HAUT=(GREATEST(MT_CRS_HAUT,20.93)), MT_CRS_BAS=(LEAST(MT_CRS_BAS,20.93)), MT_VOL=(MT_VOL+951.0)            where CD_REFERENTIEL='XPARFR0000127771' and DA_INT='2004-05-04 12:53:00'
040504 12:54:17	     

 3 Connect      

--------------------- END mysqld.log ---------------------------------------

How to repeat:
Shouldn't be that hard....

Suggested fix:

Suggestions:

1. Implement scan filters so that not all rows are fetched to the mysqld server -> SELECT performance would improve.

2. Filtered locking?

update T38_INTRADAY_1MIN set MT_CRS_CLO=20.93,
MT_CRS_HAUT=(GREATEST(MT_CRS_HAUT,20.93)), MT_CRS_BAS=(LEAST(MT_CRS_BAS,20.93)),
MT_VOL=(MT_VOL+1472.0) where CD_REFERENTIEL='XPARFR0000127771' and
DA_INT='2004-05-04 12:53:00'

With the current table schema, this query will start a full table scan of T38_INTRADAY_1MIN, when the record is found an update of that record will be sent to NDB thus setting a write lock on the record. That record lock will then be held until the rest of the records in the table has been scanned, and the the transaction can be committed. During this time(which can be long) it seems like the update transaction has timed out. And thus error 266 is returned. The other UPDATE also starts a scan which will not be able to pass the exclusive lock on the record that the first UPDATE has aquired. And subsequently that scan has timed out with error 296.
So this is not really a deadlock problem, but a timeout problem.

Solution:
1. Add an ordered index on CD_REFERENTIELL or DA_INT. KEY(CD_REFERENTIELL, DA_INT).
This will allow a ordered index scan instead of the full table scan and thus it will take less time, thus allowing the two queries to finish without timeouts.

2. Use latest version. :)

In the long run:
1. Add test cases for this where a long running full table scan is used to update only one or a few records in a big table. It should work to do it this way, even if it's slow.

2. Maybe the NdbConnection::refresh function has to be called from handler once in a while, but it would be better if this is done internally by NdbScanOperation if necessary.

Problem occurred even when both clients were not updating the same records.

Problem did not occur with only one client, update was done in no time. As soon as
the second client is interacting with the DB (even on differents rows since the test was also done with an additional ID column whose value was different for each client) the UPDATE leads to a 266 or 296 error being thrown.

WL 1822 Flush Locks when being locked in scan to avoid unnecessary deadlocks
and WL 1823 Flexible time-out to avoid aborting both deadlocked transactions
added for part of this problem
Fix for scan time-out applied and pushed already

Fixed by WL 1822, 1823 and two bug fixes

It should not be closed (yet). It has been very much improved. Deadlocks are a lot more infrequent, but they can still happen.

I have two java applications connected to a mysql server connected to two ndb nodes. Martin has the Java program (Updater.java). The two applications are not updating the same data. The data is differentiated with an application id.

Exception3: Deadlock found when trying to get lock; Try restarting transaction,  message from server: "Lock wait timeout exceeded; try restarting transaction"
stmt: com.mysql.jdbc.PreparedStatement@a470b8:  UPDATE johan SET d='kalleankakajsaanka01' WHERE a=17692 AND aid =2 AND tid=1

Hmm, I might have been to "submitter"-happy. I willl investigate this further. It is a very rare bug, and not easily reproducable.