Bug #51643 Prepared statements with loadbalancer works on wrong connection
Submitted: 2 Mar 2010 14:29 Modified: 4 Mar 2010 17:07
Reporter: Gunn Olaussen Email Updates:
Status: Closed Impact on me:
None 
Category:Connector / J Severity:S2 (Serious)
Version:5.0.12 OS:Any
Assigned to: CPU Architecture:Any

[2 Mar 2010 14:29] Gunn Olaussen
Description:
PreparedStatements (and other objects created by ConnectionImpl) holds a reference to the ConnectionImpl that created them. When the load balancer picks a new connection, this doesn't affect the reference that the PreparedStatement holds. For e.g. PreparedStatements this can mean that inserts/updates on the prepared statement goes to a outdated (or even a connection that doesn't exist any more), while the commits go to the new connection, leading to silent data loss in the JDBC driver.

(This is the same for PreparedCalls, Statements and Metadata also, but we think it is most dangerous with prepared statements.)

How to repeat:
Table: just one PK, long.

Pseudocode:
PreparedStatement ps = conn.prepareStatement("insert into t1 values ($)");
long counter = 1;
while (true) {
   ps.execute(counter++);
   conn.commit();
   long dbCount = conn.execute("SELECT COUNT(*) FROM TABLE")
   assertEquals(counter, dbCount);
}

What will happen here is this (time axis down):
conn1        conn2              comment
ins 1
commit                          proxy changed to conn 2
             select count()     returns correct count
ins 2                           prepared stmt keeps old ConnectionImpl
             commit             not on the same connection that we
                                insert into
             select count()     will return wrong value, since ins 2 is
                                not committed

Suggested fix:
We have a patch (applied) that fixes this problem, but are not sure this is how you should fix it. This is what we did:

Introduce a InternalConnection (we called it MySQLConnection in the patch) which contains all non-private methods from ConnectionImpl. This interface extends Connection and ConnectionProperties. It also need a couple of methods from ConnectionPropertiesImpl - probably the Impl is used some places where the interface should be used instead. ConnectionImpl implements InternalConnection, and InternalConnection is used everywhere ConnectionImpl is used today (in e.g. the constructor of PreparedStatement, Metadata, etc. Huge search/replace).

This has to be a interface, because the LoadBalancerConnectionProxy is going to proxy it, therefore all places which make LoadBalancerConnectionProxy proxy as Connection should now be changed to make LoadBalancerConnectionProxy proxy as InternalConnection.

At this point we have a LoadBalancerConnectionProxy which can be used for a ConnectionImpl (now hidden behind the InternalConnection interface). We then added one public method in InternalConnection: void setProxy(InternalConnection), which just stores the supplied InternalConnection, and InternalConnection getProxy() which returns this proxy if it is set, and "(InternalConnection) this" if it is not set. LoadBalancerConnectionProxy.pickConnection() should call conn.setProxy(this) when it creates new connections, and ConnectionImpl should be changed to call getProxy() every place it uses "this" (where it creates new objects which keep a reference to the connection).

The approach above has been tested and shown to work, but one could probably do a better job at hiding information than we did in the attached patch (quick hack used to prove that it work).
[2 Mar 2010 14:30] Gunn Olaussen
Here is some more information about how we tracked down this bug:

We have the following cluster setup:
siv27: ndbd, mysqld:56200
siv28: ndbd, mysqld:56300
siv29: ndb_mgmd, mysqld:56100, ndbapi

24/02 14:30:06.393 INFO Thr{0011} ClusterJDBCClient.connect: Creating
JDBC connection to MySQL Cluster using URL:
'jdbc:mysql:loadbalance://siv29:56100,siv27:56200,siv28:56300'
24/02 14:30:06.393 FINE Thr{0011} ClusterJDBCClient.connect: JDBC
Properties used: [autoreconnect=false reconnect=false
loadBalanceStrategy=random user=XXX loadBalanceBlacklistTimeout=5000
allowMultiQueries=true drivername=com.mysql.jdbc.Driver
dumpQueriesOnException=true password=YYY
logger=com.mysql.jdbc.log.StandardLogger enablePacketDebug=true
traceProtocol=true]

The goal is to run TPCC load, but the datamodel creation fails because
the number of lines doesn't match the expected:

24/02 14:35:26.152 FINE Thr{0011} Transaction.message:
com.sun.jet.load.oltp.transactions.RunOLTP$ModelCheck:
Expected #WAREHOUSE to be 1 was 1,
expected #DISTRICTS to be 5 was 5,
expected #ORDERS to be 15000 was 14975,
expected #CUSTOMERS to be 15000 was 13500,
expected #ITEMS to be 20000 was 20000,
expected #STOCKS to be 20000 was 19901

Lets take the ORDERS table as an example:

By connecting to the mysqld we see that the missing rows are the last 25
that were inserted (id 2976-3000 for district 5). In the load log I see
that a commit has been done after o_id=2975 and then another after the
last 25 orders:

24/02 14:35:25.132 FINE Thr{0062} Transaction.message:
com.sun.jet.load.oltp.transactions.OrderPopulate
OrderPopulate1_0:(populate) INSERT INTO orders: o_id: '2975', d_id: '5',
w_id: '1', c_id: '2177'
24/02 14:35:25.145 FINER Thr{0062} JDBCTransaction.commit:
com.sun.jet.load.oltp.transactions.OrderPopulate OrderPopulate1_0 commit()
24/02 14:35:25.148 FINE Thr{0062} Transaction.message:
com.sun.jet.load.oltp.transactions.OrderPopulate
OrderPopulate1_0:(populate) INSERT INTO orders: o_id: '2976', d_id: '5',
w_id: '1', c_id: '640'

skip...

24/02 14:35:25.609 FINE Thr{0062} Transaction.message:
com.sun.jet.load.oltp.transactions.OrderPopulate
OrderPopulate1_0:(populate) INSERT INTO orders: o_id: '2999', d_id: '5',
w_id: '1', c_id: '347'
24/02 14:35:25.636 FINE Thr{0062} Transaction.message:
com.sun.jet.load.oltp.transactions.OrderPopulate
OrderPopulate1_0:(populate) INSERT INTO orders: o_id: '3000', d_id: '5',
w_id: '1', c_id: '756'
24/02 14:35:25.648 FINE Thr{0062} Transaction.message:
com.sun.jet.load.oltp.transactions.OrderPopulate
OrderPopulate1_0:loadOrders, progress = 3000
24/02 14:35:25.648 FINER Thr{0062} JDBCTransaction.commit:
com.sun.jet.load.oltp.transactions.OrderPopulate OrderPopulate1_0 commit()

The same queries can be found in the query log for the mysqld on siv27
(with timestamp 100224 14:35:25):

12 Query INSERT INTO orders(o_id, o_c_id, o_d_id, o_w_id, o_entry_d,
o_ol_cnt,o_total_amount, o_all_local, o_carrier_id)values (2975, 2177,
5, 1, CURDATE(), 23, 6697.79, 1, null)
12 Query commit
12 Query INSERT INTO orders(o_id, o_c_id, o_d_id, o_w_id, o_entry_d,
o_ol_cnt,o_total_amount, o_all_local, o_carrier_id)values (2976, 640, 5,
1, CURDATE(), 44, 5910.8, 1, null)

skip...

12 Query INSERT INTO orders(o_id, o_c_id, o_d_id, o_w_id, o_entry_d,
o_ol_cnt,o_total_amount, o_all_local, o_carrier_id)values (2999, 347, 5,
1, CURDATE(), 42, 9262.23, 1, null)
12 Query INSERT INTO orders(o_id, o_c_id, o_d_id, o_w_id, o_entry_d,
o_ol_cnt,o_total_amount, o_all_local, o_carrier_id)values (3000, 756, 5,
1, CURDATE(), 19, 3973.39, 1, null)

But here there is no commit afterwards. However, there is a commit in
the query log of siv29:

100224 14:35:25 18 Query commit

When seeing this problem we checked out the connector/J source code and
built it ourselves so we have added some extra tracing and turned on packet
debug:

The last order which has been commited is 2975:

Wed Feb 24 14:35:25 CET 2010 TRACE: send() packet payload:
aa 00 00 00 03 49 4e 53 . . . . . I N S
45 52 54 20 49 4e 54 4f E R T . I N T O
20 6f 72 64 65 72 73 28 . o r d e r s (
6f 5f 69 64 2c 20 6f 5f o _ i d , . o _
63 5f 69 64 2c 20 6f 5f c _ i d , . o _
64 5f 69 64 2c 20 6f 5f d _ i d , . o _
77 5f 69 64 2c 20 6f 5f w _ i d , . o _
65 6e 74 72 79 5f 64 2c e n t r y _ d ,
20 6f 5f 6f 6c 5f 63 6e . o _ o l _ c n
74 2c 6f 5f 74 6f 74 61 t , o _ t o t a
6c 5f 61 6d 6f 75 6e 74 l _ a m o u n t
2c 20 6f 5f 61 6c 6c 5f , . o _ a l l _
6c 6f 63 61 6c 2c 20 6f l o c a l , . o
5f 63 61 72 72 69 65 72 _ c a r r i e r
5f 69 64 29 76 61 6c 75 _ i d ) v a l u
65 73 20 28 32 39 37 35 e s . ( 2 9 7 5
2c 20 32 31 37 37 2c 20 , . 2 1 7 7 , .
35 2c 20 31 2c 20 43 55 5 , . 1 , . C U
52 44 41 54 45 28 29 2c R D A T E ( ) ,
20 32 33 2c 20 36 36 39 . 2 3 , . 6 6 9
37 2e 37 39 2c 20 31 2c 7 . 7 9 , . 1 ,
20 6e 75 6c 6c 29 . n u l l )

Then the commit with our added debug shows that the commit is done on the
correct host:

Wed Feb 24 14:35:25 CET 2010 INFO: Commit with host='siv27:56200' and
origHost='siv27:56200'
Wed Feb 24 14:35:25 CET 2010 TRACE: send() packet payload:
07 00 00 00 03 63 6f 6d . . . . . c o m
6d 69 74 m i t

Then it continues to insert the remaining rows (this is the last):

Wed Feb 24 14:35:25 CET 2010 TRACE: send() packet payload:
a9 00 00 00 03 49 4e 53 . . . . . I N S
45 52 54 20 49 4e 54 4f E R T . I N T O
20 6f 72 64 65 72 73 28 . o r d e r s (
6f 5f 69 64 2c 20 6f 5f o _ i d , . o _
63 5f 69 64 2c 20 6f 5f c _ i d , . o _
64 5f 69 64 2c 20 6f 5f d _ i d , . o _
77 5f 69 64 2c 20 6f 5f w _ i d , . o _
65 6e 74 72 79 5f 64 2c e n t r y _ d ,
20 6f 5f 6f 6c 5f 63 6e . o _ o l _ c n
74 2c 6f 5f 74 6f 74 61 t , o _ t o t a
6c 5f 61 6d 6f 75 6e 74 l _ a m o u n t
2c 20 6f 5f 61 6c 6c 5f , . o _ a l l _
6c 6f 63 61 6c 2c 20 6f l o c a l , . o
5f 63 61 72 72 69 65 72 _ c a r r i e r
5f 69 64 29 76 61 6c 75 _ i d ) v a l u
65 73 20 28 33 30 30 30 e s . ( 3 0 0 0
2c 20 37 35 36 2c 20 35 , . 7 5 6 , . 5
2c 20 31 2c 20 43 55 52 , . 1 , . C U R
44 41 54 45 28 29 2c 20 D A T E ( ) , .
31 39 2c 20 33 39 37 33 1 9 , . 3 9 7 3
2e 33 39 2c 20 31 2c 20 . 3 9 , . 1 , .
6e 75 6c 6c 29 n u l l )

And here is the commit (to the wrong host):

Wed Feb 24 14:35:25 CET 2010 INFO: Commit with host='siv29:56100' and
origHost='siv29:56100'
Wed Feb 24 14:35:25 CET 2010 TRACE: send() packet payload:
07 00 00 00 03 63 6f 6d . . . . . c o m
6d 69 74 m i t

We added some output to pickConnection() in RandomBalanceStrategy on a
later test run and saw that it picks a new host to connect to after each
commit (which is by design). So it selects another host right before I
insert the last batch of rows. This is the host which ends up getting
the commit, but the rows were inserted on the previous connection since
that is the one the PreparedStatement is holding.
[3 Mar 2010 19:35] Mark Matthews
Fixed for 5.1.13.
[4 Mar 2010 17:07] Tony Bedford
An entry has been added to the 5.1.13 changelog:

Objects created by ConnectionImpl, such as prepared statements, hold a reference to the ConnectionImpl that created them. However, when the load balancer picked a new connection, it did not update the reference contained in, for example, the PreparedStatement. This resulted in inserts and updates being directed to invalid connections, while commits were directed to the new connection. This resulted in silent data loss.