Bug #41532 Slower performance since 5.1.7 with rewriteBatchedStatements option
Submitted: 17 Dec 2008 9:06 Modified: 2 Apr 2009 16:13
Reporter: Philippe Martin Email Updates:
Status: Closed Impact on me:
Category:Connector / J Severity:S5 (Performance)
Version:5.1.7 OS:Any
Assigned to: Mark Matthews CPU Architecture:Any

[17 Dec 2008 9:06] Philippe Martin
The driver is slower since the 5.1.7 version with the rewriteBatchedStatements option set to true.

If the query has the "ON DUPLICATE UPDATE" the performance is better between the same query without the instruction "ON DUPLICATE UPDATE".

Bench test results:
The test execute one query with different batch size and with or without the instruction "ON DUPLICATE UPDATE".
The test with rewriteBatchedStatements set at true:
ms 5.1.7, ms 5.1.6, batch size, on duplicate update

The test with rewriteBatchedStatements set at false:
ms 5.1.7, ms 5.1.6, batch size, on duplicate update

The test on the 5.1.7 driver with rewriteBatchedStatements set to true and with the instruction "ON DUPLICATE UPDATE" it's very slower (5507ms) than the same test on the 5.1.6 driver (57ms).

How to repeat:

import java.math.BigDecimal;
import java.sql.Connection;
import java.sql.Date;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Timestamp;
import java.sql.Types;

public class BugRewriteBatchedStatements517 {

    public static void main(String[] args) {
        String url = "jdbc:mysql://localhost:3306/test?user=root";
        if (args != null && args.length > 0) {
            url = args[0];
        try {
        } catch (ClassNotFoundException e) {
        String testWithoutOnDuplicateKey = "INSERT INTO TEST_DUPLICATE(ID, S1, S2, S3, D1, D2, D3, N1, N2, N3) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)"; 
        String testWithOnDuplicateKey = testWithoutOnDuplicateKey + " ON DUPLICATE KEY UPDATE S1 = VALUES(S1), S2 = VALUES(S2), S3 = VALUES(S3), D1 = VALUES(D1), D2 = VALUES(D2), D3 = VALUES(D3), N1 = N1 + VALUES(N1), N2 = N2 + VALUES(N2), N2 = N2 + VALUES(N2)"; 
        System.out.println("ms,batch size,rewriteBatchedStatements,on duplicate");  
        // test: 10
        test(10, testWithOnDuplicateKey, url, false, false);
        test(10, testWithoutOnDuplicateKey, url, false, true);
        test(10, testWithOnDuplicateKey, url, true, false);
        test(10, testWithoutOnDuplicateKey, url, true, true);
        // test: 100
        test(100, testWithOnDuplicateKey, url, false, false);
        test(100, testWithoutOnDuplicateKey, url, false, true);
        test(100, testWithOnDuplicateKey, url, true, false);
        test(100, testWithoutOnDuplicateKey, url, true, true);
        // test: 1000
        test(1000, testWithOnDuplicateKey, url, false, false);
        test(1000, testWithoutOnDuplicateKey, url, false, true);
        test(1000, testWithOnDuplicateKey, url, true, false);
        test(1000, testWithoutOnDuplicateKey, url, true, true);
    private static void test(int test, String query, String url, boolean rewriteBatchedStatements, boolean onDuplicateOption) {
        Connection conn = null;
        Date d1 = new Date(System.currentTimeMillis());
        Date d2 = new Date(System.currentTimeMillis() + 1000000);
        Date d3 = new Date(System.currentTimeMillis() + 1250000);
        try {
            conn = DriverManager.getConnection(url + ( rewriteBatchedStatements ? "&rewriteBatchedStatements=true" : "" ));
            conn.createStatement().execute("DROP TABLE IF EXISTS TEST_DUPLICATE");
            PreparedStatement ptmt = conn.prepareStatement(query);
            for (int i = 0; i < test; i++) {                
                ptmt.setObject(1, Integer.valueOf(i), Types.INTEGER);
                ptmt.setObject(2, String.valueOf(i), Types.VARCHAR);
                ptmt.setObject(3, String.valueOf(i * 0.1), Types.VARCHAR);
                ptmt.setObject(4, String.valueOf(i / 3), Types.VARCHAR);
                ptmt.setObject(5, new Timestamp(d1.getTime()), Types.TIMESTAMP);
                ptmt.setObject(6, new Timestamp(d2.getTime()), Types.TIMESTAMP);
                ptmt.setObject(7, new Timestamp(d3.getTime()), Types.TIMESTAMP);
                ptmt.setObject(8, BigDecimal.valueOf(i + 0.1), Types.DECIMAL);
                ptmt.setObject(9, BigDecimal.valueOf(i * 0.1), Types.DECIMAL);
                ptmt.setObject(10, BigDecimal.valueOf(i / 3), Types.DECIMAL);
            long startTime = System.nanoTime();
            long stopTime = System.nanoTime();
            ResultSet rs = conn.createStatement().executeQuery("SELECT COUNT(*) FROM TEST_DUPLICATE");
            if (rs != null && rs.next() && rs.getInt(1) == test) {
                System.out.println(((stopTime - startTime) / 1000000) + "," + test + "," + rewriteBatchedStatements + "," +  onDuplicateOption );
            } else {
                System.out.println("KO," + test + "," + rewriteBatchedStatements + "," +  onDuplicateOption );
            conn.createStatement().execute("DROP TABLE IF EXISTS TEST_DUPLICATE");
        } catch (SQLException ex) {
            System.out.println("KO : " + ex.getMessage());
        } finally {
            if (conn != null) {
                try {
                } catch (SQLException ignored) {
[17 Dec 2008 15:42] Tonci Grgin
Hi Philippe and thanks for your report.

We are aware of this slowdown and we have no means of fixing it for now. We technically can't rewrite "ON DUPLICATE KEY UDPATE" to be faster/different, it's a "bug" that it worked before. The problem is we can't parse out the "UPDATE ..." portion at the end.
This is related to another decision we made and that is *not* to put parser in connector. I think reasons for this decision are sound and reasonable. It's easy enough to just duplicate the stuff after "VALUES ..." and to see if "ON DUPLICATE KEY UPDATE" exists and that it isn't in a string, or an identifier, but to figure out what is in the "SET ...." requires a parser.
Following on that, we don't rewrite batch statements into "INSERT INTO foo VALUES (), (), ()" anymore so, you're paying for the extra round trips because of statements being executed sequentially. Of course, this is true if statements have "ON DUPLICATE KEY UPDATE" in them.
[18 Dec 2008 8:49] Philippe Martin
Hi Tonci
Thanks for your quick response.

There are an error in test class, the log result inverse boolean to inform if the query has or has not the instruction "ON DUPLICATE KEY ". My analyze is also inversed, and the conclusion is:
The test on the 5.1.7 driver with rewriteBatchedStatements set to true and WITHOUT the
instruction "ON DUPLICATE UPDATE" it's very slower (5507ms) than the same test on the
5.1.6 driver (57ms).

I analyze the process and there are a bug in the containsOnDuplicateKeyInString of the com.mysql.jdbc.PreparedStatement class.
The function:
protected boolean containsOnDuplicateKeyInString(String sql) {
     return StringUtils.indexOfIgnoreCaseRespectMarker(
                " ON DUPLICATE KEY UPDATE ",
                !this.connection.isNoBackslashEscapesSet()) != -1;

The first character the string " ON DUPLICATE KEY UPDATE " is a blank character and the first character is a special character for the function StringUtils.indexOfIgnoreCaseRespectMarker. 
If there are many blank characters in the sql parameter, a function with an expensive CPU cost is called for each blank character EXCEPT if the string " ON DUPLICATE KEY UPDATE " is in the sql parameter.
If the string " ON DUPLICATE KEY UPDATE " is replaced by the string "ON DUPLICATE KEY UPDATE " the test result is now for the 5.1.7 driver version:

[18 Dec 2008 12:15] Philippe Martin
Hi Tonci

I am understand your response but a simple solution is possible with no more latency to bath query with "ON DUPLICATE KEY".
The solution is to check if there an argument after the index of the " ON DUPLICATE KEY ".
The index is already searched (if a trivial bug in StringUtils.indexOfIgnoreCaseRespectMarker is fixed) and the parsing of the arguments are done after the research of the index.
Add a simple "if" in ParseInfo class fixed this issue. 
I add the files with the fix.

The last test executed with this fix:
The query latency with "ON DUPLICATE KEY" is 95ms for 1000 baths (before: 246ms)
[18 Dec 2008 12:16] Philippe Martin
bath query with "ON DUPLICATE KEY"

Attachment: PreparedStatement.java (application/octet-stream, text), 191.19 KiB.

[18 Dec 2008 12:17] Philippe Martin
Fix minor bug for bath query with "ON DUPLICATE KEY"

Attachment: StringUtils.java (application/octet-stream, text), 55.64 KiB.

[18 Dec 2008 14:19] Tonci Grgin
Philippe, I'll make sure someone looks this over. Thanks.
[18 Dec 2008 16:01] Tony Bedford
The following 'change' was added to the 5.1.7 changelog:

When statements include ON DUPLICATE UPDATE, and rewriteBatchedStatements is set to true, batched statements are not rewritten into the form INSERT INTO table VALUES (), (), (), instead the statements are executed sequentially.
[18 Dec 2008 16:04] Tonci Grgin
Thanks Tony but more correct formulation would be "ON DUPLICATE KEY UPDATE SET".

Philippe, we are still going to look into this again. Tony was just fixing documentation to reflect current status of connector.
[18 Dec 2008 16:45] Philippe Martin
Thanks to analyze the fix.

I thinks this bug is critical (for my usage of the driver/database it is true ;) ) because it's not possible to use the rewriteBatchedStatements option with more than 50 batches with the 5.1.7 driver.

It is also a good news if the instruction "ON DUPLCATE KEY" is batched.

The tests duration/latency/usage of CPU are better with my fixes on my test platform.
[18 Dec 2008 17:05] Tonci Grgin
Philippe, that's the curse of being in charge of core development... Many can support MySQL in a very good way with their "fixes" but we need to make sure every patch is what it claims to be, not just a fix for someones immediate needs.
Thus, we are checking on your test case and conclusions drawn and we'll see if they "hold water" in greater scheme of things.
[18 Dec 2008 17:24] Philippe Martin
Yes of course it is not possible to make a version by bug or for one user and it is not my answer/requirement.

I have a small knowledge of the driver and i don't know all need of the driver, i just know my need.

I do not want a new version, i want, if it is possible, a fix for the next version and if it is necessary i can help to do.
[18 Dec 2008 21:38] Jess Balint
Philippe, Can you include a diff of your changes instead of the whole files?
[18 Dec 2008 23:00] Philippe Martin
Last fix diff + ServerPreparedStatement.java to compile the package

Attachment: FixDiff.txt (text/plain), 5.21 KiB.

[19 Dec 2008 10:32] Philippe Martin
New fix: the bug is in fact in StringUtils.java

Attachment: FixDiff.txt (text/plain), 5.17 KiB.

[19 Dec 2008 10:46] Philippe Martin

The last diff file have 2 fix:
1) A modification of indexOfIgnoreCaseRespectMarker in StringUtils: it is the fix of this bug.

2) The others diff, it is to add the batch processing for insert query with an instruction " ON DUPLICATE KEY " if this query have no arguments for specify the instruction. Example:
- not batched query: INSERT INTO T (X, Y) VALUES (?, ?) ON DUPLICATE KEY UPDATE Y = VALUES( ? );
[19 Dec 2008 10:47] Philippe Martin
- not batched query: INSERT INTO T (X, Y) VALUES (?, ?) ON DUPLICATE KEY UPDATE Y = ? ;
[20 Dec 2008 18:22] Philippe Martin
Zip of fix with tests : java files + class files +  diff files

Attachment: bug-data-41532.zip (application/x-zip, text), 159.58 KiB.

[20 Dec 2008 18:26] Philippe Martin

I done some performance tests with and without my fix. 
I want to explain the very bad status of the actual version and what it is possible to have for the next version.
Can you re-opened this bug because it is easy to fix it and it is not possible to use the rewriteBatchedStatements option (-99% insert/second !!!!!), the performances are better without the option but the performances are lower (minimum -50%) than the 5.1.6 version with the option.

Tests result with the analyzes:

The differences of performance are between the 5.1.6 version with rewriteBatchedStatements
It's the avg of 100 exececution.
If the diff is in +-5% is set to "=".

Test without 'ON DUPLICATE KEY'
Batch Size    5.1.6    5.1.7    fix   first optim   second optim  5.1.7wo 
1              700       =      =         =             +50%       +50%
10            5500     -44%     =         =              =         -53%
100           12300    -93%     =         =             +22%       -69%
1000          15700   -99.4%    -8%      -8%            +32%       -75%
-> all CPU is used by this bug
-> second optim use the same code of the 5.1.7wo (if nb batch < 4)
-> second optim is better with a batch >= 100
-> 5.1.7wo it no THE solution

Batch Size 5.1.6    5.1.7    fix   first optim   second optim    5.1.7wo
1           900       =      =         =             +33%          +33%
10         2500       =      +8%     +170%          +290%           =
100        3800       =      =       +230%          +475%          -15%
1000       3800     - 18%    =       +295%          +530%          -15%
-> this bug is no visible because the query is no batched ('ON DUPLICATE KEY')
-> the first/second option are batched the query : batch efficacity
-> the second option has added a reduction of the client CPU consumption

Test without 'ON DUPLICATE KEY'  10 THREADS on mono CPU plateform with server + client
Batch Size    5.1.6     5.1.7   second optim     "5.1.6 mono thread"
1              300       -8%         =                 +133%
10            1500      -20%        +25%               +266%
100           4800      -95%        +25%               +156%
1000          3700     -99.7%      +105%               +324%
-> there are one CPU -> contention with multiple threading process

Test with 'ON DUPLICATE KEY'     10 THREADS on mono CPU plateform with server + client
Batch Size    5.1.6     5.1.7     second optim   "5.1.6 mono thread"
1              500       =             =                 +80%
10             900       =           +144%              +170%
100            1000      =           +490%              +280%
1000           750       =           +810%              +406%
-> second optim : batch efficacity

5.1.6 = the perf of the 5.1.6 version: insertion / second
5.1.7 = the perf of the 5.1.7 version: insertion / second
fix = 5.1.7  + fix of this bug 
first optim =  fix
           + the "ON DUPLICATE KEY" is batched if there are no param after " ON DUPLICATE KEY".
second optim = First optim 
           + new parser for batch query.
           + cost of option setting rt-wise is also set for canRewriteAsMultivalueInsertStatement
5.1.7wo = 5.1.7 version without the rewriteBatchedStatements option.

I added zip file (bug-data-41532.zip) with all java files, class files and diff files to test the changes.
Can you return me your feedback ?


[23 Dec 2008 11:05] Philippe Martin
StringUtils diff updated, last file have not the correct diff file (java file ok)

Attachment: bug-data-41532.zip (application/x-zip, text), 159.57 KiB.

[12 Jan 2009 15:45] Philippe Martin
Hi all,

Have you analyzed my fix ?
Can you return me your feedback ?

[12 Jan 2009 15:58] Tonci Grgin
Philippe, I've pinged relevant people. We'll see what they have to say.
[28 Jan 2009 13:39] Philippe Martin

Have you a feedback ?
If you have no time to analyze all change, please can you just valid the change in the StringUtils (the function indexOfIgnoreCaseRespectMarker ) to have good performances and the same performances than the 5.1.6 version if the rewriteBatchedStatements option is set to true.


Ps: the diff of the StringUtils class. The fix is just to replace the wrong code by the code in the function indexOfIgnoreCaseRespectQuotes.
@@ -1020,8 +1021,8 @@
 				contextMarker = c;
 			} else if (c == target.charAt(0) && !escaped
 					&& contextMarker == Character.MIN_VALUE) {
-				if (indexOfIgnoreCase(i, src, target) != -1)
+       if (startsWithIgnoreCase(src, i, target))
[18 Feb 2009 13:50] Tonci Grgin
Bug#42970 was marked as duplicate of this one. Reopening discussion.
[2 Mar 2009 15:29] Philippe Martin
Did you change or make a decision on the resolution of this bug?
[23 Mar 2009 13:46] Philippe Martin

Can you return me your decision ?

[23 Mar 2009 19:14] Mark Matthews

We fixed the performance of indexOfIgnoreCaseRespectQuotes() in December. Can you try a nightly build and see if that changes the performance profile for your case?

They're available from http://downloads.mysql.com/snapshots.php

We'll have to do some more thorough analysis on the other changes you propose (to determine arguments after " ON DUPLICATE KEY UPDATE " to make sure they don't have side effects.
[24 Mar 2009 9:58] Philippe Martin

Thanks for your response.

Unfortunately, the correction of the function indexOfIgnoreCase does not fix the bug in the function indexOfIgnoreCaseRespectMarker.

Indeed, the indexOfIgnoreCase function is used by the indexOfIgnoreCaseRespectMarker function for EACH occurrence of the first target character in the source, the indexOfIgnoreCase searches the target in ALL the source. There is two cases :

-	If the target is present, the indexOfIgnoreCaseRespectMarker function will return the first index of the first target character because the indexOfIgnoreCase function returned a index of the target in the source. Example: the source is “INSERT INTO A (B) VALUES (1) ON DUPLICATE KEY UPDATE B = 2;”, the target is “ ON DUPLICATE KEY UPDATE ”, the indexOfIgnoreCaseRespectMarker function return 28, the indexOfIgnoreCase function return 6. It is a wrong result !

-	If the target is not present, for EACH first target character in the source, the source is checked by indexOfIgnoreCase function. Example, the source is “INSERT INTO A (B) VALUES ( ?, ?, ?, ? ), ( ?, ?, ?, ? );”, the target is “ ON DUPLICATE KEY UPDATE ”. The indexOfIgnoreCase function is used 16 times for search target in source with a different first position but not different end position. 16 times is for two batches,  for 24 batches is 148 times. But the source length is 408 for 24 batches and 55 for 2 bathes. The checked  character is 402 + 398 + …. = 9869 for 24 batches and 35 for 2 batches. For 120 batches is 256 685, for 240 batches is 1 031 765. 
If the startsWithIgnoreCase is used (to replace the indexOfIgnoreCase function, it is my fix) in the indexOfIgnoreCaseRespectMarker function, the checked  character is lower than: target length * number of first target character in source. Example, for 24 batches is: 148 * 2 = 296 because the second character is always different in source, the current value is 9869. For 240 bathes is 2888, the current value is 1 031 765.

In conclusion, the bug is due to the usage of the indexOfIgnoreCase function in the indexOfIgnoreCaseRespectMarker function.

Best regards,
[25 Mar 2009 20:05] Mark Matthews
We've fixed the performance regression in indexOfIgnoreCaseRespectMarker(), as well as made it possible for the driver to even rewrite INSERT statements with ON DUPLICATE KEY UPDATE clauses in them as long as the UPDATE clause contains no reference to LAST_INSERT_ID(), as that would cause the driver to return bogus values for getGeneratedKeys() invocations.
[26 Mar 2009 3:56] Mark Matthews
See patch/diff at http://bazaar.launchpad.net/~mark-mysql/connectorj/5.1/revision/766
[26 Mar 2009 9:55] Philippe Martin

Thanks, the bug is now fixed.

You have not selected my second optimization.
The second optimization has a very important reduction of the parsing cost, it is done by a reuse of the original query parsing.
In the current version, the cost of the parsing is important because:
- A new query is build but if the parameter cachePrepStmts is set to true, the new query is generally more length than the maximal query length to be cached and it is necessary to parse the new query.
- The parsing cost of the new query is equal to the original query cost multiply by the number of batches.
In the current version, the rewriteBatchedStatements is better because the MySQL server is better with the new query but is not better because the driver is better.

The second optimization reduces the driver parsing cost because:
- A new query is build but there are just a flag added on the first position. The new query has a small length and is possible to set the new query in the prepare statement cache. The flag is #B#<batch size>, example: for 12 bathes, the original query is “INSERT INTO T (A) VALUES (?)”, the new query is “#B#12INSERT INTO T (A) VALUES (?)”.
- The original parsing is set into a ThreadLocal object to easily get it, but it is also possible to use another system (example by new prepareStatement function in com.mysql.jdbc.Connection class and com.mysql.jdbc.ConnectionImpl class).  
- The original parsing is used to build the new ParseInfo object. It is not necessary to parse the new query. (see the algorithm in the constructor of ParseInfo class in PreparedStatement class)
- It is used by the queries: UPDATE, INSERT and INSERT with ON DUPLICATE KEY.

[27 Mar 2009 12:46] Mark Matthews
Which diff contained your second optimization? I didn't see it the first time around. We'd have to way the changes caused by that, to just simply increasing the limit to the size of the statements cached, which is already parameterized.
[27 Mar 2009 13:45] Philippe Martin
The optimizations are in the DiffPreparedStatement.txt file, in the last bug-data-41532.zip file attached.

Yes, it is possible to increase the size of the cache but it is not always the best option because in this case :
- All large queries are cached.
- It’s not possible to always fix the best size if the number of batch is not known.
- It is a cache with a HashMap map and a LRU mechanism, it is not possible to cache more than 1000/2000 … without bad effect because the queries are large length and the cost of the hash computed and the search in the large list of data with a large key length is greater or equal than the parsing cost. There is also the cost of the LRU …
[2 Apr 2009 16:13] Tony Bedford
An entry was added to the 5.1.8 changelog:

MySQL Connector/J 5.1.7 was slower than previous versions when the rewriteBatchedStatements option was set to true.

If the query used ON DUPLICATE UPDATE the performance was greater than without ON DUPLICATE UPDATE.
[3 Apr 2009 10:33] Tony Bedford
Changelog message was corrected. "ON DUPLICATE UPDATE" was changed to "ON DUPLICATE KEY UPDATE".
[3 Apr 2009 15:21] Tony Bedford
The changelog entry was rewritten to include more relevant information:

MySQL Connector/J 5.1.7 was slower than previous versions when the rewriteBatchedStatements option was set to true.


The performance regression in indexOfIgnoreCaseRespectMarker()has been fixed. It has also been made possible for the driver to rewrite INSERT statements with ON DUPLICATE KEY UPDATE clauses in them, as long as the UPDATE clause contains no reference to LAST_INSERT_ID(), as that would cause the driver to return bogus values for getGeneratedKeys() invocations. This has resulted in improved performance over version 5.1.7.
[3 Jun 2009 8:16] Noam Barcay
I don't think this bug is specific to cases where rewriteBatchedStatements=true and should not be closed.
Looking at the source code of 5.1.7 it seems every prepared statement creates an instance of ParseInfo which calls containsOnDuplicateKeyInString(sql) which calls StringUtils.indexOfIgnoreCaseRespectMarker - which is very slow.
This is verified by profiling as well - prepared statements are much slower to create in 5.1.7.

If one is using Spring for example, which creates statements for each update() call, this becomes a huge performance hit.
[3 Jun 2009 21:04] Mark Matthews

There's caveats, warnings strewn throughout the JDBC specifcation, tutorials, etc. that state that Connection.prepare() *is expensive*. That's the entire reason that connection pools, and connector/J itself offer caching of prepared statements. If you cache prepared statements using Connector/J's implementation (cachePrepStmts=true), the parse info is only constructed once per connection that a statement is prepared on.

We can consider not doing this parsing if batch-rewriting isn't enabled, however, the common use case is that prepared statements in Java are long-lived, and we have to take advantage of that to provide performance gains elsewhere. 

Also, in a future version we're considering normalizing the return value of executeUpdate() when ON DUPLICATE KEY exists in an INSERT statement (it can return 0 or 2 for INSERTs, which confuses layers on top, such as JPA or Hibernate, today you have to tell these frameworks not to trust the value, which isn't safe).

[3 Jun 2009 21:09] Mark Matthews
You might actually be interested in profiling 5.1.8 snapshots after this was fixed, I don't think you'll see an impact larger than the "normal" parsing that's always been done (ignoring the buggy code for detecting ON DUPLICATE KEY UPDATE in 5.1.7). It doesn't show up as a hotspot in our profiling runs.

See http://downloads.mysql.com/snapshots.php#connector-j