MySQL Bugs: #34176: Connector/J hang retrieving large data set w/5.0.45 server

Bug #34176	Connector/J hang retrieving large data set w/5.0.45 server
Submitted:	30 Jan 2008 20:53	Modified:	3 Mar 2008 19:21
Reporter:	Jeremy Gilbert	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	Connector / J	Severity:	S2 (Serious)
Version:	5.0.8	OS:	Linux (Red Hat Enterprise WS 3 (Taroon Update 9))
Assigned to:		CPU Architecture:	Any
Tags:	Connector/J, hang, jdk1.6.0_02, ReadAheadInputStream.fill(4), Red Hat, Server 5.0.45, socketRead0()

Description:
The server trace shows that the server processed the query and
is sending back the reply.  The client trace seems to show
that the client waits forever for the reply header.  Other
queries work fine.

We recently upgraded from MySQL server version 4.1.9-standard-log
to version 5.0.45-community-log.  Almost everything works, but
there seems to be a problem using Connector/J to retrieve large
data sets.  There is one particular query which always hangs
when executed in Connector/J.  Interestingly, the same query
works fine when run from within the mysql command line tool,
and retrieves 41386 rows.  This query used to work perfectly
using Connector/J before upgrading the MySQL server.

Since this problem is very easily reproducible, I have gathered
what I hope are useful traces.  Here are the details of my
environment ...

  MySQL Server Version
    MySQL-server-community-5.0.45-0.rhel3

  MySQL Client Version
    MySQL-client-community-5.0.45-0.rhel3

  MySQL Connector/J Version
    mysql-connector-java-5.0.8-bin.jar

  Java Version
    jdk1.6.0_02

  Operating System
    Red Hat Enterprise Linux WS release 3 (Taroon Update 9)

  Startup Options
    /usr/sbin/mysqld-debug --debug --basedir=/ --datadir=/var/lib/mysql
       --user=mysql --pid-file=/var/lib/mysql/TestServer.pid
       --skip-external-locking --socket=/var/lib/mysql/mysql.sock

  Connect String
    jdbc:mysql://192.168.1.50/igift

Both the client and the server are running on the same system.
I originally tried this with "localhost" instead of the IP
address in the connect string.  I get exactly the same results
either way.

Effectively, my Java code is doing the following.  This isn't
the literal code, but this at least shows that I am not
setting any special options or doing anything tricky.
In actuality there are a few other requests issued on the
same connection before the query that hangs (as you can see
in the traces).

  Connection con = DriverManager.getConnection(
     "jdbc:mysql://192.168.1.50/igift", dbUser, dbPW);
  String sql =
    "SELECT " +
    "   HistoryTransactions.RecordID, " +
    "   HistoryTransactions.TransactionType, " +
    "   Terminals.SubscriberID, " +
    "   Terminals.StoreID, " +
    "   HistoryTransactions.TranProcState, " +
    "   HistoryTranGiftCards.IssueAmount, " +
    "   HistoryTranGiftCards.PurchaseAmount, " +
    "   HistoryTranGiftCards.CustomizationFee, " +
    "   HistoryTranGiftCards.ActivationFee, " +
    "   HistoryTranGiftCards.ProcState AS CardProcState " +
    "FROM HistoryTransactions " +
    "LEFT JOIN Terminals USING (TerminalID) " +
    "LEFT JOIN HistoryTranGiftCards ON " +
    "   (HistoryTransactions.RecordID = " +
    "    HistoryTranGiftCards.TranRecordID) " +
    "WHERE " +
    "  HistoryTransactions.BusinessDate >= '20070130' AND " +
    "  HistoryTransactions.BusinessDate <= '20080129' AND " +
    "  Terminals.SubscriberID IN " +
    "    ('MCS', 'NAI', 'SWIPEIT', 'TestData', 'TRU') " +
    "ORDER BY HistoryTransactions.TransactionTime";
  PreparedStatement stmt = con.prepareStatement(sql);
  ResultSet rs = pstmt.executeQuery();

The call to executeQuery() never returns.

I have included the following traces ...

  client.trace    - Produced with "traceProtocol=true"
  client.stack    - Stack trace of blocked thread in JVM
  mysqld.trace    - Output of running mysqld-debug --debug ...
  mysqld.slow.log - Slow log output for the query

The "client.trace" shows that the client sends the SQL query to
the server and then immediately calls ReadAheadInputStream.fill(4)
to wait for the response header.  It looks like it never gets
the data it is waiting for.  The "client.stack" shows that the
client is blocked in SocketInputStream.socketRead0() and it
stays blocked here indefinitely.

Oddly, the server traces seem to indicate that the server
does send a response to the query.

In "mysqld.trace" starting at line 88632 the server begins
processing the query.  From there until line 95442 it seems
to be analyzing the query and preparing to fetch data.
From there until line 280222 it seems to be fetching rows.
Then it seems to do some sorting things until line 280698.
Then it looks like it starts buffering up the response and
sending it in blocks of 16384 bytes (see lines 286954,
293214, 299391, 305685, and 311786.  At this point it
seems to have reached the maximum buffered capacity of
the socket, because the write returns EAGAIN, indicating
that it would block.  It looks like the server enables
blocking mode and then tries to write again.  Then, at line
311802 this blocked write returns EINTR, although I don't
know what the mechanism is here.

The "mysqld.slow.log" shows that the server sent 1715 rows
and examined 35140 rows.

So, I think the most confusing thing about this is that
the server traces show clearly that it sent a lot of data
to the client, but the client traces make it seem like
the client never got anything.

How to repeat:
Every time I try this query using Connector/J it behaves
exactly the same way.

I can provide the table data if needed.

Client and server traces, client stack trace

Attachment: traces.tar.gz (application/x-gzip, text), 464.61 KiB.

I had to strip out some lines from the mysqld.trace to
meet the limit of 500k on files included with the bug
report.  In doing so, some of the line counts from my
original description are now incorrect.  Here is the
new analysis of the server trace with the correct line
numbers ...

In "mysqld.trace" starting at line 88632 the server begins
processing the query.  From there until line 95442 it seems
to be analyzing the query and preparing to fetch data.
From there until line 95539 it seems to be fetching rows.
Then it seems to do some sorting things until line 96009.
Then it looks like it starts buffering up the response and
sending it in blocks of 16384 bytes (see lines 102265,
108525, 114702, 120996, and 127097.  At this point it
seems to have reached the maximum buffered capacity of
the socket, because the write returns EAGAIN, indicating
that it would block.  It looks like the server enables
blocking mode and then tries to write again.  Then, at line
128255 this blocked write returns EINTR, although I don't
know what the mechanism is here.

I have done more research, and I think this may be a Linux
kernel bug, somehow.

I switched back to MySQL server 4.1.9 standard, and tried
the same query execting it to work, but it hung.  So I
rebooted with an older version of the kernel
(2.4.21-52.ELsmp) and tried again with MySQL server 4.1.9
and it worked without error.

Now it gets stranger and more disturbing.  I then rebooted
again with the latest kernel (2.4.21-53.ELsmp) and tried
it with MySQL server 4.1.9, expecting it to hang, but
it worked without any error.  Then I uninstalled MySQL
server 4.1.9 and installed MySQL server 5.0.45 and
tried the query and it works.

So, I am willing to bet that if I had just rebooted the
linux box earlier the problem would have gone away, which
does not make me very comfortable, but probably means
that it isn't a MySQL bug, after all.

Do you know if other strange MySQL problems have been
reported against the lastest Red Hat 3 kernel?

Hi Jeremy and thanks for your report.

I consulted with my colleagues and no one is aware of any such problem... Maybe netstat will provide enough info so you can file a bug report with RH?

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Upgrading to JDK 6 update 4 solved the problem.

Beware of possible incompatibility between JDK 6 update 2 and glibc-2.3.2-95.50 when using Connector/J.

Using 'netstat' and 'pstack' I was able to see that the MySQL server was properly sending the reply (the Send-Q of the socket reached its maximum size), so I focused on the Java client code using Connector/J and saw that it was blocked in the pthreads recv() call - which seemed odd to me because clearly there was data to receive.  I remembered that I had recently used up2date on this server, and one of the packages that was upgraded was glibc which contains the pthreads libraries.  I considered the possibility that there might be some incompatibility between the JDK that I was using and the new version of pthreads, so I upgraded to JDK 6 udpate 4 and the problem went away.

Jeremy, if I read you correctly, this is !Bg. Thanks for following up on this and providing valuable info to all.

Thanks for your interest in MySQL.

I think this is really a bug of mysql server, but I think I found the problem. It's related to the query cache. There must be some kind of bug because sometimes the server stops sending rows to the jdbc driver, so after a long time a java.io.EOFException is thrown. Whenever this blocking issue occurs, re-executing the same program will end-up with the same result. However, you will see that issuing a "reset query cache" (or restarting mysql) will make your program to work again... but just for some time until it happens again. My workarounds are (depending on your needs):

a) Disable query cache by setting "query_cache_size=0" in your my.cnf
b) Changing your streaming queries so that they include the SQL_NO_CACHE flag. This is what I've done, so now my queries look like "select SQL_NO_CACHE * from t1". No more hangs so far.
c) Changing query_cache_type to a different option (but will involve adding flags to some queries).

Honestly, I think there's some kind of bug around the query cache not being able to handle so many results (due to the streaming resultset). Some bugs point to OS networking issues, but I've updated my redhat 5.2 box with the latest pachages & kernel, also updated to java 1.6.06, mysql 5.0.51a and jdbc 5.1.6 and the same happened.

I hope this helps you. I've lost a lot of time debugging this, as it's very hard to reproduce.

Hi

We are seeing this exact bug on Debian Lenny (mysql 5.0.51a, kernel 2.6.31) with programs in Java, C++ and Python, so I am confident that this problem is not limited to Java. 

It appears to be triggered when a large streaming query (i.e. UseQueryResult in C++, SSCursor in Python) is interrupted when only a handful of rows were retrieved. My guess is that when the amount of data actually read from the server is smaller than the query_cache_limit somehow a corrupted query cache entry gets created. After that, the same query will expose the problem until the server is restarted. But if you disable the query cache as Rico described the query works normally. You can also "repair" the "broken" query by just adding a space to it (because query cache works on the raw, unparsed query string).

I'm attaching a simple test case in C++ that exposes this bug. If you run it with the "_exit()" line enabled, it will create the broken query cache entry I mentioned.

The table I used for this test is the image SQL dump from Wikipedia, but any sufficiently large table will do.

Test case for the cache query bug

Attachment: mysql_cache_bug.cpp (text/x-c++src), 597 bytes.

Tomaž, Rico, thanks for your thoughts and tests. There are numerous bug reports against server cache reported so please search in BugsDB.

Search returns over 680 bug reports for "query cache". I haven't checked them all, but from other searches that I did, this is the only bug report that describes exactly what I'm seeing on my servers.

Can you perhaps reassign the category from Connector/J to Server: Query Cache, or should I open a new bug report?

Tomaž sure but I've already passed this report to my server colleagues and will let them make the ruling.

Ok. Thanks for your help Tonci.

If you suspect a server bug, please, try to repeat with recent actively supported server versions, 5.0.86 and/or 5.1.40, and inform about the results.