Bug #18972 Dataset to large
Submitted: 10 Apr 2006 17:37 Modified: 10 Apr 2006 21:19
Reporter: Thijs Sanders Email Updates:
Status: Not a Bug Impact on me:
None 
Category:Connector / J Severity:S2 (Serious)
Version:Mysql connector 3.1.12 OS:Linux (Redhat 9)
Assigned to: CPU Architecture:Any

[10 Apr 2006 17:37] Thijs Sanders
Description:
Hello everybody,

I'm currently building a large database retrieval system(JAVA and mysql). Everything works fine, but I have a big problem. The returning resultset is way larger then it should be!!

The table has the following structure:

Probe_set_id unsigned mediumint(6)
Expressionvalue float(5,1)
Dataset unsigned tinyint(2)

I'm trying to retrieve a dataset with 6.5 million rows with the following query.

Select Probe_Set_ID, Expressionvalue from Test5 where dataset=1

This results in JAVA taking around 500 megabyte retrieving the Resultset!!! If my calculations are right it should take around the following:

Float 4 bytes * 6500000 = 26000000 bytes
unsigned mediumint = long 8 bytes * 6500000 = 52000000 bytes
That would be 78000000 bytes, which is around 75 megabyte and not 500 megabyte!!!!!

Code sample:

String dataBaseName = "testDB";
String dataBaseUser = "java";
String dataBasePass = "clustering";
String url = "jdbc:mysql://localhost/" + dataBaseName;
Class.forName ("org.gjt.mm.mysql.Driver").newInstance ();
Connection conn = DriverManager.getConnection (url, dataBaseUser, dataBasePass);
conn.setReadOnly(true);
Statement s = conn.createStatement();
s.execute("Select Probe_Set_ID, Expressionvalue from Test5 where dataset=1");
ResultSet rs = s.getResultSet();

The resultset is way to large and I can't see the answer. I hope someone can help me. I'm using the newest JDBC driver of mysql! 

The java application (a gene expression retrieval server) works on the same server as the MYSQL server. The column probe_set_id used to be a char(30) column but I changed it in an unsigned mediumint (JAVA converts this to a long), but this really differs a lot. I also changed the expressionvalue column from double to float, this also should save 4 bytes (if the driver doesn't pass Strings and converts them later to the desired primitive type).

It's also a possibility that the Driver downloads the values as Strings and converts them to the desired primitive type when requested. The problem with this is that the probe_set_id column used to have a string avarage length of 17 and now that I numbered the probe_set_id column it's avarage length is 4 and the size of the Resultset still doesn't differ much. 

The avarage length of both columns are 4 (probe_set_id) and 3 (expression)Value). I still can't believe that the latter issue would consume 500 mb memory!! I thought when making the avarage length of the Strings smaller and changing to smaller primitive types would make it perform better and making the download times shorter. This is so when fiddling with the mysql DB (faster downloads, faster retrieval), but this isn't the case when connecting with JAVA to the MYSQL database. The resultset still stays as big as it used to be, even when making the avarage length smaller and changing the primitive types to smaller types.

Is it possible that the driver saves a lot extra data next to the data that I desire and if yes which options should I enable/disable or change??

How to repeat:
The bug isn't really repeatable. Only if you make a large table with 6.5 million fake rows and the same structure and retrieve it with mysql.
[10 Apr 2006 18:53] Mark Matthews
This is covered in the manual, see the "ResultSet" section of:

http://dev.mysql.com/doc/refman/5.0/en/cj-implementation-notes.html
[10 Apr 2006 20:11] Thijs Sanders
Thank you Mark for the reply, streaming is also a really good option (which I surely implement, because it consumes less memory and releaves stress of the GC). But I really would like to know is the reason the size of the Resultset stayed the same after converting the table. The length of the Strings where shorter and I took less consuming primitive types. I might have overlooked it, but could you please answer this??
[10 Apr 2006 20:22] Mark Matthews
If you're using "plain" java.sql.Statements, MySQL returns the values as strings, which the JDBC driver stores as byte[]s. Therefore the "client storage size" for plain statements is related to the string representation of the value, not the binary representation of the same value.

If you're using MySQL-4.1 or newer, when using prepared statements the server will return "native" binary representations of the values, which in some cases will be smaller. For example, the integer value '1', encoded as a string will of course be smaller than 32 bits (8 bytes), while '1234567891234' will be larger than 8 bytes. The JDBC driver will use these "native" binary representations when storing the values, and thus in many cases, the memory required will be less.
[10 Apr 2006 20:31] Thijs Sanders
Ahhh yes thank you Mark, that was what I'm searching for!! That was really hard to find. . Only 32 bits are 4 bytes and not 8, but thanks again!
[10 Apr 2006 20:41] Mark Matthews
Sorry for the typo!
[10 Apr 2006 20:42] Mark Matthews
Realize also that there is overhead for every java object instance, so your calculations are quite simplistic.

Arrays have 16 bytes or so of overhead, and each field value in the JDBC driver is an array of byte[]. Storing it as one byte[] with offsets would actually take more memory in most cases, since each int "offset" into the array would take 8 bytes _each_ (in general, most primatives have an 8 byte overhead).

Here's your calculations taking that fact into account:

Float 4 bytes * 6500000 = 26000000 bytes (+ 52000000 bytes JVM overhead)
unsigned mediumint = long 8 bytes * 6500000 = 52000000 bytes (+ 52000000 JVM overhead)
Array for each row: 104000000 overhead

286,000,000 (272M) bytes for the data alone (not including other intermediate objects created by the java class libraries themselves when doing I/O and numeric conversions, as well as internal object instances the driver creates).
[10 Apr 2006 21:19] Thijs Sanders
ahhh yes that would explain a lot. That is one big overhead, I guess streaming in this case is the only good option because it takes to much memory and resources to save all the rows and fields in byte arrays and it stresses the GC and needs to malloc() too many times