Bug #70369 Undocummented behavior of InnoDB tables using CHAR data type
Submitted: 17 Sep 2013 16:07 Modified: 3 Jan 2014 15:35
Reporter: Przemyslaw Malkowski Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Documentation Severity:S3 (Non-critical)
Version:5.5 5.6 OS:Any
Assigned to: Daniel Price CPU Architecture:Any
Tags: InnnoDB CHAR

[17 Sep 2013 16:07] Przemyslaw Malkowski
Description:
As per documentation (http://dev.mysql.com/doc/refman/5.5/en/storage-requirements.html), for CHAR data type and multi-byte character sets, the row size in bytes is always at maximum possible value. 
So CHAR(3) for utf-8 will always have 9 bytes, while for latin1 it will be 3 bytes. 
This is easy to verify in MyISAM tables - their size on disk as well as DATA_LENGTH row attribute reflects the 3x difference between utf-8 and latin1. 
For example, same table data in 1-byte charset vs multi-byte:
+-----------------+-------------+
| table_name      | DATA_LENGTH |
+-----------------+-------------+
| char_lat_myisam |      101000 |
| char_utf_myisam |      301000 |
+-----------------+-------------+

But it does not look the same in InnoDB tables... The problem is whatever character is set or actual data inserted (single byte characters or "wide" characters) - I end up with exactly the same table size. 

Tested on both 5.5.33 and 5.6.13.

How to repeat:
Create Database: CREATE DATABASE `test` /*!40100 DEFAULT CHARACTER SET latin1 */

CREATE TABLE char_utf_innodb (
val1 char(100))
ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE char_lat_innodb (
val1 char(100))
ENGINE=InnoDB DEFAULT CHARSET=latin1;

CREATE TABLE char_utf_innodb_2 (
val1 char(100))
ENGINE=InnoDB DEFAULT CHARSET=utf8;

Each table has 1001000 same rows like below:

mysql> select * from char_utf_innodb limit 2;
+------+
| val1 |
+------+
| abc  |
| abc  |
+------+
2 rows in set (0.00 sec)

mysql> select * from char_lat_innodb limit 2;
+------+
| val1 |
+------+
| abc  |
| abc  |
+------+
2 rows in set (0.00 sec)

mysql> select * from char_utf_innodb_2 limit 2;
+-----------+
| val1      |
+-----------+
| ऊऊऊ       |
| ऊऊऊ       |
+-----------+
2 rows in set (0.00 sec)

mysql> show table status like 'char_lat_innodb'\G
*************************** 1. row ***************************
           Name: char_lat_innodb
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 1001163
 Avg_row_length: 135
    Data_length: 135954432
Max_data_length: 0
   Index_length: 0
      Data_free: 6291456
 Auto_increment: NULL
    Create_time: 2013-09-17 17:51:25
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: row_format=COMPACT
        Comment: 
1 row in set (0.00 sec)

mysql> show table status like 'char_utf_innodb'\G
*************************** 1. row ***************************
           Name: char_utf_innodb
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 1001169
 Avg_row_length: 136
    Data_length: 137003008
Max_data_length: 0
   Index_length: 0
      Data_free: 5242880
 Auto_increment: NULL
    Create_time: 2013-09-17 17:50:29
    Update_time: NULL
     Check_time: NULL
      Collation: utf8_general_ci
       Checksum: NULL
 Create_options: row_format=COMPACT
        Comment: 
1 row in set (0.00 sec)

mysql> show table status like 'char_utf_innodb_2'\G
*************************** 1. row ***************************
           Name: char_utf_innodb_2
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 1001169
 Avg_row_length: 136
    Data_length: 137003008
Max_data_length: 0
   Index_length: 0
      Data_free: 5242880
 Auto_increment: NULL
    Create_time: 2013-09-17 17:50:15
    Update_time: NULL
     Check_time: NULL
      Collation: utf8_general_ci
       Checksum: NULL
 Create_options: row_format=COMPACT
        Comment: 
1 row in set (0.01 sec)

+-------------------+-------------+
| table_name        | DATA_LENGTH |
+-------------------+-------------+
| char_lat_innodb   |   135954432 |
| char_utf_innodb   |   137003008 |
| char_utf_innodb_2 |   137003008 |
+-------------------+-------------+

ls -lh:
-rw-rw----. 1 mysql mysql 146800640 Sep 17 17:51 char_lat_innodb.ibd
-rw-rw----. 1 mysql mysql 146800640 Sep 17 17:50 char_utf_innodb_2.ibd
-rw-rw----. 1 mysql mysql 146800640 Sep 17 17:50 char_utf_innodb.ibd

Suggested fix:
Row size should be smaller for single byte character set for CHAR data types also in InnoDB tables. Now it looks like InnoDB is wasting lot of space using maxed space allocations regardless of character set and multi-byte characters stored.
If that is impossible, at least the documentation should mention this per-engine difference.
[17 Sep 2013 17:05] MySQL Verification Team
Last points in this page gives some clues as to what happens:
http://dev.mysql.com/doc/refman/5.5/en/innodb-physical-record.html
[17 Sep 2013 17:18] Jeremy Cole
This is more subtle than it would perhaps seem. The following two cases occur:

* CHAR(100) CHARSET latin1: Stored as fixed-width field (or CHAR(100)), with length 100, and end-space padded.
* CHAR(100) CHARSET utf8: Stored as variable-width field (or VARCHAR(100)). End-space padded up to at least 100 bytes, but may store up to 300 bytes in total.

For the two cases you used (value "abc" and value "ऊऊऊ") the storage required for CHAR(100) CHARSET utf8 is exactly the same. However, the value "abc" (3 bytes) is padded by 97 bytes, and "ऊऊऊ" (9 bytes) is padded by 91 bytes.

InnoDB is internally making an optimization here to store a CHAR(100) in the utf8 character set as a "pseudo fixed width" field. This could probably be documented a bit better.
[17 Sep 2013 17:27] Jeremy Cole
By the way, using your table sizes it is easy to verify that this case is closer to storing the minimum allocation for either string than the maximum:

146800640 bytes / 1001000 rows => 146.654 bytes/row

This comes from the 100 bytes for val1, plus 26 bytes of per-row overhead (including a 6-byte implicit clustered primary key), plus ~20 bytes of tablespace overhead/reservation/waste.
[17 Sep 2013 17:42] Valeriy Kravchuk
So, space usage is actually efficient, but it should be clearly documented, IMHO.
[18 Sep 2013 14:11] Przemyslaw Malkowski
Thank you Jeremy for excellent explanation. The fact that InnoDB uses optimization to store utf-8 CHAR similar way as it's storing VARCHAR explains a lot.
By stating the initial title as InnoDB being inefficient, I actually based it on my earlier tests on smaller tables, but I end up showing an example of larger table where it is showing actual efficiency for this specific case as Jeremy noticed. My bad.

So, below another example, with smaller table and smaller CHAR column width. 

Let's say I have CHAR(10) column, and I fill it with all 10 characters, 1 byte/char for both latin1 and uft8 plus 3 bytes/char for second utf8 table.

In MyISAM table both .MYD file size and Data_length makes exactly 31 bytes per row in utf8 table regardless of characters stored.
In InnoDB, after I fill these tables with just 100k rows:
* when table is utf-8 and I fill CHAR(10) column with single-bye characters - Data_length shows ~47 bytes/row
.ibd size makes ~125 bytes/row

* when table is latin1 and I fill CHAR(10) column with characters - Data_length shows again ~47 bytes/row
.ibd size makes also ~125 bytes/row

* when table is utf-8 and I fill CHAR(10) column with three-bytes characters - Data_length shows ~68 bytes/row
.ibd size makes ~146 bytes/row

So the conclusion is that given only the Data_length statistics, we have about ~37 bytes overhead per row. But given the result file size, it's already ~115 bytes overhead per single CHAR(10) row.

This makes an example utf-8 InnoDB table, filled with 100k rows of single-byte characters size at 12582912 bytes vs the same table as MyISAM is 3100000 bytes. 4 times bigger despite the fact InnoDB uses optimization for multi-byte charset here.
For 1M rows, it's 30MB MyISAM vs 48MB InnoDB. Even for 10M rows it's still 296MB vs 388MB.

The clear advantage of InnoDB over MyISAM is when 3 characters are stored per CHAR(100) column - 288MB MyISAM table vs 140MB InnoDB.

The different storage usage behaviour in InnoDB should be docummented, also here I think: http://dev.mysql.com/doc/refman/5.5/en/char.html.
[18 Sep 2013 14:23] Przemyslaw Malkowski
Changing title to more appropriate.
[19 Sep 2013 8:49] MySQL Verification Team
Hello Przemyslaw,

Thank you for the report.

Thanks,
Umesh
[3 Jan 2014 15:35] Daniel Price
The Data Storage Requirements section (http://dev.mysql.com/doc/refman/5.5/en/storage-requirements.html) was missing references to InnoDB-specific information about CHAR data type storage requirements for InnoDB tables, found here: http://dev.mysql.com/doc/refman/5.5/en/innodb-table-and-index.html. 

References have been added. This section also provides information about InnoDB physical row structure (and "overhead" storage costs).  

Also added a reference to the data type storage requirements section from http://dev.mysql.com/doc/refman/5.5/en/char.html. 

Thank you for the bug report.