Bug #50552 ndb_size.pl fails to calculate utf8 fields correctly
Submitted: 22 Jan 2010 17:18 Modified: 16 Apr 2012 11:52
Reporter: Patrick Mulvany (OCA) Email Updates:
Status: In review Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1-telco-7.0, 5.5.20-ndb-7.2.5 OS:Any
Assigned to: CPU Architecture:Any
Tags: 7.0.9, Contribution

[22 Jan 2010 17:18] Patrick Mulvany
Description:
ndb_size.pl calculates field size without checking charecter set so when a table uses utf8 and collation utf8_general_ci the calculation fails to spot max usage = 3xfield length rounded up.

This also means that complex tables using utf8 and with total fields varchar over 3k (ie 12x varchar(255)) appear to work but will not be portable to NDB cluster

How to repeat:
Use InnoDB or MyISAM engines :-

mysql> create table t1 (c1 varchar(255),c2 varchar(255),c3 varchar(255),c4 varchar(255),c5 varchar(255),c6 varchar(255),c7 varchar(255),c8 varchar(255),c9 varchar(255),c10 varchar(255),c11 varchar(255),c12 varchar(255)) character set utf8;
Query OK, 0 rows affected (0.00 sec)

ndb_size.pl test --format=html > test.html

test.t1
DataMemory for Columns

* means varsized DataMemory
Column Name 	Type 	Varsized 	Key 	4.1 	5.0 	5.1
c3 	varchar(255) 	Y 		256 	256 	4*
c4 	varchar(255) 	Y 		256 	256 	4*
c9 	varchar(255) 	Y 		256 	256 	4*
c2 	varchar(255) 	Y 		256 	256 	4*
c10 	varchar(255) 	Y 		256 	256 	4*
c6 	varchar(255) 	Y 		256 	256 	4*
c12 	varchar(255) 	Y 		256 	256 	4*
c11 	varchar(255) 	Y 		256 	256 	4*
c7 	varchar(255) 	Y 		256 	256 	4*
c5 	varchar(255) 	Y 		256 	256 	4*
HIDDEN_NDB_PKEY 	bigint 		PRI 	8 	8 	8
c1 	varchar(255) 	Y 		256 	256 	4*
c8 	varchar(255) 	Y 		256 	256 	4*
Fixed Size Columns DM/Row 				3080 	3080 	8
Varsize Columns DM/Row 				0 	0 	48

but if you use the same creation script on a NDB cluster you will get :-

ERROR 1118 (42000): Row size too large. The maximum row size for the used table type, not counting BLOBs, is 8052. You have to change some columns to TEXT or BLOBs

Suggested fix:
Implement character set detection an alter calculation appropriately.

mysql>  show table status ;
+------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+-----------------+----------+----------------+---------+
| Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time         | Update_time         | Check_time | Collation       | Checksum | Create_options | Comment |
+------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+-----------------+----------+----------------+---------+
| t1   | MyISAM |      10 | Dynamic    |    0 |              0 |           0 | 281474976710655 |         1024 |         0 |           NULL | 2010-01-22 17:11:56 | 2010-01-22 17:11:56 | NULL       | utf8_general_ci |     NULL |                |         |
+------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+-----------------+----------+----------------+---------+
1 row in set (0.00 sec)
[26 Jan 2010 15:21] Patrick Mulvany
Patch to add --utf8 switch and warn of rows too long for NDB engine

Attachment: ndb_size.pl.patch (application/octet-stream, text), 5.46 KiB.

[1 Feb 2010 10:22] Sveta Smirnova
Thank you for the report.

Verified as described.
[5 Feb 2010 15:42] Lenz Grimmer
Hi Patrick! Thank you very much for your patch contribution.

In order for us to accept you patch, We have to ask you for one small favour -
could you please send us a signed and filled out copy of the Sun Contributor
Agreement (SCA) as outlined on this page?

  http://forge.mysql.com/wiki/Contributing_Code#Paperwork

You will only have to do this once and it's valid for all other Sun-governed
Open Source projects as well.

Please let me know, if you have any questions or concerns about this!
about this. Thanks!
[10 Feb 2010 12:09] Patrick Mulvany
SCA sent on 8th no idea how long these take to process ;) 

When I get time (isn't it always) I will do a more complete patch covering double byte characters and per table/field as well.

Probably using something like :-
select  TABLE_SCHEMA,TABLE_NAME,COLUMN_NAME,CHARACTER_SET_NAME,COLLATION_NAME from information_schema.columns where table_schema not in ('mysql','information_schema') and CHARACTER_SET_NAME is not null;
[11 Feb 2010 11:41] Patrick Mulvany
Updated patch to add --default-character-set=utf8|utf2|utf16 option, now detects column charsets, removed defunct --uft8 option

Attachment: ndb_size.pl.patch (application/octet-stream, text), 6.68 KiB.

[11 Feb 2010 14:12] Patrick Mulvany
Full fix handles any character set pulling length from information_schema with backout using fixed values for utf8,ucs2 7 utf16

Attachment: ndb_size.pl.patch (application/octet-stream, text), 6.97 KiB.

[24 Mar 2010 11:38] Patrick Mulvany
SCA accepted ;)

https://sca.dev.java.net/CA_signatories.htm#m

Phew that took a little while to get sorted.

Paddy
[25 Mar 2010 11:20] Frazer Clement
Hi Patrick,
  Thanks for the bug report and patch.  
  I'll have a look at ndb_size.pl and your patch today.
Thanks,
Frazer Clement
[26 Mar 2010 18:08] Frazer Clement
Hi Patrick,
  Sorry for the delay in responding, I was not that familiar with Perl or ndb_size.pl previously!
  I've reviewed your patch and it looks good.  
  I've suggested a few changes in a new patch, which I will attach to this bug report.  Please let me know what you think, any mistakes etc.
  Once we've got a version we are both happy with then I will commit it to mysql-5.1-telco-6.2 and upwards.
  Please get in touch with any questions / issues.  My email contact is frazer@mysql.com.
Thanks,
Frazer

  Changes suggested in new patch :
    - Added other 'bugs' to the comments list
    - Removed change of shortcut -h from help to hostname
    - Renamed $infoschema to $using_infoschema
    - Removed capitalisation of information_schema
    - Added warnings when default character set mappings are being used
      (Note that no per-column charset info is available if the information_schema is unavailable)
    - Moved charlen * size calculation out of type-specific switch
    - Added $lenbytes variable
    - Added separate 'Max bytes/row' sum, which is used for rowlength checks.
      (Per version, for potential future scenario of different limits / version)
[26 Mar 2010 18:09] Frazer Clement
Modified version of latest patch

Attachment: bug#50522.patch (text/x-patch), 9.23 KiB.

[6 Mar 2012 17:41] Chris Miller
When will this patch be committed? It's been two years, just sayin... I'm fortunate enough to have found this searching the interwebs, and utf8 is more the rule than the exception these days.
[16 Apr 2012 11:52] Patrick Mulvany
Currently working on an updated patch that fixes a few more issues bring this more up to date as it has been a bit stagnant.
Currently proposed patch does not handle set and per column charecter sets correctly.