Bug #69869 Spaces aren't spaces
Submitted: 30 Jul 2013 7:32 Modified: 31 Jul 2013 8:55
Reporter: Owen Miller Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Parser Severity:S4 (Feature request)
Version:5.7.2 OS:Linux (Ubuntu 12.04)
Assigned to: CPU Architecture:Any
Tags:  , 1064, space, Unicode

[30 Jul 2013 7:32] Owen Miller
Description:
Variations of the space character are handled differently. From their appearance to users and their syntactical use in language, users naturally expect them to be treated the same. When they are treated differently for the sake of it (as they are here), it creates a lot of friction in the user experience.

How to repeat:
With the bépo keyboard layout in Ubuntu 12.04, typing SHIFT+SPACEBAR (easy to do accidentally) results in U+00A0 NO-BREAK SPACE, whereas typing SPACEPAR results in U+0020 SPACE

Suggested fix:
Treat all these spaces the same:
http://www.fileformat.info/info/unicode/category/Zs/list.htm

Alternatively, at least detect that different spaces are being used and warn the user that even though they're typing spaces, apparently it's not the right kind of space for MySQL. It took me ages to diagnose this problem, with the most common error being 1064, error in SQL syntax
[30 Jul 2013 15:08] MySQL Verification Team
I think this bug report is about the parser.  
Previously I opened similar one internally about actual data.
Bug 14271638 - CHARACTER SET DUPLICATE HANDLING SEEMS BROKEN

The crux of the matter was some character codes are different, but appear the same so the only sure way to represent all varchar is in hex notation otherwise dump/reload with mysqldump can corrupt or give duplicate errors, e.g:

mysql> select hex(a),a from t1;
+--------+------+
| hex(a) | a    |
+--------+------+
| 29     | )    |
| A4     | )    |
+--------+------+
2 rows in set (0.00 sec)
[31 Jul 2013 6:17] Owen Miller
yeah sorry, I should've said that this was in the parser, not the data tables. It was where I was writing my SELECT statements. I'm not quite sure what you mean about the hex representation, but these are indeed different characters. It's only from a human point of view that they're the same.
[31 Jul 2013 8:55] MySQL Verification Team
I believe currently the function my_isspace determines whether given char is whitespace or not. That is defined as:

#define	my_isspace(s, c)  (((s)->ctype+1)[(uchar) (c)] & _MY_SPC)
#define	_MY_SPC	010	/* Spacing character */

In unicode there are more whitespace characters for example:
http://en.wikipedia.org/wiki/Whitespace_character