Bug #90602 WHERE statements with JSON extraction do not follow the table collation
Submitted: 24 Apr 2018 12:00 Modified: 30 May 2018 14:49
Reporter: Dieter Devlieghere Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: JSON Severity:S3 (Non-critical)
Version:5.7 OS:Ubuntu
Assigned to: CPU Architecture:Any
Tags: EXTRACT, json

[24 Apr 2018 12:00] Dieter Devlieghere
Description:
There is a discrepancy between selecting data in regular varchar fields versus working with data in a JSON field. 

You can create a table with a json field and a varchar field using utf8_general_ci as table collation. Then, when you do a WHERE on an extracted piece of JSON data, you will not get the expected results.

For example, using utf8_general_ci, doing a SELECT query with

WHERE mycolumn = "accént"

will return values where mycolumn contains "accent" or "accént" or another variation.

However, doing the same on a value extracted from a JSON field, you will only get values which match exactly. 

Likely this is due to (as I've read elsewhere) the internal utf8mb4 incoding used to encode JSON data?

Is this behaviour by design, or can it be considered as a bug?

How to repeat:
# Create a demo table:
CREATE TABLE `testdata` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `title` json,
  `title_plain` VARCHAR(200),
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci ROW_FORMAT=COMPRESSED;

# Insert a record with accents:
insert into `testdata` (title, title_plain) 
values 
('{"nl" : "accent", "fr" : "áccent"}', "áccent");

# The following will return the record we inserted:
select * from `testdata`
where title_plain = "accent";

# The following will return no records:
select * from `testdata`
where title->>"$.fr" = "accent";

# The following will return the record we inserted:
select * from `testdata`
where title->>"$.fr" = "áccent";

# The following will return the record we inserted:
select * from `testdata`
where CONVERT(title->>"$.fr" using utf8) = "accent";

Suggested fix:
Make the behaviour consistent?
[26 Apr 2018 9:09] Jonathan Vercoutere
Data stored in a JSON field uses the utf8mb4_bin collation, so you're also going to run into unexpected disrepancies concerning case sensitivity.

Preferred behaviour would see the JSON contents using the same collation as defined on the table they are in.

Ultimately there are workarounds to solve this issue, but they increase code complexity and decrease performance. The inconsistency also increases the likelihood of unknowingly introducing bugs.
[30 Apr 2018 14:49] MySQL Verification Team
Please read: https://dev.mysql.com/doc/refman/5.7/en/json.html#json-comparison

"MySQL handles strings used in JSON context using the utf8mb4 character set and utf8mb4_bin collation. Strings in other character sets are converted to utf8mb4 as necessary."
[1 Jun 2018 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".