Bug #70901 ORDER BY of a TEXT column doesn't use the same ordering as the < > = operators
Submitted: 13 Nov 2013 20:18 Modified: 14 Nov 2013 18:02
Reporter: Arthur O'Dwyer Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:5.5.31, 5.5.35, 5.1.71, 5.0.97, 5.6.15 OS:Any
Assigned to: CPU Architecture:Any

[13 Nov 2013 20:18] Arthur O'Dwyer
Description:
MySQL's CHAR columns are literally padded with spaces.

MySQL's VARCHAR columns aren't literally padded, but they still compare via < > = as if they were space-padded; for example "a" = "a  " and "a\n" < "a". Doing an ORDER BY a VARCHAR column sorts the rows according to this ordering.

MySQL's TEXT columns use the same space-padding < > = comparisons as VARCHAR columns, but when you ORDER BY a TEXT column, the order you get is actually a strict dictionary ordering (each string `x` compares strictly greater-than each proper prefix of `x`).

This last fact means that TEXT columns sort (and apparently are indexed) in an order that's not equivalent to how they compare with < > =. This is extremely surprising behavior. I imagine it also breaks various range-scan optimizations (SELECT ... WHERE mytext BETWEEN foo AND bar), although I haven't tried looking for the breaks.

How to repeat:
create database if not exists d;
use d;
drop table if exists t;
create table t (v text collate utf8_bin);
insert into t values ('');
insert into t values ('\n\n');
select hex(t1.v), hex(t2.v), t1.v < t2.v as `<`, t1.v = t2.v as `=`, t1.v > t2.v as `>` from t t1 join t t2 order by t1.v, t2.v;

  +-----------+-----------+------+------+------+
  | hex(t1.v) | hex(t2.v) | <    | =    | >    |
  +-----------+-----------+------+------+------+
  |           |           |    0 |    1 |    0 |
  |           | 0A0A      |    0 |    0 |    1 |
  | 0A0A      |           |    1 |    0 |    0 |
  | 0A0A      | 0A0A      |    0 |    1 |    0 |
  +-----------+-----------+------+------+------+

Notice that according to the right-hand columns, "" > "\n\n" and "\n\n" < "", but according to the left-hand columns, "" sorts earlier than "\n\n" in an ORDER BY.

Suggested fix:
I suppose this behavior might be documented somewhere. It probably can't be "fixed" at this point. However, it would be a good idea to audit the code for optimizations that assume ORDER BY and/or indexes use the same collation order as < > =, and fix them up as they're found.
[14 Nov 2013 18:02] Sveta Smirnova
Thank you for the report.

Sort order, indeed, is different for VARCHAR and TEXT. Verified as described.