Description:
I have a reproducable bug in MySQL 4.1.11 and 5.0.2 where a strange combination of data and index length is producing erroneous results. This only happens with a very specific set of circumstances, where there is a compound index in which the field you're querying on is limited to a number of characters. This only seems to happen after data is moved using INSERT ... SELECT;
I have placed a data file containing the data that causes this bug at a URL which I will give to the developers privately. It is copyrighted data, and so cannot be made publicly available. The script to reproduce, using that file, is below. I am confident that this will happen any time one imports data into a utf-8 table and then selects out of it.
Here are the results from running the code below against the MySQL Max 4.1.11 binaries from mysql.com on RedHat Linux 8.0. They are identical to tests performed Debian Sarge using MySQL 5.0.2.
CitiesUsingEquals
0
id city country_abbreviation latitude longitude
28 Durban IR 27.05000000 62.21666670
33 Durban SF -29.85000000 31.01666670
38 Durban FR 43.53333330 0.60000000
43 Durban FR 43.01666670 1.35000000
48 Durban FR 43.00000000 2.81666670
CitiesUsingLike
5
id city country_abbreviation latitude longitude
28 Durban IR 27.05000000 62.21666670
33 Durban SF -29.85000000 31.01666670
38 Durban FR 43.53333330 0.60000000
43 Durban FR 43.01666670 1.35000000
48 Durban FR 43.00000000 2.81666670
CitiesUsingEqualsNoId
5
id city country_abbreviation latitude longitude
28 Durban IR 27.05000000 62.21666670
33 Durban SF -29.85000000 31.01666670
38 Durban FR 43.53333330 0.60000000
43 Durban FR 43.01666670 1.35000000
48 Durban FR 43.00000000 2.81666670
CitiesUsingLikeNoId
5
How to repeat:
drop table if exists FeatureBrief;
create table FeatureBrief (
Feature_int_ID int,
FeatureType_str_code varchar(5),
Admin1_str_code varchar(5),
Country_str_code varchar(2),
Feature_int_font_code int,
Feature_str_name varchar(120),
Feature_str_local_name varchar(120),
Feature_dec_lat numeric(12,7),
Feature_dec_lon numeric(12,7),
Feature_str_lat varchar(12),
Feature_str_lon varchar(12),
Feature_dec_source_lat numeric(12,7),
Feature_dec_source_lon numeric(12,7),
Feature_str_source_lat varchar(12),
Feature_str_source_lon varchar(12)
) type=myisam character set utf8;
load data local infile 'Feature.brief.txt' into table FeatureBrief fields terminated by '\t' lines terminated by '\r\n';
create index FeatureBrief_Feature_str_name_index on FeatureBrief(Feature_str_name(7));
drop table if exists citiesBrief;
create table citiesBrief (
id int unsigned auto_increment primary key,
city varchar(100),
country_abbreviation char(3),
latitude decimal(16,8),
longitude decimal(16,8)
) type=myisam character set=utf8;
-- Note that there is no extended character information being selected
-- in this insert.
insert into citiesBrief
select
null,
F.Feature_str_name,
F.Country_str_code,
F.Feature_dec_lat,
F.Feature_dec_lon
from FeatureBrief F;
-- This index produces the problem. Notice that 0 rows are selected on the first query using =
create index citiesBrief_city_id_index on citiesBrief(city(7),id);
select * from citiesBrief where city = 'Durban';
select FOUND_ROWS() as CitiesUsingEquals;
-- These show that the problem does not manifest on the source table
--select * from FeatureBrief where Feature_str_name = 'Durban';
--select FOUND_ROWS() as FeatureUsingEquals;
select * from citiesBrief where city like 'Durban';
select FOUND_ROWS() as CitiesUsingLike;
--select * from FeatureBrief where Feature_str_name like 'Durban';
--select FOUND_ROWS() as FeatureUsingLike;
-- recreating the index with just the city field, the problem goes away
alter table citiesBrief drop index citiesBrief_city_id_index;
create index citiesBrief_city_id_index on citiesBrief(city(7));
select * from citiesBrief where city = 'Durban';
select FOUND_ROWS() as CitiesUsingEqualsNoId;
select * from citiesBrief where city like 'Durban';
select FOUND_ROWS() as CitiesUsingLikeNoId;
Suggested fix:
Whatever is going wrong here, the equals query should not fail where the like query does not! The data is _identical_ in all 5 resulting columns, so this clearly looks like a problem with the indexing code.