MySQL Bugs: #100292: utf-8 decoding fails if the first byte is 0xEF

Bug #100292	utf-8 decoding fails if the first byte is 0xEF
Submitted:	22 Jul 2020 14:37	Modified:	23 Jul 2020 18:41
Reporter:	Rafal Somla	Email Updates:
Status:	Closed	Impact on me:	None
Category:	Connector / C++	Severity:	S3 (Non-critical)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Decoding of utf-8 string fails if this string starts with \xEF. The reason is that \xEF byte is treated as a BOM marker and skipped, leading to wrong utf-8 data. This happens in function str_decode() in <cdk/foundation/string.h> which invokes rapidjson code to perform utf-8 conversion. It first creates rapidjson input stream from the raw bytes that contain utf-8 data:

  rapidjson::EncodedInputStream<FROM, Mem_stream<char> > input(bytes);

It also creates output stream and calls rapidjson::Transcoder<>::Transcode(input, output). When EncodedInputStream<> is constructed, the ctor looks for the BOM marker at the beginning of the data and skips it if present:

    EncodedInputStream(InputByteStream& is) : is_(is) { 
        current_ = Encoding::TakeBOM(is_);
    }

This leads to the issue. 

How to repeat:
  Session sess(...);
  auto res = sess.sql("SELECT '\xef\xbc\x88'").execute();
  Value row = res.fetchOne().get(0); // throw exception here

Suggested fix:
Use a different input stream implementation, which does not interpret BOM markers. Possibly rapidjson provides such class.

Posted by developer:
 
Fixed in 8.0.22.

String decoding failed for utf-8 strings that began with a \xEF
byte-order mark.