Bug #100292 utf-8 decoding fails if the first byte is 0xEF
Submitted: 22 Jul 2020 14:37 Modified: 23 Jul 2020 18:41
Reporter: Rafal Somla Email Updates:
Status: Closed Impact on me:
Category:Connector / C++ Severity:S3 (Non-critical)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any

[22 Jul 2020 14:37] Rafal Somla
Decoding of utf-8 string fails if this string starts with \xEF. The reason is that \xEF byte is treated as a BOM marker and skipped, leading to wrong utf-8 data. This happens in function str_decode() in <cdk/foundation/string.h> which invokes rapidjson code to perform utf-8 conversion. It first creates rapidjson input stream from the raw bytes that contain utf-8 data:

  rapidjson::EncodedInputStream<FROM, Mem_stream<char> > input(bytes);

It also creates output stream and calls rapidjson::Transcoder<>::Transcode(input, output). When EncodedInputStream<> is constructed, the ctor looks for the BOM marker at the beginning of the data and skips it if present:

    EncodedInputStream(InputByteStream& is) : is_(is) { 
        current_ = Encoding::TakeBOM(is_);

This leads to the issue. 

How to repeat:
  Session sess(...);
  auto res = sess.sql("SELECT '\xef\xbc\x88'").execute();
  Value row = res.fetchOne().get(0); // throw exception here

Suggested fix:
Use a different input stream implementation, which does not interpret BOM markers. Possibly rapidjson provides such class.
[23 Jul 2020 18:41] Paul DuBois
Posted by developer:
Fixed in 8.0.22.

String decoding failed for utf-8 strings that began with a \xEF
byte-order mark.