Description:
SHOW PROCESSLIST's "Info" column containing the query, can contain invalid UTF-8 bytes while the column is marked as a UTF-8 varchar. This can lead to data corruption or a crash in the client, as the expectation that MySQL only returns validly-encoded character bytes in the specified character set is broken.
How to repeat:
*** ON CLIENT A ***
1) Connect to the server
2) Execute: SET NAMES utf8mb4;
3) Using the C API mysql_real_query(), execute the query where the query bytes given are:
53454C45 4354205F 62696E61 72792027 EDAFB327 2C20736C 65657028 333029
… which is a string of invalid utf8mb4 bytes…
SELECT _binary '<invalid utf-8 bytes>', sleep(30)
*** ON ANY OTHER CLIENT ***
1) Connect to the server
2) Execute: SET NAMES utf8mb4
3) Execute: SHOW PROCESSLIST
4) Note that the result of the "Info" column for the row relating to the query executed by Client A contains invalid UTF-8 bytes, and can thus cause a problem.
• The 'Info' column of the result is:
field->name: "Info"
field->length = 400
field->flags = 0
field->decimals = 39
field->charsetnr = 45
field->type = MYSQL_TYPE_VAR_STRING
• Note: SHOW COLLATION WHERE Id = 45;
==> utf8mb4_general_ci,utf8mb4,45,Yes,Yes,1
• The bytes returned in the MYSQL_ROW for the Info field are exactly as above:
53454C45 4354205F 62696E61 72792027 EDAFB327 2C20736C 65657028 333029
… which is invalid UTF-8.
### The Problem ###
Clients which faithfully validate the UTF-8 bytes are then forced to decide:
a) crash on invalid UTF-8
b) detect the invalid bytes, interpret the returned bytes with something like latin1 encoding, and then convert _that_ to UTF-8 which results in malformed data.
Neither is a good solution. For generalized clients executing any user-provided query, allowing option B for any field of any query could easily lead to data corruption.
Suggested fix:
If MySQL is to allow accepting arbitrary bytes for any query to be executed (bytes which are invalid for the connection's set character set), then SHOW PROCESSLIST's "Info" column needs to be a binary blob since the bytes could be invalid for the specified character set.