Bug #79993 cmd_query_iter erroneously calls ".encode('utf8')" on bytestrings
Submitted: 14 Jan 2016 23:40 Modified: 12 Jul 2017 15:47
Reporter: Ben Jolitz Email Updates:
Status: Closed Impact on me:
None 
Category:Connector / Python Severity:S2 (Serious)
Version:all versions OS:Any
Assigned to: CPU Architecture:Any
Tags: broken multiquery, invalid use of 'encode', submission failure

[14 Jan 2016 23:40] Ben Jolitz
Description:
In order to serialize a unicode string, one picks an encoding format. UTF8 is obviously chosen.

However, if you call encode('utf8') on a bytestring that was serialized from unicode, Python 2.x will attempt to /decode/ the string using 'ASCII' codec and encode it back using utf8.

If you have any non-ASCII characters (like from serializing a Unicode string), it will blow up.

The incorrect code is inside connection.py, line 515.

Here is the /incorrect/ excerpt:

    def cmd_query_iter(self, statements):
        """Send one or more statements to the MySQL server

        Similar to the cmd_query method, but instead returns a generator
        object to iterate through results. It sends the statements to the
        MySQL server and through the iterator you can get the results.

        statement = 'SELECT 1; INSERT INTO t1 VALUES (); SELECT 2'
        for result in cnx.cmd_query(statement, iterate=True):
            if 'columns' in result:
                columns = result['columns']
                rows = cnx.get_rows()
            else:
                # do something useful with INSERT result

        Returns a generator.
        """
        if not isinstance(statements, bytearray):
            if isstr(statements):
                statements = bytearray(statements.encode('utf-8'))
            else:
                statements = bytearray(statements)

        # Handle the first query result
        yield self._handle_result(self._send_cmd(ServerCmd.QUERY, statements))

        # Handle next results, if any
        while self._have_next_result:
            self.handle_unread_result()
            yield self._handle_result(self._socket.recv())

isstr on python 2 will do 'isinstance(object, basestring)' which returns TRUE for unicode and bytestrings.

How to repeat:
To demonstrate the Python 2 unicode issue, here is an example of what happens when you call encode on a bytestring.

I would like to note that by the time cmd_query_iter is called, the cursor has already converted the unicode query into a bytestring, so cmd_query_iter always sees a bytestring.

(cpython27) BenJolitz-Laptop:~/software$ python
Python 2.7.10 (default, Oct 23 2015, 18:05:06)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> a = u'This string has a unicode char (\u221a) which is the squareroot'
>>> type(a)
<type 'unicode'>
>>> bytestring = a.encode('utf8')
>>> type(bytestring)
<type 'str'>
>>> # the string is now a bytestring. What happens if we call encode on it again?
>>> bytestring.encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 32: ordinal not in range(128)
>>> ^D

Suggested fix:
Change:

from .catch23 import PY2, isstr

To:

from .catch23 import PY2, isstr, UNICODE_TYPES

And Write cmd_query_iter as this:

    def cmd_query_iter(self, statements):
        """Send one or more statements to the MySQL server

        Similar to the cmd_query method, but instead returns a generator
        object to iterate through results. It sends the statements to the
        MySQL server and through the iterator you can get the results.

        statement = 'SELECT 1; INSERT INTO t1 VALUES (); SELECT 2'
        for result in cnx.cmd_query(statement, iterate=True):
            if 'columns' in result:
                columns = result['columns']
                rows = cnx.get_rows()
            else:
                # do something useful with INSERT result

        Returns a generator.
        """
        if not isinstance(statements, bytearray):
            if isstr(statements):
                # Detect if the incoming string is unicode and serialize it
                # into a UTF8 bytestring
                if isinstance(statements, UNICODE_TYPES):
                    statements = statements.encode('utf8')
                statements = bytearray(statements)
            else:
                statements = bytearray(statements)

        # Handle the first query result
        yield self._handle_result(self._send_cmd(ServerCmd.QUERY, statements))

        # Handle next results, if any
        while self._have_next_result:
            self.handle_unread_result()
            yield self._handle_result(self._socket.recv())
[19 Jan 2016 7:56] Chiranjeevi Battula
Hello Ben Jolitz,

Thank you for the bug report.
Verified as described with the help of dev's.

Thanks,
Chiranjeevi.
[19 Jan 2016 7:57] Chiranjeevi Battula
Error Message :

Traceback (most recent call last):
  File "D:\Python\79993.py", line 25, in <module>
    bytestring.encode('utf8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 32: ordinal not in range(128)
[19 Jan 2016 20:55] Ben Jolitz
Whenever you see "UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 32: ordinal not in range(128)"

it means that you called encode('utf8') on a bytestring, wherein Python attempts to decode the bytestring as 'ascii' (not utf8) then encode the now-unicode string back into a bytestring with the utf8 encoding.
[8 Mar 2016 1:11] Ben Jolitz
This bug is now well over 49 days old.

The patch given works.

I do not understand the blockage.
[12 Jul 2017 15:47] Paul DuBois
Posted by developer:
 
Fixed in 2.1.7.

With Python 2.x, for a call to encode('utf8') on a bytestring that
was serialized from unicode, Python attempted to decode the string
using the 'ascii' codec and encode it back using 'utf8'. The result
was encoding failure for bytestrings that contain non-ASCII
characters.