Description:
In order to serialize a unicode string, one picks an encoding format. UTF8 is obviously chosen.
However, if you call encode('utf8') on a bytestring that was serialized from unicode, Python 2.x will attempt to /decode/ the string using 'ASCII' codec and encode it back using utf8.
If you have any non-ASCII characters (like from serializing a Unicode string), it will blow up.
The incorrect code is inside connection.py, line 515.
Here is the /incorrect/ excerpt:
def cmd_query_iter(self, statements):
"""Send one or more statements to the MySQL server
Similar to the cmd_query method, but instead returns a generator
object to iterate through results. It sends the statements to the
MySQL server and through the iterator you can get the results.
statement = 'SELECT 1; INSERT INTO t1 VALUES (); SELECT 2'
for result in cnx.cmd_query(statement, iterate=True):
if 'columns' in result:
columns = result['columns']
rows = cnx.get_rows()
else:
# do something useful with INSERT result
Returns a generator.
"""
if not isinstance(statements, bytearray):
if isstr(statements):
statements = bytearray(statements.encode('utf-8'))
else:
statements = bytearray(statements)
# Handle the first query result
yield self._handle_result(self._send_cmd(ServerCmd.QUERY, statements))
# Handle next results, if any
while self._have_next_result:
self.handle_unread_result()
yield self._handle_result(self._socket.recv())
isstr on python 2 will do 'isinstance(object, basestring)' which returns TRUE for unicode and bytestrings.
How to repeat:
To demonstrate the Python 2 unicode issue, here is an example of what happens when you call encode on a bytestring.
I would like to note that by the time cmd_query_iter is called, the cursor has already converted the unicode query into a bytestring, so cmd_query_iter always sees a bytestring.
(cpython27) BenJolitz-Laptop:~/software$ python
Python 2.7.10 (default, Oct 23 2015, 18:05:06)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> a = u'This string has a unicode char (\u221a) which is the squareroot'
>>> type(a)
<type 'unicode'>
>>> bytestring = a.encode('utf8')
>>> type(bytestring)
<type 'str'>
>>> # the string is now a bytestring. What happens if we call encode on it again?
>>> bytestring.encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 32: ordinal not in range(128)
>>> ^D
Suggested fix:
Change:
from .catch23 import PY2, isstr
To:
from .catch23 import PY2, isstr, UNICODE_TYPES
And Write cmd_query_iter as this:
def cmd_query_iter(self, statements):
"""Send one or more statements to the MySQL server
Similar to the cmd_query method, but instead returns a generator
object to iterate through results. It sends the statements to the
MySQL server and through the iterator you can get the results.
statement = 'SELECT 1; INSERT INTO t1 VALUES (); SELECT 2'
for result in cnx.cmd_query(statement, iterate=True):
if 'columns' in result:
columns = result['columns']
rows = cnx.get_rows()
else:
# do something useful with INSERT result
Returns a generator.
"""
if not isinstance(statements, bytearray):
if isstr(statements):
# Detect if the incoming string is unicode and serialize it
# into a UTF8 bytestring
if isinstance(statements, UNICODE_TYPES):
statements = statements.encode('utf8')
statements = bytearray(statements)
else:
statements = bytearray(statements)
# Handle the first query result
yield self._handle_result(self._send_cmd(ServerCmd.QUERY, statements))
# Handle next results, if any
while self._have_next_result:
self.handle_unread_result()
yield self._handle_result(self._socket.recv())