Bug #86916 LOAD XML does not decode entities
Submitted: 4 Jul 2017 3:33 Modified: 5 Jul 2017 8:51
Reporter: John Hawkinson Email Updates:
Status: Verified Impact on me:
Category:MySQL Server: DML Severity:S3 (Non-critical)
Version:5.7.18 OS:Any (OS 10.10.5)
Assigned to: CPU Architecture:Any
Tags: XML

[4 Jul 2017 3:33] John Hawkinson
MySQL's LOAD XML statement does not consistently parse XML entity encodings when reading in XML. This means

1. Equivalent XML constructs can be loaded as different SQL strings.
2. MySQL strings can end up containing XML entity codes instead of the actual strings they ought to have. This is unexpected and indeed, unwanted.

It looks like & gets properly converted to an ampersand, however 
 is not properly converted to a newline.

(Background: I found this through a much more complicated example, but I have tried to simplify this to the minimum reasonable case that makes sense. In my realworld case, the source XML data file contained data that should be associated with multiple SQL tables. For some of the data, it was convenient to format it with <field> tags as I looped through the XML and did not know what fields might appear. For some other data, there was a strict mapping from particular elements to particular fields, so it was easier to represent them as attributes of a single tag. 

But because the same data imported differently in MySQL in these two ways, it not possible to database joins between the two tables because some have entity cods and others do not.)

How to repeat:
Suppose we start with a single XML file that is uses a schema that is not one of the three formats supported by LOAD XML. Therefore, it has to be transformed into an appropriate format before we begin. Here is input.xml:

    ipsum &amp; dolor</td>

The first <td> element contains a field name, and the second contains the field value.

We transform it using XML Starlet (or raw XSLT) into the
<field name="column1">value1</field>
format that LOAD XML can understand:

$ < input.xml xml sel -t -e row  -m //tr -e field -a name -v td[1] -b -v td[2] > test1.xml
$ cat test1.xml
<row><field name="f1">Lorem
    ipsum &amp; dolor</field></row>

The <field/> tag contains a literal newline.

But we can also transform it into the 
<row column1="value1"/>
format, also known to LOAD XML:

$ < input.xml xml sel -t -e rows  -m //tr -e row -a f1 -v td[2] > test2.xml
$ cat test2.xml
<rows><row f1="Lorem&#10;    ipsum &amp; dolor"/></rows>

In this case, the literal newline is converted into a &#10; entity reference, because it is a quoted string inside a tag's attribute.

Now if we load these two into MySQL, we get different results:

mysql> create table t (f1 varchar(255));
Query OK, 0 rows affected (0.02 sec)
mysql> LOAD XML LOCAL INFILE 'test1.xml' INTO TABLE t;
Query OK, 1 row affected (0.00 sec)
Records: 1  Deleted: 0  Skipped: 0  Warnings: 0

mysql> LOAD XML LOCAL INFILE 'test2.xml' INTO TABLE t;
Query OK, 1 row affected (0.01 sec)
Records: 1  Deleted: 0  Skipped: 0  Warnings: 0

mysql> select * from t;
| f1                          |
| Lorem
    ipsum & dolor     |
| Lorem&#10;    ipsum & dolor |
2 rows in set (0.00 sec)

In both cases, the &amp; entity was properly decoded to a literal ampersand. The literal newline was properly read from test1.xml, but the entity-coded &#10; newline was not converted from test2.xml.

Suggested fix:
Consistently decode XML entity codes in all variants of LOAD XML.
[5 Jul 2017 8:51] Umesh Shastry
Hello John Hawkinson,

Thank you for the report and test case.