Bug #33761 ExtractValue: parse fails with valid prologue, succeeds with invalid prologue!
Submitted: 9 Jan 2008 10:25 Modified: 1 Feb 2008 10:01
Reporter: Roland Bouman Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: XML functions Severity:S2 (Serious)
Version:5.1.22 OS:Any
Assigned to: Assigned Account CPU Architecture:Any

[9 Jan 2008 10:25] Roland Bouman
Description:
The following syntax rules are taken from http://www.w3.org/TR/2006/REC-xml-20060816/ :

[1] document ::= prolog  element  Misc*
[22] prolog ::= XMLDecl? Misc* (doctypedecl  Misc*)?
[28] doctypedecl ::= '<!DOCTYPE' S  Name (S  ExternalID)? S? ('[' intSubset ']' S?)? '>'
[75] ExternalID ::= 'SYSTEM' S  SystemLiteral
                  | 'PUBLIC' S PubidLiteral S SystemLiteral
[11]   	SystemLiteral	   ::=   	('"' [^"]* '"') | ("'" [^']* "'")
[12]   	PubidLiteral	   ::=   	'"' PubidChar* '"' | "'" (PubidChar - "'")* "'"

Therefore, the following text should be regarded as well-formed XML:

<!DOCTYPE html 
  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
>
<html>bla</html>

Howeverm, ExtractValue fails to parse this and reports this as "Incorrect XML".

Removing the SystemLiteral yields this document:

<!DOCTYPE html 
  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
>
<html>bla</html>

Note that this is not correct according to the XML specification. Yet, ExtractValue parses this without issue.

This is a serious problem because it does not allow one to parse standards compliant XHTML webpages which is arguably one of the most useful features of the extractvalue function.

How to repeat:
mysql> select extractvalue(
    -> '<!DOCTYPE html
    '>   PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    '>   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
    '> >
    '> <html>bla</html>'
    -> ,'/html'
    -> );
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| extractvalue(
'<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
>
<html>bla</html>'
,'/html'
) |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| NULL                                                                                                                                                                         |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> show warnings;
+---------+------+-----------------------------------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                                               |
+---------+------+-----------------------------------------------------------------------------------------------------------------------+
| Warning | 1522 | Incorrect XML value: 'parse error at line 3 pos 61: '</-//W3C//DTD XHTML 1.0 Transitio>' unexpected ('</EN>' wanted)' |
+---------+------+-----------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

FAILS! this is not correct behaviour, it should have succeeded as it it perfectly well-formed XML.

The next example, with not well-formed XML works without issue:

mysql> select extractvalue(
    -> '<!DOCTYPE html
    '>   PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    '> >
    '> <html>bla</html>'
    -> ,'html'
    -> );
+----------------------------------------------------------------------------------------------------------------+
| extractvalue(
'<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
>
<html>bla</html>'
,'html'
) |
+----------------------------------------------------------------------------------------------------------------+
| bla                                                                                                            |
+----------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Suggested fix:
Please fix this and parse according to standard XML syntax.
[1 Feb 2008 10:01] Valeriy Kravchuk
Thank you for a bug report. Verified just as described with 5.1.22.