Description:
The following syntax rules are taken from http://www.w3.org/TR/2006/REC-xml-20060816/ :
[1] document ::= prolog element Misc*
[22] prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?
[28] doctypedecl ::= '<!DOCTYPE' S Name (S ExternalID)? S? ('[' intSubset ']' S?)? '>'
[75] ExternalID ::= 'SYSTEM' S SystemLiteral
| 'PUBLIC' S PubidLiteral S SystemLiteral
[11] SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")
[12] PubidLiteral ::= '"' PubidChar* '"' | "'" (PubidChar - "'")* "'"
Therefore, the following text should be regarded as well-formed XML:
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
>
<html>bla</html>
Howeverm, ExtractValue fails to parse this and reports this as "Incorrect XML".
Removing the SystemLiteral yields this document:
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
>
<html>bla</html>
Note that this is not correct according to the XML specification. Yet, ExtractValue parses this without issue.
This is a serious problem because it does not allow one to parse standards compliant XHTML webpages which is arguably one of the most useful features of the extractvalue function.
How to repeat:
mysql> select extractvalue(
-> '<!DOCTYPE html
'> PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
'> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
'> >
'> <html>bla</html>'
-> ,'/html'
-> );
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| extractvalue(
'<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
>
<html>bla</html>'
,'/html'
) |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| NULL |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set, 1 warning (0.00 sec)
mysql> show warnings;
+---------+------+-----------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+-----------------------------------------------------------------------------------------------------------------------+
| Warning | 1522 | Incorrect XML value: 'parse error at line 3 pos 61: '</-//W3C//DTD XHTML 1.0 Transitio>' unexpected ('</EN>' wanted)' |
+---------+------+-----------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
FAILS! this is not correct behaviour, it should have succeeded as it it perfectly well-formed XML.
The next example, with not well-formed XML works without issue:
mysql> select extractvalue(
-> '<!DOCTYPE html
'> PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
'> >
'> <html>bla</html>'
-> ,'html'
-> );
+----------------------------------------------------------------------------------------------------------------+
| extractvalue(
'<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
>
<html>bla</html>'
,'html'
) |
+----------------------------------------------------------------------------------------------------------------+
| bla |
+----------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
Suggested fix:
Please fix this and parse according to standard XML syntax.