Bug #25701 XML: ExtractValue() using wrong context to calculate XPath position()
Submitted: 18 Jan 2007 18:59 Modified: 4 Oct 2007 17:13
Reporter: Travis Haagen Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: XML functions Severity:S3 (Non-critical)
Version:5.1.15-BK OS:Linux (Linux, Win32)
Assigned to: Alexander Barkov CPU Architecture:Any
Tags: ExtractValue, position, XML, XPath

[18 Jan 2007 18:59] Travis Haagen
Description:
When using the XPath short-hand for descendant (//), ExtractValue() appears to be calculating the position of returned nodes using the parent context, while it should be using the root context. For example, it is expected that //b[1] be equivallent to /descendant-or-self::b[1] but MySQL is not treating them as such. For now, the work-around is to use queries like /descendant-or-self::b[1] instead of //b[1] until this bug is fixed.

How to repeat:
EXAMPLE A - //b[1] returns the first and third occurance of b, while it should return only the first. What it is actually doing is returning the first occurance from every parent context it comes across.

mysql> SELECT ExtractValue(
    ->     "<a><b>one</b><b>two</b></a><a><b>three</b></a>",
    ->     "//b[1]") AS RESULT;
+-----------+
| RESULT    |
+-----------+
| one three |
+-----------+
1 row in set (0.00 sec)

EXAMPLE B - //b[position()=1] has the same problem as //b[1] above

mysql> SELECT ExtractValue(
    ->     "<a><b>one</b><b>two</b></a><a><b>three</b></a>",
    ->     "//b[position()=1]") AS RESULT;
+-----------+
| RESULT    |
+-----------+
| one three |
+-----------+
1 row in set (0.00 sec)

EXAMPLE C - //b[2] appears to work correctly, but only because there is only one parent context with a second 'b' element

mysql> SELECT ExtractValue(
    ->     "<a><b>one</b><b>two</b></a><a><b>three</b></a>",
    ->     "//b[2]") AS RESULT;
+--------+
| RESULT |
+--------+
| two    |
+--------+
1 row in set (0.00 sec)

EXAMPLE D - //b[3] should return the third occurance of 'b', but returns nothing, because there is no parent context with three 'b' elements

mysql> SELECT ExtractValue(
    ->     "<a><b>one</b><b>two</b></a><a><b>three</b></a>",
    ->     "//b[3]") AS RESULT;
+--------+
| RESULT |
+--------+
|        |
+--------+
1 row in set (0.00 sec)

EXAMPLE D - The work-around is to use queries like /descendant-or-self::b[3], because it is returning the correct result

mysql> SELECT ExtractValue(
    ->     "<a><b>one</b><b>two</b></a><a><b>three</b></a>",
    ->     "/descendant-or-self::b[3]") AS RESULT;
+--------+
| RESULT |
+--------+
| three  |
+--------+
1 row in set (0.00 sec)

Suggested fix:
Use the root-context for calculating position instead of the parent-context for XPath expressions of the form //b[1] or //b[position()=1].
[23 Jan 2007 13:33] Valeriy Kravchuk
Thank you for a bug report. Verified just as described with 5.1.15-BK on Linux:

openxs@suse:~/dbs/5.1> bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.1.15-beta Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> select extractvalue('<a><b>one</b><b>two</b></a><a><b>three</b></a>', '//b[1]') as result;
+-----------+
| result    |
+-----------+
| one three |
+-----------+
1 row in set (0.00 sec)

mysql> select extractvalue('<a><b>one</b><b>two</b></a><a><b>three</b></a>', '//b[position()=1]') as result;
+-----------+
| result    |
+-----------+
| one three |
+-----------+
1 row in set (0.00 sec)

mysql> select extractvalue('<a><b>one</b><b>two</b></a><a><b>three</b></a>', '//b[2]') as result;
+--------+
| result |
+--------+
| two    |
+--------+
1 row in set (0.00 sec)

mysql> select extractvalue('<a><b>one</b><b>two</b></a><a><b>three</b></a>', '//b[3]') as result;
+--------+
| result |
+--------+
|        |
+--------+
1 row in set (0.00 sec)

mysql> select extractvalue('<a><b>one</b><b>two</b></a><a><b>three</b></a>', '//descendant-or-self::b[3]') as result;
+--------+
| result |
+--------+
| three  |
+--------+
1 row in set (0.00 sec)
[4 Sep 2007 17:10] Alexander Barkov
Another problem is that XPath accepts a not-well formed XML.
A well-formed XML must have exactly one root element, while
the value in the example have two root elements "a" and
actually consists of a concatenation of two XML values:

"<a><b>one</b><b>two</b></a>" and "<a><b>three</b></a>".

However, the same problem shows up with a well-formed value as well:

mysql> select extractvalue('<root><a><b>one</b><b>two</b></a><a><b>three</b></a></root>','//b[1]') as result;
+-----------+
| result    |
+-----------+
| one three |
+-----------+
1 row in set (0.00 sec)

mysql> select extractvalue('<root><a><b>one</b><b>two</b></a><a><b>three</b></a></root>','/descendant-or-self::b[1]') as result;
+--------+
| result |
+--------+
| one    |
+--------+
1 row in set (0.00 sec)
[3 Oct 2007 11:05] Alexander Barkov
It seems this is not really a bug.

Take a look here: http://www.w3.org/TR/xpath

> // is short for /descendant-or-self::node()/. For example, //para is short for
> /descendant-or-self::node()/child::para and so will select any para element in
> the document (even a para element that is a document element will be selected 
> by //para since the document element node is a child of the root node); 
> div//para is short for div/descendant-or-self::node()/child::para and so will 
> select all para descendants of div children.

> NOTE: The location path //para[1] does not mean the same as the location path 
> /descendant::para[1]. The latter selects the first descendant para element; 
> the former selects all descendant para elements that are the first para 
> children of their parents.

So, //b[1] returns all descendant "b" elements that are the first "b" children
of their parents. They are "one" and "three".

Isn't it correct?
[3 Oct 2007 18:35] Travis Haagen
You may be correct that this is expected behavior, but the XPath specification is counterintuitive when it states, "// is short for /descendant-or-self::node()/". One would assume that if they mean the same thing that the position predicate would work the same. Thank you for thoroughly investigating this issue.

I did find an example of how you can achieve the result of //b[1] using 'descendant-or-self' or 'descendant'. Still, in my opinion, this should be explained in more detail in the XPath specification.

mysql>  SELECT ExtractValue(
    ->          "<a><b>one</b><b>two</b></a><a><b>three</b></a>",
    ->          "//b[1]") AS RESULT;
+-----------+
| RESULT    |
+-----------+
| one three |
+-----------+
1 row in set (0.00 sec)

mysql>  SELECT ExtractValue(
    ->          "<a><b>one</b><b>two</b></a><a><b>three</b></a>",
    ->          "/descendant-or-self::*/b[1]") AS RESULT;
+-----------+
| RESULT    |
+-----------+
| one three |
+-----------+
1 row in set (0.00 sec)

mysql>  SELECT ExtractValue(
    ->          "<a><b>one</b><b>two</b></a><a><b>three</b></a>",
    ->          "/descendant::*/b[1]") AS RESULT;
+-----------+
| RESULT    |
+-----------+
| one three |
+-----------+
1 row in set (0.00 sec)
[4 Oct 2007 6:34] Alexander Barkov
I tried a similar script with Oracle 10g:

drop table t1;
create table t1 (x xmltype);
insert into t1 values (xmltype.createxml('<root><a><b>one</b><b>two</b></a><a><b
>three</b></a></root>'));
select Extract(x, '//b[1]') from t1;
select Extract(x, '//b[2]') from t1;
select Extract(x, '/descendant-or-self::*/b[1]') from t1;
select Extract(x, '/descendant-or-self::*/b[2]') from t1;
select Extract(x, '/descendant-or-self::b[1]') from t1;
select Extract(x, '/descendant-or-self::b[2]') from t1;
select Extract(x, '/descendant-or-self::b[3]') from t1;

And the answer was:

SQL> 
EXTRACT(X,'//B[1]')
--------------------------------------------------------------------------------
<b>one</b><b>three</b>

SQL>
EXTRACT(X,'//B[2]')
--------------------------------------------------------------------------------
<b>two</b>

SQL>
EXTRACT(X,'/DESCENDANT-OR-SELF::*/B[1]')
--------------------------------------------------------------------------------
<b>one</b><b>three</b>

SQL>
EXTRACT(X,'/DESCENDANT-OR-SELF::*/B[2]')
--------------------------------------------------------------------------------
<b>two</b>

SQL>
EXTRACT(X,'/DESCENDANT-OR-SELF::B[1]')
--------------------------------------------------------------------------------
<b>one</b>

SQL>
EXTRACT(X,'/DESCENDANT-OR-SELF::B[2]')
--------------------------------------------------------------------------------
<b>two</b>

SQL>
EXTRACT(X,'/DESCENDANT-OR-SELF::B[3]')
--------------------------------------------------------------------------------
<b>three</b>

I.e.  both  '//b[1]' and '/descendant-or-self::*/b[1]'
use positions from the original contexts.
While '/descendant-or-self::b[1]'  re-enumerates the nodes consequently,
starting from 1.

MySQL works the same:

mysql> select ExtractValue('<root><a><b>one</b><b>two</b></a><a><b>three</b></a></root>','//b[1]') as x;
+-----------+
| x         |
+-----------+
| one three |
+-----------+
1 row in set (0.01 sec)

mysql> select ExtractValue('<root><a><b>one</b><b>two</b></a><a><b>three</b></a></root>','/descendant-or-self::*/b[1]') as x;
+-----------+
| x         |
+-----------+
| one three |
+-----------+
1 row in set (0.00 sec)

mysql> select ExtractValue('<root><a><b>one</b><b>two</b></a><a><b>three</b></a></root>','/descendant-or-self::b[3]') as x;
+-------+
| x     |
+-------+
| three |
+-------+
1 row in set (0.00 sec)

So '//b[1]' is in fact equal to '/descendant-or-self::*/b[1]'.

But '/descendant-or-self::b[1]' is something different.
[4 Oct 2007 17:13] Travis Haagen
This appears to not be a bug and I am closing it.