MySQL Bugs: #18134: XML: ExtractValue() should distinguish between absent and empty elements

Bug #18134	XML: ExtractValue() should distinguish between absent and empty elements
Submitted:	10 Mar 2006 13:19	Modified:	19 May 2006 7:46
Reporter:	Roland Bouman	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Documentation	Severity:	S4 (Feature request)
Version:	5.1.7 beta	OS:	Windows (winxp pro)
Assigned to:	Jon Stephens	CPU Architecture:	Any

Description:
The ExtractValue() funtion returns the text content of the node(s) selected by the XPath expression argument. If the XPath expression matches nothing, an empty string is returned, making it impossible to distinguish between a match with an existing node without text content.

It was expected that a NULL value would be returned.

The benefit of returning a NULL value would be that it would provide a simple, convenient and intuitive way to check for the existance of a node.

How to repeat:
mysql> select extractValue('<tag>text</tag>','*[2]/text()') = '' as c;
+---+
| c |
+---+
| 1 |
+---+
1 row in set (0.00 sec)

mysql> select extractValue('<tag></tag>','*[1]/text()') = '' as c;
+---+
| c |
+---+
| 1 |
+---+
1 row in set (0.00 sec)

Suggested fix:
Please return NULL in case the XPath expression did not match any nodes.

changed the title

(Following based on recent email discussion with Bar, Reggie, PeterG, et al.)

According to http://www.w3.org/TR/xpath#function-string -

"A node-set is converted to a string by returning the string-value of the 
node in the node-set that is first in document order. If the node-set is 
empty, an empty string is returned."

It doesn't say anything about showing *why* the node-set was empty. You'll have to use count() or some other means (e.g. REGEXP, SUBSTRING) to determine the reason.

I've changed the category to Docs and assigned to myself. 

TODO: Update the ExtractValue() function description to make it clear that, since this is what the XPath spec says, this is what we do as well. (And show suggested solution to problem.)

John, 

the link you are providing points to the description of the XPath string() function. Naturally, if my example would read something like:

ExtractValue('<tag>text</tag>','string(*[2]/text())') 

the result should be an empty string.

However, my sample code does not use the string function. This is what we have:

XML: <tag>text</tag>
XPath: *[2]/text()

the result should be an empty nodeset according to the spec (http://www.w3.org/TR/xpath#node-tests), and IE and Mozilla seem to agree on this (at least in the context of xslt - files attached, open the xml in the browser). 

Now, I know that in MySQL the ExtractValue() function always returns the result as a (MySQL) character type - there just isn't a datatype that could pass as a true nodeset. So it's clear that some conversion must take place. However, I do NOT see why that final conversion to a character type should be performed according to the XPath string() function. There is no standard that compels MySQL to do so, because there is no standard that compels you to do the final conversion to a character type in the first place.
From that perspective, I think it is a mistake to choose the semantics of the XPath string() function to implement the final conversion, as it makes it much less convenient to distinguish between XPaths that evaluate to an empty node set and XPaths that genuinly evaluate to the empty string.
If you still think that this is only a documentation issue, then please explicitly describe that the final conversion of nodesets to strings has the semantics of the XPath string() function, so that it is as if each XPath passed to ExtractValue has a string({actual argument XPath expression}) implicitly wrapped around it.

Hope I'm making my point clear

XML file containing the expression

Attachment: test.xml (text/plain), 87 bytes.

xslt accompanying the xm. Open the xml in Internet explorer or Mozilla or firefox

Attachment: test.xslt (text/plain), 597 bytes.

Hi Roland,

NULL is not appropriate here, since in SQL it means "unknown", and this is not the same as saying that we know there's no match. We may not know *why* there's no match, but the fact that there isn't one is not in doubt.

ExtractValue() can be thought of as working by appending a call to text() to the XPath expression passed to it as an argument. Thus, ExtractValue(@xml, '/a/b') produces the same result as ExtractValue(@xml, '/a/b/text()'). If you test /a/b/text() against either <a><b/></a> or <a><c>5</c></a> the result is the same - an empty nodeset - and in both cases we represent this as an empty string, meaning "none".

John, 

thanks for taking the time to elaborate. 
Your argument concerning SQL null semantics is understandable, but I think not entirely correct. I've only got a draft of the 2003 standard, but that sais:

..the null value is neither equal to any other value nor not equal to any other value — it is unknown whether or not it is equal to any given value..

(ISO/IEC 9075-1:2003 (E), p 15)

and also:

"A value of the Boolean data type is either true or false. The truth value of unknown is sometimes represented by the null value."

(ISO/IEC 9075-1:2003 (E), p 16)

So only in a boolean context can it mean "unknown" in a literal sense. I don't think the boolean context applies here, so it must mean "..unknown whether or not it is equal to any given value..."

Now, I do agree that to know that there was no match means a lot more than that it is unknown wheter it is equal to a given value: we know for sure that there is _no value at all_ . 

There is of course also another interpretation of SQL null that claims that NULL actually means "missing value" or "not applicable". To me it seems that this interpretation could quite appropriately represent the empty nodeset given that ExtractValue() tries to extract the *value*. 

Anyway, I feel that both interpretations of NULL are valid, and that means that if MySQL sticks with the first interpretation, it is not appropriate to return the NULL (no matter how convenient it may seem to at least me).

As for your explanation of the text-conversion behaviour of ExtractValue, I do not want to offend you, but I think that it is incorrect. Consider this:

XML: <tag><b/></tag>
XPath: count(//b)

Suppose we stick to your interpretation, ExtractValue would actually execute this:

count(//b)/text()

Well, both IE and FireFox implementations regard this as invalid. Although I cannot find anything in the XPath spec that sais they should regard it as invalid, we can both see that this XPath expression could _never_ return the number of b elements. That's because count() evaluates to a number. /text() finds text *nodes*, and since count does not return a nodeset, it should find none here.

On the other hand, if we assume that ExtractValue() actually does something like this:

string(count(//b))

The result will check out. 

(Of course, neither the string() nor the text() explanation adequately describe the exact behaviour of ExtractValue in case the xpath expression returns a nodeset containing multiple nodes. ExtractValue returns a space separated list of all text values which is at least not compliant with the XPath 1.0 spec according to which only the first value should be returned)

Thank you for your bug report. This issue has been addressed in the
documentation. The updated documentation will appear on our website
shortly, and will be included in the next release of the relevant
product(s).

Additional info:

1. "NULL is not equal (or unequal) to any known value" -> stands for UNKNOWN.

2. If we don't know whether a Boolean value is true or false, then we may use NULL to indicate that this is the case.

This does *not* mean that NULL is restricted to use in a Boolean context.

As for how ExtractValue() handles multiple matches, that's a separate issue and not really relevant to this bug. However, there are processors that handle multiple matches in this manner, so such behaviour is not without precedent - and this already indicated in the XML Functions section of the 5.1 Manual).

---

Following a lengthy discussion with PeterG, Bar, and Reggie, I've updated the ExtractValue() function description to indicate that the fact that we do not distinguish between "no matching element" and "matching element contains no child text nodes" is intentional. I've also provided an example (verified in 5.1.11-bk) showing how to discover which is the case.

John, 

thanks for taking the time.

The only reason for me to mention the multiple value case is to show that it is not really possible to describe how ExtractValue converts from nodeset to MySQL character type - at least not using XPath terminology.