MySQL Bugs: #96163: Use cases for variable assignment in DML statements- do good alternatives exist?

Bug #96163	Use cases for variable assignment in DML statements- do good alternatives exist?
Submitted:	10 Jul 2019 18:30	Modified:	15 Jul 2019 12:31
Reporter:	Dean Trower	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server	Severity:	S4 (Feature request)
Version:	8+	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
I understand that the ability to assign user-defined variables within a SELECT OR UPDATE statement, such as:

SELECT @var:=colvalue FROM table WHERE...

is deprecated starting from MySQL v8.

I do understand that there are some good reasons for this, including at least:

*  Order of column evaluation is nondeterministic, so that, e.g.

SET @v:=5;
SELECT @v, @v:=10;

might return [5,10] or it might return [10,10].

*  Variable types are checked before statements are evaluated, so changing a type can have unintended/unexpected consequences.

*  Order of *row* evaluation need not match the final order sent to the client.  (see stack overflow question:  https://stackoverflow.com/questions/56939034/)

It might be expected that here:

SELECT @v:=columnValue, ... WHERE <condition matching multiple rows>
SELECT @v;

the second SELECT would return the columnValue of the *last* row returned by the first SELECT.  This is usually true, but not always, because DISTINCT, and GROUP BY or ORDER BY with a column reference can trigger column evaluation prior to re-sorting of the rows.  e.g., say table t has 1000 rows:

SET @n:=0;
SELECT @n:=@n+1 FROM t ORDER BY 1 DESC LIMIT 5;
SELECT @n;

Returned result sets are:

1000
 999
 998
 997
 996

and

1000

That would probably be expected given the explicit ORDER BY, but here:

SELECT DISTINCT IFNULL(@n:=Name,'unknown') FROM people ORDER BY <some non-indexed expression> LIMIT 10;
SELECT @n;

The second select can return a value not even included in the set returned by the first, which probably isn't at all what's expected!

[NOTE:  The docs don't mention this issue.  They really, really should.]

----------------------------------------------------------------------------

However, the abovementioned problems notwithstanding, I'd like to point out that:

*  LOTS of existing code assigns variables in SELECT (so removing the ability to do so would break a great many DB applications... though I'm sure you know that already),

*  While window functions can sometimes be used for some of the same purposes, there remain valid use cases for assigning variables within a SELECT.

and

*  Yes, as a last resort you could always do this stuff with cursors... but that's kind of the nuclear option, to be used when all else fails, and never, ever desirable, the way I see it...

How to repeat:

I'd like to illustrate some of those use cases:

(1)  Simple running calculation over rows:

SET @runningTotal:=0;
SELECT value, @runningTotal:=@runningTotal+value FROM table WHERE ...

This could be replaced with the use of SUM() as a window function.

(2)  Less simple running calculation over rows (assume CRC32lookup() is a stored function with 1 argument):

SET @crc32 := 0xFFFFFFFF+0;
SELECT byteValue, CAST((@crc32 := (@crc32>>8)^CRC32lookup((@crc32^byteValue)&255))^(0xFFFFFFFF+0) AS UNSIGNED) AS runningCRC32 FROM lots_of_bytes WHERE ... ORDER BY ...

Which computes a running CRC32 of all the values in the `byteValue` column as retrieved by the query.

I DON'T think this could be replaced by using window functions at the moment.
It probably could be if you introduced the ability to use stored functions as window (and ideally also as aggregate) functions.

(3)  Simple running calculation over rows, in an UPDATE:

SET @n:=0;
UPDATE t SET idx=(@n:=@n+1) ORDER BY valueCol;

Since window functions can't be used directly in an update, this can be replace by a window function ONLY if there's a unique or primary key with which to join, i.e. if you can do:

UPDATE t INNER JOIN (SELECT RowID,ROW_NUMBER() OVER (ORDER BY valueCol) AS idx) t2 USING (RowID) SET t.idx=t2.idx;

If there's no such unique RowID column, window functions can't be used at all.

(4)  More complex running calculation over rows, in an UPDATE:

SET @prev:=0,@prev2:=1;
UPDATE t SET fibonacci=(@prev := (@prev2+(@prev2 := @prev))) ORDER BY someCol;
SELECT fibonacci FROM t ORDER BY someCol,fibonacci;  # returns the fibonacci sequence!                        

While this looks like it might be a good candidate for "IFNULL(SUM(fibonacci) OVER (ORDER BY someCol ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING),1), actually a combination of various restrictions in MySQL makes this impossible to do with window functions at all, AFAIK.  Here the updates have to be done in a specific order, and only single-table UPDATE syntax supports ORDER BY.  So because it can't be in the UPDATE itself, the window function would have to go in a subquery of the form: UPDATE t SET fibonacci=(subquery returning a scalar) ORDER BY...  But same-table subqueries are not permitted in UPDATE statements.  You can nest it within an outer subquery to materialize it, but that means the entire subquery is calculated before the update begins, which means that the window function can't see the updates to the previous two rows while it's calculating the value for a given row; so it just doesn't work!
Also once again you can't use window functions if you don't have a unique column in the table.  This is true *even* if you could use the window function inside the UPDATE directly, because you need to be able to define a total order over all rows in the table, or the order in which you do the UPDATE in might differ from the order of rows used by the window function, if someCol contains duplicate values.  To ensure these two orderings are guaranteed to be in sync, both the window function and the UPDATE itself need to be ORDER BY SomeCol,...,UniqueCol.  (Or at least you need to be sure that the set of columns you're ordering on never contains a duplicate set of values for any two DB rows; and using a unique column at the end of the ORDER BY is the simplest way to guarantee that).
So even if you could do this with window functions, it becomes significantly more complex (and error-prone if you forget about total sort order, etc).

(5)  Return a row to the client, but ALSO save some column information (or some calculations derived therefrom) for later use on the SQL side:

SELECT a, b, c, @d:=d, @e:=e, <complex expression involving (@x := <complex function of column data>), ..., y, x FROM table WHERE <complicated WHERE clause resolving to a single row>

This is a really simple and (probably?) really common thing to do; you SELECT something from a single row, while also populating some variables with data from that row for use in later SQL statements.
While there are indeed other ways to do this (e.g. a SELECT to send the data to the client followed by a separate SET or SELECT...INTO to populate the server-side user variables), they are nowhere near as concise or efficient.

[ As an aside, MariaDB supports the very useful feature of using LAST_VALUE() as a normal non-window function that evaluates it's arguments in order and simply returns the last one, which lets you assign column values or expressions to variables without also returning them, i.e. you can write:

SELECT a, b, c, @d:=d, LAST_VALUE(@e:=e, @x:=someExpression(a,b,c,...z), f) FROM table WHERE RowID=...

to return columns a,b,c,d,f to the client and simultaneously assign variables @d, @e, and @x from the same row, WITHOUT also sending the last two of these to the client.
This would be a nice feature for MySQL to implement also (preferably with corresponding FIRST_VALUE() functionality as well). ]

Suggested fix:

----------------------------------------------------------------------------

>>>>  So, what I am actually "feature request"ing here is that MySQL not deprecate or remove useful functionality that can't be efficiently or concisely duplicated in other ways!  <<<<

----------------------------------------------------------------------------

i.e. Either deprecate variable assignment in SELECT/UPDATE *only* in a restricted range of cases (e.g. in SELECT where column evaluation order makes it ambiguous and also where row evaluation before re-ordering would lead to unexpected results, say...), ***OR*** provide alternative functionality to accomplish all of what I've described above.

For performing running calculations on rows as they are returned, window functions offer an obvious (and indeed in many ways more powerful) alternative, but to actually duplicate what can be done with variables, you would need to be able to use stored functions as window functions (and preferably as aggregate functions as well, just by-the-by).  Even then, it's more difficult, because you have to actually define a stored function first, rather than just issue the SQL SELECT you want.  Perhaps there could be some syntax to convert an in-line expression directly into a window or aggregate function, e.g. just OTOH something vaguely like:

LAMBDA(v, <expression involving PREV(i) and v>)

where v is the argument (a column name or expression), PREV(i) is the result of the evaluation of LAMBDA() on the ith-previously-processed row, and the <expression involving PREV(i) and v> gives the actual result returned?
The idea here being that you wouldn't need to actually create a stored function to do a once-off custom window-function calculation, it could all be specified within the SELECT.

For UPDATE, you'd need to allow use of window functions in the UPDATE itself (not just in a subquery).  Even then, it's far from ideal.

For both populating variables and simultaneously sending data back to the client from a single result row, it's hard to see how the existing syntax could be improved.
Indeed, it would be nice to extend it with LAST_VALUE() and FIRST_VALUE() as non-window functions (see the MariaDB docs for LAST_VALUE()).
(Also useful for UPDATE, for example the fibonacci sequence UPDATE from above was fortuitously able to be written as a single expression, but a more robust, general way of doing this sort of thing would be:
 UPDATE t SET fibonacci=LAST_VALUE(@new:=@prev2+@prev, @prev2:=@prev, @prev:=@new) ...}

Inside a SELECT, it would also be nice to be able to say "evaluate these expressions first in left-to-right order, before evaluating the rest of the returned row", so that you could reliably use something like this without worrying about column evaluation order:

SELECT
  DO_FIRST(@x:=<very complex row expression>, @y:=<very complex row expression possibly involving @x>, @z:=<very complex row expression possibly involving @x,@y>)
  (@x+@y)*@z, @z-(@y/@x), <complex row expression involving @x,@y,@z>, <some other complex row expression involving @x,@y,@z>
FROM table WHERE ...

which (provided there's a unique column such as `RowID` you can use for self-joins) is equivalent, but VERY (!) much simpler and easy to understand, then:

SELECT (s1.x+s1.y)*s1.z, s1.z-(s1.y/s1.x), <complex row expression involving s1.x,s1.y,s1.z>, <some other complex row expression involving s1.x,s1.y,s1.z> FROM table INNER JOIN
 (SELECT RowID, s2.x, s2.y, <very complex row expression possibly involving s2.x,s2.y> AS z FROM table INNER JOIN
   (SELECT RowID, s3.x, <very complex row expression possibly involving s3.x> AS y FROM table INNER JOIN
     (SELECT RowID, <very complex row expression> AS x FROM table) s3 USING (RowID)
   ) s2 USING (RowID)
 ) s1 USING (RowID)
WHERE ...

(In this speculative syntax, I'm thinking that DO_FIRST would not return anything or create a column, just evaluate it's arguments left-to-right, but you could also consider a syntax that returns column values).
Or alternatively some kind of modifier to the SELECT, like:

SELECT FIXED_EVAL_ORDER
  LAST_VALUE(@x:=<very complex row expression>, @y:=<very complex row expression possibly involving @x>, @z:=<very complex row expression possibly involving @x,@y>), @x+@y)*@z), @z-(@y/@x), <complex row expression involving @x,@y,@z>, <some other complex row expression involving @x,@y,@z>

Perhaps you could also use arguments to indicate which (or how many leftmost) columns need to be evaluated first, while leaving the others free to be evaluated in any order, e.g.

SELECT FIXED_EVAL_ORDER(2) @a:=<something>, @b:=<something>, @a+@b, @a*@b FROM ...    # force first 2 columns to evaluate left-to-right before the rest
or, different (better?) syntax:
SELECT FIXED_EVAL_ORDER(3,1) @b:=<something involving @a>, tableCol1, @a:=<something>, tableCol27, tableCol42 FROM ...    # force evaluation of 3rd column, *then* 1st column, before evaluating remaining columns in arbitrary order.

To avoid some of the known pitfalls of variable assignment inside SELECT/UPDATE, instead of deprecating it altogether, consider deprecating only certain cases of it, such as (just as a minimum rough example, I'm quite sure a lot more analysis is needed):

*  Use of a variable and assignment to that variable in two different columns of a SELECT, where column evaluation order is undefined, e.g. SELECT @a, @a:=@a+1;

*  When a SELECT both assigns a value to a variable, and has columns that depend on the *previous row's* assignment to that variable (i.e. so excluding all my FIXED_EVAL_ORDER examples above), and where the use of LIMIT, DISTINCT, and GROUP BY or ORDER BY on the returned values themselves could result in assignments to the variable that don't correspond to returned row order.  (i.e. because rows must be partially or fully evaluated BEFORE a sort is applied, and the variable assignments either affect, or occur *inside* returned expressions and so can't be postponed till afterwards).  Actually, this possibly should apply to any function that has side effects when used inside a SELECT, and not just to variable assignments!     

*  Where change-of-variable-type after an assignment might cause problems?

By "deprecated in these cases", I mean of course that some future version of MySQL might throw an error in these situations.

Finally, I'd also like to point out (though I'm sure you already know), that even if window functions and other new (and not as yet existing) language features could be used to completely, elegantly, and efficiently replace all use-cases where a variable assignment occurs in a SELECT or UPDATE, it would *STILL* be an absolutely *HUGE* amount of work involved in converting all the legacy applications that use variables this way to use the new feature set instead... particularly because it isn't just a matter of a search-and-replace to change some syntax- the actual structure of the queries involved would have to be carefully analysed and rewritten.

(Which leads me to wonder, "Are you serious?!?" ...and since apparently you are, what it is about the (admittedly somewhat problematic) use of variable assignment in SELECT and UPDATE is ****SO**** bad that you'd take this drastic step?)

HI Mr. Trower,

Thank you for your bug report.

I sincerely do not see what is it that you are asking us to do. Improve documentation or add a feature.

If it is documentation, we would like to know exactly which chapter should be updated and with what info. You do not have to provide exact wordings , of course.

If it is a feature, you would have to define it in much, much more detail. This is simply, first of all, because assignments in DMLs will NOT come back. This is due to both stochastic nature of the SELECT list evaluation and due to the simple fact that this is not prescribed by latest SQL standards. Hence, define your feature as precisely as possible. Then, it might be accepted.

Hi.

Sorry, that *was* long and rather vague. The actual "feature request" was, more-or-less, this line:

>>>> So, what I am actually "feature request"ing here is that MySQL not deprecate or remove useful functionality that can't be efficiently or concisely duplicated in other ways! <<<<

Given that you're committed to removing all assignments in DMLs, that therefore transmutes into:

"Please MySQL, create or implement some language features that allow the same kind of things to be done that could up till now be done with variables, allowing for a similar level of conciseness (in the SQL syntax) and (computational) efficiency".

I don't have a *specific* request, both because this is by nature somewhat open-ended, and because the specification of new syntax should probably be left to the MySQL developers (or even to the SQL standards people). So that's why I listed several use-cases of variable assignment in DML, and then... waffled a bit I guess you could say, on how they could be done differently, using window functions or other things.

However, it's clear that a good start would be to allow the use of stored functions as window and aggregate functions, and remove the restriction on using window functions directly in an UPDATE. Perhaps also consider a syntax that allows in-line lambda expressions (for window functions) or defines them just for the one statement (like WITH/CTEs but for stored window & non-window functions?). And allow ORDER BY in multi-table UPDATEs (the assumption being that the window functions would see the previously-updated rows). And remove the restrictions on same-table subqueries.
Also nice would be a way to combine the functionality of SELECT (return data to the client) with SELECT...INTO (set some variables) so that a single statement could somehow do both (because if you create a very complex SELECT query, you don't want to have to do it twice it for the two purposes). And perhaps some syntax to allow the explicit definition of a list of sub-expressions, (each which might depend on the previously-defined ones), so that you can build up complex formulas hierarchically inside an SQL statement, rather than writing each expression (column to return, ORDER/GROUP BY expression, etc) out in full. (Yes, I'm aware that technically you *can't* reliably do that with variables... but it's still conceptually related, and something MySQL needs).

...

This is a long laundry list of things obviously, but the common theme is "stuff needed to replicate what you could previously do with variable assignments inside DML statements".

I'm by no means claiming that this list would cover it; nor were the 4 or 5 use-cases involving variables that I mentioned in any way an exhaustive list. (They were basically just what immediately occurred to me as I was writing the feature request).

What I'm getting at is that I think further thought and analysis is definitely called for on this issue, including the opportunity for other MySQL users affected by this (deprecation) to chip in. Maybe that's already happened somewhere, and/or maybe I should have posted on the forums rather than here... but this *IS* a genuine feature request, even if the features being requested aren't well defined! (Only some of what I'd like to be able to *DO* with them). So that's why this bug tracker was my first thought.

If another MySQL user comes here with the same concerns I have, and reads this page, maybe they could post their own use-cases for variable assignment in DML, if different than the ones I already mentioned, so that alternative ways of accomplishing the same tasks might be discussed, and examined to see what (if any) SPECIFIC features MySQL might need to implement to replace the use of variable assignments in those particular use-cases?

That should better explain my intention here better, I hope...?

[Also, I regret the Fibonacci example: A better one would be more like the CRC32 calculation, where the existing data within the table is also used as part of the running calculation].

[Regarding my request for documentation improvements, that was more of an aside. Maybe it doesn't even matter so much since the subject of the docs is being deprecated anyway, and/or maybe I should post a separate bug report for documentation deficiencies. But essentially, as long as variable assignment in DML is still legal, I think that the docs page about user variables should say something about the issue described in the Stack Overflow question that I linked to previously.]

Hi,

In order to verify the feature request, we need it well defined. You do not have to write the syntax. You could refer to some sub-paragraph of the SQL standard instead.

I'm afraid my knowledge of the SQL standard does not extend much past the fact that it exists. I've never read it, and as it isn't available for free, am not going to satisfy idle curiosity by doing so now.

I am also confused as to how it would help. Are you saying that all use-cases I've described in this feature request have already been considered in-depth by the SQL standards committee, and that they ALREADY have syntax in the SQL standard that covers alternative methods for performing them? So that all I need to do is ask you to implement the relevant bits of the standard? ...Without reading it I can't be sure, of course, but that seems very dubious, particularly since these use cases are all based on variable assignment in DML statements, which you already told me the SQL standard doesn't support! (Unless what you said perhaps meant that it *was* supported, but isn't anymore?)

As I said, I don't know the SQL standard, but I wouldn't expect any new language feature that I made up and requested to just so happen to already be in it; any such things would presumably need to be proprietary extensions to the standard (which I had assumed would be fine, given that each DBMS has it's own "flavour" of SQL anyway).

I consider myself a competent database programmer (meaning I'm very comfortable using databases like MySQL as back-ends for other applications). But I am not a database *engine* programmer - I have only a vague notion how MySQL works internally - and I'm certainly not qualified to be an SQL language-feature architect.

So asking me to provide you with specific new language features to implement is asking for something that is well outside my expertise. I could certainly take a stab at it, but I'd expect I'd be knocked back with either "We can't do that because MySQL's internal architecture doesn't support it", or "We won't do that because it subtly (or not-so-subtly) violates the design of the SQL language".

In deprecating variable-assignments-in-DML, you are deprecating a feature that is in use, and is (I think, probably) in quite *widespread* use. Certainly I use it quite often, and there are lots of blogs and internet tutorials that demonstrate how to use it in MySQL to achieve certain specific tasks.

I am pointing this out, and giving a few examples of this use, and asking for YOU to come up with alternative means (of some kind) to achieve the same goals.
I don't think it's unreasonable to expect that it should be YOUR DEVELOPMENT TEAM that decides just what kind of specific features within MySQL need to be modified or added in order to do that, since they are the ones with the expert knowledge required for that.

The above notwithstanding, I've already given you a "laundry list" of things that would probably go some fair way towards replacing the deprecated functionality. I can expand on that if you'd like. I can also go ahead and make up some new language features (complete with detailed syntax if you want it), that I think would also be helpful in that regard... *if* you think there's any genuine chance at all that you might really go ahead and implement them.

-------------------
Two other things:
-------------------

* Should I open a separate bug report for the documentation on page "9.4 User-Defined Variables" failing to mention the row-evaluation-order problem (from the StackOverflow question https://stackoverflow.com/questions/56939034/ )?

and

* On the same page, where it says "Previous releases of MySQL made it possible to assign a value to a user variable in statements other than SET. This functionality is supported in MySQL 8.0 for backward compatibility but is subject to removal in a future release of MySQL.",

would you consider adding something like:

"Techniques that made use of this functionality will then cease working. New language features such as window functions and recursive common table expressions (see WITH) are already able to replace some use-cases. Other new language features are under consideration that are expected to make up the gap. __Click here___ to view or contribute to a forum discussion with our development team on the proposed changes and how they are expected to replace deprecated code-patterns in legacy code."

...and then providing such a forum, with developer team participation, as described?

(Or you could reword it slightly and link directly to this bug report, but frankly in that case I think it'd need to be rewritten and cleaned up a lot first, if the bug tracker provides a way to do so).

Hi Mr. Trower,

Sorry, but we do not need a specific feature request, to use your own words. You do not have to provide the syntax , nor do you have to quote the standard, but you can not be vague, like you were in your descriptions. You can not just ask "to be provided by a replacement for deprecated feature(s)". That is not specific at all. 

Also, we do not see what is it there to document. To write down that those that are used to deprecated features will not be able to use them is so self-evident that writing something like that is unnecessary.

Regarding forums, we have public forums like forums.mysql.com. These are forums where MySQL users can exchange views, experiences and each others. MySQL developers do not partake in those forums. However, MySQL developers do appear frequently on Oracle Open World meetings and various other meetings. 

They also publish their articles on:

https://mysqlserverteam.com