Bug #100345 The results of REGEXP_SUBSTR() are different for same string.
Submitted: 28 Jul 2020 5:44 Modified: 4 Aug 2020 14:00
Reporter: James Lee Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:8.0.20, 8.0.21 OS:Linux
Assigned to: CPU Architecture:Any

[28 Jul 2020 5:44] James Lee
Description:
Description:
Although they are the same string, the results of REGEXP_SUBSTR() are different when the first parameter is a string constant or a column of a table.

Like Bug #100196. 

How to repeat:
The first way to repeat this bug:

1. Initialize DB with the below "Configure file".
2. Start DB.
Below are the steps of running sqls:
3. create table if not exists t_test(a int not null primary key, b varchar(10));
4. insert into t_test(a,b) values(5, '11a22A33a');
5. 
mysql> SELECT REGEXP_SUBSTR(b,'[^A]+',1,1) AS STR FROM t_test where a = 5;
+-------+
| STR   |
+-------+
| 11a22 |
+-------+

6. 
mysql> SELECT REGEXP_SUBSTR('11a22A33a','[^A]+',1,1) AS STR;
+-----+
| STR |
+-----+
| 11  |
+-----+

Configure file:

[mysqld]
sync_binlog=1

innodb_flush_log_at_trx_commit = 1

# support transaction
default-storage-engine=innodb

port            = 5539
socket          = /home/myname/mysql8.0.20/bin/mysql1.sock
bind_address = 127.0.0.1
datadir  = /home/myname/mysql8.0.20/data/data
log-error=/home/myname/mysql8.0.20/log/mysqld1.log
pid-file=/home/myname/mysql8.0.20/bin/mysqld1.pid
server-id       = 1
basedir=/home/myname/mysql8.0.20/
log-bin=../binlog/mysql-bin
relay-log=../relaylog/relay-bin
tmpdir=/home/myname/mysql8.0.20/data/tmp

gtid_mode = on
enforce_gtid_consistency=on

# replication options
expire_logs_days=10
max_binlog_size=100M
binlog_format=row
innodb_flush_log_at_trx_commit=1
character-set-server=UTF8MB4
collation_server = utf8mb4_bin

=================================================
Another way to repeat this bug:

CREATE TABLE `t_test1` (
  `a` int NOT NULL,
  `b` varchar(10) COLLATE utf8mb4_bin DEFAULT NULL,
  PRIMARY KEY (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
insert into t_test1(a,b) values(5 ,'11a22A33a');
SELECT REGEXP_SUBSTR(b,'[^A]+',1,1) AS STR FROM t_test1 where a =5;
+-------+
| STR   |
+-------+
| 11a22 |
+-------+

CREATE TABLE `t_test2` (
  `a` int NOT NULL,
  `b` varchar(10) DEFAULT NULL,
  PRIMARY KEY (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
insert into t_test2(a,b) values(5 ,'11a22A33a');
SELECT REGEXP_SUBSTR(b,'[^A]+',1,1) AS STR FROM t_test2 where a =5;
+------+
| STR  |
+------+
| 11   |
+------+

SELECT REGEXP_SUBSTR('11a22A33a','[^A]+',1,1) AS STR FROM DUAL;
+-----+
| STR |
+-----+
| 11  |
+-----+
[28 Jul 2020 8:00] MySQL Verification Team
Hello James Lee,

Thank you for the report and test case.

regards,
Umesh
[4 Aug 2020 14:00] Martin Hansson
Posted by developer:
 
This is expected behavior, as Erlend pointed out. It can also be observed the following way:

mysql> SELECT REGEXP_SUBSTR('11a22A33a', '[^A]+', 1, 1, 'c'); # Case sensitive
+------------------------------------------------+
| REGEXP_SUBSTR('11a22A33a', '[^A]+', 1, 1, 'c') |
+------------------------------------------------+
| 11a22                                          |
+------------------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT REGEXP_SUBSTR('11a22A33a', '[^A]+', 1, 1, 'i'); # Case insensitive
+------------------------------------------------+
| REGEXP_SUBSTR('11a22A33a', '[^A]+', 1, 1, 'i') |
+------------------------------------------------+
| 11                                             |
+------------------------------------------------+
[4 Aug 2020 17:02] Erlend Dahl
Replacing the case-insensitive collation with utf8mb4_0900_as_cs also makes the problem go away.