MySQL Bugs: #98632: Remove indirect error reporting in various places.

Bug #98632	Remove indirect error reporting in various places.
Submitted:	17 Feb 2020 9:49	Modified:	18 Feb 2020 4:34
Reporter:	Simon Mudd (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S4 (Feature request)
Version:	8.0, 5.7	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	indirect, logging, not_showing_actual_message

Description:
There are several places where for various reasons the mysqld logs do not provide the actual error message to the user but tell the user to look elsewhere.

This is not helpful as the error is known and has been recorded but it's not been recorded where the user can immediately see it and this forces the user to take extra steps to identify the cause of the problem and then resolve it.

An example can be seen here:

How to repeat:
Start up a group replication cluster using dbdeployer (3 nodes). Shut it down.
Try to restart one of the nodes:

node1 [localhost:21920] {msandbox} ((none)) > start group_replication;
ERROR 3092 (HY000): The server is not configured properly to be an active member of the group. Please see more details on error log.
node1 [localhost:21920] {msandbox} ((none)) > exit

The error message here is useless as it provides no immediate information to the user.

Checking the logs shows:

2020-02-17T09:22:14.130798Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error connecting to all peers. Member join failed. Local port: 22045'
2020-02-17T09:22:15.249375Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 22045'
2020-02-17T09:22:17.705763Z 9 [ERROR] [MY-011640] [Repl] Plugin group_replication reported: 'Timeout on wait for view after joining group'
2020-02-17T09:22:17.706343Z 9 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is leaving a group without being on one.'

Suggested fix:
There are 4 different messages here and it's not 100% clear which error is the best to report but I suspect that 'Timeout on wait for view after joining group' is closest, though even though it says the server has joined the group it also implies that it did not really complete this.

I'm guessing a better message might be appropriate: 'Timeout while attempting to join the group' sounds better to me, and that's the sort of error message I'd like to see when I run START GROUP_REPLICATION as it would provide a clearer indication of what the problem is without me having to search the error logs (which may not be immediately available).

Note: there are several other similar errors like this one, especially related to parallel replication and requesting that performance_schema.replication_applier_status_by_worker is checked for errors. If you have the error why not remember it and report it?

examples:

Slave SQL Running: No
SQL Error 3547: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'c0e97a91-96e5-11e9-87ae-525400460e96:36' at master log binlog.000001, end_log_pos 11646. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

In cases like this it helps a lot to get the actual error immediately rather than having to go and search for it (that is store it in the SHOW SLAVE STATUS Last_SQL_Errno / Last_SQL_Error

Change to a feature request.

I have provided 2 specific examples here.  Should I provide other similar errors if I see them?

Hello Simon,

Thank you for the reasonable feature request!

regards,
Umesh