Bug #84538 Replication fail over does not recover after server marked as unavailable
Submitted: 17 Jan 2017 15:39 Modified: 20 Dec 2017 6:13
Reporter: Martin Dunsmore Email Updates:
Status: Verified Impact on me:
None 
Category:Connector / NET Severity:S2 (Serious)
Version:6.8.3 - 7.0 OS:Windows
Assigned to: CPU Architecture:Any
Tags: mysql connector net replication

[17 Jan 2017 15:39] Martin Dunsmore
Description:
If an exception is thrown, when using MySQL Replication Groups, such as the following: Authentication to host 'some-mysql-host' for user 'mysql-user' using method 'mysql_native_password' failed with message: Reading from the stream has failed. The Replication Server in the Replication Group is set to IsAvailable = false from within the GetNewConnection method in ReplicationManager.cs.

In this scenario, the handle fail over logic will never be triggered and therefore never recover and will always be unavailable until application restart.

How to repeat:
Run this:

        static void Main(string[] args)
        {
            while (true)
            {
                for (int i = 0; i < 25; i++)
                {
                    Task.Factory.StartNew(() =>
                    {
                        using (var mysql = new MySqlConnection("server=SomeMySqlGroup"))
                        {
                            mysql.Open();
                            var data = MySqlHelper.ExecuteReader(mysql, "SELECT SLEEP(1);"); // Put a break point here before killing processes on the server

                        }
                    });
                }

                Thread.Sleep(500);
            }
        }

After this code has been running for say around 30 seconds, put a break point where stated above and then kill all user processes on each server within the Replication Group. Then remove the break point and allow the code to continue, this will most likely trigger the following exception (may have to repeat a couple of times):

ConnectorTest.vshost.exe Error: 0 : MySql.Data.MySqlClient.MySqlException (0x80004005): Authentication to host 'some-host' for user 'mysql-user' using method 'mysql_native_password' failed with message: Reading from the stream has failed. ---> MySql.Data.MySqlClient.MySqlException (0x80004005): Reading from the stream has failed. ---> System.IO.EndOfStreamException: Attempted to read past the end of the stream.
   at MySql.Data.MySqlClient.MySqlStream.ReadFully(Stream stream, Byte[] buffer, Int32 offset, Int32 count) in C:\Users\Guest\Downloads\mysql-connector-net-6.9.9-src\Source\MySql.Data\MySqlStream.cs:line 188
   at MySql.Data.MySqlClient.MySqlStream.LoadPacket() in C:\Users\Guest\Downloads\mysql-connector-net-6.9.9-src\Source\MySql.Data\MySqlStream.cs:line 204
   at MySql.Data.MySqlClient.MySqlStream.LoadPacket() in C:\Users\Guest\Downloads\mysql-connector-net-6.9.9-src\Source\MySql.Data\MySqlStream.cs:line 228
   at MySql.Data.MySqlClient.MySqlStream.ReadPacket() in C:\Users\Guest\Downloads\mysql-connector-net-6.9.9-src\Source\MySql.Data\MySqlStream.cs:line 143
   at MySql.Data.MySqlClient.NativeDriver.ReadPacket() in C:\Users\Guest\Downloads\mysql-connector-net-6.9.9-src\Source\MySql.Data\NativeDriver.cs:line 137
   at MySql.Data.MySqlClient.Authentication.MySqlAuthenticationPlugin.ReadPacket() in C:\Users\Guest\Downloads\mysql-connector-net-6.9.9-src\Source\MySql.Data\Authentication\MySQLAuthenticationPlugin.cs:line 171
   at MySql.Data.MySqlClient.Authentication.MySqlAuthenticationPlugin.AuthenticationFailed(Exception ex) in C:\Users\Guest\Downloads\mysql-connector-net-6.9.9-src\Source\MySql.Data\Authentication\MySQLAuthenticationPlugin.cs:line 92
   at MySql.Data.MySqlClient.Authentication.MySqlAuthenticationPlugin.ReadPacket() in C:\Users\Guest\Downloads\mysql-connector-net-6.9.9-src\Source\MySql.Data\Authentication\MySQLAuthenticationPlugin.cs:line 177
   at MySql.Data.MySqlClient.Authentication.MySqlAuthenticationPlugin.Authenticate(Boolean reset) in C:\Users\Guest\Downloads\mysql-connector-net-6.9.9-src\Source\MySql.Data\Authentication\MySQLAuthenticationPlugin.cs:line 131
   at MySql.Data.MySqlClient.NativeDriver.Authenticate(String authMethod, Boolean reset) in C:\Users\Guest\Downloads\mysql-connector-net-6.9.9-src\Source\MySql.Data\NativeDriver.cs:line 505
   at MySql.Data.MySqlClient.NativeDriver.Open() in C:\Users\Guest\Downloads\mysql-connector-net-6.9.9-src\Source\MySql.Data\NativeDriver.cs:line 309
   at MySql.Data.MySqlClient.Driver.Open() in C:\Users\Guest\Downloads\mysql-connector-net-6.9.9-src\Source\MySql.Data\Driver.cs:line 240
   at MySql.Data.MySqlClient.Driver.Create(MySqlConnectionStringBuilder settings) in C:\Users\Guest\Downloads\mysql-connector-net-6.9.9-src\Source\MySql.Data\Driver.cs:line 227
   at MySql.Data.MySqlClient.Replication.ReplicationManager.GetNewConnection(String groupName, Boolean master, MySqlConnection connection) in C:\Users\Guest\Downloads\mysql-connector-net-6.9.9-src\Source\MySql.Data\Replication\ReplicationManager.cs:line 172

Once this or any other exception excluding exception number 1042 is thrown in GetNewConnection group.HandleFailover will never be called.

Suggested fix:
The suggested fix would be to only mark the server as unavailable if the exception number is 1042 UnableToConnectToHost (Given when the connection is unable to successfully connect to host.) as this is the only time the handle fail over logic is triggered within GetNewConnection

Current logic:
connection.driver = null;
server.IsAvailable = false;
MySqlTrace.LogError(ex.Number, ex.ToString());
if (ex.Number == 1042)
{
    // retry to open a failed connection and update its status
    group.HandleFailover(server, ex);
}

Suggested logic:
connection.driver = null;

MySqlTrace.LogError(ex.Number, ex.ToString());
if (ex.Number == 1042)
{
    server.IsAvailable = false;
    // retry to open a failed connection and update its status
    group.HandleFailover(server, ex);
}
[25 Jan 2017 18:09] Roberto Ezequiel Garcia Ballesteros
The reason having the 'server.IsAvailable = false;' outside 1042 error is to avoid an infinite loop because if there is another error different to 1042 it will try to reconnect forever. In this case is expected to throw the exception to the application to have a different behavior when reconnecting.
[3 Feb 2017 12:41] Muhammed Khalifa
The unfortunately side effect of that is the server will never recover from this state, even after the "retryTime" interval because it is not monitored by a timer.

In this scenario the query is possibly exceeding the net_read_time timeout (transient failure) and a exception is thrown which causes the server to be marked as unavailable permanently.

If the application constantly encounters this transient error, then yes the application reconnection logic will keep trying to the open a connection. 

It may be that more specific error exception codes need to be considered before marking the server as unavailable or create timers so that "retryTime" is honoured to recover from these transient failures.
[3 Feb 2017 12:45] Muhammed Khalifa
Sorry, i can't seem to edit my comment to correct it:

The unfortunate side effect of that is, the server will never recover from this state, even after the "retryTime" interval because it is not monitored by a timer.

In this scenario the query is possibly exceeding the net_read_time timeout (transient failure) and a exception is thrown which causes the server to be marked as unavailable permanently.

If the application constantly encounters this transient error, then yes the application reconnection logic will keep trying to open a connection. 

It may be that more specific error exception codes need to be considered before marking the server as unavailable or create timers so that "retryTime" is honoured to recover from these transient failures.
[20 Dec 2017 6:13] Chiranjeevi Battula
Hello Muhammed Khalifa,

Thank you for the feedback.
Verified based on internal discussion with dev's.

Thanks,
Chiranjeevi.