Bug #120126 mysql connector c xdev get stuck when connect in keepalived and mysql server restart
Submitted: 23 Mar 2:53 Modified: 25 Mar 9:34
Reporter: 芳彬 杨 Email Updates:
Status: Open Impact on me:
None 
Category:Connector / C Severity:S2 (Serious)
Version:9.6.0 and 8.0.33 OS:Ubuntu (20.04)
Assigned to: CPU Architecture:Any

[23 Mar 2:53] 芳彬 杨
Description:
1 This issue exists both in mysql connector c 9.6.0 and 8.0.33
2 Use the x dev api c connect mysql server 8.0.33 which is behind the "keepalived"
3 When restart the mysql server, the mysqlx api will get stuck for 15 minutes whih time_out option setted
5 In fact , no keepalive this issue will happen too, but very hard to reproduce. and "keepalive" will make is very easy to reproduce

How to repeat:
1 download the lib from mysql webside
https://dev.mysql.com/downloads/connector/cpp/

2 set two mysql server 8.0.33, and make them to be Master-Master Replication mode

3 set up keepalived, the config in two machine is like this:

machineA:

global_defs {
    router_id R1
    script_user opencapture
    enable_script_security
    notification_email {
       jjtseng@tvbs.com.tw
    }
    notification_email_from opensolution@tvbs.com.tw
    smtp_server 10.254.101.46 25
    smtp_connect_timeout 30
}
vrrp_script chk_mysql_port {
    script "/opt/chk_mysql.sh"
    interval 2
    weight -50
    fall 2
    rise 1
}

vrrp_instance VI_1 {
    state MASTER
    interface ens18
    mcast_src_ip 10.254.16.162
    virtual_router_id 190
    priority 120
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        10.254.16.190
    }

track_script {
   chk_mysql_port
   }
}

machineB:

! Configuration File for keepalived
global_defs {
   router_id R2
   script_user opencapture
   enable_script_security
}
vrrp_script chk_mysql_port {
    script "/opt/chk_mysql.sh"
    interval 2
    weight -5
    fall 2
    rise 1
}

vrrp_instance VI_1 {
    state BACKUP
    interface ens18
    mcast_src_ip 10.254.16.163
    virtual_router_id 190
    priority 80
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        10.254.16.190
    }

track_script {
   chk_mysql_port
   }
}

The master will be in machine A

4 In machine B, create a sessiong using the below options.  Attention, must be in machine B. If in A will not be easy to reproduce

    mysqlx_session_options_t *opt;
    opt = mysqlx_session_options_new();
    int optRes = mysqlx_session_option_set(opt,
        OPT_HOST(ip.c_str()), OPT_PORT(port),
        OPT_USER(name.c_str()), OPT_PWD(password.c_str()),
        OPT_DB(database.c_str()),
        OPT_COMPRESSION(MYSQLX_COMPRESSION_DISABLED),
        OPT_CONNECT_TIMEOUT(mConnecTimeOutMs),
        OPT_READ_TIMEOUT(14000),
        OPT_WRITE_TIMEOUT(10000),
        PARAM_END);

5 restart machine A's mysqlserver and call the below in a "while(can add some thread sleep)" together

mysqlx_sql(mySessiont, sql.c_str(), sql.length());

//the sql is a simple "select" query

6 then the "mysqlx_sql" will gei stuck for 15 minuets in one of the calling

Suggested fix:
In fact, I have found the stuck location:

in /cdk/foundation/connection_openssl.cc

the "ssl_read" in "bool TLS::Read_some_op::common_read()" which called in  do_wait()

though I have set the socket time out, it will happen because the 'while' in do_wait() won't judge it to complete, so the "common_read()" will enter many times and will get get stuck until the linux report poxi error

But I have no good idea to fix it, because the "do_wait()" will enter many times when close so close will get stuck long too. Only control the time out in "do_wait" is not enough.
[25 Mar 1:41] 芳彬 杨
Sorry~ something is missing for reproduce step.
For step 5
"5 restart machine A's mysqlserver and call the below in a "while(can add some thread sleep)" together

mysqlx_sql(mySessiont, sql.c_str(), sql.length());

//the sql is a simple "select" query"

should be
"5 restart machine A's mysqlserver and call the below in a "while(can add some thread sleep)" together

mysqlx_sql(mySessiont, sql.c_str(), sql.length());
and if the errcode is 11
close the seccion(mysqlx_session_close) and recreate using the above parameter(mysqlx_get_session_from_options)
the continue call mysqlx_sql

//the sql in mysqlx_sql is a simple "select" query
//and must call the above repeated at the same time when mysql server restart, if wait several minutes, it won't happen
"

when restart mysql server, the process is like this:
mysqlx_sql fail, errcode 11
mysqlx_get_session_from_options fail for several times, errcode is 111
mysqlx_get_session_from_options success
mysqlx_sql fail, and get stuck 15minutes
[25 Mar 9:03] 芳彬 杨
Here is my test code with 9.6.0:
may need restart the mysql serveral times:

mysqlx_session_t* MysqlApiTest::_Bug120126_Connect()
{
        mysqlx_session_t  *sess = NULL;
    unsigned int curErrcode;
    mysqlx_session_options_t *opt = NULL;
    mysqlx_error_t *error = NULL;
    opt = mysqlx_session_options_new();

    int optRes = mysqlx_session_option_set(opt,
        OPT_HOST(mIp.c_str()), OPT_PORT(mPort),
        OPT_USER(mUser.c_str()), OPT_PWD(mPwd.c_str()),
        OPT_DB(mDbName.c_str()),
        OPT_COMPRESSION(MYSQLX_COMPRESSION_DISABLED),
        OPT_CONNECT_TIMEOUT(10*1000),
        OPT_READ_TIMEOUT(10000),
        OPT_WRITE_TIMEOUT(15000),
        PARAM_END);

    sess = mysqlx_get_session_from_options(opt, &error);
    mysqlx_free(opt);

    if (sess == NULL)
    {
        std::cout << "session create null" << std::endl;
        curErrcode = mysqlx_error_num(error);  //if option error, mCurErrcode = 0
        std::string str(mysqlx_error_message(error));
        std::cout << str << std::endl;
        mysqlx_free(error);
        return sess;
    }
    std::cout << "session create success" << std::endl;
    if (error != NULL) {
        mysqlx_free(error);
        error = NULL;
    }
    return sess;
}
void MysqlApiTest::Bug120126Test_keepAlived()
{
    std::cout << "\nBug120126Test_keepAlived begin: mysql server restart behind alived will get stuck" << std::endl;
    mysqlx_session_t  *sess = NULL;
    unsigned int curErrcode;

    sess = _Bug120126_Connect();

    if (sess == NULL)
    {
        return;
    }

    
    std::chrono::time_point<std::chrono::system_clock> begin_time = std::chrono::system_clock::now();

    const std::string sql = "SELECT * FROM app_shared_data limit 10;";
    long sql_succ_count = 0;
    long recon_count = 0;
    while (true && sess)
    {
        std::chrono::time_point<std::chrono::system_clock> begin_sql_time = std::chrono::system_clock::now();

        mysqlx_result_t* result = mysqlx_sql(sess, sql.c_str(), sql.length());

        std::chrono::time_point<std::chrono::system_clock> end_sql_time = std::chrono::system_clock::now();

        std::chrono::milliseconds sql_cost = std::chrono::duration_cast<std::chrono::milliseconds>(end_sql_time - begin_sql_time);

        if (result != NULL)
        {
            sql_succ_count++;
            mysqlx_free(result);
            result = NULL;
            if (sql_succ_count % 10 == 0)
            {
                std::cout << "mysqlx_sql success:"<< sql_succ_count << std::endl;
            }

            //sleep
            std::this_thread::sleep_for(std::chrono::seconds(1));
        }
        else {
            if (sql_cost.count() > 10 * 1000)
            {
                std::cout << "mysqlx_sql fail and cost too long:"<< sql_cost.count() << std::endl;
            }

            sql_succ_count = 0;

            curErrcode = mysqlx_error_num(sess);  //if option error, mCurErrcode = 0
            std::string str(mysqlx_error_message(sess));
            std::cout << "mysqlx_sql fail:" << curErrcode << "," << str << std::endl;

            if(curErrcode == 11
            || curErrcode == 104
            || curErrcode == 23
            || curErrcode == 15)
            {
                recon_count = 0;
                std::cout << "close session and do reconnect" << std::endl;
                //free and reconnect
                mysqlx_session_close(sess);
                sess = NULL;
                while ((sess = _Bug120126_Connect()) == NULL)
                {
                    recon_count++;
                    std::cout << "reconnect: " << recon_count << std::endl;

                    //sleep
                    std::this_thread::sleep_for(std::chrono::seconds(2));
                }
                
            }
        }

        std::chrono::time_point<std::chrono::system_clock> end_time = std::chrono::system_clock::now();
    
        std::chrono::milliseconds totaltime = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - begin_time);

        if (totaltime.count() > 20 * 60 * 1000)//20min
        {
            if(sess)
            {
                mysqlx_session_close(sess);
                sess = NULL;
                std::cout << "run too long, and break" << std::endl;
                break;
            }
        }
    }
    std::cout << "Bug120126Test_keepAlived end" << std::endl;
}
[25 Mar 9:34] 芳彬 杨
Here is my log(I restart the mysqlserver twice) and the client must run in machine B:

Bug120126Test_keepAlived begin: mysql server restart behind alived will get stuck
session create success
mysqlx_sql fail:11,CDK Error: OpenSSL: error:0A000126:SSL routines::unexpected eof while reading
close session and do reconnect
session create null
CDK Error: Connection refused (generic:111)
reconnect: 1
session create null
CDK Error: Connection refused (generic:111)
reconnect: 2
session create null
CDK Error: Connection refused (generic:111)
reconnect: 3
session create success
mysqlx_sql success:10
mysqlx_sql fail and cost too long:Bug120126Test_keepAlived begin: mysql server restart behind alived will get stuck
session create success
mysqlx_sql fail:11,CDK Error: OpenSSL: error:0A000126:SSL routines::unexpected eof while reading
close session and do reconnect
session create null
CDK Error: Connection refused (generic:111)
reconnect: 1
session create null
CDK Error: Connection refused (generic:111)
reconnect: 2
session create null
CDK Error: Connection refused (generic:111)
reconnect: 3
session create success
mysqlx_sql success:10
mysqlx_sql fail and cost too long:938196
mysqlx_sql fail:110,CDK Error: Connection timed out (generic:110)
mysqlx_sql fail:32,CDK Error: Broken pipe (generic:32)
mysqlx_sql fail:1,CDK Error: Can't write message while another one is being written
close session and do reconnect
session create success
mysqlx_sql success:10
mysqlx_sql success:20
mysqlx_sql success:30

mysqlx_sql fail:110,CDK Error: Connection timed out (generic:110)
mysqlx_sql fail:32,CDK Error: Broken pipe (generic:32)
mysqlx_sql fail:1,CDK Error: Can't write message while another one is being written
close session and do reconnect
session create success
mysqlx_sql success:10
mysqlx_sql success:20
mysqlx_sql success:30