金融應用場景下跨資料中心的 MGR 架構方案(二)

語言: CN / TW / HK

本文來源:原創投稿

* GreatSQL社群原創內容未經授權不得隨意使用,轉載請聯絡小編並註明來源。

內容提綱

  • 1、什麼是Async Replication Auto failover

  • 2、基於MGR的兩地三中心資料庫架構方案

  • 3、配置Async Replication Auto failover

    • 3.1、建立複製通道

    • 3.2、對複製通道新增多個複製源

  • 4、模擬故障,確認可自動切換

如何在多個數據中心部署多套MGR叢集,並實現故障快速切換。

上篇文章介紹瞭如何在多資料中心部署多套MGR叢集 ,並構建叢集間的複製通道。這樣一旦主AZ不可用時,在校驗完資料後,就可以切換到備用AZ的MGR叢集,非常方便。

本文我們繼續深入介紹如何利用 Async Replication Auto failover 實現故障自動轉移的。

1、什麼是Async Replication Auto failover

從MySQL 8.0.22開始,推出一個新特性" Async Replication Auto failover ",當MGR叢集發生故障時,其從庫可以更方便的實現快速自動切主。直譯過來是“ 非同步複製自動故障轉移 ”,但實際上它也是支援半同步複製場景的。

You can use MySQL Server's new asynchronous connection failover mechanism to automatically establish an asynchronous (source to replica) replication connection to a new source after the existing connection from a replica to its source fails. The connection fails over if the replication I/O thread stops due to the source stopping or due to a network failure. The asynchronous connection failover mechanism can be used to keep a replica synchronized with multiple MySQL servers or groups of servers that share data. To activate asynchronous connection failover for a replication channel set SOURCE_CONNECTION_AUTO_FAILOVER=1 on the CHANGE MASTER TO statement for the channel, and set up a source list for the channel using the asynchronous_connection_failover_add_source and asynchronous_connection_failover_delete_source functions.

詳細介紹見官方文件 17.4.9 Switching Sources with Asynchronous Connection Failover

2、基於MGR的兩地三中心資料庫架構方案

在兩地三中心架構下,可以採用下面這個部署方案

在這個架構方案裡,MGR-B可以採用 非同步複製 或  增強半同步複製 通道從MGR-A複製資料,這要取決於兩個AZ之間的網路狀況。

在金融應用場景下,這個網路條件一般可以得到保障,因此 優先採用增強版同步方式

而跨城異地AZ裡的MGR C則因為網路延遲較大,大概率會採用非同步複製方式。

在上述方案中,不管是MGR-B還是C,都面臨一個問題: 那就是複製源指向的主機例項,發生故障不可用之後,如何快速切換,實現自動故障轉移

在以往,只能靠第三方工具實現切換。

在MySQL 8.0.22新增"Async Replication Auto failover"特性後,就沒這個煩惱了。

其工作機制是 在一個複製通道上設定多個複製源(source),它還支援對多個源設定不同權重。當發現主複製源發生故障異常中斷後(會先嚐試重連幾次),即可實現自動切換到新的複製源。當原來的複製源恢復後,如果其權重更高,還會再切換回去

3、配置Async Replication Auto failover

部署的過程很簡單,幾條命令就搞定了。

3.1、建立複製通道

按照常規方式,在從例項上(本案以MGR-B為例)建立一個複製通道

[[email protected] mgrB-1][(none)]> CHANGE REPLICATION SOURCE TO
MASTER_HOST='172.16.16.10', MASTER_PORT=3306, 
MASTER_USER='repl', MASTER_PASSWORD='repl',
MASTER_AUTO_POSITION=1,
SOURCE_CONNECTION_AUTO_FAILOVER=1, #這裡是關鍵,表示開啟自動故障轉移
MASTER_RETRY_COUNT=3, #最多重試3次
MASTER_CONNECT_RETRY=10 #每次重試間隔10秒
FOR CHANNEL 'MGR-A';

#簡單解釋下幾個引數
- SOURCE_CONNECTION_AUTO_FAILOVER=1  #這裡是關鍵,表示開啟自動故障轉移
- MASTER_RETRY_COUNT=3  #表示最多重試3次,預設是是86400次
- MASTER_CONNECT_RETRY=10 #表示每次重試間隔10秒,預設是60秒

確認新增的複製通道生效了:

[[email protected] mgrB-1][(none)]> SELECT * FROM performance_schema.replication_applier_status\G
*************************** 1. row ***************************
              CHANNEL_NAME: mgr-a
             SERVICE_STATE: ON
           REMAINING_DELAY: NULL
COUNT_TRANSACTIONS_RETRIES: 0
*************************** 2. row ***************************
              CHANNEL_NAME: group_replication_applier
             SERVICE_STATE: ON
           REMAINING_DELAY: NULL
COUNT_TRANSACTIONS_RETRIES: 0

3.2、對複製通道新增多個複製源

接下來再對這個複製通道新增多個複製源(多次呼叫該UDF即可):

[[email protected] mgrB-1][(none)]> SELECT asynchronous_connection_failover_add_source('MGR-A','172.16.16.10',3306,null,60);
[[email protected] mgrB-1][(none)]> SELECT asynchronous_connection_failover_add_source('MGR-A','172.16.16.11',3306,null,60);
[[email protected] mgrB-1][(none)]> SELECT asynchronous_connection_failover_add_source('MGR-A','172.16.16.12',3306,null,60);

#簡單解釋下幾個引數
MGR-A  #表示複製通道,和上面的複製通道同名
172.16.16.10 #表示該複製源的IP
3306 #表示該複製源的埠
null #表示network_namespace,未來的特性,現在先放空即可
60 #表示該複製源的權重,上面我們介紹了不同權重的作用,值越大越有機會搶到成為複製源

確認多個複製源生效:

[[email protected] mgrB-1][(none)]> SELECT * FROM performance_schema.replication_asynchronous_connection_failover;
+--------------+--------------+------+-------------------+--------+--------------+
| CHANNEL_NAME | HOST         | PORT | NETWORK_NAMESPACE | WEIGHT | MANAGED_NAME |
+--------------+--------------+------+-------------------+--------+--------------+
| mgr-a        | 172.16.16.10 | 3306 |                   |     60 |              |
| mgr-a        | 172.16.16.11 | 3306 |                   |     60 |              |
| mgr-a        | 172.16.16.12 | 3306 |                   |     60 |              |
+--------------+--------------+------+-------------------+--------+--------------+

啟動該複製通道:

[[email protected] mgrB-1][(none)]> START REPLICA FOR CHANNEL 'MGR-A';

確認複製通道和MGR的狀態都正常:

[[email protected] mgrB-1][(none)]> SELECT * FROM performance_schema.replication_connection_status\G
*************************** 1. row ***************************
                                      CHANNEL_NAME: mgr-a
                                        GROUP_NAME:
                                       SOURCE_UUID: b084f8a1-96a8-11eb-9a70-525400fb993a
                                         THREAD_ID: 3084
                                     SERVICE_STATE: ON
                         COUNT_RECEIVED_HEARTBEATS: 5974
                          LAST_HEARTBEAT_TIMESTAMP: 2021-05-29 18:53:13.879720
                          RECEIVED_TRANSACTION_SET: 476c0276-be03-11eb-bd34-525400e802e2:21-31:1000016-1000017
                                 LAST_ERROR_NUMBER: 0
                                LAST_ERROR_MESSAGE:
                              LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00.000000
                           LAST_QUEUED_TRANSACTION: 476c0276-be03-11eb-bd34-525400e802e2:31
 LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 2021-05-27 17:19:43.201000
     LAST_QUEUED_TRANSACTION_START_QUEUE_TIMESTAMP: 2021-05-27 17:19:43.203315
       LAST_QUEUED_TRANSACTION_END_QUEUE_TIMESTAMP: 2021-05-27 17:19:43.203349
                              QUEUEING_TRANSACTION:
    QUEUEING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
   QUEUEING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
        QUEUEING_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00.000000
*************************** 2. row ***************************
                                      CHANNEL_NAME: group_replication_applier
                                        GROUP_NAME: f195537d-19ac-11eb-b29f-5254002eb6d6
                                       SOURCE_UUID: f195537d-19ac-11eb-b29f-5254002eb6d6
                                         THREAD_ID: NULL
                                     SERVICE_STATE: ON
                         COUNT_RECEIVED_HEARTBEATS: 0
                          LAST_HEARTBEAT_TIMESTAMP: 0000-00-00 00:00:00.000000
                          RECEIVED_TRANSACTION_SET: 476c0276-be03-11eb-bd34-525400e802e2:1-31:1000015-1000017,
f195537d-19ac-11eb-b29f-5254002eb6d6:1-18
                                 LAST_ERROR_NUMBER: 0
                                LAST_ERROR_MESSAGE:
                              LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00.000000
                           LAST_QUEUED_TRANSACTION: f195537d-19ac-11eb-b29f-5254002eb6d6:18
 LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
     LAST_QUEUED_TRANSACTION_START_QUEUE_TIMESTAMP: 2021-05-27 17:04:03.407281
       LAST_QUEUED_TRANSACTION_END_QUEUE_TIMESTAMP: 2021-05-27 17:04:03.407317
                              QUEUEING_TRANSACTION:
    QUEUEING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
   QUEUEING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
        QUEUEING_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00.000000

執行 SHOW REPLICA STATUS 檢視狀態:

*************************** 1. row ***************************
             Replica_IO_State: Waiting for master to send event
                  Source_Host: 172.16.16.10
                  Source_User: repl
                  Source_Port: 3306
                Connect_Retry: 10
...
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
...
                  Source_UUID: 5499a6cb-91cb-11eb-966f-525400e802e2
...
           Source_Retry_Count: 3
...
           Retrieved_Gtid_Set: 476c0276-be03-11eb-bd34-525400e802e2:21-31:1000016-1000017
            Executed_Gtid_Set: 476c0276-be03-11eb-bd34-525400e802e2:1-31:1000015-1000017,
f195537d-19ac-11eb-b29f-5254002eb6d6:1-18
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name: mgr-a
...               

先記住上面輸出結果中的 Source_Host 和  Source_UUID 等資訊,下面模擬一次複製源伺服器宕機後,自動切換複製源的場景。

4、模擬故障,確認可自動切換

在當前複製源伺服器上,執行 kill -9 殺掉 mysqld 程序,然後就能看到從伺服器上有類似如下日誌:

# 先嚐試3次(每次間隔10秒)重連舊的複製源伺服器
[ERROR] [MY-010584] [Repl] Slave I/O for channel 'mgr-a': error connecting to master '[email protected]:3306' - retry-time: 10 retries: 3 message: Can't connect to MySQL server on '172.16.16.10:3306' (111), Error_code: MY-002003

#重試失敗,停止複製I/O執行緒
[Note] [MY-010563] [Repl] Slave I/O thread for channel 'mgr-a' killed while connecting to master
[Warning] [MY-010897] [Repl] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.

# 再次啟動複製I/O執行緒,連線到新的複製源伺服器
[System] [MY-010562] [Repl] Slave I/O thread for channel 'mgr-a': connected to master '[email protected]:3306',replication started in log 'FIRST' at position 8598
# 告知UUID發生切換了
[Warning] [MY-010549] [Repl] The master's UUID has changed, although this should not happen unless you have changed it manually. The old UUID was ec2fcbeb-976c-11eb-a652-525400e2078a.

再次執行 SHOW REPLICA STATUS 確認複製源切換了:

*************************** 1. row ***************************
             Replica_IO_State: Waiting for master to send event
                  Source_Host: 172.16.16.11
                  Source_User: repl
                  Source_Port: 3306
                Connect_Retry: 10
...
           Replica_IO_Running: Yes
          Replica_SQL_Running: Yes
...
                  Source_UUID: ec2fcbeb-976c-11eb-a652-525400e2078a
...
           Source_Retry_Count: 3
...
           Retrieved_Gtid_Set: 476c0276-be03-11eb-bd34-525400e802e2:21-32:1000016-1000017
            Executed_Gtid_Set: 476c0276-be03-11eb-bd34-525400e802e2:1-32:1000015-1000017,
f195537d-19ac-11eb-b29f-5254002eb6d6:1-18
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name: mgr-a
...               

因為3個複製源的權重設定為一樣,所以當原來的複製源伺服器宕機恢復後,不會再切換回去。而如果舊的複製源伺服器權重設定較高的話,當他恢復後,會再次發生切換,切回原來的源:

#沒有任何嘗試重連的行為,直接發起切換
[Note] [MY-011026] [Repl] Slave I/O thread killed while reading event for channel 'mgr-a'.
[Note] [MY-010570] [Repl] Slave I/O thread exiting for channel 'mgr-a', read up to log 'FIRST', position 8871
[Warning] [MY-010897] [Repl] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
[System] [MY-010562] [Repl] Slave I/O thread for channel 'mgr-a': connected to master '[email protected]:3306',replication started in log 'FIRST' at position 8871
[Warning] [MY-010549] [Repl] The master's UUID has changed, although this should not happen unless you have changed it manually. The old UUID was 5499a6cb-91cb-11eb-966f-525400e802e2. -- 再次切回原來的主

這就很方便的可以實現自動故障轉移了。

現在,我們利用MGR + 增強半同步複製 + 自動故障轉移 構建了一套金融級應用場景下的兩地多中心資料庫架構方案。推薦選用可靠性、穩定性更高的GreatSQL,可以更放心的使用MGR( GreatSQL,打造更好的MGR生態 )。

後面再繼續介紹基於MGR的其他架構解決方案。

Enjoy GreatSQL :)

文章推薦:

掃碼新增GreatSQL社群助手微信好友

傳送“ 加群 ”加入GreatSQL/MGR交流群

亦可掃碼加入 GreatSQL/MGR交流 QQ群

想看更多技術好文,點個 “在看” 吧!