2.X版本的一個通病問題

語言: CN / TW / HK

【概述】

對於配置了HA模式的RM或者NN,客戶端如果向standby的節點發送請求,會因為不可連線或standby拒絕提供服務導致請求失敗,轉而向Active的節點發送請求,這個轉換是hadoop客戶端內部自動完成的,無須上層業務感知(本質上是向其中一個節點發送請求,如果失敗則繼續向另外一個節點發送請求)。

上週排查了一個相關的問題,在叢集正常的情況下,向兩個節點發送請求都失敗,並且是持續失敗,從而陷入死迴圈最後發現是hadoop內部RPC機制的問題,並且在2.X版本中,該問題都是存在的。本文就來聊聊這個問題。

【問題現象】

某天,上層業務部分的兄弟反饋了一個問題,其現象是yarn client請求某個應用(application)的狀態失敗。

瞭解到問題現象後,首先查看了兩個RM的日誌,並未發現有什麼錯誤的日誌資訊;接著通過命令列與yarn client分別嘗試獲取了"有問題"application的狀態,發現也都是可以正確獲取到的。

再次與該兄弟溝通後發現只有該application有問題,其他application都能正確獲取到。同時給出了該application獲取時的報錯資訊

22/06/20 20:48:06 DEBUG ipc.Client: IPC Client (1291113768) connection to resourcemanager-0/172.16.55.7:8032 from 28573: closed
22/06/20 20:48:06 TRACE ipc.ProtobufRpcEngine: 1: Exception <- resourcemanager-0/172.16.55.7:8032: getApplicationReport {java.net.ConnectException: Call From pvs285731713/10.33.72.132 to resourcemanager-0:8032 failed on connection exception: java.net.ConnectException: Connection refused: no further information; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused}
22/06/20 20:48:06 TRACE retry.RetryInvocationHandler: Call#-2: ApplicationBaseProtocol.getApplicationReport([application_id { id: 1 cluster_timestamp: 1655720645233 }])
java.net.ConnectException: Call From pvs285731713/10.33.72.132 to resourcemanager-0:8032 failed on connection exception: java.net.ConnectException: Connection refused: no further information; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:827)
  at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:757)
  at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1553)
  at org.apache.hadoop.ipc.Client.call(Client.java:1495)
  at org.apache.hadoop.ipc.Client.call(Client.java:1394)
  at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
  at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
  at com.sun.proxy.$Proxy7.getApplicationReport(Unknown Source)
  at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:236)
  at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
  at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
  at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
  at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
  at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
  at com.sun.proxy.$Proxy8.getApplicationReport(Unknown Source)
  at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:509)
  at Test.main(Test.java:27)
Caused by: java.net.ConnectException: Connection refused: no further information
  at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
  at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
  at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
  at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
  at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:701)
  at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:814)
  at org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:423)
  at org.apache.hadoop.ipc.Client.getConnection(Client.java:1610)
  at org.apache.hadoop.ipc.Client.call(Client.java:1441)
  ... 16 more
22/06/20 20:48:06 INFO retry.RetryInvocationHandler: java.net.ConnectException: Call From pvs285731713/10.33.72.132 to resourcemanager-0:8032 failed on connection exception: java.net.ConnectException: Connection refused: no further information; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.getApplicationReport over rm1 after 4 failover attempts. Trying to failover after sleeping for 1017ms.
22/06/20 20:48:06 TRACE retry.RetryInvocationHandler: #-2 processRetryInfo: retryInfo=RetryInfo{retryTime=5131870305, delay=1017, action=RetryAction(action=FAILOVER_AND_RETRY, delayMillis=1017, reason=null), expectedFailoverCount=12, failException=null}, waitTime=1016
22/06/20 20:48:08 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
22/06/20 20:48:08 TRACE ipc.ProtobufRpcEngine: 1: Call -> resourcemanager-1:8032: getApplicationReport {application_id { id: 1 cluster_timestamp: 1655720645233 }}
22/06/20 20:48:08 TRACE ipc.ProtobufRpcEngine: 1: Exception <- resourcemanager-1:8032: getApplicationReport {java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "resourcemanager-1":8032; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost}
22/06/20 20:48:08 TRACE retry.RetryInvocationHandler: Call#-2: ApplicationBaseProtocol.getApplicationReport([application_id { id: 1 cluster_timestamp: 1655720645233 }])
java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "resourcemanager-1":8032; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:827)
  at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:770)
  at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:461)
  at org.apache.hadoop.ipc.Client.getConnection(Client.java:1590)
  at org.apache.hadoop.ipc.Client.call(Client.java:1441)
  at org.apache.hadoop.ipc.Client.call(Client.java:1394)
  at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
  at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
  at com.sun.proxy.$Proxy7.getApplicationReport(Unknown Source)
  at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:236)
  at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
  at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
  at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
  at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
  at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
  at com.sun.proxy.$Proxy8.getApplicationReport(Unknown Source)
  at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:509)
  at Test.main(Test.java:27)
Caused by: java.net.UnknownHostException
  ... 19 more
22/06/20 20:48:08 INFO retry.RetryInvocationHandler: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "resourcemanager-1":8032; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost, while invoking ApplicationClientProtocolPBClientImpl.getApplicationReport over rm2 after 5 failover attempts. Trying to failover after sleeping for 2978ms.

【問題分析】

結合上面的情況整體進行分析:RM沒有報錯,並且通過命令列可以正確獲取到application的狀態,因此基本排除服務端存在問題的可能性;而在業務中,只有該application不能查到,其他application都正常;並且在業務端的使用方式為:每個application使用一個獨立的yarn client物件進行查詢。到這裡,基本可以確定是在客戶端一側出了問題。

再從上面的報錯日誌可以看出,因為RM1是standby,並未監聽8032埠,因此客戶端向RM1建立連線失敗這個是正常的邏輯,接著繼續向RM2建立連線傳送請求,但與RM2連線時,丟擲了UnknownHost的異常,重新又轉向RM1請求,如此反覆迴圈,導致出現了該問題。因此UnknownHost異常應該是導致請求失敗的最大疑點

我們還是通過走讀原始碼,從掌握互動邏輯流程來進一步分析該問題。

首先,客戶端建立連線物件時,會判斷服務端的地址是否已經解析,如果未解析則直接拋出異常(這也就是前面問題拋異常的地方)

public Connection(ConnectionId remoteId, int serviceClass) throws IOException {
    this.remoteId = remoteId;
    this.server = remoteId.getAddress();
    if (server.isUnresolved()) {
        throw NetUtils.wrapException(
            server.getHostName(),
            server.getPort(),
            null,
            0,
            new UnknownHostException());
    }
    ...
}

其次,對於服務端採用HA模式部署時,客戶端的RPC代理層會有一個重試邏輯:對於單個rpc請求過程中的異常,通過回撥切換到另外一個RM,並獲取對應的proxy物件,繼續進行請求訪問。

在獲取proxy物件時,內部實際上是對不同RM分別建立proxy物件,並快取在map中,下次使用時直接從map中獲取。

// ConfiguredRMFailoverProxyProvider.java
public synchronized ProxyInfo<T> getProxy() {
    String rmId = rmServiceIds[currentProxyIndex];
    T current = proxies.get(rmId);
    if (current == null) {
        current = getProxyInternal();
        proxies.put(rmId, current);
    }
    return new ProxyInfo<T>(current, rmId);
}

在首次建立proxy物件時,對服務端的地址進行解析,如果無法解析出地址,則建立一個未解析的套接字,儲存在proxy物件中(注:建立連線時使用的就是該套接字)

// ConfiguredRMFailoverProxyProvider.java
// 獲取proxy物件
protected T getProxyInternal() {
    try {
        // 解析RM的地址
        final InetSocketAddress rmAddress = rmProxy.getRMAddress(conf, protocol);
        return rmProxy.getProxy(conf, protocol, rmAddress);
    } catch (IOException ioe) {
        LOG.error(
            "Unable to create proxy to the ResourceManager " +
            rmServiceIds[currentProxyIndex], ioe);
        return null;
    }
}

// ClientRMProxy.java
public InetSocketAddress getRMAddress(YarnConfiguration conf, Class<?> protocol) 
    throws IOException {
    if (protocol == ApplicationClientProtocol.class) {
        return conf.getSocketAddr(
            YarnConfiguration.RM_ADDRESS,
            YarnConfiguration.DEFAULT_RM_ADDRESS,
            YarnConfiguration.DEFAULT_RM_PORT);
    } 
    ...
}

// Configuration.java
public InetSocketAddress getSocketAddr(
    String name, String defaultAddress, int defaultPort) {
    final String address = getTrimmed(name, defaultAddress);
    return NetUtils.createSocketAddr(address, defaultPort, name);
}

// NetUtils.java
public static InetSocketAddress createSocketAddr(String target, int defaultPort, String configName) {
    ...
    return createSocketAddrForHost(host, port);
}

public static InetSocketAddress createSocketAddrForHost(String host, int port) {
    String staticHost = getStaticResolution(host);
    String resolveHost = (staticHost != null) ? staticHost : host;

    InetSocketAddress addr;
    try {
        InetAddress iaddr = SecurityUtil.getByName(resolveHost);
        // if there is a static entry for the host, make the returned
        // address look like the original given host
        if (staticHost != null) {
            iaddr = InetAddress.getByAddress(host, iaddr.getAddress());
        }
        addr = new InetSocketAddress(iaddr, port);
    } catch (UnknownHostException e) {
        // 捕獲異常並建立未解析的套接字
        addr = InetSocketAddress.createUnresolved(host, port);
    }
    return addr;
}

看到這裡,可以分析出原因:即只有首次建立proxy物件時才會對服務端的地址進行解析儲存,同時proxy物件會快取在map中迴圈使用;而真正進行連線時會判斷地址是否已經解析,如果未解析則直接丟擲異常,如果未解析出的地址的RM恰好是Active的話,就會導致出現該問題。

另外,該問題僅僅對單個客戶端(yarn client)有問題,不會影響其他客戶端,這也就可以解釋為什麼業務側只有某個application無法正確獲取到,其他都正常,同時再次通過命令列或者客戶端獲取時又能正確獲取到。

最後,如果業務側對於異常的處理的方式是新建一個客戶端,而不是繼續複用該客戶端物件傳送請求,也不會出現該問題。

【問題解決】

問題的解決其實比較簡單,在社群中也已經有人發現了該問題,並提交了patch,具體修改為:去除了建立連線時對服務端地址是否解析的判斷,同時在真正建立連線時,對於未解析的地址丟擲異常並捕獲觸發重新解析

因此只需要引入該patch即可解決。

【總結】

小結一下,本文通過一個案例,講述了hadoop中rpc內部快取導致的一個問題,除此之外,hadoop的rpc中還有不少細節,我們也都踩過一些坑,後面我們再展開聊聊。

好了,這就是本文的全部內容,如果覺得本文對您有幫助,不要吝嗇點贊在看轉發,也歡迎加我微信交流~

 

本文分享自微信公眾號 - hncscwc(gh_383bc7486c1a)。
如有侵權,請聯絡 [email protected] 刪除。
本文參與“OSC源創計劃”,歡迎正在閱讀的你也加入,一起分享。