DPDK Graph Pipeline 框架簡介與實現原理

語言: CN / TW / HK

時間 2023-02-07 01:01:56 位元組跳動SYS Tech

主題:

DPDK 全稱為 Data Plane Development Kit ，是近年來在高速網路通訊行業中炙手可熱的一種網路報文處理加速框架。DPDK 從十年前誕生直至發展到今天已經可以支援業界主流的高階網絡卡以及各類加速硬體裝置，同時也支援主流的各個CPU 體系結構（可以運行於 X86, Arm, Power 等平臺）。同時也可以運行於 Linux/FreeBsd/Windows 等主流作業系統之上。DPDK因為其優異的效能指標被廣泛的應用於閘道器/負載均衡/SDN/虛擬交換的各個場景。

本文主要介紹 DPDK 中的 libgraph 設計思想以及實現，libgraph 的設計思想源自於開源專案 Vector Packet Processor(VPP)。VPP 中的向量包處理優化方案是 libgraph 的參照物件。因為 VPP 專案整體是一個非常全面的解決方案，從中剝離核心的設計框架為其它輕量級專案所用就變得很有意義。這也是 Libgraph 的產生背景。雖然它目前還是DPDK 中相對比較新的 lib，但是其優秀的設計思想還是值得學習參考。

下面將首先介紹 libgraph 架構的核心概念：標量和向量包處理（scalar vs vector packet processing）以及它們之間的區別。然後再介紹 libgraph 中的核心元件，以及它們之間的聯絡和互動方式。最後，我們將介紹 libgraph 的具體用例以及優缺點。

背景

DPDK libgraph 是一個向量包處理的框架。與傳統的標量包處理模型（一系列函式負責處理一個數據包，重複直到所有資料包處理完畢）相比，向量資料包處理模型的差別則在於一系列的函式處理一批資料包。這種方法的主要目的是解決當 pipeline 變複雜時所遇到 i-cache miss 較高的問題。

標量處理模型 vs 向量處理模型

標量處理模型在效能上不如向量處理模型的另一個原因，就是在不同處理函式（業務邏輯）之間，報文傳遞到下一個處理函式的方式上有所不同。前者只是簡單地用 pointer assignment 的方法來傳遞，若處理基於不同批次的情況下，由於低效的 pointer assignment，會導致處理一批報文時效能低下。

而對於向量包處理模型而言，向量包處理模型反而可以減少上述標量處理模型因批處理而導致效能不理想的情況。這是因為在向量模型中，報文在業務邏輯之間的傳輸除了可以通過最原始的 pointer assignment 之外，也可以通過優化的 memory copy，最優情況下可以僅僅交換pointer。它不僅消除了標量處理模型會遇到的問題，而且在設計 pipeline 時還可以實現更優的記憶體分配並與業務邏輯解耦。

圖片來源：http://fd.io/gettingstarted/technology/

向量處理模型，graph 有 1 個起點（dpdk-input, n packets）以及 7 個業務邏輯

Graph 框架優點

總體而言，與傳統的標量處理模型相比，以下概括了 graph 框架主要的優勢

更好的 i-cache 管理，更好的icache locality。
靈活的 pipeline 模型（pipeline 框架是從業務邏輯中抽象出來的）
減少 pointer 複製
Node 可以累積多個先前 node 所傳遞的報文，因此批處理效能更好
表驅動節點排程（便於支援QoS）而不是 hardcode pipeline

Graph 工作流程

現在來說一下整體上 graph 框架的工作流程。如上圖所述的一樣，一個 graph 是由不同的節點 (node) 所組成的。與傳統的標量模型相比，兩個框架其實都包含一樣的邏輯。最大的不同是 graph 框架中 node 節點在執行時與其他 node節點的互動方式，或者具體地來說，是報文在node節點中的傳遞方式。

一個 node 中的 報文 傳送到不同的 node，home run 情況 (優化案例) 以及正常 enqueue情況；動態增加 node 的 objects 佇列大小

Node 的 Object 佇列大小

預設情況下，每個 node 都有一個 RTE_GRAPH_BURST_SIZE 大小（node->size）的 object 佇列，它定義了該 node 中可以容納和處理的報文數量。如果一個 node 試圖將報文排隊到下一個 node，但下一個 node 已經有報文在排隊並且已滿（例如在 l3fwd 中，使用者可以有多個接收佇列/ node 排隊到同一個下一個 node pkt_cls），node 的 object 佇列大小將會被動態的加倍。這意味著 node 具有批處理的優勢，因為它可以累積報文。

注意：當前 API 不支援 node 的 object 佇列大小在增大後再縮小

Objs 的傳遞 — 'home run'/normal enqueue

報文轉換有兩種型別，一種是正常的 enqueue，另一種是優化情況，“home run”。

Normal enqueue

在上面介紹 node 的 object 佇列大小部分，我們提到將報文排隊到下一個 node 的概念。對於普通的 enqueue，就是簡單的將當前 node 中處理好的報文，通過使用 rte_memcpy / pointer assignment 作為傳遞方法， enqueue 到下一個可能的 node。

Home run

Home run 則是一種優化的情況，它不像普通的 enqueue 把報文複製到下一個 node，而是簡單地交換當前 node 和下一個 node 之間的有關於報文的指標（例如報文、報文數目），從而消除 memory copy / pointer assignment 的開銷。當滿足以下條件時，home run 才會被觸發：

所有已處理好的報文都將前往同一個 node
下一個 node 的 object 佇列沒有任何需要處理的報文 (node 的 object 佇列為空)

/* Home run scenario */

/* Swap the pointers if dst don't have valid objs */

if (likely(dst->idx == 0)) {

    void **dobjs = dst->objs;

    uint16_t dsz = dst->size;

    dst->objs = src->objs;

    dst->size = src->size;

    src->objs = dobjs;

    src->size = dsz;

    dst->idx = src->idx;

    __rte_node_enqueue_tail_update(graph, dst);

}

Graph 遍歷 node的流程

/**

 * @file lib/graph/rte_graph_worker.h:rte_graph_walk

 */

 

/*

 * Walk on the source node(s) ((cir_start - head) -> cir_start) and then

 * on the pending streams (cir_start -> (cir_start + mask) -> cir_start)

 * in a circular buffer fashion.

 *

 *  +-----+ <= cir_start - head [number of source nodes]

 *  |     |

 *  | ... | <= source nodes

 *  |     |

 *  +-----+ <= cir_start [head = 0] [tail = 0]

 *  |     |

 *  | ... | <= pending streams

 *  |     |

 *  +-----+ <= cir_start + mask

 */

    const rte_graph_off_t *cir_start = graph->cir_start;

    const rte_node_t mask = graph->cir_mask;

    uint32_t head = graph->head;



    while (likely(head != graph->tail)) {

        node = RTE_PTR_ADD(graph, cir_start[(int32_t)head++]);

        RTE_ASSERT(node->fence == RTE_GRAPH_FENCE);

        objs = node->objs;

        rte_prefetch0(objs);



        if (rte_graph_has_stats_feature()) {

            [...]

        } else {

            node->process(graph, node, objs, node->idx);

        }

        node->idx = 0;

        head = likely((int32_t)head > 0) ? head & mask : head;

    }

    graph->tail = 0;

注意：rte_graph_walk 呼叫 struct rte_graph 的記憶體佈局以及 circular buffer (cir_start) 可以參考後面元件的介紹

graph walk 的概念簡單明瞭，它總是有次序地去完成一系列操作：

21行：定位下一個需要處理的 node
23行：準備下一個需要處理的 node 的 objects /報文
29行：處理定位好的 node
32行：檢查 graph walk 是否完成

在 graph walk 層面來看，最重要的就是第一步：定位下一個要處理的 node。DPDK 的 graph 框架利用了 circular buffer 和附帶的 head 和 tail 指標定位下一個 node，如果 head 不等於 tail 指標，這就代表了 circular buffer 裡有需要處理的 node，而在 libgraph 庫中，這些待處理的 node 也被稱為 pending stream。

回到第一步，它是以一個 circular buffer (cir_start[head++]) 的 head 來定位，或者更詳細地從 graph walk 的始點說起，由於 graph->head 在初始化的時候被指定為 -src_node_count （負的 source node 數量，可以參考下一部分 rte_graph 的記憶體佈局），所以 cir_start[head] 其實就是第一個 source node。那麼 tail 的作用是什麼？libgraph 是如何利用它來動態修改 circular buffer 和 pending stream 幫助 node 的定位呢？

在回答這些問題之前，我們首先來了解下面兩個 API 函式，分別是 __rte_node_enqueue_prologue 和 __rte_node_enqueue_tail_update。

__rte_node_enqueue_prologue(struct rte_graph *graph, struct rte_node *node,

                const uint16_t idx, const uint16_t space) {

    /* Add to the pending stream list if the node is new */

    if (idx == 0)

        __rte_node_enqueue_tail_update(graph, node);



    if (unlikely(node->size < (idx + space)))

        __rte_node_stream_alloc(graph, node);

}

每次當前 node 需要將報文 enqueue 到任何下一個 node 時，都需要呼叫 __rte_node_enqueue_prologue去檢查以下內容：

下一個 node 是否已被新增到 pending stream （待處理的 node）
是否需要為下一個 node 分配記憶體

其中變數 idx 和 space 分別表示下一個 node 已有的報文數目，以及當前 node 想要enqueue的數目。

static __rte_always_inline void

__rte_node_enqueue_tail_update(struct rte_graph *graph, struct rte_node *node)

{

    uint32_t tail;



    tail = graph->tail;

    graph->cir_start[tail++] = node->off;

    graph->tail = tail & graph->cir_mask;

}

而API 函式 __rte_node_enqueue_tail_update目的就是調整 tail 把下一個 node 放到 pending stream 裡。

下圖展示了報文的傳遞跟 node 是如何被加到 pending stream 的關係。

注意一下 node 新增到 pending stream 的順序取決於處理報文時的順序。例如，如果報文 1 和 3 到 node2，而報文 2 和 4 到 node3，那麼 circular buffer/pending stream 中 node1 和 node2 的順序將被交換。

元件

libgraph 的核心元件是 graph 和 node。graph 作為排程 node 的全域性資料結構，而 node 則是用來封裝業務邏輯。

Graph

3 個node; 1 個 source node (源節點)和 2 個 node (非源節點)的 graph 結構

上圖顯示了具有 3 個節點的 graph 結構，包含以下關鍵元件：

graph_list

儲存所有建立成功的 graph。如果 graph 建立成功，graph 會被新增到 graph_list 中。與傳統的 main loop 設計相比（把函式直接掛載到 CPU 核上），使用者可以透過這種管理模式把 graph_list 中的 graph 對映到不同的 CPU 核上。

graph->node_list

每個建立成功的 graph 中的 node_list 都包含了先前新增到 graph 中的 node。注意此處使用的是 graph_node 結構，其中包含了本身的 node（struct node* node）以及其相鄰的 node。

struct graph_node vs struct node

struct node 結構體的用法僅僅是用來描述對應的 node，和 graph 是沒有關係的。而 struct graph_node 其實就是把 struct node 包裝了一層，主要目的就是強調這個 node 在一個 graph 中。

struct rte_graph

接下來介紹 struct rte_graph（rte_*: runtime 時所用到的資料結構）的記憶體佈局，建議先把先前的 graph walk 部分瀏覽一遍，以更好了解 rte_graph 是如何被使用。

rte_graph 記憶體佈局。注意此例除有 3 個 source node，而待處理的流 (pending stream)中有 2 個節點。

現在可以更清楚地看到 struct graph 及其子元件 struct node 的目的是儲存內部資料/用於設計，而 struct rte_graph 和 struct rte_node 是用於 runtime。runtime 的記憶體佈局包含了所有必需的元件，比如所有source node、non-source nodes 以及 pending steam 的 node（從cir_start 開始）。

先前我們提到在 graph walk 時需要從 source node 開始，以及需要利用 circular buffer 來定位待處理的node。現在透過分析記憶體佈局的幫助下，我們能更好地去想象 graph walk 是如何從 cir_start[-src_node_count] 開始，直到再沒有任何 pending stream (head 等於 tail）為止。

struct node

理解了 graph 的運作原理後，我們現在討論一下底層的關鍵元件 node。先前我們已經提到 graph_node 和 node 的部分區別，其實關鍵區別在於 node 實際包含了什麼資訊。

struct node {

        STAILQ_ENTRY(node) next;              /* Next node in the list. */

        char name[RTE_NODE_NAMESIZE];         /* Name of the node. */

        uint64_t flags;                       /* Node config flag - source node?*/

        rte_node_process_t process;           /* Node process function. */

        rte_node_init_t init;                 /* Node init function. */

        rte_node_fini_t fini;                 /* Node fini function. */

        rte_node_t id;                        /* Allocated identifier for the node. */

        rte_node_t parent_id;                 /* Parent node identifier. */

        rte_edge_t nb_edges;                  /* Number of edges from this node. */

        char next_nodes[][RTE_NODE_NAMESIZE]; /* Names of next nodes. */

};

Struct node 結構體 (lib/graph/graph_private.c)

向量包處理的概念是一次處理一個向量/一批報文，所以一個節點其實就是代表一個function，只不過處理物件是一批報文。

一個節點除了包含相鄰 node 的資訊外，還包含有助於識別 node 的基本資訊，例如 id、parent_id、flags、name 等。最重要的是，除了核心處理函式 rte_node_process_t process 以外，它還包含了在建立、銷燬 graph 時執行的 init 和 fini 函式。 init 函式的一些實際可以是配置路由表、定義 ACL 規則等。

Node 的註冊

node 的註冊是通過 contructor scheme 的 (attribute((constructor))）來完成的。下面顯示了一個非常簡單的 node 註冊示例。需要強調的是： .process 的函式定義突出了向量資料包處理的概念（需要處理的報文 (**objs) 及其數量 (nb_objs)）。

/**

 * @param objs

 *   Pointer to an array of objects to be processed.

 * @param nb_objs

 *   Number of objects in the array.

 *

 */


 /* This hello world node simply does nothing but to return the number

     of objects in the node */

static uint16_t

hello_world_process(struct rte_graph *graph, struct rte_node *node, void **objs,

        uint16_t nb_objs)

{

        printf( Hello World\n );

        return nb_objs;

}



static struct rte_node_register hello_world_node = {

        .process = hello_world_process,

        .flags = RTE_NODE_SOURCE_F,

        .name =  hello_world_node ,

};



RTE_NODE_REGISTER(hello_world_node);

構造一個 print Hello World 的 source node

source node vs non-source node

在任何處理模型都必須有一個起點，而在 libgraph 中每個 graph pipeline 都必須從 source node 開始。而對於非 source node 來說，它們則是依賴於先前 node 是否會將報文排隊到它們的節點，因此它們只會出現在動態的 pending stream 中。

總結

本文介紹了 DPDK graph API 的框架以及它內部是如何實現的。 graph 框架的優勢在於減少 i-cache miss 、更好的抽象業務邏輯，以及提高使用者設計 pipeline 的靈活度，但 graph 框架也同時帶來下列缺點：

引入潛在記憶體安全風險
複雜的程式碼邏輯，需要大量的測試工作
初學者學習曲線陡峭

總體而言，與傳統的 pipeline 模式相比，graph 框架也會出現 tradeoff。使用者需要考慮其業務邏輯的複雜程度，才能有計劃地設計出可管理的 graph pipeline。graph 模型更適用於 pipeline 比較複雜的場景。有興趣的讀者可以持續關注後期即將釋出的《DPDK Graph Pipeline 框架示例和效能分析》，以更好地理解 DPDK libgraph 的效能和實際用途。

「其他文章」