深入解析 TiFlash丨面向編譯器的自動向量化加速

語言: CN / TW / HK

時間 2022-07-11 20:36:57 PingCAP

SIMD（Single Instruction Multiple Data）是一種採用一個控制器來控制多個處理器，同時對一組資料（又稱“資料向量”）中的每一個分別執行相同的操作從而實現空間上的並行性的技術，是重要的程式加速手段。本文將簡要介紹一些在 TiFlash 中使用編譯器進行自動向量化所需要的入門知識。

SIMD 介紹

SIMD 是重要的程式加速手段。CMU DB 組在 Advanced Database Systems 中有專門的兩個章節（ vectorization-1 , vectorization-2 ）介紹 SIMD 向量化在資料庫中的應用，可見其對現代資料庫系統的重要性。

TiFlash 目前支援的架構是 x86-64 和 Aarch64，作業系統平臺有 Linux 和 MacOS。受制於平臺 ISA 和作業系統 API，在不同環境中做 SIMD 支援會遇到不同的麻煩。

X86-64

我們在傳統上把 x86-64 平臺分為 4 個 Level：

x86-64 : CMOV, CMPXCHG8B, FPU, FXSR, MMX, FXSR, SCE, SSE, SSE2
x86-64-v2 : (close to Nehalem) CMPXCHG16B, LAHF-SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3
x86-64-v3 : (close to Haswell) AVX, AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, XSAVE
x86-64-v4 : AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL

每個層次上有不同的拓展指令集支援。現狀是 TiFlash 在 x86-64 上編譯的目標是 x86-64-v2，而目前絕大部分家用和伺服器 CPU 均已支援 x86-64-v3。由於 Intel 目前面臨大小核架構的更新，x86-64-v4 的支援相對混亂，但在伺服器方面，比較新的型號均帶有不同程度的 AVX512 支援。在 AWS 的支援矩陣中我們可以看到第三代志強可拓展處理器等支援 AVX512 的型號已經被採用於生產環境。

x86-64 上不同 CPU 架構之前相同拓展指令集的開銷也是不同的，一般來說，可以在 Intel Intrinsic Guide 上簡要檢視相關指令在不同微架構上的 CPI 資訊。而如果要針對具體的平臺優化，則可以閱讀平臺相關的 Tuning Guides and Performance Analysis Papers ， INTEL® ADVANCED VECTOR EXTENSIONS 以及 Intel® 64 and IA-32 Architectures Software Developer Manuals （Software Optimization Reference Manual 系列）來獲得 Intel 官方的建議。

如何選擇 SSE，AVX/AVX2，AVX512？其實並不是技術越新，位寬越大，效果就一定越好。如，在 INTEL® ADVANCED VECTOR EXTENSIONS 的 2.8 章我們可以看到，混用傳統 SSE 和 AVX 指令集會導致所謂的 SSE-AVX Transition Penalty:

另一方面，AVX2，AVX512 都有相應的 Frequency Scaling 問題。Cloudflare 的文章 On the dangers of Intel's frequency scaling 以及 Gathering Intel on Intel AVX-512 Transitions 對這個問題都有分析。簡單而言，AVX-512 在密集計算中可以提高效能，此時 CPU 頻率下降，不過向量化本身極大的提升了速度。但是，如果在非密集場景下混用 AVX512 和普通指令，我們可以想象降頻給整體效能帶來的損失。

在 Intel 平臺上，SIMD指令集對應的是 XMM，YMM，ZMM 等暫存器，我們可以用 gdb 的

disassmble 指令來查看向量化的結果：

#!/usr/bin/env bash

args=(-batch -ex "file $1 " )

while IFS= read -r line;

args+=( "-ex" "disassemble ' $line '" )

done < <(nm --demangle $1 | grep $2 | cut -d\ -f3-)

gdb " ${args[@]} " | c++filt

# bash ./this-script.sh tiflash xxx

#!/usr/bin/env bash

# LLDB version

args=(--batch -o "file $1 " )

while IFS= read -r line;

args+=( "-o" "disassemble -F intel -n ' $line '" )

done < <(nm --defined-only --demangle $1 | grep $2 | cut -d\ -f3-)

lldb " ${args[@]} " | c++filt

# bash ./this-script.sh tiflash xxx

Aarch64

在 Arm 世界裡也存在平臺向量化指令集支援參差不齊的問題。Arm V8 目前已經細化出了 8 個版本：

在 SIMD 方面，Aarch64 主要有兩個三個的指令集 ASIMD，SVE，SVE2。ASIMD 已經在廣泛應用，事實上， GCC/Clang 會預設開啟 ASIMD 支援。在 Arm V8 中，SVE 一般不在 A Profile 中實現，而是用於 HPC 等的專業 CPU 中。在 Arm V9 中，SVE，SVE2 已經成為標配的拓展指令集。

ASIMD 描述的是定長向量化操作，作用於 64bit 和 128bit 的暫存器，功能上和 SSE 系列接近。 SVE 則是使用變長向量，Vendor 可以提供最高到 2048bit 的超寬暫存器。使用 Per-Lane Prediction 的方案，SVE 指令集建立了一種無需知道實際暫存器寬度的程式設計模型。

在實際應用中， AWS C7g （基於 AWS Graviton3）已經開始支援 SVE 指令集，最高可達 256bit 寬度。而 ASIMD 則在鯤鵬，AWS Graviton2 等 CPU 的例項上都有很好的實現。

在 AARCH64 上，常見的 ASIMD 相關的暫存器是 q0-q15，它們有時也會以 v0-v15 加字尾的形式出現在 ASM 中。SVE 等則使用 z0-z15。

SIMD 函式派發方案

TiFlash 的 CD Pipeline 對於每種 OS/Arch 組合生成一個統一的二進位制檔案包進行釋出，因此整體編譯的目標都是相對通用的架構。而 SIMD 指令集在不同平臺具有差異性，因此我們需要一些方案來派發被向量化的函式。以下提供兩大類方案，執行時和載入時。整體來說，可以參考以下條件來選擇：

如果想支援非 Linux 目標，且已知操作本身用時相對較多，不在乎多一兩個 branch，可以使用執行時的派發。在這種情況下，TiFlash 裡有提供對應向量化方案的執行時開關，功能更可控。
如果操作極其大量地被使用，且 branch 可能會影響效能，可以優先考慮載入時派發。TiFlash 在生產環境中基本上使用 Linux，所以可以只為 MacOS 提供預設版本的函式。

執行時派發

這個方案相對簡單，在 common/detect_features.h 中，TiFlash 提供了檢查具體 CPU 功能的方案，我們可以寫一個執行時檢查功能，再決定具體實現方案的函式入口。這種方案適用於已知向量化操作耗時比較長，相比可以忽略派發代價的情況。

觀察下面這段程式碼：

__attribute__((target( "avx512f" )))

void test4096_avx512 ( bool * __restrict a, const int * __restrict b)

{

for ( int i = 0 ; i < 4096 ; ++i)

{

a[i] = b[i] > 0 ;

}

__attribute__((target( "avx2" )))

void test4096_avx2 ( bool * __restrict a, const int * __restrict b)

{

for ( int i = 0 ; i < 4096 ; ++i)

{

a[i] = b[i] > 0 ;

}

__attribute__((noinline))

void test4096_generic ( bool * __restrict a, const int * __restrict b)

{

for ( int i = 0 ; i < 4096 ; ++i)

{

a[i] = b[i] > 0 ;

}

void test4096 ( bool * __restrict a, const int * __restrict b)

{

if (common::cpu_feature_flags.avx512f)

{

return test4096_avx512(a, b);

}

if (common::cpu_feature_flags.avx2)

{

return test4096_avx2(a, b);

}

return test4096_generic(a, b);

}

可以看到，函式入口就是檢測功能，呼叫對應平臺的實現：

而具體的函式則有相應平臺的向量化優化

實際上，對於這種同函式體的派發，TiFlash 已經提供了包裝好的 macro，以上程式碼可以寫為

# include <Common/TargetSpecific.h>

TIFLASH_DECLARE_MULTITARGET_FUNCTION(

/* return type */ void ,

/* function name */ test4096,

/* argument names */ (a, b),

/* argument list */ ( bool * __restrict a, const int * __restrict b),

/* body */ {

for ( int i = 0 ; i < 4096 ; ++i)

{

a[i] = b[i] > 0 ;

}

})

IFUNC 派發

在 Linux 上觀察 Glibc 的符號表：

我們可以看到，一些效能關鍵函式前被標記了 i 符號。這表示這些函式是 indirect 函式：即程式可以提供一個函式的多種實現，然後在程式載入連結階段由 ld 決定目標符號具體連結到哪個實現。Glibc 正是使用這個方案來決定一些關鍵函式如 memcpy/memcmp/memset 等的實現。

test4096 可以改寫：

void test4096 ( bool * __restrict a, const int * __restrict b) __ attribute__ ((ifunc( "test4096_resolver" ))) ;

extern "C"

void * test4096_resolver ()

{

if (__builtin_cpu_supports( "avx512f" ))

return reinterpret_cast < void *>(&test4096_avx512);

if (__builtin_cpu_supports( "avx2" ))

return reinterpret_cast < void *>(&test4096_avx2);

return reinterpret_cast < void *>(&test4096_generic);

}

這個方案減少了執行時派發的開銷，但是也有一定侷限性：

僅適用於 GNU/Linux 平臺
ifunc 的 resolver 必須在當前 unit 內。如果 resolver 是 c++ 的函式，需要提供 mangle 後的名字。
resolver 執行於進入 C 執行時和 C++ 執行時之前，不能用 TiFlash 的檢測功能。在 x86_64 平臺，可以使用 __builtin_cpu_supports ; 在 aarch64 上，可以使用以下方案：

# include <sys/auxv.h>

# ifndef HWCAP2_SVE2

# define HWCAP2_SVE2 (1 << 1)

# endif

# ifndef HWCAP_SVE

# define HWCAP_SVE (1 << 22)

# endif

# ifndef AT_HWCAP2

# define AT_HWCAP2 26

# endif

# ifndef AT_HWCAP

# define AT_HWCAP 16

# endif

namespace detail

{

static inline bool sve2_supported ()

{

auto hwcaps = getauxval(AT_HWCAP2);

return (hwcaps & HWCAP2_SVE2) != 0 ;

}

static inline bool sve_supported ()

{

auto hwcaps = getauxval(AT_HWCAP);

return (hwcaps & HWCAP_SVE) != 0 ;

}

} // namespace detail

另外一個有趣的例子是，如果你需要在 resolver 中讀取函式變數，你可能需要手動初始化 environ 指標：

extern char ** environ;

extern char **_dl_argv;

char ** get_environ () {

int argc = *( int *)(_dl_argv - 1 );

char **my_environ = ( char **)(_dl_argv + argc + 1 );

return my_environ;

}

typeof(f1) * resolve_f() {

environ = get_environ();

const char *var = getenv( "TOTO" );

if (var && strcmp (var, "ok" ) == 0 ) {

return f2;

}

return f1;

}

int f () __ attribute__ ((ifunc( "resolve_f" ))) ;

Function Multiversioning 派發

在 x86-64 上，Clang/GCC 實際上提供了更便捷的 IFUNC 實現方案：

# include <iostream>

__attribute__((target( "avx512f" )))

void test4096 ( bool * __restrict a, const int * __restrict b)

{

std :: cout << "using avx512" << std :: endl ;

for ( int i = 0 ; i < 4096 ; ++i)

{

a[i] = b[i] > 0 ;

}

__attribute__((target( "avx2" )))

void test4096 ( bool * __restrict a, const int * __restrict b)

{

std :: cout << "using avx2" << std :: endl ;

for ( int i = 0 ; i < 4096 ; ++i)

{

a[i] = b[i] > 0 ;

}

__attribute__((target( "default" )))

void test4096 ( bool * __restrict a, const int * __restrict b)

{

std :: cout << "using default" << std :: endl ;

for ( int i = 0 ; i < 4096 ; ++i)

{

a[i] = b[i] > 0 ;

}

int main () {

bool results[ 4096 ];

int data[ 4096 ];

for ( auto & i : data) {

std :: cin >> i;

}

test4096(results, data);

for ( const auto & i : results) {

std :: cout << i << std :: endl ;

}

這裡，我們不用區分函式名和提供 resolver，而是直接標記不同的 target，編譯器會自動生成 ifunc 的實現。

Macro 整合

可以使用以下程式碼整合 x86-64 和 aarch64 上的基於 IFUNC 的方案：

# ifdef __linux__

# include <sys/auxv.h>

# ifndef HWCAP2_SVE2

# define HWCAP2_SVE2 (1 << 1)

# endif

# ifndef HWCAP_SVE

# define HWCAP_SVE (1 << 22)

# endif

# ifndef AT_HWCAP2

# define AT_HWCAP2 26

# endif

# ifndef AT_HWCAP

# define AT_HWCAP 16

# endif

namespace detail

{

static inline bool sve2_supported ()

{

auto hwcaps = getauxval(AT_HWCAP2);

return (hwcaps & HWCAP2_SVE2) != 0 ;

}

static inline bool sve_supported ()

{

auto hwcaps = getauxval(AT_HWCAP);

return (hwcaps & HWCAP_SVE) != 0 ;

}

} // namespace detail

# endif

# define TMV_STRINGIFY_IMPL(X) #X

# define TMV_STRINGIFY(X) TMV_STRINGIFY_IMPL(X)

# define TIFLASH_MULTIVERSIONED_VECTORIZATION_X86_64(RETURN, NAME, ARG_LIST, ARG_NAMES, BODY) \

struct NAME##TiFlashMultiVersion \

{ \

__attribute__((always_inline)) static inline RETURN inlined_implementation ARG_LIST BODY; \

__attribute__((target( "default" ))) static RETURN dispatched_implementation ARG_LIST \

{ \

return inlined_implementation ARG_NAMES; \

}; \

__attribute__((target( "avx" ))) static RETURN dispatched_implementation ARG_LIST \

{ \

return inlined_implementation ARG_NAMES; \

}; \

__attribute__((target( "avx2" ))) static RETURN dispatched_implementation ARG_LIST \

{ \

return inlined_implementation ARG_NAMES; \

}; \

__attribute__((target( "avx512f,avx512vl,avx512bw,avx512cd" ))) static RETURN dispatched_implementation ARG_LIST \

{ \

return inlined_implementation ARG_NAMES; \

}; \

__attribute__((always_inline)) static inline RETURN invoke ARG_LIST \

{ \

return dispatched_implementation ARG_NAMES; \

}; \

# define TIFLASH_MULTIVERSIONED_VECTORIZATION_AARCH64(RETURN, NAME, ARG_LIST, ARG_NAMES, BODY) \

struct NAME##TiFlashMultiVersion \

{ \

__attribute__((always_inline)) static inline RETURN inlined_implementation ARG_LIST BODY; \

static RETURN generic_implementation ARG_LIST \

{ \

return inlined_implementation ARG_NAMES; \

}; \

__attribute__((target( "sve" ))) static RETURN sve_implementation ARG_LIST \

{ \

return inlined_implementation ARG_NAMES; \

}; \

__attribute__((target( "sve2" ))) static RETURN sve2_implementation ARG_LIST \

{ \

return inlined_implementation ARG_NAMES; \

}; \

static RETURN dispatched_implementation ARG_LIST \

__attribute__((ifunc(TMV_STRINGIFY(__tiflash_mvec_##NAME##_resolver)))); \

__attribute__((always_inline)) static inline RETURN invoke ARG_LIST \

{ \

return dispatched_implementation ARG_NAMES; \

}; \

extern "C" void * __tiflash_mvec_##NAME##_resolver() \

{ \

if (::detail::sve_supported()) \

{ \

<void *>(&NAME##TiFlashMultiVersion::sve_implementation); \

} \

if (::detail::sve2_supported()) \

{ \

return reinterpret_cast < void *>(&NAME##TiFlashMultiVersion::sve2_implementation); \

} \

return reinterpret_cast < void *>(&NAME##TiFlashMultiVersion::generic_implementation); \

}

# if defined(__linux__) && defined(__aarch64__)

# define TIFLASH_MULTIVERSIONED_VECTORIZATION TIFLASH_MULTIVERSIONED_VECTORIZATION_AARCH64

# elif defined(__linux__) && defined(__x86_64__)

# define TIFLASH_MULTIVERSIONED_VECTORIZATION TIFLASH_MULTIVERSIONED_VECTORIZATION_X86_64

# else

# define TIFLASH_MULTIVERSIONED_VECTORIZATION(RETURN, NAME, ARG_LIST, ARG_NAMES, BODY) \

struct NAME##TiFlashMultiVersion \

{ \

__attribute__((always_inline)) static inline RETURN invoke ARG_LIST BODY; \

# endif

TIFLASH_MULTIVERSIONED_VECTORIZATION(

int ,

sum,

( const int * __restrict a, int size),

(a, size),

{

int sum = 0 ;

for ( int i = 0 ; i < size; ++i) {

sum += a[i];

}

return sum;

}

)

面向編譯器的優化

LLVM 提供了一個很好的自動向量化指南： Auto-Vectorization in LLVM - LLVM 15.0.0git documentation 。可以參考其中的章節瞭解哪些常見模式可以用於向量化。簡單來說，我們可以思考迴圈的場景：能否簡化不必要的控制流，能否減少不透明的函式呼叫等等。除此之外，還可以考慮，對於一些簡單的函式定義，如果它會被大量連續呼叫，我們能否將函式定義在 header 中，讓編譯器看到並內聯這些函式，進而提升向量化的空間。

高德納說過，premature optimization is the root of all evil（過早優化是萬惡之源）。我們沒有必要為了向量化就把一些非效能關鍵部分的迴圈重寫成向量化友好的形式。結合 profiler 來決定進一步優化那些函式是一個比較好的選擇。

檢查向量化條件

我們使用以下引數檢查向量化過程：

-Rpass-missed='.*vectorize.*' 檢查編譯器為什麼沒有成功向量化
-Rpass='.*vectorize.* 檢查編譯器進行了那些向量化

具體地，在 TiFlash，我們先提取某個 object file 的編譯指令

cat compile_commands.json | grep "/VersionFilterBlockInputStream.cpp"

然後，在編譯指令前新增 -Rpass-missed='.*vectorize.*' 或者 -Rpass='.*vectorize.*' 來檢視相關資訊。

迴圈展開 Pragma

以下 pragma 可以用來控制迴圈展開策略，輔助向量化

void test1 ( int * a, int *b, int *c) {

# pragma clang loop unroll(full)

for ( int i = 0 ; i < 1024 ; ++i) {

c[i] = a[i] + b[i];

}

void test2 ( int * a, int *b, int *c) {

# pragma clang loop unroll(enable)

for ( int i = 0 ; i < 1024 ; ++i) {

c[i] = a[i] + b[i];

}

void test3 ( int * a, int *b, int *c) {

# pragma clang loop unroll(disable)

for ( int i = 0 ; i < 1024 ; ++i) {

c[i] = a[i] + b[i];

}

void test4 ( int * a, int *b, int *c) {

# pragma clang loop unroll_count(2)

for ( int i = 0 ; i < 1024 ; ++i) {

c[i] = a[i] + b[i];

}

向量化 Pragma

以下 pragma 可以建議 clang 進行向量化。

static constexpr int N = 4096 ;

int A[N];

int B[N];

struct H {

double a[ 4 ];

H operator *( const H& that) {

return {

a[ 0 ] * that.a[ 0 ],

a[ 1 ] * that.a[ 1 ],

a[ 2 ] * that.a[ 2 ],

a[ 3 ] * that.a[ 3 ],

};

}

};

H C[N];

H D[N];

H E[N];

void test1 () {

# pragma clang loop vectorize(enable)

for ( int i= 0 ; i < N; i++) {

C[i] = D[i] * E[i];

}

void test2 () {

for ( int i= 0 ; i < N; i++) {

C[i] = D[i] * E[i];

}

事實上，在 Aarch64 上，TiFlash 中 getDelta 預設就沒有向量化，而使用 hint 後則可以。

這些 pragma 如果想在 macro 內部使用，可以改為 _Pragma("clang loop vectorize(enable)") 的形式。

迴圈拆分

複用上面的例子

void x () {

# pragma clang loop vectorize(enable)

for ( int i= 0 ; i < N; i++) {

A[i + 1 ] = A[i] + B[i];

C[i] = D[i] * E[i];

}

void y () {

for ( int i= 0 ; i < N; i++) {

A[i + 1 ] = A[i] + B[i];

}

# pragma clang loop vectorize(enable)

for ( int i= 0 ; i < N; i++) {

C[i] = D[i] * E[i];

}

其中 x 函式沒有被向量化，因為 A 中存在資料依賴。y 中拆分兩個 loop 後，後一個 loop 則可以進行向量化。在實際情況下，如果 C[i] = D[i] * E[i] 的標量操作會相對佔用時間，這樣做迴圈拆分是比較有意義的。

理論上

# pragma clang loop distribution(enable)

可以自動處理相應情況，但是這裡即使使用這個 pragma，clang 仍然會相對保守。

控制向量化策略

調整單位向量大小

void test ( char *x, char *y, char * z) {

# pragma clang loop vectorize_width(8)

for ( int i= 0 ; i < 4096 ; i++) {

x[i] = y[i] * z[i];

}

比如在 Aarch64 上，vectorize_width(1) 意味著沒有向量化，vectorize_width(8) 意味著用 64bit 暫存器，vectorize_width(16) 意味著用 128bit 暫存器。

除此之外，還可以用 vectorize_width(fixed) ， vectorize_width(scalable) 調整對定長和變長向量的傾向。

調整向量化批次大小

可以用 interleave_count(4) 向編譯器建議向量化時展開的迴圈批次。在一定範圍內提高批次大小可以促進處理器利用超標量和亂序執行進行加速。

void test ( char *x, char *y, char * z) {

# pragma clang loop vectorize_width(8) interleave_count(4)

for ( int i= 0 ; i < 4096 ; i++) {

x[i] = y[i] * z[i];

}

提取定長迴圈單元

以下函式用來確認資料庫列存中第一個可見列：

const uint64_t * filterRow

(

const uint64_t * data,

size_t length,

uint64_t

{

for ( size_t i = 0 ; i < length; ++i) {

if (data[i] > current_version) {

return data + i;

}

return nullptr ;

}

它不能被向量化，因為迴圈內部有存在向外跳轉的控制流。

這種情況下，可以手動提取出一段迴圈來幫助編譯器做自動向量化：

const uint64_t * filterRow

(

const uint64_t * data,

size_t length,

uint64_t

{

size_t i = 0 ;

for (; i + 64 < length; i += 64 ) {

uint64_t mask = 0 ;

# pragma clang loop vectorize(enable)

for ( size_t j = 0 ; j < 64 ; ++j) {

mask |= data[i + j] > current_version ? ( 1u ll << j) : 0 ;

}

if (mask) {

return data + i + __builtin_ctzll(mask);

}

for (; i < length; ++i) {

if (data[i] > current_version) {

return data + i;

}

return nullptr ;

}

（ __builtin_ctzll 是用來計算整數末尾 0 的個數的編譯器內建函式，一般可以高效地翻譯成一條指令）