深入淺出帶你走進 RocksDB

語言: CN / TW / HK

時間 2022-09-30 17:01:31 開務數據庫

主題: relaxed glibc tracker rocksdb

RocksDB 是基於 Google LevelDB 研發的高性能 Key-Value 持久化存儲引擎，以庫組件形式嵌入程序中，為大規模分佈式應用在 SSD 上運行提供優化。RocksDB 重點是提供工具支持，具體實現將交給上層應用。

正是這種高度可定製化能力，使得 RocksDB 可涵蓋包括工作負載等眾多場景的應用，今天我們將逐一為大家介紹：

"為什麼需要內存管理器？為什麼不使用現有的內存管理器？ RocksDB 究竟是如何實現的？"

01 為什麼需要內存管理器？

RocksDB 有很多核心場景需要分配內存的，包括但不限於 Memtable、 Cache、Iterator 等，高效優質的內存分配器就顯得尤為重要。一個優秀的通用內存分配器需要具備以下特性:

儘量避免內存碎片；併發分配性能好；
額外的空間損耗盡量少；
兼具通用性、兼容性、可移植性且易調試。

內存管理可以分為 3 個層次，自下而上分別是：

操作系統內核的內存管理；
glibc 層使用系統調用維護的內存管理；
應用程序從 glibc 層動態分配內存後，根據應用程序本身的程序特性進行優化，比如使用引用計數、std::shared_ptr、apache 等內存池方式的內存管理。

目前大部分服務端程序使用 glibc 提供的 malloc/free 系列函數，而 glibc 使用的 ptmalloc2 在性能上遠遠落後於 Google 的 Tcmalloc 和 Facebook 的 Jemalloc 。後兩者只需使用 LD_PRELOAD 環境變量啟動程序即可，甚至並不需要重新編譯。

02 為什麼不使用現有內存管理器？

RocksDB 使用內存的場景集中在 Memtable、Cache、Iterator 等核心鏈路上，在保證高性能的同時，還需要實現對內存分配器內部控制（包括監控等）。

當前現有的內存分配器不具有主動上報內存監控細節，所以從長遠看，依舊需要自己實現 RocksDB 專有的內存管理器。內存管理器設計思路如下：

小內存的申請通過 Thread-cache 或者 Per-cpu cache，保證了 CPU 不會頻繁的 cache-miss；
若支持 MAP_HUGETLB，則會直接 mmap；
若大頁內存則直接從底層的分配器去申請。

03 RocksDB 是如何實現的？

如圖所示，RocksDB 的內存管理器是支持併發的，接下來讓我們一起從源碼入手，看看具體如何實現的。

Allocator


// Abstract interface for allocating memory in blocks. This memory is freed
// When the allocator object is destroyed. See the Arena class for more info.
class Allocator {
 public:
  virtual ~Allocator() {}
  // 分配內存
  virtual char* Allocate(size_t bytes) = 0;                     
  // 對齊分配內存
  virtual char* AllocateAligned(size_t bytes, size_t huge_page_size = 0,
                                Logger* logger = nullptr) = 0;  
  // 塊大小
  virtual size_t BlockSize() const = 0;                         
};

Arena

Arena 負責實現 RocksDB 的內存分配，我們從中可以看到其針對不同大小的內存分配請求，採取不同的分配策略，從而減少內存碎片。‍

class Arena : public Allocator {
    static const size_t kInlineSize = 2048;
    static const size_t kMinBlockSize;  // 4K
    static const size_t kMaxBlockSize;  // 2G
    
private:
    char inline_block_[kInlineSize] __attribute__((__aligned__(alignof(max_align_t))));
    const size_t kBlockSize;       // 每個 Block 的大小
    using Blocks = std::vector<char*>;
    Blocks blocks_;                // 分配的新 Block 地址（不夠了就從新分配）
    
    struct MmapInfo {
        void* addr_;
        size_t length_;
        MmapInfo(void* addr, size_t length) : addr_(addr), length_(length) {}
    };
    std::vector<MmapInfo> huge_blocks_;     // 大塊使用 mmap
    size_t irregular_block_num = 0;         // 不整齊的塊分配次數（塊大於 kBlockSize/4）
    
    char* unaligned_alloc_ptr_ = nullptr;   // 未對齊的一端指針（高地址開始）
    char* aligned_alloc_ptr_ = nullptr;     // 對齊一端指針（低地址開始）
    size_t alloc_bytes_remaining_ = 0;      // 內存剩餘
    
    size_t blocks_memory_ = 0;              // 已經分配的內存大小
    
    #ifdef MAP_HUGETLB
    size_t hugetlb_size_ = 0;
    #endif  // MAP_HUGETLB
    char* AllocateFromHugePage(size_t bytes);
    char* AllocateFallback(size_t bytes, bool aligned);
    char* AllocateNewBlock(size_t block_bytes);
    
    AllocTracker* tracker_;
}

針對 Allocate 和 AllocateAligned，我們採用對同一塊 Block 的兩端進行分配。AllocateAligned 從內存塊的低地址開始分配，Allocate 從高地址開始分配。分配流程圖如下：

ConcurrentArena

內存分配器不僅需要減少內存碎片，同樣需要保證併發分配的性能，那麼 RocksDB 是如何實現 ConcurrentArena 的呢？

從內存管理架構圖可以看出，RocksDB 維護了 CoreLocal 內存數組，每個線程從所在 CPU 對應的本地 Shard 上分配內存，若不足再去主內存 Arena 進行分配。我們從幾個核心類開始逐一介紹：

1、ConcurrentArena


class ConcurrentArena : public Allocator {
 public:
  // block_size and huge_page_size are the same as for Arena (and are
  // in fact just passed to the constructor of arena_.  The core-local
  // shards compute their shard_block_size as a fraction of block_size
  // that varies according to the hardware concurrency level.
  explicit ConcurrentArena(size_t block_size = Arena::kMinBlockSize,
                           AllocTracker* tracker = nullptr,
                           size_t huge_page_size = 0);
                           
  char* Allocate(size_t bytes) override {
    return AllocateImpl(bytes, false /*force_arena*/,
                        [this, bytes]() { return arena_.Allocate(bytes); });
  }

 private:
  ...
  CoreLocalArray<Shard> shards_;    // 維護了一個 CoreLocal 內存數組
  Arena arena_;                     // 主內存
  ...
};

2、CoreLocalArray


// An array of core-local values. Ideally the value type, T, is cache aligned to
// prevent false sharing.
template <typename T>
class CoreLocalArray {
 public:
  ...
  // returns pointer to element for the specified core index. This can be used,
  // e.g., for aggregation, or if the client caches core index.
  T* AccessAtCore(size_t core_idx) const;

 private:
  std::unique_ptr<T[]> data_;
  int size_shift_;
};

3、併發分配流程


template <typename Func>
  char* AllocateImpl(size_t bytes, bool force_arena, const Func& func) {
    size_t cpu;

    // 1：大塊則直接從 Arena 上分配，需要加鎖
    std::unique_lock<SpinMutex> arena_lock(arena_mutex_, std::defer_lock);
    if (bytes > shard_block_size_ / 4 || force_arena ||
        ((cpu = tls_cpuid) == 0 &&
         !shards_.AccessAtCore(0)->allocated_and_unused_.load(
             std::memory_order_relaxed) &&
         arena_lock.try_lock())) {
      if (!arena_lock.owns_lock()) {
        arena_lock.lock();
      }
      auto rv = func();
      Fixup();
      return rv;
    }

    // 2：挑選 CPU 對應的 Shard
    Shard* s = shards_.AccessAtCore(cpu & (shards_.Size() - 1));
    if (!s->mutex.try_lock()) {
      s = Repick();
      s->mutex.lock();
    }
    std::unique_lock<SpinMutex> lock(s->mutex, std::adopt_lock);

    size_t avail = s->allocated_and_unused_.load(std::memory_order_relaxed);
    
    // 2.1：若當前 Shard 可用內存不夠時，去 Arena 分配存入 Shard
    if (avail < bytes) {
      // reload
      std::lock_guard<SpinMutex> reload_lock(arena_mutex_);

      // If the arena's current block is within a factor of 2 of the right
      // size, we adjust our request to avoid arena waste.
      auto exact = arena_allocated_and_unused_.load(std::memory_order_relaxed);
      assert(exact == arena_.AllocatedAndUnused());

      if (exact >= bytes && arena_.IsInInlineBlock()) {
        // If we haven't exhausted arena's inline block yet, allocate from arena
        // directly. This ensures that we'll do the first few small allocations
        // without allocating any blocks.
        // In particular this prevents empty memtables from using
        // disproportionately large amount of memory: a memtable allocates on
        // the order of 1 KB of memory when created; we wouldn't want to
        // allocate a full arena block (typically a few megabytes) for that,
        // especially if there are thousands of empty memtables.
        auto rv = func();
        Fixup();
        return rv;
      }

      avail = exact >= shard_block_size_ / 2 && exact < shard_block_size_ * 2
                  ? exact
                  : shard_block_size_;
      s->free_begin_ = arena_.AllocateAligned(avail);
      Fixup();
    }
    s->allocated_and_unused_.store(avail - bytes, std::memory_order_relaxed);

    // 3：根據是否對齊，判斷是從高地址/低地址分配
    char* rv;
    if ((bytes % sizeof(void*)) == 0) {
      // aligned allocation from the beginning
      rv = s->free_begin_;
      s->free_begin_ += bytes;
    } else {
      // unaligned from the end
      rv = s->free_begin_ + avail - bytes;
    }
    return rv;
  }

總結

入參 Func 就是上面傳入的 Lambda 表達式：this, bytes { returnarena_.Allocate(bytes) ；
當請求內存塊較大時，直接從 Arena 分配且需要加鎖；否則直接從當前 CPU 對應的 Shard 分配；
若當前 Shard 可用內存不夠，需要從 Arena 再次請求；4、根據是否對齊，判斷從高/低地址分配。

「其他文章」