Grafana+Prometheus 搭建 JuiceFS 視覺化監控系統

語言: CN / TW / HK

時間 2022-05-25 10:09:32 Juicedata

作為承載海量資料儲存的分散式檔案系統，使用者通常需要直觀地瞭解整個系統的容量、檔案數量、CPU 負載、磁碟 IO、快取等指標的變化。

JuiceFS 沒有重複造輪子，而是通過 Prometheus 相容的 API 對外提供實時的狀態資料，只需將其新增到使用者自建的 Prometheus Server 建立時序資料，然後通過 Grafana 等工具即可輕鬆實現 JucieFS 檔案系統的視覺化監控。 Tips：文末有實踐影片哦～

快速上手

這裡假設你搭建的 Prometheus Server、Grafana 與 JuiceFS 客戶端都執行在相同的主機上。其中：

• Prometheus Server ：用於收集並儲存各種指標的時序資料，安裝方法請參考官方文件 ^[1] 。
• Grafana ：用於從 Prometheus 讀取並可視化展現時序資料，安裝方法請參考官方文件 ^[2] 。

Ⅰ. 獲得實時資料

JuiceFS 通過 Prometheus 型別的 API 對外提供資料。檔案系統掛載後，預設可以通過 http://localhost:9567/metrics 地址獲得客戶端輸出的實時監控資料。

Ⅱ. 新增 API 到 Prometheus Server

編輯 Prometheus 的配置檔案，新增一個新 job 並指向 JuiceFS 的 API 地址，例如：

global:

scrape_interval: 15s

evaluation_interval: 15s

alerting:

alertmanagers:

- static_configs:

- targets:

# - alertmanager:9093

rule_files:

# - "first_rules.yml"

# - "second_rules.yml"

scrape_configs:

- job_name: "prometheus"

static_configs:

- targets: ["localhost:9090"]

- job_name: "juicefs"

static_configs:

- targets: ["localhost:9567"]

假設配置檔名為 prometheus.yml ，載入該配置啟動服務：

./prometheus --config.file=prometheus.yml

訪問 http://localhost:9090 即可看到 Prometheus 的介面。

Ⅲ. 通過 Grafana 展現 Prometheus 的資料

如下圖所示，新建 Data Source：

• Name : 為了便於識別，可以填寫檔案系統的名稱。
• URL : Prometheus 的資料介面，預設為 http://localhost:9090

然後，使用 grafana_template.json ^[3] 建立一個儀表盤。進入新建的儀表盤即可看到檔案系統的視覺化圖表了：

收集監控指標

根據部署 JuiceFS 的方式不同可以有不同的收集監控指標的方法，下面分別介紹。

掛載點

當通過 juicefs mount 命令掛載 JuiceFS 檔案系統後，可以通過 http://localhost:9567/metrics 這個地址收集監控指標，你也可以通過 --metrics 選項自定義。如：

$ juicefs mount --metrics localhost:9567 ...

你可以使用命令列工具檢視這些監控指標：

$ curl http://localhost:9567/metrics

除此之外，每個 JuiceFS 檔案系統的根目錄還有一個叫做 .stats 的隱藏檔案，通過這個檔案也可以檢視監控指標。例如（這裡假設掛載點的路徑是 /jfs ）：

$ cat /jfs/.stats

Kubernetes

JuiceFS CSI 驅動預設會在 mount pod 的 9567 埠提供監控指標，也可以通過在 mountOptions 中新增 metrics 選項自定義（關於如何修改 mountOptions 請參考 CSI 驅動文件 ^[4] ），如：

apiVersion: v1

kind: PersistentVolume

metadata:

name: juicefs-pv

labels:

juicefs-name: ten-pb-fs

spec:

...

mountOptions:

- metrics=0.0.0.0:9567

新增一個抓取任務到 prometheus.yml 來收集監控指標：

scrape_configs:
  - job_name: 'juicefs'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
      action: keep
      regex: juicefs-mount
    - source_labels: [__address__]
      action: replace
      regex: ([^:]+)(:\d+)?
      replacement: $1:9567
      target_label: __address__
    - source_labels: [__meta_kubernetes_pod_node_name]
      target_label: node
      action: replace

這裡假設 Prometheus 服務執行在 Kubernetes 叢集中，如果你的 Prometheus 服務執行在 Kubernetes 叢集之外，請確保 Prometheus 服務可以訪問 Kubernetes 節點，請參考這個 issue ^[5] 新增 api_server 和 tls_config 配置到以上檔案：

scrape_configs:
  - job_name: 'juicefs'
    kubernetes_sd_configs:
    - api_server: <Kubernetes API Server>
      role: pod
      tls_config:
        ca_file: <...>
        cert_file: <...>
        key_file: <...>
        insecure_skip_verify: false
    relabel_configs:
    ...

S3 閘道器

JuiceFS S3 閘道器預設會在 http://localhost:9567/metrics 這個地址提供監控指標，你也可以通過 --metrics 選項自定義。如：

$ juicefs gateway --metrics localhost:9567 ...

如果你是在 Kubernetes 中部署 JuiceFS S3 閘道器，可以參考 Kubernetes 小節的 Prometheus 配置來收集監控指標（區別主要在於 __meta_kubernetes_pod_label_app_kubernetes_io_name 這個標籤的正則表示式），例如：

scrape_configs:
  - job_name: 'juicefs-s3-gateway'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
        action: keep
        regex: juicefs-s3-gateway
      - source_labels: [__address__]
        action: replace
        regex: ([^:]+)(:\d+)?
        replacement: $1:9567
        target_label: __address__
      - source_labels: [__meta_kubernetes_pod_node_name]
        target_label: node
        action: replace

通過 Prometheus Operator 收集

Prometheus Operator ^[6] 讓使用者在 Kubernetes 環境中能夠快速部署和管理 Prometheus，藉助 Prometheus Operator 提供的 ServiceMonitor CRD 可以自動生成抓取配置。例如（假設 JuiceFS S3 閘道器的 Service 部署在 kube-system 名字空間）：

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

name: juicefs-s3-gateway

spec:

namespaceSelector:

matchNames:

- kube-system

selector:

matchLabels:

app.kubernetes.io/name: juicefs-s3-gateway

endpoints:

- port: metrics

Hadoop

JuiceFS Hadoop Java SDK 支援把監控指標上報到 Pushgateway 或者 Graphite。

Pushgateway

啟用指標上報到 Pushgateway：

<property>
  <name>juicefs.push-gateway</name>
  <value>host:port</value>
</property>

同時可以通過 juicefs.push-interval 配置修改上報指標的頻率，預設為 10 秒上報一次。

根據 Pushgateway 官方文件的建議，Prometheus 的抓取配置中需要設定 honor_labels: true 。

需要特別注意，Prometheus 從 Pushgateway 抓取的指標的時間戳不是 JuiceFS Hadoop Java SDK 上報時的時間，而是抓取時的時間，具體請參考 Pushgateway 官方文件 ^[7] 。

預設情況下 Pushgateway 只會在記憶體中儲存指標，如果需要持久化到磁碟上，可以通過 --persistence.file 選項指定儲存的檔案路徑以及 --persistence.interval 選項指定儲存到檔案的頻率（預設 5 分鐘儲存一次）。

每一個使用 JuiceFS Hadoop Java SDK 的程序會有唯一的指標，而 Pushgateway 會一直記住所有收集到的指標，導致指標數持續積累佔用過多記憶體，也會使得 Prometheus 抓取指標時變慢，建議定期清理 Pushgateway 上的指標。

定期使用下面的命令清理 Pushgateway 的指標資料，清空指標不影響執行中的 JuiceFS Hadoop Java SDK 持續上報資料。注意 Pushgateway 啟動時必須指定 --web.enable-admin-api 選項，同時以下命令會清空 Pushgateway 中的所有監控指標。

$ curl -X PUT http://host:9091/api/v1/admin/wipe

Graphite

啟用指標上報到 Graphite：

<property>
  <name>juicefs.push-graphite</name>
  <value>host:port</value>
</property>

同時可以通過 juicefs.push-interval 配置修改上報指標的頻率，預設為 10 秒上報一次。

JuiceFS Hadoop Java SDK 支援的所有配置引數請參考文件 ^[8] 。

使用 Consul 作為註冊中心

JuiceFS 支援使用 Consul 作為監控指標 API 的註冊中心，預設的 Consul 地址是 127.0.0.1:8500 ，你也可以通過 --consul 選項自定義。如：

$ juicefs mount --consul 1.2.3.4:8500 ...

當配置了 Consul 地址以後， --metrics 選項不再需要配置，JuiceFS 將會根據自身網路與埠情況自動配置監控指標 URL。如果同時設定了 --metrics ，則會優先嚐試監聽配置的 URL。

註冊到 Consul 上的每個例項，其 serviceName 都為 juicefs ， serviceId 的格式為 <IP>:<mount-point> ，例如： 127.0.0.1:/tmp/jfs 。

每個 instance 的 meta 都包含了 hostname 與 mountpoint 兩個維度，其中 mountpoint 為 s3gateway 代表該例項為 S3 閘道器。

視覺化監控指標

Grafana 儀表盤模板

JuiceFS 提供一些 Grafana 的儀表盤模板，將模板匯入以後就可以展示收集上來的監控指標。目前提供的儀表盤模板有：

模板名稱	說明
`grafana_template.json` ^[9]	用於展示自掛載點、S3 閘道器（非 Kubernetes 部署）及 Hadoop Java SDK 收集的指標
`grafana_template_k8s.json` ^[10]	用於展示自 Kubernetes CSI 驅動、S3 閘道器（Kubernetes 部署）收集的指標

Grafana 儀表盤示例效果如下圖：

總結

使用 Grafana 做為巨集觀觀測工具，當出現異常情況時可以首先觀察其中是否存在異常指標，再進行進一步的分析。同時重要指標建議設定報警提示，以便實時獲取系統狀態異常的通知，及時排查分析故障，可以觀看以下影片瞭解一下。

引用連結

[1] 官方文件: http://prometheus.io/docs/introduction/first_steps/

[2] 官方文件: http://grafana.com/docs/grafana/latest/installation/

[3] grafana_template.json : http://github.com/juicedata/juicefs/blob/main/docs/en/grafana_template.json

[4] CSI 驅動文件: http://juicefs.com/docs/zh/csi/examples/mount-options

[5] 這個 issue: http://github.com/prometheus/prometheus/issues/4633

[6] Prometheus Operator: http://github.com/prometheus-operator/prometheus-operator

[7] Pushgateway 官方文件: http://github.com/prometheus/pushgateway/blob/master/README.md#about-timestamps

[8] 文件: ../deployment/hadoop_java_sdk.md#客戶端配置引數

[9] grafana_template.json : http://github.com/juicedata/juicefs/blob/main/docs/en/grafana_template.json

[10] grafana_template_k8s.json : http://github.com/juicedata/juicefs/blob/main/docs/en/grafana_template_k8s.json

開源社群貢獻指南

JuiceFS 已於 2021 年 1 月開源，開源軟體的發展離不開每一個人的支援，一篇文章、一頁文件、一個想法、一個建議、報告或修復一個 Bug，這些貢獻不論大小都是推動開源專案不斷髮展的動力， 歡迎來 JuiceFS 的社群參與以上貢獻。 （http://github.com/juicedata/juicefs）

:point_down: 掃碼加群 :point_down:

:point_down: 關注「 Juicedata 」，看更多技術乾貨 :point_down:

「其他文章」