影象2022-10-31 08.10.jpeg 撰文 | 鄭建華

更新｜趙露陽、王迎港

深度學習框架一般通過自動微分（autograd）機制計算梯度並反向傳播。本文嘗試通過一個簡單的例子，粗淺地觀察一下OneFlow的autograd的實現機制。

1 自動微分基礎

自動微分相關的資料比較多，個人感覺自動微分的原理介紹(http://mp.weixin.qq.com/s/BwQxmNoSBEnUlJ1luOwDag )這個系列及其引用的資料對相關背景知識的介紹比較完整清晰。

下面分幾種情況對梯度傳播的原理做一些直觀解釋。

1.1 stack網路的梯度傳播

以x -> f -> g -> z這個stack網路為例，根據鏈式法則：

∂z/∂x = ∂z/∂g * ∂g/∂f * ∂f/∂x

實際執行時，在梯度反向傳播過程中：

z將∂z/∂g傳給g。
如果節點g有權重w需要計算梯度，就計算∂z/∂w = ∂z/∂g * ∂g/∂w。
g需要計算∂g/∂f，再乘以z傳過來的梯度，將結果傳給f。g只需要給f傳遞鏈式乘積的結果，不需要傳遞各項明細。
在訓練階段的前向計算時，g需要儲存∂g/∂f計算依賴的中間結果、以供反向計算時使用。
其它節點的傳播情況依次類推。

1.2 簡單graph的梯度傳播

以下面這個簡單的graph拓撲為例。

在繼續之前，需要了解一下多元複合函式微分的基本公式。
下圖中，u和v都是關於x和y的函式，z是關於u和v的函式。

根據這個公式可以知道，z對x的梯度分別沿兩條鏈路傳播，z -> u -> x和z -> v -> x，節點x將兩個梯度之和作為z對x的梯度。

1.3 複雜graph的梯度傳播

再看一個拓撲稍微複雜點的例子：

上圖可以視為x -> U -> L，其中U是e -> ... -> h的子圖。f -> g的子圖可以視為V。

對於節點h來說，它需要把梯度傳給g和k。對節點e來說，它需要對f和k傳來的梯度求和，才是∂L/∂e。這樣，L對x的梯度，仍可以按鏈路拆解，一條鏈路前後節點間的梯度是乘積關係，傳入的多條鏈路梯度是加和關係。

這篇部落格(http://blog.paperspace.com/pytorch-101-understanding-graphs-and-automatic-differentiation/ )中有一個幾乎一樣的拓撲圖，給出了部分權重引數的梯度公式。

2 autograd中tensor相關的一些基本概念

2.1 葉子節點

OneFlow的autograd文件(http://docs.oneflow.org/en/master/basics/05_autograd.html )中介紹了leaf node和root node的概念。只有輸出、沒有輸入的是leaf node，只有輸入、沒有輸出的是root node。

個人理解，如果把weight、bias、data視為計算圖的一部分，這些節點就是葉子節點（op不是葉子節點）。尤其是從反向計算圖的視角(http://discuss.pytorch.org/t/what-is-the-purpose-of-is-leaf/87000/9 )看，這些節點的grad_fn是空，反向傳播到這些節點就會停止。

is_leaf和requires_grad有比較密切的關係，但二者又是獨立的。PyTorch是這樣解釋的：(http://pytorch.org/docs/stable/generated/torch.Tensor.is_leaf.html#torch.Tensor.is_leaf)

requires_grad=false的節點都是葉子節點。比如data。
requires_grad=true的節點如果是使用者建立的，也是葉子節點。比如weight和bias。
在梯度的反向計算過程中，只有葉子節點的梯度才會被填充。對於非葉子節點，如果要填充梯度資訊，需要顯式設定retain_grad=true。
requires_grad=true才會計算、填充梯度。比如y = relu(x)，y是op建立的、不是葉子節點。但如果x需要計算梯度，則y.requires_grad==true。但不需要為y填充梯度。

關於葉子節點這個概念，目前找到的主要是直觀描述，還沒看到嚴格、清晰的定義。也可能是因為使用者一般不會直接使用is_leaf(http://discuss.pytorch.org/t/what-is-the-purpose-of-is-leaf/87000/9 )，這個概念只是在閱讀程式碼的時候才會涉及到。

下面的資料可以供進一步參考：

What is the purpose of is_leaf? (http://discuss.pytorch.org/t/what-is-the-purpose-of-is-leaf/87000)
葉子節點和tensor的requires_grad引數（http://zhuanlan.zhihu.com/p/85506092 ）

2.2 tensor detach

Tensor的detach方法（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/framework/tensor_impl.cpp#L155 ）會建立一個新的tensor，新tensor的屬性中

requires_grad = false
is_leaf = true

detach的意思是從grad的反向計算圖中把tensor分離出來。新的tensor與原來的物件共享儲存，但不參與反向圖的拓撲構造。原有物件的requires_grad屬性不變。

比如下面的程式碼，修改一個物件的資料，另一個物件的資料也會改變。

import oneflow as flow y = flow.Tensor([1, 2, 3]) x = y.detach() x[0] = 4 assert(y[0] == 4)

3 示例程式碼

本文通過如下程式碼來觀察OneFlow的autograd機制。

``` import oneflow as flow

y is scalar

x = flow.tensor([-1.0, 2.0], requires_grad=True) y = flow.relu(x).sum() y.backward() print(x.grad)

y is not scalar

x = flow.tensor([-1.0, 2.0], requires_grad=True) y = flow.relu(x) y.backward(flow.Tensor([1, 1])) print(x.grad) ```

y.backward方法有兩種介面：

如果y是一個標量（比如loss），不需要傳遞任何引數。
如果y是一個向量，需要傳入一個與y的shape一致的向量作為引數。

為什麼會有這種區別呢？下面幾篇參考資料中對這個問題做了比較詳細的解釋。簡單的說：

如果函式的輸出是向量，在反向傳播的過程中會造成梯度tensor shape的維度膨脹，實現複雜、效能差。
如果函式的輸出是標量，反向傳播梯度tensor的shape與引數變數的shape一致，不會出現維度膨脹，更容易實現。
對於向量版本的backward，可以假想存在某個loss函式，backward的引數是loss傳播到y這裡的梯度。因為前後節點間的梯度是乘積關係，所以用ones替代這個假想的梯度，這樣計算結果x.grad就是y對x的梯度。

後續將以y.backward(flow.Tensor([1, 1]))為例觀察一下autograd的機制。其反向圖只有x <- y這一步。

參考資料

自動求梯度（http://tangshusen.me/Dive-into-DL-PyTorch/#/chapter02_prerequisite/2.3_autograd?id=_233-梯度 ）
PyTorch 的 backward 為什麼有一個 grad_variables 引數？（http://zhuanlan.zhihu.com/p/29923090 ）

3.1 梯度結果的儲存

Tensor的grad屬性（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/api/python/framework/tensor.cpp#L611 ），在讀取值時呼叫的是acc_grad()方法（acc應該是accumulate的縮寫）。這樣就知道梯度實際儲存在哪裡，讀程式碼時可以重點關注相關部分。

呼叫流程如下：

注：圖片中的MirroredTensor在最新原始碼中，已經更名為LocalTensor，其實是一樣的。

4 autograd相關的類圖關係

下圖展示了autograd相關類的關係

在看autograd程式碼之前，可以參照這個類圖，瞭解其中的結構和關係，有助於理解程式碼中各個部分的作用。

在eager模式下，使用者通過op的組合逐步構建出前向計算圖。在執行前向計算的過程中，引擎會為autograd需要的反向計算圖記錄必要的資訊，在呼叫backward方法時執行這個反向計算圖。

對照上面的類圖

站在tensor的視角

前向op輸出一個tensor y，即TensorIf <- ReluFunctor這部分。
從y可以找到反向計算圖實際執行梯度計算的類，即TensorIf -> FunctionNode ReLU這個鏈路。
FunctionNode的backward_fn_包含了OpExprGradClosure。它只負責計算當前節點的梯度。
ReLU是執行梯度計算的類，它會呼叫ReluGradFunctor這個op來執行梯度計算。

站在反向圖儲存的視角

反向圖相關的資訊在FunctionNode中儲存。
反向計算圖的root是tensor（比如y或loss）的grad_fn_node_變數。
FunctionNode的next_functions_表示反向圖的下游節點，當前節點把梯度結果傳給這些下游節點。這些FunctionNode的連線就構成了反向圖的拓撲結構。
tensor的梯度儲存路徑是TensorImpl.AutogradMeta.acc_grad_
AutogradMeta.current_grad_是反向圖上游傳遞到當前節點的梯度合計。如果tensor t輸入給op u和v，那麼u和v反傳的梯度會累加到current_grad_。current應該表示截至當前正在計算時的累加和。
FunctionNode雖然並不持有tensor例項，但它持有tensor的AutogradMeta成員變數指標。

基於上述relu的例子中的節點y 1. output_meta_data_即y.autograd_meta_ 2. input_meta_data_即x.autograd_meta_ 3. 所以FunctionNode能獲取到上下游的梯度資料並進行讀寫 - AutoGradCaptureState可以儲存一些梯度計算需要的狀態資訊，比如計算relu的梯度時需要用到它的前向輸出結果y。

站在反向圖執行的視角

GraphTask負責反向圖的執行。
FunctionNode只儲存必要的資料。
GraphTask基於這些資料，自己構造遍歷需要的資料結構，遍歷所有節點、執行梯度計算。

5 前向計算過程中為autograd所做的準備

反向圖的執行過程是資料驅動的，資料的儲存結構和內容決定了執行的具體動作。

以下討論只針對eager模式。lazy模式下，反向圖的構建是多輪優化passes的一部分（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/framework/op_interpreter/op_interpreter.cpp#L98 ）。

之前在討論Op、Kernel與直譯器(http://mp.weixin.qq.com/s/gXH7HZ9cFHtcFY_2GZ_PnQ) 時已經瞭解Interpreter的作用。只是當時重點關注op的執行，忽略了grad相關的內容。

GetInterpreter（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/framework/op_interpreter/op_interpreter_util.cpp#L67 ）返回的其實是一個AutogradInterpreter物件（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/framework/op_interpreter/op_interpreter_util.cpp#L42 ），在它的Apply方法中（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/framework/op_interpreter/op_interpreter.cpp#L86 ），呼叫內嵌Interpreter的同時，也會記錄grad計算需要的資訊。

AutogradInterpreter::Apply的主要流程如下：

Apply的第一步會先計算requires_grad。只要op的任一輸入的requires_grad為true，op的輸出的requires_grad也為true（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/framework/op_interpreter/op_interpreter.cpp#L151-L152 ）（前提是輸出的資料型別支援梯度）。y的requires_grad就是在這裡決定的。

比如y = relu(x)，如果資料型別支援梯度，y.requires_grad就等於x.requires_grad。

然後會呼叫內嵌的直譯器internal_執行相關計算。在呼叫內嵌直譯器期間，會臨時禁止梯度模式，比如有些op可能會巢狀、多次呼叫直譯器（ReluGradFunctor也會通過直譯器執行），這些都不需要梯度邏輯。

需要說明的是，構造x時不會執行grad相關的邏輯，因為inputs的requires_grad都是false，x的requires_grad是在構造的最後才設定的（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/api/python/utils/tensor_utils.cpp#L187 ）。

下面重點看一下幾個核心函式的邏輯細節。

5.1 梯度閉包的構建

前面對類圖的說明中已經提到，OpExprGradClosure只負責當前節點的梯度計算。

GetOrCreateOpGradClosure函式（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/framework/op_expr.cpp#L146 ）的核心程式碼如下：

template<> Maybe<OpExprGradClosure> BuiltinOpExprImpl<UserOpConf>::GetOrCreateOpGradClosure() const { if (!op_grad_func_.get()) { ... op_grad_func_.reset(NewObj<std::string, OpExprGradFunctionIf>(proto().op_type_name())); JUST(op_grad_func_->Init(*this)); } return std::make_shared<OpExprGradClosure>(op_grad_func_); }

NewObj會呼叫AutoRegistrationFactory（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/common/auto_registration_factory.h#L94 ）獲取預先註冊的工廠、建立物件。之前在討論Op指令在虛擬機器中的執行(http://mp.weixin.qq.com/s/r5LOoEh-Qw57pokr0miGlw) 時也看到過類似的註冊機制。

這裡op_type_name的值是relu，在程式碼中搜索"relu"，可以找到註冊ReLU的巨集（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/gradient_funcs/activation.cpp#L562 ）。巨集展開後的程式碼如下：

static AutoRegistrationFactory<std::string, OpExprGradFunctionIf>::CreatorRegisterTypeg_registry_var4("relu", ([]() { return new ReLU; }));

所以實際返回的物件是ReLU（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/gradient_funcs/activation.cpp#L200 ）。其Init函式是個空操作。

OpExprGradClosure只是簡單的把ReLU存下來供backward執行時呼叫。整個呼叫流程如下：

5.2 捕獲梯度計算需要的資料

呼叫流程如下：

Capture函式（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/framework/op_interpreter/op_interpreter.cpp#L122 ）的作用就是為後續的梯度計算儲存必要的資料。

需要注意的是，OpExprGradFunction::CaptureIf（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/framework/op_expr_grad_function.h#L93 ）中儲存的是detach的tensor。這些tensor與原來的tensor共享資料；可以讀寫梯度資料，但不會參與反向圖的拓撲構造。

這個函式把Interpreter傳過來的op的detached outputs傳給ReLU::Capture（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/framework/op_expr_grad_function.h#L128 ）（就是relu的前向輸出y），ReLU::Capture就把output[0]存到ReLUCaptureState的saved_tensors_中（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/gradient_funcs/activation.cpp#L209 ）。因為對於relu來說，根據y就可以計算梯度。

5.3 儲存反向圖結構資訊

AutogradInterpreter::Apply中會構造一個lambada表示式backward_fn（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/framework/op_interpreter/op_interpreter.cpp#L103-L110 ），其核心邏輯只有一行grad_closure->Apply。

這個lambda的主要作用就是捕獲grad_closure這個智慧指標。lambda表示式最終會作為FunctionNode的backward_fn_變數。這樣才有類圖中FunctionNode到OpExprGradClosure這條線，才能從FunctionNode找到closue、執行節點的梯度計算。

GetThreadLocalAutogradEngine()->AddNode這個函式（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/framework/op_interpreter/op_interpreter.cpp#L113 ）很關鍵，AddNode的主要任務（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L478 ）是為inputs和outputs建立FunctionNode、並儲存反向圖遍歷需要的資料。其輸入引數中的inputs/outputs，是前向計算的op的inputs/outputs。對於relu來說，inputs就是x，outputs就是y。

在上述示例程式碼中，對於x，因為它是葉子節點、也需要梯度，在AddAccumulateFunctionNode會將grad_fn_node設定為一個空操作的函式（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L508 ）。之所以是空操作，是因為葉子節點只需要儲存梯度、不需要自己計算梯度；它所需要的梯度計算結果會由反向圖的上游節點儲存到x.autograd_meta_中。

之後會為y構造GraphFunctionNode並形成節點連線（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L491 ）、並儲存到grad_fn_node（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L495 ）。需要注意的是，這裡的backward_fn就是AutogradInterpreter::Apply中的lambda表示式（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/framework/op_interpreter/op_interpreter.cpp#L103-L109 ）。

需要注意的是，AddBackwardFuncPtr中的inputs/outputs是針對op而言，GraphFunctionNode建構函式中同名變數的是針對FunctionNode而言，二者的含義和指向的物件是不一樣的。

構造完成後，x和y的grad_fn_node_欄位資料內容如下：

x.grad_fn_node_

name_: accumulate_grad next_functions_: 空 input_meta_data_: 空 output_meta_data_: size=1，x.autograd_meta_，requires_grad=true，is_leaf=true output_tensor_infos_: 對應x, relu前向op的input backward_fn_: 空函式，AddAccumulateFunctionNode中定義的

y.grad_fn_node_

name_: relu_backward next_functions_: size=1, x.grad_fn_node, 空操作, AddAccumulateFunctionNode中構造的GraphFunctionNode input_meta_data_: x.autograd_meta_, requires_grad=true, is_leaf=true output_meta_data_: size=1, y.autograd_meta_, requires_grad=false, is_leaf=false output_tensor_infos_: 對應y, relu前向op的output backward_fn_: AutogradInterpreter::Apply中定義的lambda函式

backward就是根據這些資料，從roots出發，完成反向圖的遍歷。

6 backward的入口

在《OneFlow原始碼閱讀4：tensor型別體系與local tensor》（http://segmentfault.com/a/1190000041989895 ）中提到過，Tensor類在Python端經過一層包裝，通過Python機制為Tensor類註冊一些方法，backward就是包裝的方法之一。

相關的原始碼檔案如下

python/oneflow/framework/tensor.py
python/oneflow/autograd/init.py
oneflow/python/oneflow/autograd/autograd.py
oneflow/api/python/autograd/autograd.cpp

C++的呼叫流程如下：

這裡重複一下本文使用的示例程式碼：

import oneflow as flow x = flow.tensor([-1.0, 2.0], requires_grad=True) y = flow.relu(x) y.backward(flow.Tensor([1, 1])) print(x.grad)

上述示例程式碼執行時，Backward（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/api/python/autograd/autograd.cpp#L90 ）的主要引數的值如下：

outputs: y, relu輸出的tensor
out_grads: [1, 1]

CheckAndInitOutGrads（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/api/python/autograd/autograd.cpp#L49 ）返回的是loss通過當前op、傳到當前節點的梯度。其部分邏輯就是第3節討論的

如果y是一個向量，backward必須傳入一個與y的shape一致的向量（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/api/python/autograd/autograd.cpp#L72-L81 ）。
如果y是一個標量，backward不要引數，框架會自動構造一個全1的tensor（http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/api/python/autograd/autograd.cpp#L70 ）。

7 autograd.grad

通常，我們都會通過tensor.backward或autograd.backward觸發梯度計算和反向傳播，但偶爾也會用到autograd.grad(http://oneflow.readthedocs.io/en/master/generated/oneflow.autograd.grad.html?highlight=.grad#oneflow.autograd.grad )這個介面。autograd.grad和autograd.backward很相似，不同之處主要在於：

autograd.backward以outputs(Tensor)作為起點，計算每一個葉子節點的梯度，並且梯度可累積，且保存於對應inputs(Tensor)的tensor.grad上。
而autograd.grad 介面則是從指定的 outputs為起點，以指定的 inputs為終點計算梯度，並按 inputs 引數的順序返回一個由inputs相對應的grads構成的TensorTuple。且梯度是直接獲得的，不在inputs的tensor.grad中累積。

由於autograd.grad就只執行後向計算圖中的一部分，在OneFlow 靜態圖模式下(lazy mode)TaskGraph 統計入度時就需要做一次剪枝，把不需要計算的結點去掉（參考 TaskGraph::ComputeDependenciesAndPruneNode(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L346) 介面），同時記錄每個 inputs 序號，在 FunctionNode::Apply (http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L474 )執行後，把需要儲存的 grad 及時捕獲，最後返回給使用者。

8 反向計算中GraphAutogradEngine的呼叫流程

反向圖計算的流程分析可以結合3類資訊

流程程式碼
上述x和y的grad_fn_node_的值
類圖以及類之間的關係

RunBackwardAndSaveGrads4LeafTensor(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L445 )函式的幾個引數是：

outputs: relu的輸出y
out_grads: 使用者自己構造的ones [1, 1]

8.1 反向傳遞過來的梯度的累加

RunBackwardAndSaveGrads4LeafTensor(http://github.com/Oneflow-Inc/oneflow/blob/release/0.7.0/oneflow/core/autograd/autograd_engine.cpp#L447 )函式中，PushPartialTensor(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L450 )的作用就是將loss傳過來的梯度累加到autograd_meta_.current_grad_.acc_tensor_。第4節中提到，TensorArg.acc_tensor_儲存的就是loss傳過來的梯度的合計。這就是roots（即y）接收到的梯度，要麼是框架自動建立的ones，要麼是使用者提供的梯度（通常也是ones）。

這行程式碼的邏輯可以用如下偽碼錶示

outputs[i].impl_.autograd_meta_.current_grad_.acc_tensor_ += out_grads[i]

8.2 反向圖計算任務的構造與執行

FunctionNode只是記錄了反向圖的基礎資訊。RunBackwardAndSaveGrads4LeafTensor中會再構造一個GraphTask物件來表示一次反向計算任務。

GraphTask的建構函式(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L452 )主要是初始化反向圖的roots_節點，並將圖中各個節點的依賴計數dependencies_置為0。根據示例程式碼，roots_就是y（通常是loss）。
ComputeDependencies(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L321 )會對反向圖進行深度優先遍歷、統計圖中各個節點的依賴計數。
GraphTask::Apply(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L405 )中實現了反向圖的遍歷邏輯（傳入的save_grad_for_leaf引數是true）。當FunctionNode的依賴為0時，節點才會被放入執行佇列(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L439 )，後續會對反向圖執行按拓撲序遍歷。FunctionNode::Apply執行時，它的依賴都執行完畢了。GraphTack::Apply這個函式中，涉及梯度計算邏輯主要包括兩部分：
呼叫node->Apply執行單個節點的梯度計算(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L421 )
呼叫node->AccGrad4LeafTensor儲存算好的梯度(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L430)

8.3 節點的梯度計算

FunctionNode::Apply中(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L187 )，處理output_meta_data_的for迴圈(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L195-L205 )的核心邏輯可以用如下偽碼錶示：

acc_tensor = output_meta_data_[i].current_grad_.acc_tensor_ if (acc_tensor != nullptr) { output_grads[i] = acc_tensor_ } else { output_grads[i] = zeros() }

從中可以看出來，output_grads的作用就是拷貝上游傳過來的梯度資料（指標），作為backward_fn_的引數。

後面可以看到，backward_fn(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L206 )的核心邏輯是：

// d(y)表示當前節點對y的梯度，比如relu對其輸出y的梯度。 input_grads = d(y) * output_grads

input_grads就是當前節點傳給下游節點的梯度，呼叫backward_fn時會對它進行賦值。

處理input_meta_data的for迴圈的核心邏輯(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L214 )可以用如下偽碼錶示。實質就是將當前節點傳給下游節點的梯度，累加到下游節點的current_grad上，從而實現梯度的傳播。如果tensor輸入給多個op，每個op的梯度會加起來。

input_meta_data_[i].current_grad_.acc_tensor_ += input_grads[i]

8.3.1 梯度計算的執行：backward_fn

以下只考慮前述示例的root節點的執行。也就是y對應的FunctionNode。對於y來說，backward_fn就是AutogradInterpreter::Apply中定義的lambda表示式(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/framework/op_interpreter/op_interpreter.cpp#L103-L110 )。對於relu來說，執行過程如下：

之前在5.1節已經確認，OpExprGradClosure::impl_就是ReLU(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/gradient_funcs/activation.cpp#L200 )。
如前所述，backward_fn的引數中，output_grads是上游傳過來的梯度資料，backward_fn需要計算relu的梯度，二者的乘積賦值給in_grads。這些引數會一直傳遞到ReLU::Apply(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/gradient_funcs/activation.cpp#L213 )。

functional::ReluGrad(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/gradient_funcs/activation.cpp#L219 )的Functor名字是ReluGrad。對應的Functor是ReluGradFunctor(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/functional/impl/activation_functor.cpp#L61 )（名稱空間是oneflow::one::functional::impl）。

ReluGradFunctor之後，是基於Primitive kernel實現的計算邏輯。 ReluGradFunctor中對應op名字是"relu_grad"，這個relu_grad的註冊被包在一個巨集定義(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/user/kernels/activation_kernels.cpp#L331 )中，實際上會返回一個BinaryPrimitiveKernel，這是一種稍顯特殊的基於Primitive的kernel，其具體為ep::primitive下的一種BroadcastElementwiseBinary工廠(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/user/kernels/activation_kernels.cpp#L337-L339 )，其對應的cpu和cuda註冊分別位於：

oneflow/core/ep/cpu/primitive/broadcast_elementwise_binary.cpp
oneflow/core/ep/cuda/primitive/broadcast_elementwise_binary.cu

最終實現位於binary_functor.h(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/ep/common/primitive/binary_functor.h#L354 )：

``` template struct BinaryFunctor { OF_DEVICE_FUNC BinaryFunctor(Scalar attr0, Scalar attr1) {}

OF_DEVICE_FUNC Dst operator()(Src dy, Src y) const { return static_cast((y <= static_cast(0.0)) ? static_cast(0.0) : dy); } }; ```

至此，完成了梯度計算的邏輯。

8.4 梯度的儲存

FunctionNode::Apply執行完畢後，GraphTask::Apply呼叫FunctionNode::AccGrad4LeafTensor(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L430 )為葉子節點拷貝梯度資料。
在上述例子中，因為y不是葉子節點，處理到y.grad_fn_node_時不會進行實質處理。對於x，會呼叫CopyOrAccGrad(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L84 )，這個函式邏輯的偽碼形式如下

autograd_meta.acc_grad_ += autograd_meta.current_grad_

autograd_meta.acc_grad_就是Python端讀到的x的梯度。

8.5 臨時梯度的釋放機制

上述第5.點中，描述了前向圖構建過程中已經存放了對應的FunctionNode以及前向op所對應的反向backward_fn，實際求梯度、反向傳播時，這一個個 backward_fn串聯起來構成了反向計算圖拓撲，對於其中的每個節點，backward_fn中都可以表示為output_grads、inputs/outputs(可選) -> inputs_grads的一個函式。

其中output_grads 就是鏈式法則中上游計算的累計梯度，當前節點backward_fn計算完成後，該節點的output_grads就不會再被使用到，從而變成了臨時梯度。之後會呼叫 FunctionNode->ReleaseOutTensorArgs()(http://github.com/Oneflow-Inc/oneflow/blob/48e511e40e09551408c96722c09bd061ce320687/oneflow/core/autograd/autograd_engine.cpp#L432) 來及時釋放該臨時梯度。

參考資料

oneflow master(http://github.com/Oneflow-Inc/oneflow/tree/48e511e40e09551408c96722c09bd061ce320687)
OneFlow學習筆記：Autograd解析(http://mp.weixin.qq.com/s/6zm4xRpRkptchGOyyk0JCA)
OneFlow: AUTOGRAD(http://docs.oneflow.org/en/master/basics/05_autograd.html)
自動微分的原理介紹(http://mp.weixin.qq.com/s/BwQxmNoSBEnUlJ1luOwDag)
自動求梯度(http://tangshusen.me/Dive-into-DL-PyTorch/#/chapter02_prerequisite/2.3_autograd?id=_233-梯度)
PyTorch 的 backward 為什麼有一個 grad_variables 引數？(http://zhuanlan.zhihu.com/p/29923090)
PyTorch 101, Part 1: Understanding Graphs, Automatic Differentiation and Autograd(http://blog.paperspace.com/pytorch-101-understanding-graphs-and-automatic-differentiation/)

歡迎下載體驗 OneFlow v0.8.0 最新版本： http://github.com/Oneflow-Inc/oneflow/

OneFlow原始碼解析：自動微分機制

1

自動微分基礎

2

autograd中tensor相關的一些基本概念

3

示例程式碼

y is scalar

y is not scalar

4

autograd相關的類圖關係

5

前向計算過程中為autograd所做的準備

6

backward的入口

7

autograd.grad

8

反向計算中GraphAutogradEngine的呼叫流程