CosineWarmup理論與代碼實戰

語言: CN / TW / HK
摘要:CosineWarmup是一種非常實用的訓練策略,本次教程將帶領大家實現該訓練策略。教程將從理論和代碼實戰兩個方面進行。

本文分享自華為雲社區《CosineWarmup理論介紹與代碼實戰》,作者: 李長安。

CosineWarmup是一種非常實用的訓練策略,本次教程將帶領大家實現該訓練策略。教程將從理論和代碼實戰兩個方面進行。

在代碼實戰部分,模型採用LeNet-5模型進行測試,數據採用Cifar10數據集作為基準數據,

Warmup最早出現於這篇文章中:Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour,warmup類似於跑步中的熱身,在剛剛開始訓練的時候進行熱身,使得網絡逐漸熟悉數據的分佈,隨着訓練的進行學習率慢慢變大,到了指定的輪數,再使用初始學習率進行訓練。

consine learning rate則來自於這篇文章Bag of Tricks for Image Classification with Convolutional Neural Networks,通過餘弦函數對學習率進行調整

一般情況下,只在前五個Epoch中使用Warmup,並且通常情況下,把warm up和consine learning rate一起使用會達到更好的效果。

  • Warmup

Warmup是在ResNet論文中提到的一種學習率預熱的方法,它在訓練開始的時候先選擇使用一個較小的學習率,訓練了一些epoches或者steps(比如4個epoches,10000steps),再修改為預先設置的學習來進行訓練。由於剛開始訓練時,模型的權重(weights)是隨機初始化的,此時若選擇一個較大的學習率,可能帶來模型的不穩定(振盪),選擇Warmup預熱學習率的方式,可以使得開始訓練的幾個epoches或者一些steps內學習率較小,在預熱的小學習率下,模型可以慢慢趨於穩定,等模型相對穩定後再選擇預先設置的學習率進行訓練,使得模型收斂速度變得更快,模型效果更佳。

  • 餘弦退火策略

當我們使用梯度下降算法來優化目標函數的時候,當越來越接近Loss值的全局最小值時,學習率應該變得更小來使得模型儘可能接近這一點,而餘弦退火(Cosine annealing)可以通過餘弦函數來降低學習率。餘弦函數中隨着x的增加餘弦值首先緩慢下降,然後加速下降,再次緩慢下降。這種下降模式能和學習率配合,以一種十分有效的計算方式來產生很好的效果。

  • 帶Warmup的餘弦退火策略
  • 單個週期餘弦退火衰減圖形

以單個週期餘弦退火衰減為例,介紹帶Warmup的餘弦退火策略,如下圖所示,學習率首先緩慢升高,達到設定的最高值之後,通過餘弦函數進行衰減調整。但是通常面對大數據集的時候,學習率可能會多次重複上述調整策略。

代碼實現

from paddle.optimizer.lr import LinearWarmup
from paddle.optimizer.lr import CosineAnnealingDecay
class Cosine(CosineAnnealingDecay):
 """
    Cosine learning rate decay
 lr = 0.05 * (math.cos(epoch * (math.pi / epochs)) + 1)
 Args:
 lr(float): initial learning rate
 step_each_epoch(int): steps each epoch
        epochs(int): total training epochs
    """
 def __init__(self, lr, step_each_epoch, epochs, **kwargs):
 super(Cosine, self).__init__(
 learning_rate=lr,
 T_max=step_each_epoch * epochs, )
 self.update_specified = False
class CosineWarmup(LinearWarmup):
 """
    Cosine learning rate decay with warmup
    [0, warmup_epoch): linear warmup
    [warmup_epoch, epochs): cosine decay
 Args:
 lr(float): initial learning rate
 step_each_epoch(int): steps each epoch
        epochs(int): total training epochs
 warmup_epoch(int): epoch num of warmup
    """
 def __init__(self, lr, step_each_epoch, epochs, warmup_epoch=5, **kwargs):
 assert epochs > warmup_epoch, "total epoch({}) should be larger than warmup_epoch({}) in CosineWarmup.".format(
            epochs, warmup_epoch)
 warmup_step = warmup_epoch * step_each_epoch
 start_lr = 0.0
 end_lr = lr
 lr_sch = Cosine(lr, step_each_epoch, epochs - warmup_epoch)
 super(CosineWarmup, self).__init__(
 learning_rate=lr_sch,
 warmup_steps=warmup_step,
 start_lr=start_lr,
 end_lr=end_lr)
 self.update_specified = False

實戰

import paddle
import paddle.nn.functional as F
from paddle.vision.transforms import ToTensor
from paddle import fluid
import paddle.nn as nn
print(paddle.__version__)
2.0.2
transform = ToTensor()
cifar10_train = paddle.vision.datasets.Cifar10(mode='train',
                                               transform=transform)
cifar10_test = paddle.vision.datasets.Cifar10(mode='test',
                                              transform=transform)
# 構建訓練集數據加載器
train_loader = paddle.io.DataLoader(cifar10_train, batch_size=64, shuffle=True)
# 構建測試集數據加載器
test_loader = paddle.io.DataLoader(cifar10_test, batch_size=64, shuffle=True)
Cache file /home/aistudio/.cache/paddle/dataset/cifar/cifar-10-python.tar.gz not found, downloading http://dataset.bj.bcebos.com/cifar/cifar-10-python.tar.gz 
Begin to download
Download finished
class MyNet(paddle.nn.Layer):
 def __init__(self, num_classes=10):
 super(MyNet, self).__init__()
 self.conv1 = paddle.nn.Conv2D(in_channels=3, out_channels=32, kernel_size=(3, 3), stride=1, padding = 1)
 # self.pool1 = paddle.nn.MaxPool2D(kernel_size=2, stride=2)
 self.conv2 = paddle.nn.Conv2D(in_channels=32, out_channels=64, kernel_size=(3,3),  stride=2, padding = 0)
 # self.pool2 = paddle.nn.MaxPool2D(kernel_size=2, stride=2)
 self.conv3 = paddle.nn.Conv2D(in_channels=64, out_channels=64, kernel_size=(3,3), stride=2, padding = 0)
 # self.DropBlock =  DropBlock(block_size=5, keep_prob=0.9, name='le')
 self.conv4 = paddle.nn.Conv2D(in_channels=64, out_channels=64, kernel_size=(3,3), stride=2, padding = 1)
 self.flatten = paddle.nn.Flatten()
 self.linear1 = paddle.nn.Linear(in_features=1024, out_features=64)
 self.linear2 = paddle.nn.Linear(in_features=64, out_features=num_classes)
 def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
 # x = self.pool1(x)
 # print(x.shape)
        x = self.conv2(x)
        x = F.relu(x)
 # x = self.pool2(x)
 # print(x.shape)
        x = self.conv3(x)
        x = F.relu(x)
 # print(x.shape)
 # x = self.DropBlock(x)
        x = self.conv4(x)
        x = F.relu(x)
 # print(x.shape)
        x = self.flatten(x)
        x = self.linear1(x)
        x = F.relu(x)
        x = self.linear2(x)
 return x
# 可視化模型
cnn2 = MyNet()
model2 = paddle.Model(cnn2)
model2.summary((64, 3, 32, 32))
---------------------------------------------------------------------------
 Layer (type)     Input Shape          Output Shape         Param #    
===========================================================================
   Conv2D-1 [[64, 3, 32, 32]] [64, 32, 32, 32] 896 
   Conv2D-2 [[64, 32, 32, 32]] [64, 64, 15, 15] 18,496 
   Conv2D-3 [[64, 64, 15, 15]] [64, 64, 7, 7] 36,928 
   Conv2D-4 [[64, 64, 7, 7]] [64, 64, 4, 4] 36,928 
   Flatten-1 [[64, 64, 4, 4]] [64, 1024] 0 
   Linear-1 [[64, 1024]] [64, 64] 65,600 
   Linear-2 [[64, 64]] [64, 10] 650 
===========================================================================
Total params: 159,498
Trainable params: 159,498
Non-trainable params: 0
---------------------------------------------------------------------------
Input size (MB): 0.75
Forward/backward pass size (MB): 25.60
Params size (MB): 0.61
Estimated Total Size (MB): 26.96
---------------------------------------------------------------------------
{'total_params': 159498, 'trainable_params': 159498}
# 配置模型
from paddle.metric import Accuracy
scheduler = CosineWarmup(
 lr=0.5, step_each_epoch=100, epochs=8, warmup_steps=20, start_lr=0, end_lr=0.5, verbose=True)
optim = paddle.optimizer.SGD(learning_rate=scheduler, parameters=model2.parameters())
model2.prepare(
 optim,
 paddle.nn.CrossEntropyLoss(),
 Accuracy()
 )
# 模型訓練與評估
model2.fit(train_loader,
 test_loader,
        epochs=10,
        verbose=1,
 )
The loss value printed in the log is the current step, and the metric is the average value of previous step.
Epoch 1/3
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
 return (isinstance(seq, collections.Sequence) and
step 782/782 [==============================] - loss: 1.9828 - acc: 0.2280 - 106ms/step         
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 157/157 [==============================] - loss: 1.5398 - acc: 0.3646 - 35ms/step        
Eval samples: 10000
Epoch 2/3
step 782/782 [==============================] - loss: 1.7682 - acc: 0.3633 - 106ms/step         
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 157/157 [==============================] - loss: 1.7934 - acc: 0.3867 - 34ms/step        
Eval samples: 10000
Epoch 3/3
step 782/782 [==============================] - loss: 1.3394 - acc: 0.4226 - 105ms/step         
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 157/157 [==============================] - loss: 1.4539 - acc: 0.3438 - 35ms/step        
Eval samples: 10000

總結

之前一直提到這個CosineWarmup,但是一直沒有實現過,這次也算是填了一個很早之前就挖的坑。同樣,這裏也不再設置對比實驗,因為這個東西確實很管用。小模型和小數據集可能不太能夠體現該訓練策略的有效性。大家如果有興趣可以使用更大的模型、更大的數據集測試一下。

 

點擊關注,第一時間瞭解華為雲新鮮技術~