神經網路中的正則化

語言: CN / TW / HK

本文已參與「新人創作禮」活動,一起開啟掘金創作之路。

Adding regularization will often help To prevent overfitting problem (high variance problem ).

1. Logistic regression

回憶一下訓練時的優化目標函式 $$
\min \limits_{w,b}J\left(w,b\right), \ \ \ \ w\in\mathbb{R}^{n_x},b\in\mathbb{R} \tag{1-1}
$$
其中 $$
J\left(w,b\right)=\frac{1}{m}\sum_{i=1}^{m}L\left(\hat y^{(i)},y^{(i)}\right)\ \tag{1-2}
$$ $L_2 \ \ regularization $ (most commonly used):
$$ J\left(w,b\right)=\frac{1}{m}\sum_{i=1}^{m}L\left(\hat y^{(i)},y^{(i)}\right)+\frac{\lambda}{2m}\left\lVert w \right\rVert_2^2\ \tag{1-3}
$$
其中 $$ \left\lVert w \right\rVert_2^2=\sum_{j=1}^{n_x}w_j^2=w^Tw\tag{1-4}
$$
Why do we regularize just the parameter w? Because w Is usually a high dimensional parameter vector while b is A scalar. Almost all The parameters are in w rather than b.
$L_1 \ \ regularization $
$$
J\left(w,b\right)=\frac{1}{m}\sum_{i=1}^{m}L\left(\hat y^{(i)},y^{(i)}\right)+\frac{\lambda}{m}\left\lvert w \right\rvert_1\tag{1-5}
$$
其中 $$
\left\lvert w \right\rvert_1=\sum_j^{n_x}\left\lvert w_j \right\rvert \tag{1-6} $$
w will end up being sparse. In other words the w vector will have a lot of zeros in it. This can help with compressing the model a little.

2. Neural network "Frobenius norm"

$$ J\left(w^{[1]},b^{[1]},\cdots,w^{[L]},b^{[L]}\right)=\frac{1}{m}\sum_{i=1}^{m}L\left(\hat y^{(i)},y^{(i)}\right)+\frac{\lambda}{2m}\sum_{l=1}^{L}{\left\lVert w \right\rVert_2^2 }\tag{2-1}
$$ 其中 $$
\left\lVert w^{[l]} \right\rVert_F^2=\sum_i^{n^{[l-1]}}\sum_j^{n^{[l]}}\left(w_{ij}\right)^2 \tag{2-2} $$
$L_2$ regulation is also called Weight decay:
$$
\begin{aligned} dw^{[l]}&=\left(from\ backprop\right)+\frac{\lambda}{m}w^{[l]}\ w^{l}:&=w^{[l]}-\alpha dw^{[l]}\ &=\left(1-\frac{\alpha\lambda}{m}\right)w^{[l]}-\alpha(from\ backprop)\ \tag{2-3} \end{aligned} $$
能夠防止權重$w$過大,從而避免過擬合

3. inverted dropout

對於不同的訓練樣本都可以隨機消除一部分結點
反向隨機失活(前向和後向都需要dropout): $$
\begin{aligned} d^3&=np.random.rand(a_3.shape[0],a_3.shape[1]) < keep.prob\ a^3&=np.multiply(a_3,d_3)\ \ \ #a3*d3, element\ wise\ multiplication\ a^3/&=keep.prob\ \ \ #in\ order\ to\ not\ reduce\ the\ expected\ value\ of\ a^3\ \ inverted\ dropout\ z^{[4]}&=w^{[4]}a^{[3]}+b^{[4]}\ z^{[4]}/&=keep.prob\ \tag{3-1} \end{aligned}
$$

this inverted dropout technique by dividing by the keep.prob, it ensures that the expected value of a3 remains the same. This makes test time easier because you have less of a scaling problem. 測試時不需要使用drop out