神經網路中的正則化
本文已參與「新人創作禮」活動,一起開啟掘金創作之路。
Adding regularization will often help To prevent overfitting problem (high variance problem ).
1. Logistic regression
回憶一下訓練時的優化目標函式
$$
\min \limits_{w,b}J\left(w,b\right), \ \ \ \ w\in\mathbb{R}^{n_x},b\in\mathbb{R} \tag{1-1}
$$
其中
$$
J\left(w,b\right)=\frac{1}{m}\sum_{i=1}^{m}L\left(\hat y^{(i)},y^{(i)}\right)\ \tag{1-2}
$$
$L_2 \ \ regularization $ (most commonly used):
$$ J\left(w,b\right)=\frac{1}{m}\sum_{i=1}^{m}L\left(\hat y^{(i)},y^{(i)}\right)+\frac{\lambda}{2m}\left\lVert w \right\rVert_2^2\ \tag{1-3}
$$
其中
$$ \left\lVert w \right\rVert_2^2=\sum_{j=1}^{n_x}w_j^2=w^Tw\tag{1-4}
$$
Why do we regularize just the parameter w? Because w Is usually a high dimensional parameter vector while b is A scalar. Almost all The parameters are in w rather than b.
$L_1 \ \ regularization $
$$
J\left(w,b\right)=\frac{1}{m}\sum_{i=1}^{m}L\left(\hat y^{(i)},y^{(i)}\right)+\frac{\lambda}{m}\left\lvert w \right\rvert_1\tag{1-5}
$$
其中
$$
\left\lvert w \right\rvert_1=\sum_j^{n_x}\left\lvert w_j \right\rvert \tag{1-6}
$$
w will end up being sparse. In other words the w vector will have a lot of zeros in it. This can help with compressing the model a little.
2. Neural network "Frobenius norm"
$$ J\left(w^{[1]},b^{[1]},\cdots,w^{[L]},b^{[L]}\right)=\frac{1}{m}\sum_{i=1}^{m}L\left(\hat y^{(i)},y^{(i)}\right)+\frac{\lambda}{2m}\sum_{l=1}^{L}{\left\lVert w \right\rVert_2^2 }\tag{2-1}
$$
其中
$$
\left\lVert w^{[l]} \right\rVert_F^2=\sum_i^{n^{[l-1]}}\sum_j^{n^{[l]}}\left(w_{ij}\right)^2 \tag{2-2}
$$
$L_2$ regulation is also called Weight decay:
$$
\begin{aligned} dw^{[l]}&=\left(from\ backprop\right)+\frac{\lambda}{m}w^{[l]}\ w^{l}:&=w^{[l]}-\alpha dw^{[l]}\ &=\left(1-\frac{\alpha\lambda}{m}\right)w^{[l]}-\alpha(from\ backprop)\
\tag{2-3}
\end{aligned}
$$
能夠防止權重$w$過大,從而避免過擬合
3. inverted dropout
對於不同的訓練樣本都可以隨機消除一部分結點
反向隨機失活(前向和後向都需要dropout):
$$
\begin{aligned} d^3&=np.random.rand(a_3.shape[0],a_3.shape[1]) < keep.prob\ a^3&=np.multiply(a_3,d_3)\ \ \ #a3*d3, element\ wise\ multiplication\ a^3/&=keep.prob\ \ \ #in\ order\ to\ not\ reduce\ the\ expected\ value\ of\ a^3\ \ inverted\ dropout\ z^{[4]}&=w^{[4]}a^{[3]}+b^{[4]}\ z^{[4]}/&=keep.prob\
\tag{3-1}
\end{aligned}
$$
this inverted dropout technique by dividing by the keep.prob, it ensures that the expected value of a3 remains the same. This makes test time easier because you have less of a scaling problem. 測試時不需要使用drop out