神经网络笔记（二）——Batch Normalization & DropOut

2021-09-05 约 2530 字预计阅读 6 分钟

Batch Normalization

对应的实验是BatchNormalization.ipynb。

批量归一化（Batch Normalization）。批量归一化是loffe和Szegedy最近（2015年）才提出的方法，该方法减轻了如何合理初始化神经网络这个棘手问题带来的头痛：），其做法是让激活数据在训练开始前通过一个网络，网络处理数据使其服从标准高斯分布。因为归一化是一个简单可求导的操作，所以上述思路是可行的。在实现层面，应用这个技巧通常意味着全连接层（或者是卷积层，后续会讲）与激活函数之间添加一个BatchNorm层。对于这个技巧本节不会展开讲，因为上面的参考文献中已经讲得很清楚了，需要知道的是在神经网络中使用批量归一化已经变得非常常见。在实践中，使用了批量归一化的网络对于不好的初始值有更强的鲁棒性。最后一句话总结：批量归一化可以理解为在网络的每一层之前都做预处理，只是这种操作以另一种方式与网络集成在了一起。搞定！¹

Batch Normalization的论文中提出了Internal Covariate Shift的现象，即每个输入层的分布在训练的过程中会由于前层的参数的改变而发生改变，一个层需要不断地去适应其输入的新的分布。当网络深度较大时，前层参数的变化可能会在后层参数项放大而产生指数级变化，这使得我们很难选择一个合适的学习率，也会产生非线性函数的饱和导致的难以训练的现象。BN能够很好的减小Internal Covariate Shift，使得我们可以使用更高的学习率和在参数初始化上不必过分小心。² ³

BN的思想是通过归一化来修正层的输入，来期望能提升训练的速度。众所周知，如果输入经过了白化(whitened)，网络的训练将收敛的更快。

标准化一个单元的均值和标准差会降低包含该单元的神经网络的表达能力。为了保持网络的表现力，通常会将对于归一化的输入替换为

$$ y^{(k)}=\gamma^{(k)}\hat{x}^{(k)}+\beta^{(k)} $$

特别的，当$\gamma^2=\sigma^2$，$\beta=\mu$时，可以实现等价变换（identity transform）并且保留了原始输入特征的分布信息。通过上面的步骤，我们就在一定程度上保证了输入数据的表达能力。$\gamma$和$\beta$是两个需要被学习的参数。

前向传播与后向传播

前向传播

训练阶段对每个批次更新滑动平均和方差，用于对测试输入的归一化。代码如下：

1
2
3


# momentum是衰减系数, pyTorch里面的值为0.1
running_mean = momentum * running_mean + (1 - momentum) * sample_mean
running_var = momentum * running_var + (1 - momentum) * sample_var

实验代码：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


if mode == "train":
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    sample_mean=np.mean(x,axis=0)
    sample_var=np.var(x,axis=0)

    x_hat=(x-sample_mean)/np.sqrt(sample_var+eps)

    out=gamma*x_hat+beta
    cache=(x, x_hat, sample_mean, sample_var, gamma, beta, eps)

    running_mean = momentum * running_mean + (1 - momentum) * sample_mean
    running_var = momentum * running_var + (1 - momentum) * sample_var
    pass

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

1
2
3
4
5
6
7
8


elif mode == "test":
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    x_hat=(x-running_mean)/np.sqrt(running_var+eps)
    out=gamma*x_hat+beta
    pass

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

后向传播

需要计算的偏导数有$\frac{\partial L}{\partial x_i}$,$\frac{\partial L}{\partial \gamma}$,$\frac{\partial L}{\partial \beta}$。论文中有推导过程可以参考。代码实现如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

N,D = dout.shape
x, x_hat, sample_mean, sample_var, gamma, beta, eps = cache
dx_hat=dout*gamma
dvar=-0.5*np.sum(dx_hat*(x-sample_mean),axis=0)*np.power(sample_var+eps,-1.5)
dmean=np.sum(dx_hat*(-1.0/np.sqrt(sample_var.T+eps)),axis=0)+dvar*np.sum(-2*(x-sample_mean))/N
dx=dx_hat/(np.sqrt(sample_var+eps))+dvar*2*(x-sample_mean)/N+dmean/N
dgamma=np.sum(dout*x_hat,axis=0)
dbeta=np.sum(dout,axis=0)

pass

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

将BN添加到Fully Connected Net中

初始化

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

self.params['W1']=weight_scale*np.random.randn(input_dim,hidden_dims[0])
self.params['b1']=np.zeros(hidden_dims[0])
if self.normalization=="batchnorm":
self.params['gamma1']=np.ones(hidden_dims[0])
self.params['beta1']=np.zeros(hidden_dims[0])
for i in range(1,len(hidden_dims)):
self.params['W'+str(i+1)]=weight_scale*np.random.randn(hidden_dims[i-1],hidden_dims[i])
self.params['b'+str(i+1)]=np.zeros(hidden_dims[i])
if self.normalization=="batchnorm":
    self.params['gamma'+str(i+1)]=np.ones(hidden_dims[i])
    self.params['beta'+str(i+1)]=np.zeros(hidden_dims[i])
self.params['W'+str(len(hidden_dims)+1)]=weight_scale*np.random.randn(hidden_dims[-1],num_classes)
self.params['b'+str(len(hidden_dims)+1)]=np.zeros(num_classes)
pass

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

计算scores

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54


        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        aff_outs=[]
        bn_outs=[]
        relu_outs=[]
        aff_caches=[]
        bn_caches=[]
        relu_caches=[]
        for i in range(self.num_layers-1):
          # affine forward
          aff_out, aff_cache=None, None
          if i==0:
            aff_out, aff_cache=affine_forward(X,self.params['W1'],self.params['b1'])
          else:
            aff_out, aff_cache=affine_forward(drop_outs[-1],self.params['W'+str(i+1)],self.params['b'+str(i+1)])
          aff_outs.append(aff_out)
          aff_caches.append(aff_cache)
          # BN forward
          if self.normalization=="batchnorm":
            if i!=self.num_layers-1:
              tgamma=self.params['gamma'+str(i+1)]
              tbeta=self.params['beta'+str(i+1)]
              bnp=self.bn_params[i]
              bn_out, bn_cache=batchnorm_forward(aff_out,tgamma,tbeta,bnp)
              bn_outs.append(bn_out)
              bn_caches.append(bn_cache)
            else:
              bn_out=aff_out
          elif self.normalization=="layernorm":
            if i!=self.num_layers-1:
              tgamma=self.params['gamma'+str(i+1)]
              tbeta=self.params['beta'+str(i+1)]
              bnp=self.bn_params[i]
              bn_out, bn_cache=layernorm_forward(aff_out,tgamma,tbeta,bnp)
              bn_outs.append(bn_out)
              bn_caches.append(bn_cache)
            else:
              bn_out=aff_out
          else:
            bn_out=aff_out
          # ReLU forward
          relu_out, relu_cache=relu_forward(bn_out)
          relu_outs.append(relu_out)
          relu_caches.append(relu_cache)
          pass
        i=self.num_layers
        aff_out, aff_cache=affine_forward(drop_outs[-1],self.params['W'+str(i)],self.params['b'+str(i)])
        aff_outs.append(aff_out)
        aff_caches.append(aff_cache)
        scores=aff_outs[-1]

        pass

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

计算梯度

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        # print(self.num_layers, len(drop_caches),len(relu_caches),len(bn_caches),len(aff_caches))
        loss, grad=softmax_loss(scores,y)
        i=self.num_layers
        grad,grads['W'+str(i)],grads['b'+str(i)]=affine_backward(grad,aff_caches[i-1])
        for i in range(self.num_layers):
          loss+=0.5*self.reg*np.sum(np.square(self.params['W'+str(i+1)]))
        # backprop
        for i in range(self.num_layers-1,0,-1):
          grad=relu_backward(grad, relu_caches[i-1])
          if self.normalization=="batchnorm":
            if i!=self.num_layers:
              grad,grads['gamma'+str(i)],grads['beta'+str(i)]=batchnorm_backward_alt(grad,bn_caches[i-1])
          elif self.normalization=="layernorm":
            if i!=self.num_layers:
              grad,grads['gamma'+str(i)],grads['beta'+str(i)]=layernorm_backward(grad,bn_caches[i-1])
          grad,grads['W'+str(i)],grads['b'+str(i)]=affine_backward(grad,aff_caches[i-1])
          grads['W'+str(i)]+=self.reg*self.params['W'+str(i)]


        pass

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

Dropout

随机失活(Dropout)是一个简单又极其有效的正则化方法。与L1正则化，L2正则化和最大范式约束等方法互为补充。在训练的时候，随机失活的实现方法是让神经元以超参数p的概率被激活或者被设置为0。从课程的实验中能够看到Dropout能够有效地对抗过拟合。添加Dropout层后，完整的loss函数实现如下：

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156


    def loss(self, X, y=None):
        """Compute loss and gradient for the fully connected net.
        
        Inputs:
        - X: Array of input data of shape (N, d_1, ..., d_k)
        - y: Array of labels, of shape (N,). y[i] gives the label for X[i].

        Returns:
        If y is None, then run a test-time forward pass of the model and return:
        - scores: Array of shape (N, C) giving classification scores, where
            scores[i, c] is the classification score for X[i] and class c.

        If y is not None, then run a training-time forward and backward pass and
        return a tuple of:
        - loss: Scalar value giving the loss
        - grads: Dictionary with the same keys as self.params, mapping parameter
            names to gradients of the loss with respect to those parameters.
        """
        X = X.astype(self.dtype)
        mode = "test" if y is None else "train"

        # Set train/test mode for batchnorm params and dropout param since they
        # behave differently during training and testing.
        if self.use_dropout:
            self.dropout_param["mode"] = mode
        if self.normalization == "batchnorm":
            for bn_param in self.bn_params:
                bn_param["mode"] = mode
        scores = None
        ############################################################################
        # TODO: Implement the forward pass for the fully connected net, computing  #
        # the class scores for X and storing them in the scores variable.          #
        #                                                                          #
        # When using dropout, you'll need to pass self.dropout_param to each       #
        # dropout forward pass.                                                    #
        #                                                                          #
        # When using batch normalization, you'll need to pass self.bn_params[0] to #
        # the forward pass for the first batch normalization layer, pass           #
        # self.bn_params[1] to the forward pass for the second batch normalization #
        # layer, etc.                                                              #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        aff_outs=[]
        bn_outs=[]
        relu_outs=[]
        drop_outs=[]
        aff_caches=[]
        bn_caches=[]
        relu_caches=[]
        drop_caches=[]
        for i in range(self.num_layers-1):
          # affine forward
          aff_out, aff_cache=None, None
          if i==0:
            aff_out, aff_cache=affine_forward(X,self.params['W1'],self.params['b1'])
          else:
            aff_out, aff_cache=affine_forward(drop_outs[-1],self.params['W'+str(i+1)],self.params['b'+str(i+1)])
          aff_outs.append(aff_out)
          aff_caches.append(aff_cache)
          # BN forward
          if self.normalization=="batchnorm":
            if i!=self.num_layers-1:
              tgamma=self.params['gamma'+str(i+1)]
              tbeta=self.params['beta'+str(i+1)]
              bnp=self.bn_params[i]
              bn_out, bn_cache=batchnorm_forward(aff_out,tgamma,tbeta,bnp)
              bn_outs.append(bn_out)
              bn_caches.append(bn_cache)
            else:
              bn_out=aff_out
          elif self.normalization=="layernorm":
            if i!=self.num_layers-1:
              tgamma=self.params['gamma'+str(i+1)]
              tbeta=self.params['beta'+str(i+1)]
              bnp=self.bn_params[i]
              bn_out, bn_cache=layernorm_forward(aff_out,tgamma,tbeta,bnp)
              bn_outs.append(bn_out)
              bn_caches.append(bn_cache)
            else:
              bn_out=aff_out
          else:
            bn_out=aff_out
          # ReLU forward
          relu_out, relu_cache=relu_forward(bn_out)
          relu_outs.append(relu_out)
          relu_caches.append(relu_cache)
          if self.use_dropout:
            drop_out, drop_cache=dropout_forward(relu_out,self.dropout_param)
            drop_caches.append(drop_cache)
          else:
            drop_out=relu_out
          drop_outs.append(drop_out)
          pass
        i=self.num_layers
        aff_out, aff_cache=affine_forward(drop_outs[-1],self.params['W'+str(i)],self.params['b'+str(i)])
        aff_outs.append(aff_out)
        aff_caches.append(aff_cache)
        scores=aff_outs[-1]

        pass

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # If test mode return early.
        if mode == "test":
            return scores

        loss, grads = 0.0, {}
        ############################################################################
        # TODO: Implement the backward pass for the fully connected net. Store the #
        # loss in the loss variable and gradients in the grads dictionary. Compute #
        # data loss using softmax, and make sure that grads[k] holds the gradients #
        # for self.params[k]. Don't forget to add L2 regularization!               #
        #                                                                          #
        # When using batch/layer normalization, you don't need to regularize the   #
        # scale and shift parameters.                                              #
        #                                                                          #
        # NOTE: To ensure that your implementation matches ours and you pass the   #
        # automated tests, make sure that your L2 regularization includes a factor #
        # of 0.5 to simplify the expression for the gradient.                      #
        ############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        # print(self.num_layers, len(drop_caches),len(relu_caches),len(bn_caches),len(aff_caches))
        loss, grad=softmax_loss(scores,y)
        i=self.num_layers
        grad,grads['W'+str(i)],grads['b'+str(i)]=affine_backward(grad,aff_caches[i-1])
        for i in range(self.num_layers):
          loss+=0.5*self.reg*np.sum(np.square(self.params['W'+str(i+1)]))
        # backprop
        for i in range(self.num_layers-1,0,-1):
          if self.use_dropout:
            grad=dropout_backward(grad,drop_caches[i-1])
          grad=relu_backward(grad, relu_caches[i-1])
          if self.normalization=="batchnorm":
            if i!=self.num_layers:
              grad,grads['gamma'+str(i)],grads['beta'+str(i)]=batchnorm_backward_alt(grad,bn_caches[i-1])
          elif self.normalization=="layernorm":
            if i!=self.num_layers:
              grad,grads['gamma'+str(i)],grads['beta'+str(i)]=layernorm_backward(grad,bn_caches[i-1])
          grad,grads['W'+str(i)],grads['b'+str(i)]=affine_backward(grad,aff_caches[i-1])
          grads['W'+str(i)]+=self.reg*self.params['W'+str(i)]


        pass

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return loss, grads

目录