You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: batch_normalization/notes/batch_normalization.md
+15-1
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ Batch normalization [1] overcomes this issue and make the training more efficien
14
14
15
15
## 1. Reduce internal covariance shift via mini-batch statistics
16
16
17
-
One way to reduce remove the ill effects of the internal covariance shift within a Neural Network is to normalize layers inputs. This operation not only enforces inputs to have the same distribution but also whitens each of them. This method is motivated by some studies [3,4] showing that the network training converges faster if its inputs are whitened and as a consequence, enforcing the whithening of the inputs of each layers is a desirable property for the network.
17
+
One way to reduce remove the ill effects of the internal covariance shift within a Neural Network is to normalize layers inputs. This operation not only enforces inputs to have the same distribution but also whitens each of them. This method is motivated by some studies [2] showing that the network training converges faster if its inputs are whitened and as a consequence, enforcing the whithening of the inputs of each layers is a desirable property for the network.
18
18
19
19
However, the full whitening of each layer’s inputs is costly and not fully differentiable. Batch normalization overcomes this issue by considering two assumptions:
20
20
@@ -47,6 +47,10 @@ $$
47
47
48
48
#### Fully connected layers
49
49
50
+
The implementation for fully connected layers is pretty simple. We just need to get the mean and the variance of each batches and then to scale and shift the feature map with the alpha and the beta parameters presented earlier.
51
+
52
+
During the backward pass, we will use backpropagation in order to update these two parameters.
53
+
50
54
```python
51
55
mean = torch.mean(X, axis=0)
52
56
variance = torch.mean((X-mean)**2, axis=0)
@@ -56,6 +60,8 @@ out = gamma * X_hat + beta
56
60
57
61
#### Convolutional layers
58
62
63
+
The implementation for convolutional layers is almost the same as before. We just need to perform some reshaping in order to adapt to the input that we get from the previous layer.
In PyTorch, the backpropagation is very easy to handle, one important thing here is to specify that our alpha and beta are parameters in order to update them during the backward phase.
67
74
75
+
To do so, we will declare them as `nn.Parameter()` in our layer and we will initialize them with random values.
68
76
69
77
### During inference
70
78
@@ -81,8 +89,14 @@ $$
81
89
82
90
This moving average is stored in a global variable that is updated during the training phase.
83
91
92
+
In order to store this moving average in our layer during training, we can use buffers. We initiate these buffer when we instanciate our layer with the method `register_buffer()` of PyTorch.
93
+
84
94
### Final module
85
95
96
+
Then final module is then composed of all of the previous blocks that we discribed earlier. We add a condition over the shape of the input data in order to know wether we are dealing with a fully connected layer or a convolutional layer.
97
+
98
+
One important thing to notice here is that we only need to implement the `forward()` method. As our class inherits from the `nn.Module`, the `backward()` function will be automaticly inherited from this class (thank you PyTorch ❤️).
0 commit comments