diff --git a/English version/ch03_DeepLearningFoundation/ChapterIII_DeepLearningFoundation.md b/English version/ch03_DeepLearningFoundation/ChapterIII_DeepLearningFoundation.md index e91b2883..b14cc295 100644 --- a/English version/ch03_DeepLearningFoundation/ChapterIII_DeepLearningFoundation.md +++ b/English version/ch03_DeepLearningFoundation/ChapterIII_DeepLearningFoundation.md @@ -103,14 +103,14 @@ Some platforms are specifically developed for deep learning research and applica ### 3.1.5 Why is deep neural network difficult to train? -1. Gradient Gradient +1. The Vanishing Gradient Problem: The disappearance of the gradient means that the gradient will become smaller and smaller as seen from the back and the front through the hidden layer, indicating that the learning of the front layer will be significantly slower than the learning of the latter layer, so the learning will get stuck unless the gradient becomes larger. The reason for the disappearance of the gradient is affected by many factors, such as the size of the learning rate, the initialization of the network parameters, and the edge effect of the activation function. In the deep neural network, the gradient calculated by each neuron is passed to the previous layer, and the gradient received by the shallower neurons is affected by all previous layer gradients. If the calculated gradient value is very small, as the number of layers increases, the obtained gradient update information will decay exponentially, and the gradient disappears. The figure below shows the learning rate of different hidden layers: ![](img/ch3/3-8.png) -2. Exploding Gradient +2. Exploding Gradients: In a network structure such as a deep network or a Recurrent Neural Network (RNN), gradients can accumulate in the process of network update, becoming a very large gradient, resulting in a large update of the network weight value, making the network unstable; In extreme cases, the weight value will even overflow and become a $NaN$ value, which cannot be updated anymore. 3. Degeneration of the weight matrix results in a reduction in the effective degrees of freedom of the model. The degradation rate of learning in the parameter space is slowed down, which leads to the reduction of the effective dimension of the model. The available degrees of freedom of the network contribute to the gradient norm in learning. As the number of multiplication matrices (ie, network depth) increases, The product of the matrix becomes more and more degraded. In nonlinear networks with hard saturated boundaries (such as ReLU networks), as the depth increases, the degradation process becomes faster and faster. The visualization of this degradation process is shown in a 2014 paper by Duvenaud et al: