I'm studying about Neural network and I have some questions about the theory of Gradient descent.

  1. why is the fluctuation(slope) of Batch gradient descent less than the fluctuation(slope) of SGD.enter image description here
  2. Why SGD avoid local minima better Batch gradient descent.
  3. Batch Gradient survey all data then come updates. What is its meaning?
1

Best Answer


Everything is related to the compromise between exploitation and exploration.

  1. Gradient Descend uses all the data to update the weights which implies a better update. In neural networks Batch Gradient Descend is used because the original is not applicable to practice. Instead Stochastic Gradient Descend only uses a single example and that adds noise. With BGD and GD you exploit more data.

  2. With SGD you can avoid minimum locations because by using a single example you benefit the exploration and you can come up with other solutions that with BGD you could not, that implies noise. SGD you explore more.

  3. BGD, takes a dataset and breaks it into N chunks where each chunk has B samples (B is the batch_size). That forces you to go through all the data in the dataset.