stochastic gradient descent

Stochastic Gradient Descent

Overview

In stochastic gradient descent, the gradient of the loss function is computed with respect to a single data point (or at least a subset of the full training set)

{% \theta_{k+1} = \theta_k - \nu \nabla J_i(\theta_k) %}

where {% J(\theta_k) %} is the loss as computed against the entire training set, and {% J_i(\theta_k) %} is the loss as computed against the {% i^{th} %} data point in the training set.

When the loss function is designed so that the total loss is just the sum of the individual losses, then the gradient of the total loss is just the sum of the gradients of the individual losses. That is, with a small step size, the stochastic gradient descent algorithm is nearly equivalent to normal gradient descent.

{% \nabla J(\theta_k) = \sum_i \nabla J_i(\theta_k) %}

However, the algorithm adds a degree of randomness to the process, which helps the algorithm to avoid local minima.

Mini Batch

The mini batch algorithm is an intermediate between the full gradient descent and the stochastic gradient descent. It starts with a set of batchs (collections of data points from the training data), which it then runs gradient descent on the batch.

Each mini batch is designed to be smaller than the full training set. Sometimes the batches are created as a partition of the training set. (that is, they are disjoint and their union equals the training set). Sometimes a batch is just a random selection of points from the training set. (see re-sampling)

Implementation

Implementation