Overview
The momentum gradient descent effectively calculates an exponentially weighted moving average gradient, and then uses that gradient in the update to each interations parameters.
{% \nu_t = \beta \nu_{t-1} - \alpha \nabla J(\theta_t) %}
{% \theta_{t+1} = \theta_t + \nu_t %}