Overview
The gradient used for the update is an exponentially weighted version of the computed gradient.
{% m_t = \beta_1 \vec{m}_{t-1} + (1-\beta_1)\vec{g}_t %}
A exponentially weighted normalization factor is included.
{% v_t = \beta_2v_{t-1} + (1-\beta_2)|| \vec{g}_t ||^2 %}
{% \theta_{t+1} = \theta_t - \frac{\alpha m_t}{\sqrt{v_t} + \epsilon} %}