Simple Recurrent Neural Networks
Overview
Network Architecture
A simple recurrent neural network can be described as
{% \vec{h_t} = \sigma(U \vec{h}_{t-1} + W \vec{s}_t + \vec{b}) %}
where {% \vec{s}_i %} is the input vector at time i and {% \vec{h}_i %}
is the output vector which is also fed back to the network in the next step.
Here, U and W represent two different weight matrices.
Training
Training of a simple RNN uses the Backpropagation Through Time algorithm, which is a variant of the
Backpropagation algorithm. It does this by "unrolling" the network and applying the Backpropagation. That is,
a simple one-layer network will appear to be n-layer network on the {% n^{th} %} input.
Because of the way that a recurrent neural network is constructed, each additional sequential input adds an additional
layer of computation to the backpropagation, which will increase computation time as well as creating
the exploding/diminishing gradient problem mentioned below.
Vanishing/Exploding Gradients
The
Chain Rule
when applied to a complex neural network will express the desired gradient as the multiplication
of several factors.
{% grad_1 \times grad_2 \times ... \times grad_n %}
A recurrent network will have gradients computed with a factor for each prior time period.
(that is, the gradient for the nth input will have at least n terms).
In such a case, as n gets large, if the terms are less than 1, the computed gradient will likely converge toward 0.
If the factors are greater than 1, then the computed gradient can explode.