Simple Recurrent Neural Networks

Overview


Network Architecture


A simple recurrent neural network can be described as
{% \vec{h_t} = \sigma(U \vec{h}_{t-1} + W \vec{s}_t + \vec{b}) %}
where {% \vec{s}_i %} is the input vector at time i and {% \vec{h}_i %} is the output vector which is also fed back to the network in the next step.

Here, U and W represent two different weight matrices.
see Salem

Training


Training of a simple RNN uses the Backpropagation Through Time algorithm, which is a variant of the Backpropagation algorithm. It does this by "unrolling" the network and applying the Backpropagation. That is, a simple one-layer network will appear to be n-layer network on the {% n^{th} %} input.

Because of the way that a recurrent neural network is constructed, each additional sequential input adds an additional layer of computation to the backpropagation, which will increase computation time as well as creating the exploding/diminishing gradient problem mentioned below.

Vanishing/Exploding Gradients


The Chain Rule when applied to a complex neural network will express the desired gradient as the multiplication of several factors.
{% grad_1 \times grad_2 \times ... \times grad_n %}
A recurrent network will have gradients computed with a factor for each prior time period. (that is, the gradient for the nth input will have at least n terms).

In such a case, as n gets large, if the terms are less than 1, the computed gradient will likely converge toward 0. If the factors are greater than 1, then the computed gradient can explode.