Backpropagation
Overview
Setup
For this example, a neural network is envisioned as in the following diagram.
{%
\begin{bmatrix}
\\
I_1 \\
\\
\end{bmatrix}
\times
\begin{bmatrix}
& & \\
& W_1 & \\
& & \\
\end{bmatrix}
=
\begin{bmatrix}
\\
A_1 \\
\\
\end{bmatrix}
\circ f_1
\rightarrow
\begin{bmatrix}
\\
I_2 \\
\\
\end{bmatrix}
\times
\begin{bmatrix}
& & \\
& W_2 & \\
& & \\
\end{bmatrix}
=
\begin{bmatrix}
\\
A_2 \\
\\
\end{bmatrix}
\circ f_2
\rightarrow
%}
Each layer has a row vector of inputs, labeled {% I_i %}.
That vector is then multiplied by a weight matrix,
{% W_i %}
The result of the multiplication is labeled {% A_i %}, which is a column vector which becomes the inputs to that
layers activation function (labeled f in the diagram). The result of the activation function then becomes the
inputs, {% I_{i+1} %} to the next layer.
{% A_i = I_i \times W_i %}
The output of the ith layer is defined as
{% O_i = f_i(A_i) %}
The, for a two layer network, the final output can be written as
{% O = f_2(f_1(I \ times W_1) \ times W_2) %}
Loss Function
In order to train the network, we need to define a smooth loss function to be minimized.
{% Loss = E(f_2(f_1(I \times W_1) \times W_2)) %}
Gradient Descent
The backpropagation algorithm simply applies
Gradient Descent
to the gradient of the loss function with respect to each layers weight matrix computed using
the
chain rule
{% \frac{\partial{E}}{\partial{W_1}} = \frac{\partial{E}}{\partial{O_2}} \times \frac{\partial{O_2}}{\partial{A_2}} \times \frac{\partial{A_2}}{\partial{O_i}}
\times \frac{\partial{O_1}}{\partial{W_1}} %}
Topics