Backpropagation

Overview


Setup


For this example, a neural network is envisioned as in the following diagram.
{% \begin{bmatrix} \\ I_1 \\ \\ \end{bmatrix} \times \begin{bmatrix} & & \\ & W_1 & \\ & & \\ \end{bmatrix} = \begin{bmatrix} \\ A_1 \\ \\ \end{bmatrix} \circ f_1 \rightarrow \begin{bmatrix} \\ I_2 \\ \\ \end{bmatrix} \times \begin{bmatrix} & & \\ & W_2 & \\ & & \\ \end{bmatrix} = \begin{bmatrix} \\ A_2 \\ \\ \end{bmatrix} \circ f_2 \rightarrow %}
Each layer has a row vector of inputs, labeled {% I_i %}.
That vector is then multiplied by a weight matrix, {% W_i %}

The result of the multiplication is labeled {% A_i %}, which is a column vector which becomes the inputs to that layers activation function (labeled f in the diagram). The result of the activation function then becomes the inputs, {% I_{i+1} %} to the next layer.
{% A_i = I_i \times W_i %}
The output of the ith layer is defined as
{% O_i = f_i(A_i) %}
The, for a two layer network, the final output can be written as
{% O = f_2(f_1(I \ times W_1) \ times W_2) %}

Loss Function


In order to train the network, we need to define a smooth loss function to be minimized.
{% Loss = E(f_2(f_1(I \times W_1) \times W_2)) %}

Gradient Descent


The backpropagation algorithm simply applies Gradient Descent to the gradient of the loss function with respect to each layers weight matrix computed using the chain rule
{% \frac{\partial{E}}{\partial{W_1}} = \frac{\partial{E}}{\partial{O_2}} \times \frac{\partial{O_2}}{\partial{A_2}} \times \frac{\partial{A_2}}{\partial{O_i}} \times \frac{\partial{O_1}}{\partial{W_1}} %}

Topics


Contents