backpropagation

Backpropagation

Overview

Setup

For this example, a neural network is envisioned as in the following diagram.

{% \begin{bmatrix} \\ I_1 \\ \\ \end{bmatrix} \times \begin{bmatrix} & & \\ & W_1 & \\ & & \\ \end{bmatrix} = \begin{bmatrix} \\ A_1 \\ \\ \end{bmatrix} \circ f_1 \rightarrow \begin{bmatrix} \\ I_2 \\ \\ \end{bmatrix} \times \begin{bmatrix} & & \\ & W_2 & \\ & & \\ \end{bmatrix} = \begin{bmatrix} \\ A_2 \\ \\ \end{bmatrix} \circ f_2 \rightarrow %}

Each layer has a row vector of inputs, labeled {% I_i %}.
That vector is then multiplied by a weight matrix, {% W_i %}

The result of the multiplication is labeled {% A_i %}, which is a column vector which becomes the inputs to that layers activation function (labeled f in the diagram). The result of the activation function then becomes the inputs, {% I_{i+1} %} to the next layer.

{% A_i = I_i \times W_i %}

The output of the ith layer is defined as

{% O_i = f_i(A_i) %}

The, for a two layer network, the final output can be written as

{% O = f_2(f_1(I \ times W_1) \ times W_2) %}

Loss Function

In order to train the network, we need to define a smooth loss function to be minimized.

{% Loss = E(f_2(f_1(I \times W_1) \times W_2)) %}

Gradient Descent

The backpropagation algorithm simply applies Gradient Descent to the gradient of the loss function with respect to each layers weight matrix computed using the chain rule

{% \frac{\partial{E}}{\partial{W_1}} = \frac{\partial{E}}{\partial{O_2}} \times \frac{\partial{O_2}}{\partial{A_2}} \times \frac{\partial{A_2}}{\partial{O_i}} \times \frac{\partial{O_1}}{\partial{W_1}} %}

Topics

Numeric Implementation

Overview

Setup

Loss Function

Gradient Descent

Topics

Contents