Backpropagation

Overview


Backpropagation is the standard method used to mahtematically tune the parameters in a neural network during training. At it's heart, it is simply just a method to calculate the gradient of the loss function and then applying gradient descent.

Setup


For this example, a neural network layer is envisioned as in the following diagram.
{% \begin{bmatrix} \\ I_1 \\ \\ \end{bmatrix} \times \begin{bmatrix} & & \\ & W_1 & \\ & & \\ \end{bmatrix} \circ f_1 %}
  • Each layer has a row vector of inputs, labeled {% I_i %}.
  • That vector is then multiplied by a weight matrix, {% W_i %}
  • The result of the multiplication (which we will label as {% A_i %} that is {% A_i = I_i \times W_i %}), is a column vector which becomes the inputs to that layers activation function (labeled f in the diagram). The result of the activation function then becomes the inputs, {% I_{i+1} %} to the next layer.

A two layer network can then be visualized as
{% \begin{bmatrix} \\ A_1 \\ \\ \end{bmatrix} \circ f_1 \rightarrow \begin{bmatrix} \\ A_2 \\ \\ \end{bmatrix} \circ f_2 \rightarrow %}
The output of the ith layer is defined as
{% O_i = f_i(A_i) %}
The, for a two layer network, the final output can be written as
{% O = f_2(f_1(I \times W_1) \times W_2) %}

Loss Function


In order to train the network, we need to define a smooth loss function to be minimized.
{% Loss = E(f_2(f_1(I \times W_1) \times W_2)) %}
Loss functions are often extended to include a form of regularization.

Gradient Descent


The backpropagation algorithm simply applies Gradient Descent to the gradient of the loss function with respect to each layers weight matrix computed using the chain rule
{% \frac{\partial{E}}{\partial{W_1}} = \frac{\partial{E}}{\partial{O_2}} \times \frac{\partial{O_2}}{\partial{A_2}} \times \frac{\partial{A_2}}{\partial{O_i}} \times \frac{\partial{O_1}}{\partial{W_1}} %}
Note that many machine learning algorithms are written using denominator layout which will then reverse the order of the factors in the chain rule given above.

Topics


  • Numeric Implementation
  • Analytic Implementation