back propagation

Backpropagation

Overview

A machine learning algorithm is effectively a function that maps points from the input space, {% X %} to the range of answers, {% Y %}.

{% f_{\vec{\theta}}:X \mapsto Y %}

The function depends on a set of parameters, here denoted {% \vec{\theta} %}.

The predicted value of y is given by

{% \hat{y} = f_{\vec{\theta}}(x) %}

Error

There is a error function, that defines an error for a forecast, when it is different from the true value

{% Error(y, \hat{y}) %}

Then the total empirical error is

{% \sum_i Error(y_i, \hat{y_i}) %}

(Note a somewhat more abstract approach can be developed where the total empirical error is not a sum of errors of each individual point, but a function of the set of points)

Gradient Descent

The standard method of solution is to use gradient descent to find the optimal weights {% \bar{\theta} %}

Note, that the gradient descent algorithm in this case considers f to be a function of its parameters, {% \theta %}, not as a function of x, which in the case of given sample dataset, is fixed. Also, the gradient descent algorithm typically treats f as a black box, that is, it calculates the gradient using a numeric procedure (see numeric differentiation) which calculates the value of error at nearby points and approximates.

Known Function

The backpropagation algorithm comes into play when the analytic form of function, f above, is known ahead of time and all the partial derivatives of f can be computed analytically.

{% E(f_{\vec{\theta}}(\vec{x}), y) %}

The gradient of the error is calculated using the chain rule

{% \frac{\partial{E(\vec{x})}}{\partial{\vec{x}}} = \frac{\partial{E}}{\partial{\vec{f}}} \times \frac{\partial{\vec{f}}}{\partial{\vec{x}}} %}

but now, we want the gradient with respect to {% \vec{\theta} %}, not {% \vec{x} %}

{% \frac{\partial{E(\vec{x})}}{\partial{\vec{\theta}}} = \frac{\partial{E}}{\partial{\vec{f}}} \times \frac{\partial{\vec{f}}}{\partial{\vec{\theta}}} %}

Function Composition

Backpropagation comes into play when the function in question can be expressed as a composition

{% f(g(\vec{x})) %}

where here f has a vector of parameters {% \vec{\theta} %} and g has a vector of parameters {% \xi %}.

{% \frac{\partial{E(\vec{x})}}{\partial{\vec{\theta}}} = \frac{\partial{E}}{\partial{\vec{f}}} \times \frac{\partial{\vec{f}}}{\partial{\vec{\theta}}} %}

{% \frac{\partial{E(\vec{x})}}{\partial{\vec{\xi}}} = \frac{\partial{E}}{\partial{\vec{f}}} \times \frac{\partial{\vec{f}}}{\partial{\vec{g}}} \times \frac{\partial{\vec{g}}}{\partial{\vec{\xi}}} %}

Neural Networks

Neural networks and deep learning represent the most prominent use of backpropagation. To see the derivation of backpropagation in that case, please see :
Backpropagation in a Neural Network