Overview
A machine learning algorithm is effectively a function that maps points from the input space, {% X %} to the range of answers, {% Y %}.
{% f_{\vec{\theta}}:X \mapsto Y %}
The function depends on a set of parameters, here denoted {% \vec{\theta} %}.
The predicted value of y is given by
{% \hat{y} = f_{\vec{\theta}}(x) %}
Error
There is a error function, that defines an error for a forecast, when it is different from the true value
{% Error(y, \hat{y}) %}
Then the total empirical error is
{% \sum_i Error(y_i, \hat{y_i}) %}
(Note a somewhat more abstract approach can be developed where the total empirical error is not a sum of errors of each
individual point, but a function of the set of points)
Gradient Descent
The standard method of solution is to use gradient descent to find the optimal weights {% \bar{\theta} %}
Note, that the gradient descent algorithm in this case considers f to be a function of its parameters, {% \theta %}, not as a function of x, which in the case of given sample dataset, is fixed. Also, the gradient descent algorithm typically treats f as a black box, that is, it calculates the gradient using a numeric procedure (see numeric differentiation) which calculates the value of error at nearby points and approximates.
Known Function
The backpropagation algorithm comes into play when the analytic form of function, f above, is known ahead of time and all the partial derivatives of f can be computed analytically.
{% E(f_{\vec{\theta}}(\vec{x}), y) %}
The gradient of the error is calculated using the
chain rule
{% \frac{\partial{E(\vec{x})}}{\partial{\vec{x}}} = \frac{\partial{E}}{\partial{\vec{f}}} \times \frac{\partial{\vec{f}}}{\partial{\vec{x}}} %}
but now, we want the gradient with respect to {% \vec{\theta} %}, not {% \vec{x} %}
{% \frac{\partial{E(\vec{x})}}{\partial{\vec{\theta}}} = \frac{\partial{E}}{\partial{\vec{f}}} \times \frac{\partial{\vec{f}}}{\partial{\vec{\theta}}} %}
Function Composition
Backpropagation comes into play when the function in question can be expressed as a composition
{% f(g(\vec{x})) %}
where here f has a vector of parameters {% \vec{\theta} %} and g has a vector of parameters {% \xi %}.
{% \frac{\partial{E(\vec{x})}}{\partial{\vec{\theta}}} = \frac{\partial{E}}{\partial{\vec{f}}} \times \frac{\partial{\vec{f}}}{\partial{\vec{\theta}}} %}
{% \frac{\partial{E(\vec{x})}}{\partial{\vec{\xi}}} = \frac{\partial{E}}{\partial{\vec{f}}} \times \frac{\partial{\vec{f}}}{\partial{\vec{g}}}
\times \frac{\partial{\vec{g}}}{\partial{\vec{\xi}}}
%}
Neural Networks
Neural networks and deep learning represent the most prominent use of backpropagation. To see the derivation of backpropagation in that case, please see :
Backpropagation in a Neural Network