Overview
Gradient Descent is an algorithm for finding the minimum (or maximum, see optimization) of a given function. (either single or multi-variable)
It is a local algorithm, that is, it starts at a given point, and then examines the function value around that point (local area) in order to determine which point to evaluate next. That is, it looks at the slope of the function in each direction, and then follows the direction with the steepest slope.
A library implementing the basic gradient descent algorithm can be found at Gradient Descent.
Intuition
The above graph represents an error surface. Clicking the run button, will place a red dot somehwhere randomly on the chart. Each subsequent click of the run button will run one iteration of the gradient descent algorithm. The slider moves the dot to a new point on the chart, in order to view the algorithm starting from a different location.
Algorithm
The algorithm takes a multivariable function {% g %}, an initial set of parameters {% \vec{ \theta_k } %}, then calculates the gradient {% \vec{g_k} %}, multiplies the gradient by a learning rate {% \nu_k %} (which can be a constant caculated on each pass) and adds that result back to the parameter vector. In this description, {% k %} represents the number of the iteration, so for instance {% \vec{g_k} %} is the gradient after the kth iteration. (murphy - pg. 247)
{% \vec{ \theta_{k+1} } = \vec{ \theta_k } - \nu_k \vec{g_k} %}
The learning rate can be specified to be a constant value, or a sequence of values, or even a calculated value.
In addition, you can provide a learning rate vector, that gives different learning rates for each input parameter.
Pitfalls
There are some situations where gradient descent works well and can mostly function on its own. Other situations require careful attention from the user. For example, in situations where the error function is specified as squared error, the total error can be quite large, and the gradient can become large. In such a situation, if the learning rate is not tuned correctly, the algorithm can bounce around erratically.
One way to address these issues is to try to get close to the optimum values before running the gradient descent. This can be accomplished by using a regression, or a random search, or optimizing one parameter at a time first.
Gradient Descent and Machine Learning
Gradient descent is the primary method used when applying machine learning to a dataset that has been generated from some unknown distribution, particularly when no assumptions such as parametric form can be made about the distribution.
In the machine learning context, gradient descent is used in loss minimization. In such a case, the function to be minimized is some loss coputed from the training data set. The loss is usually expressed as a sum of a computed loss from the project forecast {% f(x_i) %} from the reponse function f, and the target value {% y_i %}.
{% Total \, Loss = \sum_i loss(f(x_i), y_i) %}
The gradient of the total loss is the sum of the graients of the individual loss.
{% \nabla Total \, Loss = \sum_i \nabla loss(f(x_i), y_i) %}
This can used to parallelize the problem.
Stochastic Gradient Descent
In stochastic gradient descent, points are sampled from the training set and a gradient computed from the sampled set. (Note, the sampling need not be random, it could be as simple as single point round robin)
Then one interation of descent is performed. From a mathematical perspecitve, this algorithm follows an estimate of the expected gradient of loss, rather than the actual loss gradient.
{% \mathbb{E}(\nabla Total \, Loss) = \nabla[\sum_i \nabla loss(f(x_i), y_i)] %}
where the sum is taken over a randomly sampled dataset.
Because the sampled points could be a small set compared to the full training set (in fact, the sampled points could be just a single point), running one iteration of the algorithm is much less computational intensive than running the algorithm on the full set.
In a simple alteration of stochastic gradient descent, the algorithm takes each point in the training set in turn, computes an indvidual gradient and runs one iteration of the descent.
Because of the randomness of the SGD algorithm, it is thought to provide a way to bypass small local minima in the loss surface, by providing a random kick in the algorithm that may jolt a solution point out of its local minima.
(see prince chpt 6)
Step Sizes
Multiple common methods exist for choosing a step size in each iteration of the gradient descent algorithm.
- Constant Learning Rate
{% \vec{ \theta_{k+1} } = \vec{ \theta_k } - \nu \vec{g_k} %}where {% \nu %} is a constant
- Constant Step Size
{% \vec{ \theta_{k+1} } = \vec{ \theta_k } - \nu_k \vec{g_k} %}where {% \nu_k %} is chosen such that {% || \nu_k \vec{g_k} || %} is some constant.
- Iteration Dependent Step Sizes
{% \vec{ \theta_{k+1} } = \vec{ \theta_k } - \nu_i \vec{g_k} %}where {% \nu_i %} changes as {% i %} increases. The typical method is the decrease {% \nu %} as {% i %} increases.
- Line Search
- Backtracking Line Search
Algorithms
Libary
Gradient Descent API