Gradient Descent
Overview
Gradient Descent is an algorithm for finding the minimum of a given mutlivariable function. It is a local algorithm, that is, it starts
at a given point, and then examines the function value around that point (local area) in order to determine which point to evaluate next.
Technically, it determines the rate of change of the function in each direction (along each axis of the inputs), determines which direction
represents the fastest rate and then proceeds in that direction by a given step size.
(see
gradient)
A gradient descent library can be found at
Gradient Descent.
Intuition
Demo Video
Demonstrates the gradient descent algorithm.
Algorithm
The algorithm takes a multivariable function {% g %}, an initial set of parameters {% \vec{ \theta_k } %}, then calculates the
gradient {% \vec{g_k} %}, multiplies the
gradient by a learning rate {% \nu_k %} (which can be a constant caculated on each pass) and adds that result back
to the parameter vector. In this description, {% k %} represents the number of the iteration, so for instance {% \vec{g_k} %}
is the gradient after the kth iteration.
(
murphy - pg. 247)
{% \vec{ \theta_{k+1} } = \vec{ \theta_k } - \nu_k \vec{g_k} %}
The learning rate can be specified to be a constant value, or a sequence of values, or even a calculated value.
In addition, you can provide a learning rate vector, that gives different learning rates for each input parameter.
Pitfalls
There are some situations where gradient descent works well and can mostly function on its own. Other situations
require careful attention from the user. For example, in situations where the error function is specified as
squared error, the total error can be quite large, and the gradient can become large. In such a situation, if the
learning rate is not tuned correctly, the algorithm can bounce around erratically.
One way to address these issues is to try to get close to the optimum values before running the gradient descent.
This can be accomplished by using a regression, or a random search, or optimizing one parameter at a time first.
Gradient Descent and Machine Learning
Gradient descent is the primary method used when applying machine learning to a dataset that has been
generated from some unknown distribution, particularly when no assumptions such as parametric form can
be made about the distribution.
In the machine learning context, gradient descent is used in
loss minimization.
In such a case, the function to be minimized is some loss coputed from the training data set.
The loss is usually expressed as a sum of a computed loss from the project forecast {% f(x_i) %}
from the reponse function f, and the target value {% y_i %}.
{% Total \, Loss = \sum_i loss(f(x_i), y_i) %}
The gradient of the total loss is the sum of the graients of the individual loss.
{% \nabla Total \, Loss = \sum_i \nabla loss(f(x_i), y_i) %}
This can used to parallelize the problem.
Stochastic Gradient Descent
In stochastic gradient descent, points are
sampled from the training set and a gradient computed, and one interation of descent performed.
Essentially, this algorithm follows the expected gradient of loss, rather than the actual loss
gradient.
{% \mathbb{E}(\nabla Total \, Loss) = \nabla[\sum_i \nabla loss(f(x_i), y_i)] %}
where the sum is taken over a randomly sampled dataset.
Because the sampled points could be a small set compared to the full training set (in fact, the sampled points
could be just a single point), running one iteration of the algorithm is much less computational intensive
than running the algorithm on the full set.
In a simple alteration of stochastic gradient descent, the algorithm takes each point in the training set in turn,
computes an indvidual gradient and runs one iteration of the descent.
Because of the randomness of the SGD algorithm, it is thought to provide a way to bypass small local minima in the
loss surface, by providing a random kick in the algorithm that may jolt a solution point out of its local minima.
Step Sizes
Alternative Methods