gradient descent

Gradient Descent

Overview

Gradient Descent is an algorithm for finding the minimum of a given mutlivariable function. It is a local algorithm, that is, it starts at a given point, and then examines the function value around that point (local area) in order to determine which point to evaluate next.

Technically, it determines the rate of change of the function in each direction (along each axis of the inputs), determines which direction represents the fastest rate and then proceeds in that direction by a given step size. (see gradient)

A gradient descent library can be found at Gradient Descent.

Intuition

Demo Video

Demonstrates the gradient descent algorithm.

Algorithm

The algorithm takes a multivariable function {% g %}, an initial set of parameters {% \vec{ \theta_k } %}, then calculates the gradient {% \vec{g_k} %}, multiplies the gradient by a learning rate {% \nu_k %} (which can be a constant caculated on each pass) and adds that result back to the parameter vector. In this description, {% k %} represents the number of the iteration, so for instance {% \vec{g_k} %} is the gradient after the kth iteration. (murphy - pg. 247)

{% \vec{ \theta_{k+1} } = \vec{ \theta_k } - \nu_k \vec{g_k} %}

The learning rate can be specified to be a constant value, or a sequence of values, or even a calculated value. In addition, you can provide a learning rate vector, that gives different learning rates for each input parameter.

Pitfalls

There are some situations where gradient descent works well and can mostly function on its own. Other situations require careful attention from the user. For example, in situations where the error function is specified as squared error, the total error can be quite large, and the gradient can become large. In such a situation, if the learning rate is not tuned correctly, the algorithm can bounce around erratically.

One way to address these issues is to try to get close to the optimum values before running the gradient descent. This can be accomplished by using a regression, or a random search, or optimizing one parameter at a time first.

Gradient Descent and Machine Learning

Gradient descent is the primary method used when applying machine learning to a dataset that has been generated from some unknown distribution, particularly when no assumptions such as parametric form can be made about the distribution.

In the machine learning context, gradient descent is used in loss minimization. In such a case, the function to be minimized is some loss coputed from the training data set. The loss is usually expressed as a sum of a computed loss from the project forecast {% f(x_i) %} from the reponse function f, and the target value {% y_i %}.

{% Total \, Loss = \sum_i loss(f(x_i), y_i) %}

The gradient of the total loss is the sum of the graients of the individual loss.

{% \nabla Total \, Loss = \sum_i \nabla loss(f(x_i), y_i) %}

This can used to parallelize the problem.

Stochastic Gradient Descent

In stochastic gradient descent, points are sampled from the training set and a gradient computed, and one interation of descent performed. Essentially, this algorithm follows the expected gradient of loss, rather than the actual loss gradient.

{% \mathbb{E}(\nabla Total \, Loss) = \nabla[\sum_i \nabla loss(f(x_i), y_i)] %}

where the sum is taken over a randomly sampled dataset.

Because the sampled points could be a small set compared to the full training set (in fact, the sampled points could be just a single point), running one iteration of the algorithm is much less computational intensive than running the algorithm on the full set.

In a simple alteration of stochastic gradient descent, the algorithm takes each point in the training set in turn, computes an indvidual gradient and runs one iteration of the descent.

Because of the randomness of the SGD algorithm, it is thought to provide a way to bypass small local minima in the loss surface, by providing a random kick in the algorithm that may jolt a solution point out of its local minima.

(see prince chpt 6)

Overview

Intuition

Algorithm

Pitfalls

Gradient Descent and Machine Learning

Stochastic Gradient Descent

Step Sizes

Alternative Methods

Contents