Ridge Regression and Regularization

Overview

Ridge regression is a form of regularization applied to OLS Regression

Ridge Regression

Ridge regression takes the normal loss function used in OLS regression and adds an additional term that is proportional to the norm of the weight vector.

{% L(\vec{w}) = \frac{1}{N} \sum_{i=1}^N (y_i - (w_0 + \vec{w}^T \vec{x}))^2 + \lambda || \vec{w} || ^2 %}

Here {% \lambda || \vec{w} || ^2 %} is the additional term. This term penalizes the regression for having large weights, thereby pushing the weights toward zero. The term inlcudes the hyper-parameter {% \lambda %} which dictates how strongly the term affects the regression.

The optimal weights of the ridge regression are given by

{% \vec{w}_{opt} = (\lambda I_D + X^TX)^{-1} X^T \vec{y} %}

Choosing Lambda

In general, there are no a-priori reasons to choose one value of lambda over another. Typically, lambda is labeled as a hyper-parameter and trained by using a validation set. See Data partitioning.

Connection to Bayesianism

Bayesianism

{% \mathbb{P}(\vec{w}) = \Pi_{i} \mathbb{N}(w_i|0,\tau^2) %}