CSE 291 LECTURE NOTES

March 3, 2005


MSE = BIAS + VARIANCE

Lemma:  MSE(theta hat)  =  E[ (theta hat - theta)^2 ]  =  var(theta hat) +  (E[theta hat] - theta)^2

Proof:

An unbiased estimator has zero bias:  E[theta hat] - theta = 0.  It is possible that an alternative estimator might have non-zero bias, but with reduced variance so that total MSE is smaller.  Among unbiased estimators, of course, the smallest variance is best.


BIAS AND VARIANCE IN LINEAR REGRESSION

Remember that
b hat  =  (XTX)-1XTy
y hat  =  X b hat
Note that the least-squares estimates of the coefficients b are a linear function of y.

Gauss-Markov theorem:  Among all estimators of b that are unbiased and linear functions of y, the least-squares solution has smallest variance.

Corollary:  For any vector a, aTb hat has the same property as an estimator of aTb.

However, as we just saw, smallest variance does not imply smallest mean-squared error.  Moreover, MSE and prediction accuracy are essentially the same:

E(Y - f hat(x))^2  =  sigma^2 + E(xTb hat - f(x))2  =  sigma^2 + MSE(f hat(x)).
From a machine learning point of view, minimizing MSE, not variance, is what we really want to do.

Note that we are still assuming our model is correct, so sigma^2 is fixed for all x and the error in f hat(x) just comes from not enough training data, not from the model being wrong.

Additional note:  Suppose we get to choose the placement of the training x values?  Where should we put them?
 
 

SHRINKAGE METHODS

Last Wed. we looked at stepwise methods for feature selection.  Today we'll look at two other approaches that try to reduce MSE, by reducing variance more directly.  Secondarily, two methods also make models more comprehensible, by eliminating some predictors entirely.

Terminology: Dimensionality reduction, feature selection, feature construction, feature extraction, basis function selection; supervised versus unsupervised approaches.

So-called "shrinkage" methods reduce the impact of predictors in a smooth way, by reducing the magnitude of their coefficients.  The most common method is called ridge regression. 

How do you choose the best amount of shrinkage?  Minimize an estimate of the average prediction error, e.g. with cross-validation.
 
 

RIDGE REGRESSION

The idea here is to place a penalty on large coefficient values.  This shrinks the estimated coefficients towards zero.  We minimize a modified RSS:
SUM (y - b0 - SUM xj*bj)^2  + lambda SUM_j>=1 bj^2
The bigger lambda is, the more the shrinkage.  For neural networks, the same idea is called weight decay.

The effect of shrinkage depends on the size of the x values, so usually these are standardized to have unit variance.  They are also standardized to have zero mean, so that b0 can be estimated separately, without shrinkage.

Mathematically, minimizing the modified RSS is equivalent to minimizing the regular RSS

SUM (y - b0 - SUM xj*bj)^2
subject to the constraint that  SUM_j>=1 bj^2 <= s  for some constant s.  This means that correlated predictors that cancel each other out must be given small weights, rather than opposing large weights.


SOLVING RIDGE REGRESSION

The solution to a ridge regression problem is still a matrix times y, where the matrix is a function of X.  So the solution is linear but not unbiased.  Typically the bias is less than the decrease in variance caused by the constraint on the estimates.

The standard least-squares solution is  b hat  =  (X'X)-1X'y.  For ridge regression,  b"  =  (X'X + lambda I)-1X'y

As lambda tends to infinity, the coefficients b" tend to zero.  If the columns of X are highly co-linear, then some coefficients may be misleadingly negative.  With ridge regression, as coefficients shrink towards zero, their signs become meaningful.


MATLAB EXAMPLE OF RIDGE REGRESSION

Here are four data points in two dimensions that are highly co-linear, i.e. the dimensions are correlated:
X =
   -2.0000   -2.5000
   -1.0000   -0.5000
             0    0.5000
    3.0000    7.5000
Note that I wanted the mean of each column to be zero, but I changed the (4,2) entry without restoring this condition.  Here are some values for the dependent variable that should be easy to predict using either predictor:
y =
   -6.2500
   -3.2500
    1.7500
    7.7500
Here are the best estimates of y:
y hat  =
   -5.9464
   -3.1607
   -0.1250
    7.9821
These y hat use the least-squares estimated coefficients:
b hat  =   inv(X'*X)*X'*y =     3.2857    -0.2500
Here are the coefficients using ridge regression:
b"  =  inv(X'*X+eye(2))*X'*y =     2.0511     0.2940
Now both predictors have positive coefficients, which reflects their individual correlation with y better.   The new estimated y values are
-4.83
-2.20
0.15
8.36
with RSS 6.042 compared to 3.67 without shrinkage.