b hat = (XTX)-1XTyNote that the least-squares estimates of the coefficients b are a linear function of y.
y hat = X b hat
Gauss-Markov theorem: Among all estimators of b that are unbiased and linear functions of y, the least-squares solution has smallest variance.
Corollary: For any vector a, aTb hat has the same property as an estimator of aTb.
However, as we just saw, smallest variance does not imply smallest mean-squared error. Moreover, MSE and prediction accuracy are essentially the same:
E(Y - f hat(x))^2 = sigma^2 + E(xTb hat - f(x))2 = sigma^2 + MSE(f hat(x)).From a machine learning point of view, minimizing MSE, not variance, is what we really want to do.
Note that we are still assuming our model is correct, so sigma^2 is fixed for all x and the error in f hat(x) just comes from not enough training data, not from the model being wrong.
Additional note: Suppose we get to choose the placement of the
training x values? Where should we put them?
Terminology: Dimensionality reduction, feature selection, feature construction, feature extraction, basis function selection; supervised versus unsupervised approaches.
So-called "shrinkage" methods reduce the impact of predictors in a
smooth
way, by reducing the magnitude of their coefficients. The most
common
method is called ridge regression.
How do you choose the best amount of shrinkage? Minimize an
estimate
of the average prediction error, e.g. with cross-validation.
SUM (y - b0 - SUM xj*bj)^2 + lambda SUM_j>=1 bj^2The bigger lambda is, the more the shrinkage. For neural networks, the same idea is called weight decay.
The effect of shrinkage depends on the size of the x values, so usually these are standardized to have unit variance. They are also standardized to have zero mean, so that b0 can be estimated separately, without shrinkage.
Mathematically, minimizing the modified RSS is equivalent to minimizing the regular RSS
SUM (y - b0 - SUM xj*bj)^2subject to the constraint that SUM_j>=1 bj^2 <= s for some constant s. This means that correlated predictors that cancel each other out must be given small weights, rather than opposing large weights.
The standard least-squares solution is b hat = (X'X)-1X'y. For ridge regression, b" = (X'X + lambda I)-1X'y
As lambda tends to infinity, the coefficients b" tend to zero.
If the columns of X are highly co-linear, then some coefficients
may be misleadingly negative. With ridge regression, as
coefficients
shrink towards zero, their signs become meaningful.
X =Note that I wanted the mean of each column to be zero, but I changed the (4,2) entry without restoring this condition. Here are some values for the dependent variable that should be easy to predict using either predictor:
-2.0000 -2.5000
-1.0000 -0.5000
0 0.5000
3.0000 7.5000
y =Here are the best estimates of y:
-6.2500
-3.2500
1.7500
7.7500
y hat =These y hat use the least-squares estimated coefficients:
-5.9464
-3.1607
-0.1250
7.9821
b hat = inv(X'*X)*X'*y = 3.2857 -0.2500Here are the coefficients using ridge regression:
b" = inv(X'*X+eye(2))*X'*y = 2.0511 0.2940Now both predictors have positive coefficients, which reflects their individual correlation with y better. The new estimated y values are