CSE 291 LECTURE NOTES

February 24, 2005
   

THE LINEAR REGRESSION MODEL

Suppose we want to learn how a real-valued random variable Y depends on real-valued variables X1 ... Xp.  This is a multivariate scenario unlike the univariate situations we've seen so far.  The model we assume is linear:
E[Y|X]  =  f(X) =  b0 + SUM_j Xj * bj
Note that we assume there is randomness in Y, so we cannot say that Y is a function of the Xj, just that the expectation of Y is.  The parameters we want to estimate (also called coefficients) are the p+1 scalars bj.   We're interested in the value of Y conditioned on the Xj values, so we don't need to model possible randomness in the Xj.

We may assume further that the deviations in the observed y values are Gaussian with zero mean.  That is, we assume that the following model is true

Y  =  N(b0 + SUM_j Xj*bj), sigma^2)
In other words, all deviations in the observed values of Y follow  Y - E[Y | X]  =  epsilon ~ N(0, sigma^2).


LEAST-SQUARES ESTIMATION

Suppose we have N training examples (subscript i).  Each training example is a p-dimensional vector (subscript j) along with a scalar.  The components of a training example are called independent variables, features, attributes, or predictors.  These are all synonyms!

The most popular estimation method is called "least squares."  We pick the b vector to minimize

RSS(b)  =  SUM_i (f(xi) - yi)^2
"RSS" stands for "residual sum of squares."  Note that this minimization treats all the xi equally, and that it penalizes large deviations dramatically.  Background knowledge about the real-world scenario may imply that these choices are not appropriate.  But we will see soon that the estimates based on least-squares are the maximum likelihood estimates, and they are also the minimum-variance unbiased estimates.

For now let's look at least squares algorithmically and geometrically.


MINIMIZING RSS

Let X be a matrix where each row is one training example, and the first column is all ones, so the size of X is N by p+1.  Let y be the column vector of observed values for the dependent variable.  Let b a column vector of p+1 parameter values.  In matrix notation  E[y] = f(X) = Xb  and
RSS(b)  =  (y - Xb)T(y - Xb)
To explain this in detail:  Xb is a matrix times a column vector, so we take the dot-product of each row of X with b.  This gives a column vector of size N, which we can subtract from y, giving e = y-Xb.  Now we take the dot product of e with itself by first converting it into a row vector with the transpose operation, indicated by the superscript T.

Note that y and X are fixed and the result RSS(b) is a scalar function of the p+1 parameters that are components of the vector b.  To minimize RSS(b) we set its derivative to zero.  The derivative is

d/db RSS(b)  =  -2XT(y - Xb)
This result can be proved by going back to the non-matrix formulation and computing the derivative of the RSS sum using standard calculus.  The second derivative is
d/db -2XT(y - Xb)  =  -2XTX
We can solve the equation  -2XT(y - Xb) = 0  using the matrix inverse:
            2XTy  =  -2XT((-Xb)
(XTX)-1XTy  =  b
This solution is only valid if the inverse actually exists, i.e. the matrix XTX is non-singular.  Note that XTX is square,of size p+1 by p+1. 

 

LINEAR DEPENDENCE

Suppose the columns of X are not linearly independent, e.g. one feature is a linear combination of other features.  This will happen for example if we use a "one of n" coding for a feature that is intrinsically discrete.  Then XTX is singular and the least-squares coefficients b are not defined uniquely.  This makes sense: you can get equally good predictions for y from alternative linear functions of the input vector.  But the predicted values y hat are still uniquely defined. Linear dependence between columns will also happen when the number of rows (i.e. training examples) is less than the number of columns (i.e. features).


VARIANCE OF THE ESTIMATED COEFFICIENTS

Let's assume that the points X at which we make training observations are totally fixed.  The only randomness comes from the observed dependent values, y.  Assume that these are not correlated with each other and that they follow some distribution with variance sigma^2, regardless of x.  What is the variance of the estimated b, which is a function of y?
var(b hat)  =  var((XTX)-1XTy)  =   (XTX)-1sigma^2
Note that this variance is a variance-covariance matrix, of size p+1 by p+1.  The entry ij is the expected value of (bi - mui)(bj - muj)  where mui is the expected value of b1.

Of course, sigma^2 is usually unknown.  We can estimate it in an unbiased way as

1/(N - p - 1) SUM_i (yi - yi hat)^2
Remember that each yi has a different expectation.  This expectation is an unknown function of xi, but we can estimate it as yi hat.
 
 

BIAS AND DISTRIBUTION OF THE ESTIMATED COEFFICIENTS

Let's assume further that the linear model actually is correct (only its parameter values are unknown) and that the deviations in the observed y values are Gaussian with zero mean.  That is, we assume that the following model is true
Y  =  N(b0 + SUM_j Xj*bj), sigma^2)
In other words, all deviations in the observed values of Y follow  Y - E[Y | X]  =  epsilon ~ N(0, sigma^2).

In this case one can show that the least-squares estimates are unbiased and follow a multivariate Gaussian distribution:

b hat  ~  N(b, (XTX)-1sigma^2)
Also, the estimated residual sum of squares RSS follows a chi-squared distribution with N-p-1 degrees of freedom:
RSS  =  SUM_i (yi - yi hat)^2  ~  sigma^2 CHISQ(N-p-1)
so we get an unbiased estimate of the variance as RSS / N-p-1.
 
 

HYPOTHESIS TESTING

Often we have many candidate predictors and we want to know which ones to keep or omit.

Suppose the unrestricted model has p+1 parameters, and yields residual sum of squares RSS1 = SUM_i (yi - yi hat)^2.  Suppose the restricted (nested) model has q+1 parameters where q < p, and yields RSS0.  The F statistic is
(RSS0 - RSS1)/ (p - q)*sigma^2 hat
where sigma^2 hat  =  RSS1/ N-p-1.  This statistic follows an F distribution.  Not surprisingly, this tends towards a chi-squared distribution with p-q degrees of freedom when N increases.  The current homework assignment asks you to work out the corresponding likelihood ratio test.