E[Y|X] = f(X) = b0 + SUM_j Xj * bjNote that we assume there is randomness in Y, so we cannot say that Y is a function of the Xj, just that the expectation of Y is. The parameters we want to estimate (also called coefficients) are the p+1 scalars bj. We're interested in the value of Y conditioned on the Xj values, so we don't need to model possible randomness in the Xj.
We may assume further that the deviations in the observed y values are Gaussian with zero mean. That is, we assume that the following model is true
Y = N(b0 + SUM_j Xj*bj), sigma^2)In other words, all deviations in the observed values of Y follow Y - E[Y | X] = epsilon ~ N(0, sigma^2).
Suppose we have N training examples (subscript i). Each training example is a p-dimensional vector (subscript j) along with a scalar. The components of a training example are called independent variables, features, attributes, or predictors. These are all synonyms!
The most popular estimation method is called "least squares." We pick the b vector to minimize
RSS(b) = SUM_i (f(xi) - yi)^2"RSS" stands for "residual sum of squares." Note that this minimization treats all the xi equally, and that it penalizes large deviations dramatically. Background knowledge about the real-world scenario may imply that these choices are not appropriate. But we will see soon that the estimates based on least-squares are the maximum likelihood estimates, and they are also the minimum-variance unbiased estimates.
For now let's look at least squares algorithmically and
geometrically.
RSS(b) = (y - Xb)T(y - Xb)To explain this in detail: Xb is a matrix times a column vector, so we take the dot-product of each row of X with b. This gives a column vector of size N, which we can subtract from y, giving e = y-Xb. Now we take the dot product of e with itself by first converting it into a row vector with the transpose operation, indicated by the superscript T.
Note that y and X are fixed and the result RSS(b) is a scalar function of the p+1 parameters that are components of the vector b. To minimize RSS(b) we set its derivative to zero. The derivative is
d/db RSS(b) = -2XT(y - Xb)This result can be proved by going back to the non-matrix formulation and computing the derivative of the RSS sum using standard calculus. The second derivative is
2XTy = -2XT((-Xb)This solution is only valid if the inverse actually exists, i.e. the matrix XTX is non-singular. Note that XTX is square,of size p+1 by p+1.
(XTX)-1XTy = b
var(b hat) = var((XTX)-1XTy) = (XTX)-1sigma^2Note that this variance is a variance-covariance matrix, of size p+1 by p+1. The entry ij is the expected value of (bi - mui)(bj - muj) where mui is the expected value of b1.
Of course, sigma^2 is usually unknown. We can estimate it in an unbiased way as
1/(N - p - 1) SUM_i (yi - yi hat)^2Remember that each yi has a different expectation. This expectation is an unknown function of xi, but we can estimate it as yi hat.
Y = N(b0 + SUM_j Xj*bj), sigma^2)In other words, all deviations in the observed values of Y follow Y - E[Y | X] = epsilon ~ N(0, sigma^2).
In this case one can show that the least-squares estimates are unbiased and follow a multivariate Gaussian distribution:
b hat ~ N(b, (XTX)-1sigma^2)Also, the estimated residual sum of squares RSS follows a chi-squared distribution with N-p-1 degrees of freedom:
RSS = SUM_i (yi - yi hat)^2 ~ sigma^2 CHISQ(N-p-1)so we get an unbiased estimate of the variance as RSS / N-p-1.
(RSS0 - RSS1)/ (p - q)*sigma^2 hatwhere sigma^2 hat = RSS1/ N-p-1. This statistic follows an F distribution. Not surprisingly, this tends towards a chi-squared distribution with p-q degrees of freedom when N increases. The current homework assignment asks you to work out the corresponding likelihood ratio test.