CSE 291 LECTURE NOTES

Tuesday February 8, 2005


LARGE-SAMPLE MAXIMUM-LIKELIHOOD

Let p*(xi,theta) be the distribution followed by a single element of an iid sample of size n.  We have  log p(x,theta) = SUM_i log p*(xi,theta)

We are going to prove that the MLE, i.e. the theta that maximizes the above sum, is essentially an ideal estimator as n tends to infinity.  More precisely, for large n with high probability (a) the MLE is close to the true theta, and (b) the variance of the MLE is close to the Cramer-Rao lower bound.

The qualifier "for large n with high probability" means a somewhat complicated statement about a certain probability tending to 1 as n tends to infinity.

Note on notation:  The superscript * on a likelihood function, and/or the subscript i on x, both indicate that we're talking about a single one of the iid observations.


PROPERTIES OF THE SCORE FUNCTION

Lemma:  For any fixed value of theta, the score function has zero mean.
Corollary:  The variance of the score function is the expected value of its square:  var[s(xi,theta)]  =  E_theta [ s(xi,theta)^2 ] .
Definition (reminder):  The "Fisher information" is  I = var[s(xi,theta)].

Lemma:  I = E[- d/d theta s(xi,theta)], i.e. minus the second derivative of the log likelihood.
Proof:
d2/d theta2 log p  =  d/d theta [1/p d p/d theta]  =  -1/p2 [d p/d theta]2 + 1/p d2 p/ d theta2.

Consider the first term, -1/p2 [d p/d theta]2 .  This is simply   - [d/d theta log p]2.

Consider the expectation of the second term.  This is E[ 1/p d2 p/ d theta2 ]  =  INTx 1/p d2 p/ d theta2 p dx
   =  INTx d2 p/ d theta2 dx  =  d2 p/ d theta2 INT p dx  =  d2 p/ d theta2 1  =  0.


WEAK LAW OF LARGE NUMBERS

The weak law of large numbers says that when n is large, for most iid samples (x1 ... xn) the observed average of f(xi) is close to its expected value.  This is true for any function f with finite variance and any distribution of the individual xi.

Theorem:  Let X1 ... Xn be iid random variables with mean mu and finite non-zero variance sigma^2.  Let Sn be X1 + ... + Xn.  Then for any epsilon > 0,  P( | S_n/n - mu | <= epsilon ) tends to one as n tends to infinity.

The statement above is one possible formalization of "for large n with high probability."  Expanding the "tends to" notation gives for every epsilon > 0, for every delta > 0, there exists m, for all n > m, P( | S_n/n - mu | <= epsilon ) >= 1-delta.

Note that we can apply the theorem to any derived random variables, i.e. Y1 = f(X1), Y2 = f(X2), etc.  However we do need f(X) to have finite variance, and some f(X) may not have finite variance even if X itself does.


FIRST LEMMA

Lemma:  With high probability, the MLE theta hat is near the true theta.
Proof (with missing steps): 

Let's apply the law of large numbers to the function f(xi) = log p*(xi,theta),  for any theta.

Let the corresponding expected value be  z(theta) = E[ log p*(x,theta)].  Reminder from the first lemma: this expectation is an integral over xi following the true theta.

The law of large numbers says that for large n, for most x, the observed average, i.e. (1/n) SUM log p*(xi,theta), is close to the expected value z(theta).

Remember, we don't know what the true theta is, so we consider all possible theta. 

Suppose this is true uniformly for all theta, i.e.  (1/n) SUM log p*(xi,theta)  is within the same epsilon of z(theta) for all theta.  Let g(x) = argmax_theta (1/n) SUM log p*(xi,theta). 

Then g(x) will be in the set of theta such that z(theta) is within two epsilon of z(theta_0).  (This depends on another lemma we have not proved yet.)
 
Now suppose that z(theta_0) - z(theta) is small if and only if theta_0 - theta is small.  In this case, g(x) will be close to theta_0.


CENTRAL LIMIT THEOREM

Theorem [central limit theorem]:  Let Y1 ... Yn be iid random variables with mean mu and finite non-zero variance sigma^2.  Let Sn be Y1 + ... + Yn.  Then the limit as n tends to infinity of  P( (Sn - n*mu)/(sigma*sqrt(n)) <= z ) is Phi(z)  where z ~ N(0,1).

In other words, (Yn - n*mu)/(sigma*sqrt(n))  tends towards a N(0,1) distribution.  This means that  Sn  tends towards a N(n*mu,n*sigma2) distribution.

Application:  Consider  (1/sqrt(n)) SUM s(xi,theta).  We know that the expectation of Yi = s(xi,theta) is 0, and its variance is I.  Therefore, (1/sqrt(n)) SUM s(xi,theta) tends towards a N(0,I) distribution.


TAYLOR EXPANSION

From efunda.com:  If a function  has continuous derivatives up to (n+1)th order, then this function can be expanded in the following fashion:

PROOF OF MAIN THEOREM

The proof uses the following: Because g(x) = theta_hat maximizes the log likelihood, we know  0 = s(x,theta hat).  The first lemma tells us that theta hat is near the true theta.  Hence we can do a Taylor expansion around theta and apply it to theta hat:
s(x,theta hat)   =  s(x,theta) + (theta hat - theta) d/d theta s(x,theta) + remainder(x,theta,theta hat)
where the remainder involves (theta hat - theta)^2, which is order-of-magnitude smaller than the first-order term.

Now we rearrange the equation:

(theta hat - theta) = - s(x,theta) / d/d theta s(x,theta).
Multiply by sqrt(n) on both sides and by 1/n top and bottom on the right:
sqrt(n) (theta hat - theta) =  1/sqrt(n) s(x,theta) / -1/n d/d theta s(x,theta)
The numerator is a sum of individual score functions:
 1/sqrt(n) s(x,theta)  =  1/sqrt(n) SUM s(xi,theta)
By the application above of the central limit theorem, this has Gaussian distribution N(0, I_theta) approximately.

Now consider the denominator:  -1/n d/d theta s(x,theta)  =  -1/n SUM d/d theta s(xi,theta).

We showed before that the expectation of -d/d theta s(xi,theta) is I.  So by the weak law of large numbers, the denominator is approximately equal to I.

Moving the denominator to the left, we have that  I*sqrt(n) (theta hat - theta)  follows approximately an N(0,I) distribution.  Therefore

theta hat ~ N(theta, I/(I2n))  =  N(theta,1/nI).
This says that the variance of theta hat is approximately the Cramer-Rao lower bound, i.e. 1/nI.  Note that I is the Fisher information content of a single observation xi, while nI is the Fisher information content of the entire training set of size n.