CSE 291 LECTURE NOTES
Tuesday February 8, 2005
LARGE-SAMPLE MAXIMUM-LIKELIHOOD
Let p*(xi,theta) be the distribution followed by a single element of an
iid sample of size n. We have log p(x,theta) = SUM_i
log p*(xi,theta)
We are going to prove that the MLE, i.e. the theta that maximizes
the above sum, is essentially an ideal estimator
as n tends to infinity. More precisely, for large n with high
probability (a)
the MLE is close to the true theta, and (b) the variance of the MLE is
close to the Cramer-Rao lower bound.
The qualifier "for large n with high probability" means a somewhat
complicated statement about a certain probability tending to 1 as n
tends to infinity.
Note on notation: The superscript * on a likelihood function,
and/or the subscript i on x, both indicate that we're talking about a
single one of the iid observations.
PROPERTIES OF THE SCORE FUNCTION
Lemma: For any fixed value of theta, the score function has zero
mean.
Corollary: The variance of the score function is the
expected value of its square: var[s(xi,theta)] =
E_theta
[ s(xi,theta)^2 ]
.
Definition (reminder): The "Fisher information" is I =
var[s(xi,theta)].
Lemma: I = E[- d/d theta s(xi,theta)], i.e. minus the
second derivative of the log likelihood.
Proof:
d2/d theta2 log p = d/d theta [1/p d
p/d theta] = -1/p2 [d p/d theta]2 +
1/p d2 p/ d theta2.
Consider the first term, -1/p2 [d p/d theta]2 .
This is simply - [d/d theta log p]2.
Consider the expectation of the second term. This is E[ 1/p d2
p/ d theta2 ] =
INTx 1/p d2 p/ d theta2 p dx
= INTx d2 p/ d theta2
dx
= d2 p/ d theta2 INT p dx = d2
p/ d theta2 1 = 0.
WEAK LAW OF LARGE NUMBERS
The weak
law
of large numbers says that when n is large, for most iid samples
(x1
... xn) the observed average of f(xi) is close to its expected
value. This is true for any function f with finite variance and
any distribution of the individual xi.
Theorem: Let X1 ... Xn be iid
random variables with mean mu and finite non-zero variance
sigma^2. Let Sn be X1 + ... + Xn. Then for any epsilon >
0, P( | S_n/n - mu | <= epsilon ) tends to one as n tends to
infinity.
The statement above is one possible formalization of "for large n with
high probability." Expanding the "tends to" notation gives for
every epsilon >
0, for every delta > 0, there exists m, for all n > m, P( | S_n/n
- mu | <= epsilon ) >= 1-delta.
Note that we can apply the theorem to any derived random variables,
i.e. Y1 = f(X1), Y2 = f(X2), etc. However we do need f(X) to have
finite variance, and some f(X) may not have finite variance even if X
itself does.
FIRST LEMMA
Lemma: With high probability,
the MLE theta hat is near the true theta.
Proof (with missing steps):
Let's apply the law of large numbers to the function f(xi) = log
p*(xi,theta), for any theta.
Let the corresponding expected value be z(theta) = E[ log
p*(x,theta)]. Reminder from the first lemma: this expectation is
an integral over xi
following the true theta.
The law of large numbers says that for large n, for most
x,
the observed average, i.e. (1/n) SUM log p*(xi,theta), is close to the
expected
value z(theta).
Remember, we don't know what the true theta is, so we consider all
possible
theta.
Suppose this is true uniformly for all theta,
i.e. (1/n) SUM log p*(xi,theta) is within the same epsilon of
z(theta) for all
theta.
Let g(x) = argmax_theta (1/n) SUM log p*(xi,theta).
Then g(x)
will be in
the
set of theta such that z(theta) is within two epsilon of
z(theta_0). (This depends on another lemma we have not proved
yet.)
Now suppose that z(theta_0) - z(theta) is small if and only if
theta_0 - theta is small. In this case, g(x) will be close to
theta_0.
CENTRAL LIMIT THEOREM
Theorem [central limit theorem]: Let Y1 ... Yn be iid random
variables with mean mu and finite non-zero variance sigma^2. Let
Sn be Y1 + ... + Yn. Then the limit as n tends to infinity
of P( (Sn - n*mu)/(sigma*sqrt(n)) <= z ) is Phi(z) where
z ~ N(0,1).
In other words, (Yn - n*mu)/(sigma*sqrt(n)) tends towards a
N(0,1) distribution. This means that Sn tends towards
a N(n*mu,n*sigma2) distribution.
Application: Consider (1/sqrt(n)) SUM s(xi,theta). We
know that the expectation of Yi = s(xi,theta) is 0, and its variance is
I. Therefore, (1/sqrt(n)) SUM s(xi,theta) tends towards a N(0,I)
distribution.
TAYLOR EXPANSION
From efunda.com: If a
function
has continuous
derivatives up to (n+1)th order, then this function
can be expanded in the following fashion:

PROOF OF MAIN THEOREM
The proof uses the following:
- (1) The Taylor expansion of the score function around theta.
- (2) The central
limit theorem applied to s(x,theta)/sqrt(n), as done above.
- (3) A law
of large numbers applied to 1/n
d/d theta s(x,theta) (the second derivative of the log likelihood).
Because g(x) = theta_hat maximizes the log likelihood, we know 0
=
s(x,theta hat). The first lemma tells us that
theta hat is near the true theta. Hence we can do a Taylor
expansion around theta and apply it to theta hat:
s(x,theta hat) = s(x,theta) + (theta hat
- theta) d/d theta s(x,theta) + remainder(x,theta,theta hat)
where the remainder involves (theta hat - theta)^2, which is
order-of-magnitude smaller than the first-order term.
Now we rearrange the equation:
(theta hat - theta) = - s(x,theta) / d/d theta s(x,theta).
Multiply by sqrt(n) on both sides and by 1/n top and bottom on the
right:
sqrt(n) (theta hat - theta) = 1/sqrt(n) s(x,theta) /
-1/n
d/d theta s(x,theta)
The numerator is a sum of individual score functions:
1/sqrt(n) s(x,theta) = 1/sqrt(n) SUM
s(xi,theta)
By the
application above of the central limit theorem, this has
Gaussian distribution N(0, I_theta) approximately.
Now consider the denominator: -1/n d/d theta s(x,theta)
= -1/n SUM d/d theta
s(xi,theta).
We showed before that the expectation of -d/d theta s(xi,theta) is
I. So by the weak law of large numbers, the denominator is
approximately equal to I.
Moving the denominator to the left, we have that I*sqrt(n)
(theta hat - theta) follows approximately an N(0,I)
distribution. Therefore
theta hat ~ N(theta, I/(I2n)) =
N(theta,1/nI).
This says that the variance of theta hat is approximately the
Cramer-Rao lower
bound, i.e. 1/nI. Note that I is the Fisher information content
of a
single observation xi, while nI is the Fisher information content of
the entire training set of size n.