CSE 291 LECTURE NOTES

Thursday February 10, 2005



WEAK LAW OF LARGE NUMBERS

Theorem:  Let X1 ... Xn be iid random variables with mean mu and finite non-zero variance sigma^2.  Let Sn be X1 + ... + Xn.  Then for any epsilon > 0,  P( | S_n/n - mu | <= epsilon ) tends to one as n tends to infinity.

This type of convergence is called "convergence in probability to a constant."


CENTRAL LIMIT THEOREM

Theorem [central limit theorem]:  Let Y1 ... Yn be iid random variables with mean mu and finite non-zero variance sigma^2.  Let Sn be Y1 + ... + Yn.  Then the limit as n tends to infinity of  P( (Sn - n*mu)/(sigma*sqrt(n)) <= z ) is Phi(z)  where z ~ N(0,1).

This type of convergence is called "convergence in distribution."

In other words, (Sn - n*mu)/(sigma*sqrt(n))  tends towards a N(0,1) distribution.  Intuitively, this means that  Sn  tends towards a N(n*mu,n*sigma2) distribution, but note that convergence has to be towards something fixed, so formally we have to say (Sn - n*mu)/(sigma*sqrt(n))  tends towards  N(0,1).


SLUTSKY'S THEOREM

Theorem [Slutsky]:  Let X_n and Y_n be sequences of random variables.  Suppose X_n tends to X in distribution, and Y_n tends to the constant b in probability. Then
X_n + Y_n tends to X + b in distribution, and
X_n * Y_n tends to bX in distribution.
Proof:  Omitted.

Intuitively, Slutsky's theorem says that the influence of Y_n on X_n is that of a constant, if Y_n tends to a constant.

Example:  Suppose sqrt(n)(X bar_n - mu)/sigma is aymptotically N(0,1), but the true variance sigma^2 is unknown.  Let S2_n be our estimator of the variance.  Suppose the variance of this estimator tends to zero, so it tends to the true sigma^2 in probability.  Then the theorem says that the distribution of sqrt(n)(X bar_n - mu)/S_n is aymptotically N(0,1).


MAIN THEOREM

Theorem:  Let p*(xi,theta) be the distribution followed by a single element of an iid sample of size n, and let theta hat be the MLE of theta.  Then
sqrt(n) * (theta hat - theta_0)  tends in distribution to  N(0, 1/I)
where theta_0 is the true theta and I is the Fisher information.

Proof:  Because g(x) = theta_hat maximizes the log likelihood, we know  0 = s(x,theta hat).  The first lemma tells us that theta hat is near the true theta.  Hence we can do a Taylor expansion around theta and apply it to theta hat:
s(x,theta hat)   =  s(x,theta) + (theta hat - theta) d/d theta s(x,theta) + remainder(x,theta,theta hat)
where the remainder involves (theta hat - theta)^2, which is order-of-magnitude smaller than the first-order term.

Now we rearrange the equation:

(theta hat - theta) = - s(x,theta) / d/d theta s(x,theta).
Multiply by sqrt(n) on both sides and by 1/n top and bottom on the right:
sqrt(n) (theta hat - theta) =  1/sqrt(n) s(x,theta) / -1/n d/d theta s(x,theta)
The numerator is a sum of individual score functions:
 1/sqrt(n) s(x,theta)  =  1/sqrt(n) SUM s(xi,theta)
We know that the expectation of Yi = s(xi,theta) is 0, and its variance is I.  Therefore by the central limit theorem, (1/sqrt(n)) SUM s(xi,theta)  tends in distribution to  N(0,I).

Now consider the denominator:  -1/n d/d theta s(x,theta)  =  -1/n SUM d/d theta s(xi,theta).  We showed before that the expectation of -d/d theta s(xi,theta) is I.  So by the weak law of large numbers, the denominator tends in probability to the constant I.

Moving the denominator to the left, we have that  I*sqrt(n) (theta hat - theta)  tends in distribution to N(0,I) distribution.  Therefore  sqrt(n) * (theta hat - theta_0)  tends in distribution to  N(0, 1/I) which is what we wanted to prove.

We can say informally that  theta hat  tends in distribution to N(theta_0, I/(I2n))  =  N(theta,1/nI).  This says that the variance of theta hat is approximately the Cramer-Rao lower bound, i.e. 1/nI.

Remember that I is the Fisher information content of a single observation xi, while nI is the Fisher information content of the entire training set of size n.

 

THE LOGIC OF TESTING A HYPOTHESIS

Suppose we have a family P_theta of possible probability distributions, where theta is in Theta.  Let's say we have a null hypothesis that is a subset Omega of Theta.

Idea:  Given x, find the best-guess distribution inside Theta and also inside Omega.  Each of these gives a maximum likelihood.  Look at the ratio for Theta over Omega.  By definition this ratio is  lambda(x) >= 1.

We make decisions using a threshold k.  We reject the null hypothesis Omega if and only if lambda(x) > k.  We choose k' so that

sup_theta in Omega P_theta(lambda(x) > k)  =  alpha
where alpha is called the significance level of the test.

Notes:

  1. Sometimes the null hypothesis Omega is a single point, e.g. mean = 0.
  2. We use sup instead of max above because the set Omega may be open, so it has a supremum but no maximum.
  3. If we have a sufficient statistic t(x), we can compute the likelihood ratio using just this, without needing x itself.
  4. Often, lambda(x) is an increasing function of t(x), so lambda(x) > k iff  t(x) > k'.
  5. If x has discrete values only, we may not be able to get exact equality.