CSE 291 LECTURE NOTES

Thursday February 3, 2005


ACHIEVING THE CRAMER-RAO LOWER BOUND

We proved on Tuesday that there exists an estimator g(x) whose variance is 1/ var[s(x,theta)]  if and only if the score function can be written
s(x,theta)  =  d log p(x,theta)/dtheta  =   b(theta)*[h(x) - theta]
In this case h(x) is the MVUE.

Lemma:  The variance of h(x) is 1/b(theta).

Proof:  Use the fact that  var[h(x)]*var[s(x,theta)]  =  cov[h(x), s(x,theta)]^2  =  1, and  var[s(x,theta)] = b(theta)2*var[h(x)]. 


A CRAMER-RAO EXAMPLE

Example:  Let x = (x1 ... xn) be the result of n independent coin flips.with success probability theta.  As usual,  p(x,theta) = theta^m(x)*(1-theta)^(n-m(x))  where m(x) is the number of successes observed.  So
log p(x,theta)  =  m(x)*log(theta) + (n-m(x))*log(1-theta)

s(x,theta)  =  d log p(x,theta)/dtheta  =  m(x)/theta - (n-m(x))/log(1-theta)  =  n/[theta*(1-theta)] * [g(x) - theta]

where g(x) = m(x)/n.  So this g(x) is an MVUE and its variance is theta*(1-theta)/n.
 
 

A DIGRESSION

Note that if you have n binary trials with success probability p, then the expected number of successes is np and the variance is npq.  Often p and n are unknown, but p is small and n is large.  In this case the variance and the expectation are approximately equal.  This fact can be used to test informally whether the observed number of successes in two different scenarios is significantly different.

For example, suppose there were three abductions of children by strangers in California last year, and six this year.  The observed rate has doubled.  Is this a terible crime wave? 

The answer is no.  Let the null hypothesis be that the true expected number per year is np = 3, with random variability.  Under this hypothesis, the standard deviation is around sqrt(3) = 1.7.  About 2/3 of years will have a rate within +/- one standard deviation of the mean, and about 95% within +/- two standard deviations.  In this application, about one year out of every three the number will be zero, or five or more, without any change in the underlying rate. 

Even with a very large sample, the number of information-rich examples may still be very low.  For example, there are over five million children in California but very little information is available about whether there has been a change in the probability of abduction.


NOTES ON TESTING A HYPOTHESIS

(1) Which null hypothesis you choose should depend on your point of view, and can change your final conclusion.  Here, should H0 be that np = 3, or that np = 4.5, where 4.5 is our best guess of the true rate, assuming that the true rate is constant?  Which H0 to choose is a real-world question, not a technical mathematical one.

(2) Once you have chosen H0, the mathematical question is "what is the probability of either the observed outcome, or a more extreme outcome?"  The definition of "more extreme" depends on the real-world scenario.

(3) The probability defined in (2) is called the p-value.  Your final conclusion is basedon comparing the p-value to a threshold.  Which threshold you use is again a real-world question, not a mathematical one.

 

FISHER INFORMATION

The variance of the score function is a formalization of the concept "amount of information" from the 1920s that predates Shannon's famous notion of entropy (1948).  Of course, the two are related.

Fisher information is additive, because variances are additive.  If the sample (i.e. training set) is a set of iid observations, then the total information is n times the information provided by each observation.
 
 

LARGE-SAMPLE MAXIMUM-LIKELIHOOD

Let p*(xi,theta) be the distribution followed by a single element of a large iid sample of size n.  We have  log p(x,theta) = SUM_i log p*(xi,theta)

Let's call this sum l(x,theta).  Given any theta, we can think of it as a function of x, i.e. a random variable.  It's a different random variable for each theta.

For each n, let thetahat_n be the MLE.  Assuming that the MLE is not a "corner case" solution, because it maximizes the log likelihood, thetahat_n is a solution of the equation D_theta l(x,theta) = 0.  Remember that this is the score function called s(x,theta) before.

We are going to prove that the MLE is essentially an ideal estimator as n tends to infinity.  More precisely, with probability one (a) the MLE tends towards the true theta, and (b) the variance of the MLE tends towards the Cramer-Rao lower bound.


CONSISTENCY

Definition:  Let (theta_n tilde) be a sequence of estimators of theta, for n >= 1.  The sequence is consistent if for all theta, theta_n tilde tends to theta.

Remember that each theta_n tilde is a function of x.  It's too much to ask that convergence be true for all x.  There are weak and strong versions of the definition using different probabilistic conditions on x.

Notes:
(1) The sequence (theta_n tilde) can be consistent, even though each theta_n tilde is not unbiased.
(2) Conversely, the sequence  can fail to be consistent, even though each theta_n tilde is unbiased.
(3) A sequence can be consistent, but still converge very slowly, e.g. if each estimator throws away some useful information.

   

LARGE-SAMPLE EFFICIENCY

Next week we will prove that for large n, MLEs are consistent, and have variance only slightly above the Cramer-Rao lower bound.  This second property is called efficiency.

We shall use several intermediate results: