CSE 291 LECTURE NOTES

February 1, 2005
 
 

ANNOUNCEMENTS

See the new third assignment, due February 15.


THE SCORE FUNCTION

The function  log p(x,theta)  is called the log-likelihood function.  Its derivative  s(x,theta) = d log p(x,theta) / d theta  is called the score function.

Because we use natural logarithm and d/dx log x = 1/x, the chain rule for derivatives says that

s(x,theta)  =  1/p(x,theta) * d p(x,theta) / d theta
Generally, given x we want to guess theta such that p(x,theta) is high and  d p(x,theta) / d theta = 0, to be at a local maximum for p(x,theta).  Hence for fixed x, the score function says which values of theta are best: the optimum score is zero and any non-zero score is less desirable.
 
 

MEAN AND VARIANCE OF THE SCORE FUNCTION

Lemma:  For any fixed value of theta, the score function has zero mean:

Proof:  By definition E[s(x,theta)]  =  INT_x dx p(x,theta) d/dtheta log p(x,theta).

So  E[s(x,theta)]  =  INT_x dx p(x,theta) 1/p(x,theta) * d/dtheta p(x,theta)
                              =  INT_x dx d/dtheta p(x,theta)  =  d/dtheta INT_x dx p(x,theta)  =  d/dtheta 1  =  0.

Intuitively, the integral of a derivative is the derivative of the integral because the derivative of a sum is the sum of the derivatives.  This equality can fail if the bounds over which we average x are different for different theta, but we won't go into these complications.

Because the score function has zero mean, its variance is just the expected value of its square:

var[s(x,theta)]  =  E_theta [ s(x,theta)^2 ]
Note that the variance, like the mean, is an average over all values of x, given a certain theta.  The mean is always zero but the variance can be different for different theta.
 

CRAMER-RAO LOWER BOUND

Sometimes we can't find an MVUE, but we can find an unbiased estimator.  In this case we'd like to know how good its variance is.  One way to do this is to compare it to some lower bound.  The result we'll see now gives such a lower bound.

Suppose that the score function has small variance, for some theta.  This means that all x have scores close to zero, so whatever the x that we observe, it doesn't provide much information about the value of theta.  Hence every estimator of theta based on x is likely to be bad.

More specifically, the smaller the variance of s(x,theta), the bigger the variance of any unbiased estimator g(x), including the MVUE.

Theorem [Cramer, Rao]:  Suppose the family of distributions P_theta is defined by a density function p(x,theta) where theta is a single real-valued parameter.  Let g(x) be any unbiased estimator of theta.  Then

var_theta[g(x)]  >=  1/ var[s(x,theta)].
Proof:  We start with some properties of g(x).  First, the expectation of g(x) is theta so
INT_x g(x) p(x,theta) dx  =  theta

d/ d theta  INT_x g(x) p(x,theta) dx  =  1

INT_x g(x) d/ d theta  p(x,theta) dx  =  1

The last step above comes from the fact that g(x) is not a function of theta.  It also assumes regularity conditions that we won't go into.  Now using the fact  s(x,theta)  =  d log p(x,theta)/dtheta  =   1/p(x,theta) * d p(x,theta) / d theta
INT_x g(x) * d log p(x,theta)/dtheta * p(x,theta) dx  =  1
which is the expectation of  g(x) * s(x,theta).

We proved above that E[s(x,theta)] = 0.  Consider the definition of the covariance of g(x) and s(x,theta):

cov[g(x), s(x,theta)]  =  E[ (g(x)-theta)*(s(x,theta)-0) ]
                                     =  E[ g(x)*s(x,theta) -  theta*s(x,theta) ]  =  E[ g(x)*s(x,theta) ]  -  0
Using the general result that the covariance squared is less than the product of the variances gives
var[g(x)]*var[s(x,theta)]  >=  cov[g(x), s(x,theta)]^2  =  E[ g(x)*s(x,theta) ]^2  =  1
so var[g(x)]  >=  1/ var[s(x,theta)] as wanted.


ACHIEVING THE CRAMER-RAO LOWER BOUND

Theorem:  There exists an estimator g(x) whose variance is 1/ var[s(x,theta)]  if and only if the score function can be written
s(x,theta)  =  d log p(x,theta)/dtheta  =   b(theta)*[h(x) - theta]
where h(x) is unbiased.  In this case h(x) is an MVUE with variance 1/b(theta).

Proof:  We used the fact that  var[g(x)]*var[s(x,theta)]  >=  cov[g(x), s(x,theta)]^2.  Not surprisingly, the covariance is maximized if g(x) and s(x,theta) are linearly related, where the constant b is allowed to depend on theta:

s(x,theta) - E[s(x,theta)]  =  b(theta)*{ g(x) - E[g(x)] }
Simplifying gives
s(x,theta)  =  b(theta)*{ g(x) - theta }.
In this case g(x) is an MVUE and b(theta) is the Fisher information so we know the MVUE variance.
 
 

A CRAMER-RAO EXAMPLE

Example:  Let x = (x1 ... xn) be the result of n independent coin flips.with success probability theta.  As usual,  p(x,theta) = theta^m(x)*(1-theta)^(n-m(x))  where m(x) is the number of successes observed.  So
log p(x,theta)  =  m(x)*log(theta) + (n-m(x))*log(1-theta)

s(x,theta)  =  d log p(x,theta)/dtheta  =  m(x)/theta - (n-m(x))/log(1-theta)  =  n/[theta*(1-theta)] * [g(x) - theta]

where g(x) = m(x)/n.  So this g(x) is an MVUE.