Because we use natural logarithm and d/dx log x = 1/x, the chain rule for derivatives says that
s(x,theta) = 1/p(x,theta) * d p(x,theta) / d thetaGenerally, given x we want to guess theta such that p(x,theta) is high and d p(x,theta) / d theta = 0, to be at a local maximum for p(x,theta). Hence for fixed x, the score function says which values of theta are best: the optimum score is zero and any non-zero score is less desirable.
Proof: By definition E[s(x,theta)] = INT_x dx p(x,theta) d/dtheta log p(x,theta).
So E[s(x,theta)] = INT_x dx p(x,theta)
1/p(x,theta) * d/dtheta p(x,theta)
= INT_x dx d/dtheta p(x,theta) = d/dtheta INT_x dx
p(x,theta) = d/dtheta 1 = 0.
Intuitively, the integral of a derivative is the derivative of the integral because the derivative of a sum is the sum of the derivatives. This equality can fail if the bounds over which we average x are different for different theta, but we won't go into these complications.
Because the score function has zero mean, its variance is just the expected value of its square:
var[s(x,theta)] = E_theta [ s(x,theta)^2 ]Note that the variance, like the mean, is an average over all values of x, given a certain theta. The mean is always zero but the variance can be different for different theta.
Suppose that the score function has small variance, for some theta. This means that all x have scores close to zero, so whatever the x that we observe, it doesn't provide much information about the value of theta. Hence every estimator of theta based on x is likely to be bad.
More specifically, the smaller the variance of s(x,theta), the bigger the variance of any unbiased estimator g(x), including the MVUE.
Theorem [Cramer, Rao]: Suppose the family of distributions P_theta is defined by a density function p(x,theta) where theta is a single real-valued parameter. Let g(x) be any unbiased estimator of theta. Then
var_theta[g(x)] >= 1/ var[s(x,theta)].Proof: We start with some properties of g(x). First, the expectation of g(x) is theta so
INT_x g(x) p(x,theta) dx = thetaThe last step above comes from the fact that g(x) is not a function of theta. It also assumes regularity conditions that we won't go into. Now using the fact s(x,theta) = d log p(x,theta)/dtheta = 1/p(x,theta) * d p(x,theta) / d thetad/ d theta INT_x g(x) p(x,theta) dx = 1
INT_x g(x) d/ d theta p(x,theta) dx = 1
INT_x g(x) * d log p(x,theta)/dtheta * p(x,theta) dx = 1which is the expectation of g(x) * s(x,theta).
We proved above that E[s(x,theta)] = 0. Consider the definition of the covariance of g(x) and s(x,theta):
cov[g(x), s(x,theta)] = E[ (g(x)-theta)*(s(x,theta)-0) ]Using the general result that the covariance squared is less than the product of the variances gives
= E[ g(x)*s(x,theta) - theta*s(x,theta) ] = E[ g(x)*s(x,theta) ] - 0
var[g(x)]*var[s(x,theta)] >= cov[g(x), s(x,theta)]^2 = E[ g(x)*s(x,theta) ]^2 = 1so var[g(x)] >= 1/ var[s(x,theta)] as wanted.
s(x,theta) = d log p(x,theta)/dtheta = b(theta)*[h(x) - theta]where h(x) is unbiased. In this case h(x) is an MVUE with variance 1/b(theta).
Proof: We used the fact that var[g(x)]*var[s(x,theta)] >= cov[g(x), s(x,theta)]^2. Not surprisingly, the covariance is maximized if g(x) and s(x,theta) are linearly related, where the constant b is allowed to depend on theta:
s(x,theta) - E[s(x,theta)] = b(theta)*{ g(x) - E[g(x)] }Simplifying gives
s(x,theta) = b(theta)*{ g(x) - theta }.In this case g(x) is an MVUE and b(theta) is the Fisher information so we know the MVUE variance.
log p(x,theta) = m(x)*log(theta) + (n-m(x))*log(1-theta)where g(x) = m(x)/n. So this g(x) is an MVUE.s(x,theta) = d log p(x,theta)/dtheta = m(x)/theta - (n-m(x))/log(1-theta) = n/[theta*(1-theta)] * [g(x) - theta]