Theorem [Calyampudi
Radhakrishna Rao, 1945]:
Then g hat(t(x)) is an unbiased estimator for g with variance equal-or-smaller to that of g tilde.
Proof outline:
(1) Show that g hat(a) = E_theta [ g tilde(x') | t(x') = a ] is the
same
function of a regardless of theta. This means that g hat(t(x)) is
a
statistic, i.e. a function of x only, and hence it is a legitimate
estimator.
(2) Show that E_theta [ g hat(a) ] = g(theta) for every theta.
This
says that g hat is unbiased.
(3) Show that var_theta( g tilde(x)) >= var_theta(g hat(a)).
This
says that g hat has smaller variance.
Step (1): The statistic t: X -> Y is sufficient for theta, so p(x | t(x) = a), i.e. the distribution of x conditional on a certain value of t, does not depend on theta. By the definition of expectation, if the space X is discrete then the expectation of f(x') given t(x') = a is
SUM_{x' s.t. t(x') = a} f(x')*p(x' | t(x') = a)For any function f: X -> R, this expectation is the same regardless of theta, because f does not depend on theta and p(x' | t(x') = a) does not depend on theta. Let f be g tilde; the expectation of g tilde(x') given t(x') = a is a function of a but not a function of theta. (A similar argument applies if the space X is continuous, with an integral instead of a sum, but there are technical details we won't go into.)
Step (2):
We must show that E_theta [ g hat(x) ] = g(theta)
for
arbitrary theta.
Proof: We use the lemma about nested expectations, where the event A contains all x' such that t(x') = t(x). We drop the subscript theta on the expectations (everything we say is true separately for every theta), and get E[g hat(x)] = E[ E[ g tilde(x') | t(x') = t(x)] ] = E[ g tilde(x')] = g(theta).
Step (3): Define c(u) = (u - g(theta))^2 for any statistic u: X -> R. For each theta this is a different function of u, but that's ok. If u is an unbiased estimator, then E[ c(u(x)) | x ~ P_theta ] is the variance of u. In particular, E[ c(g tilde(x)) | x ~ P_theta ] is the variance of the original estimator g tilde.Use Jensen's inequality where we
condition on the event t(x') = a, and we let u(x') = g
tilde(x'). Jensen's inequality says E[ c(g tilde(x')) |
t(x')=a
] >= c(E[ g tilde(x') | t(x')=a ]) = c(g
hat(a)). This is true for every a. For the righthand side,
remember that E[ g tilde(x') | t(x') = t(x) ] = g
hat(x) by definition.
Take the expectation again of each side, averaging over a. This gives E[ E[ c(g tilde(x')) | t(x')=t(x) ] | x ~ P_theta ] >= E[ c(g hat(x)) | x ~ P_theta ]
The righthand side is the variance of g hat. Using the lemma about nested expectations again for the lefthand side gives
E[ E[ c(g tilde(x')) | t(x') = t(x) ] | x ~ P_theta ] = E[ c(g tilde(x')) | x' ~ P_theta ]which is the variance of g tilde(x').
Lemma: Suppose t is a sufficient statistic, and g hat is the unique function of t that is an unbiased estimator of g(theta). Then g hat(t) is the MVUE.
Proof: Let g star(x) be any unbiased estimator of g(theta). Note that E[ g star(x) | t ] is a function of t and an unbiased estimator. So E[g star(x) | t] = g hat(t). By the Rao-Blackwell theorem, g hat(t) has variance equal-or-smaller than the variance of g star(x). So g hat(t) is the estimator that has smallest variance among all unbiased estimators.How can we know that the sufficient statistic t has the property
that g hat(t) is unique? The answer is via the concept of
completeness.
Algorithm:
(1) Find a sufficient statistic t.
(2) Show that the family of distributions of t is complete.
(3) Find a crude unbiased estimator g tilde(x).
(4) Evaluate g hat(t(x)) = E_theta[ g tilde(y) | t(y)=t(x) ]
Steps 1 and 2 only have to be done once for a given family of distributions P_theta. They can then be reused for different estimation targets g(theta).