Intuition: E_theta[ g tilde(x') | t(x')
= t(x)] means E[ g tilde(x') | t(x')
= t(x), x' ~ P_theta]. This is an average over all possible
observations x' that
have the same value for the sufficient statistic t as the actual
observation x. By averaging over all these x', we reduce
sensitivity to the particular observation x, while still using all the
information in x that is relevant, because t is a sufficient statistic.
Lemma: Let {A} be any partition of X, and let f be any function of x. Consider E[ E[ f(x) | A ] ] where the outer expectation averages over the events A. Claim: E[ E[ f(x) | A ] ] = E[ f(x) ].
Proof (discrete case only): Using the definition of expectation, E[ E[f(x)|A] = SUM_A p(A) * [ SUM_x in A p(x|x in A) * f(x) ].
Proof 1: Write A(x) to mean whichever subset contains x. Because
the {A}
are a partition of X, E[ E[f(x)|A] = SUM_x p(A(x)) *
p(x|x in A(x)) * f(x).
Now p(x|x in B) = p(x and x in B)/p(x in B) by the definition of conditional probability. The numerator is 0 if x is not in B. It is just p(x) if x is in B.
So E[ E[f(x)|A] = SUM_x p(A(x)) * p(x) / p(A(x)) *
f(x) and we can cancel p(A(x)) out, giving the result.
Proof 2: By definition p(x|x in B) = 0 if x is not in B, and
p(x|x in B) = p(x)/p(x in B) if x is in B. So we can drop
the condition "in A" from the inner sum, giving SUM_A p(A) * [ SUM_x
p(x|x
in A) * f(x) ].
Moving p(A) inside and swapping the sums gives SUM_x SUM_A [ p(A) *
p(x|x
in A) * f(x) ].
Because p(x|x in A) is non-zero only for A = A(x), we get SUM_x p(A(x)) * p(x)/p(A(x) * f(x).
Canceling out p(A(x)) gives the result.
Definition: The function c: R-> R is convex if bc(x) + (1-b)c(y) >= c(bx +(1-b)y) for all x and y, for 0 <= b <= 1.
For example, the functions x^2 and 1/x are convex. Subject to minor conditions, a function is convex iff its second derivative is always positive. (This means that if the first derivative is ever zero, the function has a global minimum.) See picture!Lemma [Jensen's
inequality, 1906]: Let u be a real-valued random variable and
let c be a convex function .
Then E[c(u)] >= c(E[u]).
Proof: We'll just do the proof in the finite discrete
case, where u has n different values of u that
have
non-zero probability. The proof goes by induction on n. The
base case, i.e. n = 2, is an exercise for
the reader.
Inductive case: We'll suppose Jensen's inequality is true for
n-1, and then prove it for n. Let U = {u_1, ...., u_n-1} be the
n-1 different values of u that
have
non-zero probability. The inductive hypothesis tells us
that E[c(u) | u in U ] >= c(E[u | u in U ]).
We need to prove E[c(u)] >=
c(E[u]) where u may take on n different values, that is u in U
union {u_n}.
Let p = P(u = u_n) and let z = E[u | u in U ].
Then E[u] = pu_n + (1-p)z. Similarly,
E[c(u)] = pc(u_n) +
(1-p)E[c(u) | u in U ] .
The inductive hypothesis tells us that E[c(u) | u in U ]
>= c(E[u | u in U ]). So E[c(u)]
>=
pc(u_n) + (1-p)c(z).
The definition of convexity tells us that pc(u_n) +
(1-p)c(z) >= c(pu_n + (1-p)z ) =
c(E[u]). END OF PROOF.
Theorem [Calyampudi
Radhakrishna Rao, 1945]:
Then g hat(t(x)) is an unbiased estimator for g with variance equal-or-smaller to that of g tilde.
Proof outline:
(1) Show that g hat(a) = E_theta [ g tilde(x') | t(x') = a ] is the
same
function of a regardless of theta. This means that g hat(t(x)) is
a
statistic, i.e. a function of x only, and hence it is a legitimate
estimator.
(2) Show that E_theta [ g hat(a) ] = g(theta) for every theta.
This
says that g hat is unbiased.
(3) Show that var_theta( g tilde(x)) >= var_theta(g hat(a)).
This
says that g hat has smaller variance.
Step (1): The statistic t: X -> Y is sufficient for theta, so p(x | t(x) = a), i.e. the distribution of x conditional on t, does not depend on theta. By the definition of expectation, if the space X is discrete then the expectation of f(x') given t(x') = a is
SUM_{x' s.t. t(x') = a} f(x')*p(x' | t(x') = a)For any function f: X -> R, this expectation is the same regardless of theta, because f does not depend on theta and p(x' | t(x') = a) does not depend on theta. Let f be g tilde; the expectation of g tilde(x') given t(x') = t(x) is a function of t(x) but not a function of theta. (A similar argument applies if the space X is continuous, with an integral instead of a sum, but there are technical details we won't go into.)