CSE 291 LECTURE NOTES

January 13, 2005
 
 

ANNOUNCEMENTS

Here is the second assignment, which is due in two weeks.  Be sure to check for clarifications and updated.


THE RAO-BLACKWELL THEOREM

How can we find an unbiased estimator with minimum MSE among all unbiased estimators?

Theorem: 
Claim: ghat(x) is an unbiased estimator with variance equal-or-smaller to that of gtilde(x).

Intuition:  E_theta[ g tilde(x') | t(x') = t(x)] means E[ g tilde(x') | t(x') = t(x), x' ~ P_theta].  This is an average over all possible observations x' that have the same value for the sufficient statistic t as the actual observation x.  By averaging over all these x', we reduce sensitivity to the particular observation x, while still using all the information in x that is relevant, because t is a sufficient statistic.


NESTED EXPECTATIONS LEMMA

Remember the definition of expectation (discrete case):  E [ f(x) ]  =  SUM_x p(x)*f(x  where the sum is over all x in the sample space X.

An event is a subset of the sample space X.  For example, B = {x : t(x) = b} is an event.  The notation E[ f(x) | B ] averages over the x belonging to B, so it means SUM_x in B p(x|x in B) * f(x).

Lemma:  Let {A} be any partition of X, and let f be any function of x.  Consider  E[ E[ f(x) | A ] ]  where the outer expectation averages over the events A.  Claim:  E[ E[ f(x) | A ] ] = E[ f(x) ].

Proof (discrete case only):  Using the definition of expectation, E[ E[f(x)|A]  =  SUM_A p(A) * [ SUM_x in A p(x|x in A) * f(x) ].


Proof 1: Write A(x) to mean whichever subset contains x.  Because the {A} are a partition of X, E[ E[f(x)|A]  =  SUM_x  p(A(x)) * p(x|x in A(x)) * f(x).

Now p(x|x in B) = p(x and x in B)/p(x in B) by the definition of conditional probability.   The numerator is 0 if x is not in B.  It is just p(x) if x is in B.

So E[ E[f(x)|A]  =  SUM_x  p(A(x)) * p(x) / p(A(x)) * f(x)  and we can cancel p(A(x)) out, giving the result.

Proof 2:  By definition p(x|x in B) = 0 if x is not in B, and p(x|x in B) =  p(x)/p(x in B) if x is in B.  So we can drop the condition "in A" from the inner sum, giving SUM_A p(A) * [ SUM_x p(x|x in A) * f(x) ].

Moving p(A) inside and swapping the sums gives SUM_x SUM_A [ p(A) * p(x|x in A) * f(x) ].

Because p(x|x in A) is non-zero only for A = A(x), we get SUM_x p(A(x)) * p(x)/p(A(x) * f(x).

Canceling out p(A(x)) gives the result.


JENSEN'S INEQUALITY

Definition:  The function c: R-> R is convex if  bc(x) + (1-b)c(y) >= c(bx +(1-b)y)  for all x and y, for 0 <= b <= 1.

For example, the functions x^2 and 1/x are convex.  Subject to minor conditions, a function is convex iff its second derivative is always positive.  (This means that if the first derivative is ever zero, the function has a global minimum.)  See picture!

Definition:  A real-valued random variable is a function X -> R.  Note that this is another name for what we called a statistic earlier.  Whatever the name, the probability distribution on X determines a probability distribution on the real numbers that are the values u(x).

Lemma [Jensen's inequality, 1906]:  Let u be a real-valued random variable and let c be a convex function .  Then E[c(u)] >= c(E[u]).

Proof:  We'll just do the proof in the finite discrete case, where u has n different values of u that have non-zero probability.  The proof goes by induction on n.  The base case, i.e. n = 2, is an exercise for the reader.

Inductive case:  We'll suppose Jensen's inequality is true for n-1, and then prove it for n.  Let U = {u_1, ...., u_n-1} be the n-1 different values of u that have non-zero probability.  The inductive hypothesis tells us that  E[c(u) | u in U ]  >=  c(E[u | u in U ]). 

We need to prove  E[c(u)] >= c(E[u])  where u may take on n different values, that is u in U union {u_n}.

Let p = P(u = u_n) and let z = E[u | u in U ].  Then  E[u] = pu_n  +  (1-p)z.  Similarly,  E[c(u)]  =  pc(u_n)  +  (1-p)E[c(u) | u in U ] .

The inductive hypothesis tells us that  E[c(u) | u in U ]  >=  c(E[u | u in U ]).  So E[c(u)] >= pc(u_n)  +  (1-p)c(z).

The definition of convexity tells us that pc(u_n)  +  (1-p)c(z)  >= c(pu_n  +  (1-p)z )  =  c(E[u]).  END OF PROOF.

Jensen's inequality also applies to conditional expectations:  if c is convex, then E[ c(f(x)) | x in A] >= c( E[ f(x) | x in A] ).  Justification:  those x such that x in A are really just a different sample space, and the result is true for any sample space.


RAO-BLACKWELL PROOF

Theorem [Calyampudi Radhakrishna Rao, 1945]: 

Then g hat(t(x)) is an unbiased estimator for g with variance equal-or-smaller to that of g tilde.

Proof outline:
(1) Show that g hat(a) = E_theta [ g tilde(x') | t(x') = a ] is the same function of a regardless of theta.  This means that g hat(t(x)) is a statistic, i.e. a function of x only, and hence it is a legitimate estimator.
(2) Show that E_theta [ g hat(a) ] = g(theta) for every theta.  This says that g hat is unbiased.
(3) Show that var_theta( g tilde(x)) >= var_theta(g hat(a)).  This says that g hat has smaller variance.

Step (1):  The statistic t: X -> Y is sufficient for theta, so p(x | t(x) = a), i.e. the distribution of x conditional on t, does not depend on theta.  By the definition of expectation, if the space X is discrete then the expectation of f(x') given t(x') = a is

SUM_{x' s.t. t(x') = a} f(x')*p(x' | t(x') = a)
For any function f: X -> R, this expectation is the same regardless of theta, because f does not depend on theta and p(x' | t(x') = a) does not depend on theta.  Let f be g tilde; the expectation of g tilde(x') given t(x') = t(x) is a function of t(x) but not a function of theta.  (A similar argument applies if the space X is continuous, with an integral instead of a sum, but there are technical details we won't go into.)