CSE 291 LECTURE NOTES

January 6, 2005
 
 

ANNOUNCEMENTS

See the introductory information and the guidelines for the students who will produce the lecture notes in LaTeX each week.


POINT ESTIMATION

The framework for point estimation is as follows.  We are given
(1) a sample space X = { x },
(2) a family of distributions on X {P_theta: theta in Theta}, and
(3) a function g: Theta -> R.  (Here R means the set of real numbers.)

Example:
(1) The sample space is { x } = X = Rn.  This means that x = (x1 ... xn) is a vector of n real numbers.
(2) The family of distributions is all Gaussians N(mu, sigma^2).  Note that theta is an ordered pair, theta = (mu, sigma^2), and each element of x is iid.  "iid" means "independent and identically distributed.
(3) Suppose we don't care what the precise distribution N(mu, sigma^2) is, but just what its mean is.  Then the function g is (mu, sigma^2) -> mu.

There is a true but unknown theta and hence a true but unknown value g(theta).  Our objective is to guess g(theta) as accurately as possible.
We are also given one training set, i.e. one sample x drawn according to P_theta, where theta is the unknown true theta. 

An estimator is a function ghat : X -> R.  Given the training set x, ghat(x) is an estimate.  Note that the estimator is our learning algorithm, while the estimate is the result of applying this algorithm to a particular set of observations.

Example continued:  The obvious estimator for mu is the function ghat(x) = x bar = (1/n) SUM x_i.
 

UNBIASEDNESS

The training set x follows P_theta so we can make probability statements about g hat, e.g. compute E_theta[g hat(x)].  Since we don't know what the true theta is, "compute" means "work out a formula" here.

Definition: The estimator g hat is unbiased if E_theta [g hat(x)] = g(theta) for all theta.  Notation:  E_theta [ghat(x)] = E [ghat(x) | x ~ P_theta].

Traditionally in machine learning the word bias means any sort of prior knowledge, e.g. the belief that survival rates are non-increasing.  In mathematical statistics the word has a different, more narrow meaning.  (Both meanings are quite different from the ordinary meaning "unjustified prejudice.")

Example continued:  It can be computed that x bar is unbiased: E[x bar] = mu.  Note:  This computation uses the fact that "the expectation of a sum is the sum of the expectations."  This fact is true even if the components of the sum are not independent.

Note that these computations are deductive, i.e. they involve reasoning, not learning.  Also note that these computations are not part of the estimator, i.e. not part of the learning algorithm.  We're doing the computation as part of analyzing the algorithm.


AN IDEAL ESTIMATOR ...

Unbiasedness is a nice property for an estimator, but it is not a strong requirement.  Example:  ghat(x) = x_1 is unbiased also.

Ideally we want P_theta(g hat(x) = g(theta)) = 1 for every theta.  In words, regardless of what the true theta is, we want ghat(x) to equal g(theta) with probability 1, where x is distributed according to the true theta.

When is this achievable?  A sufficient condition: If X can be partitioned into disjoint subsets X_theta such that for each theta, P_theta(X_theta) = 1.  Then let g hat(x) = g(theta) for x in X_theta.

This situation can be achieved asymptotically when x is a very large sample of independent observations. 

 

MEAN SQUARED ERROR (MSE)

Definition: The mean squared error (MSE) of an estimator g hat is
E[ (g hat(x) - g(theta))^2  | x ~ P_theta]

Example continued: Suppose x = (x1 ... xn) is an iid sample from a univariate normal distribution with parameter theta = (mu, sigma^2).  The obvious estimator for mu is x bar.  What is the MSE of x bar?  Answer: MSE[x bar] = sigma^2/n.  Note: This computation uses the fact that "the variance of a sum is the sum of the variances, if the components of the sum are independent."

A desirable property would be minimum mean square error: For every theta and every other estimator gbar, we want
E_theta [ (g hat - g(theta))^2 ]  <=  E_theta [ (g bar - g(theta))^2 ]
This is not achievable in general.  Consider the estimator g bar(x) = g(theta_0) regardless of x.  Although this is a bad estimator in general, it has zero error for theta = theta_0.  So ghat would have to have zero error for theta_0, and hence for all theta.  (Analogy: A stopped clock is perfectly accurate twice a day.)
 

MINIMUM VARIANCE UNBIASED ESTIMATORS (MVUE)

We just saw that getting an estimator that is minimum MSE is not achievable in general.  Suppose we restrict attention only to unbiased estimators.  We can often find ghat with the following property:
Note that if E_theta [g hat(x)] = g(theta) then E_theta [(g hat(x) - g(theta))^2] is the variance var_theta(g hat).  Therefore we can use the terminology "minimum variance unbiased estimator" (MVUE).

Next week we'll start work towards an algorithm for designing MVUEs.


BOOK RECOMMENDATIONS

Here are some books recommended as background reading.