CSE 291 LECTURE NOTES
January 6, 2005
ANNOUNCEMENTS
See the introductory information and the guidelines for the students who will produce
the lecture notes in LaTeX each week.
POINT ESTIMATION
The framework for point estimation is as follows. We are given
(1) a sample space X = { x },
(2) a family of
distributions on
X {P_theta: theta in Theta}, and
(3) a function g: Theta -> R.
(Here R means the set of real numbers.)
Example:
(1) The sample space is { x } = X = Rn. This means
that x = (x1 ... xn) is a vector of n real numbers.
(2) The family of distributions is all Gaussians N(mu, sigma^2).
Note that theta is an ordered pair, theta = (mu, sigma^2), and each
element of x is iid. "iid" means "independent and identically
distributed.
(3) Suppose we don't care what the precise distribution N(mu, sigma^2)
is, but just what its mean is. Then the function g is (mu,
sigma^2) -> mu.
There is a true but unknown theta and hence a true but unknown value
g(theta). Our objective is to guess g(theta) as accurately as
possible.
We are also given one training set, i.e. one sample x drawn according
to P_theta, where theta is the unknown true theta.
An estimator is a function ghat : X -> R. Given the
training set x, ghat(x) is
an estimate. Note that the estimator is our learning
algorithm, while the estimate is the result of applying this algorithm
to a particular set of observations.
Example continued: The obvious estimator for mu is the
function ghat(x) = x bar = (1/n) SUM x_i.
UNBIASEDNESS
The training set x follows P_theta so we can make
probability statements about g hat, e.g. compute E_theta[g
hat(x)]. Since we don't know what the true theta is, "compute"
means "work out a formula" here.
Definition: The estimator g hat is unbiased if
E_theta [g hat(x)] = g(theta) for all theta. Notation:
E_theta [ghat(x)] = E [ghat(x) | x ~ P_theta].
Traditionally in machine learning the word bias means any sort of prior
knowledge, e.g. the belief that survival rates are
non-increasing.
In mathematical statistics the word has a different, more narrow
meaning. (Both meanings are quite different from the ordinary
meaning "unjustified prejudice.")
Example continued: It can be computed that x bar is unbiased:
E[x bar]
= mu. Note: This computation uses the fact that "the
expectation of a sum is the sum of the expectations." This fact
is true even if the components of the sum are not independent.
Note that these computations are deductive, i.e. they involve
reasoning, not learning. Also note that these computations are
not
part of the estimator, i.e. not part of the learning algorithm.
We're doing the computation as part of analyzing the algorithm.
AN IDEAL ESTIMATOR ...
Unbiasedness is a nice property for an estimator, but it is not a
strong requirement. Example: ghat(x) = x_1 is unbiased also.
Ideally we want P_theta(g hat(x) = g(theta)) = 1
for every theta. In words, regardless of what the true theta is,
we want ghat(x) to equal g(theta) with probability 1, where x is
distributed according to the true theta.
When is this achievable? A sufficient condition: If X can be
partitioned into disjoint subsets X_theta such that for each theta,
P_theta(X_theta) = 1. Then let g hat(x) = g(theta) for x in
X_theta.
This situation can be achieved asymptotically when x is a very large
sample of independent observations.
MEAN SQUARED ERROR (MSE)
Definition: The mean
squared error (MSE) of an estimator g hat is
E[ (g hat(x) - g(theta))^2 | x ~
P_theta]
Example continued: Suppose x = (x1 ... xn) is an iid sample from a
univariate
normal distribution with parameter theta = (mu, sigma^2). The
obvious estimator for mu is x bar. What is the MSE of x
bar?
Answer: MSE[x bar] = sigma^2/n. Note: This computation uses the
fact that "the variance of a sum is the sum of the variances, if the
components of the sum are independent."
A desirable property would be minimum mean square error: For every
theta and every other estimator gbar, we want
E_theta [ (g hat - g(theta))^2 ] <= E_theta
[ (g
bar - g(theta))^2 ]
This is not achievable in general. Consider the estimator g
bar(x) = g(theta_0) regardless of x. Although this is a bad
estimator in general, it has zero error for theta = theta_0. So
ghat would have to have zero error for theta_0, and hence for all
theta. (Analogy: A stopped clock is perfectly accurate twice a
day.)
MINIMUM VARIANCE UNBIASED ESTIMATORS (MVUE)
We just saw that getting an estimator that is minimum MSE is not
achievable in general. Suppose we restrict attention only to
unbiased estimators. We can often find ghat with the following
property:
- ghat is unbiased and for every
theta and every other unbiased estimator gbar, E_theta [ (g hat -
g(theta))^2 ] <= E_theta [ (g
bar - g(theta))^2 ].
Note that if E_theta [g hat(x)] = g(theta) then E_theta [(g hat(x) -
g(theta))^2] is the variance var_theta(g hat). Therefore we can
use the terminology "minimum variance unbiased estimator" (MVUE).
Next week we'll start work towards an algorithm for designing MVUEs.
BOOK RECOMMENDATIONS
Here are some books recommended as background reading.