CSE 291 LECTURE NOTES

January 4, 2005
 
 

WELCOME

See the introductory information.


REASONING VERSUS LEARNING

Probability theory is deductive, i.e. a form of reasoning:  "Given information about a certain probability distribution, what is the probability of a certain event?"

Statistical inference is inductive, i.e. a form of learning:  "Suppose we assume a family of probability distributions, and we observe certain events.  Which member of the family should we use for future reasoning?"

All inductive reasoning has two important properties:  (1) the conclusion is not guaranteed to be true, and (2), the conclusion depends on assumptions (i.e. prior knowledge) in addition to depending on observations.

Example:  Consider a large population of neurons.  At age t, proportion pi(t) are still alive.  At ages t1, t2, ... we take a random sample of n neurons and count how many are alive.  An observation is an ordered set (r1 ... rs) of integers.  Terminology: Note that here one observation x is an entire vector  (r1 ... rs)  of multiple measurements.

If pi(t) is known, then p(r1 ... rs) = PRODUCT_i=1 s  (n choose ri) pi(ti)^ri [1 - pi(ti)]^(n - ri).  This computation is an example of deductive reasoning using probabilities.

But pi(t) is not known!  What we know (that is, what we assume) is that pi is a non-increasing function of t, and we want to induce what pi(t) is.

Note the role of prior knowledge: even though the sequence (r1 ... rs) may not be non-increasing, we assume that pi is.

Generally, we assume a family P_theta of possible distributions on the sample space {x} and the task is to choose an appropriate theta.  Here, the observation x is (r1 ... rs).

General idea:  Given observed data x, choose theta such that P_theta(x) is high.  We'll return to this idea, which is called "maximum likelihood."

 

POINT ESTIMATION

We'll start with the classical theory of parameter estimationg, because this is one of the great achievements of 20th century mathematics.  The notes below are based originally on the book by S. D. Silvey.  The book is available through Amazon and elsewhere; see addall.com.

Often we don't need to make a commitment to a particular entire distribution P_theta.  Instead, we just want to know some property of P_theta, for example the t* such that pi(t*) = 0.5, which is the age at which half the neurons are still alive.  Finding the best guess for t*, based on observations and assumptions, is called "point estimation" or "parameter estimation."

Formally, we are given (1) a sample space X = { x }, (2) a family of distributions on X {P_theta: theta in Theta}, and (3) a function g: Theta -> R.  There is a true but unknown theta and hence a true but unknown value g(theta).

An estimator is a function g hat : X -> R.  Given a particular outcome (aka observation, aka training data) x, g hat(x) is an estimate.  Note that the estimator is our learning algorithm, while the estimate is the result of applying this algorithm to a particular set of observations.

Example: Suppose x = (x1 ... xn) is an iid sample from a univariate normal distribution with parameter theta = (mu, sigma^2).  The obvious estimator for mu is the sample average, x bar .