Statistical inference is inductive, i.e. a form of learning:
"Suppose we
assume a family of probability distributions, and we observe certain
events. Which member of the family should we use for future
reasoning?"
All inductive reasoning has two important properties: (1) the
conclusion is not guaranteed to be true, and (2), the conclusion
depends on assumptions (i.e. prior knowledge) in addition to depending
on observations.
Example: Consider a large population of neurons. At age t, proportion pi(t) are still alive. At ages t1, t2, ... we take a random sample of n neurons and count how many are alive. An observation is an ordered set (r1 ... rs) of integers. Terminology: Note that here one observation x is an entire vector (r1 ... rs) of multiple measurements.
If pi(t) is known, then p(r1 ... rs) = PRODUCT_i=1 s (n choose
ri) pi(ti)^ri [1 - pi(ti)]^(n - ri). This computation is an
example of deductive reasoning using probabilities.
But pi(t) is not known! What we know (that is, what we assume)
is that pi is a
non-increasing function of t, and we want to induce what pi(t) is.
Note the role of prior knowledge: even though the sequence (r1 ... rs) may not be non-increasing, we assume that pi is.
Generally, we assume a family P_theta of possible distributions on the sample space {x} and the task is to choose an appropriate theta. Here, the observation x is (r1 ... rs).
General idea: Given observed data x, choose theta such that
P_theta(x)
is high. We'll return to this idea, which is called "maximum
likelihood."
An estimator is a function g hat : X -> R. Given a
particular outcome (aka observation, aka training data) x, g hat(x) is
an estimate. Note that the estimator is our learning
algorithm, while the estimate is the result of applying this algorithm
to a particular set of observations.
Example: Suppose x = (x1 ... xn) is an iid sample from a
univariate normal distribution with parameter theta = (mu, sigma^2).
The obvious estimator for mu is the sample average, x bar .