s(x,theta) = d log p(x,theta)/dtheta = b(theta)*[h(x) - theta]In this case h(x) is the MVUE.
Proof: Use the fact that var[h(x)]*var[s(x,theta)] = cov[h(x), s(x,theta)]^2 = 1, and var[s(x,theta)] = b(theta)2*var[h(x)].
log p(x,theta) = m(x)*log(theta) + (n-m(x))*log(1-theta)where g(x) = m(x)/n. So this g(x) is an MVUE and its variance is theta*(1-theta)/n.s(x,theta) = d log p(x,theta)/dtheta = m(x)/theta - (n-m(x))/log(1-theta) = n/[theta*(1-theta)] * [g(x) - theta]
For example, suppose there were three abductions of children by
strangers in California last year, and six this year. The
observed
rate has doubled. Is this a terible crime wave?
The answer is no.
Let the null hypothesis be that the true expected number per year is np
= 3, with random variability. Under this hypothesis, the standard
deviation is
around sqrt(3) = 1.7. About 2/3 of years will have a rate
within +/- one standard deviation of the mean, and about 95% within +/-
two standard deviations. In this application, about one year out
of every three the number
will be zero, or five or more, without any change in the
underlying rate.
Even with a very large sample, the number of information-rich
examples may still be very low. For example, there are over five
million children in California but very little information is available
about whether there has been a change in the probability of abduction.
(1) Which null hypothesis you choose should depend on your point of
view, and can change your final conclusion. Here, should H0 be
that np = 3, or that np = 4.5, where 4.5 is our best guess of the true
rate, assuming that the true rate is constant? Which H0 to choose
is a real-world question, not a technical mathematical one.
(2) Once you have chosen H0, the mathematical question is "what is the probability of either the observed outcome, or a more extreme outcome?" The definition of "more extreme" depends on the real-world scenario.
(3) The probability defined in (2) is called the p-value. Your
final conclusion is basedon comparing the p-value to a threshold.
Which threshold you use is again a real-world question, not a
mathematical one.
Fisher information is additive, because variances are
additive. If the sample (i.e. training set) is a set of iid
observations, then the total information is n times the information
provided by each observation.
Let's call this sum l(x,theta). Given any theta, we can think of it as a function of x, i.e. a random variable. It's a different random variable for each theta.
For each n, let thetahat_n be the MLE. Assuming that the MLE
is not a "corner case" solution, because it maximizes the
log likelihood, thetahat_n is a solution of the equation D_theta
l(x,theta) = 0. Remember that this is the score function called
s(x,theta) before.
We are going to prove that the MLE is essentially an ideal estimator
as n tends to infinity. More precisely, with probability one (a)
the MLE tends towards the true theta, and (b) the variance of the MLE
tends towards the Cramer-Rao lower bound.
Remember that each theta_n tilde is a function of x. It's too
much to ask that convergence be true for all x. There are weak
and
strong versions of the definition using different probabilistic
conditions on x.
Notes:
(1) The sequence (theta_n tilde) can be consistent, even though each
theta_n tilde is not
unbiased.
(2) Conversely, the sequence can fail to be consistent, even
though each theta_n tilde is unbiased.
(3) A sequence can be consistent, but still converge very slowly, e.g.
if each estimator throws away some useful information.
We shall use several intermediate results: