X_n + Y_n tends to X + b in distribution, andProof: Omitted.
X_n * Y_n tends to bX in distribution.
Intuitively, Slutsky's theorem says that the influence of Y_n on X_n is that of a constant, if Y_n tends to a constant.
Example: Suppose sqrt(n)(X bar_n - mu)/sigma is aymptotically N(0,1), but the true variance sigma^2 is unknown. Let S2_n be our estimator of the variance. Suppose the variance of this estimator tends to zero, so it tends to the true sigma^2 in probability. Then the theorem says that the distribution of sqrt(n)(X bar_n - mu)/S_n is aymptotically N(0,1).s(x,theta hat) = s(x,theta) + (theta hat - theta) d/d theta s(x,theta) + remainder(x,theta,theta hat)where the remainder involves (theta hat - theta)^2, which is order-of-magnitude smaller than the first-order term.
Now we rearrange the equation:
(theta hat - theta) = - s(x,theta) / d/d theta s(x,theta).Multiply by sqrt(n) on both sides and by 1/n top and bottom on the right:
sqrt(n) (theta hat - theta) = 1/sqrt(n) s(x,theta) / -1/n d/d theta s(x,theta)The numerator is a sum of individual score functions:
1/sqrt(n) s(x,theta) = 1/sqrt(n) SUM s(xi,theta)We know that the expectation of Yi = s(xi,theta) is 0, and its variance is I. Therefore by the central limit theorem, (1/sqrt(n)) SUM s(xi,theta) tends in distribution to N(0,I).
Now consider the denominator: -1/n d/d theta s(x,theta) = -1/n SUM d/d theta s(xi,theta). We showed before that the expectation of -d/d theta s(xi,theta) is I. So by the weak law of large numbers, the denominator tends in probability to the constant I.
Moving the denominator to the left, we have that I*sqrt(n)
(theta hat - theta) tends in distribution to N(0,I)
distribution. Therefore sqrt(n) * (theta hat -
theta_0) tends in distribution to N(0, 1/I) which is what
we wanted to prove.
We can say informally that theta hat tends in
distribution to N(theta_0, I/(I2n))
=
N(theta,1/nI). This says that the variance of theta hat is
approximately the
Cramer-Rao lower
bound, i.e. 1/nI.
Remember that I is the Fisher information content
of a
single observation xi, while nI is the Fisher information content of
the entire training set of size n.
Idea: Given x, find the best-guess distribution inside Theta and also inside Omega. Each of these gives a maximum likelihood. Look at the ratio for Theta over Omega. By definition this ratio is lambda(x) >= 1.
We make decisions using a threshold k. We reject the null hypothesis Omega if and only if lambda(x) > k. We choose k' so that
sup_theta in Omega P_theta(lambda(x) > k) = alphawhere alpha is called the significance level of the test.
Notes: