p_theta(x) = C(theta) exp[ Q1(theta)*t1(x) + ... + Qk(theta)*tk(x) ] h(x)where theta is any collection of parameters and the Q and t functions are real-valued.
Note that by the factorization theorem, the vector [t1(x), ..., tk(x)] is sufficient.
The exponential family includes Gaussian, Poisson, and
many other families. The definition can be extended to discrete
distributions also, to include the binomial and other discrete
families. It does not include uniform distributions.
Often, we have a major simplification: the parameter space is R^k and Qk(theta) = theta_k. In this case, p_theta(x) = C(theta) exp[ theta_1*t1(x) + ... + theta_k*tk(x) ] h(x)
Suppose (x_1, ..., x_n) is an iid sample from a univariate
Gaussian. As above, we have
p_theta(x) = (2 pi sigma^2)^-0.5n * exp( -1/2sigma^2 * SUM (xi - mu)^2 )Here it looks like the parameter mu is involved with each separate xi. However we can rewrite the above as
p_theta(x) = ... SUM xi^2 - 2*mu*SUM xi + n*mu^2We can describe the same family of distributions using a different definition of the parameters. Let phi = ( -1/2sigma^2, mu/sigma^2 ). In this case
= (2 pi sigma^2)^-0.5n * exp(-n*mu^2/2*sigma^2) * exp( -1/2sigma^2 * SUM xi^2 + mu/sigma^2 * SUM xi)
= C(theta) exp[ Q1(theta)*t1(x) + Q2(theta)*t2(x) ]
Theorem: Consider the exponential family of distributions
p_theta(x) = C(theta) exp[ theta_1*t1(x) + ... + theta_k*tk(x) ] h(x)
with sufficient statistic t(x) = (t1(x), ...,
tk(x)). Suppose the parameter space Theta contains a
k-dimensional rectangle. Then the family of distributions of t is
complete.
Proof: Omitted.
Notes: When you define a family of distributions, you have to say not just what the parameters are (e.g. mu and sigma^2) but also what the allowable ranges are for these (e.g. mu > 0, sigma^2 > mu).
To apply the theorem, you have to describe your exponential family using the natural parameters, e.g. phi = [-0.5sigma^2, mu/sigma^2].
You then have to find a rectangle in the range of the natural
parameters. When a family of distributions is
highly restricted, e.g. we only consider Gaussians with mu = sigma^2,
then completeness can fail: you may not be able to find a rectangle of
full dimension, i.e. of dimension k.
Because we use natural logarithm and d/dx log x = 1/x, the chain rule for derivatives says that
s(x,theta) = 1/p(x,theta) * d p(x,theta) / d thetaGenerally, given x we want to guess theta such that p(x,theta) is high and d p(x,theta) / d theta = 0, to be at a local maximum for p(x,theta). Hence for fixed x, the score function says which values of theta are best: the optimum score is zero and any non-zero score is less desirable.