DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
UNIVERSITY OF CALIFORNIA, SAN DIEGO

CSE 291: Statistical Learning

Assignment 3

Due Tuesday February 15 in class.


GUIDELINES

All the guidelines from Assignments 1 and 2 still apply.  See also the January 25 feedback on Assignment 1.


PROBLEMS

Please use http://www.quicktopic.com/29/H/tpJLWrViBcP to ask questions about these problems.

(1) (a) Answer question (4) from Assignment 1 again, taking into account the feedback given in class.  You may reuse your previous answer, and you may look at answers written by other students.  However, an answer earning full marks before will not necessarily earn full marks again.  Try hard to be compelling, i.e. correct, clear, and convincing.

(b) Modeling financial data using Gaussians is questionable, because real-world financial distributions are typically heavy-tailed.  Therefore, repeat part (c) using Cauchy distribution(s) instead of Gaussian distribution(s).  For comparability, select Cauchy distribution(s) that are as similar as possible to the Gaussian(s) you use for part (c) above. 

Note that the mean (and the variance, and all higher moments) of the Cauchy is mathematically undefined, so you cannot define "similar" in terms of mean and variance.  Note also that you will need to use a large number of replications in your experiments.  Discuss briefly but correctly, clearly, and convincingly what you learn from your numerical experiments using Cauchy distributions.


(2) This question asks you to do hypothesis-testing, as discussed in class.

Assume that a certain species is endangered, unless its habitat contains at least N animals of the species.  A developer claims that the habitat does contain at least N animals.  To check the truth of this claim, m animals are captured, then tagged, then released.  After the animals have mixed thoroughly, n animals are captured again, of which r are found to be tagged.  Assume that N is large compared to m and n.

What is an appropriate null hypothesis here?  What does p-value mean in this context?  What is an appropriate statistic to compute from N, m, n, and r?  What is an appropriate rule for decision-making (i.e., for coming to a conclusion)?  Answer these questions from the point of view of an environmentalist adversary of the developer.  Make your answers concrete.  Do a numerical experiment to confirm that your decision-making rule is appropriate, and that your p-values are correct.


(3) For this question you may use the book by Casella and Berger as your primary reference, but you may want to use other sources also.

(a) Give a detailed definition of the exponential family of families of distributions.  Make sure that your definition applies to both discrete and continuous distributions, i.e. to probability density functions (pdfs) and to probability mass functions (pmfs).

(b) Consider (i) Dirichlet distributions, (ii) power law distributions, and (iii) Zipf distributions.  Which of these are members of the exponential family?

(c) Let x1, x2 through xn be an iid sample from an exponential family distribution.  State a version of the exponential family completeness theorem that applies to x1, x2 through xn.  Explain carefully whether or not the theorem relies on the exponential family being described using its natural parameters.

(d) Consider the following families of restricted Gaussians: (i) mu = constant, (ii) sigma2 = constant, and (iii) mu/sigma = constant.  For which of these families does the completeness theorem apply?  Note: See Section 3.4 of Casella and Berger.


(4) [Silvey Example 4.7]  A cell contains organelles which may be regarded as spheres of equal but unknown radius r, distributed randomly.  A section of the cell is observed through a microscope; this section contains cross-sections of n organelles with radii x1, x2 through xn
Determine the maximum-likelihood estimate of r.  What is the distribution of this estimate?