DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
UNIVERSITY OF CALIFORNIA, SAN DIEGO

CSE 291: Statistical Learning

Assignment 2

Due Tuesday February 1 in class.


GUIDELINES

You are encouraged to collaborate while solving the problems posed, and to use any books and other resources you wish.  However, you must write up your final solutions independently.  You are encouraged to share code, but you should have your own working implementation and you should write your own explanation of experimental results.  Your answers should be written in good, concise English with all necessary diagrams, plots, and explanations.  If necessary, you may make assumptions that are reasonable, and that do not make a question trivial.  If you do make an assumption, state it clearly.

These assignments are training for writing research papers.  Write up your answer to each question as if it were a piece of a research paper.  Polish your explanations, cite your sources, contribute something clear and definite (i.e. the result you are asked to show), but do not reinvent the wheel and do not get stuck on any single minor issue.

LaTeX is the intergalactic standard for writing research papers with mathematical content, so you should use it.  Only LaTeX can really typeset equations in a perfectly correct way.  (Mathematica qnd Word do not.)  Explain your work in full sentences and paragraphs, but make the answer to each question less than two pages single-spaced, unless it is really necessary to use more space.  Use BibTeX for citations.  Insert figures generated by Matlab into your LaTeX text at the appropriate places.  Make figures as simple and small as possible while still making them easy to read.  On the due date, you should submit a stapled 8.5x11 printout in class.

Mathematical proofs should be clear and not contain unexplained leaps, but it is not necessary to go into technical detail about measure theory, etc.  The statistical ideas are what is most important..  Some general hints on how to prove theorems:  

  1. Always make the high-level outline of the proof explicit.
  2. You have latitude to assume standard results, but if you do so, you should state each result precisely and cite a source.
  3. Do not try to show a difficult inequality using one global series of inequalities.  Instead, decompose the problem into parts, and prove each part separately.
  4. When you have a series of equations or inequalities, explain the most important steps of reasoning in English.  Use full sentences, not verbless phrases.  To describe your reasoning, use English words and phrases like "therefore" and "for all" rather than meta-level symbols such as => and the forall symbol.
  5. It is easy to make mistakes in algebraic manipulations.  However, it is much easier to check a proof than to invent one, so do not engage in wishful thinking and submit a proof with an algebraic error.

For some problems below, you are asked to create numerical examples in Matlab to check experimentally the correctness of a claim.  Depending on the claim, you may want to do the numerical verification before, after, or in parallel with the proof.  These Matlab examples are important.  They are the computational, modern component of the course.  The lesson to learn is how to use computation to advance understanding, to confirm symbolic results and to provide new insights that can be the springboard for further mathematical results.

Make your verification careful, convincing, and easy to understand.  In your submission, you should describe your verification process concisely.  To be convincing, take advantage of fast computers to use large enough sample sizes to reduce unwanted noise.  To be easy to understand, often, one two-dimensional figure showing two superimposed functions (e.g. an empirical histogram and a theoretical distribution) for visual comparison is ideal.  Always explain precisely but concisely how the functions plotted are defined mathematically.  Also, when you present numerical results in a figure or table, always say explicitly in English what the important lesson(s) to be drawn are.


PROBLEMS

Please use http://www.quicktopic.com/29/H/BzUhWUuMDS3F to ask questions about these problems.

(1) (a) Suppose e(x) and f(x) are both unbiased estimators of g(theta), with constant variance v and w respectively.  Let b be any constant.  Show that  be(x) + (1-b)f(x)  is also an unbiased estimator.  What is its variance?

(b) What is the optimal value for b that minimizes the variance of the combined estimator be(x) + (1-b)f(x) ?

(c) Extend your answer to parts (a) and (b) to the case of n estimators e1(x), e2(x), ..., en(x) each with its own constant variance.

(d) Are the results above really useful?  Hint:  Do estimators typically have constant variance?


(2) (a)  Let x1, x2 through xn be an iid sample from a Gaussian distribution.  Let z be the average of the xi.  Prove that the MVUE of the variance is the sum of (xi - z)2 divided by n-1.  Verify that this is the MVUE numerically.  (To illustrate numerically that you have found an MVUE, illustrate that your estimator is unbiased, and that it has variance smaller than some other reasonable unbiased estimator.)

(b) Consider an alternative estimator of the variance, namely the sum of (xi - z)2 divided by n instead of n-1.  Prove that this estimator has smaller MSE than the MVUE.  Verify this result numerically.

(c) Explain the connection between part (b) and the machine learning concept of overfitting.


(3) [Adapted from Silvey, 2.8.]  A biologist is interested in the prevalence of a certain SNP (single nucleotide polymorphism).  There are just two alternative genotypes, called A and B.  The biologist samples at random from a large population of the species.  She records the genotype of each individual, and stops sampling immediately after obtaining M individuals with genotype A (M > 1), by which time the total sample size is x.  Find an MVUE of pi, the proportion of genotype A in the population.

Here, you should make clear how you are following the informal algorithm for finding MVUEs explained in class.  Make clear what the sample space is, and what the family of probability distributions is.   You must do this explicitly as the foundation for finding any MVUE.

The answer M/x is incorrect, but for large M it is very close to the correct answer.  Be sure that your numerical experiments reveal that M/x is in fact incorrect.  This problem is an example of how numerical experiments can never prove that a mathematical result is correct.  They can only cast doubt on it, or suggest that it is approximately correct.


(4) (a) [Silvey, 1.3.]  Let x1, x2 through xn be independent Poisson random variables with common mean b.  Find the conditional distribution of x1, given x1+ x2+ ... + xn.

(b) [Silvey, 2.1.]  Let x1, x2 through xn be independent Poisson random variables with unknown common mean b.  Find an MVUE of e-b, the probability of the zero outcome.  Determine the variance of this estimator.  Verify your result numerically.

For (a), you should give a proof.  The answer is a well-known standard distribution--which one?  The answer to (a) is useful for (b).  In general, it is vital to express mathematical results as simply as possible, and to relate new results to old concepts.  Doing this makes it much easier to build on the new result in further reasoning, e.g. to use (a) for (b) here.