DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
UNIVERSITY OF CALIFORNIA, SAN DIEGO

CSE 291: Statistical Learning

Assignment 1

Deadline extended one week:  Due in class on Tuesday January 18, 2005.



GENERAL INSTRUCTIONS

You are encouraged to collaborate while solving the problems posed, and to use any books and other resources you wish.  However, you must write up your final solutions independently.  Your answers should be written in good, concise English with all necessary diagrams, plots, and explanations.  You must use LaTeX or similar high-quality software for text processing.  On the due date, you should submit a 8.5x11 printout in class.  You must staple the pages of your printout together; you must not use any sort of binder.

Mathematical proofs should be clear and not contain unexplained leaps, but it is not necessary to go into technical detail about measure theory, etc.  The statistical ideas are what is most important.

You are encouraged to share code, but you should have your own working implementation and you should write your own explanation of experimental results.

If necessary, you may make assumptions that are reasonable, and that do not make a question trivial.  If you do make an assumption, state it clearly.


PROBLEMS

Please ask questions about this assignment at http://www.quicktopic.com/29/H/hXscE9w3V7aD.

(1) [10 points]  Consider this unconstrained optimization problem:  minimize E[ (f(x) - y)2 ]  where  (x,y) ~ P.

(a) Explain carefully what the task is, i.e. explain all the notation used above.
(b) Show that the answer is f(x) = E[ y | x].
(c) Assume that y is binary, i.e. y = 0 or y = 1 always; in this case what is the answer?


(2) [10 points]  In a sample of n married couples, the standard deviation of the husbands' ages is four years, and the standard deviation of the wives' ages is three years.  Let d be the age difference within one couple.

(a) Under what circumstances would you expect to find the standard deviation of the d values in the sample to be about five years?
(b) Instead you find it to be two years.  One possibility is that this discrepancy is the result of random variability.  Give another reasonable explanation.
(c) Design and perform a numerical experiment using Matlab to answer the question "How large must n be, to inspire confidence that the discrepancy is not due to random variability?"


(3) [10 points]  See the paper Can We Learn to Beat the Best Stock? by Allan Borodin, Ran El-Yaniv, and Vincent Gogan, from the December 2003 NIPS conference.

(a) Implement the UBAH and UCBAL algorithms in Matlab.  Reproduce the DJIA and SP500 results for these algorithms given in Table 1 of the paper.
(b) Use the Matlab profiler to measure the average runtime of your implementation on the SP500 dataset.  Report the product of your runtime and the GHz speed of your CPU.  Everyone with product less than twice the best product achieved will get full marks.

Notes: 
(i) The relevant datasets are at http://www.cs.technion.ac.il/~rani/portfolios/.  These are matrices of relative stock prices, called X in the paper.
(ii) The point of part (b) is to implement UBAH and CBAL using Matlab builtin operations.  There is not much room to optimize these algorithms, but there is a lot of room to dis-optimize them, for example by using nested "for" loops.  Since the algorithms are fast, write an outer loop that executes each as many times as needed to give at least two digits of timing accuracy.  Preload the matrices you need and do not measure file I/O; Matlab is notoriously slow at I/O.


(4) [10 points]  Suppose you have $1000 now, and you want to maximize your wealth n years from now.  You may either keep your money as cash, in which case your gain or loss is fixed, or you may invest it in the stockmarket, in which case your gain or loss is random each year.  An oracle offers to tell you the level of the stockmarket m years from now; assume that the oracle's answer is guaranteed to be correct.

(a) For which m should you query the oracle?
(b) How much should you be willing to pay the oracle for its answer?
(c) Confirm your answers to parts (a) and (b) with a numerical experiment. 
(d) Suggest a real-world scenario where the answer to parts (a) and (b) would be useful.

Notes:  This puzzle is an exercise in formalizing a problem so that it can be solved mathematically.  Since the statement of the problem is vague, you will need to make assumptions in order to arrive at a useful answer.  When choosing your assumptions, aim for simplicity and interestingness.  For example, assumptions that lead to the solution m = 0 or m = n are not very interesting.   Here are some reasonable simple assumptions.
For part (c), you should use Matlab.  This is a simple example of checking experimentally the correctness of mathematical claims that you prove.  Depending on the claim, you may want to do the numerical verification before, after, or in parallel with the proof.  In your submission, you should describe your verification process concisely, and you should include a relevant plot generated in Matlab.



Most recently updated on January 7, 2005 by Charles Elkan, elkan@cs.ucsd.edu