| Department of Computer Science and Engineering | CSE 130 |
| University of California at San Diego | Fall 1999 |
The purpose of this project is to study some important features of scripting languages through the experience of solving the same problem in awk and in Java. (You may use Perl or Python instead of awk if you prefer, but awk is recommended because it is a simpler language.) Your report should discuss how scripting languages and object-oriented languages are "high-level" in different ways.
The assignment is to find a good investment strategy based on the daily prices of various mutual funds. Given the change in price from the previous day to today, how likely is it that a trend will continue tomorrow? Or will tomorrow's closing price be affected by what happens in other financial markets today?
Linear regression is an algorithm for finding the straight line relationship that best fits two vectors of data X and Y. For more information on linear regression see any good statistics textbook, or here or here.
For this assignment you should compute the linear regression between
two sets of vectors (also called time series):
(1) X = {percentage change of price of some mutual fund from
yesterday to today}
Y= {percentage change of
price of the same mutual fund from two days ago to yesterday}
(2) X = {percentage change of price of some mutual fund from
yesterday to today}
Y = {percentage change
of closing price of S&P500 from two days ago to yesterday}
In each case "yesterday" means the previous business day, i.e. the previous day for which a closing price is available. Case (1) estimates the accuracy of trend-following, while case (2) estimates how much the price one day of a mutual fund is influenced by the S&P500 the previous day.
Your program will take one command line argument: the name of a file containing a list of mutual fund indices. A mutual fund index is a string of five uppercase letters ending with X, for example JAOSX. Your program should obtain fund price data from Yahoo!Finance. However, design the program so that it will cache the data in a local file. The cache file should be named data/<index>. For example, given an index ZZZZX, first search for the cache file data/ZZZZX. If ithis file does not exist, obtain the data from Yahoo! and store it into the cache file. When creating the cache file, you can assume that the directory data already exists. The range of the data should be from 1/1/1998 to 12/31/1998.
To download data from the web, for awk, call the Unix text web browser lynx with the option -dump using the awk builtin function system. For the Java program, use the java.net.URL class to create an input stream.
Use /usr/bin/nawk for the awk interpreter, and JDK 1.2.1 for the java compiler.
Your program should output the indices and the scores of the top 5 mutual funds for each of the cases (1) and (2):
In each case, the top five mutual funds are those with the highest r2 correlation coefficients, which ranges between 0.0 and 1.0, and is 0.0 when the two series are independent, that is, the prediction is no better than a random guess.#Autocorrelation
AAAAX 0.8100
BBBBBX 0.8005
CCCCX 0.7993
...
#Correlation with S&P500
ZZZZX 0.5567
YYYYX 0.5234
...
WHAT WE WILL PROVIDE
We will provide the following:
1. An awk function that computes linear regressions is in
../public/project4/linreg.awk.
You must implement the corresponding Java function yourself.
2. An example file containing a list of indices of mutual funds is
in ../public/project4/fundindex.
3. Example awk code that extracts only the indices from ../public/project4/namelist,
which was obtained from http://www.fundalarm.com/sort_x.htm.
4. The S&P500 index is named ^spc.
5. Given an index name, say YAFFX, the following URL returns
the closing price of YAFFX from 1/2/1999 to 3/4/1999 in comma-separated
format, unfortunately in reverse order:
http://chart.yahoo.com/table.csv?s=YAFFX&a=1&b=2&c=1999&d=3&e=4&f=1999&g=d&q=q&y=0&z=YAFFX&x=.csv
You might find other possibly helpful stuff under ../public/project4/.
To turn in your work, cd to the directory containing the two
files and type the Unix command bundleP4. The bundleP4
script knows which files to look for and submit. Remember to check the
class
web page at http://www-cse.ucsd.edu/classes/fa99/cse130 and
the
message board
for any changes to these directions.
Features of awk and relevant issues that should be discussed in your report include but are not limited to the following:
If you use Perl or Python instead of awk for the programming part of this project, then your report should discuss awk and also the language that you use.
As an appendix to your report, you should submit a printout of your software, with comments and documentation of professional quality. This documentation should be sufficient for another software engineer to maintain the program. Remember that good documentation is necessary but not sufficient. Comments and user instructions cannot alleviate bad engineering.
Also, don't forget to attach a copy of the team self-evaluation form. The form is required for all four projects, including this one. Be sure to follow all the rules and guidelines explained in the CSE 130 course description. Complete academic honesty is again required.