Assignment 4 Grading Guide
CSE 150 - Winter 2004
There are five parts to this assignment:
1. Write a Preprocessor.
You must write a preprocessor function that inputs text emails, computes the values of chosen features for these emails, and outputs the resulting vectors of feature values. If you like, the preprocessor may be written in a scripting language such as Python or Perl. This function will be used by both the learning and classifying algorithms. You should not have to replicate this code in both programs. In designing the preprocessor, you must choose which features to use. In your report, describe and justify your method for choosing features. You may use any interface you like, but be sure to describe this interface in your README.2. Implement the naive Bayes Learner.
You must write a program that implements the naive Bayes learning algorithm. This program should input the set of labeled training text emails (as opposed to feature vectors) and estimate the necessary statistics. It should then output these statistics to a file to be used by the naive Bayes classifier. Use whatever input and output interfaces you like. In the README header, specify the exact command needed to run your learner on the spam and legit training emails. For instance, if the command to run your learner is:spamNBLearner legit spam > spam_params.out
please specify this in the README header.3. Implement the naive Bayes Classifier.
You must write a program that implements the naive Bayes classifier algorithm. This program should take one argument: the name of a file consisting of text emails strung together, of the format as spam and legit. Using the statistics learned by the naive Bayes learner, it should ouput to stdout the estimated class of each input email: 1 for spam and 0 for legitimate. Each class should be on a separate line. If the name of your classifier executable is spamNBClassifier, then the command to run your classifier on the email messages in the file test would be:spamNBClassifier test
Please specify the name of the classifier executable in the README header. The expected output should be something like:4. Test your classifier.
You are to to test your classifier on each email of this set of unlabeled test emails. You should store the results of yout classifier in the file test_results.out. Part of your grade will be based on the classification error of your algorithm on this test set. You may not hand label this test set and use it in your learning algorithm.5. Write the Report.
Your report should discuss the issues discussed in the Assignment 4 handout:README Header:.
Your README file should have the following header: Partner 1 Name: [fullname
of one partner]
Partner 2 Name: [fullname of other partner]
Partner 1 Login: [login of one partner]
Partner 2 Login: [login of other partner]
NB Learner Command: [full command to run your NB learner]
NB Classifier Executable Name: [name of the command to run your
classifier]
Here is an example README header:
Partner 1 Name: Anjum Gupta
Partner 2 Name: Kristin Branson
Partner 1 Login: a3gupta
Partner 2 Login: kbranson
NB Learner Command: spamNBLearner legit spam > spam_params.out
NB Classifier Executable Name: spamNBClassifier
Your programs should compile with the command make.
Here is the point division for this assignment:
| Element | Points |
| Preprocessor | 15 |
| Learner | 15 |
| Classifier | 15 |
| Code style and complete README | 10 |
| Performance on the test data | 10 |
| Report | 35 |