Assignment 4 Grading Guide
CSE 150 - Winter 2004

There are five parts to this assignment:

  1. Writing an email preprocessor.
  2. Implementing the naive Bayes learning algorithm.
  3. Implementing a naive Bayes classifier.
  4. Testing your classifier.
  5. Writing the report.

1. Write a Preprocessor.

You must write a preprocessor function that inputs text emails, computes the values of chosen features for these emails, and outputs the resulting vectors of feature values. If you like, the preprocessor may be written in a scripting language such as Python or Perl. This function will be used by both the learning and classifying algorithms. You should not have to replicate this code in both programs. In designing the preprocessor, you must choose which features to use. In your report, describe and justify your method for choosing features. You may use any interface you like, but be sure to describe this interface in your README.

2. Implement the naive Bayes Learner.

You must write a program that implements the naive Bayes learning algorithm. This program should input the set of labeled training text emails (as opposed to feature vectors) and estimate the necessary statistics. It should then output these statistics to a file to be used by the naive Bayes classifier. Use whatever input and output interfaces you like. In the README header, specify the exact command needed to run your learner on the spam and legit training emails. For instance, if the command to run your learner is:

spamNBLearner legit spam > spam_params.out

please specify this in the README header.

3. Implement the naive Bayes Classifier.

You must write a program that implements the naive Bayes classifier algorithm. This program should take one argument: the name of a file consisting of text emails strung together, of the format as spam and legit. Using the statistics learned by the naive Bayes learner, it should ouput to stdout the estimated class of each input email: 1 for spam and 0 for legitimate. Each class should be on a separate line. If the name of your classifier executable is spamNBClassifier, then the command to run your classifier on the email messages in the file test would be:

spamNBClassifier test

Please specify the name of the classifier executable in the README header. The expected output should be something like:
1
1
0
1
0
...

4. Test your classifier.

You are to to test your classifier on each email of this set of unlabeled test emails. You should store the results of yout classifier in the file test_results.out. Part of your grade will be based on the classification error of your algorithm on this test set. You may not hand label this test set and use it in your learning algorithm.

5. Write the Report.

Your report should discuss the issues discussed in the Assignment 4 handout: You should also describe the classifier that your software learns, numerically and qualitatively.  Which features (and which values of which features) are most predictive of a message being spam?  How many features contribute significantly to classification, versus how many are uninformative?  What features might you want to add to your feature set in the future, to get a better classifier?  Is it adequate to represent a message as a "bag of words," or does this lose too much information?

README Header:.

Your README file should have the following header:

Partner 1 Name: [fullname of one partner]
Partner 2 Name: [fullname of other partner]
Partner 1 Login: [login of one partner]
Partner 2 Login: [login of other partner]
NB Learner Command: [full command to run your NB learner]
NB Classifier Executable Name: [name of the command to run your classifier]

Here is an example README header:

Partner 1 Name: Anjum Gupta
Partner 2 Name: Kristin Branson
Partner 1 Login: a3gupta
Partner 2 Login: kbranson
NB Learner Command: spamNBLearner legit spam > spam_params.out
NB Classifier Executable Name: spamNBClassifier

Your programs should compile with the command make.

Here is the point division for this assignment:

Element Points
Preprocessor 15
Learner 15
Classifier 15
Code style and complete README 10
Performance on the test data 10
Report 35