| Department of Computer Science and Engineering |
CSE 150 |
| University of California at San Diego |
Winter 2004 |
Assignment 4
DUE MONDAY MARCH 15, 2004, AT 8 AM.
The task here is to build a spam filter using Bayesian
learning. You should implement the naive Bayes learning algorithm
(NB) described in class, and apply it to learn a classifier that
distinguishes spam from legitimate email messages as accurately as
possible.
You must write a preprocessor that converts an email message into a
vector of feature values. Use your creativity in choosing
features that you think are likely to be predictive, i.e. likely to be
different for the two classes of messages. Use at least 100
different features, and make sure your NB implementation can handle at
least 200 features and 2000 training examples. Most of your
features are likely to be words, but also use other features such as
the time of day of the message. Use human intelligence to avoid
features that may be highly predictive for the training data, but that
will not be useful for test data, for example the date of the email
message.
Here is a training set of about 500 spam messages
and here is a training set of about 500 legitimate
messages. You will have to divide each file into its separate
email messages; this is not difficult. You must also write a
script to preprocess each message, to compute the value for it of each
feature that you use.
On Tuesday March 9 we will publish a test set of about 500 mixed spam
and legitimate email messages. You will run your classifier to
generate a file of predicted labels 1 for spam and 0 for
legitimate. We will compute the percentage of correct labels that
you submit, and part of your score will be based on this
percentage. Note that for this assignment, both types of mistake
(classifying spam as legitimate, and vice versa) are equally bad.
Your report should include discussions of the following issues:
- In your experience, is the NB algorithm useful for spam filtering?
- What are the strengths and weaknesses of using NB compared to
hand-coding a spam recognizer?
- What programming problems did you have to solve in order to make
your implementation of NB successful?
- What have you learned about the practicalities of data mining
applications, like this one?
You should also describe the classifier that your software learns,
numerically and qualitatively. Which features (and which values
of which features) are most predictive of a message being spam?
How many features contribute significantly to classification, versus
how many are uninformative? What features might you want to add
to your feature set in the future, to get a better classifier? Is
it adequate to represent a message as a "bag of words," or does this
lose too much information?