CSE 150 LECTURE NOTES
March 4, 2004
ANNOUNCEMENTS
Today's handout is the description of the fourth (and last!) assignment.
FRAMEWORK FOR LEARNING A CLASSIFIER
With N training examples, the training data are a matrix with N rows
and p columns, where each example is represented by values for p
different features. Let feature value j for example i be x_ij.
The label of example i is y_i, for example y_i = 1 if message i is spam
and y_i = 0 if it is not spam.
Each test example is also represented as a row vector of length
p. The label y for a test example is unknown. The output of
the classifier is a guess at y.
Terminology: "feature" "attribute" "predictor" are all
synonyms; also "class" and "label" are synonyms.
RANDOM VARIABLES
A "random variable" is a function from outcomes to numerical
values. A
"probability
density function" or "pdf" assigns a probability to each value of a
random
variable. We write X ~ f(x) to mean P(X = x) =
f(x).
Notation: P(X = x) means the probability that the random
variable
X (always capitalized) has the particular value x (always lower
case).
When we write just P(X) we mean P(X = x) for any individual value x.
For example P(X,Y) = P(X)P(Y) means that for all values x and for all
values y, P(X =x and Y = y) = P(X =x)P(Y = y).
The notation P(X|Y) means P(X=x | Y = y). We can show all these
probabilities in a table. For example: ...
VIEWING AN EXAMPLE AS A VECTOR OF RANDOM VARIABLES
Mathematically, we view each training example as a random draw from a
population of outcomes. Each feature is a random variable.
The value
of this feature for this example is the value of this random variable.
Note that the class is a random variable also.
Let C be the class random variable, and let X1 to Xp be the feature
random variables. For a training example, the outcome of all of
these
is known.
For a test example, only the outcome of X1 to Xp is
known. We want to estimate P(C), i.e. P(C = c) for each different
value C might take. Often, the only possible values are C = 0 and
C =
1.
More precisely, we want to estimate the conditional probability P(C |
X1 and ... and Xp), which is different for each example since each
example has different values for the random variables X1 to Xp.
Precisely, the unconditional P(C = c) is the overall probability of
label value c in the whole population (also called the incidence of C=c
or the base rate of C=c).
BAYESIAN LEARNING
Bayesian learning is an approach to supervised learning using the
language
of probability theory and Bayes' rule.
Let's use Bayes' rule:
P(C=c | X1=x1 and ... and Xp=xp)
= P(X1=x1 and ... and Xp=xp | C=c) P(C=c) / P(X1=x1 and ... and
Xp=xp).
There are three factors above that we need to learn from the training
data.
The easiest is P(C = c).
The denominator is next easiest, because SUM_c P(C=c | X1=x1 and
... and Xp=xp) = 1 always.