CSE 150 LECTURE NOTES

March 4, 2004
 
 

ANNOUNCEMENTS

Today's handout is the description of the fourth (and last!) assignment.

 

FRAMEWORK FOR LEARNING A CLASSIFIER

With N training examples, the training data are a matrix with N rows and p columns, where each example is represented by values for p different features.  Let feature value j for example i be x_ij.

The label of example i is y_i, for example y_i = 1 if message i is spam and y_i = 0 if it is not spam.

Each test example is also represented as a row vector of length p.  The label y for a test example is unknown.  The output of the classifier is a guess at y.

Terminology:  "feature" "attribute" "predictor" are all synonyms; also "class" and "label" are synonyms.

 

RANDOM VARIABLES

A "random variable" is a function from outcomes to numerical values.  A "probability density function" or "pdf" assigns a probability to each value of a random variable.  We write X ~ f(x) to mean P(X = x) = f(x).

Notation:  P(X = x) means the probability that the random variable X (always capitalized) has the particular value x (always lower case).  When we write just P(X) we mean P(X = x) for any individual value x.

For example P(X,Y) = P(X)P(Y) means that for all values x and for all values y, P(X =x and Y = y) = P(X =x)P(Y = y).

The notation P(X|Y) means P(X=x | Y = y).  We can show all these probabilities in a table.  For example: ...


VIEWING AN EXAMPLE AS A VECTOR OF RANDOM VARIABLES

Mathematically, we view each training example as a random draw from a population of outcomes.  Each feature is a random variable.  The value of this feature for this example is the value of this random variable.

Note that the class is a random variable also.

Let C be the class random variable, and let X1 to Xp be the feature random variables.  For a training example, the outcome of all of these is known. 

For a test example, only the outcome of X1 to Xp is known.  We want to estimate P(C), i.e. P(C = c) for each different value C might take.  Often, the only possible values are C = 0 and C = 1.

More precisely, we want to estimate the conditional probability P(C | X1 and ... and Xp), which is different for each example since each example has different values for the random variables X1 to Xp. 

Precisely, the unconditional P(C = c) is the overall probability of label value c in the whole population (also called the incidence of C=c or the base rate of C=c).


BAYESIAN LEARNING

Bayesian learning is an approach to supervised learning using the language of probability theory and Bayes' rule.

Let's use Bayes' rule:

P(C=c | X1=x1 and ... and Xp=xp)  =  P(X1=x1 and ... and Xp=xp | C=c) P(C=c) / P(X1=x1 and ... and Xp=xp).

There are three factors above that we need to learn from the training data.

The easiest is P(C = c).

The denominator is next easiest, because SUM_c  P(C=c | X1=x1 and ... and Xp=xp) = 1 always.