CSE 150 LECTURE NOTES

March 9, 2004
 
 

ANNOUNCEMENTS

We have published the "what to submit" guide and the grading guide for the last assignment.


BAYESIAN LEARNING

Bayesian learning is an approach to supervised learning using the language of probability theory and Bayes' rule.

Let's use Bayes' rule:

P(C=c | X1=x1 and ... and Xp=xp)  =  P(X1=x1 and ... and Xp=xp | C=c) P(C=c) / P(X1=x1 and ... and Xp=xp).

There are three factors above that we need to learn from the training data.

The easiest is P(C = c): we estimate this as count(C = c)/n where count(C =c) means the number of training examples with value c for the class, and n is the total number of training examples

The denominator is next easiest, because it is the same for all different c.  There are three different ways to deal with the denominator:
  1. If we just want to know which c is most probable given X, we just need to find which numerator is highest, because the denominator is fixed given X
  2. SUM_c  P(C=c | X1=x1 and ... and Xp=xp) = 1 always, so we can calculate the numerator for every c and then calculate the value for the denominator that makes this sum equal to 1.
  3. The different labels c are mutually exclusive, so
P(X1=x1 and ... and Xp=xp) = SUM_c P(X1=x1 and ... and Xp=xp | C=c) P(C=c). 

With just two labels 0 and 1 we have

P(X1=x1 and ... and Xp=xp) = P(X1=x1 and ... and Xp=xp | C=0) P(C=0) + P(X1=x1 and ... and Xp=xp | C=1) P(C=1).

All the equations above are mathematically always true.  We have not made any assumptions yet.


THE NAIVE BAYES ASSUMPTION: CONDITIONAL INDEPENDENCE

So, we need to estimate P(X1=x1 and ... and Xp=xp | C=c) somehow.

Now we make an important assumption that in general is not exactly true.  Let's assume that

P(X1=x1 and ... and Xp=xp | C=c) = P(X1=x1 C = c) ... P(Xp=xp | C=c).

This is an assumption that the features are independent inside each class.  It is not an overall independence assumption.

GRE and GPA example.


IMPLEMENTING NAIVE BAYES LEARNING

Summary:  Features are viewed as random variables and usually called X1 through Xk.  One training or test example consists of one particular outcome for each of X1 through Xk.  The class or label is also a random variable, typically called C.  In the simple case C and all the Xj are discrete random variables, i.e. the set of possible outcomes is finite as opposed to real-valued.

Given the training data we calculate:
  1. A vector P(C=c) for each c.
  2. For each c and each feature Xj, a vector P(Xj = y | C = c) where y ranges over the different possible values of Xj.
If we have p features, each with the same number of alternative values (e.g. q = 10) then step 2 gives a matrix with p rows and q columns, for each c. 

For a test example, we know the value yj of each feature Xj, but we do not know the value of C.  For each alternative c, we calculate the numerator as a product of p+ 1 numbers:

N(c)  =  P(C=c) PRODUCT_j P(Xj=yj | C = c).

Each number in the product comes from one row of the matrix that we stored for C = c.  Finally, we predict the class c for which N(c) is highest.

Use logarithms of probabilities instead of probabilities.

Use pseudo-counts to avoid zero probabilities.


COMPUTATIONAL COMPLEXITY

Suppose we have N training examples, p features, and q values for each feature, and r alternative values for the class.

Then the space used to store the classifier is O(pqr).  This space is fixed regardless of how many training examples we have.

The time to learn the classifier is O(Np).  For each feature, for each example, we increment one count, namely count(C = c, Xj = y).

This is time linear in the size of the input data.  Any learning algorithm that inspects all training data must have this complexity or worse.