CSE 150 LECTURE NOTES

March 2, 2004
 
 

ANNOUNCEMENTS

Today is the deadline for the Otter project.  Because the TA account is full, the online submission deadline is extended until noon on Wednesday March 3.  ACS has increased the quota but it won't take effect until Wednesday morning.  We will move some backups so there may be space this evening.

Today's handout is A Plan for Spam by Paul Graham.  Use this as background for the last assignment, and also as an example of good easy-to-read technical writing.
 
 

FOUNDATIONS OF PROBABILISTIC REASONING

The rest of today will be a review of basic probability theory.

To get started with probabilities, we assume a universe of basic mutually exclusive outcomes, each with primitive probabilities, which sum to 1.  For example the universe may be the six faces of a die, each with equal probability 1/6.

An "event" is a subset of basic outcomes, for example "odd" or "four or less".  We have

Because events are sets, we can combine  them with set operations.  Set union corresponds to "or" while set intersection corresponds to "and."

By definition, the event A is independent of the event B if and only if  Pr(A & B) = Pr(A) * Pr(B)

Notice that then  Pr(A|B) = Pr(A & B) / Pr(B) = (Pr(A) * Pr(B)) / Pr(B) = Pr(A) 

The above assumes that the denominators are non-zero.  When doing probability calculations, you always have to pay attention to the case of probabilities that are zero.


CONDITIONAL PROBABILITIES

We can talk about probabilities of propositional combinations of events.
                                            Pr(A & B)
By definition   Pr(A|B) = ---------   where A and B are events.
                                                 Pr(B)

The "product rule" says that  Pr(A & B) = Pr(A|B)Pr(B).

Note that Pr(A|B) = 0.8 does not mean "whenever B is true then Pr(A) = 0.8" because it may be the case, for example, that   Pr(A|B,C) = 0.

The statement "A is independent of B" is equivalent to Pr(A|B) = Pr(A) and also equivalent to Pr(B|A) = Pr(B)/
 
 

BAYES' RULE

Bayes' rule is actually a theorem, due to the Reverend Thomas Bayes, 1702-1761, who was a clergyman in the town of Tunbridge Wells in England..  This discovery was published posthumously in 1763.

Supposedly Bayes did not publish his discovery because he thought it was hubris for humans to investigate the will of God.  He discovered how to compute Pr(Y|X) based on knowledge of Pr(X|Y).

What Bayes discovered is this formula:

        Pr(Y|X) = Pr(Y & X)/Pr(X) = (Pr(X|Y)Pr(Y)) / Pr(X)

Example:

What is the probability that a patient with a stiff neck actually has meningitis?
 
 

INFORMAL PROBABILITY REASONING

The same facts can be stated as frequencies:  of 10000 patients, 2 have meningitis, one of those has a stiff neck, and overall 500 have a stiff neck.  Reasoning with numbers in this format is much easier, presumably because experience gives us absolute counts directly, not frequency ratios.

The following article review (author unknown), available at http://www.stat.unipg.it/ncsu/info/jse/v5n3/resource.html
shows how difficult commonsense reasoning with probabilities is.

Review of The Psychology of Good Judgment by Gerd Gigerenzer (1996). Medical Decision Making, 16(3), 273-280.

Gigerenzer argues that physicians and their patients will better understand the chance of a false positive result if we replace the conventional conditional probability analysis by an equivalent frequency method. 

... a mammography problem: To facilitate early detection of breast cancer, women are encouraged from a particular age on to participate at regular intervals in routine screening, even if they have no obvious symptoms. Imagine you conduct in a certain region such a breast cancer screening using mammography. For symptom-free women aged 40 to 50 who participate in screening  using mammography, the following information is available for this region.

     Probability format:

The probability that one of these women has breast cancer is 1%. If a woman has breast cancer, the probability is 80% that she will have a positive mammography test. If a woman does not have breast cancer, the probability is 10% that she willstill have a positive mammography test. Imagine a woman (aged 40 to 50, no symptoms) who has a positive mammography test in your breast cancer screening. What is the probability that she actually has breast cancer? _____%

     Frequency format:

Ten out of every 1,000 women have breast cancer. Of these 10 women with breast cancer, 8 will have a positive mammography test. Of the remaining 990 women without breast cancer, 99 will still have a positive mammography test.Imagine a sample of women (aged 40 to 50, no symptoms) who have positive mammography tests in your breast cancer screening. How many of these women do actually have breast cancer? _____ out of _____

In a classic study by D. M. Eddy (see Dowie J. Elstein (ed.) (1988), Professional Judgment: A Reader in Clinical Decision Making, Cambridge University Press, pp. 45-590), essentially this same question, with just the probability format, was given to 100 physicians.  Ninety-five of the physicians gave the answer of  approximately 75% instead of the correct answer, which, in this example, is 7.48%.

In the present study, Gigerenzer found that, when the information was presented in the probability format, only 10% reasoned with the Bayes computation

     P(breast cancer | positive test) =

     (.01)(.80)/[(.01)(.80) + (.99)(.10)] = .0748.

For the group given the frequency format, 46% computed the Bayes probability in the simpler form:

     P(breast cancer | positive test) = 8/(8 + 99) = .0748. 

The article discusses some of the reactions of the physicians to even considering such problems. Here are some quotes:
 On such a basis one can't make a diagnosis. Statistical information is one big lie.

 I never inform my patients about statistical data. I would tell the patient that mammography is not so exact, and I would in any case perform a biopsy.

 Oh, what nonsense. I can't do it. You should test my daughter. She studies medicine.

 Statistics is alien to everyday concerns and of little use for judging individual persons.

Some doctors commented that getting the answer in the frequency form was simple.

 

DATA ORGANIZATION FOR LEARNING A CLASSIFIER

With N training examples, the training data are a matrix with N rows and p columns, where each example is represented by values fpr p different features.  Let feature value j for example i be x_ij.

The label of example i is y_i, for example y_i = 1 if message i is spam and y_i = 0 if it is not spam.

Each test example is also represented as a row vector of length p.  The label y for a test example is unknown.  The output of the classifier is a guess at y.