FOA Home | UP: Classification


Priors

When we are attempting to estimate $$, the most common priors applied within text classification is the M-ESTIMATE [Mitchell97] . This corresponds to pretending that the actual training data in $T$ is augmented by $NKw \equiv |\mathname{Vocab}|$ pseudo-trials uniformly distributed across all the keywords in $\mathname{Vocab}$. Operationally, this simply means that all keyword counters are ``primed'' with one before real statistics are collected. With the priors specified, we can now proceed to estimate $\theta_{ck}$ under each of the two models discussed in §7.4.1 .

Recall that the multi-variate Bernoulli model associates a single biased coin $$ with each keyword used by a class. The Bernoulli assumption is then that a document in the class will contain at least one occurrance of the keyword with probability $\theta_{ck}$ but also with probability $1-\theta_{ck}$ that it will not contain any instances of the keyword. Using a Boolean ``indicator'' function $B_{dk}$ to signal the presence/absence of a keyword in a document, and another $B_{cd}$ to signal whether the document is/not classified with respect to class $c$: $ = 1 iff $d$ is classified as an instance of class $c$ and zero otherwise.} \widetilde{\theta_{ck}} = \frac{1 + \sum\limits_{d \in {T}} B_{dk} B_{dc}} {2 + \sum\limits_{d \in {T}} B_{dc}} Note that in this statistic the ``non-occurrance'' of keywords not present in the document affects our estimate, too. This somewhat odd MARK OF ZERO [Lewis96] should make us feel less sanquine about just what is captured by any Bernoulli model.

The multinomial alternative (assuming the same priors as above) for these estimates are: \widetilde{\theta_{ck}} = \frac{1 + \sum\limits_{d \in {T}} f_{cd} B_{dc} } {NKw + \sum\limits_{k \in V} \sum\limits_{d \in {T}} f_{cd} B_{dc}}

Empirically, the multinomial model seems to support better classification performance, especially when larger vocabulary sizes are considered [McCallum98b] .


Top of Page | UP: Classification | ,FOA Home


FOA © R. K. Belew - 00-09-21