FOA Home
When we are attempting to estimate $$, the most common priors applied
within text classification is the M-ESTIMATE [Mitchell97] . This corresponds to
pretending that the actual training data in $T$ is augmented by $NKw
\equiv |\mathname{Vocab}|$ pseudo-trials uniformly distributed across
all the keywords in $\mathname{Vocab}$. Operationally, this simply means
that all keyword counters are ``primed'' with one before real statistics
are collected. With the priors specified, we can now proceed to estimate
$\theta_{ck}$ under each of the two models discussed in §7.4.1 .
Recall that the multi-variate
Bernoulli model associates a single biased coin $$ with each keyword
used by a class. The Bernoulli assumption is then that a document in the
class will contain at least one occurrance of the keyword with
probability $\theta_{ck}$ but also with probability $1-\theta_{ck}$ that
it will not contain any instances of the keyword. Using a
Boolean ``indicator'' function $B_{dk}$ to signal the presence/absence
of a keyword in a document, and another $B_{cd}$ to signal whether the
document is/not classified with respect to class $c$: $ = 1 iff $d$ is
classified as an instance of class $c$ and zero otherwise.}
\widetilde{\theta_{ck}} = \frac{1 + \sum\limits_{d \in {T}} B_{dk}
B_{dc}} {2 + \sum\limits_{d \in {T}} B_{dc}} Note that in this statistic
the ``non-occurrance'' of keywords not present in the document
affects our estimate, too. This somewhat odd MARK OF ZERO [Lewis96] should make us feel less
sanquine about just what is captured by any Bernoulli model.
The
multinomial alternative (assuming the same priors as above) for these
estimates are: \widetilde{\theta_{ck}} = \frac{1 + \sum\limits_{d \in
{T}} f_{cd} B_{dc} } {NKw + \sum\limits_{k \in V} \sum\limits_{d \in
{T}} f_{cd} B_{dc}}
Empirically, the multinomial model seems to support
better classification performance, especially when larger vocabulary
sizes are considered [McCallum98b] .
Top of Page
Priors