FOA Home
We will talk about competing ``hypotheses,'' for example, rule that
successfully divide our spam Email from our familyıs Email. If only very
simple hypotheses are to be considered, a relatively small amount of
data can be used to select between them. For example, if our hypothesis
is that spam Email always contains the phrase {\$\$\$\$ BIG MONEY
\$\$\$\$\$}, a small amount of training data is sufficient to confirm or
disconfirm this rule [Sahami98] .
But if we wish to consider elaborate discrimination rules for example
including many key words and/or date information, etc., it takes much
more data to tease apart all the various alternatives. The volume of
training data available, then, provides a very real constraint on how
complex the hypotheses we can consider and how statistically reliable we
will expect rules to be on unseen test data.
Top of Page
Building hypotheses about documents
Subsections