FOA Home
All samples of language, including the documents indexed by Web search
engines, depend heavily on {shared context} for comprehension. A
document's author makes assumptions, often tacit, about their intended
audience and when this document appears in a ``traditional'' medium
(conference proceedings, academic journal, etc.) it is likely that
typical readers will understand it as intended. But one of the many
things the Web changes is the huge new audience it brings for documents,
many of whom will {\em not} share the author's intended context.
But
because most search engines attempt to index indiscriminantly across the
entire WWW, the {global} word frequency statistics they collect can only
reflect gross averages. The utility of an index term, as a discriminator
of relevant from irrelevant items, can become a muddy average of its
application across multiple, distinct sub-corpora within which these
words have more focused meaning [REF866] [REF1097] .
Hypertext information
environments such as the Web contain additional structure information
[Chakrabarti98b] . This
linkage information is typically exploited by browsing users. But
LINKAGE TOPOLOGY --- the ``spatial'' structure imposed over
documents by their hypertext links to one another -- can be used to
generate a concrete notion of context within which each document is
understood: Two documents and the words they contain are imagined to be
in the same context if they are close together in this space. Even in
unstructured portions of the Web, authors tend to cluster documents
about related topics by letting them point to each other via links. Such
linkage topology is useful inasmuch as browsers have a
better-than-random expectation that following links can provide them
with guidance. If this were not the case, browsing would be a waste of
time.
This suggests that AGENTS (a.k.a. infobots, spiders, etc.)
which navigate over such structural links might be able to discover this
context. For example, agents browsing through pages about {\tt ROCK
CLIMBING} and {\tt ROCK 'N ROLL} should attribute different weights to
the word {\tt ROCK} depending on whether the query they are trying to
satisfy is about music or sports. Where an agent is situated in an
``environment'' (neighborhood of highly interlinked documents) provides
it with the {\em local context} within which to analyze word meanings
--- a structured, situated approach to polisemy. The words that surround
links in a document provide an agent with valuable information to
evaluate links and thus guide its path decisions --- a statistical
approach to action selection.
The idea of decentralizing the
index-building process is not new. Dividing the task into localized
indexing, performed by a set of { gatherers,} and centralized searching,
performed by a set of {\em brokers,} has been suggested since the early
days of the Web by the Harvest project [Bowman94] . WebWatcher [Armstrong95] and Letizia [Lieberman97] are agents that learn
to mimic the user by looking over his/her shoulder while browsing. Then
they perform look-ahead searches and make real-time suggestions for
pages that might interest the user. Fab [Balabanovic97] and Amalthaea [Moukas97] are multi-agent adaptive
filtering systems inspired by genetic algorithms, artificial life, and
market models. Term weighting and relevance feedback are used to adapt a
matching between a set of discovery agents (typically search engine
parasites) and a set of user profiles (corresponding to single- or
multiple-user interests).
Here we focus on InfoSpiders, a multi-agent
system developed by Fillipo Menczer [REF1110] [REF1142] [REF1148] [REF1150] . In InfoSpiders an evolving
population of many agents is maintained, with each agent browsing from
document to document on-line, making autonomous decisions about which
links to follow, and adjusting its strategy. Population-wide dynamics
bias the search toward more promising area and control the total amount
of computing resources devoted to the search activity. Basic features of
the algorithm are discussed, and then an example of how these agents
perform as searchers through a hypertext version of the Encyclopedia
Britannicaare presented below.
Top of Page
Exploiting linkage for context