FOA Home
AltaVista was, in 1995, arguably the first search engine offered for
general use, and so AltaVista's
history is especially interesting. At that time, AltaVista was
developed by Digital Equipment Computer to primarily to demonstrate just
how powerful their new Alpha architecture was, especially its then-novel
64-bit addressing and the consequentially vast data spaces. Indexing all
the WWW's pages and providing a useful service to many simply was good
publicity.
Since that time Digital Computer has been acquired by Compaq,
and Altavista spun off to CMGI. As searching newly authored pages on the
WWW has become increasingly profitably, similar search technology has
been applied to exiting, traditionally published corpora to form the
next generation of DIGITAL LIBRARIES [Fox98] [Paepcke98] . It is amazing how closely
they resemble the vision H.G. Wells had of what a ``World Encyclopedia''
might mean, as early as 1938 [Wells38] !
As the Internet has reached a
mass audience and these new search engine users begin to FOA in earnest,
important new data is becoming available as to just how these real users
(as opposed to most IR experimental subjects, cf. Section §4.3.1 ) behave. Silverstein et al.
report on their analysis of approximately one billion $(10^9)$ queries
issued against the AltaVista search engine during six weeks in August
and September, 1998 [Silverstein99] . Another important
qualification on this preliminary study is that no attempt was made to
discriminate ``real,'' human-generated queries from automatic queries
generated by robots. Still, several features of this study are
signficant.
First, fully 15% of the queries were entirely empty; they
contained no keywords! Two-thirds of these empty queries were generated
within AltaVista's ``advanced query'' interface. Clearly, good interface
design and user education remains a fundamental issue for effective
search engine design.
Second, WWW searches use very short,
simple queries, averaging only 2.3 keywords/query (and not including the
zero-length queries in this average). Only 12.6% of queries used more
than three keywords. Of course the fact that AltaVista's interface does
not easily support longer, RelFbk queries (cf. Section §3.6 ) keeps these from occurring. Most
users also avoid query syntax and issue simple queries: only 20% of
queries used any of AltaVista's query operators ({\tt +, -, and, or,
not, near}); half of these used only one operator.
These findings are
especially signficant because they paint a much different picture of the
``typical user'' than IR has traditionally held. When IR systems were
first developed, the target audience was primarily reference librarians,
SEARCH INTERMEDIATES who helped library patrons find what they
were seeking from sophisticated systems such as DIALOG. These librarians
were specially educated, in particular in the subtleties of Boolean
query operators and other sophisticated techniques for constructing
exactly the right ``magic bullet'' query for a particular corpus. IR
system design and theory therefore generally assumed that queries were
fairly rich, structured expressions. At least at the moment, these
assumptions do not seem to hold for most Web searching.
But despite the
relatively simple form of most queries, the third interesting fact is
that Web queries are rarely repeated. Even folding case and ignoring
word order, only one third of queries appeared more than once in the
billion queries; only 14% occurred more than three times. {Evidence for
wide query novelty is especially striking given that, at least at this
juvenille stage of Internet usage, by far the most dominant query topic
is {\tt SEX}. Not only was {\tt SEX} the most common token, but
sex-related terms dominated 17 of the top 25 most frequent query terms.
{\tt MP3} and {\tt CHAT} were the most popular non-sex-related tokens,
but their frequency was approximately a third of that of {\tt SEX}.}
These statistics are especially significant in the face of new services
such as AskJeeves which focus on
providing especially relevant answers for a restricted set of
anticipated queries.
Finally, Silverstein et al. attempted to analyze
query sessions. Knowing just when a query is part of a session is
notoriously difficult, especially when some queries are being generated
by robots; this study used a combination of server-set cookies and a
five-minute time window to capture coherent searches by the same user.
It appears that 78% of query sessions involve only a single query, and
that an average session involves only two queries! These data are
preliminary, but provide an interesting contrast to the power law,
Zipfian distribution of Web surfing behavior reported by Huberman et al.
[Huberman98] (cf. Section §3.2.2 ).
The primary extension of the
search engine technology developed so far in this text the
CRAWLING function that must harvest web pages prior to their
indexing. The design of web crawlers is now one of the most active areas
of computer science research and we provide only a few basic references
here.
Top of Page
Things that are changing
Subsections