One of the things you learn from students are jokes: This guy is shopping in a grocery store in Cambridge, Mass. He finishes, and lines up with a full basket under a big sign marking the aisle, ``Express: 10 items or less.'' When he gets to the front of the line, the exasperated clerk says: ``Look fella, I don't know if you're from Harvard and can't count, or MIT and can't read, but either way you're in the wrong line.'' \end{quotation} You probably have to have gone to school in Cambridge to really appreciate this joke. I never did, but I find it funny because it laughs at an important division between thoughtful people.
He sees scientists and engineers, on the other hand, are optimistic, impatient do-ers! This leads him to hypothesize that ``literature changes more slowly than science'' (p. 9). .} He also thought that, due to the forces of ``fanatical belief in educational specialization" and ``tendency to let our social forms crystallize,'' (p. 18) the gap between the two cultures was `` ... much less bridgeable among the young than it was even 30 years ago.'' He says this in 1959! Certainly these same forces have not helped matters in the intervening 40 years.
But Snow's most important recommendation remains true: There's only one way out of all this: it is, of course, by rethinking our education\ldots. [{\em Speaking especially of British education}:] Somehow we have let ourselves the task of producing even a tiny elite educated in one academic skill. For 150 years in Cambridge it was mathematics: then it was mathematics or classics: then natural science was allowed in. But still the choice had to be a single one. (p. 19,21)
The premise of this text is that Finding Out About (FOA), the process of actively seeking out information relavent to a topic of interest, absolutely demands a wide-ranging attack by both literary and scientific disciplines. The kind of fractionation that Snow describes has boxed investigators from various disciplines into corners from which they each attempt to address a broad range of fundamentally interdisciplinary questions of cognition. FOA is only one such question, but the tension between computational and linguistic sensibilities has been manifest in this domain for an especially long time.
For example. as part of an early meeting of cyberneticists exploring the way that communication and computation might interact, Benoit Mandelbrot, an eminent mathematician and physicist (now most famous for his fractal landscapes). Mandelbrot presented hypothetical models of language use that would explain a phenomeon known as ZIPF'S LAW (a topic discussed in this text, cf. Sections §3.2 §5.1 ), claiming they were analogous to physical systems with which he was familiar.
In reaction, A. S.
C. Ross, a famous linguist of the 1950s, offered the following
commentary: [Mandelbrot] states that `language is a message
intentionally produced in order to be decoded word-by-word'. Many
schools of linguistic scholarship would reject such a view\ldots. It is,
indeed, important that there should be liason between specialists in
communications theory and philogogists [linguists, esp. concerned with
literature]. The gap between the two subjects is very wide, especially
in matters of techique and one wonders what philogists are going to make
of remarks such as `Our model of languate is fully analogous to the
perfect gas of thermodynamics.' ... [But] all statements of this kind
really imply that the occurrence of a word at a given point in a text is
a matter of chance and this is what philologists and students of
literature will deny. If an English writer has to express the idea of
TEAPOT -- and whether he has to or not is not in the least
a matter of chance -- the probability of his using the word
TEAPOT is unity and the probability of his using the word
KETTLE is zero. ... [Ross53] Mandelbrot's probabilistic models
and statistics did not to have much to say to this linguist.
An optimist,
however, could see a basic complimentary between statistical
methods and the linguists' syntactic methods. FOA's statistical methods
are good at semantics, knowing gross things about an entire document's
meaning -- what words mean in terms of how they relate to
other documentss in the corpus and to users' queries. It blithley throws
away NOISE WORDS like AND, OF THE, etc. because they
are assumed to say little about the documents content. Syntactic
analysis captures the fine structure of individual sentences, and
depends critically on just the same noise words to relibly anchor its
parsing. , is beginning to change of all that. The recent textbook by
Manning Sch\"{u}tze [Manning99]
provides an excellent introduction to this methodology. And for a long
time Karen Sparck Jones has been exemplary in straddling these two
approaches. Her work [Robertson76] [SparckJones72] [SparckJones76] [ SparckJones79a] [SparckJones79b] [REF703] [REF702] [SparckJones96] [SparckJones97] has consistently
sparked from one side of Snow's gulf to the other, making fundamental
contributions to each. }
The title of this textbook also makes cognitive aspirations. ``Cognitive'' stems from the Greek cognito, refering to structure, building. We typically imagine cognitive structures to be within an individual's head. But part of what is now known as the discipline of cognitive science is the realization that these representatiosn can as well be built by many individuals as one. Considering the WWW as a knowledge representation is a topic considered further in Section §6.9 .
I am personally drawn to the FOA problem because of the way it intermixes verbal and numeric sensibilities. To say that ``literary intellectuals'' are interested in language is almost tautological. But one of the major arguments put forward by this text is that many linguistic phenomena also have interesting statistical and mathematical properties. Computations involving these numbers are not only central to the engineering of effective search engines, but portend fundamental insights into the new forms of communication emerging on the World Wide Web (WWW).
Depending on your particular background, however, some of the techniques and perspectives discussed in this text will come naturally to you, and others will seem as if they arrive from a foreign planet. But if you apply some effort at understanding these other languages, you may just find out you have lots of new friends in the rest of the solar system. Literate people can learn new mathematical names to apply to their literature, and mathematicians can appreciate new features of the language going on about them.
Authors
who have in the past attempted to discuss language, of course using
language to do so, have long recognized the confusion that can result as
words are used in these two very different roles. Like many others, I
have chosen to use typography to help make this distinction. For
example, many of the examples used throughout the text will be drawn
from the area of {ARTIFICIAL INTELLIGENCE}, a subdiscipline of computer
science. Terms like this which are used as examples of lexical items,
rather than as part of the discourse between me (the author) and you
(the reader), will appear as CAPITALIZED and in FIXED
FONT.
Second, will be used to flag especially important terms that help to define the FOA problem. For example, DOMAIN OF DISCOURSE is the technical term used to describe {\tt ARTIFICIAL INTELLIGENCE}, the subject matter of the documents we hope to find. These are collected at the end of each chapter, for purposes of review .
Third, the fundamental relation between something in the world and what we think it means is a pivotal issue of this book. But \about-ness is also a natural, ubiquituous part of much of our communication, so much so that we will adopt the typographic convention of \underline{underlining} words such as about and meaning in order to highlight and better appreciate their use.
Finally, authors are always faced with decisions as to which thing they must say first. Making the right decision keeps the story moving forward, while interjecting a digression can make a reader lose their way. The WWW is most people's first experience with the HYPERTEXT alternative to this linear flow. Readers are given the choice points and the opportunity to construct their own, nonlinear path through a text simply by clicking on links. Obviously such jumps are harder in a printed text. In this text small marginal notes are used to point to a tangential topic that a user might choose to follow up. On the accompanying CD, clicking on correlated anchors will lead to brief discussions of this topic. Traditional footnotes will be used to provide important details or clarifications.
My interest in the topics discussed here goes back to my own dissertation. At that point I was primarily interested in machine learning techniques, and learned just enough about free-text information retrieval to use it as a demonstration ``domain'' for the ``connectionist'' learning techniques I proposed (cf. Section §6.5.2 ). Since then, I have become increasingly interested in the issues surrounding FOA, and have now taught courses in Information Retrieval for many years, at the Univ. California in San Diego and the Univ. Wisconsin in Madison.
This book began as a series of lecture notes for these classes. In the first years, I used Keith van Rijsbergen's seminal text [vanR79] . (This book was already out of print when I first found it, but van Rijsbergen's text has now been placed in its entirety on the WWW). This text so influenced my thinking on this subject that it occupies a special relationship with FOA: I quote from it especially often, and use the special typographic convention of \vanR{pageNumber}. With Keith's permission, I include a complete copy of his hypertext on the FOA CD, and every every reference will allow you to click directly to the quoted page.
Three other texts deserve special mention. The collection of chapters editted by Frakes and Baeze-Yates [Frakes92a] provides an excellent introduction to many topics. Fox's Chapter 7 in particular is figures heavily in Chapter 2 of this text. A second edition of this text is now available [Baeza-Yates99] . Robert Korfheage has also written a textbook especially useful from the perspective of library science [Korfhage97] . I highly recommend {\bf Readings in Information Retrieval}, edited by Karen Sparck-Jones and Peter Willett [SparckJones97] , as a companion to this text. This collection pulls together many classic papers from IR's distant past, some of which are now hard to get. A supplement (available at the FOA website ) links readings from this text as an adjunct to this textbook.
Because I teach primarily in a Computer Science department, the primary audience for this textbook is computer science students, both graduate and undergraduate, like those I have had the good fortune to meet in my classes. At the same time, I have tried to suppress technical details or explain them in ways that should make the most important themes accessible to audiences (e.g., linguists, library scientists) more comfortable with words than equations. Search engine technologies are central to the FOA problem, but the text was designed to be accessible to those who can write such computer programs, as well as to those who do not.
Executable versions of all basic routines are available on the attached CDROM; current versions are maintained at the FOA website . Together with the test corpora and experimental data (queries, relevance assessments), students and teachers should be able to explore many variations without changing any code. Source code for the routines is also provided for those programmers who want to modify or extend the basic functionalities.
A number of exercises are scattered throughout the text, but they are an admittedly uneven mix. Those collected at the end of chapters are intended as basic review exercises; those placed within the text are often more challenging. The primary assignments for my classes are a series of machine problems: extended programming assignments which cumulatively build all of the parts of a basic search engine. The details of these assignments are available to instructors who might be interested.
The first chapter of the text is designed to give any audience a broad overview of the basic questions underlying FOA and how they interact. The next three chapters cover the core issues involved in building and evaluating a generic search engine, at a level appropriate to undergraduates. Chapter 5 collects together several important topics that require more mathematical sophistication, and Chapters 6 and 7 consider extensions of the basic core material at a graduate level. Chapter 6 considers extensions of basic search technologies that use features of documents beyond keywords to draw more ``artificially intelligent'' (AI) inferences about them. Chapter 7 focuses on how one particular branch of AI, machine learning, has been used to automatically learn more about both documents and the users searching through them. Chapter 8 concludes with some looks into the most active development in FOA, and a reassessment about what fundamental issues will be with us for the foreseeable future.
I had the good fortune to have had David Blair at the University of Michigan (in a single lecture!) make it clear that FOA wasn't just an engineering problem, but is important to anyone deeply interested in language. Mike Gordon (energized by that same lecture), Manfred Kochen, Bob Lindsay, Gary and Judy Olson, and Maurita Holland were all in Ann Arbor, and taught me more than I would really appreciate until years later.
Keith van Reisbergen's unswerving confidence has made this book possible. His book is where I began and the standard I have tried to maintain. Gerry Salton and Karen Sparck Jones have been generous and patient with me as they have been to so many others in the IR community. I thank Nick Belkin, Bruce Croft, Steve Robertson, Norbert Fhr, Sue Dumais and David Lewis for uncountable, interesting SIGIR dinners. I am happy to acknowledge the influence of the industrious groups around Carnegie Mellon University and Just Research, led by Tom Mitchell and Andrew McCallum, especially on Chapter 7.
A summer of exciting conversation (1987) with Ed Hutchins and Don Norman of UCSD's Cognitive Science Department helped me think more broadly about ``parallel distributed processing'' models of cognition, involving networks of people rather than neurons, as parts of social systems. I have benefited from a long, productive relationship with the editors and others working at \EB. I am grateful to have met Mortimer Adler (once!), and especially to have worked closely with Editor-in-Chief Bob McHenry and others at Encyclopedia Britannicain Chicago, Chris Needham (in London), and Bob Clarke, John Dimm, John McInerney and Harold Kester in La Jolla. I enjoyed a pleasant sabbatical at the University of Wisconsin in Madison, teaching with and learning from Jude Shavlik and Mark Craven. Paul Kube is, more than anyone else I know, comfortable in both of Snow's two cultures (and several others as well); he has helped me sober and balance many aspects of this manuscript. I thank Kim Itkonen for turning my words about words into a wonderful image for the cover.
Most of my own research has been done in collaboration with students. Many of my thoughts about what I had done right and wrong with AIR were shaped in conversations with Dan Rose, concerning his thesis. I am also grateful to both Dan and Susan Gruber for their help in shaping very early drafts of all chapters. Brian Bartell asked hard questions about FOA from the beginning, and I have appreciated the pleasure of his collaboration ever since. John Hatton, Amy Steier and Fil Menczer have all helped me explore aspects of FOA as part of their own research.
Chris Rosin and Terry Jones provided useful feedback on some chapters, and Marti Hearst (University of California, Berkeley) and Paul Thompson (University of Minnesota and St. Thomas University) used early drafts of FOA with their classes. I am grateful to Shari Chappell and Laura Dorfman for their rescue attempts at Cambridge University Press.
Will, Lee, Cori and Julie are my nearest and dearest family. Simply completing this book (finally!) is the best apology I can offer them. Beyond that ... ``Whereof one cannot speak, one must remain silent.''
It is here where I must say: despite the best efforts of these many friends and colleagues, I know I haven't said it all, and that mistakes surely still remain. I have written down those things I wish I'd known when I began my thesis, for use by students in the classes I teach. If it helps you avoid any of the mistakes it has taken me a decade to learn, it will almost have been worth it.
Top of Page