| Department of Computer Science and Engineering | CSE 250A |
| University of California at San Diego | Winter 2001 |
The team scores are 20, 26, 29, 32, and 34 out of 40. However, one score is being changed to zero because of plagiarism. Two short paragraphs were copied word-for-word from two different web pages, without attribution.
The average number of hours reported per person for this project ranges from 10.3 to 90. I would expect you to spend about ten hours per week on the 250A projects, i.e. about 25 hours per person per project. If you are spending much less time, you are not putting in the effort of a full-time student and getting a good grade will be difficult. If you are spending much more time, you should think more about efficiency and prioritization.
Content. No team did very well on the criterion "well-chosen starting point for new research." It is important not to "reinvent the wheel," and to make your new contributions as general as possible. In particular, you should discuss broad algorithmic ideas separately from application-specific heuristics. For this project, you should explain why A* is not appropriate and why greedy best-first search is the optimal algorithm. Then you should discuss explicitly the issue of how to recognize the value of links that do not lead to goal pages directly, but do lead there indirectly. Then and only then, you should describe specific heuristics, such as keyword weights.
You should be creative but skeptical about reusing previous work from other fields. For example, there is no strong reason to expect a particular weighting method that happens to be useful in traditional information retrieval, such as tf/idf, to be especially useful for real-time web search.
When designing and discussing an experiment, use an appropriate independent variable. For example, if you have a heuristic to make the retrieval of pages faster, show results as a function of clock time or CPU time, not of number of pages visited.
Always think about and discuss explicitly the issue of statistical significance for experimental comparisons, even if your conclusion is that you cannot evaluate significance quantitatively. In particular, when you report a mean, you should also report the corresponding standard deviation s. Then you can use the standard error, s/sqrt(n) where n is the number of measurements averaged, to say whether the difference between two means is statistically significant, i.e. reproducible,
Abstract. An abstract should be as specific and concrete as possible, while remaining less technical than the whole paper. For example, do not write "Certain modifications improve the pages hit vs. pages searched performance ratio, while some do not." Instead, describe briefly but explicitly the methods that do work, and your most important experiments, results, and conclusions. If you still have space, also describe the methods that do not work. Remember that when a paper is published, many people read the abstract but never read the paper. Make the abstract as useful as possible to these readers. Typical good abstracts are 150 to 200 words long.
Introduction. Avoid exaggerated claims, and claims that may not be true or cannot be proved. Avoid unnecessary metaphors. For example, do not write "The web is a sea of information, containing trillions of pages..." or "As the web grows, search engines that provide exhaustive indexing become less effective for searches in a specific genre."
Organization. Some conferences, journals, or research areas have the convention that all papers use the same section titles. Otherwise, including in most of computer science, you should use informative titles, not clichés like "Problem Statement" or single words like "Results." Do not use identical sentences in the abstract and elsewhere in a paper, and do not repeat the same arguments anywhere. It is fine for the concluding section of a paper to be very short.
Write in the present tense as much as possible, and organize the description of your work logically. Avoid writing in the past tense, and avoid any hint of chronological organization. For example, do not write "... changing it to fit our needs proved difficult ... In the end, we decided to ..." Present your work in a mostly impersonal way. Use "we" and "our" when convenient, but not continuously.
General writing. Avoid non sequiturs such as "It is important to have an accurate web crawler to maintain currency of online indexes." Currency means recency here, and accuracy and recency are not obviously linked. The sentence should either be changed, or expanded into an argument that explains the link between accuracy and currency.
Be precise and clear in all descriptions, and use simple declarative sentences as much as possible. Precision and clarity lead you step by step to new insights. A straightforward list of observations, in a logical order, is an excellent way to organize most technical descriptions and analyses.
Do not use a comma to concatenate two sentences that should be separate. Avoid category errors such as "In a web crawler, the search task traverses links..." Avoid weak jokes and irrelevant allusions such as "Search for the Holy Grail." Learn once and for all to avoid common spelling mistakes such as "can not" and "effect" where "affect" is correct. ("Affect" is usually a verb while "effect" is usually a noun.) Avoid footnotes, and also phrases and sentences in parentheses.
Mathematical writing. Instead of using a multiletter identifier such as NDocs, use a single letter identifier with an explanation that is a full sentence such as "... where n is the number of documents in the sample corpus." Use the simplest and most sober possible notation: avoid non-Roman letters, unusual symbols, and boldface as much as possible. When defining a novel concept, use function arguments instead of subscripts and superscripts. Follow standards in notation. For example, write "z = log xy" not "z = log ( x*y )". Never use the same letter or symbol with two different meanings.
Whenever possible, give an entire equation instead of just a formula, i.e. a fragment of an equation. Equations and their surrounding sentences should be written to flow naturally, with as little punctuation as possible. For example, do not write this:
"The weight of each keyword in a document d, is defined as:Instead, write this:wkd = fkd x idfk where fkd is the frequency of a keyword k in a document d (the term frequency)."
"The weight of a term k in a document d is w(k,d) = f(k,d)t(k) where f(k,d) is the number of times k appears in d and t(k) is the inverse document frequency of k."An equation should be displayed, i.e. centered on its own line, if it is long or if you need to give it a number in order to refer to it later. Whether displayed or not, each equation should be part of a complete sentence.
Figures. It is easy to make good charts with Gnuplot, but very difficult with Excel. Charts should not contain unnecessary background lines or shading. The labels on axes and plotted points and curves should be informative. On axes, units should be made clear and numerical values should be well-chosen round numbers. The origin should be at zero whenever possible.
Bibliography. References are like comments in software:
adding them at the end misses the point. As you progress through
a project, you should be looking for related published work that can give
you ideas and save you work. As you find these references, you should
cite them immediately in project planning documents, notes, and paper drafts.
References should be complete. As an absolute minimum, each reference
should contain the precise, correct title, the last names and initials
of at least the first three authors, the exact title of the journal or
proceedings or book, the year of publication, and correct page numbers.
BibTeX makes all this easy.