CSE134A LECTURE NOTES

November 28, 2001
 
 

ANNOUNCEMENTS

The final exam is on Thursday next week, 3pm to 6pm, in this room.  The exam will be similar in format and style to the midterm.
 
 

WEB SERVICES

There are very few public services using SOAP--we'll discuss possible reasons later.

The SOAP standard does not provide any of the following:

Most of these are necessary for all non-trivial distributed applications, including Internet-based web services.
 
 

THE FUTURE OF SOAP

Microsoft is pushing SOAP as part of its .NET initiative.  To support .NET, and all its programming standards, Microsoft provides software development environments and large, useful libraries of reusable code.  Critically, Microsoft provides easy-to-use GUI-based programming tools, so it can attract a much larger number of developers than vendors who do not provide visual tools.  Microsoft also provides excellent documentation and guidance for programmers at all levels of expertise.

Microsoft's strategy is always embrace, extend, dominate.  Because so many services are needed on top of SOAP by applications, this strategy should work well with SOAP.

Most SOAP services will be private, not public for end-users.  End-users need attractive, easy-to-use graphical services, as provided by HTML, not remote-procedure-call interfaces

The era of free web services is over.  How will SOAP services be paid for?  There may be an initialization problem: we need reliable, useful micropayment services in order to support the development of other SOAP services.
 
 

HOW GOOGLE WORKS

The class description says "If time permits, the course will also cover algorithms for organizing and searching web-based information, such as those used by the Google search engine."  We have half a lecture to do this!

Difficulty: Which web pages are the best to suggest, given a short query like "britney spears"?  Google has indexed about 779000 web pages containing these two words.  Among all these, Google ranks www.britneyspears.com first, which happens to be the official web site.  (Incidentally, this site is an example of two major trends: it crashes my Netscape browser, and it's implemented in PHP.)

Idea: Backlinks indicate importance.  For example, about 1130 pages have links pointing to www.britneyspears.com.  Difficulty: It's easy to create meaningless backlinks automatically, i.e. to "spam" this heuristic.

Refined idea: If many important pages link to you, then you are an important page.  If the Yahoo home page points to you, then you are important.

Mathematically, let u be a web page.  Define R(u) = sum R(v)/N(v)  where v is all pages that point to u, and N(v) is the number of links out of v.

Difficulty: Consider two pages A and B where A points to B and the only link from B is back to A.  Now R(A) and R(B) are infinite, which doesn't make sense.
 
 

THE RANDOM SURFER MODEL

If someone was following links and came across pages A and B, eventually they would get bored and jump somewhere else.

Imagine someone surfing the web.  Each minute, with probability p they jump randomly, and with probability 1-p they follow a link from the current page.  What is the probability P(u) that the random surfer is visiting page u?

Intuitively, this is a measure of how important a page is.  A self-contained set of pages with links to each other, but few links from outside,
will get very little importance.

The equation is P(u) = p/N + (1-p) sum R(v)/N(v) where v is all pages that point to u, and N is the total number of web pages.

There is a simple iterative algorithm for calculating approximate values of P(u) for every web page.  A reasonable value for p is 0.15.
 
 

GOOGLE IMPLEMENTATION ISSUES

Computing the "pagerank" P(u) requires knowledge of a large fraction of the whole web.  However the computation can be done once, off line.  One new page doesn't change pageranks much, so the pagerank of new pages can be computed approximately as soon as they are discovered.

Breadth-first crawling will tend to visit pages with high pagerank first.

What about pages with no outgoing links?  Often these are pages that we know about because links point to them, but they haven't been downloaded yet.  A Google innovation is to index these using the anchor text of the links pointing to them.

Other heuristics for measuring importance: font size, visibility.

Difficulties: Dynamically generated pages: changing URLs, the "invisible web."   Security: Honey pots and confidentiality breaches.
 
 

GRAPH PROPERTIES OF THE WEB

We can view the web as a graph where each page is a node and each link is a directed edge.  This graph has many interesting properties.

(1) It's an expander graph.  Let S be a subset of nodes and let N be the set of neighbors of S.  Then the size of N is almost always more than alpha times the size of S, where alpha > 1.

(2) The probability that a web page has k links pointing to it is proportional to 1/k2.1 about for k > 1
 

number of inlinks
proportion of pages
1
1
2
1/4.3
3
1/10
4
1/21
...
...

(3) The average path length between any two web pages is about 19.
 



Copyright (c) by Charles Elkan, 2001.