The SOAP standard does not provide any of the following:
Microsoft's strategy is always embrace, extend, dominate. Because so many services are needed on top of SOAP by applications, this strategy should work well with SOAP.
Most SOAP services will be private, not public for end-users. End-users need attractive, easy-to-use graphical services, as provided by HTML, not remote-procedure-call interfaces
The era of free web services is over. How will SOAP services be
paid for? There may be an initialization problem: we need reliable,
useful micropayment services in order to support the development of other
SOAP services.
Difficulty: Which web pages are the best to suggest, given a short query like "britney spears"? Google has indexed about 779000 web pages containing these two words. Among all these, Google ranks www.britneyspears.com first, which happens to be the official web site. (Incidentally, this site is an example of two major trends: it crashes my Netscape browser, and it's implemented in PHP.)
Idea: Backlinks indicate importance. For example, about 1130 pages have links pointing to www.britneyspears.com. Difficulty: It's easy to create meaningless backlinks automatically, i.e. to "spam" this heuristic.
Refined idea: If many important pages link to you, then you are an important page. If the Yahoo home page points to you, then you are important.
Mathematically, let u be a web page. Define R(u) = sum R(v)/N(v) where v is all pages that point to u, and N(v) is the number of links out of v.
Difficulty: Consider two pages A and B where A points
to B and the only link from B is back to A. Now R(A) and
R(B) are infinite, which doesn't make sense.
Imagine someone surfing the web. Each minute, with probability p they jump randomly, and with probability 1-p they follow a link from the current page. What is the probability P(u) that the random surfer is visiting page u?
Intuitively, this is a measure of how important a page is. A self-contained
set of pages with links to each other, but few links from outside,
will get very little importance.
The equation is P(u) = p/N + (1-p) sum R(v)/N(v) where v is all pages that point to u, and N is the total number of web pages.
There is a simple iterative algorithm for calculating approximate values
of P(u) for every web page. A reasonable value for
p is 0.15.
Breadth-first crawling will tend to visit pages with high pagerank first.
What about pages with no outgoing links? Often these are pages that we know about because links point to them, but they haven't been downloaded yet. A Google innovation is to index these using the anchor text of the links pointing to them.
Other heuristics for measuring importance: font size, visibility.
Difficulties: Dynamically generated pages: changing URLs, the "invisible
web." Security: Honey pots and confidentiality breaches.
(1) It's an expander graph. Let S be a subset of nodes and let N be the set of neighbors of S. Then the size of N is almost always more than alpha times the size of S, where alpha > 1.
(2) The probability that a web page has k links pointing to it
is proportional to 1/k2.1 about for k > 1
|
|
|
| 1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
(3) The average path length between any two web pages is about
19.