CSE134A LECTURE NOTES
November 5, 2001
ANNOUNCEMENTS
On Friday we handed back grade slips for the second project.
WEB SITE ARCHITECTURAL OBJECTIVES
Today's lecture is based on A
Blueprint for Building Web Sites Using the Microsoft Windows DNA Platform,
Draft Version 0.9, Microsoft Corporation, January 2000. (The
link is to a local PDF copy of this document.)
This document explains a standard architecture that is successful in
achieving several important objectives. The article is an example
of what is called a "white paper" in the computer industry.
Important design objectives for any large web site include:
-
Scalability: cope easily with exponential growth in demand; scalability
should be "linear."
-
Availability: redundancy and functional specialization to limit fault propagation.
-
Security: protect servers from attacks and also data from theft.
-
Manageability: keep operations and system administration simple, flexible,
and scalable in human effort.
-
Designability: keep application design simple and flexible.
The standard overall design that meets these goals consists of loosely-connected
tiers (i.e. levels) of replicated, task-focused servers.
-
Front-end servers have no long-term state; they are cloned.
-
Load-balancing software and/or hardware spreads requests across multiple
front-end servers, and includes failure detection.
-
Persistent content is partitioned across multiple back-end servers.
-
Back-end servers are grouped into partitions.
Application complexity is managed by specialization: different servers
perform different functions. Manageability and scalability are often
achieved in part by outsourcing, i.e. remote hosting. A specialized
company installs the servers near a main Internet access point.
Usability issue: provide limited functionality even if some backends
are unavailable, e.g. let users send mail even if they cannot read messages;
let users see the catalog even if they cannot place orders.
Availability/security issues: prevent a script failure from crashing
the web server, prevent failure on one server from being repeated on identical
servers.
SECURITY DOMAINS
Security is a vital issue with at least two aspects: protect servers from
attacks, and also protect data from theft. Security is based on multiple
separate security zones. Each zone is protected by a firewall, i.e.
network packet filter. Typically there are at least three zones:
the public Internet, a public-facing so-called "demilitarized zone," abbreviated
DMZ, and a private zone with sensitive data.
A security mechanism, for example the use of security domains, has three
major objectives:
-
prevent unauthorized disclosure (privacy)
-
prevent unauthorized modification (integrity)
-
prevent denial-of-service attacks (availability).
Security domains are regions with restricted and monitored communication.
Domains may be geographical, organizational, by server type, by data type.
Domains may be nested but preferably not overlapping.
A "firewall" is a device that inspects every packet of data coming into
(or out of) a domain. Typically suspicious packets are simply dropped.
Simple firewalls, known as packet filters, just look at the IP addresses
and port numbers mentioned in packets.
Each frontend server in the DMZ has an operating system hardened for
security. Firewalls separate it from the Internet and also from the
internal network.
NETWORKING
A large web site will have several separate local-area networks (LAN):
-
One for connecting to the Internet, through multiple ISPs
-
One inside the DMZ connecting web servers to firewalls to temporary storage
servers
-
The organization's secure internal network containing confidential data
-
A pre-existing corporate network.
Each security domain often has a separate management network that overlays
the other networks. Management involves consoles, human monitors,
and automated monitoring software. With a physically separate management
network, each host has two network interface cards (at least).
Monitoring is based on logging, which can be a heavy network and storage
load, up to 4000 gigabytes per day at Yahoo.
Clients benefit from multiple ISPs through a feature of standard Internet
Domain Name Servers (DNS): round-robin behavior. If an IP address
does not respond, the client just has to hit "reload."
All frontends at one ISP respond to the same IP address, which is handled
by a load-balancer.
Copyright (c) by Charles Elkan, 2001.