Proceeedings of the 25th Annual Conference on Computer Communications
(INFOCOM), Barcelona, Spain, Apr. 2006.

Analysis of Long-Running Replicated Systems

Sriram Ramabhadran and Joseph Pasquale
Department of Computer Science and Engineering
University of California, San Diego

Abstract: We address the problem of using replication to reliably
maintain state in a distributed system for time spans that far exceed
the lifetimes of individual replicas. This scenario is relevant for
any system comprised of a potentially large and selectable number
of replicated components, each of which may be highly unreliable,
where the goal is to have enough replicas to keep the system
``alive'' (meaning at least one replica is working or available)
for a certain expected period of time, i.e., the system's lifetime.
In particular, this applies to recent efforts to build highly
available storage systems based on the peer-to-peer paradigm.
We model notions of replica loss and replica repair in such
systems by a simple Markov chain model, and derive an expression
for the lifetime of the replicated state. We then apply this model
to study the impact of practical considerations like storage and
bandwidth limits on the system, and describe methods to optimally
choose system parameters so as to maximize lifetime. Our analysis
sheds light on the efficacy of various replication strategies.

Full paper in pdf