FOA Home
The debate concerning these models date back almost 40 years, but
Zipfian distributions and attempts to explain them contine to arise. For
example, many have been struck by language-like properties exhibited by
the long sequences of {genetic} codes found in all living species' DNA.
That a simple ``alphabet'' of four nucleic acid BASE-PAIRS (BPs)
({\tt A,C,G,T} in DNA) are broken into three-letter CODONS that
mean one of twenty possible ``words'' corresponding to amino
acids has lead many to wonder what we might learn by viewing the genome
as a linguistic object [Sereno91] .
Mantegna
et al. [Mantegna94] was led to
consider the ``word'' frequency distributions of such words in the DNA
``corpus.'' Further, they considered differences in the distributions
across coding regions of the genome as well as non-coding regions that
never are expressed. Their first result is that this sequence data does
indeed contain ``linguistic features,'' especially in the non-coding
regions. By Analyzing various genentic corpora (e.g., approximately one
million BPs taken from 14 mammalian sequences), they found that, in
contrast to what we might expect of completely random sequences, the
rank-frequency distribution of six-BP words could be well fit by a
(log-log linear) Zipf exponent= -0.28. They conclude: \bq These results
are consistent with the possible existence of one (or more than one)
structured biological languages present in non-coding DNA sequences.
These results are consistent with the possible existence of one (or more
than one) structured biological languages present in non-coding DNA
sequences. These results are consistent with the possible existence of
one (or more than one) structured biological languages present in
non-coding DNA sequences. ese results are consistent with the possible
existence of one (or more than one) structured biological languages
present in non-coding DNA sequences. e results are consistent with the
possible existence of one (or more than one) structured biological
languages present in non-coding DNA sequences. results are consistent
with the possible existence of one (or more than one) structured
biological languages present in non-coding DNA sequences. sults are
consistent with the possible existence of one (or more than one)
structured biological languages present in non-coding DNA sequences. lts
are consistent with the possible existence of one (or more than one)
structured biological languages present in non-coding DNA sequences. s
are consistent with the possible existence of one (or more than one)
structured biological languages present in non-coding DNA sequences. are
consistent with the possible existence of one (or more than one)
structured biological languages present in non-coding DNA sequences. e
consistent with the possible existence of one (or more than one)
structured biological languages present in non-coding DNA sequences.
consistent with the possible existence of one (or more than one)
structured biological languages present in non-coding DNA sequences.
nsistent with the possible existence of one (or more than one)
structured biological languages present in non-coding DNA sequences.
istent with the possible existence of one (or more than one) structured
biological languages present in non-coding DNA sequences. tent with the
possible existence of one (or more than one) structured biological
languages present in non-coding DNA sequences. nt with the possible
existence of one (or more than one) structured biological languages
present in non-coding DNA sequences. with the possible existence of one
(or more than one) structured biological languages present in non-coding
DNA sequences. ith the possible existence of one (or more than one)
structured biological languages present in non-coding DNA sequences. h
the possible existence of one (or more than one) structured biological
languages present in non-coding DNA sequences. the possible existence of
one (or more than one) structured biological languages present in
non-coding DNA sequences. e possible existence of one (or more than one)
structured biological languages present in non-coding DNA sequences.
possible existence of one (or more than one) structured biological
languages present in non-coding DNA sequences. ssible existence of one
(or more than one) structured biological languages present in non-coding
DNA sequences. ible existence of one (or more than one) structured
biological languages present in non-coding DNA sequences. le existence
of one (or more than one) structured biological languages present in
non-coding DNA sequences. existence of one (or more than one) structured
biological languages present in non-coding DNA sequences. xistence of
one (or more than one) structured biological languages present in
non-coding DNA sequences. stence of one (or more than one) structured
biological languages present in non-coding DNA sequences. ence of one
(or more than one) structured biological languages present in non-coding
DNA sequences. ce of one (or more than one) structured biological
languages present in non-coding DNA sequences. of one (or more than one)
structured biological languages present in non-coding DNA sequences. f
one (or more than one) structured biological languages present in
non-coding DNA sequences. one (or more than one) structured biological
languages present in non-coding DNA sequences. one (or more than one)
structured biological languages present in non-coding DNA sequences. one
(or more than one) structured biological languages present in non-coding
DNA sequences. e (or more than one) structured biological languages
present in non-coding DNA sequences. (or more than one) structured
biological languages present in non-coding DNA sequences. r more than
one) structured biological languages present in non-coding DNA
sequences. more than one) structured biological languages present in
non-coding DNA sequences. re than one) structured biological languages
present in non-coding DNA sequences. than one) structured biological
languages present in non-coding DNA sequences. han one) structured
biological languages present in non-coding DNA sequences. n one)
structured biological languages present in non-coding DNA sequences.
one) structured biological languages present in non-coding DNA
sequences. e) structured biological languages present in non-coding DNA
sequences. structured biological languages present in non-coding DNA
sequences. tructured biological languages present in non-coding DNA
sequences. uctured biological languages present in non-coding DNA
sequences. tured biological languages present in non-coding DNA
sequences. red biological languages present in non-coding DNA sequences.
d biological languages present in non-coding DNA sequences. biological
languages present in non-coding DNA sequences. ological languages
present in non-coding DNA sequences. ogical languages present in
non-coding DNA sequences. ical languages present in non-coding DNA
sequences. al languages present in non-coding DNA sequences. languages
present in non-coding DNA sequences. anguages present in non-coding DNA
sequences. guages present in non-coding DNA sequences. ages present in
non-coding DNA sequences. es present in non-coding DNA sequences.
present in non-coding DNA sequences. resent in non-coding DNA sequences.
sent in non-coding DNA sequences. nt in non-coding DNA sequences. in
non-coding DNA sequences. n non-coding DNA sequences. non-coding DNA
sequences. non-coding DNA sequences. non-coding DNA sequences. n-coding
DNA sequences. coding DNA sequences. ding DNA sequences. ng DNA
sequences. DNA sequences. NA sequences. sequences. equences. uences.
nces. es. . \eq
Subsequent analysis, however, makes it quite clear that
any such interpretations are ill-founded [Bonhoeffer96] . Deviations from
fully random sequence behavior can be attributed to two simple
characteristics of biological sequence data. First, define $H(n)$ to be
the entropy of the distribution of $n$-length nucleotides sequences.
Then the redundancy $R(1)$ of length $n=1$ words is: R(n) = 1 -
\frac{H(n)}{2n} $R(1)$ then reflects a simple increase with the {\em
variance} of the four base pairs; but the fact that the bases occur with
much different frequencies is a well-known biolgical fact. Second, very
short range correlations between nucleic acids (which are very easy to
imagine given the basic three letter genetic code) and the fact that in
DNA the most common words are simply combinations of the most probable
letters because recombinations events cross over, especially in regions
of short repeats like this. There are still interesting questions (e.g.,
why coding and non-coding regions differ in their nucleic acid
frequencies) but does undermine any large scale language-like properties
within DNA sequence.
A final, very recent example of how Zipf-like
distributions arise is offered by analyses of WWW SURFING
behaviors [Huberman98] , and makes
this same point (but cf. Section §8.1
for more recent, apparently contradictory data generated from massive
AltaVista logs). Consider each page click by a browsing user to be a
character, and the amount of time spent by the same user on the same
host to be the length of a ``word.'' Then (surprise!), empirical data
capturing the rank-frequency distribution of each WWW surfing ``ride''
again shows a (log-log linear) Zipfian relationship with slope equal to
-1.5, as shown in Figure (figure) .
Huberman et al. also propose
a model explaining this empirical data. Assume that the ``value'' (what
we might think of as perceived relevance) $V(L)$ of each page in a
browsing sequence of length $L$ goes up or down according to identical,
independently distributed (iid) Gaussian random variables $\[ V(L) =
V(L-1) + \epsilon_L \] Using economic reasoning, Huberman et al. then
hypothesize: \bq ... an individual will continue to surf until the
expected cost of continuing is perceived to be larger than the
discounted expected value of the information to be found in the
future.... Even if the value of the current page is negative, it may be
worthwhile to proceed, because a collection of high value pages may
still be found. If the value is sufficiently negative, however, then it
no longer worth the risk to continue. \eq If users's browsing behaviors
follow a random walk governed by these consideration, Huberman et al.
show that the passage times to this cutoff threshold is given by the
inverse Dousian distribution: \Pr (L)=\sqrt{\frac{\lambda }{2 \pi
L^{3}}} \exp \left[ \frac{-\lambda (L-\mu )^{2}}{2\mu ^{2}L}\right]
\label{eq:websurf} where $\mu $ is the mean of the random walk length
variable $L$, $\mu ^{3}/\lambda $ is its variance and $\lambda $ is a
scaling parameter.
Top of Page
More recent Zipfian sightings