FOA Home
As before, we begin by defining a word to be any sequence of characters
separated by spaces. Let us therefore consider an alphabet of $M$
characters, interspersed with a specially-designated space character $
We will consider an especially simple model (similar to that used by
many others [Li92] [ Miller57] [Hill70] [Hill74] ) in which our random monkey hits
all keys -- space and letters -- with equal probability $p$:
p&=&\Pr(\emptyset) = \Pr(\mathtt{A}) = \cdots = \Pr(\mathtt{Z})
\nonumber \\ &=&{1}\over{M+1}
We can use a LEXICOGRAPHIC TREES to
conveniently organize words by the order in which the $k$ characters
occur prior to the terminating space, as shown in Figure
(figure) This shows a set of $M+1$ trees, each rooted in the
words' starting character. Leaf nodes at level $k$ are all labeled with
the probability of the sequence of $k-1$ characters prior to the space
occurring at level $k$.
One immediate observation is that N_{k} \), the
number of words \( w_{i} \) of length $k$ or less, is: N_k &\equiv&
\hbox{\it Number}(w_i\; | \; i\le k) \nonumber \\ &=& \sum_{i=1}^k M^i
\nonumber \\ &=& {M(1-M^k)}\over{1-M} In an infinitely long sequence of
characters generated according to Equation (FOAref) , we will
expect to find a ``word'' $w_k$ terminating at level $k$ (i.e., a string
of $k$ unbroken non-space characters bracketed by two spaces) with
probability defined in terms of the independent character probabilities
$p$ : p_k &\equiv& \Pr(w_k) \nonumber \\ &=& p^{k+2} \nonumber \\ &=&
{c\over {(M+1)^{k+2}}} We can compute $c$, the constant of
proportionality, by including all the $M^k$ words of length $k$, and
summing these probabilities over all possible words (including
unrealistic, infinitely long ones!): makes this issue disappear:
}
Next
consider the rank of these words. Since the probability of a word's
occurrence is an exponentially decreasing function of its length, we
know that the M highest ranked words are exactly the one character
words; next come the $M^{2}$ two-letter words; and so on. Using Equation
(FOAref) we therefore know how the rank $r_{k}$ of all words
$w_{k}$ terminating on level $k$ must be bounded above and below:
N_{k-1} &<& r_k \leq N_{k} \nonumber \\ {\tilde r}_k &=&
{{N_{k-1}+1+N_k}\over{2}} \nonumber \\ &=& (M^k - 1) {{M+1}\over
{2(M-1)} } where ${\tilde r}_k$ denotes a compromise, ``average'' rank
for all the $M^k$ equiprobable words.
Note that (FOAref) and
(FOAref) define the words' probability and rank, respectively,
in terms of the common metric $k$. As Li [Li92] notes, Zipf's Law is fundamentally
about this transformation, from an exponential distribution onto a rank
variable.
Solving both equations for $k$: k&=& -{{\ln(M\cdot
p_k)}\over{\ln(M+1)}} \nonumber \\ k&=& {{\ln\left( {{2(M-1) {\tilde
r}_k}\over{M+1}}+1\right)}\over{\ln M}} we can now set them equal and
derive an expression for a word's probability in terms of its rank:
{{\ln (M\cdot p_k)}\over{\ln(M+1)}}&=& -{{\ln\left({{2(M-1){\tilde
r}_k}\over{M+1}}+1\right)} \over{\ln M}} \nonumber \\ p_k &=& {1\over
M}\left({{2(M-1){\tilde r}_k}\over{M+1}}+1\right)^{{-\ln(M+1)}\over{\ln
M}} This has just the functional form required by Mandelbrot's
generalized Zipf's Law (cf. Equation (FOAref) ): p_k&=& {C
\over{({\tilde r}_k + B)^\alpha}}}\ \mathrm{,\ where} \nonumber \\
\alpha &=& {{\ln(M+1)}\over{\ln M}} \nonumber \\ B &=&
{{M+1}\over{2(M-1)}} \nonumber \\ C &=& {1\over
M}\left({{M+1}\over{2(M-1)}}\right)^\alpha \end{eqnarray}
Top of Page
Derivation of Zipf's Law for random texts
Subsections