Dynamic Programming and Hidden Markov Models
underlying the org.biojava.bio.dp package
Draft: Version 0.1
15 January 01

Richard K. Belew

Introduction

Together with the textbook Biological Sequence Analysis [Durbin et al, Cambridge Univ Press, 1998], the org.biojava.bio.dp package represents an excellent resource for teaching students of computational biology. This document is designed to help students familiar with the theory of Hidden Markov models but unfamiliar with the biojava implementation, and perhaps hackers who are using the code but unclear as to the theory behind it. It draws heavily from, and attempts to remain consistent with, both of these original sources. It also provides only the most superficial gloss of fundamental issues in statistical inference. Biological Sequence Analysis (esp. Chapters 3,4,11) is most highly recommended.

Still ToDo:

References to classes needs to be consistent and appropriate given relative positions in biojava distribution tree.
Mathematics are left in TeX $formula$ format
Incorporate CSS dependencies?

Finite State Machines

Interface overview

The central representation used by hidden Markov models (HMMs) is a finite state machine (FSM). This basic representation is built from a (finite!) number of States (nodes) and Transitions (links) among them. As the FSM proceeds from some (specially distinguished) initial state along some transition to the next, it also produces output of some sort. Most simply we can imagine that we are using the FSM to generate strings and a single character is produced every time we enter a state (Moore FSA) or cross a transition (Mealy FSA).

A slightly more difficult example arises in the alignment of two strings A and B. A very simple FSA might have three states corresponding to: a match between two characters in A and B; a hypothesized "gap" in A corresponding to a character in B; and a converse "gap" in B corresponding to a character in A. In this case, the emitted characters are a ternery "delta" difference between the indices of the two strings.

Markov chains are a special class of non-deterministic FSMs which jump from state to state according to probabilities $p_{ij}$ associated with each transition from state $s_i$ to state $s_j$, and the additional important property that the probability of transistion $p_{ij}$ depends only on the state $s_i$ and not any states the system may have visited earlier.

The central new component HMMs add is the distinction between "visible" states and "hidden" ones. This distinguishes a (visible) EmissionState which emits an output character when it is visited, from a DotState which does not. Said another way, there is no longer a one-to-one correspondence between states and the emitted evidence about what state the HMM is in.

The general State interface can then be implemented in various ways that take special advantage of specific features of the states , ala DNAState. Rather than going through the abstract Alphabet associated with States rather than the more efficient routines like DNATools.forIndex().

For example, in SimpleStateTrainer.train():

for( Iterator i = ((FiniteAlphabet) state.alphabet()).iterator(); i.hasNext();) {
      Symbol r = (Symbol) i.next();
      sum += ((Double) c.get(r)).doubleValue();
}
for( Iterator i = ((FiniteAlphabet) state.alphabet()).iterator(); i.hasNext();) {
      Symbol res = (Symbol) i.next();
      Double d = (Double) c.get(res);
      state.setWeight(res, Math.log(d.doubleValue() / sum) );
}

While in DNAStateTrainer.train():

      
for(int i = 0; i < c.length; i++) {
        Symbol r = DNATools.forIndex(i);
        sum += c[i] += (nullModel == null) ? 0.0 : Math.exp(nullModel.getWeight(r))*weight;
}
for(int i = 0; i < c.length; i++) {
        Symbol r = DNATools.forIndex(i);
        setWeight(r, Math.log(c[i] / sum) );
}

The StateFactory notion is useful for changing between extensional and intentional implemenations of state.

... great freedom for changing how the states are implemented. For example, insertFactory may always return views onto a single underlying probability distribution in which case all insert states will emit the same type of stuff - or it may create an individual state for each insert state, in which case each insert can be a different type of thing. You could make the insert state a view onto your null model, or onto a protein-specific distribution. [ProfileHMM]

MarkovModel

It should come as no surprise that when we consider the conjunctions of a long series of events, each of which may be unlikely, the computation of the product of these small numbers can quickly become very small, even with floating point representations. It turns out that computing probabilities in log() space, with log(probability) stored and products $p_i * p_j$ replaced with $log(p_i) + log(p_j)$ is much more convenient. Further, since many of the computations we want to do have to do with relative probabilities and since the log() transform is monotonic (i.e., $p_i < p_j \iff log(p_i) < log(p_j)$), this simplification can be used without affecting our answers.

"Magical" start and end states allow recursive (a single state representing an entire HMM) constructions. The FlatModel allows any recursion to be "flattened," for example when the model is being trained.

ProfileHMM builds on the basic HMM by assuming that it will have a fixed number of states: Col match states, Col delete states and Col+1 insert states.

Heads

Single: consume single sequence
Pair: alignments
>2: "more interesting things" ??

Dynamic programming (DP)

All the real work using the HMM is done by functions in the DP classes, SingleDP and PairwiseDP.

viterbi() "decodes" an observed sequence of (emitted) characters into the most likely path through the hidden states of the HMM. (Yes, this is the same Viterbi known locally for his role at Qualcomm, philanthropic efforts at KPBS, etc.!)

It is worth considering the complexity of the Viterbi algorithm.

State [] states
int [][] transitions                      /* States x States */
double [][] transitionScore               /* States x States */
BackPointer [] oldPointers = new BackPointer[stateCount];
BackPointer [] newPointers = new BackPointer[stateCount];

Thus the space required for the algorithm is $O(mn)$ where $m$ and $n$ are lengths of the sequences being compared. All of the basic computations over this structure are also $O(mn)$ (or simply $O(n^2)$ if we make the reasonable assumption that the two sequences are of comparable lengths).

forward() functions very much like the Viterbi routine. Like Viterbi it is also a "forward" inference, in the sense that it reasons from the beginning of the string ahead to the current state. but forward() considers the full sum over all possible paths rather than the maximum (most likely) path considered by viterbi.

backward() is analogous to forward(), but rather than reasoning from the beginning of the sequence forward to the current state and symbol, the inferences is backwards from "posterior" states and symbols at the end of the sequence. Of course there is nothing wrong with using both forms of reasoning to estimate the probability of the observed sequence, assuming that we are in a particular HMM state.

scoreWeightMatrix(WeightMatrix matrix,SymbolList symList, int start)
scores SymbolList from start with WeightMatrix, returning log probability WeightMatrix having generated SymList

generate(int length)
Generates an alignment from a model

Training Algorithm

Training algorithms are the mechanism by which an HMM is constructed from sequence data. Suppose first that the actual state sequences through the model were known. For example, it might be possible to label a training set of real sequence data using some other knowledge (e.g., a human expert or some other kind of test). Then observed relative frquencies in the training set can be used to form "maximum likelihood" estimates of the probability of transitions and emissions in a model.

From BaumWelchTrainer:

    

    SingleDPMatrix fm = (SingleDPMatrix) dp.forwardMatrix(rll);
    double fs = fm.getScore();
    
    SingleDPMatrix bm = (SingleDPMatrix) dp.backwardMatrix(rll);
    double bs = bm.getScore();

    // state trainer
    for (int i = 1; i <= resList.length(); i++) {
      Symbol res = resList.symbolAt(i);
      for (int s = 0; s < dp.getDotStatesIndex(); s++) {
        double [] fsc = fm.scores[i];
        double [] bsc = bm.scores[i];
        if (! (states[s] instanceof MagicalState)) {
          trainer.addStateCount((EmissionState) states[s], res, Math.exp(fsc[s] + bsc[s] - fs));
        } } }

    // transition trainer
    for (int i = 0; i <= resList.length(); i++) {
      Symbol res = (i < resList.length()) ? resList.symbolAt(i + 1) :
                    MagicalState.MAGICAL_SYMBOL;
      double [] fsc = fm.scores[i];
      double [] bsc = bm.scores[i+1];
      for (int s = 0; s < states.length; s++) {
        int [] ts = backwardTransitions[s];
        double [] tss = backwardTransitionScores[s];
        for (int tc = 0; tc < ts.length; tc++) {
          int t = ts[tc];
          double weight = (states[t] instanceof EmissionState)
            ? ((EmissionState) states[t]).getWeight(res)
            : 0.0;
          if (weight != Double.NEGATIVE_INFINITY) {
              trainer.addTransitionCount( states[s], states[t], Math.exp(fsc[s] + tss[tc] + weight + bsc[t] - fs ) );
          } } } }

This circular relationship between increasing knowledge of model and data suggests an iterative procedure for improving them both known as the Baum-Welch algorithm. The remarkable feature of this procedure is that, beginning from no knowledge (i.e, random probabilities) in a HMM, increasingly good (more likely) estimates both the model and their decoding of the data are obtained until this iteration produces no improvement.

From TrainingAlgorithm

train(SequenceDB,NullModel,NullWeight,StoppingCriteria)

Demos

demo/Dice

demo/SearchProfile

Putting it all together:

  
FiniteAlphabet PROTEIN = ProteinTools.getXAlphabet();
nullModel = createNullModel(PROTEIN);
SequenceDB seqDB = readSequenceDB(seqFile, PROTEIN);
ProfileHMM profile = createProfile(seqDB, PROTEIN);
DP dp = DPFactory.createDP(profile);

TrainingAlgorithm ta = new BaumWelchTrainer(dp);
ta.train(seqDB, nullModel, 5, 
         new StoppingCriteria() {
           public boolean isTrainingComplete(TrainingAlgorithm ta) {
                 if(ta.getCycle() == 5) {return true;} else {return false;} 
           }
});

for(SequenceIterator si = seqDB.sequenceIterator(); si.hasNext(); ) {
  Sequence seq = si.nextSequence();
  SymbolList [] rl = { seq };
  StatePath statePath = dp.viterbi(rl);
  double fScore = dp.forward(rl);
  double bScore = dp.backward(rl);
}

Misc. Utilities

GUI for sequence and alignments
Sanger's GFFformat for sharing sequence feature (annotation) information
Support vector machine (SVM) statistical analyses
Meme interface

Richard K. Belew

Dynamic Programming and Hidden Markov Models underlying the org.biojava.bio.dp package Draft: Version 0.1 15 January 01

Richard K. Belew

Introduction

Still ToDo:

Finite State Machines

Interface overview

MarkovModel

Dynamic programming (DP)

Training Algorithm

Demos

demo/Dice

demo/SearchProfile

Misc. Utilities

Dynamic Programming and Hidden Markov Models
underlying the org.biojava.bio.dp package
Draft: Version 0.1
15 January 01