Hidden markov model approaches for biological studies

doi:10.15406/bbij.2017.05.00139

eISSN: 2378-315X

Biometrics & Biostatistics International Journal

Research Article Volume 5 Issue 4

Hidden markov model approaches for biological studies

Xiang Yang Lou^1,2

Verify Captcha

Regret for the inconvenience: we are taking measures to prevent fraudulent form submissions by extractors and page crawlers. Please type the correct Captcha word to see email ID.

¹Department of Pediatrics, University of Arkansas for Medical Sciences, USA
²Arkansas Children's Hospital Research Institute, USA

Correspondence: Xiang-Yang Lou, Ph.D., 1 Children?s Way, Slot 512-43, Little Rock, AR 72202, USA, Tel 1-501-364-6516, Fax 1-501-364-1413

Received: February 14, 2017 | Published: March 31, 2017

Citation: Lou XY. Hidden markov model approaches for biological studies. Biom Biostat Int J. 2017;5(4):132-144. DOI: 10.15406/bbij.2017.05.00139

Download PDF

Abstract

Organism is a multi-level and modularized complex system that is composed of numerous interwoven metabolic and regulatory networks. Functional associations and random evolutionary events in evolution result in elusive molecular, physiological, metabolic, and evolutionary relationships. It is a daunting challenge for biological studies to decipher the complex biological mechanisms and crack the codes of life. Hidden Markov models and more generally hidden Markov random fields can capture both random signals and inherent correlation structure typically in time and space, and have emerged as a powerful approach to solve many analytical problems in biology. This article will introduce the theory of hidden Markov model and the computational algorithms for the three fundamental statistical problems and summarize striking applications of hidden Markov models to biological and medical studies.

Keywords: hidden markov model, hidden markov random field, applications to biological studies, pattern recognition for biological sequence, genetic mapping, gene prediction, biological image analysis

Introduction

A Markov process, named after the Russian mathematician Andrey Andreyevich Markov who developed much of relevant statistical theory,M¹ is a stochastic process that satisfies the Markov property characterized as "memorylessness" (also known as “non-aftereffect”).² The term “stochastic process”, also called random function that is interpreted as a random element in a function space, refers to the statistical phenomenon that the outcomes of a collection of events or variables, typically in temporal and/or spatial order, are not deterministic, but, instead, probabilistic, and "memorylessness" means that given the state of a variable (or the states of a few adjacent variables) in a process, the succeeding variables and preceding variables are conditionally independent in the sense of probability. Markov processes can be categorized into various types according to the nature such as whether the parameter of random function is continuous or discrete and whether the state space is countable or infinite: discrete-time vs. continuous-time, countable vs. infinite states, first-order (the conditional probability of a variable depending only on its one preceding variable) vs. high-order (the conditional probability of a variable depending on its few preceding variables), time-homogeneous (stationary transitional probabilities) vs. time-nonhomogeneuous, and uni-dimensional vs. multi-dimensional. Markov processes are quite common in both the natural world and human societies; random walk, drunkard's walk, Lévy flight, Brownian motion, and the thermal electron transfer between different sites, are all examples of time-homogeneous continuous-time Markov processes;³ spread of epidemics, growth of populations, and vegetation dynamics can be approximated by a Markov process.^4,5 Markov chain is a special type of Markov process countable in state space and discrete in time.⁶ In most cases of interest, the states in an underlying Markov process are not observable but there exists a probabilistic function of the states that can be associated with another set of stochastic sources that produce observed symbols, characterized as signals, correspondingly, called a hidden Markov process.⁷ The probabilistic model to characterize a hidden Markov process is referred to as a hidden Markov model (abbreviated as HMM). The most common HMM is uni-dimensional (one-dimensional), and extension to high-dimensional cases includes multi-dimensional hidden Markov model or Markov mesh random field, and more generally, Markov random field (MRF), also known as Markov network. There are a wide range of applications of HMMs to signal processing, speech recognition, character decoding, imaging analysis, economics, sociology, and life sciences.^3,8-10 This review will briefly describe the statistical principle of HMMs and then focus on recent advances in applications of HMMs to biology and biomedicine.

Hidden markov model

The hidden Markov process is a class of doubly stochastic processes, characterized by Markov property and the output independence, in which an underlying Markov process is hidden, meaning the variable states cannot be directly observed, but can be inferred through another set of stochastic processes evident as a sequence of observed outputs.¹¹ The Markov process generates the sequence of states of variables, specified by the initial state probabilities and state transition probabilities between variables, while the observation process outputs the measurable signals, specified by a state-dependent probability distribution, thus being viewed as a noisy realization of a Markov process. In what follows the first-order HMM is used to illustrate the theory. Since any probabilistic question about higher order Markov chains is reducible by a data augmentation device to a corresponding question about first Markov chains,⁶ extension of those methods to the higher order cases will be straightforward. As shown in Figure 1A, a uni-dimensional (1-D) hidden Markov process involves a sequence of hidden random variables $X_{i}$ ( $i = 0, 1, 2, \dots, I$ ), where each variable $X_{i}$ has $S_{i}$ potential states, and of which the conditional probability satisfies the Markov dependence, $\Pr (X_{i} | X_{0}, X_{1}, L, X_{i - 1}) = \Pr (X_{i} | X_{i - 1})$ , $i = 1, 2, \dots, I$ , namely, the underlying states follow a first-order Markov chain. If $\Pr (X_{i} | X_{i - 1}) = \Pr (X_{1} | X_{0})$ （ $i = 2, 3, \dots, I$ ）, that is, the transitional probability is independent of the position of a variable in the sequence, it is called a time-homogeneous Markov chain. Without loss of generality, the time-nonhomogeneous Markov chain is used here to illustrate the statistical principle. There are a series of output signals either discrete or continuous, denoted by $Y_{i}$ , associated with the variable states underlying a hidden Markov process ( $X_{i}$ ) that are not observed through a set of probability distributions. A hidden Markov process can be characterized by the corresponding HMM, which is formally defined by the following components: the elemental structure including the hidden variables and their state spaces, and the output symbols and their value ranges, and a set of three-tuple parameters denoted by $Ω = {π, A, B}$ , where $π$ is the initial state probability distribution, $A$ is the state transition probability distribution, and $B$ is the emission probability distribution. The initial probability vector is,

$π = (π_{u})$ , $u = 1, 2, \dots, S_{0}$ ,

Figure 1 Hidden Markov processes: (A) One-dimensional hidden Markov process, (B) Two-dimensional hidden Markov process, and (C) Hidden Markov random field.

where $π_{u} = \Pr (X_{0} = u)$ ; the transition probability matrix of variable $X_{i}$ ( $i = 1, 2, \dots, I$ ) is,

$A^{i} = (a_{u v}^{i})$ , $u = 1, 2, \dots, S_{i - 1}$ , $v = 1, 2, \dots, S_{i}$ ,

where $a_{u v}^{i} = \Pr (X_{i} = v | X_{i - 1} = u)$ ; the output probability distribution of observation $Y_{i}$ ( $i = 0, 1, 2, \dots, I$ ) is, either a matrix or a family of distributions depending on the type of observations,

$B^{i} = [b_{u}^{i} (y)]$ , $u = 1, 2, \dots, S_{i}$ ,

where $b_{u}^{i} (y) = \Pr (Y_{i} = y | X_{i} = u)$ .

As shown in Figure 1B, the underlying process of a two-dimensional (2-D) hidden Markov process is a Markov mesh consisting of an array of random variables,^12,13 where each variable $X_{i, j}$ can potentially take $S_{i, j}$ states ( $i = 0, 1, 2, \dots, I$ , $j = 0, 1, 2, \dots, J$ ), and of which the conditional probability of a node in the 2-D lattice satisfies the property of memorylessness, only depending on the nearest neighbors,

$\Pr (X_{i, j} | X_{0 : i, 0 : j} \ {X_{i, j}}) = \Pr (X_{i, j} | X_{i - 1, j}, X_{i, j - 1})$ , $i = 0, 1, 2, \dots, I$ , $j = 0, 1, 2, \dots, J$ ,

where $X_{0 : i, 0 : j} = {X_{u, v} \forall u = 0, 1, \dots, i; v = 0, 1, \dots, j}$ , $\Pr (X_{i, j} | X_{i - 1, j}, X_{i, j - 1})$ is the state transition probability, denoted by $a_{k l u}^{i, j} = \Pr (X_{i, j} = u | X_{i - 1, j} = k, X_{i, j - 1} = l)$ ; when $i = 0$ ,

$\Pr (X_{0, j} | X_{0, 0 : j - 1}) = \Pr (X_{0, j} | X_{0, j - 1})$ , $j = 1, 2, \dots, J$ ,

where $\Pr (X_{0, j} | X_{0, j - 1})$ is the transition probability, denoted by $a_{u v}^{0, j} = \Pr (X_{0, j} = v | X_{0, j - 1} = u,)$ ; when $j = 0$ ,

$\Pr (X_{i, 0} | X_{0 : i - 1, 0}) = \Pr (X_{i, 0} | X_{,})$ , $i = 1, 2, \dots, I$ ,

where $\Pr (X_{i, 0} | X_{i - 1, 0})$ is the transition probability, denoted by $a 〈_{u v}^{i, 0} 〉 = \Pr (X_{i, 0} = v | X_{i - 1, 0} = u,)$ ; when $i = 0$ and $j = 0$ , there are a set of the initial state probability $\Pr (X_{0, 0})$ , denoted by,

$π = (π_{u})$ , $u = 1, 2, \dots, S_{0, 0}$ ,

where $π_{u} = \Pr (X_{0, 0} = u)$ . The output signal $Y_{i, j}$ conforms to the following distribution, given the state of $X_{i, j}$ , being conditionally independent of the states of the other variables and the other observations,

$\Pr (Y_{i, j} | X_{0 : i, 0 : j}, Y_{0 : i, 0 : j} \ {Y_{i, j}}) = \Pr (Y_{i, j} | X_{i, j})$ , $i = 0, 1, \dots, I$ , $j = 0, 1, \dots, J$ ,

where $\Pr (Y_{i, j} | X_{i, j})$ is the output probability, denoted by $b_{u}^{i, j} (y) = \Pr (Y_{i, j} = y | X_{i, j} = u)$ .

HMM can be further extended to the higher dimensional cases,^14-17 and even more generally as hidden MRFs or Markov networks.^18-23 A hidden MRF is a Markov random field degraded by conditionally independent noise, of which the set of underlying variables are latent,²⁴ often described by a graphical model for its representation of dependencies. As shown in Figure 1C, each vertex (node) corresponds to a variable and the edge connecting two vertices represents a dependency. When there is an edge to connect every two distinct variables in a subset of vertices, such a subset is called a clique. Each clique is associated with a set of nonnegative functions, called potential functions, to specify its probability distribution. A maximal clique is a clique that cannot be extended by including one more adjacent vertex, that is, a clique which does not exist exclusively within the vertex set of a larger clique. The joint distribution of a set of variables in a MRF can be expressed by a product of several potential functions based on clique factorization,

$\Pr (X = x) = \frac{1}{Z} \prod_{c \in c l (X)}^{} ψ_{c} (x_{c})$ ，

where $c l (X)$ are a set of cliques, usually a set of maximal cliques, $ψ_{c}$ is the potential functions of clique c， $x_{c}$ is the states of clique c, $Z = \sum_{x}^{} \prod_{c \in c l (X)}^{} ψ_{c} (x_{c})$ is the normalizing constant so that the sum of probabilities over the whole range equals to 1. The underlying variables are hidden but each variable will output a signal confounded with noise. The output observation of a variable is conditionally independent of the states of other variables and the other outputs.

Statistical inference in hidden markov models

Three fundamental problems are addressed for HMMs:¹¹ (1) evaluation or scoring, to compute the probability of the observation sequence for a model, (2) decoding or optimization, to find the optimal corresponding state sequence given the observation sequence and the model, (3) learning or training, to estimate the model parameters (initial probabilities, transition probabilities, and emission probabilities) that best explains the observation sequences given the model structure describing the relationships between variables. Although all these computations can be implemented by the naive brute-force algorithms, which exhaustively enumerate all possible state sequences and do calculations over them for the observed series of events, they are not effective enough. The computational complexity of such algorithms increases exponentially with problem size, so that the implementation quickly becomes computationally infeasible as the number of states and the number of sequences increase, even for a small number. The 1-D Markov model can be used to illustrate this point. When moving from a variable to the next along the Markov chain, each current state may shift to one of the possible next states, and then a state transition diagram can be formed by connecting all plausible moves, as shown in Figure 2A. Each path from the first variable to the final variable represents a possible state combination, and there are a total of $\prod_{i = 0}^{I} S_{i}$ paths, which can be, in many cases of interest, an astronomical number. Powerful algorithms can be developed through recursive factorization or dynamic programming by making use of the conditional independence given the Markov blanket of a variable or a group of variables, so that substantial redundant calculations and storage are avoided, and much fewer arithmetic operations and less computing resources such as computer memory are required.

Figure 2 Trellis algorithm: (A) The forward trellis and (B) The Viterbi trellis.

The principle of trellis algorithm is extensively used in statistical analysis for 1-D hidden Markov models. As visualized in Figure 2, the states in various variables are arranged into a trellis according to the order of variables, whose dimensions are, respectively, the number of states and the length of the variable sequence. The nodes at a vertical slice represent the states of the corresponding variable. In a trellis graph, each node at each variable position connects to at least one node at the previous variable and/or at least one node at the next variable (each node for the first variable and at the last variable can be viewed to always connect with one dummy node before the start position and after the final position, respectively), then forming an interconnected network. By use of the conditional independency, the intermediate calculations on all the paths in which a node is involved can be individually cached to avoid redundant calculation. The forward-backward algorithm,^25,26 the Viterbi algorithm,^27-29 and the Baum-Welch algorithm (a special type of Expectation-Maximization, abbreviated by EM, algorithm),^30,31 were developed based on the trellis diagram and can used for purposes of evaluation, decoding, and learning, respectively. The relevant theory and methods is concisely recapitulated as follows.

Forward-backward algorithm

The recursive forward or backward algorithm can be used to compute the probability of a sequence of observations ( $y_{0}$ ， $y_{1}$ ，…， $y_{I}$ ) generated by a particular model, denoted by,

$\Pr (y_{0}, y_{1}, \dots, y_{I} | Ω) = \sum_{x_{0}, x_{1}, \dots, x_{I}}^{} \Pr (y_{0}, y_{1}, \dots, y_{I} | x_{0}, x_{1}, \dots, x_{I}; Ω)$ .

Corresponding to each node in a trellis diagram (Figure 2A), two storage variables are used to cache the forward probability (the sum of probabilities of all paths leaving from the start of the sequence and ending in the state considered) and the backward probability (the sum of probabilities of all paths starting with the considered state and going to the end of the sequence), respectively, denoted by $α_{i} (u)$ and $β_{i} (u)$ ( $u = 1, 2, \dots, S_{i}$ , $i = 0, 1, \dots, I$ ). Both the forward algorithm and the backward algorithm involve three steps: initialization, recursion (or induction), and termination.

(1) Initialization

Calculate the forward probability at the first position, $α_{0} (u) = π_{u} b_{u}^{0} (y_{0})$ , $u = 1, 2, \dots, S_{0}$ .

Specify the backward probability at the last position, $β_{I} (u) = 1$ , $u = 1, 2, \dots, S_{I}$ .

(2) Recursion

Compute the forward probability, $α_{i} (u) = \sum_{v = 1}^{S_{i - 1}} α_{i - 1} (v) a_{v u}^{i} b_{u}^{i} (y_{i})$ , $u = 1, 2, \dots, S_{i}$ , $i = 1, 2, \dots, I$ .

Compute the backward probability $β_{i} (u) = \sum_{v = 1}^{S_{i + 1}} a_{u v}^{i + 1} b_{v}^{i + 1} (y_{i + 1}) β_{i + 1} (v)$ , $u = 1, 2, \dots, S_{I}$ , $i = I - 1, I - 2, \dots, 0$ .

(3) Termination

When $i = I$ , the forward recursion stops. Then, the probability of the whole sequence of observations can be found by summing the forward probabilities over all the states at the final variable, $\Pr (y_{0}, y_{1}, \dots, y_{I} | Ω) = \sum_{u = 1}^{S_{I}} α_{I} (u)$ .

When $i = 0$ , the backward recursion stops. Then, the probability of the whole sequence of observations can also be found by summing the products of the forward and backward probabilities over all the states at the first variable, $\Pr (y_{0}, y_{1}, \dots, y_{I} | Ω) = \sum_{u = 1}^{S_{I}} β o (u)$ .

Viterbi algorithm

The Viterbi algorithm can be used to identify the most likely sequence of hidden states to produce an observed sequence, termed the Viterbi path, denoted by $(x_{0}^{*}, x_{1}^{*}, \dots, x_{I}^{*})$ , so that,

$\Pr (y_{0}, y_{1}, \dots, y_{I} | | x_{0}^{*}, x_{1}^{*}, \dots, x_{1}^{*}; Ω) = \max_{x_{0}, x_{1}, \dots, x_{I}} \Pr (y_{0}, y_{1}, \dots, y_{I} | x_{0}, x_{1}, \dots, x_{I}; Ω)$ .

Corresponding to each node (Figure 2B), two storage variables are used to cache the probability of the most likely path for the partial sequence of observations and the state at the previous variable to lead this path, denoted by $δ_{i} (u)$ and $ψ_{i} (u)$ ( $u = 1, 2, \dots, S_{i}$ , $i = 0, 1, \dots, I$ ), respectively. The Viterbi algorithm consists of four steps: initialization, recursion, termination, and backtracking.

(1) Initialization

Calculate $δ_{0} (u) = π_{u} b_{u}^{0} (y_{0})$ and set $ψ_{0} (u) = 0$ , $u = 1, 2, \dots, S_{0}$ .

(2) Recursion

Calculate $δ_{i} (u) = \max_{v = 1}^{S_{i - 1}} δ_{i - 1} (v) a_{v u}^{i} b_{u}^{i} (y_{i})$ , and record the state $ψ_{i} (u) = \arg \max_{v = 1}^{S_{- 1 i}} δ_{i - 1} (v) a_{v u}^{i}$ , $u = 1, 2, \dots, S_{i}$ , $i = 1, 2, \dots, I$ .

(3) Termination

The recursion ends when $i = I$ . The probability of the most likely path is found by $\Pr (y_{0}, y_{1}, \dots, y_{I} | | x_{0}^{*}, x_{1}^{*}, \dots, x_{I}^{*}; Ω) = \max_{u = 1}^{S_{I}} δ_{I} (u)$ . The state of this path at variable $I$ is found by $x_{I}^{*} = \arg \max_{u = 1}^{S_{I}} δ_{I} (u)$ .

(4) Backtracing

The state of the optimal path at variable $i$ is found by $x_{i}^{*} = ψ_{i + 1} (x_{i + 1}^{*})$ , $i = I - 1, I - 2, \dots, 0$ .

EM Algorithm

Given a set of examples from a sequence and the HMM structure, the EM algorithm can be implemented for model fitting. The EM algorithm is an iterative method to find the maximum likelihood estimate(s) (MLE) based on the following principle:³² From the Kullback-Leibler divergence theory,³² the expected log-likelihood function of the complete data (consisting of the observed data, known as incomplete data, and the unobserved latent data) is a lower bound on the log-likelihood of the observed data, that is, the log-likelihood function of the complete data under any set of probability distribution of hidden data is less than or equal to the log-likelihood of the observed data. Therefore, we can use the expected log-likelihood function of the complete data as a working function, iteratively approaching to the log-likelihood of the observed data, the true objective function, and thereby finding the MLE. The EM algorithm alternate between performing an expectation step (E-step) and a maximization step (M-step). In an E-step, the expectation of the log-likelihood of complete data is evaluated using the current estimate for the parameters to create a function for maximization, and in an M-step, the parameters maximizing the expected log-likelihood found in the E-step are computed. It can be proved that such an iteration will never decrease the objective function, assuring the EM converges to an optimum of the likelihood. The EM algorithm will be very efficient in particular for the cases when there is the closed-form solution to MLE for the complete data. The EM algorithm includes three steps of initialization, a series of iterations, and termination.

(1) Initialization

Given a set of initial parameter values $Ω^{(0)} = {π^{(0)}, A^{(0)}, B^{(0)}}$ , start the EM iterations.

Iteration

Each cycle of EM iteration involves two steps, an E-step followed by an M-step, alternately optimizing the log-likelihood with respect to the posterior probabilities and parameters, respectively.

E-step: Calculate the posterior probability of latent data using the current estimate for the parameters, $Ω^{(t)} = {π^{(t)}, A^{(t)}, B^{(t)}}$ . Specifically, as mentioned in the forward-backward algorithm, compute the forward probabilities and the backward probabilities for each sequence of observations ( $y_{0}$ ， $y_{1}$ ，…， $y_{I}$ ) using the current estimate. Further, calculate the posterior state probabilities of a variable and of the state combinations of two adjacent variables as follows,

$\Pr^{(t)} (X_{i} = u) = \frac{α_{i}^{(t)} (u) β_{i}^{(t)} (u)}{\Pr (y_{0}, y_{1}, \dots, y_{I} | Ω^{(t)})}$ , $u = 1, 2, \dots, S_{i}$ , $i = 0, 1, \dots, I$ ,

$\Pr^{(t)} (X_{i - 1} = u, X_{i} = v) = \frac{α_{i - 1}^{(t)} (u) a_{u v}^{i (t)} b_{v}^{i (t)} (y_{i}) β_{i}^{(t)} (v)}{\Pr (y_{0}, y_{1}, \dots, y_{I} | Ω^{(t)})}$ , $u = 1, 2, \dots, S_{i - 1}$ , $v = 1, 2, \dots, S_{i}$ , $i = 1, 2, \dots, I$ .

To perform the E-step will assure that the expected log-likelihood function will be maximized under the current estimated parameters $Ω^{(t)}$ . Then, the function for the expectation of the log-likelihood of complete data is computed by using the estimated posterior probabilities of hidden data.

M-step: Estimate the new parameters that maximize the expected log-likelihood found in the E-step. When the complete data are available, the estimation of parameters in a HMM is straightforward; for example, the initial state probability $π_{u}^{(t + 1)}$ ( $u = 1, 2, \dots, S_{0}$ ) can be calculated from $\Pr^{(t)} (X_{0} = u)$ by the counting method, the transition probability $a_{u v}^{i (t + 1)}$ ( $u = 1, 2, \dots, S_{i - 1}$ , $v = 1, 2, \dots, S_{i}$ , $i = 1, 2, \dots, I$ ) can be calculated from $\Pr^{(t)} (X_{i - 1} = u, X_{i} = v)$ by the counting method, the emission parameter $b_{u}^{i (t + 1)} (y)$ ( $u = 1, 2, \dots, S_{i}$ , $i = 0, 1, 2, \dots, I$ ) can be found from $\Pr^{(t)} (X_{i} = u)$ and observation $y$ , depending on the form of emission probability function. To perform the M-step will assure that the likelihood of complete data computed from the E-step will be maximized with respect to the parameters.

Termination

Repeat the E-step and the M-step until convergence is reached such as the objective function does not increase or the parameter estimation does no longer change.

In theory, extension of the above 1-D methods to higher dimensional HMMs and hidden MRFs is straightforward. In practice, such an extension is not easy to implement. One solution is to convert a multi-dimensional model into a 1-D multi-dimensional vector Markov model by considering the set of nodes with a fixed coordinate along a given direction (e.g., horizontal, vertical, or diagonal in a regular plane) as a new vector node. For example, the rows, the columns, and the anti-diagonal and its parallelisms of a 2-D lattice respectively form super nodes, generating a vector Markov chain.^34–38 Generalization to the 3-D case is also straightforward.^39–41 Then, the forward-backward algorithm, the Viterbi algorithm, and the EM algorithm can be applicable to the new HMM. The main limitation of these approaches is that, although avoiding exhaustive computations along the chosen dimension, it is still necessary to consider all the possible combinations of states in the resulting vector and thus the complexity is lessened only from one direction; in other words, the computational complexity of the algorithms grows exponentially with the data size in the other dimension(s), e.g., the number of rows, the number of columns, or the minimum of them in the 2-D lattice,^34,42,43 making the computations intractable in most cases of interest. Alternatively, restricted models with reduced connectivity such as pseudo multi-dimensional HMMs,^44,45 embedded HMMs,^46,47 dependence-tree HMMs⁴⁸ are suggested for use. One of the major shortcomings is that the dependence of a node on its neighbors in a fully connected multi-dimensional HMM does not guarantee to be explored. Several attempts have also been done to heuristically reduce the complexity of the HMM algorithms by making simplifying assumptions.^{34,36,42,43,49–52} The main disadvantage of these approaches is that they only provide approximate computations, such that the probabilistic model is no longer theoretically sound.

On hidden MRFs, there are also a few methods applicable to exact computation such as variable elimination methods (an analogue to the forward-backward algorithm) and belief propagation methods (sum-product message passing and max-product message passing methods corresponding to the forward-backward and the Viterbi algorithms, respectively).⁵³ However, the exact inference for hidden MRFs and higher dimensional HMMs is a nondeterministic polynomial time-hard (NP-hard) problem. It is believed that there exists no general method efficiently for exact computation of hidden MRFs and higher dimensional HMMs. In most cases, exact inference is not allowed because of tremendous computational burden. Several approximate inference approaches and numerical computing techniques such as Markov chain Monte Carlo sampling,⁵⁴ variational inference,⁵⁵ and loopy belief propagation^56,57 methods are more feasible.

My research team, funded by an NSF grant, is developing a telescopic algorithm which can make full use of the property of conditional independence in the models and is expected to increase the computational complexity linearly (rather than exponentially) with the sizes of both dimensions in a 2-D HMM, thus greatly lowering the cost in computing resources including computer memory and computing time and being applicable to exact computation in an iterative way for statistical inference of high dimensional HMMs.

Applications to biological studies

HMMs offer an effective means for prediction and pattern recognition in biological studies. Since it was introduced to computational biology by Churchill for the first time,⁵⁸ HMM has emerged as a promising tool for diverse biological problems and there have been an expanding list of applications in biology.^59,60 Here highlight the major applications across the various biological research areas.

Genetic mapping

Genetic mapping includes linkage map construction and gene mapping, is to place a collection of genetic loci such as genes, genetic markers, and other forms of DNA segments, onto their respective positions on the genome, based on the information of meiotic crossover (recombination between the homologous chromosomes). Genetic mapping is the basis of map-based (positional) cloning, marker-assisted selection, and forward genetics.

Gene transmission from parent to offspring at multiple genetic loci is by nature a 2-D Markov process where, along the generation (vertical) dimension, given the genotypes of parents, the genotype of an offspring is independent of his/her other ancestors, and along the chromosome (horizontal) dimension, given the inheritance state of a locus (to denote whether the paternal allele and the maternal allele are grand-paternally or grand-maternally inherited), the inheritance state at its one side locus is independent of that at its other side locus assuming no crossover interference (otherwise, it will be a higher order Markov process). The ordered alleles or inheritance state of an individual at a locus can be viewed as a node in a 2-D Markov mesh. However, the genotype at a gene, the ordered genotype and the inheritance states at an either genetic marker or gene are usually not directly observed. Thus, gene transmission is a 2-D hidden Markov process. Two seminal algorithms, the Elston-Stewart algorithm^61,62 and the Lander-Green algorithm,⁶³ are the major tools for genetic reconstruction and extracting inheritance information, laying the cornerstones in genetic mapping. The most popular analytical methods and software packages are derived from them, for example, the former including FASTLINK^64–66 and VITESSE,⁶⁷ and the latter including GENEHUNTER,⁶⁸ MERLIN,⁶⁹ and ALLEGRO.^70,71 But both the algorithms use the strategy that converts 2-D model to 1-D vector model—the Elston-Stewart algorithm and the Lander-Green algorithm correspond to the 1-D row-vector and column-vector HMMs, respectively. Therefore, both categories of approaches have their intrinsic weaknesses: the Elston-Stewart algorithm is only applicable to a handful of genetic loci, because of growing computational complexity exponentially with the number of loci, while the Lander-Green algorithm is only applicable to relatively small pedigrees, because of growing computational complexity exponentially with the number of nonfounders in a pedigree. It is expected that the telescopic algorithm being developed will offer a solution to circumventing the limitations of the Elston-Stewart and the Lander-Green algorithms.

A simple pedigree consisting of four generations and seven individuals shown in Figure 3B is used to illustrate its correspondence to the 2-D HMM in Figure 3A. In the dimension of time (along the generation), the marriage nodes 1 × 2, 3 × 4, and 5 × 6, respectively, correspond to the first three rows in Figure 3A, while the individual 7 corresponds to the last row. In the dimension of space (along a chromosome), the genetic loci either marker or gene in Figure 3B correspond to the columns in Figure 3A. In the absence of both crossover interference and higher order linkage disequilibrium, it is a 2-D HMM of order one. The two-tuple state variable consisting of the ordered genotype and inheritance state is hidden, of which the state number varies from node to node depending on the allele number at a locus and the parental genotypes. The transition probability along the first row is determined by the linkage disequilibrium parameters between two contiguous loci or haplotype frequencies as both the individuals of this marriage node are founders; those along the second and the third rows are determined by the linkage disequilibrium coefficients and the recombination rate as one individual of that node is a founder and the other is a nonfounder; and that along the fourth row is determined by the recombination rate as the only individual of that node is a nonfounder. The transition probability along the column is either 0.25 (equi-probable transmission) if the states between the parents and the offspring are compatible or 0 otherwise. The emission parameters depend on the type of locus. For a genetic marker, they take the value of 1 if the hidden state is compatible with the marker observation or 0 otherwise. For a gene locus, they are the penetrance or other proper probabilistic function linking a genotype to the phenotype. A general pedigree also corresponds to a similar 2-D HMM. Given the straightforward correspondence between 2-D HMMs and pedigrees, the telescopic algorithm can be adapted into the genetics context to invent an innovative genetic reconstruction engine, and the inheritance processes can be restored in the statistical sense. Based on the reconstructed inheritance processes, genotypic similarity between relatives can be evaluated in terms of the distribution of the hidden state variable at a certain locus (or loci) and it is in principle feasible to develop powerful parametric and nonparametric methods to test linkage, association or both.

Figure 3 An illustrative diagram for the correspondence between an 2-D HMM and its corresponding pedigree: (A) the 2-D HMM of size and (B) a simple pedigree with four generations.

Biological sequence analysis

Protein and nucleic acid including DNA and RNA are linearly ordering sequences of monomers such as amino acids and nucleotides. The primary structure of a biological macromolecule (i.e., the linear sequence of monomer units) is the 3-D structural and functional basis, determining the higher order structure and 3-D shape/conformation in living systems and hence defining its function. Biological sequence is not only important in the description of biological polymer but also the basic information-bearing entity on its biological role. Biological sequence analysis can shed light on its features, structure, function, and evolution.

New sequences are adapted from pre-existing sequences rather than invented de novo. During a long-term evolutionary course, induced by the biotic and abiotic agents, the ancestral sequence may undergo mutation, recombination, and insertion/deletion which is further subjected to certain evolutionary forces such as selection and genetic drift, to develop into the extant sequences. Because of certain biological mechanisms such as triplet codon, dinucleotide tract (e.g., CpG islands), and the sequence consensus due to the conserved functional domain, there also exist immanent associations between sequence units. It has also been demonstrated that a biological sequence is not a completely random arrangement.⁷² Haussler et al. reported that HMM biological sequence such as protein and DNA could be characterized by a HMM, where match (including identity and substitution) and gap (i.e., insertion/deletion) occurred in each site can be considered as a hidden state that satisfies the Markov structure, while the sequence of monomers are the observation data emitted from a set of state-dependent probability distributions.⁷³ HMMs are recognized as a powerful analytical tool and are extensively used for pattern recognition in a single sequence such as direct repeats and characteristic subsequence, sequence alignment, sequence classification, similarity query, or homology detection.^74–77

Sequence alignment is to arrange a pair/group of biological sequences next to one another so that their most similar elements are juxtaposed, and thereby to identify regions of similarity that may be a consequence of structural, functional, or evolutionary relationships among the sequences. Alignments are conventionally shown as a traces or a position-specific scoring table (also known as a profile), where the aligned sequences are typically represented as rows within a matrix and gaps may be inserted between the units so that each element in a sequence is either a match or a gap and identical or similar elements are aligned in successive columns. If the aligned sequences share a common ancestry, mismatches can be interpreted as point mutations and gaps correspond to insertion or deletion mutations introduced in one or more lineages. Successful alignments are a prerequisite for many bioinformatics tasks such as pattern recognition, characteristic subsequence extraction, and motif search. Computational approaches to sequence alignment generally fall into global (full-length) and local alignments, and pairwise and multiple sequence alignments. Global alignment is to align every element in all analyzed sequences so that it needs to consider all mismatches and gaps across the entire length. A global alignment is best for describing relationship and inferring homology between query biological molecules. Local alignment attempts to align only parts of the query sequences (subsequences) and to identify regions of similarity or similar sequence motifs within long sequences. Local alignments are often preferable because it is common that only parts of compared sequences are homologous (e.g., they share one conserved domain, whereas other domains are unique), and, on many occasions, only a portion of the sequence is conserved enough to carry a detectable signal, whereas the rest has diverged beyond recognition. Multiple sequence alignment is an extension of pairwise alignment to incorporate more than two sequences at a time. Although multiple sequence alignment shares the same principle as pairwise alignment, the algorithmic complexity for exact computation increases exponentially with the number of query sequences, and thus it is usually implemented by heuristic and progressive algorithms. Several common methods include Needleman-Wunsch algorithm⁷⁸ for pairwise global alignment, Smith-Waterman algorithm⁷⁹ for pairwise local alignment, Feng-Doolittle algorithm⁸⁰ for progressive multiple sequence alignment, and the profile analysis⁸¹ for aligning a family of similar sequences, finding distantly related sequences, and identifying known sequence domains in new sequences by sequence comparison. However, these standard methods highly depend upon the choice on substitution/scoring matrix (e.g., Dayhoff mutational distance matrix in protein sequence) and gap penalties that seems to be somewhat subjective and arbitrary. Application of HMMs to sequence alignment may circumvent the limitations in the traditional methods. In the HMM context, an alignment can be viewed as a series of nodes, each of which corresponds to a position (column) in the alignment and may take one of three hidden states: match, insertion, and deletion. A set of transition probabilities specifies the potential transition from one state to another in a move along the sequence of positions. A match state and an insertion state will emit an output token from a set of possible symbols (e.g., nucleotides and amino acids) according to their respective emission probabilities, while a deletion state will be silent. The HMM can be trained on unaligned sequences or pre-constructed multiple alignments, and the model parameters including transition probabilities between match, insertion, and deletion, and the state-specific emission probabilities can be learned from the training data. Once training is done, the optimal state path for each sequence can be decoded based on the fitted model and the similarity between sequences can also be evaluated based on the (log) p-value. Compared with the customary methods, the HHM-based methods not only are well grounded in probability theory and have a consistent probabilistic basis behind gap penalties and substitution scores, in general, producing a better result, but also run in an automatic regime and thus require less skill.^82,83

After a profile or traces is constructed for a group of related sequences derived from a common ancestral sequence, likewise, the HMM can be applied to extract consensus segments, to create a retrieval database, to capture information about the degree of conservation, to find the family of sequences, to test the similarity of new sequences to a group of aligned sequences (probe), and further to classify sequences, to infer the evolutionary relationship, and to construct a phylogenetic tree.

Gene finding and feature discovering

In computational biology, gene finding or gene prediction is to identify the coding regions or genes from genomic DNA, also known as genomic annotation.^84–87 There are two categories of methods for gene prediction. One is called the ab initio methods, which find gene(s) by systematically searching the genomic DNA alone for certain signals and statistical properties of protein-coding genes. A gene is composed of specific structures including promoter sequence, start codon, exon, intron, stop codon, and untranslated region. There are certain sequences of features in these regions such as the donor (GT dinucleotide) and acceptor (AG dinucleotide) in an intron. Furthermore, protein-coding DNA has certain periodicities and other statistical properties. Thus, HMM can be used for gene discovery. The principle is that by considering the gene structures and characteristic subsequneces as hidden states and the genomic DNA as the observed symbols, gene prediction is, similar to the natural language recognition, a decoding problem for a hidden Markov process. Specifically, the HMM is fitted by using a set of annotated sequences (i.e., sequences with known coding and noncoding regions); then find the most likely hidden states for a new sequence based on the trained HMM. Such a principle of prediction is also applicable to other bioinformatics analyses including functional domain finding, sequence pattern extraction, motif search, and identification of CpG island, open reading frame, transcription factor binding site, cis-regulatory module, and noncoding RNA.⁸⁸

A hypothetical example is used here to illustrate the application of the Viterbi decoding algorithm to identifying the CpG island, a region of DNA with a high frequency of CpG sites where a cytosine nucleotide is followed by a guanine nucleotide in the linear sequence of bases along its 5' → 3' direction. As cytosines in CpG dinucleotides are vulnerable to methylation that can change with a high chance into a thymine by accidental deamination, the CG dinucleotide frequency in genomic DNA is often much smaller than expected. However, methylation is often suppressed in the promoter regions of active genes, and therefore those regions often have much higher CpG frequency than elsewhere in the genome. CpG islands, at least 200 (typically 300-3,000) base pairs in length with a GC content of greater than 50% and a ratio of observed-to-expected CpG number above 60%, are often associated with the start of the gene (promoter regions) in the most mammalian genome and thus the presence of a CpG island is an important signal for gene finding. HMM is a powerful approach to determine whether a short DNA segment comes from a CpG island or not, and to find all contained CpG islands in a long genomic segment. There are several choices on HMM for CpG islands. Assume a HMM, as shown in Figure 4, is available (or has been already trained from a set of empirical observations), in which there are two hidden states (CpG, denoted by “+”, and non-CpG, denoted by “−“) with the initial state, transition, and emission probabilities, respectively. The Viterbi algorithm is used to find the most probable path of states for the following short sequence (generated from the website: http://statgen.ncsu.edu/sblse/animations/cgIsland.html):

TCTCGCTGCCGCCAACCCTCGGCGCCGTCGGGTTCGCCGCGGCTCTGATAAGTCCCGT
TTATGGTACCCGGGCCGATCTCTGGTGGGAATCGGAGACCTGTGTACCCTGACGCATC
CGTTTGTGTTCCCTACACGGCCGACGCAGACCGGGCGCGCGGCGCCACCCAACGAAGC
CCGGGTATGGCACGTGCCCCAGGCGGTGCCCTACCCGTATTTCGGGACAAGTTCCCGG
ATCGGGTGAAAGTTAACGGAAGGATGCCAAGCAATAGCGGCCACAGGACCCGCCTGGC
GACGCATGGACTGGATCCGGAGGTCTGGCCAACAGTTGATTTCATGGGTTACAGCCCC
GGTGTAGATCCCCTCATGGTCTCCCGAACCGATTAGTTTGAAAACTGTATCTCCTGGC
CGCCTAACAGGTATAAAGAGCCGGCTCACACTGGGGTGAGGGGGCGCGTGGCCCCCTT.

The Viterbi algorithm is implemented as follows. (For ease of notation, the logarithm of the probability is used here.)

Figure 4 A hypothetical hidden Markov model for CpG island.

(1) Initialization

$δ_{0} (+) = \log {π_{+} * b (T | +)} = \log (0.5 * 0.25) = - 0.9031$ and $ψ_{0} (+) = 0$ ,

$δ_{0} (-) = \log {π_{-} * b (T | -)} = \log (0.5 * 0.60) = - 0.5229$ and $ψ_{0} (-) = 0$ ,

(2) Recursion

$δ_{1} (+) = \max {δ_{0} (+) + \log (a_{+ +}), δ_{0} (-) + \log (a_{- +})} + \log {b (C | +)} = - 1.0738$ and $ψ_{1} (+) = +$ ,

$δ_{1} (-) = \max {δ_{0} (+) + \log (a_{+ -}), δ_{0} (-) + \log (a_{- -})} + \log {b (C | -)} = - 0.9666$ and $ψ_{1} (-) = -$ ;

$δ_{2} (+) = \max {δ_{1} (+) + \log (a_{+ +}), δ_{1} (-) + \log (a_{- +})} + \log {b (T | +)} = - 1.7216$ and $ψ_{2} (+) = +$ ,

$δ_{2} (-) = \max {δ_{1} (+) + \log (a_{+ -}), δ_{1} (-) + \log (a_{- -})} + \log {b (T | -)} = - 1.2342$ and $ψ_{2} (-) = -$ ;

$δ_{3} (+) = \max {δ_{2} (+) + \log (a_{+ +}), δ_{2} (-) + \log (a_{- +})} + \log {b (C | +)} = - 1.8923$ and $ψ_{3} (+) = +$ ,

$δ_{3} (-) = \max {δ_{2} (+) + \log (a_{+ -}), δ_{2} (-) + \log (a_{- -})} + \log {b (C | -)} = - 1.6779$ and $ψ_{3} (-) = -$ ;

…

(3) Termination

When the last symbol is reached,

$δ_{463} (+) = \max {δ_{462} (+) + \log (a_{+ +}), δ_{462} (-) + \log (a_{- +})} + \log {b (T | +)} = - 154.268$ and $ψ_{463} (+) = +$ ,

$δ_{463} (-) = \max {δ_{462} (+) + \log (a_{+ -}), δ_{462} (-) + \log (a_{- -})} + \log {b (T | -)} = - 154.462$ and $ψ_{463} (-) = -$ .

Therefore, the probability of the most likely path is $δ_{463} (+) = - 154.268$ and the state of this path at position 463 is “+”. A tabular illustration of the Viterbi algorithm is shown in Table 1.

Table 1 The implementation of the Viterbi algorithm for the hypothetical example

(4) Backtracing

Finally, the best path of states can be identified as follows:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--+++
-++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++-++++++++++++++++++---++++++++-+++
+++++---+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++--+--+++++-+-++++++++++++++++++++++++++++++++++++++++++--+--
--+-+---+-++++++++++++++++++++---+-++++++++++++++++++++++++++++++
+++++++++,

where “+” denotes a CpG state and “−“ denotes a non-CpG state in the path. There is a large proportion of CpG states in the best path, suggesting that this sequence may be a CpG island.

The other category is called the homology-based methods for gene discovery, also known as the comparative genome approaches or the empirical methods. They predict potential gene structure and infer the gene function based on the similarity and homology between a new sequence and the known sequences assessed from a sequence alignment analysis, for which, as mentioned in the previous subsection on sequence alignment, HMMs are an effective means. The homology-based methods are also applicable to identification of regulatory genes, detection of open reading frames, exploration of junk genes, and others.

Molecular structure prediction

The high level structure of a biological macromolecule, such as the DNA double helix and supercoiling (superhelicity), the RNA stem-loop, and the protein α-helix, β-sheet, β-turn, spatial arrangement, and spatial structure of the complex of subunits, is highly related to its 3-D conformation and molecular interaction, and thus plays a very important role in its biological activities and stability. The higher order structure of macromolecules is governed by foundational principles of chemistry and physics, and can be predicted by the primary structure. HMMs offer an effective solution to predict the potential higher order structure and dynamics of a biological molecule from its linear sequence.^89–91 In essence, such a prediction is a pattern recognition issue by modeling the structural elements such as the α-helix, β-sheet, and β-turn, as hidden data and the linear sequence of units as the observation data. A set of sequences with known high level structure are used for an HMM learning and finding the best fit parameters. Then the trained model is used to predict the hidden states of a target sequence.

Biological image analysis

Images generated from modern imaging platforms such as computer tomography, magnetic resonance imaging, and ultrasound imaging are been accumulating at an explosive pace, representing an important category of biological data. Computerized image analysis is a research hotspot being actively explored in biology.^92,93 There usually exist temporal and/or spatial dependencies among the pixels in an image or images,⁹⁴ for example, there will be spatial correlations between the pixels in an image and temporal correlation between the pixels in the images taken at different time points. HMM or hidden MRF can well model the correlation structure with random noise. As improved methods, HMMs have been intensively applied to diverse image analyses including image segmentation, image noise filtering, image retrieval, image classification, and image recognition.^24,56,95,96 Within the HMM, the feature of a pixel, usually denoted by a vector quantization for color, greyscale, texture, and shape, is viewed as a random variable with noise, while a Markov mesh is used to model the spatial structure and conditional independencies. Hence, the HMM can well describe the statistical properties of an image and then used to fit the digital image data. The general steps include: to collect a volume of images as training data, to extract the feature information for each pixel such as grey value and texture, to define a HMM structure such as hidden states, to implement model learning using training data, and further to perform image segmentation, classification, and pattern recognition for query images, or feature extraction for database retrieval and search.

Epidemiological prediction

Many epidemic and etiological processes such as infectious disease transmission and chronic disease progression are a multi-state and multi-stage Markov process, that is, the future epidemic outbreak or disease status depends on the current state, not on the past, because communicable pathogen dispersion in population and pathogen multiplication and spread in body satisfy a temporal/spatial Markov dependence. On the other hand, however, not all data are perfectly recorded in many cases and thus the variables of interest may be latent; for example, the exact time at onset of a patient cannot be determined because examination or follow-up is conducted at limited time points, and epidemic dynamics are not available for all geographical regions and years in a survey. HMM can well model the epidemic dynamics, the progression of a chronic disease, and the geographical distribution of a disease, and are very useful for epidemiological analysis, prediction, and surveillance.^97–102 The main steps include: to collect data on epidemic survey or chronic disease status, to define hidden states such as epidemic and non-epidemic states, and healthy and disease phases (more flexibly, the number of hidden states is not necessarily predefined in a HMM), to train the HMM and estimate parameters, and to use the fitted model for prediction or disease map. Further, some putative risk factors can be introduced to a HMM via building a set of transition probabilities that are subject to these risk factors, in order to identify risk factor(s) and/or improve the prediction accuracy.

Phylogenetic tree construction

Similar to gene transmission from generation to generation, molecular evolutionary course is a combination of double Markov processes—one that operates in the dimension of space (along a genome) and one that operates in the dimension of time (along the branches of a phylogenetic tree).¹⁰³ Compared with the traditional evolutionary models that ignore the dependencies between genetic loci, a phylogenetic HMM can express the true relationship more accurate and thereby usually give better results.^104,105 This category of models are also used for epitope discovery.¹⁰⁶ Moreover, HMMs are applicable to handling the ambiguity in evolution and comparative genomics analysis due to missing data¹⁰⁷ and heterogeneity in evolutionary rate.¹⁰⁸

Text mining

While the rapidly increasing text data such as biological literature, medical documents, and electronic health records provide a large amount of material for evidence-based practice and data-driven research, it is a key step to harness the massive text data and extract useful information effectively. Automatic text mining is an effective means of gaining potentially useful knowledge from a large text database.¹⁰⁹ HMM is a powerful approach for text mining, information retrieval, and document classification.^110-112 The basic idea is that a paper and a clinical record contain several semantic tags such as symptom, therapy, and performance, or index fields such as title field, author field, the content of which is the phrase of interest. Extraction of information is first to determine the index field, and then to grab the phrases. Such a process corresponds to an HMM: the index fields represent the combination of hidden states, and the corpus consisting of words/phrases are the observed symbols, and thus the HMM may be used for text mining. Specifically, the steps can be summarized as follows: to define a HMM and choose the HMM structure, to label a set of known papers or documents as the training corpora, to train the model using the tagged examples, and then to use the learned model to find the best label sequence for a query document and output the phrases in the desired fields, achieving the text mining.

Future prospect

Organism is a multi-level and modularized intricate system that is composed of numerous interwoven metabolic and regulatory networks. During the long period of evolution, functional associations and random evolutionary events result in elusive molecular, physiological, metabolic, and evolutionary relationships. It is a daunting challenge for biological studies to decipher the complex biological mechanisms and crack the codes of life. Recent technological advances, including innovations in high-throughput experimental methods such as genomics, transcriptomics, epigenomics, proteomics, and metabolomics, provide powerful discovery tools for biological studies, enabling dissecting the multi-level modular networks and tracing the historical events. However, the analyses of systems biology, computational biology, and bioinformatics, play an indispensable role in extracting knowledge from massive data and assembling a comprehensive map underlying a biological system. A HMM not only is allowed to model random noises but also can capture the intrinsic dependencies between units, so that many biological mechanisms or phenomena can be characterized by a HMM. Recently HMMs have demonstrated a tremendous potential in genetic network analysis and integration of multi-omics data.^113–115 Thus, HMMs, as a powerful tool to tackle complicated analytical problems, are expected to have more compelling applications in biology.

Acknowledgments

The author likes to thank Tingting Hou and Shouye Liu for their contributions to this study. This project was supported in part by NSF grant DMS1462990 and UAMS Research Scholar Pilot Grant Awards in Child Health G1-51898 to X.-Y.L. The author declares no conflict of interest on this work.