DNA, Words and Models: Statistics of Exceptional WordsAn important problem in computational biology is identifying short DNA sequences (mathematically, 'words') associated to a biological function. One approach consists in determining whether a particular word is simply random or is of statistical significance, for example, because of its frequency or location. This book introduces the mathematical and statistical ideas used in solving this so-called exceptional word problem. It begins with a detailed description of the principal models used in sequence analysis: Markovian models are central here and capture compositional information on the sequence being analysed. There follows an introduction to several statistical methods that are used for finding exceptional words with respect to the model used. The second half of the book is illustrated with numerous examples provided from the analysis of bacterial genomes, making this a practical guide for users facing a real situation and needing to make an adequate procedure choice. |
Contents
3 | 11 |
Introduction to Markov chain models | 27 |
4 | 96 |
Overrepresentation of Chi sites in E coli and H influenzae | 112 |
22828 | 116 |
7 | 119 |
1 | 124 |
2 | 126 |
134 | |
Common terms and phrases
amino acid analysis bacterium Bernoulli model calculate codons coli complete genome compound Poisson distribution compound Poisson model compound Poisson process consider count N(w Cov[Y CP model cumulated distances denoted dinf dinucleotides distance of order DNA sequences dsup eight-letter word equation estimated exact distribution exceptional words exceptionality expected count first-order Markov chain formula frequencies function gctgg gctggtgg genes genome genome of H geometric distribution ggcct given heterogeneity Hidden Markov models independent influenzae letters Markov chain Markov chain model Markov models model M1 motif Nobs nucleotide counts nucleotides number of occurrences observed process observed sequence occurrence at position over-represented p-values palindromes parameters permutation model phase Pr{U probability a(w properties protein random sequences replication scores Section segment simple distances six-letter words Sobs stationary distribution statistical strand sub-words total variation distance transition probabilities translated sequences variance w₁ word count words of length Yi+d(w
Popular passages
Page 134 - Compound Poisson approximation for nonnegative random variables via Stein's method. Ann. Prob. 20 1843-1866. BARBOUR, AD, HOLST, L. and JANSON, S. 1992b. Poisson approximation. Oxford - University Press.
Page 134 - Koutras ( 1994). Distribution theory of runs: a Markov chain approach. J. Amer. Statist. Assoc. 89, 1050-1058.
Page 134 - A., Rouxel, T., Gleizes, A., Moszer, I. and Danchin, A. (1996). Uneven distribution of GATC motifs in the escherichia coli chromosome, its plasmids and its phages.