Statistical Modeling. Prof. William H. Press CAM 397: Introduction to Mathematical Modeling 11/3/08 11/5/08

Statistical Modeling Prof. William H. Press CAM 397: Introduction to Mathematical Modeling 11/3/08 11/5/08

What is a statistical model as distinct from other kinds of models? Models take inputs, turn some algorithmic crank, and produce ( predict ) outputs (initial conditions) + (dynamical equations, boundary conditions) = (time evolution) (geometrical description, materials properties, loading forces) + (laws of solid mechanics) = (stress/strain/failure predictions) (incomplete or probabilistic data) + (underlying statistical model) = (probabilistic values of hidden parameters)

Everybody s favorite first example: Repeated measurements, with errors, of a quantity are repeated measurements of some underlying with measurement errors ( standard deviations ) best (in some magical way) estimate is error of that estimate is and if you re fancy, goodness-of-fit: s.b.

What is really going on is that there is a stochastic process underlying the experiment true (hidden) parameters process observed data stochasticity ( noise ) So a statistical model is an idealization or simplification of the stochastic process sometimes a rather distant idealization as, e.g., finite element equations are rather distant idealizations of the underlying molecular dynamics With the additional complication that most statistical models are interesting only as inverse problems get hidden parameters from observed data inverse problems occur in other kinds of models, but forward modeling is more typical

In everybody s favorite example of repeated measurements, the statistical model is additive Gaussian noise And the inverse problem turns out to be trivial Bayesian estimate of the parameter is also a Gaussian Of, if you must, you can view it as an ML estimate, with error from likelihood ratio theorem and Fisher information matrix I would normally deprecate the latter perspective but for its access to goodness-of-fit information take my course for more on this

Let s show a nontrivial generalization for an historically (1996) interesting example Hubble constant H 0 is the rate of expansion of the Universe Measurement by classical astronomy very difficult each a multi-year project calibration issues Between 1930 and 2000 credible measurements ranged from 30 to 120 (km/s/mpc) many claimed small errors Consensus view was we just don t know H 0. or was it just failure to apply an adequate statistical model to the existing data?

Grim observational situation: Not the range of values, but the inconsistency of the claimed errors. This forbids any kind of just average, because goodnessof-fit rejects the possibility that these experiments are measuring the same value.

Let s model,using Bayes law, the idea that some experiments are right, some are wrong, and we don t know which are which probability that an experiment is right now, expand out the prior: bit vector of which experiments are right or wrong, e.g. (1,0,0,1,1,0,1 ) p #(v=1) (1-p) #(v=0)

any big enough value now you stare at this a while and realize that the sum over v is just a multinomial expansion (This is called a mixture model. Probably we should have guessed!)

And the answer is from Wilkinson Microwave Anisotropy Probe satellite 5-year results (2008) from WMAP + supernovae This is not a Gaussian, it s just whatever shape it came out from the data It s not even necessarily unimodel (although it is for this data) If you leave out some of the middle-value experiments, it splits to be bimodal Thus showing that this method is not tail-trimming

If you don t sum over the v s but only integrate over the p s, you get the probability that each measurement is correct

If you care, you can also get the probability distribution of p (probability of a measurement being correct a priori) (this is of course not universal, but depends on the field and its current state)

Let s look at a statistical model very different from additive Gaussian noise : a Markov process Directed graph may (usually does) have loops Discrete time steps Each time step, state advances with probabilities labeled on outgoing edges self loops also ok Markov because no memory knows only what state in now Markov models especially important because exists a fast algorithm for parsing their state from observed data so-called Hidden Markov Models (HMMs)

A (right) stochastic matrix has non-negative entries with rows summing to 1 from i to j transpose, because of the way from/to are defined population vector

Note the two different ways of drawing the same Markov model: directed graph with loops adding the time dimension, directed graph, no loops (can t go backward in time)

In a Hidden Markov Model, we don t get to observe the states, but instead we see a symbol that each state probabilistically emits when it is entered sequence of (hidden) states sequence of (observed) symbols What can we say about the sequence of states, given a sequence of observations?

The Forward-Backward algorithm. Let s try to estimate the probability of being in a certain state at a certain time Define as the probability of state i at time t given (only) the data up to and including t. forward estimate huge sum over all possible paths! likelihood (or Bayes probability with uniform prior) of that exact path and the exact observed data As written, this is computationally unfeasible. But it satisfies an easy recurrence!

Define as the probability of state i at time t given (only) the data to the future of t. backward estimate Now, there is a backward recurrence! uniform prior: no data to the future of N-1 And the grand estimate using all the data is ( forward-backward algorithm ) Likelihood or Bayes probability of the data. Actually, it s independent of t! You could use its numerical value to compare different models. Worried about multiplying the α s and β s as independent probabilities? Markov guarantees that they are conditionally independent given i, and P t (i) P t (data i)

Let s work a biologically motivated example In the galaxy Zyzyx, the Qiqiqi lifeform has a genome consisting of a linear sequence of amino acids, each chosen from 26 chemical possibilities, denoted A-Z. Genes alternate with intergenic regions. In intergenic regions, all A-Z are equiprobable. In genes, the vowels AEIOU are more frequent. Genes always end with Z. The length distribution of genes and intergenic regions is known (has been measured). Can we find the genes? On Earth, it s 20 amino acids, with the additional complication of a genetic code mapping three base-4 codons (ACGT) into one a.a. Our example thus simplifies by having no ambiguity on reading frame, and also no ambiguity of strand. genes intergenes qnkekxkdlscovjfehvesmdeelnzlzjeknvjgetyuhgvxlvjnvqlmcermojkrtuczg rbmpwrjtynonxveblrjuqiydehpzujdogaensduoermiadaaustihpialkxicilgk tottxxwawjvenowzsuacnppiharwpqviuammkpzwwjboofvmrjwrtmzmcxdkclvky vkizmckmpvwfoorbvvrnvuzfwszqithlkubjruoyyxgwvfgxzlzbkuwmkmzgmnsyb (pvowell = 0.45)

If we know the rules and (rough) probabilities, we can model directly: The forward-backward results on the previous data are: state G actual start the right z enough chance excess vowels to make it not completely sure! another z state Z

But we can also learn the rules by Bayesian re-estimation of the transition and output matrices (Baum-Welch re-estimation) Given the data, we can re-estimate A as follows So, estimating as an average over the data, the backward recurrence says that these are equal number of i j transitions number of i states (note that L cancels)

Similarly, re-estimate b number of i states emitting k number of i states Hatted A and b are improved estimates of the hidden parameters. With them, you can go back and re-estimate α and β. And so forth. This is a special case of what is called an EM method. Can prove that Baum-Welch re-estimation always increases the overall likelihood L, iteratively to an (possibly only local) maximum. Notice that re-estimation doesn t require any additional information, or any training data. It is pure unsupervised learning.

Before (previous result) After re-estimation (data size N=10 5 ) parsing (forward-backward) can work on even small fragments, but re-estimation takes a lot of data how log-likelihood increases with iteration number

On many problems, re-estimation can hill-climb to the right answer from amazingly crude initial guesses a[0][0] = 1.-1./100.; a[0][1] = 1.-a[0][0]; a[1][1] = 1.-1./100.; a[1][2] = 1.-a[1][1]; a[2][0] = 1.; for (i=0;i<26;i++) { b[0][i] = 1./26.; b[1][i] = 1./26.; b[2][i] = 1./26.; } genes and intergenes alternate and are each about 100 long there is a one-symbol gene end marker but we don t know anything about which symbols are preferred in genes, end-genes, or intergenes a period of stagnation is not unusual log-likelihood increases monotonically accuracy (in this example we know the right answers!)

Final estimates of the transition and symbol probability matrices: state I state G state Z A 0.99397 0.00000 1.00000 0.00603 0.96991 0.00000 0.00000 0.03009 0.00000 1/0.00603 = 166 1/0.03009 = 33.2 why these values? b A 0.03746 0.09039 0.00245 E 0.03935 0.09196 0.00218 I 0.03684 0.08903 0.00266 O 0.03875 0.08992 0.00109 U 0.03740 0.09214 0.00144 B 0.03772 0.02590 0.00096 C 0.03891 0.02716 0.00686 D 0.03945 0.02792 0.00140 F 0.03862 0.02515 0.00037 G 0.03888 0.02505 0.00057 H 0.03884 0.02874 0.00116 J 0.03652 0.02926 0.00188 K 0.03838 0.02777 0.00069 L 0.03836 0.02673 0.00113 M 0.03823 0.02822 0.00035 N 0.03885 0.02639 0.00005 P 0.03888 0.02677 0.00493 Q 0.03880 0.02743 0.00572 R 0.04055 0.02844 0.00115 S 0.03933 0.02862 0.00769 T 0.03923 0.02393 0.00053 V 0.03924 0.02772 0.00057 W 0.03862 0.02826 0.00308 X 0.03820 0.02945 0.00039 Y 0.03871 0.02759 0.00045 Z 0.03586 0.00006 0.95026 Yes, it discovered all the vowels. It discovered Z, but didn t quite figure out that state 3 always emits Z

An obvious flaw in the model: Self-loops in Markov models must always give (discrete approximation of) exponentially distributed residence times exit event waiting time to an event in a Poisson process is exponentially distributed But Qiqiqi genes and intergenes are roughly gamma-law distributed in length In fact, they re exactly gamma-law, because that s how I constructed the genome not from a Markov model! So the model really is a model, not a representation of the actual physics. This is usually true in statistical modeling. Can we make the results more accurate by somehow incorporating length info?

Generalized Hidden Markov Model (GHMM) also called Hidden Semi-Markov Model (HSMM) the idea is to impose (or learn by re-estimation) an arbitrary probability distribution for the residency time τ in each state can be thought of as an ordinary HMM where every state gets expanded into a timer cluster output symbol probabilities identical for all states in a timer (equal those of the state before it was expanded) arbitrary distribution with τ n Gamma-law distribution» [p 1 ;(1 p 1 )p 2 ;(1 p 1 )(1 p 2 )p 3 ;:::]» Gamma( ; p) h i = p

So, our intergene-gene-z example becomes:

So how well do we do? n0 = n1 = 1 (previous HMM) accuracy = 0.9629 table = 0.1466 0.0227 0.0144 0.8163 n0 = 2, n1 = 5 accuracy = 0.9690 table = 0.1498 0.0195 0.0115 0.8192 n0 = 3, n1 = 8 accuracy = 0.9726 table = 0.1518 0.0175 0.0099 0.8208 Accuracy wrt genes shown as TP FP typically, for this example, it s starting a gene ~5 too late FN TN sensitivity = TP/(TP+FN) specificity = TN/(FP+TN) For whole genes (length ~50), the sensitivity and specificity are basically 1.0000, because, with pvowel=0.45, the gene is highly statistically significant. What the HMM or GHMM does well is to call the boundaries as exactly as possible. Obviously there s a theoretical bound on the achievable accuracy, given that the exact sequence of an FP or FN might also occur as a TP or TN. Can you calculate or estimate the bound? or ~3 too early

Summary remarks Like other kinds of modeling, statistical modeling has a bunch of standard tools from which you can construct or invert models Gaussian mixture models hidden Markov models hierarchical models Gaussian process regression (a.k.a. linear prediction or kriging) various log-odds things, including logistic regression Markov-chain Monte Carlo neural networks various EM variants wavelet smoothing and other function bases etc., etc., etc. But, also like other modeling, new problems can require you to invent new tools Statistical modeling and machine learning are overlapping fields, though with different emphases Recommended books Gelman, Carlin, Stern, and Rubin, Bayesian Data Analysis, 2 nd ed. Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning (and of course) Numerical Recipes, 3 rd ed. (esp. chapters 14, 15, and 16) available free from ICES subnets at http://nrbook.com/institutional