Statistical Modeling. Prof. William H. Press CAM 397: Introduction to Mathematical Modeling 11/3/08 11/5/08

Similar documents
Unit 16: Hidden Markov Models

4th IMPRS Astronomy Summer School Drawing Astrophysical Inferences from Data Sets

STA 4273H: Statistical Machine Learning

STA 414/2104: Machine Learning

Math 350: An exploration of HMMs through doodles.

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Dynamic Approaches: The Hidden Markov Model

Sean Escola. Center for Theoretical Neuroscience

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Introduction to Machine Learning CMU-10701

State-Space Methods for Inferring Spike Trains from Calcium Imaging

Algorithmisches Lernen/Machine Learning

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Pattern Recognition and Machine Learning

Statistical NLP: Hidden Markov Models. Updated 12/15

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Brief Introduction of Machine Learning Techniques for Content Analysis

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Introduction to Machine Learning Midterm, Tues April 8

O 3 O 4 O 5. q 3. q 4. Transition

Bayesian Networks Inference with Probabilistic Graphical Models

Bayesian Methods for Machine Learning

Hidden Markov Models Part 2: Algorithms

Midterm sample questions

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

(x 1 +x 2 )(x 1 x 2 )+(x 2 +x 3 )(x 2 x 3 )+(x 3 +x 1 )(x 3 x 1 ).

CSC 2541: Bayesian Methods for Machine Learning

Markov Chains and Hidden Markov Models. = stochastic, generative models

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Hidden Markov Models: All the Glorious Gory Details

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Learning Energy-Based Models of High-Dimensional Data

CSCE 471/871 Lecture 3: Markov Chains and

Lecture 21: Spectral Learning for Graphical Models

Machine Learning, Midterm Exam

p L yi z n m x N n xi

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Introduction to Machine Learning

Multivariate statistical methods and data mining in particle physics

Today s Lecture: HMMs

STA 4273H: Statistical Machine Learning

Machine Learning Techniques for Computer Vision

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Lecture 7 Sequence analysis. Hidden Markov Models

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

1 Review of the dot product

Conditional probabilities and graphical models

HMMs and biological sequence analysis

Gaussian Process Approximations of Stochastic Differential Equations

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Chapter 1 Review of Equations and Inequalities

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Linear Algebra. Introduction. Marek Petrik 3/23/2017. Many slides adapted from Linear Algebra Lectures by Martin Scharlemann

Stephen Scott.

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Probabilistic Graphical Models for Image Analysis - Lecture 1

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Linear Methods for Prediction

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

order is number of previous outputs

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Machine Learning for natural language processing

CS 188: Artificial Intelligence. Bayes Nets

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Evolutionary Models. Evolutionary Models

CSCI-567: Machine Learning (Spring 2019)

Introduction to Machine Learning CMU-10701

Lecture: Mixture Models for Microbiome data

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Hidden Markov Models

Week 2: Defining Computation

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Undirected Graphical Models

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam

Neural Networks (and Gradient Ascent Again)

Nonparametric Bayesian Methods (Gaussian Processes)

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Statistical Methods for NLP

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

CS 7180: Behavioral Modeling and Decision- making in AI

STA 4273H: Sta-s-cal Machine Learning

Multiple Sequence Alignment using Profile HMM

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

L11: Pattern recognition principles

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

Transcription:

Statistical Modeling Prof. William H. Press CAM 397: Introduction to Mathematical Modeling 11/3/08 11/5/08

What is a statistical model as distinct from other kinds of models? Models take inputs, turn some algorithmic crank, and produce ( predict ) outputs (initial conditions) + (dynamical equations, boundary conditions) = (time evolution) (geometrical description, materials properties, loading forces) + (laws of solid mechanics) = (stress/strain/failure predictions) (incomplete or probabilistic data) + (underlying statistical model) = (probabilistic values of hidden parameters)

Everybody s favorite first example: Repeated measurements, with errors, of a quantity are repeated measurements of some underlying with measurement errors ( standard deviations ) best (in some magical way) estimate is error of that estimate is and if you re fancy, goodness-of-fit: s.b.

What is really going on is that there is a stochastic process underlying the experiment true (hidden) parameters process observed data stochasticity ( noise ) So a statistical model is an idealization or simplification of the stochastic process sometimes a rather distant idealization as, e.g., finite element equations are rather distant idealizations of the underlying molecular dynamics With the additional complication that most statistical models are interesting only as inverse problems get hidden parameters from observed data inverse problems occur in other kinds of models, but forward modeling is more typical

In everybody s favorite example of repeated measurements, the statistical model is additive Gaussian noise And the inverse problem turns out to be trivial Bayesian estimate of the parameter is also a Gaussian Of, if you must, you can view it as an ML estimate, with error from likelihood ratio theorem and Fisher information matrix I would normally deprecate the latter perspective but for its access to goodness-of-fit information take my course for more on this

Let s show a nontrivial generalization for an historically (1996) interesting example Hubble constant H 0 is the rate of expansion of the Universe Measurement by classical astronomy very difficult each a multi-year project calibration issues Between 1930 and 2000 credible measurements ranged from 30 to 120 (km/s/mpc) many claimed small errors Consensus view was we just don t know H 0. or was it just failure to apply an adequate statistical model to the existing data?

Grim observational situation: Not the range of values, but the inconsistency of the claimed errors. This forbids any kind of just average, because goodnessof-fit rejects the possibility that these experiments are measuring the same value.

Let s model,using Bayes law, the idea that some experiments are right, some are wrong, and we don t know which are which probability that an experiment is right now, expand out the prior: bit vector of which experiments are right or wrong, e.g. (1,0,0,1,1,0,1 ) p #(v=1) (1-p) #(v=0)

any big enough value now you stare at this a while and realize that the sum over v is just a multinomial expansion (This is called a mixture model. Probably we should have guessed!)

And the answer is from Wilkinson Microwave Anisotropy Probe satellite 5-year results (2008) from WMAP + supernovae This is not a Gaussian, it s just whatever shape it came out from the data It s not even necessarily unimodel (although it is for this data) If you leave out some of the middle-value experiments, it splits to be bimodal Thus showing that this method is not tail-trimming

If you don t sum over the v s but only integrate over the p s, you get the probability that each measurement is correct

If you care, you can also get the probability distribution of p (probability of a measurement being correct a priori) (this is of course not universal, but depends on the field and its current state)

Let s look at a statistical model very different from additive Gaussian noise : a Markov process Directed graph may (usually does) have loops Discrete time steps Each time step, state advances with probabilities labeled on outgoing edges self loops also ok Markov because no memory knows only what state in now Markov models especially important because exists a fast algorithm for parsing their state from observed data so-called Hidden Markov Models (HMMs)

A (right) stochastic matrix has non-negative entries with rows summing to 1 from i to j transpose, because of the way from/to are defined population vector

Note the two different ways of drawing the same Markov model: directed graph with loops adding the time dimension, directed graph, no loops (can t go backward in time)

In a Hidden Markov Model, we don t get to observe the states, but instead we see a symbol that each state probabilistically emits when it is entered sequence of (hidden) states sequence of (observed) symbols What can we say about the sequence of states, given a sequence of observations?

The Forward-Backward algorithm. Let s try to estimate the probability of being in a certain state at a certain time Define as the probability of state i at time t given (only) the data up to and including t. forward estimate huge sum over all possible paths! likelihood (or Bayes probability with uniform prior) of that exact path and the exact observed data As written, this is computationally unfeasible. But it satisfies an easy recurrence!

Define as the probability of state i at time t given (only) the data to the future of t. backward estimate Now, there is a backward recurrence! uniform prior: no data to the future of N-1 And the grand estimate using all the data is ( forward-backward algorithm ) Likelihood or Bayes probability of the data. Actually, it s independent of t! You could use its numerical value to compare different models. Worried about multiplying the α s and β s as independent probabilities? Markov guarantees that they are conditionally independent given i, and P t (i) P t (data i)

Let s work a biologically motivated example In the galaxy Zyzyx, the Qiqiqi lifeform has a genome consisting of a linear sequence of amino acids, each chosen from 26 chemical possibilities, denoted A-Z. Genes alternate with intergenic regions. In intergenic regions, all A-Z are equiprobable. In genes, the vowels AEIOU are more frequent. Genes always end with Z. The length distribution of genes and intergenic regions is known (has been measured). Can we find the genes? On Earth, it s 20 amino acids, with the additional complication of a genetic code mapping three base-4 codons (ACGT) into one a.a. Our example thus simplifies by having no ambiguity on reading frame, and also no ambiguity of strand. genes intergenes qnkekxkdlscovjfehvesmdeelnzlzjeknvjgetyuhgvxlvjnvqlmcermojkrtuczg rbmpwrjtynonxveblrjuqiydehpzujdogaensduoermiadaaustihpialkxicilgk tottxxwawjvenowzsuacnppiharwpqviuammkpzwwjboofvmrjwrtmzmcxdkclvky vkizmckmpvwfoorbvvrnvuzfwszqithlkubjruoyyxgwvfgxzlzbkuwmkmzgmnsyb (pvowell = 0.45)

If we know the rules and (rough) probabilities, we can model directly: The forward-backward results on the previous data are: state G actual start the right z enough chance excess vowels to make it not completely sure! another z state Z

But we can also learn the rules by Bayesian re-estimation of the transition and output matrices (Baum-Welch re-estimation) Given the data, we can re-estimate A as follows So, estimating as an average over the data, the backward recurrence says that these are equal number of i j transitions number of i states (note that L cancels)

Similarly, re-estimate b number of i states emitting k number of i states Hatted A and b are improved estimates of the hidden parameters. With them, you can go back and re-estimate α and β. And so forth. This is a special case of what is called an EM method. Can prove that Baum-Welch re-estimation always increases the overall likelihood L, iteratively to an (possibly only local) maximum. Notice that re-estimation doesn t require any additional information, or any training data. It is pure unsupervised learning.

Before (previous result) After re-estimation (data size N=10 5 ) parsing (forward-backward) can work on even small fragments, but re-estimation takes a lot of data how log-likelihood increases with iteration number

On many problems, re-estimation can hill-climb to the right answer from amazingly crude initial guesses a[0][0] = 1.-1./100.; a[0][1] = 1.-a[0][0]; a[1][1] = 1.-1./100.; a[1][2] = 1.-a[1][1]; a[2][0] = 1.; for (i=0;i<26;i++) { b[0][i] = 1./26.; b[1][i] = 1./26.; b[2][i] = 1./26.; } genes and intergenes alternate and are each about 100 long there is a one-symbol gene end marker but we don t know anything about which symbols are preferred in genes, end-genes, or intergenes a period of stagnation is not unusual log-likelihood increases monotonically accuracy (in this example we know the right answers!)

Final estimates of the transition and symbol probability matrices: state I state G state Z A 0.99397 0.00000 1.00000 0.00603 0.96991 0.00000 0.00000 0.03009 0.00000 1/0.00603 = 166 1/0.03009 = 33.2 why these values? b A 0.03746 0.09039 0.00245 E 0.03935 0.09196 0.00218 I 0.03684 0.08903 0.00266 O 0.03875 0.08992 0.00109 U 0.03740 0.09214 0.00144 B 0.03772 0.02590 0.00096 C 0.03891 0.02716 0.00686 D 0.03945 0.02792 0.00140 F 0.03862 0.02515 0.00037 G 0.03888 0.02505 0.00057 H 0.03884 0.02874 0.00116 J 0.03652 0.02926 0.00188 K 0.03838 0.02777 0.00069 L 0.03836 0.02673 0.00113 M 0.03823 0.02822 0.00035 N 0.03885 0.02639 0.00005 P 0.03888 0.02677 0.00493 Q 0.03880 0.02743 0.00572 R 0.04055 0.02844 0.00115 S 0.03933 0.02862 0.00769 T 0.03923 0.02393 0.00053 V 0.03924 0.02772 0.00057 W 0.03862 0.02826 0.00308 X 0.03820 0.02945 0.00039 Y 0.03871 0.02759 0.00045 Z 0.03586 0.00006 0.95026 Yes, it discovered all the vowels. It discovered Z, but didn t quite figure out that state 3 always emits Z

An obvious flaw in the model: Self-loops in Markov models must always give (discrete approximation of) exponentially distributed residence times exit event waiting time to an event in a Poisson process is exponentially distributed But Qiqiqi genes and intergenes are roughly gamma-law distributed in length In fact, they re exactly gamma-law, because that s how I constructed the genome not from a Markov model! So the model really is a model, not a representation of the actual physics. This is usually true in statistical modeling. Can we make the results more accurate by somehow incorporating length info?

Generalized Hidden Markov Model (GHMM) also called Hidden Semi-Markov Model (HSMM) the idea is to impose (or learn by re-estimation) an arbitrary probability distribution for the residency time τ in each state can be thought of as an ordinary HMM where every state gets expanded into a timer cluster output symbol probabilities identical for all states in a timer (equal those of the state before it was expanded) arbitrary distribution with τ n Gamma-law distribution» [p 1 ;(1 p 1 )p 2 ;(1 p 1 )(1 p 2 )p 3 ;:::]» Gamma( ; p) h i = p

So, our intergene-gene-z example becomes:

So how well do we do? n0 = n1 = 1 (previous HMM) accuracy = 0.9629 table = 0.1466 0.0227 0.0144 0.8163 n0 = 2, n1 = 5 accuracy = 0.9690 table = 0.1498 0.0195 0.0115 0.8192 n0 = 3, n1 = 8 accuracy = 0.9726 table = 0.1518 0.0175 0.0099 0.8208 Accuracy wrt genes shown as TP FP typically, for this example, it s starting a gene ~5 too late FN TN sensitivity = TP/(TP+FN) specificity = TN/(FP+TN) For whole genes (length ~50), the sensitivity and specificity are basically 1.0000, because, with pvowel=0.45, the gene is highly statistically significant. What the HMM or GHMM does well is to call the boundaries as exactly as possible. Obviously there s a theoretical bound on the achievable accuracy, given that the exact sequence of an FP or FN might also occur as a TP or TN. Can you calculate or estimate the bound? or ~3 too early

Summary remarks Like other kinds of modeling, statistical modeling has a bunch of standard tools from which you can construct or invert models Gaussian mixture models hidden Markov models hierarchical models Gaussian process regression (a.k.a. linear prediction or kriging) various log-odds things, including logistic regression Markov-chain Monte Carlo neural networks various EM variants wavelet smoothing and other function bases etc., etc., etc. But, also like other modeling, new problems can require you to invent new tools Statistical modeling and machine learning are overlapping fields, though with different emphases Recommended books Gelman, Carlin, Stern, and Rubin, Bayesian Data Analysis, 2 nd ed. Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning (and of course) Numerical Recipes, 3 rd ed. (esp. chapters 14, 15, and 16) available free from ICES subnets at http://nrbook.com/institutional