Statistical Data models, Non-parametrics, Dynamics

Similar documents
Non-informative, proper and improper priors. Statistical Data models, Non-parametrics, Dynamics. Dirichlet Distributionprior for discrete distribution

Deciding, Estimating, Computing, Checking

Deciding, Estimating, Computing, Checking. How are Bayesian posteriors used, computed and validated?

Pattern Recognition and Machine Learning

Recent Advances in Bayesian Inference Techniques

Lecture 9. Time series prediction

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Brief Introduction of Machine Learning Techniques for Content Analysis

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Gentle Introduction to Infinite Gaussian Mixture Modeling

Probability for Statistics and Machine Learning

Bayesian Regression Linear and Logistic Regression

Sparse Gaussian Markov Random Field Mixtures for Anomaly Detection

Hidden Markov Models Part 1: Introduction

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Methods for Machine Learning

Gaussian Models

Linear Dynamical Systems

Bayesian Models in Machine Learning

Fundamental Probability and Statistics

Lecture 3: Pattern Classification

Detection ASTR ASTR509 Jasper Wall Fall term. William Sealey Gosset

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

Particle Swarm Optimization of Hidden Markov Models: a comparative study

Practical Statistics

STA 4273H: Statistical Machine Learning

Bagging During Markov Chain Monte Carlo for Smoother Predictions

COM336: Neural Computing

p L yi z n m x N n xi

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Curve Fitting Re-visited, Bishop1.2.5

Introduction to Machine Learning

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Machine Learning using Bayesian Approaches

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

ECE521 Lecture 19 HMM cont. Inference in HMM

Robert Collins CSE586, PSU Intro to Sampling Methods

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

L11: Pattern recognition principles

Stat 5101 Lecture Notes

Machine Learning Overview

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Algorithmisches Lernen/Machine Learning

Approximate Inference

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

STA 414/2104: Machine Learning

Non-Parametric Bayes

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

ECE521 week 3: 23/26 January 2017

STA 4273H: Statistical Machine Learning

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

p(d θ ) l(θ ) 1.2 x x x

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

The Bayes classifier

Contents. Part I: Fundamentals of Bayesian Inference 1

Introduction to Machine Learning Midterm, Tues April 8

PATTERN RECOGNITION AND MACHINE LEARNING

Probabilistic Methods in Bioinformatics. Pabitra Mitra

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja

STA 4273H: Sta-s-cal Machine Learning

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Algorithm-Independent Learning Issues

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Lecture 3: Pattern Classification. Pattern classification

Probabilistic Graphical Models

HMM part 1. Dr Philip Jackson

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

STA 4273H: Statistical Machine Learning

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

9 Multi-Model State Estimation

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Hidden Markov Models

PATTERN CLASSIFICATION

Bayesian Nonparametric Learning of Complex Dynamical Phenomena

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Introduction to Machine Learning CMU-10701

2D Image Processing (Extended) Kalman and particle filter

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Lecture 5. G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1

Bayesian time series classification

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CPSC 540: Machine Learning

Mathematical Formulation of Our Example

Probability and Estimation. Alan Moses

Hidden Markov Models and Gaussian Mixture Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Data Analyzing and Daily Activity Learning with Hidden Markov Model

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Bayesian Nonparametrics for Speech and Signal Processing

Lecture 2: From Linear Regression to Kalman Filter and Beyond

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

Modeling and Predicting Chaotic Time Series

Fast Likelihood-Free Inference via Bayesian Optimization

Study Notes on the Latent Dirichlet Allocation

Transcription:

Statistical Data models, Non-parametrics, Dynamics

Non-informative, proper and improper priors For real quantity bounded to interval, standard prior is uniform distribution For real quantity, unbounded, standard is uniform - but with what density? For real quantity on half-open interval, standard prior is f(s)=1/s - but integral diverges! Divergent priors are called improper - they can only be used with convergent likelihoods

Dirichlet Distribution- prior for discrete distribution

Mean of Dirichlet - Laplaces estimator

Occurence table probability

Occurence table probability Uniform prior:

Non-parametric inference How to perform inference about a distribution without assuming a distribution family? A distribution over reals can be approximated by a piecewise uniform distribution a mixture of real distributions But how many parts? This is non-parametric inference

Non-parametric inference Change-points, Rao-Blackwell Given times for events (eg coal-mining disasters) Infer a piecewise constant intensity function (change-point problem) State is set of change-points with intensities inbetween But how many pieces? This is non-parametric inference MCMC: Given current state, propose change in segment bounadry or intensity But it is possible to integrate out intensities proposed

Probability ratio in MCMC For a proposed merge of intervals j and j+1, with sizes proportional to (α,1-α), were the counts and obtained by tossing a coin with success probability or not? Compute model probability ratio as in HW1. n j n j +1 " Also, the total number of breakpoints has prior distribution Poisson with parameter (average) ". Probability ratio in favor of split :

Averging MCMC run, positions and number of breakpoints

Averging MCMC run, positions with uniform test data

Mixture of Normals

Mixture of Normals elimination of nuisance parameters

Mixture of Normals elimination of nuisance parameters (integrate using normalization constant of Gaussian and Gamma distributions)

Matlab Mixture of Normals, MCMC (AutoClass method) function [lh,lab,trlpost,trm,trstd,trlab,trct,nbounc]= mmnonu1(x,n,k,labi,nn); %[lh,lab,trlpost,trm,trstd,trlab,trct,nbounc]= % MMNONU1(x,N,k,labi,NN); %inputs % 1D MCMC mixture modelling, % x - 1D data column vector % N - MCMC iterations. % k - number of components %lab,labi - component labelling of data vector) % NN - thinning (optional)

Matlab Mixture of Normals, MCMC function [lab,trlh,trm,trstd,trlab,trct,nbounc]= mmnonu1(x,n,k,labi,nn); %[lh,lab,trlpost,trm,trstd,trlab,trct,nbounc]= % MMNONU1(x,N,k,labi,NN); %outputs %trlh - thinned trace of log probability (optional) %trm - thinned trace of means vector (optional) %trstd - thinned vector of standard deviations (optional) %trlab - thinned trace of labels vector (size(x,1) by N/NN (optional) %trct - thinned trace of mixing proportions

Matlab Mixture of Normals, MCMC N=10000; NN=100; x=[randn(100,1)-1;randn(100,1)*3;randn(100,1)+1]; % 3 components synthetic data k=2; labi=ceil(rand(size(x))*k); [llhc,lab2,trl,trm,trstd,trlab,trct,nbounc]= mmnonu1(x,n,k,labi,nn); [llhc2,lab2,trl2,trm2,trstd2,trlab2,trct2,nbounc]= mmnonu1(x,n,k,lab2,nn); (k=3, 4, 5)

Matlab Mixture of Normals, MCMC The three components and the joint empirical distr

Matlab Mixture of Normals, MCMC Putting them together makes the identification seem harder.

Matlab Mixture of Normals, MCMC std K=2: mean

Matlab Mixture of Normals, MCMC std Burn in progressing K=3: mean

Matlab Mixture of Normals, MCMC Burnt in std K=3: mean

Matlab Mixture of Normals, MCMC std No focus- No interpretation as 4 clusters K=4: Low prob mean

Matlab Mixture of Normals, MCMC std K=5: Low prob mean

Matlab Mixture of Normals, Trace of state labels MCMC X sample: 1-100 : (-1 1) 101:200: (0 3) 201:300: (1 1) Unsorted sample label trace sorted

Mixtures of multivariate normals This works the same way, but instead of a Gamma distribution for the variance we use the Wishart distribution, a matrix-valued distribution over covariance matrices. Competes well with both clustering and Expectation Maximization, which are prone to overfitting (clustering cannot handle overlapping components)

Dynamic Systems, time series An abundance of linear prediction models exists For non-linear and Chaotic systems, method was developed in 1990:s (Santa Fe) Gershenfeld, Weigend: The Future of Time Series Online/offline: prediction/retrodiction

Hidden Markov Models Given a sequence of discrete signals xi Is there a model likely to have produced xi from a sequence of states si of a Finite Markov Chain? P(. s) - transition probability in state s S(. s) - signal probability in state s Speech Recognition, Bioinformatics,

Hidden Markov Models function [Pn,Sn,stn,trP,trS,trst,tll]= hmmsim(a,n,n,s,prop,po,so,sto,nn); %[Pn,Sn,stn,trP,trS,trst]=HMMSIM(A,N,n,s,prop,Po,So,sto,NN); % Compute trace of posterior for hmm parameters % A - the sequence of signals % N - the length of trace % n - number of states in Markov chain % s - number of signal values % prop - proposal stepsize % optional inputs: % Po - starting transition matrix (each of n columns a discrete pdf % in n-vector % So - starting signal matrix (each of n columns a discrete pdf

Hidden Markov Models function [Pn,Sn,stn,trP,trS,trst,tll]= hmmsim(a,n,n,s,prop,po,so,sto,nn); % in s-vector % sto - starting state sequence (congruent to vector A) % NN - thining of trace, default 10 % outputs % Pn - last transition matrix in trace % Sn - last signal emission matrix % stn - last hidden state vector (congruent to A) % trp - trace of transition matrices % trs - trace of signal matrices % trace of hidden state vectors

Hidden Markov Models

Hidden Markov Models

Evidence Based Education: EBE Home Page Evidence is often incomplete or equivocal. One of the problems that commonly afflicts politicians is feeling the need to act, or at least to be seen to be acting, despite the absence of any clear evidence about what action is most appropriate. A more mature response in many areas of educational policy would be to acknowledge that we do not really know enough to support a clear decision. Claims that have been enshrined in textbooks are suddenly unprovable. (Truth wears off, Lehrer, 2010)

Hidden Markov Models

Hidden Markov Models Over 100000 iterations, burnin is visible 2 states, 2 signals P-transition matrix S-signaling

3 states 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 10 20 30 40 50 60 70 80 90 100

2 vs 3 states 6000 7000 8000 2 states 9000 10000 11000 3 states 12000 0 0.5 1 1.5 2 2.5 3 3.5 x 10 4 Log probability traces

MCMC Convergence

Kolmogorov-Smirnov test Is sample of n points from distribution? D max of abs difference between empirical and theoretical CDF compute D*sqrt(n), if larger that ca 2, reject Are two samples size n1, n2, from same distribution D max abs difference between the empirical CDF:s, test statistic sqrt(n1*n2/(n1+n2)).

Block-wise KS test for 4 MCMC runs red/black=nonreject 100 long 100 short 80 80 60 60 40 40 20 20 0 0 20 40 60 80 100 0 0 20 40 60 80 100 100 conv 100 X 80 80 60 60 40 40 20 20 0 0 20 40 60 80 100 0 0 20 40 60 80 100

Berry and Linoff have eloquently stated their preferences with the often quoted sentence: "Neural networks are a good choice for most classification problems when the results of the model are more important than understanding how the model works". Neural networks typically give the right answer

Dynamic Systems and Taken s Theorem Lag vectors (xi,x(i-1), x(i-t), for all i, occupy a submanifold of E^T, if T is large enough This manifold is diffeomorphic to original state space and can be used to create a good dynamic model Taken s theorem assumes no noise and must be empirically verified.

Dynamic Systems and Taken s Theorem

Santa Fe 1992 Competition Unstable Laser Intensive Care Unit Data, Apnea Exchange rate Data Synthetic series with drift White Dwarf Star Data Bach s unfinished Fugue

Stereoscopic 3D view of state space manifold, series A (Laser) The points seem to lie on a surface, which means that a lag-vector of 3 gives good prediction of the time series. The surface is either produced for a training batch, or produced on-the-fly from neighboring data points (possibly downweighing very old points)

Figure in book misleading: Origin where surface touches ground

Variational Bayes

True trajectory in state space (Valpola-Karhunen 2002)

Reconstructed trajectory in inferred state space

Chapman Kolmogorov version of Bayes rule f (! t D t ) " f (d t! t )# f (! t! t $1 ) f (! t $1 D t $1 )d! t$1

Chapman Kolmogorov version of Bayes rule f (! t D t ) " f (d t! t )# f (! t! t $1 ) f (! t $1 D t $1 )d! t$1

Observation and video based particle filter tracking Defence: tracking with heterogeneous observations Crowd analysis: tracking from video

Cycle in Particle filter Time step cycle Importance (weighted) sample Resampled ordinary sample Diffused sample Weighted by likelihood X- state Z - Observation

Particle filter- general tracking