Introduction to Machine Learning

Similar documents
Probabilistic Graphical Models

Introduction to Machine Learning

Pattern Recognition and Machine Learning

Probabilistic Graphical Models

Density Estimation. Seungjin Choi

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Bayesian Inference and MCMC

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Nonparametric Bayesian Methods (Gaussian Processes)

Probabilistic Machine Learning

STA 4273H: Statistical Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Variational Autoencoders

STA 4273H: Sta-s-cal Machine Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Lecture 13 : Variational Inference: Mean Field Approximation

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Bayesian Networks Inference with Probabilistic Graphical Models

2D Image Processing (Extended) Kalman and particle filter

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Linear Dynamical Systems (Kalman filter)

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Recent Advances in Bayesian Inference Techniques

Bayesian Methods for Machine Learning

Generative Clustering, Topic Modeling, & Bayesian Inference

CPSC 540: Machine Learning

LECTURE 15 Markov chain Monte Carlo

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

13: Variational inference II

STA414/2104 Statistical Methods for Machine Learning II

Linear Dynamical Systems

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

STA 4273H: Statistical Machine Learning

Machine Learning Techniques for Computer Vision

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Introduction to Graphical Models

Lecture : Probabilistic Machine Learning

Non-Parametric Bayes

Gaussian Mixture Model

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

Bayesian Machine Learning

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Gaussian Processes (10/16/13)

Based on slides by Richard Zemel

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Results: MCMC Dancers, q=10, n=500

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Graphical Models for Collaborative Filtering

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

STA 4273H: Statistical Machine Learning

Unsupervised Learning

Density Propagation for Continuous Temporal Chains Generative and Discriminative Models

Latent Variable Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Probabilistic Graphical Models

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Outline Lecture 2 2(32)

Variational Inference via Stochastic Backpropagation

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Mathematical Formulation of Our Example

Topic Modelling and Latent Dirichlet Allocation

Data Mining Techniques

PATTERN RECOGNITION AND MACHINE LEARNING

Graphical Models and Kernel Methods

Approximate Inference using MCMC

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

STA 414/2104: Machine Learning

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

State-Space Methods for Inferring Spike Trains from Calcium Imaging

CS Lecture 18. Topic Models and LDA

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

An Introduction to Statistical and Probabilistic Linear Models

Part 1: Expectation Propagation

Latent Variable Models and EM Algorithm

Why do we care? Examples. Bayes Rule. What room am I in? Handling uncertainty over time: predicting, estimating, recognizing, learning

Variational Scoring of Graphical Model Structures

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Factor Analysis and Kalman Filtering (11/2/04)

Why do we care? Measurements. Handling uncertainty over time: predicting, estimating, recognizing, learning. Dealing with time

Computer Vision Group Prof. Daniel Cremers. 3. Regression

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

L11: Pattern recognition principles

CSCI-567: Machine Learning (Spring 2019)

Machine Learning for Signal Processing Bayes Classification and Regression

An introduction to Sequential Monte Carlo

MCMC algorithms for fitting Bayesian models

Learning the hyper-parameters. Luca Martino

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

Bayesian Networks BY: MOHAMAD ALSABBAGH

Probabilistic Graphical Networks: Definitions and Basic Results

Bayesian Machine Learning

Lecture 7: Con3nuous Latent Variable Models

Transcription:

Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin Murphy s textbook, Machine Learning: A Probabilistic Perspective

E[f] = Monte Carlo Methods f(z)p(z) dz 1 L Provably good if L sufficiently large:! Unbiased for any sample size! Variance inversely proportional to sample size (and independent of dimension of space)! Weak law of large numbers! Strong law of large numbers! Problem: Drawing samples from complex distributions! L l=1 Alternatives for hard problems: f(z (l) )! Importance sampling! Markov chain Monte Carlo (MCMC) z (l) p(z) Estimation of expected model properties via simulation

Markov Chain Monte Carlo z (0) z (1) z (2) z (t+1) q(z z (t) ) z (t) q(z z (t) )! At each time point, state is a configuration of all the variables in the model: parameters, hidden variables, etc.! We design the transition distribution so that the chain is irreducible and ergodic, with a unique stationary distribution p (z) p (z) = Z q(z z )p (z ) dz! For learning, the target equilibrium distribution is usually the posterior distribution given data x: p (z) =p(z x)! Popular recipes: Metropolis-Hastings and Gibbs samplers

Gibbs Sampler for a 2D Gaussian z 2 L General Gibbs Sampler z (t) i p(z i z (t 1) \i ) i = i(t) z (t) j = z (t 1) j j i(t) l Under mild conditions, converges assuming all variables are resampled infinitely often (order can be fixed or random) z 1 C. Bishop, Pattern Recognition & Machine Learning

Probabilistic Mixture Models 1 (a) 0.5 θ k = {µ k, Σ k } p(z i π) = Cat(z i π) p(x i z i, µ, Σ) =N (x i µ zi, Σ zi ) 0 0 0.5 1

Mixture Sampler Pseudocode

Snapshots of Mixture Gibbs Sampler Initialization B Initialization A 2 Iterations 10 Iterations 50 Iterations

Collapsed Sampling Algorithms π Dir(α) z i Cat(π) x i F (θ zi ) θ k G(β) Conjugate priors allow exact marginalization of parameters, to make an equivalent model with fewer variables

Gibbs: Representation and Mixing Multiple Initializations Quantiles of 100 Chains Standard Gibbs: Alternatively sample assignments, parameters Collapsed Gibbs: Marginalize parameters, sample assignments

MCMC & Computational Resources Best practical option: A few (> 1) initializations for as many iterations as possible

End of New Material Next Slides: Some review and some advertisement of advanced topics

The Main Learning Problems Supervised Learning Unsupervised Learning Discrete classification or categorization clustering Continuous regression dimensionality reduction!supervised: Learn to approximate a function from examples!unsupervised: Learn a representation which compresses data!probabilistic learning: Learn by maximizing probability, or minimizing an expected loss

Supervised Learning Generative ML or MAP Learning: Naïve Bayes N log p(π)+ logp(θ)+ [log p(y i π)+logp(x i y i, θ)] max π,θ π y i π y t i=1 y i y t θ Train x i N θ Test x t Discriminative ML or MAP Learning: Logistic regression max θ log p(θ)+ θ N log p(y i x i, θ) i=1 x i Train N θ Test x t

Learning via Optimization ML Estimate: MAP Estimate: Gradient vectors: f : R M R ŵ = arg min w ŵ = arg min w w f : R M R M log p(y i x i,w) log i p(w) log p(y i x i,w) i ( w f(w)) k = f(w) w k Hessian matrices: 2 wf : R M R M M Optimization of Smooth Functions: ( w f(w)) k,l = 2 f(w) w k w l! Closed form: Find zero gradient points, check curvature! Iterative: Initialize somewhere, use gradients to take steps towards better (by convention, smaller) values

Clustering: max π,θ Unsupervised Learning log p(π)+ log p(θ)+ Dimensionality Reduction: max π,θ log p(π)+ log p(θ)+ [ ] N log p(z i π)p(x i z i, θ) z i i=1 N [ ] log p(z i π)p(x i z i, θ) dz i z i i=1! No notion of training and test data: labels are never observed! As before, maximize posterior probability of model parameters! For hidden variables associated with each observation, we marginalize over possible values rather than estimating! Fully accounts for uncertainty in these variables! There is one hidden variable per observation, so cannot perfectly estimate even with infinite data! Must use generative model (discriminative degenerates) π θ z i x i N

π Expectation Maximization (EM) y i π y t π z i θ x i N Supervised Training θ x t Supervised Testing θ x i N Unsupervised Learning π, θ parameters (define low-dimensional manifold) z 1,...,z N hidden data (locate observations on manifold)! Initialization: Randomly select starting parameters! E-Step: Given parameters, find posterior of hidden data! Equivalent to test inference of full posterior distribution! M-Step: Given posterior distributions, find likely parameters! Similar to supervised ML/MAP training! Iteration: Alternate E-step & M-step until convergence

EM as Lower Bound Maximization ln p(x θ) = ln ln p(x θ) z ( z! Initialization: Randomly select starting parameters! E-Step: Given parameters, find posterior of hidden data q (t) = arg max q L(q, θ (t 1) )! M-Step: Given posterior distributions, find likely parameters θ (t) = arg max ) p(x, z θ) dz q(z) ln p(x, z θ) dz θ L(q (t), θ)! Iteration: Alternate E-step & M-step until convergence z q(z) ln q(z) dz L(q, θ) θ (0)

Gaussian Mixture Models vs. HMMs Mixture Model z 1 z 2 z 3 z 4 z 5 x 1 x 2 x 3 x 4 x 5 p(z i π, µ, Σ) = Cat(z i π) z i {1,...,K} Hidden Markov Model p(x i z i, π, µ, Σ) = Norm(x i µ zi, Σ zi ) z 1 z 2 z 3 z 4 z 5 x 1 x 2 x 3 x 4 x 5 p(z t π, µ, Σ,z t 1,z t 2,...) = Cat(z t π zt 1 ) p(x t z t, π, µ, Σ) = Norm(x t µ zt, Σ zt ) Recover mixture model when all rows of state transition matrix are equal.

Probabilistic PCA & Factor Analysis! Both Models: Data is a linear function of low-dimensional latent coordinates, plus Gaussian noise p(x i z i, θ) =N (x i Wz i + µ, Ψ) p(x i θ) =N (x i µ, W W T + Ψ) Ψ! Factor analysis:! Probabilistic PCA: x 2 is a general diagonal matrix is a multiple of identity matrix Ψ = σ 2 I p(x ẑ) w p(z i θ) =N (z i 0,I) x 2 low rank covariance parameterization µ } ẑ w µ p(z) p(x) ẑ z x 1 C. Bishop, Pattern Recognition & Machine Learning x 1

Linear State Space Models! States & observations jointly Gaussian:!! All marginals & conditionals Gaussian!! Linear transformations remain Gaussian

Simple Linear Dynamics Brownian Motion Constant Velocity Time Time

Kalman Filter! Represent Gaussians by mean & covariance: Prediction: Kalman Gain: Update:

Constant Velocity Tracking Kalman Filter Kalman Smoother (K. Murphy, 1998)

Nonlinear State Space Models! State dynamics and measurements given by potentially complex nonlinear functions! Noise sampled from non-gaussian distributions

Examples of Nonlinear Models Dynamics implicitly determined by geophysical simulations Observed image is a complex function of the 3D pose, other nearby objects & clutter, lighting conditions, camera calibration, etc.

Nonlinear Filtering Prediction: Update:

Approximate Nonlinear Filters! No direct represention of continuous functions, or closed form for the prediction integral! Big literature on approximate filtering:!! Histogram filters!! Extended & unscented Kalman filters!! Particle filters!!!

Nonlinear Filtering Taxonomy Histogram Filter:!!Evaluate on fixed discretization grid!!only feasible in low dimensions!!expensive or inaccurate Extended/Unscented Kalman Filter:!!Approximate posterior as Gaussian via linearization, quadrature,!!!inaccurate for multimodal posterior distributions Particle Filter:!!Dynamically evaluate states with highest probability!!monte Carlo approximation

Particle Filters Condensation, Sequential Monte Carlo, Survival of the Fittest,!! Represent state estimates using a set of samples! Propagate over time using importance sampling Sample-based density estimate Weight by observation likelihood Resample & propagate by dynamics

Particle Filtering Movie (M. Isard, 1996)

Dynamic Bayesian Networks Specify and exploit internal structure in the hidden states underlying a time series Maneuver Mode Spatial Position Noisy Observations

DBN Hand Tracking Video Isard et. al., 1998

Particle Filtering Caveats! Particle filters are easy to implement, and effective in many applications, BUT!! It can be difficult to know how many samples to use, or to tell when the approximation is poor!! Sometimes suffer catastrophic failures, where NO particles have significant posterior probability!! This is particularly true with peaky observations in high-dimensional spaces: likelihood dynamics

The Big Picture Ghahramani & Roweis