Lecture 3: Machine learning, classification, and generative models

Similar documents
Lecture 3: Pattern Classification

Lecture 3: Pattern Classification. Pattern classification

Session 1: Pattern Recognition

Pattern Recognition Applied to Music Signals

Brief Introduction of Machine Learning Techniques for Content Analysis

CSCI-567: Machine Learning (Spring 2019)

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning for Signal Processing Bayes Classification and Regression

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

STA 414/2104: Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Hidden Markov Models Part 2: Algorithms

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Lecture 9: Speech Recognition. Recognizing Speech

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Lecture 9: Speech Recognition

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

STA 4273H: Statistical Machine Learning

Lecture 4: Probabilistic Learning

Basic math for biology

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

CSE446: Clustering and EM Spring 2017

Naïve Bayes classification

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Hidden Markov Models

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Statistical learning. Chapter 20, Sections 1 4 1

an introduction to bayesian inference

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Expectation Maximization

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

order is number of previous outputs

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

CMU-Q Lecture 24:

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Linear & nonlinear classifiers

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Expectation maximization

p(d θ ) l(θ ) 1.2 x x x

Hidden Markov Models

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Outline of Today s Lecture

Linear Dynamical Systems

Pattern Recognition and Machine Learning

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Math 350: An exploration of HMMs through doodles.

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Based on slides by Richard Zemel

Conditional Random Field

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Hidden Markov Models Hamid R. Rabiee

Hidden Markov Models

Density Estimation. Seungjin Choi

L11: Pattern recognition principles

Hidden Markov Models in Language Processing

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Gaussian Mixture Models

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Statistical Pattern Recognition

COS513 LECTURE 8 STATISTICAL CONCEPTS

Machine Learning for Signal Processing Bayes Classification

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Final Exam, Machine Learning, Spring 2009

COMP90051 Statistical Machine Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Hidden Markov Models and Gaussian Mixture Models

Probabilistic Reasoning in Deep Learning

BAYESIAN DECISION THEORY

Algorithmisches Lernen/Machine Learning

STA 4273H: Sta-s-cal Machine Learning

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Latent Variable Models and Expectation Maximization

Latent Variable Models and Expectation Maximization

Machine Learning for Structured Prediction

Bayesian Machine Learning

Lecture 6: April 19, 2002

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

COM336: Neural Computing

Linear Models for Classification

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning

Lecture 14. Clustering, K-means, and EM

CPSC 540: Machine Learning

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Hierarchical Models & Bayesian Model Selection

CS534 Machine Learning - Spring Final Exam

Linear & nonlinear classifiers

Lecture 3: ASR: HMMs, Forward, Viterbi

Variable selection for model-based clustering

Transcription:

EE E6820: Speech & Audio Processing & Recognition Lecture 3: Machine learning, classification, and generative models 1 Classification 2 Generative models 3 Gaussian models Michael Mandel <mim@ee.columbia.edu> 4 Hidden Markov models Columbia University Dept. of Electrical Engineering http://www.ee.columbia.edu/ dpwe/e6820 February 7, 2008 Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 1 / 43

Outline 1 Classification 2 Generative models 3 Gaussian models 4 Hidden Markov models Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 2 / 43

Classification and generative models F2/Hz 4000 f/hz 3000 2000 1000 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 time/s 2000 1000 0 0 x x 1000 2000 F1/Hz ay ao Classification discriminative models discrete, categorical random variable of interest fixed set of categories Generative models descriptive models continuous or discrete random variable(s) of interest can estimate parameters Bayes rule makes them useful for classification Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 3 / 43

Building a classifier Define classes/attributes could state explicit rules better to define through training examples Define feature space Define decision algorithm set parameters from examples Measure performance calculated (weighted) error rate 1100 Pols vowel formants: "u" (x) vs. "o" (o) 1000 F2 / Hz 900 800 700 600 250 300 350 400 450 500 550 600 F1 / Hz Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 4 / 43

Classification system parts Sensor signal segment feature vector class Pre-processing/ segmentation Feature extraction Classification Post-processing STFT Locate vowels Formant extraction Context constraints Costs/risk Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 5 / 43

Feature extraction Right features are critical waveform vs formants vs cepstra invariance under irrelevant modifications Theoretically equivalent features may act very differently in a particular classifier representations make important aspects explicit remove irrelevant information Feature design incorporates domain knowledge although more data less need for cleverness Smaller feature space (fewer dimensions) simpler models (fewer parameters) less training data needed faster training [inverting MFCCs] Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 6 / 43

Optimal classification Minimize probability of error with Bayes optimal decision ˆθ = argmax p(θ i x) θ i p(error) = p(error x)p(x) dx = (1 p(θ i x))p(x) dx i Λ i where Λ i is the region of x where θ i is chosen... but p(θ i x) is largest in that region so p(error) is minimized Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 7 / 43

Sources of error p(x ω 1 ) Pr(ω 1 ) X ω1 ^ X ω2 ^ p(x ω 2 ) Pr(ω 2 ) Pr(ω 2 X ω1 ^ ) = Pr(err X ω1 ^ ) x Pr(ω 1 X ω2 ^ ) = Pr(err X ω2 ^ ) Suboptimal threshold / regions (bias error) use a Bayes classifier Incorrect distributions (model error) better distribution models / more training data Misleading features ( Bayes error ) irreducible for given feature set regardless of classification scheme Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 8 / 43

Two roads to classification Optimal classifier is but we don t know p(θ i x) ˆθ = argmax θ i p(θ i x) Can model distribution directly e.g. Nearest neighbor, SVM, AdaBoost, neural net maps from inputs x to outputs θi a discriminative model Often easier to model data likelihood p(x θ i ) use Bayes rule to convert to p(θ i x) a generative (descriptive) model Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 9 / 43

Nearest neighbor classification 1800 Pols vowel formants: "u" (x), "o" (o), "a" (+) 1600 F2 / Hz 1400 1200 1000 800 600 x * o + new observation x 0 200 400 600 800 1000 1200 1400 1600 1800 F1 / Hz Find closest match (Nearest Neighbor) Naïve implementation takes O(N) time for N training points As N, error rate approaches twice the Bayes error rate With K summarized classes, takes O(K) time Locality sensitive hashing gives approximate nearest neighbors in O(dn 1/c2 ) time (Andoni and Indyk, 2006) Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 10 / 43

Support vector machines (w. x) + b = 1 Large margin linear classifier for separable data x 2 regularization of margin avoids y i over-fitting = 1 w can be adapted to non-separable data (C parameter) made nonlinear using kernels k(x 1, x 2 ) = Φ(x 1 ) Φ(x 2 ). (w. x) + b = + 1 x 1 (w x) + b = 0 Depends only on training points near the decision boundary, the support vectors Unique, optimal solution for given Φ and C y i = +1 Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 11 / 43

Outline 1 Classification 2 Generative models 3 Gaussian models 4 Hidden Markov models Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 12 / 43

Generative models Describe the data using structured probabilistic models Observations are random variables whose distribution depends on model parameters Source distributions p(x θ i ) reflect variability in features reflect noise in observation generally have to be estimated from data (rather than known in advance) p(x ω i ) ω 1 ω 2 ω 3 ω 4 x Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 13 / 43

Generative models (2) Three things to do with generative models Evaluate the probability of an observation, possibly under multiple parameter settings p(x), p(x θ 1 ), p(x θ 2 ),... Estimate model parameters from observed data ˆθ = argmin C(θ, θ x) θ Run the model forward to generate new data x p(x ˆθ) Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 14 / 43

Random variables review Random variables have joint distributions, p(x, y) y p(x,y) Marginal distribution of y p(y) = p(x, y) dx p(y) µy σ y p(x) µ x σ x σ xy x Knowing one value in a joint distribution constrains the remainder Conditional distribution of x given y p(x,y) y Y p(x y) p(x, y) p(y) = p(x, y) p(x, y) dy x p(x y = Y ) Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 15 / 43

Bayes rule p(x y)p(y) = p(x, y) = p(y x)p(x) p(y x) = p(x y)p(y) p(x) can reverse conditioning given priors/marginals terms can be discrete or continuous generalizes to more variables p(x, y, z) = p(x y, z)p(y, z) = p(x y, z)p(y z)p(z) allows conversion between joint, marginals, conditionals Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 16 / 43

Bayes rule for generative models Run generative models backwards to compare them p(θ x) = Likelihood p(x θ) R p(x θ)p(θ) p(θ) Posterior prob Evidence = p(x) Prior prob Posterior is the classification we re looking for combination of prior belief in each class with likelihood under our model normalized by evidence (so posteriors = 1) Objection: priors are often unknown... but omitting them amounts to assuming they are all equal Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 17 / 43

Computing probabilities and estimating parameters Want probability of the observation under a model, p(x) regardless of parameter settings Full Bayesian integral p(x) = p(x θ)p(θ) dθ Difficult to compute in general, approximate as p(x ˆθ) Maximum likelihood (ML) ˆθ = argmax p(x θ) θ Maximum a posteriori (MAP): ML + prior ˆθ = argmax θ p(θ x) = argmax p(x θ)p(θ) θ Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 18 / 43

Model checking After estimating parameters, run the model forward Check that model is rich enough to capture variation in data parameters are estimated correctly there aren t any bugs in your code Generate data from the model and compare it to observations x p(x θ) are they similar under some statistics T (x) : R d R? can you find the real data set in a group of synthetic data sets? Then go back and update your model accordingly Gelman et al. (2003, ch. 6) Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 19 / 43

Outline 1 Classification 2 Generative models 3 Gaussian models 4 Hidden Markov models Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 20 / 43

Gaussian models Easiest way to model distributions is via parametric model assume known form, estimate a few parameters Gaussian model is simple and useful. In 1D [ p(x θ i ) = 1 exp 1 ( ) ] x 2 µi σ i 2π 2 σ i Parameters mean µ i and variance σ i fit P M P M e 1/2 p(x ω i ) µ i σ i x Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 21 / 43

Gaussians in d dimensions p(x θ i ) = [ 1 (2π) d/2 exp 1 ] Σ i 1/2 2 (x µ i) T Σ 1 i (x µ i ) Described by a d-dimensional mean µ i and a d d covariance matrix Σ i 5 0.5 1 0 4 3 4 2 2 0 0 2 4 1 0 0 1 2 3 4 5 Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 22 / 43

Gaussian mixture models Single Gaussians cannot model distributions with multiple modes distributions with nonlinear correlations What about a weighted sum? p(x) k c k p(x θ k ) where {ck } is a set of weights and {p(x θ k )} is a set of Gaussian components can fit anything given enough components Interpretation: each observation is generated by one of the Gaussians, chosen with probability c k = p(θ k ) Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 23 / 43

Gaussian mixtures (2) e.g. nonlinear correlation resulting surface 3 2 1 0-1 0.8 1 1.2 1.4-2 0 5 10 original data 15 20 25 30 35 0 0.2 0.4 0.6 Gaussian components Problem: finding c k and θ k parameters easy if we knew which θ k generated each x Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 24 / 43

Expectation-maximization (EM) General procedure for estimating model parameters when some are unknown e.g. which GMM component generated a point Iteratively updated model parameters θ to maximize Q, the expected log-probability of observed data x and hidden data z Q(θ, θ t ) = p(z x, θ t ) log p(z, x θ) z E step: calculate p(z x, θt ) using θ t M step: find θ that maximizes Q using p(z x, θt ) can prove p(x θ) non-decreasing hence maximum likelihood model local optimum depends on initialization Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 25 / 43

Fitting GMMs with EM Want to find parameters of the Gaussians θk = {µ k, Σ k } weights/priors on Gaussians ck = p(θ k )... that maximize likelihood of training data x If we could assign each x to a particular θ k, estimation would be direct Hence treat mixture indices, z, as hidden form Q = E [p(x, z θ)] differentiate wrt model parameters equations for µ k, Σ k, c k to maximize Q Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 26 / 43

GMM EM updated equations Parameters that maximize Q ν nk p(z k x n, θ t ) µ k = n ν nkx n Σ k = c k = 1 N n ν nk n ν nk(x n µ k )(x n µ k ) T n ν nk n ν nk Each involves ν nk, fuzzy membership of x n in Gaussian k Updated parameter is just sample average, weighted by fuzzy membership Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 27 / 43

GMM examples Vowel data fit with different mixture counts 1 Gauss logp(x)=-1911 1600 1400 1200 1000 800 600 200 400 600 800 1000 1200 1600 1400 1200 1000 800 3 Gauss logp(x)=-1849 600 200 400 600 800 1000 1200 2 Gauss logp(x)=-1864 1600 1400 1200 1000 800 600 200 400 600 800 1000 1200 1600 1400 1200 1000 800 4 Gauss logp(x)=-1840 600 200 400 600 800 1000 1200 [Example... ] Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 28 / 43

Outline 1 Classification 2 Generative models 3 Gaussian models 4 Hidden Markov models Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 29 / 43

Markov models A (first order) Markov model is a finite-state system whose behavior depends only on the current state The future is independent of the past, conditioned on the present e.g. generative Markov model S.8.8.1 A.1 B.1.1 C.7.1.1.1 E p(q n+1 q n ) q n S A B C E q n+1 S A B C E 0 1 0 0 0 0.8.1.1 0 0.1.8.1 0 0.1.1.7.1 0 0 0 0 1 S A A A A A A A A B B B B B B B B B C C C C B B B B B B C E Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 30 / 43

Hidden Markov models Markov model where state sequence Q = {q n } is not directly observable ( hidden ) But, observations X do depend on Q x n is RV that depends only on current state p(x n q n ) p(x q) p(x q) 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0 Emission distributions q = A q = B q = C q = A q = B q = C 0 0 1 2 3 4 observation x xn xn 3 2 1 0 3 2 1 State sequence AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC Observation sequence 0 0 10 20 30 time step n can still tell something about state sequence... Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 31 / 43

(Generative) Markov models HMM is specified by parameters Θ: - states q i k a t - transition probabilities a ij k a t k a t k a t 1.0 0.0 0.0 0.0 0.9 0.1 0.0 0.0 0.0 0.9 0.1 0.0 0.0 0.0 0.9 0.1 - emission distributions b i (x) p(x q) k a t x (+ initial state probabilities π i ) a ij p(q j n q i n 1) b i (x) p(x q i ) π i p(q i 1) Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 32 / 43

Markov models for sequence recognition Independence of observations observation xn depends only on current state q n p(x Q) = p(x 1, x 2,... x N q 1, q 2,... q N ) = p(x 1 q 1 )p(x 2 q 2 ) p(x N q N ) N N = p(x n q n ) = b qn (x n ) n=1 n=1 Markov transitions transition to next state q i+1 depends only on q i p(q M) = p(q 1, q 2,... M) = p(q N q N 1... q 1 )p(q N 1 q N 2... q 1 )p(q 2 q 1 )p(q 1 ) = p(q N q N 1 )p(q N 1 q N 2 )p(q 2 q 1 )p(q 1 ) N N = p(q 1 ) p(q n q n 1 ) = π q1 a qn 1 q n n=2 n=2 Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 33 / 43

Model-fit calculations From state-based modeling : p(x Θ j ) = all Q p(x Q, Θ j )p(q Θ j ) For HMMs N p(x Q) = b qn (x n ) p(q M) = π q1 n=1 N a qn 1 q n n=2 Hence, solve for ˆΘ = argmax Θj p(θ j X ) Using Bayes rule to convert from p(x Θ j ) Sum over all Q??? Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 34 / 43

Summing over all paths 0.9 S S A B E 0.1 Model M 1 0.7 0.2 A B E 0.1 0.8 S A B E 0.9 0.1 0.7 0.2 0.1 0.8 0.2 1 q 0 q 1 q 2 q 3 q 4 S A A A E S A A B E S A B B E S B B B E 0.2 x n p(x B) p(x A) States E B A Observations x 1, x 2, x 3 0.1 Observation likelihoods p(x q) x 1 x 2 x 3 A 2.5 0.2 0.1 q{ B 0.1 2.2 2.3 S 0 1 2 3 4 time n All possible 3-emission paths Q k from S to E 1 0.8 0.2 2 3 Paths 0.7 0.7 0.9 p(q M) = Πn p(q n q n-1 ) p(x Q,M) = Πn p(x n q n ) p(x,q M).9 x.7 x.7 x.1 = 0.0441.9 x.7 x.2 x.2 = 0.0252.9 x.2 x.8 x.2 = 0.0288.1 x.8 x.8 x.2 = 0.0128 2.5 x 0.2 x 0.1 = 0.05 2.5 x 0.2 x 2.3 = 1.15 2.5 x 2.2 x 2.3 = 12.65 0.1 x 2.2 x 2.3 = 0.506 0.0022 0.0290 0.3643 0.0065 Σ = 0.1109 Σ = p(x M) = 0.4020 n 0.8 0.2 0.2 0.1 Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 35 / 43

The forward recursion Dynamic-programming-like technique to sum over all Q Define α n (i) as the probability of getting to state q i at time step n (by any path): α n (i) = p(x 1, x 2,... x n, q n = q i ) p(x n 1, q i n) α n+1 (j) can be calculated recursively: Model states q i α n (i+1) a i+1j a ij b j (x n+1 ) S α n+1 (j) = [Σ α n (i) a ij ] b j (x n+1 ) i=1 α n (i) Time steps n Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 36 / 43

Forward recursion (2) Initialize α 1 (i) = π i b i (x 1 ) Then total probability p(x1 N Θ) = S i=1 α N(i) Practical way to solve for p(x Θ j ) and hence select the most probable model (recognition) Observations X p(x M 1 ) p(m 1 ) p(x M 2 ) p(m 2 ) Choose best Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 37 / 43

Optimal path May be interested in actual q n assignments which state was active at each time frame e.g. phone labeling (for training?) Total probability is over all paths... but can also solve for single best path, Viterbi state sequence Probability along best path to state q j n+1 : [ ] ˆα n+1 (j) = max {ˆα n (i)a ij } b j (x n+1 ) i backtrack from final state to get best path final probability is product only (no sum) log-domain calculation is just summation Best path often dominates total probability p(x Θ) p(x, ˆQ Θ) Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 38 / 43

Interpreting the Viterbi path Viterbi path assigns each x n to a state q i performing classification based on b i (x)... at the same time applying transition constraints a ij Inferred classification 3 x n 2 Viterbi labels: 1 0 0 10 20 30 AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC Can be used for segmentation train an HMM with garbage and target states decode on new data to find targets, boundaries Can use for (heuristic) training e.g. forced alignment to bootstrap speech recognizer e.g. train classifiers based on labels... Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 39 / 43

Aside: Training and test data A rich model can learn every training example (overtraining) But the goal is to classify new, unseen data sometimes use cross validation set to decide when to stop training For evaluation results to be meaningful: don t test with training data! don t train on test data (even indirectly... ) Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 40 / 43

Aside (2): Model complexity More training data allows the use of larger models More model parameters create a better fit to the training data more Gaussian mixture components more HMM states For fixed training set size, there will be some optimal model size that avoids overtraining Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 41 / 43

Summary Classification is making discrete (hard) decisions Basis is comparison with known examples explicitly or via a model Classification models discriminative models, like SVMs, neural nets, boosters, directly learn posteriors p(θ i x) generative models, like Gaussians, GMMs, HMMs, model likelihoods p(x θ) Bayes rule lets us use generative models for classification EM allows parameter estimation even with some data missing Parting thought Is it wise to use generative models for discrimination or vice versa? Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 42 / 43

References Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proc. IEEE Symposium on Foundations of Computer Science, pages 459 468, Washington, DC, USA, 2006. IEEE Computer Society. Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian Data Analysis. Chapman & Hall/CRC, second edition, July 2003. ISBN 158488388X. Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257 286, 1989. Jeff A. Bilmes. A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, 1997. Christopher J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov., 2(2):121 167, June 1998. Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 43 / 43