Brief Introduction of Machine Learning Techniques for Content Analysis

Similar documents
L23: hidden Markov models

Lecture 11: Hidden Markov Models

Introduction to Machine Learning Midterm, Tues April 8

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

STA 4273H: Statistical Machine Learning

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Pattern Recognition and Machine Learning

Introduction to Machine Learning CMU-10701

STA 414/2104: Machine Learning

Conditional Random Field

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Lecture 3: Machine learning, classification, and generative models

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Hidden Markov Models

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear & nonlinear classifiers

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Hidden Markov Models (HMMs)

Hidden Markov Models. Dr. Naomi Harte

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Data Mining in Bioinformatics HMM

Jeff Howbert Introduction to Machine Learning Winter

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Parametric Models Part III: Hidden Markov Models

Hidden Markov Models Hamid R. Rabiee

Hidden Markov Models

Statistical Methods for NLP

CSCI-567: Machine Learning (Spring 2019)

Hidden Markov Models

Advanced Data Science

Hidden Markov Models

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

A brief introduction to Conditional Random Fields

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

Hidden Markov Modelling

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Constrained Optimization and Support Vector Machines

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Final Examination CS 540-2: Introduction to Artificial Intelligence

Hidden Markov Models Part 2: Algorithms

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Multiscale Systems Engineering Research Group

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Introduction to Graphical Models

Hidden Markov Models and Gaussian Mixture Models

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Qualifier: CS 6375 Machine Learning Spring 2015

Machine Learning Lecture 7

HMM part 1. Dr Philip Jackson

Advanced Introduction to Machine Learning

Linear & nonlinear classifiers

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Hidden Markov Models

Machine Learning for Structured Prediction

Machine Learning Overview

Recent Advances in Bayesian Inference Techniques

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Master 2 Informatique Probabilistic Learning and Data Analysis

O 3 O 4 O 5. q 3. q 4. Transition

Machine Learning for Signal Processing Bayes Classification and Regression

CS 6375 Machine Learning

Be able to define the following terms and answer basic questions about them:

Lecture 3: ASR: HMMs, Forward, Viterbi

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Lecture 15. Probabilistic Models on Graph

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Statistical Sequence Recognition and Training: An Introduction to HMMs

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Intelligent Systems (AI-2)

Pattern Recognition 2018 Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Information Extraction from Text

Kernel Methods and Support Vector Machines

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Statistical Learning Reading Assignments

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS534 Machine Learning - Spring Final Exam

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Intelligent Systems (AI-2)

HIDDEN MARKOV MODELS

Expectation Maximization (EM)

L11: Pattern recognition principles

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Naïve Bayes classification

Probabilistic Time Series Classification

Random Field Models for Applications in Computer Vision

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

p L yi z n m x N n xi

Transcription:

1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20

Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM)

Overview 3 Any computer program that can improve its performance at some task through experience (or training) can be called a learning program. During early days, computer scientists developed learning algorithms based on heuristics and insights into human reasoning mechanisms. Decision tree, Neuro-scientists attempted to devise learning methods by imitating the structure of human brains. Y. Gong and W. Xu, Machine Learning for Multimedia Content Analysis, Springer, 2007

Basic Statistical Learning Problems 4 Many learning tasks can be formulated as one of the following two problems. Regression: X: input variable, Y: output variable. Infer a function f(x) so that given a value of x of the input variable X, y = f(x) is a good predication of the true value y of the output variable Y.

Basic Statistical Learning Problems 5 Classification: Assume that a random variable X can belong to one of a finite set of classes C={1,2,,K}. Given the value x of variable X, infer its class label l=g(x), where. It is also of great interest to estimate the probability P(k x) that X belongs to class k,. In fact both the regression and classification problems can be formulated using the same framework.

6 Categorizations of Machine Learning Techniques Unsupervised vs. Supervised For inferring the functions f(x) and g(x), if pairs of training data (x i,y i ) or (x i, l i ), i= 1,,N are available, then the inference process is called supervised learning. Most regression methods are supervised learning. Unsupervised methods strive to automatically partition a given data set into the predefined number of clusters also called clustering.

7 Categorizations of Machine Learning Techniques Generative Models vs. Discriminative Models Discriminative models strive to learn P(k x) directly from the training set without the attempt to modeling the observation x. Generative models compute P(k x) by first modeling the class-conditional probabilities P(x k) as well as the class probabilities P(k) Posterior prob. likelihood Priori prob.

Categorizations of Machine Learning 8 Techniques Generative models: Naïve Bayes, Bayesian Networks, Gaussian Mixture Models (GMM), Hidden Markov Models (HMM), Discriminative models: Neural Networks, Support Vector Machines (SVM), Maximum Entropy Models (MEM), Conditional Random Fields (CRF),

9 Categorizations of Machine Learning Techniques Models for Simple Data vs. Models for Complex Data Complex data: consist of sub-entities that are strongly related one to another E.g. a beach scene usually composed of a blue sky on top, an ocean in the middle, and a sand beach at the bottom For simple: Naïve Bayes, GMM, NN, SVM For complex: BN, HMM, MEM, CRF, M 3 -net

10 Categorizations of Machine Learning Techniques Model Identification vs. Model Prediction Model identification: to discover an existing Law of Nature The model identification paradigm is an ill-posed problems, and is annoyed by the curse of dimensionality. The goal of model predication is to predict events well, but not necessarily through the identification of the model of events.

11 Gaussian Mixture Model Wei-Ta Chu 2008/11/20

Introduction 12 By using a sufficient number of Gaussians, and by adjusting their means and covariances as well as the coefficients in the linear combinations, almost any continuous density can be approximated to arbitrary accuracy. C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006

Introduction 13 Consider a superposition of K Gaussian densities Each Gaussian density is called a component of the mixture and has its own mean and covariance.

Introduction 14 From the sum and product rules, the marginal density is given by We can view as the prior probability of picking the kth component, and the density as the probability of x conditioned on k:

Introduction 15 FromBaye stheorem, the posterior probability p(k x) is given by Gaussian mixture distribution is governed by parameters. One way to set these parameters is to use maximum likelihood. Assume that different mixtures are independent and identically distributed

Introduction 16 In case of a single variable x, the Gaussian distribution is in the form For a D-dimensional vector x, the multivariate Gaussian distribution takes the form

Maximizing Likelihood 17 Setting the derivative of with respect to the means of the Gaussian components to zero The mean for the kth Gaussian component is obtained by taking a weighted mean of all of the points in the data set, in which the weighting factor for data point is given by the posterior probability

Maximizing Likelihood 18 Setting the derivative of with respect to the covariance of the Gaussian components to zero Each data point weighted by the corresponding posterior probability

Maximizing Likelihood 19 Maximize with respect to the mixing coefficients Constraint: the sum of mixing coefficients is one Using Lagrange multiplier and maximizing the following quantity If we multiply both sides by and sum over k making use of the constraint, we find. Using this to eliminate and rearranging we obtain Mixing coefficient of the kth component is given by the average responsibility which that component takes for explaining the data points

20 Expectation-Maximization (EM) Algorithm We first choose some initial values for the means, covariances, and mixing coefficients. Expectation step (E step) Use the current parameters to evaluate the posterior probabilities Maximization step (M step) Re-estimate the means, covariances, and mixing coefficients Each update to the parameters resulting from an E step followed by an M step is guaranteed to increase the log likelihood function.

Example 21

EM for Gaussian Mixtures 22

Case Study 23 Jiang, et al. A new method to segment playfield and its applications in match analysis in sports video, In Proc. of ACM MM, pp. 292-295, 2004.

Case Study 24 The condition density of a pixel belongs to the playfield region is modeled with M Gaussian densities:

Related Resources 25 GMMBAYES - Bayesian Classifier and Gaussian Mixture Model ToolBox http://www.it.lut.fi/project/gmmbayes/downloads/src/ gmmbayestb/ Netlab http://www.ncrg.aston.ac.uk/netlab/index.php Matlab toolboxes collection http://stommel.tamu.edu/~baum/toolboxes.html

26 Hidden Markov Model Wei-Ta Chu 2008/11/20

Markov Model Chain rule 27 Assume that each of the condition distributions is independent of all previous observations except the most recent, we obtain the first-order Markov chain. First-order Markov chain Second-order Markov chain C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006

Example 28 What s the probability that the weather for eight consecutive days is sun-sun-sun-rain-rain-sun-cloudy-sun? rain 0.4 0.6 0.3 1 2 0.1 0.2 0.3 0.1 sunny 3 cloudy 0.2 0.8 L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, 1993 L. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of IEEE, vol. 77, no. 2, pp. 257-286, 1989.

Coin-Toss Model 29 You are in a room with a curtain through which you cannot see that is happening. On the other side of the curtain is another person who is performing a coin tossing experiment (using one or more coins). The person will not tell you which coin he selects at any time; he will only tell you the result of each coin flip. A typical observation sequence would be The question is: how do we build an model to explain the observed sequence of head and tails?

Coin-Toss Model 30 P(H) 1-P(H) Head 1-P(H) 1 2 P(H) Tail 1-coin model (Observable Markov Model) a 11 a 22 1-a 11 1 1-a 22 2 2-coins model (Hidden Markov Model) P(H) = P 1 P(T) =1- P 1 P(H) = P 2 P(T) =1- P 2

Coin-Toss Model 31 a 11 a 22 a 12 1 2 a 21 3-coins model (Hidden Markov Model) a a 13 a 32 31 a 23 3 a 33 State 1 State 2 State 3 P(H) = P 1 P(H) = P 2 P(H) = P 3 P(T) =1- P 1 P(T) =1- P 2 P(T) =1- P 3

Elements of HMM 32 N: the number of states in the model M: the number of distinct observation symbols per state The state-transition probability A={a ij } The observation symbol probability distribution B={b j (k)} The initial state distribution To describe an HMM, we usually use the compact notation

Three Basic Problems of HMM 33 Problem 1: Probability Evaluation How do we compute the probability that the observed sequence was produced by the model? Scoring how well a given model matches a given observation sequence.

Three Basic Problems of HMM 34 Problem 2: Optimal State Sequence Attempt to uncover the hidden part of the model that is, to find the corect state sequence. For practical situations, we usually use an optimality criterion to solve this problem as best as possible.

Three Basic Problems of HMM 35 Problem 3: Parameter Estimation Attempt to optimize the model parameters to best describe how a given observation sequence comes about. The observation sequence used to adjust the model parameters is called a training sequence because it is used to train the HMM.

Solution to Problem 1 36 There are N T possible state sequences Consider one fixed-state sequence The prob. of the observation sequence given the state sequence Where we have assumed statistical independence of observations, thus we get

Solution to Problem 1 37 The prob. of such state sequence can be written as The joint prob. of O and q, i.e., the prob. that O and q occur simultaneously, is simply the product of the above terms The prob. of O is obtained by summing this joint prob. over all possible state sequences q, giving

Solution to Problem 1 38 The Forward Procedure The prob. of the partial observation sequence o 1,o 2,,o t (until time t) and state i at time t, given the model We solve for it inductively 1. Initialization 2. Induction 3. Termination

Solution to Problem 1 39 The Forward Procedure Require on the order of N 2 T calculations, rather than 2TN T as required by the direction calculation.

40 The Backward Procedure The prob. of partial observation sequence from t+1 to the end, given state i at time t and the model 1. Initialization 2. Induction

Solution to Problem 2 41 We define the quantity Which is the best score (highest probability) along a single path, at time t, which accounts for the first t observations and ends in state i. By induction we have

Solution to Problem 2 42 The Viterbi Algorithm 1. Initialization 2. Recursion 3. Termination 4. Path (state sequence) backtracking The major difference between Viterbi and the forward procedure is the maximization over previous states.

Solution to Problem 3 43 Choose such that its likelihood,, is locally maximized using an iterative procedure such as the Baum- Welch algorithm (also known as EM algorithm or forwardbackward algorithm) Define the prob. of being in state i at time t, and state j at time t+1, given the model and the observation sequence.

Solution to Problem 3 44 The prob. of being in state i at time t, given the observation sequence O and the model

Solution to Problem 3 45

Types of HMMs 46 Ergodic Left-right Parallel path left-right

Case Study 47 Features Field descriptor Edge descriptor Grass and sand Player height Peng, et al. Extract highlights from basebal game video with hidden Markov models, In Proc. of ICIP, vol. 1, pp. 609-612, 2002.

Related Resources 48 Hidden Markov Model (HMM) Toolbox for Matlab http://www.cs.ubc.ca/~murphyk/software/hmm/hmm. html The General Hidden Markov Model library (GHMM) http://ghmm.sourceforge.net/ HTK Speech Recognition Toolkit http://htk.eng.cam.ac.uk

49 Support Vector Machine Wei-Ta Chu 2008/11/20

Introduction Linearly Separable Data 50 For a linearly separable data set S, there exists a hyperplane that satisfies all the points in S Y. Gong and W. Xu, Machine Learning for Multimedia Content Analysis, Springer, 2007

Introduction 51 Rescale (w,b) so that the closet points to the hyperplane satisfy. This normalization leads to the canonical form for SVM

Properties 52

Properties 53

54 The goal of SVM is to find the pair of hyperplanes H1, H2 that maximize the margin, subject to the constraints This can be formulated as the constrained optimization problem:

55 Introduction Linearly Non-Separable Data When the SVM derived above is applied to non-separable data sets, some data points could be at a distance on the wrong side. We transform the constraint to The SVM for non-separable case can be casted as the optimization problem:

General Idea 56 General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x φ(x)

Optimization Problem 57 Use Lagrange multiplier method to find the optimal solution transform the original problem to a dual problem Kernel trick We don t realy find the mapping function, but to use a kernel function to find the inner product of data points in the higher-dimensional space.

Related Resources 58 LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html

References 59 L. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of IEEE, vol. 77, no. 2, pp. 257-286, 1989. Statistical Data Mining Tutorials, Tutorial Slides by Andrew Moore: http://www.autonlab.org/tutorials/index.html Jiang, et al. A new method to segment playfield and its applications in match analysis in sports video, In Proc. of ACM MM, pp. 292-295, 2004. Peng, et al. Extract highlights from basebal game video with hidden Markov models, In Proc. of ICIP, vol. 1, pp. 609-612, 2002. LIBSVM, http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html

References 60 Hidden Markov Model (HMM) Toolbox for Matlab http://www.cs.ubc.ca/~murphyk/software/hmm/hmm.html The General Hidden Markov Model library (GHMM) http://ghmm.sourceforge.net/ HTK Speech Recognition Toolkit http://htk.eng.cam.ac.uk

Next Week 61 Introduction to Audio and Music