PATTERN RECOGNITION AND MACHINE LEARNING

Similar documents
Expectation Propagation for Approximate Bayesian Inference

Curve Fitting Re-visited, Bishop1.2.5

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Bayesian Machine Learning

Machine learning - HT Maximum Likelihood

Lecture : Probabilistic Machine Learning

Mathematical Formulation of Our Example

INTRODUCTION TO PATTERN RECOGNITION

ECE521 week 3: 23/26 January 2017

Machine Learning Basics: Maximum Likelihood Estimation

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

CMU-Q Lecture 24:

Overfitting, Bias / Variance Analysis

Machine Learning using Bayesian Approaches

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Density Estimation. Seungjin Choi

Bias-Variance Trade-off in ML. Sargur Srihari

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Linear Models for Classification

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Non-parametric Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2018

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Naïve Bayes classification

Linear Models for Regression

Reading Group on Deep Learning Session 2

Statistical Machine Learning Lectures 4: Variational Bayes

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Generative MaxEnt Learning for Multiclass Classification

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

An Introduction to Statistical and Probabilistic Linear Models

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Linear Models for Regression

Bayesian Linear Regression. Sargur Srihari

Machine Learning Srihari. Probability Theory. Sargur N. Srihari

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Non-parametric Methods

Expectation Propagation Algorithm

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Expectation Maximization

Data Mining Techniques

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Bayesian Learning (II)

Introduction to Machine Learning

Linear Classification

Variational Bayesian Logistic Regression

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

COM336: Neural Computing

Machine Learning Lecture 5

Ch 4. Linear Models for Classification

The Expectation Maximization or EM algorithm

Posterior Regularization

Gaussian Processes for Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Bayesian Statistics

Introduction to Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

INTRODUCTION TO PATTERN

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Machine Learning Lecture 7

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Machine Learning Linear Classification. Prof. Matteo Matteucci

Linear Regression and Discrimination

Machine Learning Lecture 2

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Introduction to Machine Learning

Least Squares Regression

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Announcements. Proposals graded

Machine Learning Lecture 2

Density Estimation: ML, MAP, Bayesian estimation

Linear Models for Regression

Intelligent Systems Statistical Machine Learning

Parameter Estimation. Industrial AI Lab.

Parametric Techniques

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Machine Learning. Option Data Science Mathématiques Appliquées & Modélisation. Alexandre Aussem

p(d θ ) l(θ ) 1.2 x x x

Notes on Discriminant Functions and Optimal Classification

Today. Calculus. Linear Regression. Lagrange Multipliers

Reading Group on Deep Learning Session 1

Generative v. Discriminative classifiers Intuition

Introduction to Machine Learning

Artificial Neural Networks

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Transcription:

PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014

Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality 6 Decision theory 7 Information theory

What is Machine Learning? Outline of this section 1 What is Machine Learning? Basic concepts 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality 6 Decision theory 7 Information theory

What is Machine Learning? Basic concepts What is ML? Figure 1: Examples of hand-written numbers Training set Target vector Testing set

What is Machine Learning? Basic concepts 1 Generalization: The ability to categorize correctly new examples that different from those used for training is known as generalization. 2 Supervised learning Regression Classification 3 Unsupervised learning Clustering Density estimation Projecting data from high-dimensional space to low-dimensional space

Curve Fitting Outline of this section 1 What is Machine Learning? 2 Curve Fitting An example Overfitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality 6 Decision theory 7 Information theory

Curve Fitting An example An example(important) Figure 2: Examples of polynomial curve fitting Green curve - sin(2πx) Blue dots - t n

Curve Fitting An example The polynominal function is y(x, w) = w 0 + w 1 x + w 2 x 2 + + w M x M (1) and the error function is E(w) = 1 2 N {y(x n, w) t n } 2 (2) n=1 We can solve the curve fitting problem by choosing the value of w for which E(w) is as small as possible.

Curve Fitting An example Figure 3: Curve fitting with different value of M

Curve Fitting An example Figure 4: Different values of w given different M

Curve Fitting Overfitting Why did overfitting happen? Figure 5: Overfitting The Taylor series of sin(2πx) has infinite items. The more flexible polynomial with larger values of M are becoming increasingly tuned to the random noise on the target values.

Curve Fitting Overfitting How to estimate over-fitting? 1 Use testing set to evaluate the E(x) in (2); 2 Use root-mean-square (RMS) error: 2E(w E RMS = ) N The error is a measure of how well we are doing in predicting the values of result for new observations. (3)

Curve Fitting Overfitting How to avoid/control over-fitting? I 1 Increasing the size of training data (N) Figure 6: More training data

Curve Fitting Overfitting How to avoid/control over-fitting? II 2 Regularization Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 w 2 (4) n=1 where w 2 w T w = w 2 0 + w2 1 + + w2 M,and λ governs the relative importance of the regularization term compared with the sum-of-squares error term.

Curve Fitting Overfitting How to avoid/control over-fitting? III Figure 7: Different coefficients with different values of λ

Curve Fitting Overfitting How to avoid/control over-fitting? IV Figure 8: Curves with different value of λ 3 A proper value for model complexity 4 Adopting a Bayesian approach

Probability Theory Outline of this section 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory The rules of probability Bayesian probabilities Curve fitting re-visited 4 Model Selection 5 The curse of dimensionality 6 Decision theory 7 Information theory

Probability Theory The rules of probability The rules of probability 1 Sum rule p(x) = Y p(x, Y ) (5) 2 Product rule 3 Bayes theorem p(x, Y ) = p(y X)p(X) (6) p(y X) = p(x Y )p(y ) p(x) (7) Plays a central role in pattern recognition and machine learning.

Probability Theory Bayesian probabilities Bayesian probabilities Now we turn to a more general Bayesian view. Probabilities provide a quantification of uncertainty. i.e. The polar ice

Probability Theory Bayesian probabilities Insight into Bayes theorem prior probability - p(w) observed data - D = {t 1, t 2,..., t N } p(w D) = p(d w)p(w) p(d) (8) convert prior probability p(d) to posterior probability p(w D). Bayes theorem can stated in words posterior likelihood prior

Probability Theory Curve fitting re-visited Curve fitting re-visited I Here we return to the example of curve fitting, and gain some insights into error function and regularization. training data - x = (x 1, x 2,..., x N ) T target values - t = (t 1, t 2,..., t N ) T t has a Gaussian distribution with mean equal to value y(x, w) thus we have p(t x, w, β) = N (t y(x, w), β 1 ) (9) where β is the inverse variance of the distribution.

Probability Theory Curve fitting re-visited Curve fitting re-visited II Figure 9: Distribution of t

Probability Theory Curve fitting re-visited Curve fitting re-visited III We now use the training data {x, t} to determine the values of th unknown parameters w and β by maximum likelihood. p(t x, w, β) = N N (t n y(x n, w), β 1 ) (10) n=1 By maximize the logarithm of likelihood ln p(t x, w, β) = β N{y(x n, w) t n } 2 + N 2 2 ln β N 2 ln(2π) n=1 (11), we can get the corresponding polynomial coefficients w ML. And β also can be got 1 β ML = 1 N N {y(x n, w ML ) t n } 2 (12) n=1

Probability Theory Curve fitting re-visited Curve fitting re-visited IV The first item of the above equation(11) has the same formation of sum-of-squares error in equation(2). A step towards Bayesian approach Consider a Gaussian distribution p(w α) = N (w 0, α 1 I) = ( α 2π ) (M+1) 2 exp{ α 2 wt w} (13) So we have p(w x, w, α, β) p(t x, w, β)p(w α) (14) Maximizing posterior is equivalent to minimizing the following equation β N{y(x n, w) t n } 2 + α 2 2 wt w (15) n=1

Probability Theory Curve fitting re-visited Bayesian curve fitting Will be introduced in Section 3.3.

Model Selection Outline of this section 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality 6 Decision theory 7 Information theory

Model Selection Model Selection I If the data is plentiful otherwise, we could use corss-validation. Figure 10: Cross-validation procedure

Model Selection Model Selection II To correct for the maximum likelihood, we introduce Akaike information criterion(aic), choose the model for which the quantity ln p(d w ML ) M (16) is largest.

The curse of dimensionality Outline of this section 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality 6 Decision theory 7 Information theory

The curse of dimensionality The curse of dimensionality I Figure 11: Data

The curse of dimensionality The curse of dimensionality II Figure 12: Result

The curse of dimensionality The curse of dimensionality III Figure 13: The curse of dimensionality

The curse of dimensionality The curse of dimensionality IV V D (1) V D (1 ε) V D (1) V D (r) = K D r D (17) = 1 (1 ε) D (18)

The curse of dimensionality The curse of dimensionality V Figure 14: The mass distribution Not all intuitions developed in spaces of low dimensionality will generalize to spaces of many dimensionality.

The curse of dimensionality Why we can apply effective techniques applicable to high-dimensional spaces? 1 Real data will often be confined to a region of the space having lower effective dimensionality 2 Real data will often typically exhibit some smoothness properties so that for most part small changes in the input variables will produce small changes in the target variables.

Decision theory Outline of this section 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality 6 Decision theory Minimizing the misclassification rate Minimizing the expected loss The reject option Inference & decision 7 Information theory

Decision theory Minimizing the misclassification rate Leave out.

Decision theory Minimizing the expected loss Leave out.

Decision theory The reject option The classification error arise where the largest of the posterior probabilities p(c k x) is significantly less than unity, or equivalently where the joint distribution p(x, C k ) have comparable values.

Decision theory Inference & decision Three approaches to solving decision problems Bayesian approach - generative models Model the posterior probability directly - discriminative models Directly generative class label - probabilities plays no role

Information theory Outline of this section 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality 6 Decision theory 7 Information theory Key concepts Relate the key concepts to PR

Information theory Key concepts Entropy H[x] = p(x) log 2 p(x) (19) Kullback-Leibler divergence (KL divergence) KL(p q) = p(x) ln q(x)dx ( p(x) ln p(x)dx) = p(x) ln{ q(x) p(x) }dx (20)

Information theory Relate the key concepts to PR p(x) - an known distribution q(x θ) - parametric distribution used to approximate p(x) x n - for n = 1, 2,..., N, drawn from p(x), so that KL(p q) N { ln q(x n θ) + ln p(x n )} (21) n=1 minimizing this KL divergence is equivalent to maximizing the likelihood function.

Thank you!