Non-parametric Methods

Similar documents
Non-parametric Methods

Introduction to Machine Learning

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

CSC321 Lecture 18: Learning Probabilistic Models

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Curve Fitting Re-visited, Bishop1.2.5

Linear Models for Regression

PATTERN RECOGNITION AND MACHINE LEARNING

Linear Models for Classification

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Lecture : Probabilistic Machine Learning

ECE521 week 3: 23/26 January 2017

Naïve Bayes classification

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Bayesian Models in Machine Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning Lecture 5

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Density Estimation: ML, MAP, Bayesian estimation

Learning Bayesian network : Given structure and completely observed data

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Bayesian Learning (II)

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Linear Models for Regression CS534

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Machine Learning and Deep Learning! Vincent Lepetit!

Mathematical Formulation of Our Example

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CMU-Q Lecture 24:

Introduction to Machine Learning

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Lecture 18: Learning probabilistic models

Bayesian Methods: Naïve Bayes

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Introduction to Bayesian Learning. Machine Learning Fall 2018

Machine Learning 4771

Overfitting, Bias / Variance Analysis

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Announcements. Proposals graded

Bayesian Machine Learning

MLE/MAP + Naïve Bayes

Linear Models for Regression CS534

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CPSC 540: Machine Learning

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Density Estimation. Seungjin Choi

Probability Theory for Machine Learning. Chris Cremer September 2015

Final Exam, Spring 2006

CS 361: Probability & Statistics

Linear Models for Regression CS534

The Naïve Bayes Classifier. Machine Learning Fall 2017

CSCI-567: Machine Learning (Spring 2019)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

MLE/MAP + Naïve Bayes

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Brief Introduction of Machine Learning Techniques for Content Analysis

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Machine Learning for Signal Processing Bayes Classification and Regression

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CPSC 340: Machine Learning and Data Mining

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

COMP 328: Machine Learning

STA414/2104 Statistical Methods for Machine Learning II

CS-E3210 Machine Learning: Basic Principles

CSCI567 Machine Learning (Fall 2014)

PMR Learning as Inference

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Logistic Regression. Machine Learning Fall 2018

STA 4273H: Statistical Machine Learning

Statistical learning. Chapter 20, Sections 1 4 1

Algorithm-Independent Learning Issues

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Machine Learning using Bayesian Approaches

Lecture 4: Probabilistic Learning

Statistical Models. David M. Blei Columbia University. October 14, 2014

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Classification & Information Theory Lecture #8

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Introduction to Probabilistic Machine Learning

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Lecture 2

Advanced statistical methods for data analysis Lecture 2

Machine Learning Lecture 7

Transcription:

Non-parametric Methods Machine Learning Alireza Ghane Non-Parametric Methods Alireza Ghane / Torsten Möller 1

Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Kernel Density Estimation Nearest-neighbour Conclusion Non-Parametric Methods Alireza Ghane / Torsten Möller 2

Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Kernel Density Estimation Nearest-neighbour Conclusion Non-Parametric Methods Alireza Ghane / Torsten Möller 3

Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Kernel Density Estimation Nearest-neighbour Conclusion Non-Parametric Methods Alireza Ghane / Torsten Möller 4

Hand-written Digit Recognition 518 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 22 Fig. 8. All of the misclassified MNIST test digits using our method 63 out of 1,). The text above each digit indicates the example number followed by the true label and the assigned label. Belongie et al. PAMI 22 straightforward sum of squared differences SSD). SSD error rate with an average of only four two-dimensional Difficult performs very well toonhand-craft this easy database duerules to the lackabout of views for digits each three-dimensional object, thanks to the variation in lighting [24] PCA just makes it faster). flexibility provided by the matching algorithm. The prototype selection algorithm is illustrated in Fig. 1. As seen, views are allocated mainly for more complex 6.3 MPEG-7 Shape Silhouette Database categories with high within class variability. The curve Our next experiment involves the MPEG-7 shape silhouette Non-Parametric Methods Alireza marked SC-proto in Fig. 9 shows the improved classification database, Ghane specifically / TorstenCore Möller Experiment CE-Shape-1 part B, 5

Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 22 x i = t i = (,,, 1,,,,,, ) Represent input image as a vector x i R 784. Suppose we have a target vector t i This is supervised learning Discrete, finite label set: perhaps t i {, 1} 1, a classification problem Given a training set {(x 1, t 1 ),..., (x N, t N )}, learning problem is to construct a good function y(x) from these. y : R 784 R 1 Non-Parametric Methods Alireza Ghane / Torsten Möller 6

Face Detection Classification problem Schneiderman and Kanade, IJCV 22 t i {, 1, 2}, non-face, frontal face, profile face. Non-Parametric Methods Alireza Ghane / Torsten Möller 7

Spam Detection Classification problem t i {, 1}, non-spam, spam x i counts of words, e.g. Viagra, stock, outperform, multi-bagger Non-Parametric Methods Alireza Ghane / Torsten Möller 8

Stock Price Prediction Problems in which t i is continuous are called regression E.g. t i is stock price, x i contains company profit, debt, cash flow, gross sales, number of spam emails sent,... Non-Parametric Methods Alireza Ghane / Torsten Möller 9

Clustering Images Wang et al., CVPR 26 Only x i is defined: unsupervised learning E.g. x i describes image, find groups of similar images Non-Parametric Methods Alireza Ghane / Torsten Möller 1

Types of Learning Problems Supervised Learning Classification Regression Unsupervised Learning Density estimation Clustering: k-means, mixture models, hierarchical clustering Hidden Markov models Reinforcement Learning Non-Parametric Methods Alireza Ghane / Torsten Möller 11

Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Kernel Density Estimation Nearest-neighbour Conclusion Non-Parametric Methods Alireza Ghane / Torsten Möller 12

Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Kernel Density Estimation Nearest-neighbour Conclusion Non-Parametric Methods Alireza Ghane / Torsten Möller 13

An Example - Polynomial Curve Fitting 1 1 1 Suppose we are given training set of N observations (x 1,..., x N ) and (t 1,..., t N ), x i, t i R Regression problem, estimate y(x) from these data Non-Parametric Methods Alireza Ghane / Torsten Möller 14

Polynomial Curve Fitting What form is y(x)? Let s try polynomials of degree M: y(x, w) = w +w 1 x+w 2 x 2 +...+w M x M 1 This is the hypothesis space. How do we measure success? Sum of squared errors: E(w) = 1 2 N {y(x n, w) t n } 2 n=1 t 1 1 tn y(xn, w) Among functions in the class, choose that which minimizes this error xn x Non-Parametric Methods Alireza Ghane / Torsten Möller 15

Which Degree of Polynomial? 1 1 1 1 1 1 1 1 1 1 1 1 A model selection problem M = 9 E(w ) = : This is over-fitting Non-Parametric Methods Alireza Ghane / Torsten Möller 16

Generalization 1 Training Test.5 3 6 9 Generalization is the holy grail of ML Want good performance for new data Measure generalization using a separate set Use root-mean-squared (RMS) error: E RMS = 2E(w )/N Non-Parametric Methods Alireza Ghane / Torsten Möller 17

Controlling Over-fitting: Regularization 1 1 1 As order of polynomial M increases, so do coefficient magnitudes Penalize large coefficients in error function: Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 w 2 n=1 Non-Parametric Methods Alireza Ghane / Torsten Möller 18

Controlling Over-fitting: Regularization 1 1 1 As order of polynomial M increases, so do coefficient magnitudes Penalize large coefficients in error function: Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 w 2 n=1 Non-Parametric Methods Alireza Ghane / Torsten Möller 19

Controlling Over-fitting: Regularization 1 1 1 1 1 1 Non-Parametric Methods Alireza Ghane / Torsten Möller 2

Controlling Over-fitting: Regularization 1 Training Test.5 35 3 25 2 Note the E RMS for the training set. Perfect match of training set with the model is a result of over-fitting Training and test error show similar trend Non-Parametric Methods Alireza Ghane / Torsten Möller 21

Over-fitting: Dataset size 1 1 1 1 1 1 With more data, more complex model (M = 9) can be fit Rule of thumb: 1 datapoints for each parameter Non-Parametric Methods Alireza Ghane / Torsten Möller 22

Validation Set Split training data into training set and validation set Train different models (e.g. diff. order polynomials) on training set Choose model (e.g. order of polynomial) with minimum error on validation set Non-Parametric Methods Alireza Ghane / Torsten Möller 23

Cross-validation run 1 run 2 run 3 run 4 Data are often limited Cross-validation creates S groups of data, use S 1 to train, other to validate Extreme case leave-one-out cross-validation (LOO-CV): S is number of training data points Cross-validation is an effective method for model selection, but can be slow Models with multiple complexity parameters: exponential number of runs Non-Parametric Methods Alireza Ghane / Torsten Möller 24

Summary Want models that generalize to new data Train model on training set Measure performance on held-out test set Performance on test set is good estimate of performance on new data Non-Parametric Methods Alireza Ghane / Torsten Möller 25

Summary - Model Selection Which model to use? E.g. which degree polynomial? Training set error is lower with more complex model Can t just choose the model with lowest training error Peeking at test error is unfair. E.g. picking polynomial with lowest test error Performance on test set is no longer good estimate of performance on new data Non-Parametric Methods Alireza Ghane / Torsten Möller 26

Summary - Model Selection Which model to use? E.g. which degree polynomial? Training set error is lower with more complex model Can t just choose the model with lowest training error Peeking at test error is unfair. E.g. picking polynomial with lowest test error Performance on test set is no longer good estimate of performance on new data Non-Parametric Methods Alireza Ghane / Torsten Möller 27

Summary - Solutions I Use a validation set Train models on training set. E.g. different degree polynomials Measure performance on held-out validation set Measure performance of that model on held-out test set Can use cross-validation on training set instead of a separate validation set if little data and lots of time Choose model with lowest error over all cross-validation folds (e.g. polynomial degree) Retrain that model using all training data (e.g. polynomial coefficients) Non-Parametric Methods Alireza Ghane / Torsten Möller 28

Summary - Solutions I Use a validation set Train models on training set. E.g. different degree polynomials Measure performance on held-out validation set Measure performance of that model on held-out test set Can use cross-validation on training set instead of a separate validation set if little data and lots of time Choose model with lowest error over all cross-validation folds (e.g. polynomial degree) Retrain that model using all training data (e.g. polynomial coefficients) Non-Parametric Methods Alireza Ghane / Torsten Möller 29

Summary - Solutions II Use regularization Train complex model (e.g high order polynomial) but penalize being too complex (e.g. large weight magnitudes) Need to balance error vs. regularization (λ) Choose λ using cross-validation Get more data Non-Parametric Methods Alireza Ghane / Torsten Möller 3

Summary - Solutions II Use regularization Train complex model (e.g high order polynomial) but penalize being too complex (e.g. large weight magnitudes) Need to balance error vs. regularization (λ) Choose λ using cross-validation Get more data Non-Parametric Methods Alireza Ghane / Torsten Möller 31

Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Kernel Density Estimation Nearest-neighbour Conclusion Non-Parametric Methods Alireza Ghane / Torsten Möller 32

Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Kernel Density Estimation Nearest-neighbour Conclusion

Decision Theory For a sample x, decide which class(c k ) it is from. Ideas: Maximum Likelihood Minimum Loss/Cost (e.g. misclassification rate) Maximum Aposteriori (MAP) Intro. to Machine Learning Alireza Ghane 34

Decision: Maximum Likelihood Inference step: Determine statistics from training data. p(x, t) OR p(x C k ) Decision step: Determine optimal t for test input x: t = arg max{ p (x C k ) k }{{} Likelihood } Intro. to Machine Learning Alireza Ghane 35

Decision: Maximum Likelihood Inference step: Determine statistics from training data. p(x, t) OR p(x C k ) Decision step: Determine optimal t for test input x: t = arg max{ p (x C k ) k }{{} Likelihood } Intro. to Machine Learning Alireza Ghane 36

Decision: Maximum Likelihood Inference step: Determine statistics from training data. p(x, t) OR p(x C k ) Decision step: Determine optimal t for test input x: t = arg max{ p (x C k ) k }{{} Likelihood } Intro. to Machine Learning Alireza Ghane 37

Decision: Minimum Misclassification Rate q(mistake) = p (x R 1, C 2 ) + p (x R 2, C 1 ) = R 1 p (x, C 2 ) dx + R 2 p (x, C 1 ) dx q(mistake) = k R j p (x, C k ) dx j p(x, C1) x x p(x, C2) ˆx: decision boundary. x : optimal decision boundary x : arg min{p (mistake)} R 1 x R1 R2 Intro. to Machine Learning Alireza Ghane 38

Decision: Minimum Misclassification Rate q(mistake) = p (x R 1, C 2 ) + p (x R 2, C 1 ) = R 1 p (x, C 2 ) dx + R 2 p (x, C 1 ) dx q(mistake) = k R j p (x, C k ) dx j p(x, C1) x x p(x, C2) ˆx: decision boundary. x : optimal decision boundary x : arg min{p (mistake)} R 1 x R1 R2 Intro. to Machine Learning Alireza Ghane 39

Decision: Minimum Loss/Cost Misclassification rate: R : arg min {R i i {1,,K}} L (R j, C k ) Weighted loss/cost function: R : arg min W j,k L (R j, C k ) {R i i {1,,K}} Is useful when: The population of the classes are different The failure cost is non-symmetric k k j j Intro. to Machine Learning Alireza Ghane 4

Decision: Maximum Aposteriori (MAP) Bayes Theorem: P {A B} = P {B A}P {A} P {B} p(c k x) }{{} P osterior p(x C k ) }{{} Likelihood p(c k ) }{{} P rior Provides an Aposteriori Belief for the estimation, rather than a single point estimate. Can utilize Apriori Information in the decision. Intro. to Machine Learning Alireza Ghane 41

Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Kernel Density Estimation Nearest-neighbour Conclusion Non-Parametric Methods Alireza Ghane / Torsten Möller 42

Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Kernel Density Estimation Nearest-neighbour Conclusion

Coin Tossing Let s say you re given a coin, and you want to find out P (heads), the probability that if you flip it it lands as heads. Flip it a few times: H H T P (heads) = 2/3 Hmm... is this rigorous? Does this make sense? Non-Parametric Methods Alireza Ghane / Torsten Möller 44

Coin Tossing Let s say you re given a coin, and you want to find out P (heads), the probability that if you flip it it lands as heads. Flip it a few times: H H T P (heads) = 2/3 Hmm... is this rigorous? Does this make sense? Non-Parametric Methods Alireza Ghane / Torsten Möller 45

Coin Tossing Let s say you re given a coin, and you want to find out P (heads), the probability that if you flip it it lands as heads. Flip it a few times: H H T P (heads) = 2/3 Hmm... is this rigorous? Does this make sense? Non-Parametric Methods Alireza Ghane / Torsten Möller 46

Coin Tossing - Model Bernoulli distribution P (heads) = µ, P (tails) = 1 µ Assume coin flips are independent and identically distributed (i.i.d.) i.e. All are separate samples from the Bernoulli distribution Given data D = {x 1,..., x N }, heads: x i = 1, tails: x i =, the likelihood of the data is: p(d µ) = N p(x n µ) = n=1 N µ xn (1 µ) 1 xn n=1 Non-Parametric Methods Alireza Ghane / Torsten Möller 47

Maximum Likelihood Estimation Given D with h heads and t tails What should µ be? Maximum Likelihood Estimation (MLE): choose µ which maximizes the likelihood of the data µ ML = arg max µ p(d µ) Since ln( ) is monotone increasing: µ ML = arg max ln p(d µ) µ Non-Parametric Methods Alireza Ghane / Torsten Möller 48

Likelihood: Log-likelihood: Maximum Likelihood Estimation ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn n=1 N x n ln µ + (1 x n ) ln(1 µ) n=1 Take derivative, set to : d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t n=1 µ = h t + h Non-Parametric Methods Alireza Ghane / Torsten Möller 49

Likelihood: Log-likelihood: Maximum Likelihood Estimation ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn n=1 N x n ln µ + (1 x n ) ln(1 µ) n=1 Take derivative, set to : d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t n=1 µ = h t + h Non-Parametric Methods Alireza Ghane / Torsten Möller 5

Likelihood: Log-likelihood: Maximum Likelihood Estimation ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn n=1 N x n ln µ + (1 x n ) ln(1 µ) n=1 Take derivative, set to : d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t n=1 µ = h t + h Non-Parametric Methods Alireza Ghane / Torsten Möller 51

Likelihood: Log-likelihood: Maximum Likelihood Estimation ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn n=1 N x n ln µ + (1 x n ) ln(1 µ) n=1 Take derivative, set to : d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t n=1 µ = h t + h Non-Parametric Methods Alireza Ghane / Torsten Möller 52

Likelihood: Log-likelihood: Maximum Likelihood Estimation ln p(d µ) = p(d µ) = N µ xn (1 µ) 1 xn n=1 N x n ln µ + (1 x n ) ln(1 µ) n=1 Take derivative, set to : d N dµ ln p(d µ) = 1 x n µ (1 x 1 n) 1 µ = 1 µ h 1 1 µ t n=1 µ = h t + h Non-Parametric Methods Alireza Ghane / Torsten Möller 53

Bayesian Learning Wait, does this make sense? What if I flip 1 time, heads? Do I believe µ=1? Learn µ the Bayesian way: P (µ D) = P (µ D) }{{} posterior P (D µ)p (µ) P (D) P (D µ) P (µ) }{{}}{{} prior likelihood Prior encodes knowledge that most coins are 5-5 Conjugate prior makes math simpler, easy interpretation For Bernoulli, the beta distribution is its conjugate Non-Parametric Methods Alireza Ghane / Torsten Möller 54

Bayesian Learning Wait, does this make sense? What if I flip 1 time, heads? Do I believe µ=1? Learn µ the Bayesian way: P (µ D) = P (µ D) }{{} posterior P (D µ)p (µ) P (D) P (D µ) P (µ) }{{}}{{} prior likelihood Prior encodes knowledge that most coins are 5-5 Conjugate prior makes math simpler, easy interpretation For Bernoulli, the beta distribution is its conjugate Non-Parametric Methods Alireza Ghane / Torsten Möller 55

Bayesian Learning Wait, does this make sense? What if I flip 1 time, heads? Do I believe µ=1? Learn µ the Bayesian way: P (µ D) = P (µ D) }{{} posterior P (D µ)p (µ) P (D) P (D µ) P (µ) }{{}}{{} prior likelihood Prior encodes knowledge that most coins are 5-5 Conjugate prior makes math simpler, easy interpretation For Bernoulli, the beta distribution is its conjugate Non-Parametric Methods Alireza Ghane / Torsten Möller 56

Beta Distribution We will use the Beta distribution to express our prior knowledge about coins: Beta(µ a, b) = Γ(a + b) µ a 1 (1 µ) b 1 Γ(a)Γ(b) }{{} normalization Parameters a and b control the shape of this distribution Non-Parametric Methods Alireza Ghane / Torsten Möller 57

Posterior P (µ D) P (D µ)p (µ) N µ xn (1 µ) 1 xn µ a 1 (1 µ) b 1 }{{} n=1 }{{} prior likelihood µ h (1 µ) t µ a 1 (1 µ) b 1 µ h+a 1 (1 µ) t+b 1 Simple form for posterior is due to use of conjugate prior Parameters a and b act as extra observations Note that as N = h + t, prior is ignored Non-Parametric Methods Alireza Ghane / Torsten Möller 58

Posterior P (µ D) P (D µ)p (µ) N µ xn (1 µ) 1 xn µ a 1 (1 µ) b 1 }{{} n=1 }{{} prior likelihood µ h (1 µ) t µ a 1 (1 µ) b 1 µ h+a 1 (1 µ) t+b 1 Simple form for posterior is due to use of conjugate prior Parameters a and b act as extra observations Note that as N = h + t, prior is ignored Non-Parametric Methods Alireza Ghane / Torsten Möller 59

Posterior P (µ D) P (D µ)p (µ) N µ xn (1 µ) 1 xn µ a 1 (1 µ) b 1 }{{} n=1 }{{} prior likelihood µ h (1 µ) t µ a 1 (1 µ) b 1 µ h+a 1 (1 µ) t+b 1 Simple form for posterior is due to use of conjugate prior Parameters a and b act as extra observations Note that as N = h + t, prior is ignored Non-Parametric Methods Alireza Ghane / Torsten Möller 6

Maximum A Posteriori Given posterior P (µ D) we could compute a single value, known as the Maximum a Posteriori (MAP) estimate for µ: µ MAP = arg max µ P (µ D) Known as point estimation However, correct Bayesian thing to do is to use the full distribution over µ i.e. Compute E µ [f] = p(µ D)f(µ)dµ This integral is usually hard to compute Non-Parametric Methods Alireza Ghane / Torsten Möller 61

Maximum A Posteriori Given posterior P (µ D) we could compute a single value, known as the Maximum a Posteriori (MAP) estimate for µ: µ MAP = arg max µ P (µ D) Known as point estimation However, correct Bayesian thing to do is to use the full distribution over µ i.e. Compute E µ [f] = p(µ D)f(µ)dµ This integral is usually hard to compute Non-Parametric Methods Alireza Ghane / Torsten Möller 62

Maximum A Posteriori Given posterior P (µ D) we could compute a single value, known as the Maximum a Posteriori (MAP) estimate for µ: µ MAP = arg max µ P (µ D) Known as point estimation However, correct Bayesian thing to do is to use the full distribution over µ i.e. Compute E µ [f] = p(µ D)f(µ)dµ This integral is usually hard to compute Non-Parametric Methods Alireza Ghane / Torsten Möller 63

Polynomial Curve Fitting: What We Did What form is y(x)? Let s try polynomials of degree M: y(x, w) = w +w 1 x+w 2 x 2 +...+w M x M 1 This is the hypothesis space. How do we measure success? Sum of squared errors: E(w) = 1 2 N {y(x n, w) t n } 2 n=1 t 1 1 tn y(xn, w) Among functions in the class, choose that which minimizes this error xn x Intro. to Machine Learning Alireza Ghane 64

Curve Fitting: Probabilistic Approach t y(x, w) y(x, w) p(t x, w, β) 2σ x x N p(t x, w, β) = N ( t n y(x n, w), β 1) n=1 Intro. to Machine Learning Alireza Ghane 65

Curve Fitting: Probabilistic Approach t y(x, w) y(x, w) p(t x, w, β) 2σ x x p(t x, w, β) = N N ( t n y(x n, w), β 1) n=1 ln (p(t x, w, β)) = β N {y(x n, w) t n } 2 + N 2 2 ln β N ln (2π) n=1 }{{}} 2 {{}}{{} const. const. βe(w) Intro. to Machine Learning Alireza Ghane 66

Curve Fitting: Probabilistic Approach t y(x, w) y(x, w) p(t x, w, β) 2σ x x p(t x, w, β) = N N ( t n y(x n, w), β 1) n=1 ln (p(t x, w, β)) = β N {y(x n, w) t n } 2 + N 2 2 ln β N ln (2π) n=1 }{{}} 2 {{}}{{} const. const. βe(w) Maximize log-likelihood Minimize E(w). Can optimize for β as well. Intro. to Machine Learning Alireza Ghane 67

Curve Fitting: Bayesian Approach t y(x, w) y(x, w) p(t x, w, β) 2σ x x N p(t x, w, β) = N ( t n y(x n, w), β 1) n=1 Intro. to Machine Learning Alireza Ghane 68

Curve Fitting: Bayesian Approach t y(x, w) y(x, w) p(t x, w, β) 2σ x x N p(t x, w, β) = N ( t n y(x n, w), β 1) n=1 Posterior Dist.:p (w x, t, α, β) p (t x, w, β) p (w α) Intro. to Machine Learning Alireza Ghane 69

Curve Fitting: Bayesian Approach t y(x, w) y(x, w) p(t x, w, β) 2σ x x N p(t x, w, β) = N ( t n y(x n, w), β 1) n=1 Posterior Dist.:p (w x, t, α, β) p (t x, w, β) p (w α) Minimize: β N {y(x n, w) t n } 2 + α 2 2 wt w n=1 }{{}}{{} regularization. βe(w) Intro. to Machine Learning Alireza Ghane 7

Curve Fitting: Bayesian p (t x, w, β, α) = N (t P w (x ), Q w,β,α (x )) t y(x, w) y(x, w) p(t x, w, β) 2σ x x Intro. to Machine Learning Alireza Ghane 71

Curve Fitting: Bayesian p (t x, w, β, α) = N (t P w (x ), Q w,β,α (x )) t y(x, w) p (t x, x, t) = N ( t m(x), s 2 (x) ) y(x, w) p(t x, w, β) 2σ x x 1 1 1 Intro. to Machine Learning Alireza Ghane 72

Curve Fitting: Bayesian p (t x, w, β, α) = N (t P w (x ), Q w,β,α (x )) t y(x, w) p (t x, x, t) = N ( t m(x), s 2 (x) ) y(x, w) p(t x, w, β) 2σ N m(x) = φ(x) T S φ(x n )t n n=1 s 2 (x) = β 1 ( 1 + φ(x) T Sφ(x) ) 1 x x S 1 = α N β I + φ(x n )φ(x n ) T n=1 1 1 Intro. to Machine Learning Alireza Ghane 73

Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Kernel Density Estimation Nearest-neighbour Conclusion Intro. to Machine Learning Alireza Ghane 74

Histograms Consider the problem of modelling the distribution of brightness values in pictures taken on sunny days versus cloudy days We could build histograms of pixel values for each class Intro. to Machine Learning Alireza Ghane 75

Histograms 5 5.5 1 5.5 1.5 1 E.g. for sunny days Count n i number of datapoints (pixels) with brightness value falling into each bin: p i = n i N i Sensitive to bin width i Discontinuous due to bin edges In D-dim space with M bins per dimension, M D bins Intro. to Machine Learning Alireza Ghane 76

Histograms 5 5.5 1 5.5 1.5 1 E.g. for sunny days Count n i number of datapoints (pixels) with brightness value falling into each bin: p i = n i N i Sensitive to bin width i Discontinuous due to bin edges In D-dim space with M bins per dimension, M D bins Intro. to Machine Learning Alireza Ghane 77

Histograms 5 5.5 1 5.5 1.5 1 E.g. for sunny days Count n i number of datapoints (pixels) with brightness value falling into each bin: p i = n i N i Sensitive to bin width i Discontinuous due to bin edges In D-dim space with M bins per dimension, M D bins Intro. to Machine Learning Alireza Ghane 78

Histograms 5 5.5 1 5.5 1.5 1 E.g. for sunny days Count n i number of datapoints (pixels) with brightness value falling into each bin: p i = n i N i Sensitive to bin width i Discontinuous due to bin edges In D-dim space with M bins per dimension, M D bins Intro. to Machine Learning Alireza Ghane 79

Local Density Estimation In a histogram we use nearby points to estimate density For a small region around x, estimate density as: p(x) = K NV K is number of points in region, V is volume of region, N is total number of datapoints Intro. to Machine Learning Alireza Ghane 8

Kernel Density Estimation Try to keep idea of using nearby points to estimate density, but obtain smoother estimate Estimate density by placing a small bump at each datapoint Kernel function k( ) determines shape of these bumps Density estimate is p(x) 1 N N ( ) x xn k h n=1 Intro. to Machine Learning Alireza Ghane 81

Kernel Density Estimation 5 5.5 1 5.5 1.5 1 Example using Gaussian kernel: p(x) = 1 N N 1 (2πh 2 exp { x x n 2 } ) 1/2 2h 2 n=1 Intro. to Machine Learning Alireza Ghane 82

Kernel Density Estimation 1.9.8.7.6.5.4.3.2.1!3!2!1 1 2 3 Other kernels: Rectangle, Triangle, Epanechnikov Intro. to Machine Learning Alireza Ghane 83

Kernel Density Estimation.14.12 1.9.8.7.6.5.4.3.2.1.8.6.4.2.1!3!2!1 1 2 3!5 5 1 15 2 25 3 Other kernels: Rectangle, Triangle, Epanechnikov Intro. to Machine Learning Alireza Ghane 84

Kernel Density Estimation.14.12 1.9.8.7.6.5.4.3.2.1.8.6.4.2.1!3!2!1 1 2 3!5 5 1 15 2 25 3 Other kernels: Rectangle, Triangle, Epanechnikov Fast at training time, slow at test time keep all datapoints Intro. to Machine Learning Alireza Ghane 85

Kernel Density Estimation.14.12 1.9.8.7.6.5.4.3.2.1.8.6.4.2.1!3!2!1 1 2 3!5 5 1 15 2 25 3 Other kernels: Rectangle, Triangle, Epanechnikov Fast at training time, slow at test time keep all datapoints Sensitive to kernel bandwidth h Intro. to Machine Learning Alireza Ghane 86

Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Kernel Density Estimation Nearest-neighbour Conclusion Intro. to Machine Learning Alireza Ghane 87

5 Nearest-neighbour.5 1 5.5 1 5.5 1 Instead of relying on kernel bandwidth to get proper density estimate, fix number of nearby points K: p(x) = K NV Note: diverges, not proper density estimate Intro. to Machine Learning Alireza Ghane 88

Nearest-neighbour for Classification K Nearest neighbour is often used for classification Classification: predict labels t i from x i Intro. to Machine Learning Alireza Ghane 89

x 2 Nearest-neighbour for Classification (a) x 1 K Nearest neighbour is often used for classification Classification: predict labels t i from x i e.g. x i R 2 and t i {, 1}, 3-nearest neighbour Intro. to Machine Learning Alireza Ghane 9

Nearest-neighbour for Classification x 2 x 2 (a) x 1 (b) x 1 K Nearest neighbour is often used for classification Classification: predict labels t i from x i e.g. x i R 2 and t i {, 1}, 3-nearest neighbour K = 1 referred to as nearest-neighbour Intro. to Machine Learning Alireza Ghane 91

Nearest-neighbour for Classification Good baseline method Slow, but can use fancy data structures for efficiency (KD-trees, Locality Sensitive Hashing) Nice theoretical properties As we obtain more training data points, space becomes more filled with labelled data As N error no more than twice Bayes error Intro. to Machine Learning Alireza Ghane 92

Bayes Error p(x, C 1 ) x x p(x, C 2 ) x R 1 R 2 Best classification possible given features Two classes, PDFs shown Decision rule: C 1 if x ˆx; makes errors on red, green, and blue regions Optimal decision rule: C 1 if x x, Bayes error is area of green and blue regions Intro. to Machine Learning Alireza Ghane 93

Bayes Error p(x, C 1 ) x x p(x, C 2 ) x R 1 R 2 Best classification possible given features Two classes, PDFs shown Decision rule: C 1 if x ˆx; makes errors on red, green, and blue regions Optimal decision rule: C 1 if x x, Bayes error is area of green and blue regions Intro. to Machine Learning Alireza Ghane 94

Bayes Error p(x, C 1 ) x x p(x, C 2 ) x R 1 R 2 Best classification possible given features Two classes, PDFs shown Decision rule: C 1 if x ˆx; makes errors on red, green, and blue regions Optimal decision rule: C 1 if x x, Bayes error is area of green and blue regions Intro. to Machine Learning Alireza Ghane 95

Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection Decision Theory: ML, Loss Function, MAP Probability Theory: (e.g.) Probabilities and Parameter Estimation Kernel Density Estimation Nearest-neighbour Conclusion Intro. to Machine Learning Alireza Ghane 96

Conclusion Readings: Chapter 1.1, 1.3, 1.5, 2.1 Types of learning problems Supervised: regression, classification Unsupervised Learning as optimization Squared error loss function Maximum likelihood (ML) Maximum a posteriori (MAP) Want generalization, avoid over-fitting Cross-validation Regularization Bayesian prior on model parameters Intro. to Machine Learning Alireza Ghane 97

Conclusion Readings: Ch. 2.5 Kernel density estimation Model density p(x) using kernels around training datapoints Nearest neighbour Model density or perform classification using nearest training datapoints Multivariate Gaussian Needed for next week s lectures, if you need a refresher read pp. 78-81 Intro. to Machine Learning Alireza Ghane 98