Machine Learning Lecture 2

Similar documents
Machine Learning Lecture 2

Machine Learning Lecture 2

Machine Learning Lecture 3

Machine Learning Lecture 1

Lecture Machine Learning (past summer semester) If at least one English-speaking student is present.

Machine Learning Lecture 5

Machine Learning Lecture 7

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

ECE521 week 3: 23/26 January 2017

Lecture 3: Pattern Classification. Pattern classification

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

MACHINE LEARNING ADVANCED MACHINE LEARNING

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Decision Theory

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

MACHINE LEARNING ADVANCED MACHINE LEARNING

Machine Learning Lecture 14

5. Discriminant analysis

PATTERN RECOGNITION AND MACHINE LEARNING

MLE/MAP + Naïve Bayes

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Notes on Discriminant Functions and Optimal Classification

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

2. A Basic Statistical Toolbox

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Intelligent Systems Statistical Machine Learning

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Classification. 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Machine Learning Lecture 10

Naïve Bayes classification

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Multivariate statistical methods and data mining in particle physics

The Bayes classifier

INF Introduction to classifiction Anne Solberg

BAYESIAN DECISION THEORY

Machine Learning Linear Classification. Prof. Matteo Matteucci

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Machine Learning Lecture 12

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Machine Learning, Fall 2012 Homework 2

Bayesian Learning (II)

Machine Learning (CS 567) Lecture 2

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

L11: Pattern recognition principles

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Pattern Recognition and Machine Learning

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

MLE/MAP + Naïve Bayes

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Statistical Methods for Particle Physics Lecture 1: parameter estimation, statistical tests

Curve Fitting Re-visited, Bishop1.2.5

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

7 Gaussian Discriminant Analysis (including QDA and LDA)

Intelligent Systems Statistical Machine Learning

COM336: Neural Computing

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Analysis Techniques Multivariate Methods

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Overfitting, Bias / Variance Analysis

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Bayesian Decision Theory

Day 5: Generative models, structured classification

Algorithms for Classification: The Basic Methods

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

A Brief Review of Probability, Bayesian Statistics, and Information Theory

INF Anne Solberg One of the most challenging topics in image analysis is recognizing a specific object in an image.

An Introduction to Statistical and Probabilistic Linear Models

CS534 Machine Learning - Spring Final Exam

Algorithmisches Lernen/Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CSC321 Lecture 18: Learning Probabilistic Models

Introduction to Statistical Inference

CSE 559A: Computer Vision

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

CSC 411: Lecture 09: Naive Bayes

Machine Learning and Deep Learning! Vincent Lepetit!

CS-E3210 Machine Learning: Basic Principles

CSE 559A: Computer Vision Tomorrow Zhihao's Office Hours back in Jolley 309: 10:30am-Noon

Linear Models for Classification

Transcription:

Announcements Machine Learning Lecture 2 Eceptional number of lecture participants this year Current count: 449 participants This is very nice, but it stretches our resources to their limits Probability Density Estimation 6.0.207 Monday lecture slot Shifted to 8:30 0:00 in AH IV (276 seats) We will monitor the situation and take further action if the space is not sufficient Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Thursday lecture slot Will stay at 4:5 5:45 in H02 (C.A.R.L, 786 seats) Eercises (non-mandatory) We will try to offer corrections, but we will have to see how to handle those numbers 2 Announcements Course Outline L2P electronic repository Slides, eercises, and supplementary material will be made available here Lecture recordings will be uploaded 2-3 days after the lecture Course webpage http://www.vision.rwth-aachen.de/courses/ Slides will also be made available on the webpage Fundamentals Probability Density Estimation Classification Approaches Linear Discriminants Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Please subscribe to the lecture on the Campus system! Important to get email announcements and L2P access! Deep Learning Foundations Convolutional Neural Networks Recurrent Neural Networks 3 4 Topics of This Lecture Basic concepts Minimizing the misclassification rate Minimizing the epected loss Discriminant functions Probability Density Estimation General concepts Gaussian distribution Parametric Methods Maimum Likelihood approach Bayesian vs. Frequentist views on probability Thomas Bayes, 70-76 The theory of inverse probability is founded upon an error, and must be wholly rejected. R.A. Fisher, 925 5 4 Image source: Wikipedia

Eample: handwritten character recognition Concept : Priors (a priori probabilities) What we can tell about the probability before seeing the data. Eample: pc k Goal: Classify a new letter such that the probability of misclassification is minimized. 5 In general: C C 2 a b pck k pc pc 2 0.75 0.25 6 Concept 2: Conditional probabilities Let be a feature vector. measures/describes certain properties of the input. E.g. number of black piels, aspect ratio, p( C k ) describes its likelihood for class C k. p a p b p Ck 7 Eample: Question: p a p b 5 Which class? p b p a Since is much smaller than, the decision should be a here. 8 Eample: Eample: p a p b p a p b 25 Question: Which class? p a p b Since is much smaller than, the decision should be b here. Question: Which class? 20 Remember that p(a) = 0.75 and p(b) = 0.25 I.e., the decision should be again a. How can we formalize this? 9 20 2

Concept 3: Posterior probabilities We are typically interested in the a posteriori probability, i.e. the probability of class C k given the measurement vector. Bayes Theorem: k p C p Ck p Ck p Ck p Ck p p Ci p Ci Interpretation Likelihood Prior Posterior Normalization Factor i k p C 2 p a p b p a p( a) p b p( b) pa pb Decision boundary Likelihood Likelihood Prior Likelihood Prior Posterior = NormalizationFactor 22 Bayesian Decision Theory Goal: Minimize the probability of a misclassification The green and blue regions stay constant. Optimal decision rule Decide for C if p(c j) > p(c 2 j) Only the size of the red region varies! This is equivalent to p(jc )p(c ) > p(jc 2 )p(c 2 ) Z = R Z p(c 2 j)p()d + R2 p(c j)p()d 23 Which is again equivalent to (Likelihood-Ratio test) p(jc ) p(jc 2 ) > p(c 2) p(c ) Decision threshold 24 Generalization to More Than 2 Classes Classifying with Loss Functions Decide for class k whenever it has the greatest posterior probability of all classes: Likelihood-ratio test p(c k j) > p(c j j) 8j 6= k p(jc k )p(c k ) > p(jc j )p(c j ) 8j 6= k p(jc k ) p(jc j ) > p(c j) p(c k ) 8j 6= k Generalization to decisions with a loss function Differentiate between the possible decisions and the possible true classes. Eample: medical diagnosis Decisions: sick or healthy (or: further eamination necessary) Classes: patient is sick or healthy The cost may be asymmetric: loss(decision = healthyjpatient = sick) >> loss(decision = sickjpatient = healthy) 25 26 3

Truth Classifying with Loss Functions Classifying with Loss Functions In general, we can formalize this by introducing a loss matri L kj L kj = loss for decision C j if truth is C k : Loss functions may be different for different actors. Eample: L stocktrader (subprime) = invest don t invest µ 2 c gain 0 0 0 Eample: cancer diagnosis Decision µ L bank (subprime) = 2 c gain 0 0 L cancer diagnosis = Different loss functions may lead to different Bayes optimal strategies. 27 28 Minimizing the Epected Loss Minimizing the Epected Loss Optimal solution is the one that minimizes the loss. But: loss function depends on the true class, which is unknown. Eample: 2 Classes: C, C 2 Solution: Minimize the epected loss 2 Decision:, 2 Loss function: L( j jc k ) = L kj Epected loss (= risk R) for the two decisions: This can be done by choosing the regions R j such that which is easy to do once we know the posterior class probabilities p(c kj). Goal: Decide such that epected loss is minimized I.e. decide if 29 30 Minimizing the Epected Loss The Reject Option R( 2 j) > R( j) L 2 p(c j) + L 22 p(c 2 j) > L p(c j) + L 2 p(c 2 j) (L 2 L )p(c j) > (L 2 L 22 )p(c 2 j) (L 2 L ) (L 2 L 22 ) > p(c 2j) p(c j) = p(jc 2)p(C 2 ) p(jc )p(c ) p(jc ) p(jc 2 ) > (L 2 L 22 ) p(c 2 ) (L 2 L ) p(c ) Adapted decision rule taking into account the loss. 3 Classification errors arise from regions where the largest posterior probability p(c kj) is significantly less than. These are the regions where we are relatively uncertain about class membership. For some applications, it may be better to reject the automatic decision entirely in such a case and e.g. consult a human epert. 32 4

Discriminant Functions Different Views on the Decision Problem Formulate classification in terms of comparisons Discriminant functions Classify as class C k if y (); : : : ; y K () y k () > y j () 8j 6= k Eamples () y k () = p(c k j) y k () = p(jc k )p(c k ) y k () = log p(jc k ) + log p(c k ) 33 y k () / p(jc k )p(c k ) First determine the class-conditional densities for each class individually and separately infer the prior class probabilities. Then use Bayes theorem to determine class membership. Generative methods y k () = p(c k j) First solve the inference problem of determining the posterior class probabilities. Then use decision theory to assign each new to its class. Discriminative methods Alternative Directly find a discriminant function y k () which maps each input directly onto a class label. 34 Topics of This Lecture Probability Density Estimation Basic concepts Minimizing the misclassification rate Minimizing the epected loss Discriminant functions Probability Density Estimation General concepts Gaussian distribution Parametric Methods Maimum Likelihood approach Bayesian vs. Frequentist views on probability Bayesian Learning Up to now Bayes optimal classification Based on the probabilities p(jc k )p(c k ) How can we estimate (=learn) those probability densities? Supervised training case: data and class labels are known. Estimate the probability density for each class C k separately: p(jc k ) (For simplicity of notation, we will drop the class label C k in the following.) 35 36 Probability Density Estimation The Gaussian (or Normal) Distribution Data:, 2, 3, 4, One-dimensional case Mean ¹ Estimate: p() Variance ¾ 2 N(j¹; ¾ 2 ) = p ¾ ( ¹)2 ep ½ 2¼¾ 2¾ 2 Methods Parametric representations (today) Non-parametric representations (lecture 3) Miture models (lecture 4) 37 Multi-dimensional case Mean ¹ Covariance N(j¹; ) = ½ ep ¾ (2¼) D=2 j j=2 2 ( ¹)T ( ¹) 38 5

Gaussian Distribution Properties Gaussian Distribution Properties Central Limit Theorem The distribution of the sum of N i.i.d. random variables becomes increasingly Gaussian as N grows. In practice, the convergence to a Gaussian can be very rapid. This makes the Gaussian interesting for many applications. Quadratic Form N depends on through the eponent Here, M is often called the Mahalanobis distance from to ¹. Eample: N uniform [0,] random variables. Shape of the Gaussian is a real, symmetric matri. We can therefore decompose it into its eigenvectors DX = iu i u T i i= and thus obtain with. 39 Constant density on ellipsoids with main directions along the eigenvectors u i and scaling factors p i. 40 Gaussian Distribution Properties Gaussian Distribution Properties Special cases Full covariance matri = [¾ ij ] The marginals of a Gaussian are again Gaussians: General ellipsoid shape Diagonal covariance matri = diagf¾ i g Ais-aligned ellipsoid Uniform variance = ¾ 2 I Hypersphere 4 42 Topics of This Lecture Parametric Methods Basic concepts Minimizing the misclassification rate Minimizing the epected loss Discriminant functions Probability Density Estimation General concepts Gaussian distribution Given Data X = f ; 2 ; : : : ; N g Parametric form of the distribution with parameters µ E.g. for Gaussian distrib.: µ = (¹; ¾) Learning Estimation of the parameters µ Parametric Methods Maimum Likelihood approach Bayesian vs. Frequentist views on probability Bayesian Learning Likelihood of µ Probability that the data X have indeed been generated from a probability density with parameters µ L(µ) = p(xjµ) 43 Slide adapted from Bernt Schiele 44 6

Maimum Likelihood Approach Computation of the likelihood Single data point: p( n jµ) Maimum Likelihood Approach NY Likelihood: L(µ) = p(xjµ) = p( n jµ) Assumption: all data points are independent NY L(µ) = p(xjµ) = p( n jµ) We want to obtain ^µ such that L(^µ) is maimized. p(xjµ) Log-likelihood E(µ) = ln L(µ) = ln p( n jµ) Estimation of the parameters µ (Learning) Maimize the likelihood Minimize the negative log-likelihood ^µ µ 45 46 Maimum Likelihood Approach Maimum Likelihood Approach Minimizing the log-likelihood How do we minimize a function? Take the derivative and set it to zero. @ @µ E(µ) = @ ln p( n jµ) = @µ Log-likelihood for Normal distribution (D case) E(µ) = ln p( n j¹; ¾) µ = ln p ep ½ jj n ¹jj 2 ¾ 2¼¾ 2¾ 2 @ @µ p( njµ) p( n jµ)! = 0 Minimizing the log-likelihood @ @¹ E(¹; ¾) = X N @ @¹ p( nj¹; ¾) p( n j¹; ¾) = 2( n ¹) 2¾ 2 X N = ( ( 2 n n ¹) ¹) ¾ Ã = N! X 2 n n N¹ ¾ p( n j¹; ¾) = p e jjn ¹jj2 2¾ 2 2¼¾ 47 48 Maimum Likelihood Approach Maimum Likelihood Approach We thus obtain ^¹ = N n In a similar fashion, we get ^¾ 2 = ( n ^¹) 2 N ^µ = (^¹; ^¾) is the Maimum Likelihood estimate for the parameters of a Gaussian distribution. This is a very important result. Unfortunately, it is wrong sample mean sample variance 49 Or not wrong, but rather biased Assume the samples, 2,, N come from a true Gaussian distribution with mean ¹ and variance ¾ 2 We can now compute the epectations of the ML estimates with respect to the data set values. It can be shown that E(¹ ML ) = ¹ µ N E(¾ML) 2 = ¾ 2 N The ML estimate will underestimate the true variance. Corrected estimate: ~¾ 2 = N N ¾2 ML = N ( n ^¹) 2 50 7

Maimum Likelihood Limitations Deeper Reason Maimum Likelihood has several significant limitations It systematically underestimates the variance of the distribution! E.g. consider the case N = ; X = f g Maimum Likelihood is a Frequentist concept In the Frequentist view, probabilities are the frequencies of random, repeatable events. These frequencies are fied, but can be estimated more precisely when more data is available. Maimum-likelihood estimate: We say ML overfits to the observed data. ^¾ = 0! We will still often use ML, but it is important to know about this effect. ^¹ This is in contrast to the Bayesian interpretation In the Bayesian view, probabilities quantify the uncertainty about certain states or events. This uncertainty can be revised in the light of new evidence. Bayesians and Frequentists do not like each other too well Slide adapted from Bernt Schiele 5 52 Bayesian vs. Frequentist View To see the difference Suppose we want to estimate the uncertainty whether the Arctic ice cap will have disappeared by the end of the century. This question makes no sense in a Frequentist view, since the event cannot be repeated numerous times. In the Bayesian view, we generally have a prior, e.g. from calculations how fast the polar ice is melting. If we now get fresh evidence, e.g. from a new satellite, we may revise our opinion and update the uncertainty from the prior. Posterior / Likelihood Prior This generally allows to get better uncertainty estimates for many situations. Main Frequentist criticism The prior has to come from somewhere and if it is wrong, the result will be worse. 53 Bayesian Approach to Parameter Learning Conceptual shift Maimum Likelihood views the true parameter vector µ to be unknown, but fied. In Bayesian learning, we consider µ to be a random variable. This allows us to use knowledge about the parameters µ i.e. to use a prior for µ Training data then converts this prior distribution on µ into a posterior probability density. The prior thus encodes knowledge we have about the type of distribution we epect to see for µ. Slide adapted from Bernt Schiele 54 Bayesian Learning References and Further Reading Bayesian Learning is an important concept However, it would lead to far here. I will introduce it in more detail in the Advanced ML lecture. More information in Bishop s book Gaussian distribution and ML: Ch..2.4 and 2.3.-2.3.4. Bayesian Learning: Ch..2.3 and 2.3.6. Nonparametric methods: Ch. 2.5. Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 55 84 8