Machine Learning Lecture 1

Similar documents
Machine Learning Lecture 2

Machine Learning Lecture 2

Machine Learning Lecture 2

Lecture Machine Learning (past summer semester) If at least one English-speaking student is present.

Machine Learning Lecture 5

Machine Learning Lecture 7

Machine Learning Lecture 14

Machine Learning Lecture 3

CS 540: Machine Learning Lecture 1: Introduction

CS 6375 Machine Learning

Machine Learning (CS 567) Lecture 2

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

BAYESIAN DECISION THEORY

Multivariate statistical methods and data mining in particle physics

Machine Learning and Deep Learning! Vincent Lepetit!

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

ECE521 week 3: 23/26 January 2017

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Brief Introduction of Machine Learning Techniques for Content Analysis

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Pattern Recognition and Machine Learning

COURSE INTRODUCTION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Machine Learning Lecture 12

Planning and Optimal Control

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Probabilistic Graphical Models for Image Analysis - Lecture 1

Computational Genomics

Mathematical Formulation of Our Example

Introduction to Statistical Inference

Learning from Data. Amos Storkey, School of Informatics. Semester 1. amos/lfd/

Generative Clustering, Topic Modeling, & Bayesian Inference

Bayesian Decision Theory

Intelligent Systems I

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

What does Bayes theorem give us? Lets revisit the ball in the box example.

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

ECE521 Lecture7. Logistic Regression

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Lecture 1: Bayesian Framework Basics

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Advanced statistical methods for data analysis Lecture 1

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Forward algorithm vs. particle filtering

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

10-701/ Machine Learning - Midterm Exam, Fall 2010

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

GI01/M055: Supervised Learning

Naïve Bayes classification

CSCI-567: Machine Learning (Spring 2019)

Introduction to Gaussian Processes

Loss Functions, Decision Theory, and Linear Models

Machine Learning, Midterm Exam

Machine Learning. Instructor: Pranjal Awasthi

L11: Pattern recognition principles

Introduction to Machine Learning (Pattern recognition and model fitting) for Master students

Probabilistic Graphical Models

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Machine Learning

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Recent Advances in Bayesian Inference Techniques

Introduction to Graphical Models

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Bayesian Networks BY: MOHAMAD ALSABBAGH

Intelligent Systems Statistical Machine Learning

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Introduction to Machine Learning

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

PATTERN RECOGNITION AND MACHINE LEARNING

Gaussian Mixture Models, Expectation Maximization

Machine Learning, Fall 2009: Midterm

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Machine learning for pervasive systems Classification in high-dimensional spaces

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Be able to define the following terms and answer basic questions about them:

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

Machine Learning Linear Classification. Prof. Matteo Matteucci

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Algorithms for Classification: The Basic Methods

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

CS Machine Learning Qualifying Exam

CS 570: Machine Learning Seminar. Fall 2016

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Statistical learning. Chapter 20, Sections 1 4 1

Final Exam, Fall 2002

Machine Learning and Related Disciplines

Transcription:

Many slides adapted from B. Schiele Machine Learning Lecture 1 Introduction 18.04.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de

Organization Lecturer Prof. Bastian Leibe (leibe@vision.rwth-aachen.de) Assistants Alexander Hermans (hermans@vision.rwth-aachen.de) Aljosa Osep (osep@vision.rwth-aachen.de) Course webpage http://www.vision.rwth-aachen.de/teaching/ Slides will be made available on the webpage There is also an L2P electronic repository Please subscribe to the lecture on the Campus system! Important to get email announcements and L2P access! 2

Language Official course language will be English If at least one English-speaking student is present. If not you can choose. However Please tell me when I m talking too fast or when I should repeat something in German for better understanding! You may at any time ask questions in German! You may turn in your exercises in German. You may answer exam questions in German. 3

Organization Structure: 3V (lecture) + 1Ü (exercises) 6 EECS credits Part of the area Applied Computer Science Place & Time Lecture: Mon 16:15 17:45 room UMIC 025 Lecture/Exercises: Tue 16:15 17:45 room UMIC 025 Exam Written exam 1 st Try Thu 18.08. 14:00 17:30 2 nd Try Fri 16.09. 14:00 17:30 4

Exercises and Supplementary Material Exercises Typically 1 exercise sheet every 2 weeks. Pen & paper and Matlab based exercises Hands-on experience with the algorithms from the lecture. Send your solutions the night before the exercise class. Need to reach 50% of the points to qualify for the exam! Teams are encouraged! You can form teams of up to 3 people for the exercises. Each team should only turn in one solution. But list the names of all team members in the submission. 5

Course Webpage Exercise on Tuesday http://www.vision.rwth-aachen.de/teaching/ 6

Textbooks Most lecture topics will be covered in Bishop s book. Some additional topics can be found in Duda & Hart. Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 (available in the library s Handapparat ) R.O. Duda, P.E. Hart, D.G. Stork Pattern Classification 2 nd Ed., Wiley-Interscience, 2000 Research papers will be given out for some topics. Tutorials and deeper introductions. Application papers 7

How to Find Us Office: UMIC Research Centre Mies-van-der-Rohe-Strasse 15, room 124 Office hours If you have questions to the lecture, come to Alex or Aljosa. My regular office hours will be announced (additional slots are available upon request) Send us an email before to confirm a time slot. Questions are welcome! 8

Machine Learning Statistical Machine Learning Principles, methods, and algorithms for learning and prediction on the basis of past evidence Already everywhere Speech recognition (e.g. Siri) Computer vision (e.g. face detection) Hand-written character recognition (e.g. letter delivery) Information retrieval (e.g. image & video indexing) Operation systems (e.g. caching) Fraud detection (e.g. credit cards) Text filtering (e.g. email spam filters) Game playing (e.g. strategy prediction) Robotics (everywhere) Slide credit: Bernt Schiele 9

Machine Learning Goal Machines that learn to perform a task from experience Why? Crucial component of every intelligent/autonomous system Important for a system s adaptability Important for a system s generalization capabilities Attempt to understand human learning Slide credit: Bernt Schiele 10

Machine Learning: Core Questions Learning to perform a task from experience Learning Most important part here! We do not want to encode the knowledge ourselves. The machine should learn the relevant criteria automatically from past observations and adapt to the given situation. Tools Statistics Probability theory Decision theory Information theory Optimization theory Slide credit: Bernt Schiele 11

Machine Learning: Core Questions Learning to perform a task from experience Task Can often be expressed through a mathematical function y = f(x; w) x: Input y: Output w: Parameters (this is what is learned ) Classification vs. Regression Regression: continuous y Classification: discrete y Slide credit: Bernt Schiele E.g. class membership, sometimes also posterior probability 12

Example: Regression Automatic control of a vehicle x f(x; w) y Slide credit: Bernt Schiele 13

Examples: Classification Email filtering x [a-z] y [ important, spam] Character recognition Speech recognition Slide credit: Bernt Schiele 14

Machine Learning: Core Problems Input x: Features Invariance to irrelevant input variations Selecting the right features is crucial Encoding and use of domain knowledge Higher-dimensional features are more discriminative. Curse of dimensionality Complexity increases exponentially with number of dimensions. Slide credit: Bernt Schiele 15

Machine Learning: Core Questions Learning to perform a task from experience Performance: 99% correct classification Of what??? Characters? Words? Sentences? Speaker/writer independent? Over what data set? The car drives without human intervention 99% of the time on country roads Slide adapted from Bernt Schiele 16

Machine Learning: Core Questions Learning to perform a task from experience Performance measure: Typically one number % correctly classified letters Average driving distance (until crash ) % games won % correctly recognized words, sentences, answers Generalization performance Training vs. test All data Slide credit: Bernt Schiele 17

Machine Learning: Core Questions Learning to perform a task from experience Performance measure: more subtle problem Also necessary to compare partially correct outputs. How do we weight different kinds of errors? Example: L2 norm classifier classifier Slide credit: Bernt Schiele 18

Machine Learning: Core Questions Learning to perform a task from experience What data is available? Data with labels: supervised learning Images / speech with target labels Car sensor data with target steering signal Data without labels: unsupervised learning Automatic clustering of sounds and phonemes Automatic clustering of web sites Some data with, some without labels: semi-supervised learning No examples: learning by doing Feedback/rewards: reinforcement learning Slide credit: Bernt Schiele 19

Machine Learning: Core Questions y = f(x; w) w: characterizes the family of functions w: indexes the space of hypotheses w: vector, connection matrix, graph, Slide credit: Bernt Schiele 20

Machine Learning: Core Questions Learning to perform a task from experience Learning Most often learning = optimization Search in hypothesis space Search for the best function / model parameter w y = f(x; w) I.e. maximize w.r.t. the performance measure Slide credit: Bernt Schiele 21

Course Outline Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Discriminative Approaches (5 weeks) Linear Discriminant Functions Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Outlook: Deep Learning Generative Models (4 weeks) Bayesian Networks Markov Random Fields Probabilistic Inference 22

Topics of This Lecture (Re-)view: Probability Theory Probabilities Probability densities Expectations and covariances Bayes Decision Theory Basic concepts Minimizing the misclassification rate Minimizing the expected loss Discriminant functions 23

Probability Theory Probability theory is nothing but common sense reduced to calculation. Pierre-Simon de Laplace, 1749-1827 24 Image source: Wikipedia

Probability Theory Example: apples and oranges We have two boxes to pick from. Each box contains both types of fruit. What is the probability of picking an apple? Formalization B r, b F a, o Let be a random variable for the box we pick. Let be a random variable for the type of fruit we get. Suppose we pick the red box 40% of the time. We write this as p( B r) 0.4 p( B b) 0.6 The probability of picking an apple given a choice for the box is p( F a B r) 0.25 p( F a B b) 0.75 What is the probability of picking an apple? p( F a)? 25 Image source: C.M. Bishop, 2006

Probability Theory More general case Consider two random variables and Consider N trials and let = #fx = x i ^ Y = y j g n ij Y y j X x i c i r j = #fx = x i g = #fy = y j g Then we can derive Joint probability Marginal probability Conditional probability 26 Image source: C.M. Bishop, 2006

Probability Theory Rules of probability Sum rule Product rule 27 Image source: C.M. Bishop, 2006

The Rules of Probability Thus we have Sum Rule Product Rule From those, we can derive Bayes Theorem where 28

Probability Densities Probabilities over continuous variables are defined over their probability density function (pdf). The probability that x lies in the interval by the cumulative distribution function (, z) is given 29 Image source: C.M. Bishop, 2006

Expectations The average value of some function f( x) under a probability distribution is called its expectation px ( ) discrete case continuous case If we have a finite number N of samples drawn from a pdf, then the expectation can be approximated by We can also consider a conditional expectation 30

Variances and Covariances The variance provides a measure how much variability there is in around its mean value. For two random variables x and y, the covariance is defined by If x and y are vectors, the result is a covariance matrix 31

Bayes Decision Theory Thomas Bayes, 1701-1761 The theory of inverse probability is founded upon an error, and must be wholly rejected. R.A. Fisher, 1925 32 Image source: Wikipedia

Bayes Decision Theory Example: handwritten character recognition Goal: Classify a new letter such that the probability of misclassification is minimized. 33 Slide credit: Bernt Schiele Image source: C.M. Bishop, 2006

Bayes Decision Theory Concept 1: Priors (a priori probabilities) pc k What we can tell about the probability before seeing the data. Example: C C 1 2 a b pc pc 1 2 0.75 0.25 In general: Slide credit: Bernt Schiele pck k 1 34

Bayes Decision Theory Concept 2: Conditional probabilities Let x be a feature vector. p x Ck x measures/describes certain properties of the input. E.g. number of black pixels, aspect ratio, p(x C k ) describes its likelihood for class C k. p x a p x b x Slide credit: Bernt Schiele x 35

Bayes Decision Theory Example: p x a p x b Question: Which class? p x b x 15 Since is much smaller than, the decision should be a here. p x a Slide credit: Bernt Schiele 36

Bayes Decision Theory Example: p x a p x b Question: Which class? x 25 p x a p x b Since is much smaller than, the decision should be b here. Slide credit: Bernt Schiele 37

Bayes Decision Theory Example: p x a p x b Question: x 20 Which class? Remember that p(a) = 0.75 and p(b) = 0.25 I.e., the decision should be again a. How can we formalize this? Slide credit: Bernt Schiele 38

Bayes Decision Theory Concept 3: Posterior probabilities p Ck We are typically interested in the a posteriori probability, i.e. the probability of class C k given the measurement vector x. x Bayes Theorem: p C Interpretation k x p x C k p Ck p x Ck p Ck p x p x Ci p C Likelihood Prior Posterior Normalization Factor i i Slide credit: Bernt Schiele 39

Bayes Decision Theory p x a p x b Likelihood p x a p( a) x p x b p( b) x Decision boundary Likelihood Prior p a x p b x Slide credit: Bernt Schiele x Posterior = Likelihood Prior NormalizationFactor 40

Bayesian Decision Theory Goal: Minimize the probability of a misclassification The green and blue regions stay constant. Only the size of the red region varies! = Z R 1 p(c 2 jx)p(x)dx + Z p(c 1 jx)p(x)dx R 2 41 Image source: C.M. Bishop, 2006

Bayes Decision Theory Optimal decision rule Decide for C 1 if p(c 1 jx) > p(c 2 jx) This is equivalent to p(xjc 1 )p(c 1 ) > p(xjc 2 )p(c 2 ) Which is again equivalent to (Likelihood-Ratio test) p(xjc 1 ) p(xjc 2 ) > p(c 2) p(c 1 ) Slide credit: Bernt Schiele Decision threshold 42

Generalization to More Than 2 Classes Decide for class k whenever it has the greatest posterior probability of all classes: p(c k jx) > p(c j jx) 8j 6= k p(xjc k )p(c k ) > p(xjc j )p(c j ) 8j 6= k Likelihood-ratio test p(xjc k ) p(xjc j ) > p(c j) p(c k ) 8j 6= k Slide credit: Bernt Schiele 43

Classifying with Loss Functions Generalization to decisions with a loss function Differentiate between the possible decisions and the possible true classes. Example: medical diagnosis Decisions: sick or healthy (or: further examination necessary) Classes: patient is sick or healthy The cost may be asymmetric: loss(decision = healthyjpatient = sick) >> loss(decision = sickjpatient = healthy) Slide credit: Bernt Schiele 44

Truth Classifying with Loss Functions In general, we can formalize this by introducing a loss matrix L kj L kj = loss for decision C j if truth is C k : Example: cancer diagnosis Decision L cancer diagnosis = 45

Classifying with Loss Functions Loss functions may be different for different actors. Example: L stocktrader (subprime) = invest don t invest µ 1 2 c gain 0 0 0 L bank (subprime) = µ 1 2 c gain 0 0 Different loss functions may lead to different Bayes optimal strategies. 46

Minimizing the Expected Loss Optimal solution is the one that minimizes the loss. But: loss function depends on the true class, which is unknown. Solution: Minimize the expected loss This can be done by choosing the regions R j such that which is easy to do once we know the posterior class probabilities p(c k jx). 47

Minimizing the Expected Loss Example: 2 Classes: C 1, C 2 2 Decision: 1, 2 Loss function: L( j jc k ) = L kj Expected loss (= risk R) for the two decisions: Goal: Decide such that expected loss is minimized I.e. decide 1 if Slide credit: Bernt Schiele 48

Minimizing the Expected Loss R( 2 jx) > R( 1 jx) L 12 p(c 1 jx) + L 22 p(c 2 jx) > L 11 p(c 1 jx) + L 21 p(c 2 jx) (L 12 L 11 )p(c 1 jx) > (L 21 L 22 )p(c 2 jx) (L 12 L 11 ) (L 21 L 22 ) > p(c 2jx) p(c 1 jx) = p(xjc 2)p(C 2 ) p(xjc 1 )p(c 1 ) p(xjc 1 ) p(xjc 2 ) > (L 21 L 22 ) (L 12 L 11 ) p(c 2 ) p(c 1 ) Adapted decision rule taking into account the loss. Slide credit: Bernt Schiele 49

The Reject Option Classification errors arise from regions where the largest posterior probability p(c k jx) is significantly less than 1. These are the regions where we are relatively uncertain about class membership. For some applications, it may be better to reject the automatic decision entirely in such a case and e.g. consult a human expert. 50 Image source: C.M. Bishop, 2006

Discriminant Functions Formulate classification in terms of comparisons Discriminant functions y 1 (x); : : : ; y K (x) Classify x as class C k if y k (x) > y j (x) 8j 6= k Examples (Bayes Decision Theory) y k (x) = p(c k jx) y k (x) = p(xjc k )p(c k ) y k (x) = log p(xjc k ) + log p(c k ) Slide credit: Bernt Schiele 51

Different Views on the Decision Problem y k (x) / p(xjc k )p(c k ) First determine the class-conditional densities for each class individually and separately infer the prior class probabilities. Then use Bayes theorem to determine class membership. Generative methods y k (x) = p(c k jx) First solve the inference problem of determining the posterior class probabilities. Then use decision theory to assign each new x to its class. Discriminative methods Alternative Directly find a discriminant function which maps each input x directly onto a class label. y k (x) 52

Next Lectures Ways how to estimate the probability densities Non-parametric methods Histograms k-nearest Neighbor Kernel Density Estimation Parametric methods 3 N =10 2 1 p(xjc k ) 0 0 0.5 1 Gaussian distribution Mixtures of Gaussians Discriminant functions Linear discriminants Support vector machines Next lectures 53

References and Further Reading More information, including a short review of Probability theory and a good introduction in Bayes Decision Theory can be found in Chapters 1.1, 1.2 and 1.5 of Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 54