Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Similar documents
Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning Lecture 5

Logistic Regression. Machine Learning Fall 2018

Ch 4. Linear Models for Classification

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Linear Models for Classification

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

ECE521 Lecture7. Logistic Regression

Generative v. Discriminative classifiers Intuition

Classification Based on Probability

ECE521 week 3: 23/26 January 2017

Binary Classification / Perceptron

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Linear & nonlinear classifiers

Stochastic Gradient Descent

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Generative v. Discriminative classifiers Intuition

Support Vector Machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning Lecture 7

The Naïve Bayes Classifier. Machine Learning Fall 2017

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Machine Learning for Signal Processing Bayes Classification and Regression

Bias-Variance Tradeoff

Logistic Regression & Neural Networks

STA 4273H: Statistical Machine Learning

Warm up: risk prediction with logistic regression

Linear discriminant functions

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Logistic Regression. COMP 527 Danushka Bollegala

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Naïve Bayes classification

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Linear Classification: Probabilistic Generative Models

Linear Models for Classification

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Logistic Regression. William Cohen

An Introduction to Statistical and Probabilistic Linear Models

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Linear & nonlinear classifiers

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

CSC 411: Lecture 04: Logistic Regression

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Bayesian Learning (II)

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Introduction to Machine Learning

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning for Signal Processing Bayes Classification

ECE 5984: Introduction to Machine Learning

Logistic Regression Logistic

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Linear Classification

Discriminative Models

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine Learning

Artificial Neural Networks

Linear Discrimination Functions

Nonparametric Bayesian Methods (Gaussian Processes)

Part 4: Conditional Random Fields

Stochastic gradient descent; Classification

10-701/ Machine Learning - Midterm Exam, Fall 2010

CPSC 340: Machine Learning and Data Mining

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Recap from previous lecture

Bayesian Decision Theory

Introduction to Logistic Regression and Support Vector Machine

Introduction to Machine Learning

Introduction to Machine Learning

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Logistic Regression. Seungjin Choi

CMU-Q Lecture 24:

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Discriminative Models

MODULE -4 BAYEIAN LEARNING

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

EE 511 Online Learning Perceptron

Ad Placement Strategies

Midterm, Fall 2003

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Introduction to Bayesian Learning. Machine Learning Fall 2018

Machine Learning

Introduction to Machine Learning

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Transcription:

Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012

Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative Naïve Bayes 2

Classification problem Given: Training set labeled set of N input-output pairs D = x i, y i i=1 y *1,, K+ N Goal: Given an input x, assign it to one of K classes Examples: Spam filter Handwritten digit recognition 3

Types of classifiers Discriminant function f(x) shows the class of the input x Probabilistic classification approaches can be divided in two main categories: Generative Discriminative 4

Target coding scheme Target values: Binary classification: a target variable y 0,1 Multiple classes (K > 2): y is a vector of length K (1-of-K) TargetClass C j : y j = 1 i j y i = 0 e.g., y = 0,0,1,0 T 5

Linear classifiers Linear classifiers: Decision boundaries are linear functions d 1 dimensional hyper-plane within the d dimensional input space. Linearly separable: Data points can be exactly classified by linear decision surfaces. We start by binary linear classification 6

Discriminant functions f x; w = w T x x = 1 x 1 x 2 x d w =,w 0 w 1 w 2 w d - w 0 : bias if f x; w = w T x 0 then C 1 else C 2 Decision boundary: f x; w = 0 f x; w predicts discrete class labels 7

Decision boundary Linear x 2 3 2 1 3 + 3 4 x 1 + x 2 = 0 if w T x 0 then y = 1 else y = 0 1 2 3 4 x 1 w =,3, 0.75, 1-8

Non-linear decision boundary Choose non-linear features Classifier still linear in parameters w x 2 1 + x 1 2 + x 2 2 = 0 1 φ x =,1, x 1, x 2, x 1 2, x 2 2, x 1 x 2 - w =, 1, 0, 0,1,1,0-1 -1 1 x 1 if w T φ(x) 0 then y = 1 else y = 0 x =,x 1, x 2-9

SSE cost function Two classes: Is it suitable for classification? J W = Xw y 2 X = 1 x 1 (1) 1 (n) 1 x 1 x 1 (2) (1) x d x (2) d x d (n) w = w 0 w 1 w d 10

Lest squares vs. logistic regression We will see Logistic regression in the next slides Least squares Logistic regression Least square penalizes too correct predictions Least squares also lack robustness to noise 11

SSE cost function Two classes: J W = Xw y 2 X = 1 x 1 (1) 1 (n) 1 x 1 x 1 (2) (1) x d x (2) d x d (n) w = w 0 w 1 w d 12

SSE cost function Two classes: g: a sigmoid function Can sigmoid solve the problem? J W = g(xw) y 2 X = 1 x 1 (1) 1 (n) 1 x 1 x 1 (2) (1) x d x (2) d x d (n) w = w 0 w 1 w d 13

Multi-class classification Two-class classification Multi-class classification x 2 x 2 x 1 x 1 14

Multi-class classification One-vs-all (one-vs-rest) x 2 x 1 x 2 x 2 15 Class 1: Class 2: Class 3: x 1 x 2 x 1 x

Multi-class approaches One-vs-all All-vs-all 16

Multiple class y *1,2,, K+ if f i x > f j x j i then decide C i 17

SSE cost function Multiple classes: J W = Tr XW Y T XW Y X = 1 x 1 (1) 1 (n) 1 x 1 x 1 (2) (1) x d x (2) d x d (n) W = w 1 w K Y = y 1 y K 18

Lest squares vs. logistic regression Least squares Logistic regression 19

Logistic regression More general than discriminant functions: f x; w predicts posterior probabilities P y = 1 x 20

Logistic regression (Cont d) f x; w = g(w T x) g. is an activation function Sigmoid (logistic) function Activation function g z = 1 1 + e z 21

Logistic regression (Cont d) 0 f x; w 1 estimated probability of y = 1 on input x (y = 0 or y = 1) f x; w : probability that y = 1 given x, parameterized by w f x; w = P y = 1 x, w P y = 1 x, w = 1 P y = 0 x, w 22

Logistic regression (Cont d) Example: Cancer (Malignant, Benign) f x; w = 0.7 70% chance of tumor being malignant 23

Logistic regression: Decision surface Decision surface f x; w = constant g w T 1 x = = constant 1+e wt x Decision surfaces are linear functions of x Generalized linear models More complex analytical and computational properties than linear regression 24

Logistic regression: Decision surface (Cont d) if f x; w 0.5 then y = 1 else y = 0 Equivalent to if w T x 0 then y = 1 else y = 0 25

Logistic regression: cost function Is SSE is a proper cost function for classification? J w = 1 n Is it convex? n i=1? y i f x i ; w 2 Is it a proper for classification problem? Are conditional distributions p y x, w in the classification problem Gaussian? 26

Logistic regression: loss function Loss y, f x; w = log(f(x; w)) if y = 1 log(1 f x; w ) if y = 0 Is it related to? 1 y y Loss y, y = 0 y = y 27

Logistic regression: loss function y = 1 or y = 0 Loss y, f x; w = y log f x; w (1 y) log(1 f x; w ) 1 f x; w = 1 + exp( w T x) 28

Logistic regression: cost function Loss y, f x; w = log p y f(x; w) when assuming Bernoulli distribution for p y w, x p y w, x = f x; w y (1 f x; w ) (1 y) Maximum (conditional) log likelihood: log p D y D x, w = log p y (i) w, x (i) n i=1 29

Logistic regression: cost function J w = 1 n = 1 n n i=1 n i=1 Loss y i, f x i ; w y (i) log f x (i) ; w (1 y (i) )log 1 f x (i) ; w No closed form solution for min w J(w) However J(w) is convex. 30

Gradient descent w t+1 = w t η w J(w t ) w J w = 1 n n i=1 (f x i ; w y i )x i Similar to gradient of SSE for linear regression? 31

Logistic regression example 32 Figure has been adapted from Jaakkola s slides

Generalization to multi-class Train a logistic regression classifier f i x; w for each class i f i x; w predicts the probability of y = i (i.e., P(y = i x, w)). On a new input x, to make a prediction, pick the class that maximizes f i x; w y x = argmax i f i x; w 33

Logistic regression: multi-class K > 2 p y = k x = exp (w k T x ) K exp (w T j x ) j=1 Normalized exponential (aka softmax) If w k T x w j T x for all j k then p(c k x) 1, p(c j x) 0 34 p C k x = p x C k p(c k ) K p x C j p(c j ) j=1

Logistic Regression (LR): summary LR is a linear classifier LR optimization problem is obtained by maximum likelihood when assuming Bernoulli distribution for conditional probabilities No closed-form solution But convex cost function and global optimum with gradient ascent 35

Perceptron algorithm Linear discriminant model Two-class: y * 1,1+ y = 1 for C 2, y = 1 for C 1 Goal: i, x (i) C 1 w T x (i) > 0 i, x i C 2 w T x i < 0 f x; w g z = = g(w T x) 1, z < 0 1, z 0 36

Perceptron architecture f(x) 37

Perceptron criterion J P w = w T x i y i i M M: subset of training data that are misclassified Assumption: Classes are linearly separable 38

Some classification criteria J(w) J P (w) w 1 w 0 w 1 w 0 misclassification Perceptron 39 Figure source: Duda s Book

Some classification criteria J P (w) J q (w) w 1 w 0 w 1 w 0 Perceptron Least squares 40 Figure source: Duda s Book

Stochastic gradient descent for Perceptron w t+1 = w t η w J P (w t ) w J P w = x i y i Online perceptron: If x (i) is misclassified: i M w t+1 = w t + ηx (i) y (i) Perceptron convergence theorem: for linearly separable data Many solutions? Which solution among them? 41

Convergence of Perceptron Change w in a direction that corrects the error 42

Bayes decision theory Bayes theorem p C k x = p x C k p(c k ) p(x) Posterior probability: p C k x Likelihood: p x C k Prior probability: p(c k ) 43

Bayes decision theory Bayes decision: Choose the class with highest p C k x Cost function: probability of misclassification minimizing the chance of assigning x to the wrong class Can be extended to minimizing loss instead of misclassification error 44

Minimizing misclassification rate R k : Decision regions All points in R k are assigned to class C k p mistake = p x R 1, C 2 + p x R 2, C 1 = p x, C 2 R 1 dx + p x, C 2 R 2 dx Choose class with highest p C k x 45

Probability of correct classification Multi-class K p correct = p x R k, C k k=1 K = p x, C k dx k=1 R k p correct = 1 p mistake 46

Discriminative vs. generative approach 47

Generative approach Inference stage Determine P(x C k ) for each class individually. Determine P(C k ) Use the Bayes theorem to find P(C k x) Decision stage: After learning the model (inference stage), make optimal class assignment for new input if P C i x > P C j x j i then decide C i It models the distribution of inputs as well as outputs. Can generate synthetic data points 48

Discriminative approach Inference stage Determine the posterior class probabilities P(C k x) directly Decision stage: After learning the model (inference stage), make optimal class assignment for new input if P C i x > P C j x j i then decide C i Example: Logistic regression 49

Generative approach: example p x C k = k = 1,2 1 2π d/2 Σ 1/2 exp * 1 2 x μ k T Σ 1 x μ k + p C 1 = p, p C 2 = 1 p 50

Generative approach: example p x C k = k = 1,2 1 2π d/2 Σ 1/2 exp * 1 2 x μ k T Σ 1 x μ k + p C 1 = p, p C 2 = 1 p Maximum likelihood estimation (D = x i, y i n i=1 ): 51 p = n 1 n μ 1 = n i=1 y (i) x (i) n 1, μ 2 = S = n 1 n S 1 + n 2 n S 2 S k = 1 n n n i=1 (1 y (i) )x (i) n 2 x μ k x μ T i=1 k (k = 1,2)

Class conditional densities vs. posterior 52

Maximum likelihood on what? Generative models: Data likelihood P(X, Y w) Compute P(X w) Discriminative models: Conditional Data Likelihood Learn P(Y X, w) required for classification Doesn t waste effort learning P(X w) 53

Discriminative vs. generative: number of parameters d-dimensional feature space Logistic regression: d + 1 parameters w = (w 0, w 1,.., w d ) Generative approach: Gaussian class-conditional densities with shared covariance matrix 2d parameters for means d(d + 1)/2 parameters for shared covariance matrix one parameter for class prior p(c 1 ). 54

Naïve Bayes classifier Generative methods High number of parameters Conditional independence assumption p x C k = p x 1 C k p x 2 C k p x d C k 55

Naïve Bayes classifier p x C k = p x 1 C k p x 2 C k p x d C k p C k x p(c k ) p(x i C k ) n i=1 Two-class Gaussian Naïve Bayes: 2d + 1 parameters 56

Naïve Bayes: discrete example p PlayTennis = Yes = 9 14 = 0.64 p PlayTennis = No = 5 14 = 0.36 p Outlook = Sunny PlayTennis = Yes = 3 9 = 0.333 p Outlook = Sunny PlayTennis = No = 3 5 = 0.6 57

Naïve Bayes: discrete example x = Sunny, Cool, Hig, Strong p Yes x = p Yes p Sunny Yes P Cool Yes P Hig Yes P Strong Yes = 0.0053 p No x = p No p Sunny Yes P Cool Yes P Hig Yes P Strong Yes = 0.0206 58

Bayes optimal classifier Training Data: D = x i, y i i=1 H: Hypothesis space n 59 p C k x, D = p C k, x p( D) h H n p D = p y i x i, i=1 p( D) p D p()

Summary of alternatives Generative Most demanding, because it finds the joint distribution p(x, C k ) Usually needs a large training set to find p(x C k ) Can find p(x) Outlier or novelty detection Discriminative Specifies what is really needed (i.e., p(c k x)) More computationally efficient Discriminant function Does not have many capabilities of probabilistic methods 60