Discriminative Models

Similar documents
Discriminative Models

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machines.

Linear & nonlinear classifiers

Kernelized Perceptron Support Vector Machines

Support Vector Machines for Classification: A Statistical Portrait

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Support Vector Machines

Machine Learning And Applications: Supervised Learning-SVM

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Warm up: risk prediction with logistic regression

Support Vector Machine (continued)

Empirical Risk Minimization

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Introduction to Logistic Regression and Support Vector Machine

Statistical Data Mining and Machine Learning Hilary Term 2016

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Jeff Howbert Introduction to Machine Learning Winter

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Linear & nonlinear classifiers

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Statistical and Computational Learning Theory

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CMU-Q Lecture 24:

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning 4771

18.9 SUPPORT VECTOR MACHINES

Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines: Maximum Margin Classifiers

COMS 4771 Introduction to Machine Learning. Nakul Verma

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Support Vector Machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule

SVMs, Duality and the Kernel Trick

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Logistic Regression. Machine Learning Fall 2018

Part of the slides are adapted from Ziko Kolter

Support vector machines Lecture 4

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

ECS289: Scalable Machine Learning

STA141C: Big Data & High Performance Statistical Computing

Content. Learning. Regression vs Classification. Regression a.k.a. function approximation and Classification a.k.a. pattern recognition

6.036 midterm review. Wednesday, March 18, 15

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Modelli Lineari (Generalizzati) e SVM

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Machine Learning for NLP

Overfitting, Bias / Variance Analysis

Multiclass Classification-1

COMS 4771 Introduction to Machine Learning. Nakul Verma

Max Margin-Classifier

Classifier Complexity and Support Vector Classifiers

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

ESS2222. Lecture 4 Linear model

MLCC 2017 Regularization Networks I: Linear Models

Support Vector Machines and Kernel Methods

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Machine Learning Lecture 7

Statistical Methods for Data Mining

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machine & Its Applications

Kernel Methods and Support Vector Machines

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

A Magiv CV Theory for Large-Margin Classifiers

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

PAC-learning, VC Dimension and Margin-based Bounds

Lecture Support Vector Machine (SVM) Classifiers

Discriminative Learning and Big Data

ML4NLP Multiclass Classification

The Perceptron algorithm

Support Vector Machines

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Introduction to Support Vector Machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Linear Classification and SVM. Dr. Xin Zhang

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Machine Learning 2017

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Machine Learning : Support Vector Machines

Logistic Regression Logistic

Midterm exam CS 189/289, Fall 2015

10-701/ Machine Learning - Midterm Exam, Fall 2010

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Kernel Methods. Machine Learning A W VO

Transcription:

No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada

Outline Generative vs. Discriminative models Statistical Learning Theory Linear models: Perceptron Linear Regression Minimum Classfication Error Support Vector Machines Rigde Regression and LASSO Compressed Sensing Neural Networks

Generative vs. discriminative models Posterior probability p(ωi X) plays the key role in pattern classification, also machine learning. Generative Models: focus on probability distribution of data p(ωi X) ~ p(ωi) p(x ωi) p (ωi) p (X ωi) (the plug-in MAP rule) Discriminative Models: directly model discriminant function: p(ωi X) ~ gi(x)

Pattern classification based on Discriminant Functions (I) Instead of designing a classifier based on probability distribution of data, we can build an ad-hoc classifier based on some discriminant functions to model class boundary info directly. Classifier based on discriminant functions: For N classes, we define a set of discriminant functions hi(x) (i=1,2,,n), one for each class. For an unknown pattern with feature vector Y, the classifier makes the decision as ω Y = argmax i h i (Y ) Each discriminant function hi(x) has a pre-defined function form and a set of unknown parameters θi, rewrite it as hi(x ; θi ). Similarly θi (i=1,2,,n) need to be estimated from some training data.

Statistical Learning Theory Training samples (x i,y i ) (i=1,2,,m) Random sariables X, Y: joint distribution P(X,Y) Input space : X from Output space : Y from Machine Learning tries to : y = h(x) + ε Hypothesis space : h(.) from Loss Function: L(y, y ) 0/1 loss, squared error,

Statistical Learning Theory Empirical Loss (a.k.a. empirical risk, in-sample error): Generalization error (a.k.a. generalization risk) Empirical Loss!= Generalization error Learnable or not: empirical risk minimization (ERM)! minimizing the generalization error.

Statistical Learning Theory Learnibility depends on: VC Generalization bounds (Vapnik-Chervonenkis theory): where d vc is called VC-dimension, only depending on.

Generalization Bounds The weak law of large numbers: Concentration inequalities (Heoffding s inequality)

Generalization Bounds For a single hypothesis h: Extend for the whole hypothesis space: The first bound:

Pattern classification based on Discriminant Functions (II) Some common forms for discriminant funtions: Linear discriminant function: h(x) = w t x + b Quadratic discrimiant function: (2 nd order) Polynomial discriminant function: (N-th order) Neural network: (arbitrary nonlinear functions) Optimal discriminant functions: optimal MAP classifier is a special case when choosing discriminant functions as class posterior probabilities.

Pattern classification based on Linear Discriminant Functions Unknown parameters of discriminant functions are estimated to optimize an objective function by some gradient descent method : Perceptron: a simple learning algorithm. Linear Regression: achieving a good mapping. Minimum Classification Error (MCE): minimizing empirical classification errors. Support Vector Machine (SVM): maximizing separation margin.

Binary Classification Task Separating two classes using linear models Label: +1 Label: -1

Perceptron Rosenblatt (1962) Linear models for two-class problems Perceptron algorithm: a very simple learning algorithm Randomly initialize w(0) and b(0), t=0 For each sample (x i,y i ) (i=1,,m) Calculate the actual output: h i (t) = f(x i ) Update the weights upon mistakes: w(t+1) = w(t) + (y i - h i (t)) x i b(t+1) = b(t) + (y i - h i (t)) t = t + 1 End for

Convergence of Perceptron If the training data is linearly separable, then the perceptron is guaranteed to converge, and there is an uppper bound on the number of times the perceptron will adjust its weights during the training. Proof can be found:

Linear Regression Find a good mapping from x to y (+1 or -1) Label: +1 Label: -1

Linear Regression Find a good mapping from X to y: X = x 1 x 2 x T Y = Xw T Y = y 1 y 2 y T = +1 1 +1 w * = argmin w i w * = ( X T X) 1 X T Y ( x i w T y ) 2 i Matrix inversion is expensive when x is high-dimension Linear regression does NOT work well for classification

Minimum Classification Error (MCE) Counting errors in training samples. (x i, y i ) g i = y i x i w T < 0 correct classification g i = y i x i w T > 0 wrong classification w * = arg min w i H (g i ) = arg min w i H (y i x i w T ) w * = arg min w i 1 l(x) = 1 + e σ x l(g i ) = arg min w i l(y i x i w T ) logistic sigmoid function

Minimum Classification Error (MCE) Optimization using gradient decent. The objective function (the smoothed training errors): The gradient is computed as: May also use SGD.

Large-Margin Classifier: Support Vector Machine (SVM)

Support Vector Machine (I) The decision boundary H should be as far away from the data of both classes as possible We should maximize the margin, m H1 H2 H Class 2 Class 1 m

Support Vector Machine (II) The decision boundary can be found by solving the following constrained optimization problem: w 2 = w T w Convert to its dual problem:

Linearly Non-Separable cases We allow error x i in classification " soft-margin SVM Class 2 Class 1

Support Vector Machine (III) Soft-margin SVM can be formulated as: w * 1 = min w 2 w,ξ 2 + C ξ i i i subject to y i (x i w T + b) > 1 ξ i ξ i > 0 ( i) It can be converted to the dual form: +1

Support Vector Machine (IV) Soft-margin SVM can be formulated as: w * 1 = min w 2 w,ξ 2 + C ξ i i i subject to y i (x i w T + b) > 1 ξ i ξ i > 0 ( i) Soft-margin SVM is equivalent to the following cost function: f (x i ) = y i (x i w T + b) +1

Support Vector Machine (IV) For nonlinear separation boundary: use a Kernel function f( ) f( ) f( ) f( ) x f(x) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) Input space f( ) f( ) Feature space

Support Vector Machine (VI) Nonlinear SVM based on a nonlinear mapping: Replace it by a Kernel function Kernel trick: no need to know the original mapping function f()

Support Vector Machine (VII) Popular Kernel functions: Polynomial kernels Gaussian (RBF) kernels

From 2-class to Multi-class Use multiple 2-class classifiers One vs. One One vs. all Direct Multi-class formulation Multiple linear discriminants MCE classifiers for N-class Multi-class SVMs

Learning Dicriminative Models in general The objective function for learning SVMs: The objective fucntion for learning discriminative models in general: Q = error function + regularization term

L P norm L p norm is defined as: L 2 norm (Eucleadian norm): L 0 norm: num of non-zero entries L 1 norm: L norm (maximum norm):

L p norm in 3-D L p norm constraints in 3-D: x p 1

Ridge Regression Ridge Regression = Linear Regression + L 2 norm 2 Closed form solution:

LASSO LASSO: least absolute shrinkage and selection operator LASSO = Linear Regression + L 1 norm Equivallent to Leading to sparse solution. Subgradient methods.

Compressed Sensing a.k.a. Compressive Sensing; Sparse Coding A real object = sparse coding from a large dictionary

Compressed Sensing Math formulation: Or some simpler ones:

Advanced Topics Mutli-class SVMs Max-margin Markov Networks Compressed Sensing (or Sparse Coding) Relevance Vector Machine Transductive SVMs