Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Similar documents
Models, Data, Learning Problems

Linear Classifiers (Kernels)

Linear Classifiers IV

Bayesian Learning (II)

Logistic Regression. Machine Learning Fall 2018

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

CS 6375 Machine Learning

ECE521 week 3: 23/26 January 2017

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine Learning Linear Classification. Prof. Matteo Matteucci

Discriminative Learning and Big Data

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Discriminative Models

Bayesian Support Vector Machines for Feature Ranking and Selection

Logistic Regression. COMP 527 Danushka Bollegala

Naïve Bayes classification

Least Squares Regression

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning for NLP

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Discriminative Models

Least Squares Regression

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Introduction to Machine Learning

Machine Learning 2017

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Linear Models for Regression

Bayesian Methods: Naïve Bayes

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Kernel methods, kernel SVM and ridge regression

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

An Introduction to Statistical and Probabilistic Linear Models

6.036 midterm review. Wednesday, March 18, 15

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Classification objectives COMS 4771

1 Machine Learning Concepts (16 points)

Pattern Recognition and Machine Learning

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Midterm: CS 6375 Spring 2015 Solutions

Ch 4. Linear Models for Classification

Machine Learning Practice Page 2 of 2 10/28/13

CS 231A Section 1: Linear Algebra & Probability Review

Linear Models for Classification

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Day 4: Classification, support vector machines

Machine Learning Basics

Click Prediction and Preference Ranking of RSS Feeds

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Introduction to Support Vector Machines

Introduction to Machine Learning

10-701/ Machine Learning - Midterm Exam, Fall 2010

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Lecture 3: Multiclass Classification

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Introduction to Logistic Regression

Learning theory. Ensemble methods. Boosting. Boosting: history

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Naive Bayes and Gaussian Bayes Classifier

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

18.9 SUPPORT VECTOR MACHINES

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Machine Learning, Fall 2009: Midterm

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Loss Functions, Decision Theory, and Linear Models

Lecture 9: PGM Learning

The Naïve Bayes Classifier. Machine Learning Fall 2017

Managing Uncertainty

Advanced Introduction to Machine Learning

CIS 520: Machine Learning Oct 09, Kernel Methods

Warm up: risk prediction with logistic regression

Logistic regression and linear classifiers COMS 4771

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Introduction to Logistic Regression and Support Vector Machine

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

5. Discriminant analysis

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

CSCI-567: Machine Learning (Spring 2019)

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Transcription:

Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer

Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic Regression Regularized Empirical Risk Minimization Kernel Perceptron, Support Vector Machine Ridge Regression, LASSO Representer Theorem Dualized Perceptron, Dual SVM Mercer Map Learning with Structured Input & Output Taxonomy, Sequences, Ranking, Decoder, Cutting Plane Algorithm 2

Prerequisites Statistics Random Variables, Distributions Bayes Formula Linear Algebra Vectors & Matrices Transpose, inverse Matrices Eigenvalues & Eigenvectors Calculus (Analysis) Derivatives, partial derivatives Gradients 3

Classification Input: an instance x X E.g., X can be a vector space over attributes The Instance is then an assignment of attributes. x = x x m is a feature vector Output: Class y Y; where Y is a finite set. The class is also referred to as the target attribute y is also referred to as the (class) label x classifier y 4

Classification: Example Input: Instance x X X : the set of all possible combinations of regiment of medication Attribute Medication # included? Medication #6 included? Instance x 0 0 0 Attribute values Feature vector Medication combination Output: y Y = toxic, ok / classifier 5

Classification: Example Input: Instance x X X : the set of all 6 6 pixel bitmaps Attribute Gray value of pixel Gray value of pixel 256 Instance x 0. 0.3 0.45 0.65 0.87 256 pixel values Output: y Y = 0,,2,3,4,5,6,7,8,9 : recognized digit classifier "6" 6

Classification: Example Input: Instance x X X : bag-of-words representation of all possible email texts Attribute Word # occurs? Word #m occurs? m,000,000 Instance x 0 0 0 Output: y Y = spam, ok Aardvark Beneficiary Friend Sterling Science Email Dear Beneficiary, your Email address has been picked online in this years MICROSOFT CONSUMER AWARD as a Winner of One Hundred and Fifty Five Thousand Pounds Sterling Dear Beneficiary, We are pleased to notify you that your Email address has been picked online in this second quarter's MICROSOFT CONSUMER AWARD (MCA) as a Winner of One Hundred and Fifty Five Thousand Pounds Sterling classifier Spam 7

Classifier Learning Input to the Learner: Training data T n. X = x x m x n x nm y = y y n Training Data: T n = x, y,, x n, y n 8

Classifier Learning Input to the Learner: Training data T n. X = y = x x m x n x nm y y n Output: a Model y X Y for example: if φ x y x = T 0 otherwise Training Data: T n = x, y,, x n, y n Linear classifier with parameter vector. 9

BAYESIAN CLASSIFICATION 0

Empirical Inference Inference of the probability of y given instance x and training data T n? p y x, T n Inference of the most likely class y = argmax y p y x, T n We must make assumptions about the process by which the data is generated to be able to calculate the most probable class. We assume all data are independent given model.

Empirical Inference Inference of the probability of y given instance x and training data T n? p y x, T n = p y, x, T n d Integration over space of model parameters: Bayesian Model Averaging = p y x, p T n d Inference of the most likely class y = argmax p y x, T n y Independence assumption = argmax y p y x, p T n d 2

Empirical Inference Inference of the probability of y given instance x and training data T n? p y x, T n = p y, x, T n d Class probability at instance x given = p y x, p T n d Inference of the most likely class y = argmax p y x, T n y a posteriori probability (Posterior) of model given training data = argmax y p y x, p T n d 3

Empirical Inference Inference of the probability of y given instance x and training data T n? p y x, T n = p y x, p T n d Generally, no closed-form solution for classification. Difficult to approximate since the space of all parameter vectors is too large. 4

Empirical Inference Inference of the probability of y given instance x and training data? p y x, T n = p y x, p T n d where MAP = argmax p y x, MAP p T n Approximation of the weighted sum through its maximum. Classification through the most probable single model instead of a sum over all models. 5

Inference Example Clinical study: Medication combination x and outcome y Inference of the probability of y given instance x and training data? p y x, T n = p y x, p T n d Integral over all models where MAP = argmax p y x, MAP p T n Most probable model given training data (Maximum a- Approximation of the weighted sum Posteriori through model) its maximum. Classification through the most probable single model instead of a sum over all models. 6

Graphical Model for Classification A graphical model defines a stochastic process It constitutes our modeling assumptions about the data generation process y i y First, a model parameter is selected (or sampled) x i n x This parameterizes the training data p y i x i, The distribution of the data p x i is not further modeled 7

Example Evolution determines physiological parameters of humans Given these parameters and a combination of medication, Nature rolls dice to decide whether we survive this combination of drugs. Every time this combination of medicine is administered, the dice are re-rolled according to p y i x i, to determine the result. x i n x? 8

Empirical Inference Computation of MAP : MAP = argmax p T n = argmax p,t n p T n x i y i n 9

Empirical Inference Computation of MAP : MAP = argmax p T n = argmax p,t n p T n x i y i n = argmax p p X p y X, p T n (data model) 20

Empirical Inference Computation of MAP : MAP = argmax p T n = argmax p,t n p T n x i y i n = argmax = argmax p p X p y X, p T n p y X, p (Constants w.r.t. ) 2

Empirical Inference Computation of p y X,. Independence of the training data (from the graphical model) n y i p y X, = p y i x i, i= x i n Discriminative class probabilities p y i x i, are directly specified by the model. 22

Empirical Inference Discriminative Models Summary of empirical inference to this point: P y x, T n = p y x, p T n d p y x, MAP MAP = argmax p y X, p p y X, = p y i x i, n i= p y i x i, is directly specified by the model 23

Empirical Inference Discriminative Models Summary of empirical inference to this point: Integral over all models: Bayesian model averaging P y x, T n = p y x, p T n d p y x, MAP Likelihood of the Class MAP = argmax p y X, = p y i x i, Training data are independent n i= p y X, p Prior over model parameters MAP: Approximation by most probable model p y i x i, is directly specified by the model 24

DISCRIMINATIVE APPROACH 25

Class Probabilities: Discriminative Models How should we model p y x,? Simple Approach: assume p depends on x T ; i.e. p y x, = q y x T Linear Model: Eg. Binary logistic regression: p y = + x, = + exp x T + b p y = x, = p y = + x, = + exp x T + b Later, we look at other frameworks for linear models 26

Binary Logistic Regression Binary classification: classes + & - p y = + x, = + exp x T + b Decision point: p y = + x, = p y = x, 2 = + exp x T + b x T + b = 0 The set of points x x T + b = 0 form a separating plane between classes - & +. 27

Linear Models Hyperplane given by normal vector & displacement: H = x f x = x T + b = 0 Decision function: f x Classifier: = x T + b y x = sign f x Discriminative class probability: P y = + x, = x 2 b +exp x T +b f x f x > 0 f x < 0 = 0 x 28

Linear Models Hyperplane given by normal vector & displacement: H = x f x = x T + b = 0 p x y = +, Decision function: f x = x T + b Classifier: y x = sign f x x 2 Discriminative class probability: p x y =, x P y = + x, = +exp x T +b 29

Linear Models Hyperplane given by normal vector & displacement: H = x f x = x T + b = 0 Decision function: f x = x T + b Classifier: y x = sign f x x 2 f x = 0 Discriminative class probability: p x y =, x P y = + x, = +exp x T +b 30

Logistic Regression: Learning Problem Inference of MAP = argmax p T n Another Assumption: the prior is normally distributed with a mean 0: p = N ; 0, Σ 3

Logistic Regression: Learning Problem Inference of the MAP-Parameter: MAP = argmax p T n = argmax p y X, p = argmax = argmax = argmax = argmax = argmin log p y X, + log p n i= log p y i x i, + log N ; 0, Σ log y + exp x T i =+ + b + log + exp + x T + b n i= y i = log n i= + exp y i x T + b + + log e 2 T Σ 2π m Σ log + exp y i x T + b + 2 T Σ 32

Logistic Regression: Learning Problem Inference of the MAP-Parameter: MAP = argmax p T n = argmax p y X, p = argmax = argmax = argmax = argmax = argmin log p y X, + log p n i= log p y i x i, + log N ; 0, Σ log y + exp x T i =+ + b + log + exp + x T + b n i= y i = log n i= + exp y i x T + b + + log e 2 T Σ 2π m Σ log + exp y i x T + b + 2 T Σ 33

Logistic Regression: Learning Problem Inference of the MAP-Parameter: MAP = argmax p T n = argmax p y X, p = argmax = argmax = argmax = argmax = argmin log p y X, + log p n i= log p y i x i, + log N ; 0, Σ log y + exp x T i =+ + b + log + exp + x T + b n i= y i = log n i= + exp y i x T + b + + log e 2 T Σ 2π m Σ log + exp y i x T + b + 2 T Σ 34

Logistic Regression: Learning Problem Inference of the MAP-Parameter: MAP = argmax p T n = argmax p y X, p = argmax = argmax = argmax = argmax = argmin log p y X, + log p n i= log p y i x i, + log N ; 0, Σ log y + exp x T i =+ + b + log + exp + x T + b n i= y i = log n i= + exp y i x T + b + + log e 2 T Σ 2π m Σ log + exp y i x T + b + 2 T Σ 35

Logistic Regression: Learning Problem Inference of the MAP-Parameter. Binary logistic regression: classes + and - MAP = argmin n i= log + exp y i x T + b + 2 T Σ y i, + How can MAP be computed? To be continued 36

FEATURE MAPPINGS 37

Linear Classification Reformulation by adding a constant input feature (affine transformation): f x = φ x m T m + b m = φ x f f f= m+ = φ x f f f= + b f x = x T + b y x = sign f x where φ x m+ = and m+ = b = φ x m+ T m+ 38

Linear Classification Reformulation by adding a constant input feature (affine transformation): f x = φ x m T m + b m = φ x f f f= m+ = φ x f f f= + b f X x = x T + b y x = sign f x where φ x m+ = and m+ = b = φ x m+ T m+ f x = φ x T y x = sign f x 39

Additional Feature Maps The abstraction φ x allows us to learn in more general feature spaces We can replace x by φ x & use the same learning! MAP = argmin n i= log + exp y i φ x T + b + 2 T Σ Aside: The tensor product between an n and m dimensional vector is an nm-dimensional vector of all products of elements: x y = x x n y y m = x y x y m x n y x n y m 40

Feature Mappings Linear Mapping: φ x i = x i Quadratic Mapping: φ x i = x i x i x i Tensor product Polynomial Mapping: φ x i = x i x i x i x i x i p factors Frequently, it occurs that feature mappings do not have a closed form expression, but can be specified indirectly via their inner products E.g., RBF kernel, Hash kernel functions 4

Sufficient Statistics, Feature Mappings Linear Mappings: Linear Mapping φ x i = x i is the sufficient statistic, when p x y, = N x; μ y, Σ and the covariance matrix is the same for all classes. A linear mapping φ x i = x i is then sufficient to calculate the class probabilities. Quadratic Mappings: More generally, a quadratic mapping is the sufficient statistic when classes have different covariance matrices. 42

Linear Models Feature Mappings Hyperplane given by normal vector & displacement: H = x f x = φ x T + b = 0 Decision function: f x Classifier: = φ x T + b y x = sign f x x 2 p x y = +, φ x i = x i x i x i Discriminative class probability: p x y =, x P y = + x, = +exp φ x T +b 43

Linear Models Feature Mappings Hyperplane given by normal vector & displacement: H = x f x = φ x T + b = 0 Decision function: f x Classifier: = φ x T + b y x = sign f x x 2 φ x i = f x = 0 x i x i x i Discriminative class probability: p x y =, x P y = + x, = +exp φ x T +b 44

MULTI-CLASS CLASSIFICATION 45

Multi-class Classification Motivation: we would like to extend classification to problems with more than 2 classes. Y =,, k Problem: we cannot separate k classes with a single hyperplane. Idea: Each class y has a separate function f x, y that is used to predict how likely y is given x. Each function is modeled as linear. We predict class y with the highest scoring function for x. 46

Multi-class Logistic Regression Probability for class y: p y x, = exp φ x T y +b y z Y exp φ x T z +b z Exponent is affine in φ x (linear + offset) Denominator is constant w.r.t. y Class y is the most likely class if it satisfies y argmax φ x T z + b z z Y This is a linear (+offset) decision function. 47

Linear Models Multi-class Case Hyperplane given by normal vector & displacement: H,y = x f x, y = φ x T y + b y = 0 Decision functions: f x, y Classifier: y x = φ x T y + b y = argmax z Y f x, z x 2 f x, y > 0 y y2 f x, y 2 > 0 Discriminative class probability: P y x, = exp φ x T y + b y z Y exp φ x T z + b z y3 f x, y 3 x > 0 48

Logistic Regression: Learning Problem Inference of the MAP-Parameter: =,, k T MAP = argmax p T n = argmax p y X, p = argmax = argmax = argmin log p y X, + log p n i= n i= log p y i x i, + log N ; 0, Σ log exp φ x i T y i + b y i z Y exp φ x i T z + b z log e 2 T Σ 2π m Σ = argmin n i= log Σ z Y exp φ x i T z + b z φ x i T y i + b y i + T Σ 2 49

Summary Learning Logistic Regression If the modelling assumptions are fulfilled: Data generation model from Slide 7, p = N ; 0, Σ ; that is, the prior is normally distributed, Then we use P y x, = exp φ x T y + b y z Y exp φ x T z + b z And the Maximum-A-Posteriori-Parameter is MAP = argmin n i= log Σ z Y exp φ x i T z + b z φ x i T y i + b y i + T Σ 2 How can MAP be computed? To be continued 50

GENERATIVE APPROACH 5

Empirical Inference Generative Models Computation of p y X,. Independence of the training data (from the graphical model) p y X, = p y i x i, n i= Generative model: apply Bayes Rule, p y i x i, = p x i y i, p y i y Y p x i z, p z where p x i y i, and p y i are model specific. x i y i n x y 52

Exponential Family Probability of a class label is part of the parameter vector p y = π y p y i x i, = p x i y i, p y i z Y p x i y, p y The conditional probability of x is given by: p x y, = h x exp φ x T y ln g y For class k, we partition the parameter vector : = k π π π k 53

Exponential Family The conditional probability of x is given by: p x y, = h x exp φ x T y ln g y The representation φ x is the sufficient statistic φ x conveys all useful information about x for the probability distribution. Partition function g h x is the base measure. y normalizes the distribution The distribution is specified by h x, φ x,, & g. Many common distributions are in exponential family. 54

Exponential Family: Normal distribution The conditional probability of x is given by: p x y, = h x exp φ x T y ln g y Example: Normal distribution N x; μ, Σ = 2π m Σ e 2 x μ T Σ x μ Can it be represented in the exponential family form? 3 2 N x x 2 ; 0 0, 0.5 0.6 0.6 x 2 0 - -2-3 -3-2 - 0 2 3 x 55

Exponential Family: Normal distribution The conditional probability of x is given by: p x y, = h x exp φ x T y ln g y Example: Normal distribution N x; μ, Σ = Exponential family form: 2π m Σ e 2 x μ T Σ x μ 3 2 N x x 2 ; 0 0, 0.5 0.6 0.6 φ x = x x x, = Σ μ vec Σ 2 h x = 2π m/2, g = Σ exp μ T Σ μ x 2 0 - -2-3 -3-2 - 0 2 3 x 56

N(0,) Exponential Family: Normal distribution The conditional probability of x is given by: p x y, = h x exp φ x T y ln g y Example: Normal distribution N x; μ, σ = σ 2π e 2σ 2 x μ 2 Exponential family form: 0.4 N x; 0, φ x = x x 2, = μ σ 2 2σ 2 0.3 0.2 0. h x = 2π /2, g = σexp μ2 2σ 2 0-0. -3-2 - 0 2 3 x 57

Exponential Family in Classification The conditional probability of x is given by: p x y, = h x exp φ x T y ln g y Substitute into Bayes Rule (Recall p y = π y ) p y i x i, = p x i y i, p y i z Y p x i z, p z = h x i exp φ x i T y i ln g y i π y i z Y h x i exp φ x i T z ln g z π z 58

Exponential Family in Classification = The conditional probability of x is given by: p x y, = h x exp φ x T y ln g y Substitute into Bayes Rule (Recall p y = π y ) p y i x i, = p x i y i, p y i z Y p x i z, p z k π π k π = h x i exp φ x i T y i ln g y i π y i z Y h x i exp φ x i T z ln g z π z = exp φ x i T y i + b y i z Y exp φ x i T z + b z b y i = ln π y i ln g y i 59

Exponential Family in Classification = The conditional probability of x is given by: p x y, = h x exp φ x T y ln g y Substitute into Bayes Rule (Recall p y = π y ) p y i x i, = p x i y i, p y i z Y p x i z, p z b k = h x i exp φ x i T y i ln g y i π y i z Y h x i exp φ x i T z ln g z π z = exp φ x i T y i + b y i z Y exp φ x T i z + b b k z b yi = ln π yi ln g yi 60

Exponential Family in Classification = The conditional probability of x is given by: p x y, = h x exp φ x T y ln g y Substitute into Bayes Rule (Recall p y = π y ) p y i x i, = p x i y i, p y i z Y p x i z, p z b k b k = h x i exp φ x i T y i ln g y i π y i z Y h x i exp φ x i T z ln g z π z = exp φ x i T y i z Y exp φ x i T z f x, y = φ x T y y x = argmax f x, z z Y 6

Generative Logistic Regression Using the generative approach & assumptions Data generation model from slide 52 p x y, is an exponential family distribution We arrived at this conditional distribution for y: p y x, = exp φ x T y z Y exp φ x T z We do not know the parameters y. We will soon show how to infer the MAP- (maximum a posteriori-) parameter. 62

Linear Classification Summary In the 2-class case, the linear classifier has a decision function: f x = φ x T + b & a classifier: y x = sign f x In the multi-class case, the linear classifier has a decision function: f x, y = φ x T y + b y & a classifier: y x = argmax z Y f x, z The data is mapped by φ x to feature space. The offsets b y can be appended to the end of the vector y & a is added to the end of each φ x i. The parameter vector y is a normal vector of a separating hyperplane. 63