Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer

Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic Regression Regularized Empirical Risk Minimization Kernel Perceptron, Support Vector Machine Ridge Regression, LASSO Representer Theorem Dualized Perceptron, Dual SVM Mercer Map Learning with Structured Input & Output Taxonomy, Sequences, Ranking, Decoder, Cutting Plane Algorithm 2

Prerequisites Statistics Random Variables, Distributions Bayes Formula Linear Algebra Vectors & Matrices Transpose, inverse Matrices Eigenvalues & Eigenvectors Calculus (Analysis) Derivatives, partial derivatives Gradients 3

Classification Input: an instance x X E.g., X can be a vector space over attributes The Instance is then an assignment of attributes. x = x x m is a feature vector Output: Class y Y; where Y is a finite set. The class is also referred to as the target attribute y is also referred to as the (class) label x classifier y 4

Classification: Example Input: Instance x X X : the set of all possible combinations of regiment of medication Attribute Medication # included? Medication #6 included? Instance x 0 0 0 Attribute values Feature vector Medication combination Output: y Y = toxic, ok / classifier 5

Classification: Example Input: Instance x X X : the set of all 6 6 pixel bitmaps Attribute Gray value of pixel Gray value of pixel 256 Instance x 0. 0.3 0.45 0.65 0.87 256 pixel values Output: y Y = 0,,2,3,4,5,6,7,8,9 : recognized digit classifier "6" 6

Classification: Example Input: Instance x X X : bag-of-words representation of all possible email texts Attribute Word # occurs? Word #m occurs? m,000,000 Instance x 0 0 0 Output: y Y = spam, ok Aardvark Beneficiary Friend Sterling Science Email Dear Beneficiary, your Email address has been picked online in this years MICROSOFT CONSUMER AWARD as a Winner of One Hundred and Fifty Five Thousand Pounds Sterling Dear Beneficiary, We are pleased to notify you that your Email address has been picked online in this second quarter's MICROSOFT CONSUMER AWARD (MCA) as a Winner of One Hundred and Fifty Five Thousand Pounds Sterling classifier Spam 7

Classifier Learning Input to the Learner: Training data T n. X = x x m x n x nm y = y y n Training Data: T n = x, y,, x n, y n 8

Classifier Learning Input to the Learner: Training data T n. X = y = x x m x n x nm y y n Output: a Model y X Y for example: if φ x y x = T 0 otherwise Training Data: T n = x, y,, x n, y n Linear classifier with parameter vector. 9

BAYESIAN CLASSIFICATION 0

Empirical Inference Inference of the probability of y given instance x and training data T n? p y x, T n Inference of the most likely class y = argmax y p y x, T n We must make assumptions about the process by which the data is generated to be able to calculate the most probable class. We assume all data are independent given model.

Empirical Inference Inference of the probability of y given instance x and training data T n? p y x, T n = p y, x, T n d Integration over space of model parameters: Bayesian Model Averaging = p y x, p T n d Inference of the most likely class y = argmax p y x, T n y Independence assumption = argmax y p y x, p T n d 2

Empirical Inference Inference of the probability of y given instance x and training data T n? p y x, T n = p y, x, T n d Class probability at instance x given = p y x, p T n d Inference of the most likely class y = argmax p y x, T n y a posteriori probability (Posterior) of model given training data = argmax y p y x, p T n d 3

Empirical Inference Inference of the probability of y given instance x and training data T n? p y x, T n = p y x, p T n d Generally, no closed-form solution for classification. Difficult to approximate since the space of all parameter vectors is too large. 4

Empirical Inference Inference of the probability of y given instance x and training data? p y x, T n = p y x, p T n d where MAP = argmax p y x, MAP p T n Approximation of the weighted sum through its maximum. Classification through the most probable single model instead of a sum over all models. 5

Inference Example Clinical study: Medication combination x and outcome y Inference of the probability of y given instance x and training data? p y x, T n = p y x, p T n d Integral over all models where MAP = argmax p y x, MAP p T n Most probable model given training data (Maximum a- Approximation of the weighted sum Posteriori through model) its maximum. Classification through the most probable single model instead of a sum over all models. 6

Graphical Model for Classification A graphical model defines a stochastic process It constitutes our modeling assumptions about the data generation process y i y First, a model parameter is selected (or sampled) x i n x This parameterizes the training data p y i x i, The distribution of the data p x i is not further modeled 7

Example Evolution determines physiological parameters of humans Given these parameters and a combination of medication, Nature rolls dice to decide whether we survive this combination of drugs. Every time this combination of medicine is administered, the dice are re-rolled according to p y i x i, to determine the result. x i n x? 8

Empirical Inference Computation of MAP : MAP = argmax p T n = argmax p,t n p T n x i y i n 9

Empirical Inference Computation of MAP : MAP = argmax p T n = argmax p,t n p T n x i y i n = argmax p p X p y X, p T n (data model) 20

Empirical Inference Computation of MAP : MAP = argmax p T n = argmax p,t n p T n x i y i n = argmax = argmax p p X p y X, p T n p y X, p (Constants w.r.t. ) 2

Empirical Inference Computation of p y X,. Independence of the training data (from the graphical model) n y i p y X, = p y i x i, i= x i n Discriminative class probabilities p y i x i, are directly specified by the model. 22

Empirical Inference Discriminative Models Summary of empirical inference to this point: P y x, T n = p y x, p T n d p y x, MAP MAP = argmax p y X, p p y X, = p y i x i, n i= p y i x i, is directly specified by the model 23

Empirical Inference Discriminative Models Summary of empirical inference to this point: Integral over all models: Bayesian model averaging P y x, T n = p y x, p T n d p y x, MAP Likelihood of the Class MAP = argmax p y X, = p y i x i, Training data are independent n i= p y X, p Prior over model parameters MAP: Approximation by most probable model p y i x i, is directly specified by the model 24

DISCRIMINATIVE APPROACH 25

Class Probabilities: Discriminative Models How should we model p y x,? Simple Approach: assume p depends on x T ; i.e. p y x, = q y x T Linear Model: Eg. Binary logistic regression: p y = + x, = + exp x T + b p y = x, = p y = + x, = + exp x T + b Later, we look at other frameworks for linear models 26

Binary Logistic Regression Binary classification: classes + & - p y = + x, = + exp x T + b Decision point: p y = + x, = p y = x, 2 = + exp x T + b x T + b = 0 The set of points x x T + b = 0 form a separating plane between classes - & +. 27

Linear Models Hyperplane given by normal vector & displacement: H = x f x = x T + b = 0 Decision function: f x Classifier: = x T + b y x = sign f x Discriminative class probability: P y = + x, = x 2 b +exp x T +b f x f x > 0 f x < 0 = 0 x 28

Linear Models Hyperplane given by normal vector & displacement: H = x f x = x T + b = 0 p x y = +, Decision function: f x = x T + b Classifier: y x = sign f x x 2 Discriminative class probability: p x y =, x P y = + x, = +exp x T +b 29

Linear Models Hyperplane given by normal vector & displacement: H = x f x = x T + b = 0 Decision function: f x = x T + b Classifier: y x = sign f x x 2 f x = 0 Discriminative class probability: p x y =, x P y = + x, = +exp x T +b 30

Logistic Regression: Learning Problem Inference of MAP = argmax p T n Another Assumption: the prior is normally distributed with a mean 0: p = N ; 0, Σ 3

Logistic Regression: Learning Problem Inference of the MAP-Parameter: MAP = argmax p T n = argmax p y X, p = argmax = argmax = argmax = argmax = argmin log p y X, + log p n i= log p y i x i, + log N ; 0, Σ log y + exp x T i =+ + b + log + exp + x T + b n i= y i = log n i= + exp y i x T + b + + log e 2 T Σ 2π m Σ log + exp y i x T + b + 2 T Σ 32

Logistic Regression: Learning Problem Inference of the MAP-Parameter. Binary logistic regression: classes + and - MAP = argmin n i= log + exp y i x T + b + 2 T Σ y i, + How can MAP be computed? To be continued 36

FEATURE MAPPINGS 37

Linear Classification Reformulation by adding a constant input feature (affine transformation): f x = φ x m T m + b m = φ x f f f= m+ = φ x f f f= + b f x = x T + b y x = sign f x where φ x m+ = and m+ = b = φ x m+ T m+ 38

Linear Classification Reformulation by adding a constant input feature (affine transformation): f x = φ x m T m + b m = φ x f f f= m+ = φ x f f f= + b f X x = x T + b y x = sign f x where φ x m+ = and m+ = b = φ x m+ T m+ f x = φ x T y x = sign f x 39

Additional Feature Maps The abstraction φ x allows us to learn in more general feature spaces We can replace x by φ x & use the same learning! MAP = argmin n i= log + exp y i φ x T + b + 2 T Σ Aside: The tensor product between an n and m dimensional vector is an nm-dimensional vector of all products of elements: x y = x x n y y m = x y x y m x n y x n y m 40

Feature Mappings Linear Mapping: φ x i = x i Quadratic Mapping: φ x i = x i x i x i Tensor product Polynomial Mapping: φ x i = x i x i x i x i x i p factors Frequently, it occurs that feature mappings do not have a closed form expression, but can be specified indirectly via their inner products E.g., RBF kernel, Hash kernel functions 4

Sufficient Statistics, Feature Mappings Linear Mappings: Linear Mapping φ x i = x i is the sufficient statistic, when p x y, = N x; μ y, Σ and the covariance matrix is the same for all classes. A linear mapping φ x i = x i is then sufficient to calculate the class probabilities. Quadratic Mappings: More generally, a quadratic mapping is the sufficient statistic when classes have different covariance matrices. 42

Linear Models Feature Mappings Hyperplane given by normal vector & displacement: H = x f x = φ x T + b = 0 Decision function: f x Classifier: = φ x T + b y x = sign f x x 2 p x y = +, φ x i = x i x i x i Discriminative class probability: p x y =, x P y = + x, = +exp φ x T +b 43

Linear Models Feature Mappings Hyperplane given by normal vector & displacement: H = x f x = φ x T + b = 0 Decision function: f x Classifier: = φ x T + b y x = sign f x x 2 φ x i = f x = 0 x i x i x i Discriminative class probability: p x y =, x P y = + x, = +exp φ x T +b 44

MULTI-CLASS CLASSIFICATION 45

Multi-class Classification Motivation: we would like to extend classification to problems with more than 2 classes. Y =,, k Problem: we cannot separate k classes with a single hyperplane. Idea: Each class y has a separate function f x, y that is used to predict how likely y is given x. Each function is modeled as linear. We predict class y with the highest scoring function for x. 46

Multi-class Logistic Regression Probability for class y: p y x, = exp φ x T y +b y z Y exp φ x T z +b z Exponent is affine in φ x (linear + offset) Denominator is constant w.r.t. y Class y is the most likely class if it satisfies y argmax φ x T z + b z z Y This is a linear (+offset) decision function. 47

Linear Models Multi-class Case Hyperplane given by normal vector & displacement: H,y = x f x, y = φ x T y + b y = 0 Decision functions: f x, y Classifier: y x = φ x T y + b y = argmax z Y f x, z x 2 f x, y > 0 y y2 f x, y 2 > 0 Discriminative class probability: P y x, = exp φ x T y + b y z Y exp φ x T z + b z y3 f x, y 3 x > 0 48

Logistic Regression: Learning Problem Inference of the MAP-Parameter: =,, k T MAP = argmax p T n = argmax p y X, p = argmax = argmax = argmin log p y X, + log p n i= n i= log p y i x i, + log N ; 0, Σ log exp φ x i T y i + b y i z Y exp φ x i T z + b z log e 2 T Σ 2π m Σ = argmin n i= log Σ z Y exp φ x i T z + b z φ x i T y i + b y i + T Σ 2 49

Summary Learning Logistic Regression If the modelling assumptions are fulfilled: Data generation model from Slide 7, p = N ; 0, Σ ; that is, the prior is normally distributed, Then we use P y x, = exp φ x T y + b y z Y exp φ x T z + b z And the Maximum-A-Posteriori-Parameter is MAP = argmin n i= log Σ z Y exp φ x i T z + b z φ x i T y i + b y i + T Σ 2 How can MAP be computed? To be continued 50

GENERATIVE APPROACH 5

Empirical Inference Generative Models Computation of p y X,. Independence of the training data (from the graphical model) p y X, = p y i x i, n i= Generative model: apply Bayes Rule, p y i x i, = p x i y i, p y i y Y p x i z, p z where p x i y i, and p y i are model specific. x i y i n x y 52

Exponential Family Probability of a class label is part of the parameter vector p y = π y p y i x i, = p x i y i, p y i z Y p x i y, p y The conditional probability of x is given by: p x y, = h x exp φ x T y ln g y For class k, we partition the parameter vector : = k π π π k 53

Exponential Family The conditional probability of x is given by: p x y, = h x exp φ x T y ln g y The representation φ x is the sufficient statistic φ x conveys all useful information about x for the probability distribution. Partition function g h x is the base measure. y normalizes the distribution The distribution is specified by h x, φ x,, & g. Many common distributions are in exponential family. 54

Exponential Family: Normal distribution The conditional probability of x is given by: p x y, = h x exp φ x T y ln g y Example: Normal distribution N x; μ, Σ = 2π m Σ e 2 x μ T Σ x μ Can it be represented in the exponential family form? 3 2 N x x 2 ; 0 0, 0.5 0.6 0.6 x 2 0 - -2-3 -3-2 - 0 2 3 x 55

Exponential Family: Normal distribution The conditional probability of x is given by: p x y, = h x exp φ x T y ln g y Example: Normal distribution N x; μ, Σ = Exponential family form: 2π m Σ e 2 x μ T Σ x μ 3 2 N x x 2 ; 0 0, 0.5 0.6 0.6 φ x = x x x, = Σ μ vec Σ 2 h x = 2π m/2, g = Σ exp μ T Σ μ x 2 0 - -2-3 -3-2 - 0 2 3 x 56

N(0,) Exponential Family: Normal distribution The conditional probability of x is given by: p x y, = h x exp φ x T y ln g y Example: Normal distribution N x; μ, σ = σ 2π e 2σ 2 x μ 2 Exponential family form: 0.4 N x; 0, φ x = x x 2, = μ σ 2 2σ 2 0.3 0.2 0. h x = 2π /2, g = σexp μ2 2σ 2 0-0. -3-2 - 0 2 3 x 57

Exponential Family in Classification The conditional probability of x is given by: p x y, = h x exp φ x T y ln g y Substitute into Bayes Rule (Recall p y = π y ) p y i x i, = p x i y i, p y i z Y p x i z, p z = h x i exp φ x i T y i ln g y i π y i z Y h x i exp φ x i T z ln g z π z 58

Exponential Family in Classification = The conditional probability of x is given by: p x y, = h x exp φ x T y ln g y Substitute into Bayes Rule (Recall p y = π y ) p y i x i, = p x i y i, p y i z Y p x i z, p z k π π k π = h x i exp φ x i T y i ln g y i π y i z Y h x i exp φ x i T z ln g z π z = exp φ x i T y i + b y i z Y exp φ x i T z + b z b y i = ln π y i ln g y i 59

Exponential Family in Classification = The conditional probability of x is given by: p x y, = h x exp φ x T y ln g y Substitute into Bayes Rule (Recall p y = π y ) p y i x i, = p x i y i, p y i z Y p x i z, p z b k = h x i exp φ x i T y i ln g y i π y i z Y h x i exp φ x i T z ln g z π z = exp φ x i T y i + b y i z Y exp φ x T i z + b b k z b yi = ln π yi ln g yi 60

Exponential Family in Classification = The conditional probability of x is given by: p x y, = h x exp φ x T y ln g y Substitute into Bayes Rule (Recall p y = π y ) p y i x i, = p x i y i, p y i z Y p x i z, p z b k b k = h x i exp φ x i T y i ln g y i π y i z Y h x i exp φ x i T z ln g z π z = exp φ x i T y i z Y exp φ x i T z f x, y = φ x T y y x = argmax f x, z z Y 6

Generative Logistic Regression Using the generative approach & assumptions Data generation model from slide 52 p x y, is an exponential family distribution We arrived at this conditional distribution for y: p y x, = exp φ x T y z Y exp φ x T z We do not know the parameters y. We will soon show how to infer the MAP- (maximum a posteriori-) parameter. 62

Linear Classification Summary In the 2-class case, the linear classifier has a decision function: f x = φ x T + b & a classifier: y x = sign f x In the multi-class case, the linear classifier has a decision function: f x, y = φ x T y + b y & a classifier: y x = argmax z Y f x, z The data is mapped by φ x to feature space. The offsets b y can be appended to the end of the vector y & a is added to the end of each φ x i. The parameter vector y is a normal vector of a separating hyperplane. 63