Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Similar documents
Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Linear Models in Machine Learning

CIS 520: Machine Learning Oct 09, Kernel Methods

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Machine Learning. The Breadth of ML CRF & Recap: Probability. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Support Vector Machines for Classification: A Statistical Portrait

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Machine Learning Practice Page 2 of 2 10/28/13

(Kernels +) Support Vector Machines

Kernel Methods and Support Vector Machines

Logistic Regression. Machine Learning Fall 2018

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Parameter learning in CRF s

Discriminative Models

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Introduction to Machine Learning

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Discriminative Models

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Review: Support vector machines. Machine learning techniques and image analysis

ECE521 week 3: 23/26 January 2017

Lecture 7: Kernels for Classification and Regression

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Pattern Recognition and Machine Learning

Machine Learning for NLP

Classification objectives COMS 4771

Support Vector Machines

Intelligent Systems (AI-2)

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Support Vector Machines and Kernel Methods

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Introduction to Machine Learning

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Alternative Parameterizations of Markov Networks. Sargur Srihari

Max Margin-Classifier

Lecture 10: Support Vector Machine and Large Margin Classifier

Midterm exam CS 189/289, Fall 2015

Machine Learning for NLP

Kaggle.

Support Vector Machine (continued)

STA 4273H: Statistical Machine Learning

Gaussian Processes (10/16/13)

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Kernel Methods. Charles Elkan October 17, 2007

The Perceptron Algorithm

Linear & nonlinear classifiers

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

RegML 2018 Class 2 Tikhonov regularization and kernels

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

1 Machine Learning Concepts (16 points)

Oslo Class 2 Tikhonov regularization and kernels

Support Vector Machines

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Intelligent Systems (AI-2)

COMS 4771 Introduction to Machine Learning. Nakul Verma

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Probabilistic Graphical Models & Applications

Ad Placement Strategies

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Deep Learning for Computer Vision

Sequential Supervised Learning

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Kernel Logistic Regression and the Import Vector Machine

Kernel Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 27, 2015

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

5.6 Nonparametric Logistic Regression

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Machine Learning 2017

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Lecture 3 Feedforward Networks and Backpropagation

Announcements. Proposals graded

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Neural Networks and Deep Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

9.2 Support Vector Machines 159

Support Vector Machine (SVM) and Kernel Methods

Undirected Graphical Models

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

LMS Algorithm Summary

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Discriminative Learning and Big Data

Transcription:

Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary & multi-class case, conditional random fields Marc Toussaint University of Stuttgart Summer 205

Discriminative Function Represent a discrete-valued function F : R n Y via a discriminative function f : R n Y R such that F : x argmax y f(x, y) A discriminative function f(x, y) maps an input x to an output ŷ(x) = argmax f(x, y) y A discriminative function f(x, y) has high value if y is a correct answer to x; and low value if y is a false answer In that way a discriminative function e.g. discriminates correct sequence/image/graph-labelling from wrong ones 2/35

Example Discriminative Function Input: x R 2 ; output y {, 2, 3} displayed are p(y = x), p(y =2 x), p(y =3 x) 0.8 0.6 0.4 0.2 0 0.9 0.5 0. 0.9 0.5 0. 0.9 0.5 0. -2-0 2 3-2 - 0 2 3 (here already scaled to the interval [0,]... explained later) 3/35

How could we parameterize a discriminative function? Well, linear in features! f(x, y) = k j= φ j(x, y)β j = φ(x, y) β Example: Let x R and y {, 2, 3}. Typical features might be φ(x, y) = [y = ] x [y = 2] x [y = 3] x [y = ] x 2 [y = 2] x 2 [y = 3] x 2 Example: Let x, y {0, } be both discrete. Features might be φ(x, y) = [x = 0][y = 0] [x = 0][y = ] [x = ][y = 0] [x = ][y = ] 4/35

more intuition... Features connect input and output. Each φ j (x, y) allows f to capture a certain dependence between x and y If both x and y are discrete, a feature φ j (x, y) is typically a joint indicator function (logical function), indicating a certain event Each weight β j mirrors how important/frequent/infrequent a certain dependence described by φ j (x, y) is f(x) is also called energy, and the following methods are also called energy-based modelling, esp. in neural modelling 5/35

In the remainder: Logistic regression: binary case Multi-class case Preliminary comments on the general structured output case (Conditional Random Fields) 6/35

Logistic regression: Binary case 7/35

Binary classification example (MT/plot.h -> gnuplot pipe) 3 train decision boundary 2 0 - -2-2 - 0 2 3 Input x R 2 Output y {0, } Example shows RBF Ridge Logistic Regression 8/35

A loss function for classification Data D = {(x i, y i )} n i= with x i R d and y i {0, } Bad idea: Squared error regression (See also Hastie 4.2) 9/35

A loss function for classification Data D = {(x i, y i )} n i= with x i R d and y i {0, } Bad idea: Squared error regression (See also Hastie 4.2) Maximum likelihood: We interpret the discriminative function f(x, y) as defining class probabilities p(y x) = ef(x,y) y ef(x,y ) p(y x) should be high for the correct class, and low otherwise 9/35

A loss function for classification Data D = {(x i, y i )} n i= with x i R d and y i {0, } Bad idea: Squared error regression (See also Hastie 4.2) Maximum likelihood: We interpret the discriminative function f(x, y) as defining class probabilities p(y x) = ef(x,y) y ef(x,y ) p(y x) should be high for the correct class, and low otherwise For each (x i, y i ) we want to maximize the likelihood p(y i x i ): L neg-log-likelihood (β) = n i= log p(y i x i ) 9/35

Logistic regression In the binary case, we have two functions f(x, 0) and f(x, ). W.l.o.g. we may fix f(x, 0) = 0 to zero. Therefore we choose features with arbitrary input features φ(x) R k We have φ(x, y) = [y = ] φ(x) 0 else ŷ = argmax f(x, y) = y if φ(x) β > 0 and conditional class probabilities 0.9 0.8 exp(x)/(+exp(x)) p( x) = with the logistic sigmoid function σ(z) = e f(x,) = σ(f(x, )) e f(x,0) + ef(x,) ez =. +e z e z + 0.7 0.6 0.5 0.4 0.3 0.2 0. 0-0 -5 0 5 0 Given data D = {(x i, y i)} n i=, we minimize L logistic (β) = n i= log p(yi xi) + λ β 2 = [ ] n i= y i log p( x i ) + ( y i ) log[ p( x i )] + λ β 2 0/35

Optimal parameters β Gradient (see exercises): n = i= (p i y i )φ(x i ) + 2λIβ = X (p y) + 2λIβ, L logistic (β) β p i := p(y = x i ), X = L logistic (β) β φ(x ). φ(x n ) R n k is non-linear in β (it enters also the calculation of p i ) does not have analytic solution /35

Optimal parameters β Gradient (see exercises): n = i= (p i y i )φ(x i ) + 2λIβ = X (p y) + 2λIβ, L logistic (β) β p i := p(y = x i ), X = L logistic (β) β φ(x ). φ(x n ) R n k is non-linear in β (it enters also the calculation of p i ) does not have analytic solution Newton algorithm: iterate β β H - L logistic (β) β with Hessian H = 2 L logistic (β) β = X W X + 2λI 2 W diagonal with W ii = p i ( p i ) /35

RBF ridge logistic regression: 3 (MT/plot.h -> gnuplot pipe) train decision boundary 0-2 0 - -2-2 - 0 2 3 3 2 0 - -2-3 -2-0 2 3-2 - 0 2 3./x.exe -mode 2 -modelfeaturetype 4 -lambda e+0 -rbfbias 0 -rbfwidth.2 2/35

polynomial (cubic) logistic regression: 3 (MT/plot.h -> gnuplot pipe) train decision boundary 0-2 0 - -2-2 - 0 2 3 3 2 0 - -2-3 -2-0 2 3-2 - 0 2 3./x.exe -mode 2 -modelfeaturetype 3 -lambda e+0 3/35

Recap: Classification Regression parameters β predictive function f(x) = φ(x) β least squares loss L ls (β) = n i= (yi f(xi))2 parameters β discriminative function f(x, y) = φ(x, y) β class probabilities p(y x) e f(x,y) neg-log-likelihood L neg-log-likelihood (β) = n i= log p(yi xi) 4/35

Logistic regression: Multi-class case 5/35

Logistic regression: Multi-class case Data D = {(x i, y i )} n i= with x i R d and y i {,.., M} We choose f(x, y) = φ(x, y) β with φ(x, y) = [y = ] φ(x) [y = 2] φ(x). [y = M] φ(x) where φ(x) are arbitrary features. We have M (or M-) parameters β Conditional class probabilties p(y x) = ef(x,y) y ef(x,y ) (optionally we may set f(x, M) = 0 and drop the last entry) f(x, y) = log p(y x) p(y =M x) (the discriminative functions model log-ratios ) Given data D = {(x i, y i )} n i=, we minimize L logistic (β) = n i= log p(y =y i x i ) + λ β 2 6/35

Optimal parameters β Gradient: L logistic (β) β c = n i= (p ic y ic )φ(x i ) + 2λIβ c = X (p c y c ) + 2λIβ c, p ic = p(y =c x i ) Hessian: H = 2 L logistic (β) β c β d = X W cd X + 2[c = d] λi W cd diagonal with W cd,ii = p ic ([c = d] p id ) 7/35

polynomial (quadratic) ridge 3-class logistic regression: 3 2 (MT/plot.h -> gnuplot pipe) train p=0.5 0.8 0.6 0.4 0.2 0 0.9 0.5 0. 0.9 0.5 0. 0.9 0.5 0. 0 - -2-0 2 3-2 - 0 2 3-2 -2-0 2 3./x.exe -mode 3 -modelfeaturetype 3 -lambda e+ 8/35

Structured Output & Structured Input 9/35

Structured Output & Structured Input regression: R n R structured output: R n binary class label {0, } R n integer class label {, 2,.., M} R n sequence labelling y :T R n image labelling y :W,:H R n graph labelling y :N structured input: relational database R labelled graph/sequence R 20/35

Examples for Structured Output Text tagging X = sentence Y = tagging of each word http://sourceforge.net/projects/crftagger Image segmentation X = image Y = labelling of each pixel http://scholar.google.com/scholar?cluster=344770229904273582 Depth estimation X = single image Y = depth map http://make3d.cs.cornell.edu/ 2/35

CRFs in image processing 22/35

CRFs in image processing Google conditional random field image Multiscale Conditional Random Fields for Image Labeling (CVPR 2004) Scale-Invariant Contour Completion Using Conditional Random Fields (ICCV 2005) Conditional Random Fields for Object Recognition (NIPS 2004) Image Modeling using Tree Structured Conditional Random Fields (IJCAI 2007) A Conditional Random Field Model for Video Super-resolution (ICPR 2006) 23/35

Conditional Random Fields 24/35

Conditional Random Fields (CRFs) CRFs are a generalization of logistic binary and multi-class classification The output y may be an arbitrary (usually discrete) thing (e.g., sequence/image/graph-labelling) Hopefully we can minimize efficiently argmax f(x, y) y over the output! f(x, y) should be structured in y so this optimization is efficient. The name CRF describes that p(y x) e f(x,y) defines a probability distribution (a.k.a. random field) over the output y conditional to the input x. The word field usually means that this distribution is structured (a graphical model; see later part of lecture). 25/35

CRFs: Core equations f(x, y) = φ(x, y) β p(y x) = ef(x,y) y ef(x,y ) = ef(x,y) Z(x,β) Z(x, β) = log y e f(x,y ) (log partition function) L(β) = i log p(y i x i ) = i [f(x i, y i ) Z(x i, β)] Z(x, β) = y 2 Z(x, β) = y p(y x) f(x, y) p(y x) f(x, y) f(x, y) Z Z This gives the neg-log-likelihood L(β), its gradient and Hessian 26/35

Training CRFs Maximize conditional likelihood But Hessian is typically too large (Images: 0 000 pixels, 50 000 features) If f(x, y) has a chain structure over y, the Hessian is usually banded computation time linear in chain length Alternative: Efficient gradient method, e.g.: Vishwanathan et al.: Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods Other loss variants, e.g., hinge loss as with Support Vector Machines ( Structured output SVMs ) Perceptron algorithm: Minimizes hinge loss using a gradient method 27/35

CRFs: the structure is in the features Assume y = (y,.., y l ) is a tuple of individual (local) discrete labels We can assume that f(x, y) is linear in features f(x, y) = k φ j (x, y j )β j = φ(x, y) β j= where each feature φ j (x, y j ) depends only on a subset y j of labels. φ j (x, y j ) effectively couples the labels y j. Then e f(x,y) is a factor graph. 28/35

Example: pair-wise coupled pixel labels x y y 2 y 3 y 4 y W y 2 y 3 y H Each black box corresponds to features φ j (y j ) which couple neighboring pixel labels y j Each gray box corresponds to features φ j (x j, y j ) which couple a local pixel observation x j with a pixel label y j 29/35

Kernel Ridge Regression the Kernel Trick Reconsider solution of Ridge regression (using the Woodbury identity): ˆβ ridge = (X X + λi k ) - X y = X (XX + λi n ) - y 30/35

Kernel Ridge Regression the Kernel Trick Reconsider solution of Ridge regression (using the Woodbury identity): ˆβ ridge = (X X + λi k ) - X y = X (XX + λi n ) - y Recall X = (φ(x ),.., φ(x n )) R k n, then: f ridge (x) = φ(x) β ridge = φ(x) X (XX }{{}}{{} +λi) - y κ(x) K K is called kernel matrix and has elements K ij = k(x i, x j ) := φ(x i ) φ(x j ) κ is the vector: κ(x) = φ(x) X = k(x, x :n ) The kernel function k(x, x ) calculates the scalar product in feature space. 30/35

The Kernel Trick We can rewrite kernel ridge regression as: f rigde (x) = κ(x) (K + λi) - y with K ij = k(x i, x j ) κ i (x) = k(x, x i ) at no place we actually need to compute the parameters ˆβ at no place we actually need to compute the features φ(x i ) we only need to be able to compute k(x, x ) for any x, x This rewriting is called kernel trick. It has great implications: Instead of inventing funny non-linear features, we may directly invent funny kernels Inventing a kernel is intuitive: k(x, x ) expresses how correlated y and y should be: it is a meassure of similarity, it compares x and x. Specifying how comparable x and x are is often more intuitive than defining features that might work. 3/35

Every choice of features implies a kernel. But, Does every choice of kernel correspond to specific choice of feature? 32/35

Reproducing Kernel Hilbert Space Let s define a vector space H k, spanned by infinitely many basis elements {φ x = k(, x) : x R d } Vectors in this space are linear combinations of such basis elements, e.g., f = i α iφ xi, f(x) = i α ik(x, x i) Let s define a scalar product in this space, by first defining the scalar product for every basis element, φ x, φ y := k(x, y) This is positive definite. Note, it follows φ x, f = i α i φ x, φ xi = i α ik(x, x i) = f(x) The φ x = k(, x) is the feature we associate with x. Note that this is a function and infinite dimensional. Choosing α = (K + λi) - y represents f ridge (x) = n i= αik(x, xi) = κ(x) α, and shows that ridge regression has a finite-dimensional solution in the basis elements {φ xi }. A more general version of this insight is called representer theorem. 33/35

Example Kernels Kernel functions need to be positive definite: z: z >0 : k(z, z ) > 0 K is a positive definite matrix Examples: Polynomial: k(x, x ) = (x x ) d Let s verify for d = 2, φ(x) = ( x 2, 2x x 2, x 2 2 ) : k(x, x ) = ((x, x 2) = (x x + x 2x 2) 2 ) 2 x x 2 = x 2 x 2 + 2x x 2x x 2 + x 2 2x 2 2 = (x 2, 2x x 2, x 2 2)(x 2, 2x x 2, x 2 2 ) = φ(x) φ(x ) Squared exponential (radial basis function): k(x, x ) = exp( γ x x 2 ) 34/35

Kernel Logistic Regression For logistic regression we compute β using the Newton iterates β β (X W X + 2λI) - [X (p y) + 2λβ] () = (X W X + 2λI) - X [(p y) W Xβ] (2) Using the Woodbury identity we can rewrite this as (X W X + A) - X W = A - X (XA - X + W - ) - (3) β 2λ X (X 2λ X + W - ) - W - [(p y) W Xβ] (4) [ ] = X (XX + 2λW - ) - Xβ W - (p y). (5) We can now compute the discriminative function values f X = Xβ R n at the training points by iterating over those instead of β: [ ] f X XX (XX + 2λW - ) - Xβ W - (p y) (6) [ ] = K(K + 2λW - ) - f X W - (p X y) (7) Note, that p X on the RHS also depends on f X. Given f X we can compute the discriminative function values f Z = Zβ R m for a set of m query points Z using [ ] f Z κ (K + 2λW - ) - f X W - (p X y), κ = ZX (8) 35/35