Kaggle.

Similar documents
Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Support Vector Machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Review: Support vector machines. Machine learning techniques and image analysis

Workshop on Web- scale Vision and Social Media ECCV 2012, Firenze, Italy October, 2012 Linearized Smooth Additive Classifiers

Announcements. Proposals graded

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Midterm: CS 6375 Spring 2015 Solutions

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

CS798: Selected topics in Machine Learning

Kernel methods CSE 250B

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Kernels and the Kernel Trick. Machine Learning Fall 2017

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machine (SVM) and Kernel Methods

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machine (SVM) and Kernel Methods

Expectation maximization

CIS 520: Machine Learning Oct 09, Kernel Methods

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machine (SVM) and Kernel Methods

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Kernel Methods. Barnabás Póczos

COMS 4771 Introduction to Machine Learning. Nakul Verma

Max-Margin Additive Classifiers for Detection

Support Vector Machines.

Modelli Lineari (Generalizzati) e SVM

Support Vector Machines

Statistical Machine Learning from Data

Machine Learning Lecture 5

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

SVMs: nonlinearity through kernels

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machine II

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Kernel Methods and Support Vector Machines

Linear & nonlinear classifiers

Support vector machines Lecture 4

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Basis Expansion and Nonlinear SVM. Kai Yu

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

9.2 Support Vector Machines 159

CS 188: Artificial Intelligence Spring Announcements

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

SVMs, Duality and the Kernel Trick

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machine I

Lecture 7: Kernels for Classification and Regression

Kernel Methods. Charles Elkan October 17, 2007

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Support Vector Machines

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Learning with kernels and SVM

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Kernelized Perceptron Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

CS 231A Section 1: Linear Algebra & Probability Review

(Kernels +) Support Vector Machines

Support Vector Machines: Kernels

Machine Learning. Support Vector Machines. Manfred Huber

FINAL: CS 6375 (Machine Learning) Fall 2014

Linear & nonlinear classifiers

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Nonlinearity & Preprocessing

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

18.9 SUPPORT VECTOR MACHINES

1 Machine Learning Concepts (16 points)

COMP 875 Announcements

Online Learning With Kernel

CS 188: Artificial Intelligence Spring Announcements

Nearest Neighbors Methods for Support Vector Machines

Introduction to Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machines and Kernel Methods

Machine Learning for Structured Prediction

10-701/ Recitation : Kernels

Neural networks and optimization

Nonlinear Classification

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Advanced Machine Learning & Perception

Neural networks and optimization

CS145: INTRODUCTION TO DATA MINING

Classifier Complexity and Support Vector Classifiers

Support Vector Machines and Kernel Methods

6.036 midterm review. Wednesday, March 18, 15

Transcription:

Administrivia Mini-project 2 due April 7, in class implement multi-class reductions, naive bayes, kernel perceptron, multi-class logistic regression and two layer neural networks training set: Project proposals due April 2, in class one page describing the project topic, goals, etc list your team members (2+) project presentations: April 23 and 27 final report: May 3 1/27

Kaggle https://www.kaggle.com/competitions 2/27

Kernel Methods Subhransu Maji CMPSCI 689: Machine Learning 24 March 2015 26 March 2015

Feature mapping Learn non-linear classifiers by mapping features Can we learn the XOR function with this mapping? 4/27

Quadratic feature map Let, x =[x 1,x 2,...,x D ] Then the quadratic feature map is defined as: p p p (x) =[1, 2x 1, 2x 2,..., 2x D, x 2 1,x 1 x 2,x 1 x 3...,x 1 x D, x 2 x 1,x 2 2,x 2 x 3...,x 2 x D,..., x D x 1,x D x 2,x D x 3...,x 2 D] Contains all single and pairwise terms There are repetitions, e.g., x1x2 and x2x1, but hopefully the learning algorithm can handle redundant features 5/27

Drawbacks of feature mapping Computational Suppose training time is linear in feature dimension, quadratic feature map squares the training time Memory Quadratic feature map squares the memory required to store the training data Statistical Quadratic feature mapping squares the number of parameters For now lets assume that regularization will deal with overfitting 6/27

Quadratic kernel The dot product between feature maps of x and z is: (x) T (z) =1+2x 1 z 1 +2x 2 z 2,...,2x D z D + x 2 1z1 2 + x 1 x 2 z 1 z 2 +...+ x 1 x D z 1 z D +......+ x D x 1 z D z 1 + x D x 2 z D z 2 +...+ x 2 DzD 2 =1+2 X x i z i + X x i x j z i z j i i,j =1+2 x T z +(x T z) 2 = 1+x T z 2 = K(x, z) quadratic kernel Thus, we can compute φ(x)ᵀφ(z) in almost the same time as needed to compute xᵀz (one extra addition and multiplication) We will rewrite various algorithms using only dot products (or kernel evaluations), and not explicit features 7/27

Perceptron revisited Input: training data feature map (x 1,y 1 ), (x 2,y 2 ),...,(x n,y n ) Initialize for iter = 1,,T for i = 1,..,n predict according to the current model ŷ i = Perceptron training algorithm if else, w [0,...,0] +1 if w T (x i ) > 0 y i =ŷ i 1 otherwise, no change w w + y i (x i ) dependence on φ Obtained by replacing x by φ(x) 8/19

Properties of the weight vector Linear algebra recap: Let U be set of vectors in Rᴰ, i.e., U = {u1,u2,, ud} and ui Rᴰ Span(U) is the set of all vectors that can be represented as ᵢaᵢuᵢ, such that aᵢ R Null(U) is everything that is left i.e., Rᴰ \ Span(U) Perceptron representer theorem: During the run of the perceptron training algorithm, the weight vector w is always in the span of φ(x1), φ(x1),, φ(xd) w = P i i (x i ) updates i i + y i w T (z) =( P i i (x i )) T (z) = P i i (x i ) T (z) 9/27

Kernelized perceptron Input: training data feature map (x 1,y 1 ), (x 2,y 2 ),...,(x n,y n ) Kernelized perceptron training algorithm Initialize for iter = 1,,T for i = 1,..,n predict according to the current model ŷ i = if else, [0, 0,...,0] +1 y i =ŷ i P if n n (x n ) T (x i ) > 0 1 otherwise, no change i = i + y i (x) T (z) =(1+x T z) p polynomial kernel of degree p 10/19

Support vector machines Kernels existed long before SVMs, but were popularized by them Does the representer theorem hold for SVMs? Recall that the objective function of an SVM is: Let, min w 1 w = w k + w? 2 w 2 + C X n max(0, 1 y n w T x n ) only w affects classification w T x i =(w k + w? ) T x i = wk T x i + w?x T i = wk T x i norm decomposes w T w =(w k + w? ) T (w k + w? ) = wk T w k + w?w T? wk T w k Hence, w 2 Span({x 1, x 2,...,x n }) 11/27

Kernel k-means Initialize k centers by picking k points randomly Repeat till convergence (or max iterations) Assign each point to the nearest center (assignment step) Estimate the mean of each group (update step) Representer theorem is easy here arg min S arg min S kx kx X Exercise: show how to compute x2s i (x) µ i 2 X x2s i (x) µ i 2 using dot products 12/37 µ i 1 S i (x) µ i 2 X x2s i (x)

What makes a kernel? A kernel is a mapping K: X xx R Functions that can be written as dot products are valid kernels Examples: polynomial kernel K(x, z) = (x) T (z) K d (poly) (x, z) =(1+xT z) d Alternate characterization of a kernel A function K: X xx R is a kernel if K is positive semi-definite (psd) This property is also called as Mercer s condition This means that for all functions f that are squared integrable except the zero function, the following property holds: Z Z f(x)k(x, z)f(z)dzdx > 0 f(x) 2 dx < 1 Z 13/27

Why is this characterization useful? We can prove some properties about kernels that are otherwise hard to prove Theorem: If K1 and K2 are kernels, then K1 + K2 is also a kernel Proof: Z Z f(x)k(x, z)f(z)dzdx = = Z Z Z Z 0+0 f(x)(k 1 (x, z)+k 2 (x, z)) f(z)dzdx Z Z f(x)k 1 (x, z)f(z)dzdx + f(x)k 2 (x, z)f(z)dzdx More generally if K1, K2,, Kn are kernels then ᵢαi Ki with αi 0, is a also a kernel Can build new kernels by linearly combining existing kernels 14/27

Why is this characterization useful? We can show that the Gaussian function is a kernel Also called as radial basis function (RBF) kernel Lets look at the classification function using a SVM with RBF kernel: f(z) = X i = X i K (rbf) (x, z) =exp x z 2 i K (rbf) (x i, z) i exp x i z 2 This is similar to a two layer network with the RBF as the link function Gaussian kernels are examples of universal kernels they can approximate any function in the limit as training data goes to infinity 15/27

Kernels in practice Feature mapping via kernels often improves performance MNIST digits test error: 8.4% SVM linear 1.4% SVM RBF 1.1% SVM polynomial (d=4) 60,000 training examples http://yann.lecun.com/exdb/mnist/ 16/27

Kernels over general structures Kernels can be defined over any pair of inputs such as strings, trees and graphs Kernel over trees: K(, ) number This can be computed efficiently using dynamic programming Can be used with SVMs, perceptrons, k-means, etc For strings number of common substrings is a kernel Graph kernels that measure graph similarity (e.g. number of common subgraphs) have been used to predict toxicity of chemical structures = of common subtrees http://en.wikipedia.org/wiki/tree_kernel 17/27

Kernels for computer vision Histogram intersection kernel between two histograms a and b a b min(a,b) Introduced by Swain and Ballard 1991 to compare color histograms 18/27

Kernel classifiers tradeoffs Evaluation time Linear Kernel h(z) =w T z Accuracy Non- linear Kernel h(z) = i K(x i, z) Linear: O (feature dimension) Non Linear: O (N X feature dimension) 19/27

Kernel classification function h(z) = i K(x i, z) = i 0 @ DX j=1 1 min(x ij,z j ) A 20/27

Kernel classification function h(z) = i K(x i, z) = i 0 @ DX j=1 1 min(x ij,z j ) A h(z) = = DX j=1 Key insight: additive property i 0 @ DX j=1 1 min(x ij,z j ) A i min(x ij,z j ) = DX j=1 h j (z j ) h j (z j )= i min(x ij,z j ) 21/27

Kernel classification function h(z) = i K(x i, z) = i 0 @ j=1 Algorithm 1 DX 1 min(x ij,z j ) A h j (z j )= i min(x ij,z j ) O(N) 22/27

Kernel classification function h(z) = i K(x i, z) = i 0 @ DX j=1 1 min(x ij,z j ) A h j (z j )= = i min(x ij,z j ) i x ij + i:x ij <z j Algorithm 1 i:x ij z j i z j [Maji et al. PAMI 13] O(N) O(log N) Sort the support vector values in each coordinate, and pre-compute these sums for each rank. To evaluate, find the position of zj the sorted list of support vector xij.this can be done in O(log N) time using binary search. 23/27

Kernel classification function h(z) = i K(x i, z) = i 0 @ DX j=1 1 min(x ij,z j ) A Algorithm 2 [Maji et al. PAMI 13] h j (z j )= = i min(x ij,z j ) O(N) i x ij + i z j O(log N) i:x ij <z j i:x ij z j O(1) For many problems h i is smooth (blue plot). Hence, we can approximate it with fewer uniformly spaced segments (red plot). This saves time and space 24/27

Kernel classification function h(z) = i K(x i, z) = i 0 @ DX j=1 1 min(x ij,z j ) A DX K(x, z) = k i (x i,z i ) additive kernels Algorithm 2 [Maji et al. PAMI 13] O(N) Intersection k(a, b) =min(a, b) Chi- squared k(a, b) = 2ab a + b a + b a + b O(1) Jensen- Shannon k(a, b) =a log a + b log b 25/27

Linear and intersection kernel SVM Using histograms of oriented gradients feature: Dataset Measure Linear SVM IK SVM Speedup INRIA pedestrians Recall@ 2 FPPI 78.9 86.6 2594 X DC pedestrians Accuracy 72.2 89.0 2253 X Caltech101, 15 examples Accuracy 38.8 50.1 37 X Caltech101, 30 examples Accuracy 44.3 56.6 62 X MNIST digits Error 1.44 0.77 2500 X UIUC cars (Single Scale) Precision@ EER 89.8 98.5 65 X On average more accurate than linear and 100-1000x faster than standard kernel classifier. Similar idea can be applied to training as well. Research question: when can we approximate kernels efficiently? 26/27

Slides credit Some of the slides are based on CIML book by Hal Daume III Experiments on various datasets: Efficient Classification for Additive Kernel SVMs, S. Maji, A. C. Berg and J. Malik, PAMI, Jan 2013 Some resources: LIBSVM: kernel SVM classifier training and testing http://www.csie.ntu.edu.tw/~cjlin/libsvm/ LIBLINEAR: fast linear classifier training http://www.csie.ntu.edu.tw/~cjlin/liblinear/ LIBSPLINE: fast additive kernel training and testing https://github.com/msubhransu/libspline 27/27