CSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1

Similar documents
Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

Convex Optimization M2

Support Vector Machine (SVM) and Kernel Methods

Perceptron Revisited: Linear Separators. Support Vector Machines

Linear & nonlinear classifiers

Support Vector Machines.

CSCI5654 (Linear Programming, Fall 2013) Lecture-8. Lecture 8 Slide# 1

6. Approximation and fitting

Max Margin-Classifier

Jeff Howbert Introduction to Machine Learning Winter

Example Problem. Linear Program (standard form) CSCI5654 (Linear Programming, Fall 2013) Lecture-7. Duality

SUPPORT VECTOR MACHINE

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear & nonlinear classifiers

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Convex Optimization: Applications

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Lecture Notes on Support Vector Machine

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

Statistical Pattern Recognition

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Homework 4. Convex Optimization /36-725

Gradient Descent. Dr. Xiaowei Huang

Convex Optimization and Support Vector Machine

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

CSC 411 Lecture 17: Support Vector Machine

Convex Optimization and SVM

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines

ICS-E4030 Kernel Methods in Machine Learning

Constrained Optimization and Lagrangian Duality

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

1 Cricket chirps: an example

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 18

10-725/36-725: Convex Optimization Prerequisite Topics

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Support Vector Machines for Classification and Regression

Dan Roth 461C, 3401 Walnut

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Interior Point Algorithms for Constrained Convex Optimization

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Linear Classifiers: Expressiveness

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Support Vector Machines

A Brief Review on Convex Optimization

subject to (x 2)(x 4) u,

LP Duality: outline. Duality theory for Linear Programming. alternatives. optimization I Idea: polyhedra

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

2.098/6.255/ Optimization Methods Practice True/False Questions

CS-E4830 Kernel Methods in Machine Learning

Nonlinear Optimization for Optimal Control

Support Vector Machines and Bayes Regression

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Interior-Point Methods for Linear Optimization

EE364b Convex Optimization II May 30 June 2, Final exam

Announcements - Homework

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support vector machines Lecture 4

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Problem Set # 1 Solution, 18.06

Today. Calculus. Linear Regression. Lagrange Multipliers

Support Vector Regression (SVR) Descriptions of SVR in this discussion follow that in Refs. (2, 6, 7, 8, 9). The literature

Machine Learning Practice Page 2 of 2 10/28/13

Lecture 2: Convex Sets and Functions

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Statistical Machine Learning from Data

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Linear Regression and Its Applications

OPTIMISATION 2007/8 EXAM PREPARATION GUIDELINES

Lecture Notes 1: Vector spaces

Linear regression COMS 4771

Kernels and the Kernel Trick. Machine Learning Fall 2017

COMS 4771 Regression. Nakul Verma

5. Duality. Lagrangian

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Support Vector Machine. Industrial AI Lab.

Convex Optimization M2

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

1 Review Session. 1.1 Lecture 2

Linear Regression (continued)

Homework 2 Solutions Kernel SVM and Perceptron

(Kernels +) Support Vector Machines

Statistical Methods for SVM

Transcription:

CSCI5654 (Linear Programming, Fall 2013) Lectures 10-12 Lectures 10,11 Slide# 1

Today s Lecture 1. Introduction to norms: L 1,L 2,L. 2. Casting absolute value and max operators. 3. Norm minimization problems. 4. Applications: fitting, classification, denoising etc.. Lectures 10,11 Slide# 2

Norm Basic idea: Notion of length/distance between vector. Definition (Norm): Let l : X R 0 be a mapping from a vector space to non-negative reals. The function l is a norm iff 1. l( x) = 0 iff 0, 2. l(a x) = a l( x) 3. l( x + y) l( x)+l( y) (Triangle Inequality). Length of x: l( x). Distance between x, y is l( y x)(= l( x y)). Lectures 10,11 Slide# 3

L 2 (Euclidean) norm Distance between (x 1,y 1 ) to (x 2,y 2 ) is (x 1 x 2 ) 2 +(y 1 y 2 ) 2. In general, x 2 = x t. x. Distance between x, y is given by x y 2. Lectures 10,11 Slide# 4

L p norm For p 1, we define L p norm as x p = ( x 1 p + x 2 p +...+ x n p ) 1 p. The L p norm distance between two vectors is: x y p L 1 norm: x 1 = x 1 + x 2 +...+ x n. Class exercise: verify that L 1 norm distance is indeed a norm. Q: Why is p 1 necessary? Lectures 10,11 Slide# 5

L norm Obtained as the limit lim p x p. Definition: For vector x, x is defined as x = max( x 1, x 2,..., x n ). Ex-1: Can we verify that L is really a norm? Ex-2 (challenge): Prove that the limit definition coincides with the explicit form of the L distance. Lectures 10,11 Slide# 6

Norm: Exercise Take vectors x = (0,1, 1), y = (2,1,1). Write down L 1,L 2,L norm distances between x, y. Lectures 10,11 Slide# 7

Norm Spheres ǫ-sphere: { x d( x,0) ǫ} for norm d. The 1 spheres corresponding to the norms L 1,L,L 2. 1 L 2 L 1 L Note: Norm spheres are convex sets. Lectures 10,11 Slide# 8

Norm Minimization problems General form: minimize x p Ax b Note-1: We always minimize the norm functions (in this course). Note-2: Maximization of norm is generally a hard (non-convex) problem. Lectures 10,11 Slide# 9

Unconstrained Norm Minimization Lets first study the following problem: min. Ax b p. 1. For p = 2, this problem is called (unconstrained) least squares. We will solve this using calculus. 2. For p = 1,, this problem is called L 1 (L ) least squares. We will reduce this problem to LP. 3. Applications: solving (noisy) linear systems, regularization, denoising, max. likelihood estimation, regression and so on. Lectures 10,11 Slide# 10

Unconstrained Least Squares Decision Variables: x = (x 1,...,x n ) R n. Objective function: min. A x b 2. Constraints: No explicit constraint. Lectures 10,11 Slide# 11

Least Squares: Example Example: min. (2x +3y 1,y) 2. Solution: We will equivalently minimize (2x +3y 1) 2 +y 2. Using fundamental theorem of calculus, we obtain the conditions: In other words, we obtain: (2x+3y 1) 2 +y 2 x = 0 (2x+3y 1) 2 +y 2 y = 0 4(2x +3y 1) = 0 6(2x +3y 1)+2y = 0 The optima lies at is y = 0,x = 1 2. Verify minima by computing checking second order derivative (Hessian matrix). Lectures 10,11 Slide# 12

Unconstrained Least Squares Problem: minimize A x b 2. Solution: We will use calculus minimum finding using partial derivatives. Recall: f(x 1,...,x n ) = ( x1 f, x2 f,..., xn f). Criterion for optimal point for f( x) is that f = (0,...,0). ( A x b 2 2) = ((A x b) t (A x b)) = ( x t A t A x 2 x t A t b + bt b) = 2A t A x 2A t b = 0 Minima will occur at x = (A t A) 1 A t b (assume A is full rank ). Similar solution for L p norm for even number p. Lectures 10,11 Slide# 13

L 1 norm minimization Decision Variables: x = (x 1,...,x n ) R n. Objective function: min. A x b 1. Constraints: No explicit constraint. We will reduce this to a linear program. Lectures 10,11 Slide# 14

Example Example: min. (2x +3y 1,y) 1 = 2x +3y 1 + y. Trick: Let t 1 0,t 2 0 be s.t. Consider the following problem: 2x +3y 1 t 1 y t 2 min. t 1 +t 2 2x +3y 1 t 1 y t 2 t 1,t 2 0 Goal: convince class that this problem is equivalent to original. Note: y t 2 can be rewritten as y t 2 y t 2. Lectures 10,11 Slide# 15

Example: LP formulation min. t 1 +t 2 2x +3y 1 t 1 2x 3y +1 t 1 y t 2 y t 2 t 1,t 2 0 Solution: x = 1 2,y = 0. Q: Can we repeat the same trick if y t 2? N: No. A disjunction of inequalities will be needed. Lectures 10,11 Slide# 16

L 1 norm minimization Problem: min. Ax b 1. Solution: Let A be m n matrix. Add variables t 1,...,t m correspoding to rows of m. min. m i=1 t i A i x b t i t 1,...,t m 0 min. m i=1 t i A x b t A x + b t t 1,...,t m 0 We write it in the general form: max. 1 t t A x t b A x t b t 1,...,t m 0 Lectures 10,11 Slide# 17

L norm minimization Decision Variables: x = (x 1,...,x n ) R n. Objective function: min. A x b. Constraints: No explicit constraint. We will reduce this to a linear program. Lectures 10,11 Slide# 18

Example Example: min. (2x +3y 1,y) = min max.( 2x +3y 1, y ). Trick: Let t 0 be s.t. Consider the following problem: min. 2x +3y 1 t y t t 2x +3y 1 t y t t 0 Goal: convince class that this problem is equivalent to original. Note: y t can be rewritten as t y t. Lectures 10,11 Slide# 19

Example:LP formulation min. t 2x +3y 1 t 2x 3y +1 t y t y t t 0 Solution: x = 1 2,y = 0. Lectures 10,11 Slide# 20

L norm minimization Problem: min. Ax b. Solution: Let A be m n matrix. Add variables t 1,...,t m correspoding to rows of m. min. t A i x b t t 0 min. t A x b t. 1 A x + b t. 1 t 0 We write it in the general form: max. t A x t. 1 b A x t. 1 b t 0 Lectures 10,11 Slide# 21

Application-1: Estimation of Solutions Lectures 10,11 Slide# 22

Application-1: Solution Estimation Problem: We are trying to solve the linear equation A x = b 1. The entries in A, b could have random errors (noisy measurements). 2. There are many more equations than variables. x A x A Noise b Lectures 10,11 Slide# 23

Application: Estimation Example Problem: We are trying to measure a quantity x using noisy measurements: 2x = 5.1 3x = 7.6 4x = 9.91 5x = 12.45 6x = 15.1 Q-1: Is there a value of x that explains the measurement? Q-2: What do we do in this case? Lectures 10,11 Slide# 24

Application: Estimation Example Problem: We are trying to measure a quantity x using noisy measurements: 2x = 5.1 3x = 7.6 4x = 9.91 5x = 12.45 6x = 15.1 Find x that nearly fits all the measurements., i,e, min. (2x 5.1,3x 7.6,4x 9.91,5x 12.45,6x 15.1) p. We can choose p = 1,2, and see what answer we get. Lectures 10,11 Slide# 25

Application:Estimation L 2 norm: Apply least squares for A = 2 3 4 5 6 b = 5.1 7.6 9.91 12.45 15.1 Solving for x = argmin Ax b 2, yields x 2.505. Residual error: (0.09, 0.08,.11, 0.08,.06). Lectures 10,11 Slide# 26

Application:Estimation L 1 norm: Reduce to linear program using A = 2 3 4 5 6, b = 5.1 7.6 9.91 12.45 15.1,. Solution: x = 2.49. Residual Error: (.12,.13, 0.05, 0.0,.16). L norm: Reduce to linear program. Solution: x = 2.501 Residual Error: (.1,.1,.1,.05,.1) Lectures 10,11 Slide# 27

Experiment with Larger Matrices (matlab) L 1 norm: large number of entries with small residuals, spread is larger. L 2 norm: residuals look like a gaussian distribution. L norm: residuals look more uniform, smaller spread. Matlab code is available, run your own experiments! Lectures 10,11 Slide# 28

Regularized Approximation Ordinary least squares can have poor performance (ill conditioning problem). Tikhonov Regularized Approximation: A x b 2 +λ x 2. λ 0 is a tuning parameter. Note-1: This is the same as the following norm minimization: [ A λ 1 2I ] x [ b Note-2: In general, we can also have regularized approximation using L 1,L and other norms. 0 ] 2 2. Lectures 10,11 Slide# 29

Penalty Function Approximation Problem: Solve minimize.φ(a x b). where φ is a penalty function. If φ = L 1,L 2,L, this is exactly the same as norm minimization. Note-1: In general, φ need not be a norm. Note-2: φ is sometimes called a loss function. Lectures 10,11 Slide# 30

Penalty Functions: Example Deadzone Linear Penalty: Let x = (x 1,...,x n ) t. The deadzone linear penalty is sum of penalties on each individual component x 1,...,x n : wherein φ dz ( x) : φ dz (x 1 )+φ dz (x 2 )+ +φ dz (x n ), φ dz (x) = { 0 u a b( u a) u > a L2 φ dz L 1 -a a Lectures 10,11 Slide# 31

Deadzone Linear Penalty (cont) Deadzone vs. L 2 : L 2 norm imposes large penalty when residual is high. Therefore, outliers in data can affect performance drastically. Deadzone penalty function is generally less sensitive to outliers. Q: How do we solve the deadzone penalty approximation problem? A: Apply tricks for L 1,L (upcoming assignment). Lectures 10,11 Slide# 32

Other penalty functions Huber Penalty: The Huber penalty is sum of functions on individual components x 1,...,x n : φ H (x 1 )+φ H (x 2 )+ +φ H (x n ), wherein φ H (x) = { u 2 u a a(2 u a) u > a φ H L 2 -a a Lectures 10,11 Slide# 33

Penalty Function Approximation: Summary General observations: L 2 : Natural, easy to solve (using pseudo inverse). However, sensitive to outliers. L 1 : Sparse: ensures that most residuals are very small. Insensitive to outliers. However, spread of error is larger. L : Non-sparse but minimizes spread. Very sensitive to outliers. Other loss functions: provide various tradeoffs. Details: Cf. Boyd s book. Lectures 10,11 Slide# 34

Application-2: Signal Denoising Lectures 10,11 Slide# 35

Application-2: Signal Denoising Input: Given samples of a noisy signal: y : ( y 1, y 2,..., y N ) t. Original signal is unknown (very little data available). Problem: Find new signal y by minimizing φ s is a smoothing function. mininize. y y p p +φ s ( y ). 4 Original Signal 4 Signal With Noise 3 3 2 2 1 1 0 0 1 1 2 2 3 0 5 10 3 0 5 10 Lectures 10,11 Slide# 36

Smoothing Criteria Quadratic Smoothing: L 2 norm on successive differences. φ q ( y) = N (y i y i 1 ) 2. i=2 Total Variation Smoothing: L 1 norm on successive differences. φ tv ( y) = N y i y i 1. i=2 Maximum Variation Smoothing: L norm on successive differences. φ max ( y) = N max i=2 ( y i+1 y i ). Lectures 10,11 Slide# 37

Quadratic Smoothing Problem: minimize y y n 2 2 +φ q( y). This can be viewed as a least squares problem: minimize [ I ] y [ yn 0 ] 2 2. Where is the matrix: = 1 1 0 0 0 0 1 1 0 0. 0 0 0.. 0 0 0 0 0 1 1 Lectures 10,11 Slide# 38

Total/Maximum Variance Smoothing Total variance smoothing: L 1 least squares problem. [ ] [ ] minimize I y yn 0 Max. variance smoothing: L 1 least squares problem. [ ] [ ] minimize I y yn 0 1.. Where is the matrix: = 1 1 0 0 0 0 1 1 0 0. 0 0 0.. 0 0 0 0 0 1 1 Lectures 10,11 Slide# 39

Experiment 5 Original Signal 5 Signal With Noise 0 0 5 0 5 10 L1 reconstructed 5 0 5 0 5 10 L2 reconstructed 5 0 5 0 5 10 L reconstructed 5 0 5 0 5 10 L1 residuals 400 200 0 6 4 2 0 2 L2 residuals 400 200 0 4 2 0 2 L residuals 200 100 5 0 5 10 0 2 1 0 1 2 Lectures 10,11 Slide# 40

Other Smoothing Functions Consider three successive data points instead of two: φ T : m 1 i=2 2y i (y i 1 +y i+1 ). Modify matrix as follows: 1 2 1 0 0 0 0 0 1 2 1 0 0 0 =. 0 0 0 0.. 0 0 0 0 0 0 0 1 2 1 is called a Topelitz Matrix. Lectures 10,11 Slide# 41

Application-3: Regression Lectures 10,11 Slide# 42

Linear Regression Inputs: Data points x i : (x i1,...,x in ) and outcome y i. Goal: Find a function that best fits the data. f( x) = c 1 x 1 +c 2 x 2 +...+c n x n +c 0 12 input vs. output plot for given data. 10 8 6 4 2 0 2 0 1 2 3 4 5 6 7 8 9 10 Lectures 10,11 Slide# 43

Linear Regression Inputs: Data points x i : (x i1,...,x in ) and outcome y i. Goal: Find a function that best fits the data. f( x) = c 1 x 1 +c 2 x 2 +...+c n x n +c 0 12 input vs. output plot for given data. 10 8 6 4 2 0 2 0 1 2 3 4 5 6 7 8 9 10 Lectures 10,11 Slide# 44

Linear Regression Let X be the data matrix and y be corresponding outcome matrix. Note: The i th row X i represents data point i y i is the corresponding outcome for X i If c t x +c 0 is the best line fit, the error for i th data point is: ǫ i : (X i c)+c 0 y i. We can write the overall error as: [X 1] c y. p Lectures 10,11 Slide# 45

Regression Inputs: Data X, outcome y. Decision Vars: c 0,c 1,...,c n the coefficients of the straight line. Solution:minimize [X 1] c y. p Note: Instead of the norm, we could use any penalty function. Ordinary Regression: min [X 1] c y p. p Tikhanov Regression: min [X 1] c y 2 2 +λ c 2 2. Lectures 10,11 Slide# 46

Experiment Experiment: Create noisy data set with outliers. 80 regular data points with low noise. 5 outlier datapoints with large noise. Goal: Compare effect of L 1, L 2, L norm regressions. Lectures 10,11 Slide# 47

Regression with Outliers 15 Regression with outliers L 2 10 L 1 L 5 0 5 10 15 0 1 2 3 4 5 6 7 8 9 10 Lectures 10,11 Slide# 48

Regression with Outliers (Error Histogram) 100 L1 norm residual 50 0 5 0 5 10 15 20 60 L2 norm residual 40 20 0 5 0 5 10 15 20 40 L norm residual 20 0 10 8 6 4 2 0 2 4 6 8 10 Lectures 10,11 Slide# 49

Linear Regression With Non-Linear Models Input: Data X, outcome y, functions g 1,...,g k. Goal: Fit a model of the form y i = c 0 +c 1 g 1 ( x)+c 2 g 2 ( x)+...+c k g k ( x). Solution: Transform the data matrix as follows: g 1 (X 1 )... g k (X 1 ) X g 1 (X 2 )... g k (X 2 ) =... g 1 (X n )... g k (X n ). Apply linear regression on (X, y). Lectures 10,11 Slide# 50

Experiment Experiment: Create noisy non-linear data set with outliers. y = 0.2sin(x)+1.5 cos(x)+2 cos(1.5 x)+.2 x 3+Noise. 80 regular data points with low noise. 5 outlier datapoints with large noise. 14 12 Data outliers Regression with non linear model 10 8 6 4 2 0 2 4 6 0 1 2 3 4 5 6 7 8 9 10 Goal: Compare effect of L 1, L 2, L norm regressions. Lectures 10,11 Slide# 51

Experiment: linear regression non-linear model 14 12 10 8 Regression with outliers L 1 L 2 L Data outliers 6 4 2 0 2 4 6 0 1 2 3 4 5 6 7 8 9 10 Lectures 10,11 Slide# 52

Experiment: residuals 200 L1 norm residual 100 0 20 15 10 5 0 5 10 200 L2 norm residual 100 0 20 15 10 5 0 5 10 60 40 20 L norm residual 0 8 6 4 2 0 2 4 6 8 Lectures 10,11 Slide# 53

Classification Problem: Given two classes: x 1 +,..., x+ N and x 1,..., x M Goal: Find separating hyperplane between the classes. of data points. 1.5 1 0.5 0 0.5 x + x classifier 1 1.5 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Lectures 10,11 Slide# 54

Application-4: Classification Lectures 10,11 Slide# 55

Linear Classification Problem: Given two classes: x 1 +,..., x+ N and x 1,..., x M Goal: Find separating hyperplane between the classes. Q-1: Will such an hyperplane always exist? A: Not always! Only if data is linearly separable. of data points. 1. Assume that such a hyperplane exists. 2. Deal with cases where a hyperplane does not exist. Lectures 10,11 Slide# 56

Linear Classification (cont) Solution: Use linear programming. Decision Variables: w 0,w 1,w 2,...,w n, representing the classfier: f w ( x) : w 0 +w 1 x 1 + +w n x n. Linear Programming: X + : max.??? X + w 0 X w 0 x + 1 1 x + 2 1 x + 3 1. 1 x + N 1 X : x 1 1 x 2 1. 1 x M 1 Lectures 10,11 Slide# 57

Classification Using Linear Programs Objective: Lets try i w i = 1 t w. Note: We write a, b for dot product a t b. Simplified Data representation: Represent two classes +1, 1 for data. Data:( x 1,y 1 ),...,( x N,y N ) (X, y), where y i = +1 for class I and y i = 1 for class II data. Linear Program: min. 1, w +w 0 y i ( x i, w +w 0 ) 0 Verify that this formulation works. Lectures 10,11 Slide# 58

Experiment: Classification Experiment: Find a classifier for the following points. 2.5 Data for classification 2 1.5 1 0.5 0 0.5 1 1.5 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 Lectures 10,11 Slide# 59

Experiment: Classification Experiment: Outcome. 2.5 Classifiers 2 1.5 1 0.5 0 0.5 1 1.5 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 Note: Many different classifiers are possible. Note-2: What is the deal with points that lie on the classifier line? Lectures 10,11 Slide# 60

Dual, Complementary Slackness Linear Program: min. 1, w +w 0 y i ( x i, w +w 0 ) 0 Complementary Slackness: Let α 1,...,α N be the dual variables. α i y i ( x i, w +w 0 ) = 0, α i 0. Claim: There are at least n+1 points x j1,..., x jn+1 s.t. x ji, w +w 0 = 0. Note: n+1 pts. fully determine the hyper-plane w;w 0. Lectures 10,11 Slide# 61

Classifying Points Problem: Given a new point x, let us find which class it belongs to. Solution-1: Just find out the sign of w, x +w 0, we know w;w 0 from our LP solution. Solution-2: More complicated. Express x as a linear combination of support vectors: We can write w, x as x = λ 1 x j1 +λ 2 x j2 + +λ n+1 x n+1 ( ( w, x +w 0 = i λ i w 0 + w, x ji + }{{} 0 1 i λ i )w 0 = 1 i λ i )w 0. This is impractical for linear programming formulations. Lectures 10,11 Slide# 62

Non-Separable Data Q: What if data is not linear separable? The linear program will be primal infeasible In practice, linear separation may hold without noise but not with noise. Solution: Relax the conditions on half-space w. ξ 1 ξ 2 Lectures 10,11 Slide# 63

Non-Separable Data: Linear Programming Classifier: (old) w 0 + w, x. (new) w, x +1. Note: Classifier i w ix i +w 0 is equivalent to i LP Formulation: minimize ξ + 1 x + i, w +1 ξ + i x j, w +1 ξ j ξ + i,ξ j 0 w i w 0 x i +1 if w 0 0. The variable ξ encodes tolerance. We minimize ξ as part of objective. Exercise: Verify that the support vector interpretation of dual works. Lectures 10,11 Slide# 64

Non Linear Model Classification Linear Classifier: w, x +1. Non-linear functions: φ : R n R. w,(φ 1 ( x),φ 2 ( x),...,φ m ( x)). This is exactly same idea as linear regression with non-linear functions. Lectures 10,11 Slide# 65