CSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1

Size: px

Start display at page:

Download "CSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1"

Nelson Bailey
5 years ago
Views:

1 CSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1

2 Today s Lecture 1. Introduction to norms: L 1,L 2,L. 2. Casting absolute value and max operators. 3. Norm minimization problems. 4. Applications: fitting, classification, denoising etc.. Lectures 10,11 Slide# 2

3 Norm Basic idea: Notion of length/distance between vector. Definition (Norm): Let l : X R 0 be a mapping from a vector space to non-negative reals. The function l is a norm iff 1. l( x) = 0 iff 0, 2. l(a x) = a l( x) 3. l( x + y) l( x)+l( y) (Triangle Inequality). Length of x: l( x). Distance between x, y is l( y x)(= l( x y)). Lectures 10,11 Slide# 3

4 L 2 (Euclidean) norm Distance between (x 1,y 1 ) to (x 2,y 2 ) is (x 1 x 2 ) 2 +(y 1 y 2 ) 2. In general, x 2 = x t. x. Distance between x, y is given by x y 2. Lectures 10,11 Slide# 4

5 L p norm For p 1, we define L p norm as x p = ( x 1 p + x 2 p x n p ) 1 p. The L p norm distance between two vectors is: x y p L 1 norm: x 1 = x 1 + x x n. Class exercise: verify that L 1 norm distance is indeed a norm. Q: Why is p 1 necessary? Lectures 10,11 Slide# 5

6 L norm Obtained as the limit lim p x p. Definition: For vector x, x is defined as x = max( x 1, x 2,..., x n ). Ex-1: Can we verify that L is really a norm? Ex-2 (challenge): Prove that the limit definition coincides with the explicit form of the L distance. Lectures 10,11 Slide# 6

7 Norm: Exercise Take vectors x = (0,1, 1), y = (2,1,1). Write down L 1,L 2,L norm distances between x, y. Lectures 10,11 Slide# 7

8 Norm Spheres ǫ-sphere: { x d( x,0) ǫ} for norm d. The 1 spheres corresponding to the norms L 1,L,L 2. 1 L 2 L 1 L Note: Norm spheres are convex sets. Lectures 10,11 Slide# 8

9 Norm Minimization problems General form: minimize x p Ax b Note-1: We always minimize the norm functions (in this course). Note-2: Maximization of norm is generally a hard (non-convex) problem. Lectures 10,11 Slide# 9

10 Unconstrained Norm Minimization Lets first study the following problem: min. Ax b p. 1. For p = 2, this problem is called (unconstrained) least squares. We will solve this using calculus. 2. For p = 1,, this problem is called L 1 (L ) least squares. We will reduce this problem to LP. 3. Applications: solving (noisy) linear systems, regularization, denoising, max. likelihood estimation, regression and so on. Lectures 10,11 Slide# 10

11 Unconstrained Least Squares Decision Variables: x = (x 1,...,x n ) R n. Objective function: min. A x b 2. Constraints: No explicit constraint. Lectures 10,11 Slide# 11

12 Least Squares: Example Example: min. (2x +3y 1,y) 2. Solution: We will equivalently minimize (2x +3y 1) 2 +y 2. Using fundamental theorem of calculus, we obtain the conditions: In other words, we obtain: (2x+3y 1) 2 +y 2 x = 0 (2x+3y 1) 2 +y 2 y = 0 4(2x +3y 1) = 0 6(2x +3y 1)+2y = 0 The optima lies at is y = 0,x = 1 2. Verify minima by computing checking second order derivative (Hessian matrix). Lectures 10,11 Slide# 12

13 Unconstrained Least Squares Problem: minimize A x b 2. Solution: We will use calculus minimum finding using partial derivatives. Recall: f(x 1,...,x n ) = ( x1 f, x2 f,..., xn f). Criterion for optimal point for f( x) is that f = (0,...,0). ( A x b 2 2) = ((A x b) t (A x b)) = ( x t A t A x 2 x t A t b + bt b) = 2A t A x 2A t b = 0 Minima will occur at x = (A t A) 1 A t b (assume A is full rank ). Similar solution for L p norm for even number p. Lectures 10,11 Slide# 13

14 L 1 norm minimization Decision Variables: x = (x 1,...,x n ) R n. Objective function: min. A x b 1. Constraints: No explicit constraint. We will reduce this to a linear program. Lectures 10,11 Slide# 14

15 Example Example: min. (2x +3y 1,y) 1 = 2x +3y 1 + y. Trick: Let t 1 0,t 2 0 be s.t. Consider the following problem: 2x +3y 1 t 1 y t 2 min. t 1 +t 2 2x +3y 1 t 1 y t 2 t 1,t 2 0 Goal: convince class that this problem is equivalent to original. Note: y t 2 can be rewritten as y t 2 y t 2. Lectures 10,11 Slide# 15

16 Example: LP formulation min. t 1 +t 2 2x +3y 1 t 1 2x 3y +1 t 1 y t 2 y t 2 t 1,t 2 0 Solution: x = 1 2,y = 0. Q: Can we repeat the same trick if y t 2? N: No. A disjunction of inequalities will be needed. Lectures 10,11 Slide# 16

17 L 1 norm minimization Problem: min. Ax b 1. Solution: Let A be m n matrix. Add variables t 1,...,t m correspoding to rows of m. min. m i=1 t i A i x b t i t 1,...,t m 0 min. m i=1 t i A x b t A x + b t t 1,...,t m 0 We write it in the general form: max. 1 t t A x t b A x t b t 1,...,t m 0 Lectures 10,11 Slide# 17

18 L norm minimization Decision Variables: x = (x 1,...,x n ) R n. Objective function: min. A x b. Constraints: No explicit constraint. We will reduce this to a linear program. Lectures 10,11 Slide# 18

19 Example Example: min. (2x +3y 1,y) = min max.( 2x +3y 1, y ). Trick: Let t 0 be s.t. Consider the following problem: min. 2x +3y 1 t y t t 2x +3y 1 t y t t 0 Goal: convince class that this problem is equivalent to original. Note: y t can be rewritten as t y t. Lectures 10,11 Slide# 19

20 Example:LP formulation min. t 2x +3y 1 t 2x 3y +1 t y t y t t 0 Solution: x = 1 2,y = 0. Lectures 10,11 Slide# 20

21 L norm minimization Problem: min. Ax b. Solution: Let A be m n matrix. Add variables t 1,...,t m correspoding to rows of m. min. t A i x b t t 0 min. t A x b t. 1 A x + b t. 1 t 0 We write it in the general form: max. t A x t. 1 b A x t. 1 b t 0 Lectures 10,11 Slide# 21

22 Application-1: Estimation of Solutions Lectures 10,11 Slide# 22

23 Application-1: Solution Estimation Problem: We are trying to solve the linear equation A x = b 1. The entries in A, b could have random errors (noisy measurements). 2. There are many more equations than variables. x A x A Noise b Lectures 10,11 Slide# 23

24 Application: Estimation Example Problem: We are trying to measure a quantity x using noisy measurements: 2x = 5.1 3x = 7.6 4x = x = x = 15.1 Q-1: Is there a value of x that explains the measurement? Q-2: What do we do in this case? Lectures 10,11 Slide# 24

25 Application: Estimation Example Problem: We are trying to measure a quantity x using noisy measurements: 2x = 5.1 3x = 7.6 4x = x = x = 15.1 Find x that nearly fits all the measurements., i,e, min. (2x 5.1,3x 7.6,4x 9.91,5x 12.45,6x 15.1) p. We can choose p = 1,2, and see what answer we get. Lectures 10,11 Slide# 25

26 Application:Estimation L 2 norm: Apply least squares for A = b = Solving for x = argmin Ax b 2, yields x Residual error: (0.09, 0.08,.11, 0.08,.06). Lectures 10,11 Slide# 26

27 Application:Estimation L 1 norm: Reduce to linear program using A = , b = ,. Solution: x = Residual Error: (.12,.13, 0.05, 0.0,.16). L norm: Reduce to linear program. Solution: x = Residual Error: (.1,.1,.1,.05,.1) Lectures 10,11 Slide# 27

28 Experiment with Larger Matrices (matlab) L 1 norm: large number of entries with small residuals, spread is larger. L 2 norm: residuals look like a gaussian distribution. L norm: residuals look more uniform, smaller spread. Matlab code is available, run your own experiments! Lectures 10,11 Slide# 28

29 Regularized Approximation Ordinary least squares can have poor performance (ill conditioning problem). Tikhonov Regularized Approximation: A x b 2 +λ x 2. λ 0 is a tuning parameter. Note-1: This is the same as the following norm minimization: [ A λ 1 2I ] x [ b Note-2: In general, we can also have regularized approximation using L 1,L and other norms. 0 ] 2 2. Lectures 10,11 Slide# 29

30 Penalty Function Approximation Problem: Solve minimize.φ(a x b). where φ is a penalty function. If φ = L 1,L 2,L, this is exactly the same as norm minimization. Note-1: In general, φ need not be a norm. Note-2: φ is sometimes called a loss function. Lectures 10,11 Slide# 30

31 Penalty Functions: Example Deadzone Linear Penalty: Let x = (x 1,...,x n ) t. The deadzone linear penalty is sum of penalties on each individual component x 1,...,x n : wherein φ dz ( x) : φ dz (x 1 )+φ dz (x 2 )+ +φ dz (x n ), φ dz (x) = { 0 u a b( u a) u > a L2 φ dz L 1 -a a Lectures 10,11 Slide# 31

32 Deadzone Linear Penalty (cont) Deadzone vs. L 2 : L 2 norm imposes large penalty when residual is high. Therefore, outliers in data can affect performance drastically. Deadzone penalty function is generally less sensitive to outliers. Q: How do we solve the deadzone penalty approximation problem? A: Apply tricks for L 1,L (upcoming assignment). Lectures 10,11 Slide# 32

33 Other penalty functions Huber Penalty: The Huber penalty is sum of functions on individual components x 1,...,x n : φ H (x 1 )+φ H (x 2 )+ +φ H (x n ), wherein φ H (x) = { u 2 u a a(2 u a) u > a φ H L 2 -a a Lectures 10,11 Slide# 33

34 Penalty Function Approximation: Summary General observations: L 2 : Natural, easy to solve (using pseudo inverse). However, sensitive to outliers. L 1 : Sparse: ensures that most residuals are very small. Insensitive to outliers. However, spread of error is larger. L : Non-sparse but minimizes spread. Very sensitive to outliers. Other loss functions: provide various tradeoffs. Details: Cf. Boyd s book. Lectures 10,11 Slide# 34

35 Application-2: Signal Denoising Lectures 10,11 Slide# 35

36 Application-2: Signal Denoising Input: Given samples of a noisy signal: y : ( y 1, y 2,..., y N ) t. Original signal is unknown (very little data available). Problem: Find new signal y by minimizing φ s is a smoothing function. mininize. y y p p +φ s ( y ). 4 Original Signal 4 Signal With Noise Lectures 10,11 Slide# 36

37 Smoothing Criteria Quadratic Smoothing: L 2 norm on successive differences. φ q ( y) = N (y i y i 1 ) 2. i=2 Total Variation Smoothing: L 1 norm on successive differences. φ tv ( y) = N y i y i 1. i=2 Maximum Variation Smoothing: L norm on successive differences. φ max ( y) = N max i=2 ( y i+1 y i ). Lectures 10,11 Slide# 37

38 Quadratic Smoothing Problem: minimize y y n 2 2 +φ q( y). This can be viewed as a least squares problem: minimize [ I ] y [ yn 0 ] 2 2. Where is the matrix: = Lectures 10,11 Slide# 38

39 Total/Maximum Variance Smoothing Total variance smoothing: L 1 least squares problem. [ ] [ ] minimize I y yn 0 Max. variance smoothing: L 1 least squares problem. [ ] [ ] minimize I y yn Where is the matrix: = Lectures 10,11 Slide# 39

40 Experiment 5 Original Signal 5 Signal With Noise L1 reconstructed L2 reconstructed L reconstructed L1 residuals L2 residuals L residuals Lectures 10,11 Slide# 40

41 Other Smoothing Functions Consider three successive data points instead of two: φ T : m 1 i=2 2y i (y i 1 +y i+1 ). Modify matrix as follows: = is called a Topelitz Matrix. Lectures 10,11 Slide# 41

42 Application-3: Regression Lectures 10,11 Slide# 42

43 Linear Regression Inputs: Data points x i : (x i1,...,x in ) and outcome y i. Goal: Find a function that best fits the data. f( x) = c 1 x 1 +c 2 x c n x n +c 0 12 input vs. output plot for given data Lectures 10,11 Slide# 43

44 Linear Regression Inputs: Data points x i : (x i1,...,x in ) and outcome y i. Goal: Find a function that best fits the data. f( x) = c 1 x 1 +c 2 x c n x n +c 0 12 input vs. output plot for given data Lectures 10,11 Slide# 44

45 Linear Regression Let X be the data matrix and y be corresponding outcome matrix. Note: The i th row X i represents data point i y i is the corresponding outcome for X i If c t x +c 0 is the best line fit, the error for i th data point is: ǫ i : (X i c)+c 0 y i. We can write the overall error as: [X 1] c y. p Lectures 10,11 Slide# 45

46 Regression Inputs: Data X, outcome y. Decision Vars: c 0,c 1,...,c n the coefficients of the straight line. Solution:minimize [X 1] c y. p Note: Instead of the norm, we could use any penalty function. Ordinary Regression: min [X 1] c y p. p Tikhanov Regression: min [X 1] c y 2 2 +λ c 2 2. Lectures 10,11 Slide# 46

47 Experiment Experiment: Create noisy data set with outliers. 80 regular data points with low noise. 5 outlier datapoints with large noise. Goal: Compare effect of L 1, L 2, L norm regressions. Lectures 10,11 Slide# 47

48 Regression with Outliers 15 Regression with outliers L 2 10 L 1 L Lectures 10,11 Slide# 48

49 Regression with Outliers (Error Histogram) 100 L1 norm residual L2 norm residual L norm residual Lectures 10,11 Slide# 49

50 Linear Regression With Non-Linear Models Input: Data X, outcome y, functions g 1,...,g k. Goal: Fit a model of the form y i = c 0 +c 1 g 1 ( x)+c 2 g 2 ( x)+...+c k g k ( x). Solution: Transform the data matrix as follows: g 1 (X 1 )... g k (X 1 ) X g 1 (X 2 )... g k (X 2 ) =... g 1 (X n )... g k (X n ). Apply linear regression on (X, y). Lectures 10,11 Slide# 50

51 Experiment Experiment: Create noisy non-linear data set with outliers. y = 0.2sin(x)+1.5 cos(x)+2 cos(1.5 x)+.2 x 3+Noise. 80 regular data points with low noise. 5 outlier datapoints with large noise Data outliers Regression with non linear model Goal: Compare effect of L 1, L 2, L norm regressions. Lectures 10,11 Slide# 51

52 Experiment: linear regression non-linear model Regression with outliers L 1 L 2 L Data outliers Lectures 10,11 Slide# 52

53 Experiment: residuals 200 L1 norm residual L2 norm residual L norm residual Lectures 10,11 Slide# 53

54 Classification Problem: Given two classes: x 1 +,..., x+ N and x 1,..., x M Goal: Find separating hyperplane between the classes. of data points x + x classifier Lectures 10,11 Slide# 54

55 Application-4: Classification Lectures 10,11 Slide# 55

56 Linear Classification Problem: Given two classes: x 1 +,..., x+ N and x 1,..., x M Goal: Find separating hyperplane between the classes. Q-1: Will such an hyperplane always exist? A: Not always! Only if data is linearly separable. of data points. 1. Assume that such a hyperplane exists. 2. Deal with cases where a hyperplane does not exist. Lectures 10,11 Slide# 56

57 Linear Classification (cont) Solution: Use linear programming. Decision Variables: w 0,w 1,w 2,...,w n, representing the classfier: f w ( x) : w 0 +w 1 x 1 + +w n x n. Linear Programming: X + : max.??? X + w 0 X w 0 x x x x + N 1 X : x 1 1 x x M 1 Lectures 10,11 Slide# 57

58 Classification Using Linear Programs Objective: Lets try i w i = 1 t w. Note: We write a, b for dot product a t b. Simplified Data representation: Represent two classes +1, 1 for data. Data:( x 1,y 1 ),...,( x N,y N ) (X, y), where y i = +1 for class I and y i = 1 for class II data. Linear Program: min. 1, w +w 0 y i ( x i, w +w 0 ) 0 Verify that this formulation works. Lectures 10,11 Slide# 58

59 Experiment: Classification Experiment: Find a classifier for the following points. 2.5 Data for classification Lectures 10,11 Slide# 59

60 Experiment: Classification Experiment: Outcome. 2.5 Classifiers Note: Many different classifiers are possible. Note-2: What is the deal with points that lie on the classifier line? Lectures 10,11 Slide# 60

61 Dual, Complementary Slackness Linear Program: min. 1, w +w 0 y i ( x i, w +w 0 ) 0 Complementary Slackness: Let α 1,...,α N be the dual variables. α i y i ( x i, w +w 0 ) = 0, α i 0. Claim: There are at least n+1 points x j1,..., x jn+1 s.t. x ji, w +w 0 = 0. Note: n+1 pts. fully determine the hyper-plane w;w 0. Lectures 10,11 Slide# 61

62 Classifying Points Problem: Given a new point x, let us find which class it belongs to. Solution-1: Just find out the sign of w, x +w 0, we know w;w 0 from our LP solution. Solution-2: More complicated. Express x as a linear combination of support vectors: We can write w, x as x = λ 1 x j1 +λ 2 x j2 + +λ n+1 x n+1 ( ( w, x +w 0 = i λ i w 0 + w, x ji + }{{} 0 1 i λ i )w 0 = 1 i λ i )w 0. This is impractical for linear programming formulations. Lectures 10,11 Slide# 62

63 Non-Separable Data Q: What if data is not linear separable? The linear program will be primal infeasible In practice, linear separation may hold without noise but not with noise. Solution: Relax the conditions on half-space w. ξ 1 ξ 2 Lectures 10,11 Slide# 63

64 Non-Separable Data: Linear Programming Classifier: (old) w 0 + w, x. (new) w, x +1. Note: Classifier i w ix i +w 0 is equivalent to i LP Formulation: minimize ξ + 1 x + i, w +1 ξ + i x j, w +1 ξ j ξ + i,ξ j 0 w i w 0 x i +1 if w 0 0. The variable ξ encodes tolerance. We minimize ξ as part of objective. Exercise: Verify that the support vector interpretation of dual works. Lectures 10,11 Slide# 64

65 Non Linear Model Classification Linear Classifier: w, x +1. Non-linear functions: φ : R n R. w,(φ 1 ( x),φ 2 ( x),...,φ m ( x)). This is exactly same idea as linear regression with non-linear functions. Lectures 10,11 Slide# 65

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin