CSCI5654 (Linear Programming, Fall 2013) Lectures 10-12 Lectures 10,11 Slide# 1
Today s Lecture 1. Introduction to norms: L 1,L 2,L. 2. Casting absolute value and max operators. 3. Norm minimization problems. 4. Applications: fitting, classification, denoising etc.. Lectures 10,11 Slide# 2
Norm Basic idea: Notion of length/distance between vector. Definition (Norm): Let l : X R 0 be a mapping from a vector space to non-negative reals. The function l is a norm iff 1. l( x) = 0 iff 0, 2. l(a x) = a l( x) 3. l( x + y) l( x)+l( y) (Triangle Inequality). Length of x: l( x). Distance between x, y is l( y x)(= l( x y)). Lectures 10,11 Slide# 3
L 2 (Euclidean) norm Distance between (x 1,y 1 ) to (x 2,y 2 ) is (x 1 x 2 ) 2 +(y 1 y 2 ) 2. In general, x 2 = x t. x. Distance between x, y is given by x y 2. Lectures 10,11 Slide# 4
L p norm For p 1, we define L p norm as x p = ( x 1 p + x 2 p +...+ x n p ) 1 p. The L p norm distance between two vectors is: x y p L 1 norm: x 1 = x 1 + x 2 +...+ x n. Class exercise: verify that L 1 norm distance is indeed a norm. Q: Why is p 1 necessary? Lectures 10,11 Slide# 5
L norm Obtained as the limit lim p x p. Definition: For vector x, x is defined as x = max( x 1, x 2,..., x n ). Ex-1: Can we verify that L is really a norm? Ex-2 (challenge): Prove that the limit definition coincides with the explicit form of the L distance. Lectures 10,11 Slide# 6
Norm: Exercise Take vectors x = (0,1, 1), y = (2,1,1). Write down L 1,L 2,L norm distances between x, y. Lectures 10,11 Slide# 7
Norm Spheres ǫ-sphere: { x d( x,0) ǫ} for norm d. The 1 spheres corresponding to the norms L 1,L,L 2. 1 L 2 L 1 L Note: Norm spheres are convex sets. Lectures 10,11 Slide# 8
Norm Minimization problems General form: minimize x p Ax b Note-1: We always minimize the norm functions (in this course). Note-2: Maximization of norm is generally a hard (non-convex) problem. Lectures 10,11 Slide# 9
Unconstrained Norm Minimization Lets first study the following problem: min. Ax b p. 1. For p = 2, this problem is called (unconstrained) least squares. We will solve this using calculus. 2. For p = 1,, this problem is called L 1 (L ) least squares. We will reduce this problem to LP. 3. Applications: solving (noisy) linear systems, regularization, denoising, max. likelihood estimation, regression and so on. Lectures 10,11 Slide# 10
Unconstrained Least Squares Decision Variables: x = (x 1,...,x n ) R n. Objective function: min. A x b 2. Constraints: No explicit constraint. Lectures 10,11 Slide# 11
Least Squares: Example Example: min. (2x +3y 1,y) 2. Solution: We will equivalently minimize (2x +3y 1) 2 +y 2. Using fundamental theorem of calculus, we obtain the conditions: In other words, we obtain: (2x+3y 1) 2 +y 2 x = 0 (2x+3y 1) 2 +y 2 y = 0 4(2x +3y 1) = 0 6(2x +3y 1)+2y = 0 The optima lies at is y = 0,x = 1 2. Verify minima by computing checking second order derivative (Hessian matrix). Lectures 10,11 Slide# 12
Unconstrained Least Squares Problem: minimize A x b 2. Solution: We will use calculus minimum finding using partial derivatives. Recall: f(x 1,...,x n ) = ( x1 f, x2 f,..., xn f). Criterion for optimal point for f( x) is that f = (0,...,0). ( A x b 2 2) = ((A x b) t (A x b)) = ( x t A t A x 2 x t A t b + bt b) = 2A t A x 2A t b = 0 Minima will occur at x = (A t A) 1 A t b (assume A is full rank ). Similar solution for L p norm for even number p. Lectures 10,11 Slide# 13
L 1 norm minimization Decision Variables: x = (x 1,...,x n ) R n. Objective function: min. A x b 1. Constraints: No explicit constraint. We will reduce this to a linear program. Lectures 10,11 Slide# 14
Example Example: min. (2x +3y 1,y) 1 = 2x +3y 1 + y. Trick: Let t 1 0,t 2 0 be s.t. Consider the following problem: 2x +3y 1 t 1 y t 2 min. t 1 +t 2 2x +3y 1 t 1 y t 2 t 1,t 2 0 Goal: convince class that this problem is equivalent to original. Note: y t 2 can be rewritten as y t 2 y t 2. Lectures 10,11 Slide# 15
Example: LP formulation min. t 1 +t 2 2x +3y 1 t 1 2x 3y +1 t 1 y t 2 y t 2 t 1,t 2 0 Solution: x = 1 2,y = 0. Q: Can we repeat the same trick if y t 2? N: No. A disjunction of inequalities will be needed. Lectures 10,11 Slide# 16
L 1 norm minimization Problem: min. Ax b 1. Solution: Let A be m n matrix. Add variables t 1,...,t m correspoding to rows of m. min. m i=1 t i A i x b t i t 1,...,t m 0 min. m i=1 t i A x b t A x + b t t 1,...,t m 0 We write it in the general form: max. 1 t t A x t b A x t b t 1,...,t m 0 Lectures 10,11 Slide# 17
L norm minimization Decision Variables: x = (x 1,...,x n ) R n. Objective function: min. A x b. Constraints: No explicit constraint. We will reduce this to a linear program. Lectures 10,11 Slide# 18
Example Example: min. (2x +3y 1,y) = min max.( 2x +3y 1, y ). Trick: Let t 0 be s.t. Consider the following problem: min. 2x +3y 1 t y t t 2x +3y 1 t y t t 0 Goal: convince class that this problem is equivalent to original. Note: y t can be rewritten as t y t. Lectures 10,11 Slide# 19
Example:LP formulation min. t 2x +3y 1 t 2x 3y +1 t y t y t t 0 Solution: x = 1 2,y = 0. Lectures 10,11 Slide# 20
L norm minimization Problem: min. Ax b. Solution: Let A be m n matrix. Add variables t 1,...,t m correspoding to rows of m. min. t A i x b t t 0 min. t A x b t. 1 A x + b t. 1 t 0 We write it in the general form: max. t A x t. 1 b A x t. 1 b t 0 Lectures 10,11 Slide# 21
Application-1: Estimation of Solutions Lectures 10,11 Slide# 22
Application-1: Solution Estimation Problem: We are trying to solve the linear equation A x = b 1. The entries in A, b could have random errors (noisy measurements). 2. There are many more equations than variables. x A x A Noise b Lectures 10,11 Slide# 23
Application: Estimation Example Problem: We are trying to measure a quantity x using noisy measurements: 2x = 5.1 3x = 7.6 4x = 9.91 5x = 12.45 6x = 15.1 Q-1: Is there a value of x that explains the measurement? Q-2: What do we do in this case? Lectures 10,11 Slide# 24
Application: Estimation Example Problem: We are trying to measure a quantity x using noisy measurements: 2x = 5.1 3x = 7.6 4x = 9.91 5x = 12.45 6x = 15.1 Find x that nearly fits all the measurements., i,e, min. (2x 5.1,3x 7.6,4x 9.91,5x 12.45,6x 15.1) p. We can choose p = 1,2, and see what answer we get. Lectures 10,11 Slide# 25
Application:Estimation L 2 norm: Apply least squares for A = 2 3 4 5 6 b = 5.1 7.6 9.91 12.45 15.1 Solving for x = argmin Ax b 2, yields x 2.505. Residual error: (0.09, 0.08,.11, 0.08,.06). Lectures 10,11 Slide# 26
Application:Estimation L 1 norm: Reduce to linear program using A = 2 3 4 5 6, b = 5.1 7.6 9.91 12.45 15.1,. Solution: x = 2.49. Residual Error: (.12,.13, 0.05, 0.0,.16). L norm: Reduce to linear program. Solution: x = 2.501 Residual Error: (.1,.1,.1,.05,.1) Lectures 10,11 Slide# 27
Experiment with Larger Matrices (matlab) L 1 norm: large number of entries with small residuals, spread is larger. L 2 norm: residuals look like a gaussian distribution. L norm: residuals look more uniform, smaller spread. Matlab code is available, run your own experiments! Lectures 10,11 Slide# 28
Regularized Approximation Ordinary least squares can have poor performance (ill conditioning problem). Tikhonov Regularized Approximation: A x b 2 +λ x 2. λ 0 is a tuning parameter. Note-1: This is the same as the following norm minimization: [ A λ 1 2I ] x [ b Note-2: In general, we can also have regularized approximation using L 1,L and other norms. 0 ] 2 2. Lectures 10,11 Slide# 29
Penalty Function Approximation Problem: Solve minimize.φ(a x b). where φ is a penalty function. If φ = L 1,L 2,L, this is exactly the same as norm minimization. Note-1: In general, φ need not be a norm. Note-2: φ is sometimes called a loss function. Lectures 10,11 Slide# 30
Penalty Functions: Example Deadzone Linear Penalty: Let x = (x 1,...,x n ) t. The deadzone linear penalty is sum of penalties on each individual component x 1,...,x n : wherein φ dz ( x) : φ dz (x 1 )+φ dz (x 2 )+ +φ dz (x n ), φ dz (x) = { 0 u a b( u a) u > a L2 φ dz L 1 -a a Lectures 10,11 Slide# 31
Deadzone Linear Penalty (cont) Deadzone vs. L 2 : L 2 norm imposes large penalty when residual is high. Therefore, outliers in data can affect performance drastically. Deadzone penalty function is generally less sensitive to outliers. Q: How do we solve the deadzone penalty approximation problem? A: Apply tricks for L 1,L (upcoming assignment). Lectures 10,11 Slide# 32
Other penalty functions Huber Penalty: The Huber penalty is sum of functions on individual components x 1,...,x n : φ H (x 1 )+φ H (x 2 )+ +φ H (x n ), wherein φ H (x) = { u 2 u a a(2 u a) u > a φ H L 2 -a a Lectures 10,11 Slide# 33
Penalty Function Approximation: Summary General observations: L 2 : Natural, easy to solve (using pseudo inverse). However, sensitive to outliers. L 1 : Sparse: ensures that most residuals are very small. Insensitive to outliers. However, spread of error is larger. L : Non-sparse but minimizes spread. Very sensitive to outliers. Other loss functions: provide various tradeoffs. Details: Cf. Boyd s book. Lectures 10,11 Slide# 34
Application-2: Signal Denoising Lectures 10,11 Slide# 35
Application-2: Signal Denoising Input: Given samples of a noisy signal: y : ( y 1, y 2,..., y N ) t. Original signal is unknown (very little data available). Problem: Find new signal y by minimizing φ s is a smoothing function. mininize. y y p p +φ s ( y ). 4 Original Signal 4 Signal With Noise 3 3 2 2 1 1 0 0 1 1 2 2 3 0 5 10 3 0 5 10 Lectures 10,11 Slide# 36
Smoothing Criteria Quadratic Smoothing: L 2 norm on successive differences. φ q ( y) = N (y i y i 1 ) 2. i=2 Total Variation Smoothing: L 1 norm on successive differences. φ tv ( y) = N y i y i 1. i=2 Maximum Variation Smoothing: L norm on successive differences. φ max ( y) = N max i=2 ( y i+1 y i ). Lectures 10,11 Slide# 37
Quadratic Smoothing Problem: minimize y y n 2 2 +φ q( y). This can be viewed as a least squares problem: minimize [ I ] y [ yn 0 ] 2 2. Where is the matrix: = 1 1 0 0 0 0 1 1 0 0. 0 0 0.. 0 0 0 0 0 1 1 Lectures 10,11 Slide# 38
Total/Maximum Variance Smoothing Total variance smoothing: L 1 least squares problem. [ ] [ ] minimize I y yn 0 Max. variance smoothing: L 1 least squares problem. [ ] [ ] minimize I y yn 0 1.. Where is the matrix: = 1 1 0 0 0 0 1 1 0 0. 0 0 0.. 0 0 0 0 0 1 1 Lectures 10,11 Slide# 39
Experiment 5 Original Signal 5 Signal With Noise 0 0 5 0 5 10 L1 reconstructed 5 0 5 0 5 10 L2 reconstructed 5 0 5 0 5 10 L reconstructed 5 0 5 0 5 10 L1 residuals 400 200 0 6 4 2 0 2 L2 residuals 400 200 0 4 2 0 2 L residuals 200 100 5 0 5 10 0 2 1 0 1 2 Lectures 10,11 Slide# 40
Other Smoothing Functions Consider three successive data points instead of two: φ T : m 1 i=2 2y i (y i 1 +y i+1 ). Modify matrix as follows: 1 2 1 0 0 0 0 0 1 2 1 0 0 0 =. 0 0 0 0.. 0 0 0 0 0 0 0 1 2 1 is called a Topelitz Matrix. Lectures 10,11 Slide# 41
Application-3: Regression Lectures 10,11 Slide# 42
Linear Regression Inputs: Data points x i : (x i1,...,x in ) and outcome y i. Goal: Find a function that best fits the data. f( x) = c 1 x 1 +c 2 x 2 +...+c n x n +c 0 12 input vs. output plot for given data. 10 8 6 4 2 0 2 0 1 2 3 4 5 6 7 8 9 10 Lectures 10,11 Slide# 43
Linear Regression Inputs: Data points x i : (x i1,...,x in ) and outcome y i. Goal: Find a function that best fits the data. f( x) = c 1 x 1 +c 2 x 2 +...+c n x n +c 0 12 input vs. output plot for given data. 10 8 6 4 2 0 2 0 1 2 3 4 5 6 7 8 9 10 Lectures 10,11 Slide# 44
Linear Regression Let X be the data matrix and y be corresponding outcome matrix. Note: The i th row X i represents data point i y i is the corresponding outcome for X i If c t x +c 0 is the best line fit, the error for i th data point is: ǫ i : (X i c)+c 0 y i. We can write the overall error as: [X 1] c y. p Lectures 10,11 Slide# 45
Regression Inputs: Data X, outcome y. Decision Vars: c 0,c 1,...,c n the coefficients of the straight line. Solution:minimize [X 1] c y. p Note: Instead of the norm, we could use any penalty function. Ordinary Regression: min [X 1] c y p. p Tikhanov Regression: min [X 1] c y 2 2 +λ c 2 2. Lectures 10,11 Slide# 46
Experiment Experiment: Create noisy data set with outliers. 80 regular data points with low noise. 5 outlier datapoints with large noise. Goal: Compare effect of L 1, L 2, L norm regressions. Lectures 10,11 Slide# 47
Regression with Outliers 15 Regression with outliers L 2 10 L 1 L 5 0 5 10 15 0 1 2 3 4 5 6 7 8 9 10 Lectures 10,11 Slide# 48
Regression with Outliers (Error Histogram) 100 L1 norm residual 50 0 5 0 5 10 15 20 60 L2 norm residual 40 20 0 5 0 5 10 15 20 40 L norm residual 20 0 10 8 6 4 2 0 2 4 6 8 10 Lectures 10,11 Slide# 49
Linear Regression With Non-Linear Models Input: Data X, outcome y, functions g 1,...,g k. Goal: Fit a model of the form y i = c 0 +c 1 g 1 ( x)+c 2 g 2 ( x)+...+c k g k ( x). Solution: Transform the data matrix as follows: g 1 (X 1 )... g k (X 1 ) X g 1 (X 2 )... g k (X 2 ) =... g 1 (X n )... g k (X n ). Apply linear regression on (X, y). Lectures 10,11 Slide# 50
Experiment Experiment: Create noisy non-linear data set with outliers. y = 0.2sin(x)+1.5 cos(x)+2 cos(1.5 x)+.2 x 3+Noise. 80 regular data points with low noise. 5 outlier datapoints with large noise. 14 12 Data outliers Regression with non linear model 10 8 6 4 2 0 2 4 6 0 1 2 3 4 5 6 7 8 9 10 Goal: Compare effect of L 1, L 2, L norm regressions. Lectures 10,11 Slide# 51
Experiment: linear regression non-linear model 14 12 10 8 Regression with outliers L 1 L 2 L Data outliers 6 4 2 0 2 4 6 0 1 2 3 4 5 6 7 8 9 10 Lectures 10,11 Slide# 52
Experiment: residuals 200 L1 norm residual 100 0 20 15 10 5 0 5 10 200 L2 norm residual 100 0 20 15 10 5 0 5 10 60 40 20 L norm residual 0 8 6 4 2 0 2 4 6 8 Lectures 10,11 Slide# 53
Classification Problem: Given two classes: x 1 +,..., x+ N and x 1,..., x M Goal: Find separating hyperplane between the classes. of data points. 1.5 1 0.5 0 0.5 x + x classifier 1 1.5 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Lectures 10,11 Slide# 54
Application-4: Classification Lectures 10,11 Slide# 55
Linear Classification Problem: Given two classes: x 1 +,..., x+ N and x 1,..., x M Goal: Find separating hyperplane between the classes. Q-1: Will such an hyperplane always exist? A: Not always! Only if data is linearly separable. of data points. 1. Assume that such a hyperplane exists. 2. Deal with cases where a hyperplane does not exist. Lectures 10,11 Slide# 56
Linear Classification (cont) Solution: Use linear programming. Decision Variables: w 0,w 1,w 2,...,w n, representing the classfier: f w ( x) : w 0 +w 1 x 1 + +w n x n. Linear Programming: X + : max.??? X + w 0 X w 0 x + 1 1 x + 2 1 x + 3 1. 1 x + N 1 X : x 1 1 x 2 1. 1 x M 1 Lectures 10,11 Slide# 57
Classification Using Linear Programs Objective: Lets try i w i = 1 t w. Note: We write a, b for dot product a t b. Simplified Data representation: Represent two classes +1, 1 for data. Data:( x 1,y 1 ),...,( x N,y N ) (X, y), where y i = +1 for class I and y i = 1 for class II data. Linear Program: min. 1, w +w 0 y i ( x i, w +w 0 ) 0 Verify that this formulation works. Lectures 10,11 Slide# 58
Experiment: Classification Experiment: Find a classifier for the following points. 2.5 Data for classification 2 1.5 1 0.5 0 0.5 1 1.5 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 Lectures 10,11 Slide# 59
Experiment: Classification Experiment: Outcome. 2.5 Classifiers 2 1.5 1 0.5 0 0.5 1 1.5 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 Note: Many different classifiers are possible. Note-2: What is the deal with points that lie on the classifier line? Lectures 10,11 Slide# 60
Dual, Complementary Slackness Linear Program: min. 1, w +w 0 y i ( x i, w +w 0 ) 0 Complementary Slackness: Let α 1,...,α N be the dual variables. α i y i ( x i, w +w 0 ) = 0, α i 0. Claim: There are at least n+1 points x j1,..., x jn+1 s.t. x ji, w +w 0 = 0. Note: n+1 pts. fully determine the hyper-plane w;w 0. Lectures 10,11 Slide# 61
Classifying Points Problem: Given a new point x, let us find which class it belongs to. Solution-1: Just find out the sign of w, x +w 0, we know w;w 0 from our LP solution. Solution-2: More complicated. Express x as a linear combination of support vectors: We can write w, x as x = λ 1 x j1 +λ 2 x j2 + +λ n+1 x n+1 ( ( w, x +w 0 = i λ i w 0 + w, x ji + }{{} 0 1 i λ i )w 0 = 1 i λ i )w 0. This is impractical for linear programming formulations. Lectures 10,11 Slide# 62
Non-Separable Data Q: What if data is not linear separable? The linear program will be primal infeasible In practice, linear separation may hold without noise but not with noise. Solution: Relax the conditions on half-space w. ξ 1 ξ 2 Lectures 10,11 Slide# 63
Non-Separable Data: Linear Programming Classifier: (old) w 0 + w, x. (new) w, x +1. Note: Classifier i w ix i +w 0 is equivalent to i LP Formulation: minimize ξ + 1 x + i, w +1 ξ + i x j, w +1 ξ j ξ + i,ξ j 0 w i w 0 x i +1 if w 0 0. The variable ξ encodes tolerance. We minimize ξ as part of objective. Exercise: Verify that the support vector interpretation of dual works. Lectures 10,11 Slide# 64
Non Linear Model Classification Linear Classifier: w, x +1. Non-linear functions: φ : R n R. w,(φ 1 ( x),φ 2 ( x),...,φ m ( x)). This is exactly same idea as linear regression with non-linear functions. Lectures 10,11 Slide# 65