Learning From Data Lecture 8 Linear Classification and Regression

Size: px

Start display at page:

Download "Learning From Data Lecture 8 Linear Classification and Regression"

Gerald Lester
6 years ago
Views:

1 Learning From Data Lecture 8 Linear Classification and Regression Linear Classification Linear Regression M. Magdon-Ismail CSCI 4100/6100

2 recap: Approximation Versus Generalization VC Analysis E out E in +Ω(d vc ) Bias-Variance Analysis E out = bias+var 1. Did you fit your data well enough (E in )? 2. Are you confident your E in will generalize to E out 1. How well can you fit your data (bias)? 2. How close to that best fit can you get (var)? out-of-sample error y y Error model complexity x x in-sample error d vc VC dimension, d vc y sin(x) ḡ(x) y ḡ(x) sin(x) The VC Insuarance Co. The VC bound is like a warranty. If you look at your data before choosing H, your warranty is void. x H 0 bias = 0.50; var = E out = 0.75 x H 1 bias = 0.21; var = E out = 1.90 c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 2 /21 Recap: learning curve

3 recap: Decomposing The Learning Curve VC Analysis Bias-Variance Analysis Expected Error generalization error E out E in Expected Error variance E out E in in-sample error bias Number of Data Points, N Number of Data Points, N Pick H that can generalize and has a good chance to fit the data Pick (H,A) to approximate f and not behave wildly after seeing the data c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 3 /21 3 learning problems

4 Three Learning Problems Approve or Deny Classification y = ±1 Credit Analysis Amount of Credit Regression y R Probability of Default Logistic Regression y [0,1] Linear models are perhaps the fundamental model. The linear model is the first model to try. c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 4 /21 Linear signal

5 The Linear Signal s = w t x linear in w: makes the algorithms work linear in x: gives the line/hyperplane separator x is the augmented vector: x {1} R d c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 5 /21 Using the linear signal

6 The Linear Signal sign(w t x) { 1,+1} s = w t x w t x R θ(w t x) [0,1] y = θ(s) c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 6 /21 Classification and PLA

7 Linear Classification H lin = { h(x) = sign(w t x) } 1. E in E out because d vc = d+1, ( ) d E out (h) E in (h)+o N logn. 2. If the data is linearly separable, PLA will find a separator = E in = 0. E in = 0 = E out 0 (f is well approximated by a linear fit). w(t+1) = w(t)+x y misclassified data point What if the data is not separable (E in = 0 is not possible)? How to ensure E in 0 is possible? pocket algorithm select good features c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 7 /21 Non-separable data

8 Non-Separable Data c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 8 /21 Pocket algorithm

9 The Pocket Algorithm Minimizing E in is a hard combinatorial problem. The Pocket Algorithm Run PLA At each stepkeep the best E in (and w) sofar. (Its not rocket science, but it works.) (Other approaches: linear regression, logistic regression, linear programming...) c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 9 /21 Digits

10 Digits Data Each digit is a image. c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 10 /21 Input is 256 dimensional

11 Digits Data Each digit is a image. [ ] x = (1,x 1,,x 256 ) input w = (w 0,w 1,,w 256 ) linear model } d vc = 257 c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 11 /21 Intensity and symmetry features

12 Intensity and Symmetry Features feature: an important property of the input that you think is useful for classification. (dictionary.com: a prominent or conspicuous part or characteristic) x = (1,x 1,x 2 ) input w = (w 0,w 1,w 2 ) linear model } d vc = 3 c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 12 /21 PLA on digits data

13 PLA on Digits Data PLA 50% E out Error (log scale) 10% 1% Iteration Number, t E in c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 13 /21 Pocket on digits data

14 Pocket on Digits Data PLA Pocket 50% E out 50% Error (log scale) 10% 1% E in Error (log scale) 10% 1% E out E in Iteration Number, t Iteration Number, t c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 14 /21 Regression

15 Linear Regression age 32 years gender male salary 40,000 debt 26,000 years in job 1 year years at home 3 years Classification: Approve/Deny Regression: Credit Line (dollar amount) regression y R h(x) = d w i x i = w t x i=0 c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 15 /21 Squared error

16 y y Least Squares Linear Regression x1 x y = f (x) + ǫ in-sample error out-of-sample error c AM L Creator: Malik Magdon-Ismail Ein(h) = 1 N N P (h(xn) yn)2 n=1 Eout(h) = Ex[(h(x) y)2] Linear Classification and Regression: 16 /21 x2 noisy target P (y x) h(x) = wtx Matrix representation

17 Using Matrices for Linear Regression X = x 1 x 2. x N y = y 1 y 2. y N ŷ = ŷ 1 ŷ 2. ŷ N = w t x 1 w t x 2. w t x N = Xw }{{} data matrix, N (d+1) } {{ } target vector } {{ } in-sample predictions E in (w) = 1 N N (ŷ n y n ) 2 n=1 = 1 N ŷ y 2 2 = 1 N Xw y 2 2 = 1 N (wt X t Xw 2w t X t y+y t y) c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 17 /21 Pseudoinverse solution

18 Linear Regression Solution E in (w) = 1 N (wt X t Xw 2w t X t y+y t y) Vector Calculus: TominimizeE in (w), set w E in (w) = 0. w (w t Aw) = (A+A t )w, w (w t b) = b. A = X t X and b = X t y: w E in (w) = 2 N (Xt Xw X t y) Setting E in (w) = 0: X t Xw = X t y normal equations w lin = (X t X) 1 X t y whenx t X is invertible c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 18 /21 Regression algorithm

19 Linear Regression Algorithm Linear Regression Algorithm: 1.Construct the matrix X and the vector y from the data set (x 1,y 1 ),,(x N,y N ), where each x includes the x 0 = 1 coordinate, X = x 1 x 2. x N } {{ } data matrix, y = y 1 y 2. y N } {{ } target vector 2.Compute the pseudo inverse X of the matrix X. If X t X is invertible, 3.Return w lin = X y. X = (X t X) 1 X t. c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 19 /21 Generalization

20 Generalization The linear regression algorithm gets the smallest possible E in in one step. Generalization is also good. One can obtain a regression version of d vc. There are other bounds, for example: ( ) d E[E out (h)] = E[E in (h)]+o N Expected Error σ 2 E out E in d+1 Number of Data Points, N c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 20 /21 Regression for classification

21 Linear Regression for Classification Linear regression can learn any real valued target function. For example y n = ±1. Use linear regression to get w with w t x n y n = ±1 Then sign(w t x n ) will likely agree with y n = ±1. These can be good initial weights for classification. (±1 are real values!) Example. Classifying1fromnot 1 (multiclass 2 class) Symmetry Average Intensity c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 21 /21

Learning From Data Lecture 10 Nonlinear Transforms

Learning From Data Lecture 10 Nonlinear Transforms Learning From Data Lecture 0 Nonlinear Transforms The Z-space Polynomial transforms Be careful M. Magdon-Ismail CSCI 400/600 recap: The Linear Model linear in w: makes the algorithms work linear in x: