Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011! 1

Todayʼs topics" Learning from Examples: brief review! Univariate Linear Regression! Batch gradient descent! Stochastic gradient descent! Multivariate Linear Regression! Regularization! Linear Classifiers! Perceptron learning rule! Logistic Regression! 2

Learning from Examples (supervised learning)" 3

Learning from Examples (supervised learning)" 4

Learning from Examples (supervised learning)" 5

Learning from Examples (supervised learning)" 6

Learning from Examples (supervised learning)" 7

Learning from Examples (supervised learning)" 8

Important issues" Generalization! Overfitting! Cross-validation! Holdout cross validation! K-fold cross validation! Leave-one-out cross-validation! Model selection! 9

Recall Notation" (x 1, y 1 ), (x 2, y 2 ),K (x N,y N ) training set! y j Where each was generated by! an unknown function! y = f (x) Discover a function that best approximates the true function! h f hypothesis! 10

Loss Functions" Suppose the true prediction for input x is f (x) = y but the hypothesis gives h(x) = y ˆ L(x, y, ˆ y ) = Utility(result of using y given input x) Utility(result of using ˆ y given input x) Simplified version : L(y, ˆ y ) Absolute value loss : L 1 (y, ˆ y ) = y ˆ y ( ) 2 Squared error loss : L 2 (y, y ˆ ) = y y ˆ 0/1 loss : L 0 /1 (y, ˆ y ) = 0 if y = ˆ y, else 1 Generalization loss: expected loss over all possible examples! Empirical loss: average loss over available examples! 11

Univariate Linear Regression" 12

Univariate Linear Regression contd." w = [ w 0,w ] 1 weight vector! h w (x) = w 1 x + w 0 Find weight vector that minimizes empirical loss, e.g., L2:! N Loss(h w ) = L 2 (y j, h w (x j )) = (y j h w (x j )) 2 = (y j (w 1 x j + w 0 )) 2 i.e., find j =1 w* such that! N j =1 w * = argmin w Loss(h w ) N j =1 13

Weight Space" 14

Finding w*" Find weights such that:! w 0 N j =1 (y j (w 1 x j + w 0 )) 2 = 0 and N (y w j (w 1 x j + w 0 )) 2 = 0 1 j =1 15

Gradient Descent" step size or! learning rate! w i w i α w i Loss(w) 16

Gradient Descent contd." For one training example (x,y) :! w 0 w 0 + α(y h w (x)) and w 1 w 1 + α(y h w (x))x For N training examples:! w 0 w 0 + α (y j h w (x j )) and w 1 w 1 + α (y j h w (x j )) j j x j batch gradient descent! stochastic gradient descent: take a step for one training example at a time! 17

The Multivariate case" h sw (x j ) = w 0 + w 1 x j,1 +L + w n x j,n = w 0 + i w i x j,i Augmented vectors: add a feature to each x by tacking on a 1:! x j,0 =1 Then:! h sw (x j ) = w x j = w T x j = i w i x j,i And batch gradient descent update becomes:! j w i w i + α (y j h w (x j )) x j,i 18

The Multivariate case contd." Or, solving analytically:! y Let be the vector of outputs for the training examples! X data matrix: each row is an input vector! Solving this for w* :! y = Xw w* = ( X T X) 1 X T y pseudo inverse! 19

Regularization" Cost(h) = EmpLoss(h) + λcomplexity(h) Complexity(h w ) = L q (w) = i w i q 20

L1 vs. L2 Regularization" 21

Linear Classification: hard thresholds" 22

Linear Classification: hard thresholds contd." Decision Boundary:! In linear case: linear separator, a hyperplane! Linearly separable:! data is linearly separable if the classes can be separated by a linear separator! Classification hypothesis:! h w (x) = Threshold(w x) where Threshold(z) =1 if z 0 and 0 otherwise 23

Perceptron Learning Rule" For a single sample (x, y) : w i w i + α( y h w (x))x i If the output is correct, i.e., y =h w (x), then the weights don't change If y =1 but h w (x) = 0, then w i is increased when x i is positive and decreased when x i is negative. If y = 0 but h w (x) =1, then w i is decreased when x i is positive and increased when x i is negative. Perceptron Convergence Theorem: For any data set thatʼs linearly separable and any training procedure that continues to present each training example, the learning rule is guaranteed to find a solution in a finite number of steps.! 24

Perceptron Performance" 25

Linear Classification with Logistic Regression" An important function! 26

Logistic Regression" h w (x) = Logistic(w x) = 1 1+ e w x For a single sample (x, y) and L 2 loss function : w i w i + α( y h w (x))h w (x)( 1 h w (x))x i derivative of logistic function! 27

Logistic Regression Performance" separable case! 28

Summary" Learning from Examples: brief review! Loss functions! Generalization! Overfitting! Cross-validation! Regularization! Univariate Linear Regression! Batch gradient descent! Stochastic gradient descent! Multivariate Linear Regression! Regularization! Linear Classifiers! Perceptron learning rule! Logistic Regression! 29

Next Class" Artificial Neural Networks, Nonparametric Models, & Support Vector Machines! Secs. 18.7 18.9! 30