CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18

Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H Learned Hypothesis H 0! 2

Unknown Target Function!: # % Probability Distribution 3 on # Training Data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H Learned Hypothesis H 0! 3

Unknown Target Distribution!: # % plus noise Probability Distribution 3 on # Training Data Error Measure Formal Setup + = -., 0.,, - 2, 0 2 Learning Algorithm * 4 Hypothesis Set H Learned Hypothesis H (! 4

Hoeffding s Inequality (Validation) For a given hypothesis, h " #$ h = in-sample error of h " &'( (h) = out-of-sample error of h + " #$ h " &'( h >. 21 2345 $ As 6 increases, RHS decreases As. decreases, RHS increases

For a given finite hypothesis set, H = h $,, h ' ( )* + = in-sample error of best hypothesis in H Hoeffding s Inequality (Corrected) (,-. (+) = out-of-sample error of best hypothesis in H 1 ( )* + (,-. + > 4 2 7 8 9:;< * As = increases, RHS decreases As 4 decreases, RHS increases As 7 increases, RHS increases

! " $ =! " +! $!{" $} The Union Bound is bad A B 7

Given some finite sample of points! ",,! % from the input space and single hypothesis h H, applying h to each point in! ",,! % results in a dichotomy Dichotomy h! ",, h! % is a vector of ) +1 s and -1 s Given! ",,! %, each hypothesis in H generates a dichotomy but not necessarily a unique dichotomy! The set of dichotomies induced by Hon! ",,! % H! ",,! % = h! ",, h! % h H is 8

Growth Function The growth function of H is the largest number of dichotomies H can induce across all data sets of size " # H " = max ( ),,(,. H / 0,, / 1 9

Observe that! H # 2 & H and # Growth Function (Shattering) Given H, if ) *,, ) &. s.t. H ) *,, ) & = 2 &, then H shatters ) *,, ) & If ) *,, ) &! H # = 2 &. that is shattered by H, then 10

Growth Function (Break Points) If! H # < 2 &, then # is a break point for H If there is at least one break point for H, then! H ' is polynomial in ' 11

! = R and H = Positive rays: h & = '()* &, Growth Function: Example - H * = * + 1 & 0 & 1 & 2 & 4 & 5 & 6 & 371 & 370 & 3, 12

! = R and H = Positive intervals Growth Function: Example % H & = '() * + 1 = '- * + ' * + 1. ). *. /. 0. 1. 2. '3*. '3). ' 4 5 13

! = R $ and H = Convex sets Growth Function: Example & H ' = 2 ) * ) * + * $ *, * / * - *. 14

! " #$ % " '() % > + 4. H 21 2 34 5 67 $ Vapornik- Chervonenkis (VC)-Bound Or " '() % " #$ % + 5 $ log <= H >$? with probability at least 1 A 15

! "# H = the largest value of & s.t. ' H & = 2 ) The VC-dimension is the greatest number of points that can be shattered by H VC-Dimension If * is the smallest breakpoint for H, then! "# H = * 1 ' H & & / 01 + 1 3 456 7 3 8) 7 + 9! "# :;< ) ) 16

How many samples do we need in our training data to say that the generalization error is less than! with probability at least 1 $? Sample Complexity Set % & log * +, -&./0 1! Conclude that we need 3 % 5 6 log * +, -&. /0 1 As $ decreases, RHS increases As! decreases, RHS increases As 7 89 decreases, RHS decreases 17

Penalty for Model Complexity Given! samples, how good can we say our learned hypothesis will do with confidence at least 1 $? Conclude that % &'( ) % +, ) +., log 2 3, 4 5678 9 18

How well does % generalize? Approximation Generalization Tradeoff! "#$ %! '( % + * +,- log 1 1 How well does % approximate 2? 19

Increases as! "# increases Approximation Generalization Tradeoff $ %&' ( $ *+ ( + -! "# log 1 1 Decreases as! "# increases 20

How variable is '? Bias-Variance Tradeoff! " # $%& ' " =! *! " ' " +, ' +, + ' + 0 +, How well, on average, does ' approximate 0? 21

Increases as H becomes more complex Bias-Variance Tradeoff " # $ %&' ( # = " + " # ( #, - (, - + (, 1, - Decreases as H becomes more complex 22

! "#$! "#$ Expected error! %& Expected error! %& Number of training points, ' Number of training points, ' Simple model Complex model 23

Expected error! "#$ Generalization error In-sample error! %& Expected error Variance Bias! "#$! %& Number of training points, ' Number of training points, ' VC analysis Bias-Variance analysis 24

Instead of bounding! "#$ % using! &' %, estimate! "#$ % using the error on some test dataset ( ),! $*+$ % Test Sets If the ( ) is not involved in the training process, then we are validating % using ( ) Therefore, Hoeffding s bound applies with, = H = 1 0! $*+$ %! "#$ % > 3 26 789: ' ; where < ) = ( ) As < ) increases, 26 789: ' ; decreases As < ) increases,! $*+$ % increases 25

3 Learning Problems Problem Domain Classification! = 1, +1 Predicting Probabilities! = [0, 1] Regression! = R 26

Linear Models h # = some function of / 0 # # = 1 # 2 # 3 # 5 27

3 Learning Solutions Problem Model Linear Classification h # = %&'( ) * # Logistic Regression h # = + ) * # Linear Regression h # = ) * # 28

Linear Classification Perceptron Given some input " = " $ = 1, " ',, " ) : ) h " = +,-. / 2 0 " 0 01$ 29

PLA finds a linear separator in finite time, if the data is linearly separable Perceptron Learning Algorithm Given: training data! = # $, & $,, # (, & ( Initialize ) to all zeros or (small) random numbers While some misclassified training example i.e. # +, & +! s.t. h # + =./01 ) 2 # + & + Randomly pick a misclassified training example, #, & Update ): ) = ) + & # 30

Perceptron Learning Algorithm Suppose ", $ & is a misclassified training example and $ = +1 * + " is negative After updating * = * + $ ", * + $ " + " = * + " + $ " + " is less negative than * + " Because $ > 0 and " + " > 0 A similar argument holds if $ = 1 31

#! "# $ = 1 ' ( ")* $ +, ". " / Linear Regression: Squared Error # = 1 ' ( + /, " $." ")* = 1 ' 0$. / where 6 = ( 6 / " = 6 + 6 7 ")* = 1 ' 0$. 8 0$. 32

Find the gradient Minimizing Error Set it equal to zero Solve (Check that the solution is a minimum) 33

! "# $ = 1 ' ($ +, ($ + = 1 ' ($ 2 +, ($ + Minimizing Error = 1 ' $2 ( 2 ($ 2$ 2 ( + + + 2 + /! "# $ = 1 ' 2(2 ($ 2( 2 + = 0 2( 2 ($ 2( 2 + = 0 ( 2 ($ = ( 2 + $ = ( 2 ( 56 ( 2 + 34

" # $% & = 1 ) 2+, +& 2+, / Checking 0 " # $% & = 1 ) 2+, + 0 " # $% & is (almost always) positive definite & = +, + 34 +, / is a unique global minimum 35

Input:! = # $, & $, # ', & ',, # ), & ) Linear Regression Algorithm 1. Construct * and & 2. Compute the pseudo-inverse of * = *, = * - *.$ * - 3. Compute / = X, & Output: / 36

Key observation: 1, +1 R Use linear regression to find ' = * + *,- * + / Linear Regression for Classification ' minimizes 8 59 ' = 1 3 : ' + 4 5 / < 5 5;- In general, 0123 ' + 4 5 / 5 1 9 Use ' for linear classification: 2 4 = 0123 ' + 4 37

Input:! = # $, & $,, # (, & (, ) 1. Initialize * to all zeros and +,-./ = 2. For 1 = 1, 2,, ) The Pocket Algorithm a. Randomly pick a misclassified training example, #, & b. Update *: * = * + & # c. If + 6( * < +,-./ I. +,-./ = + 6( * II. * = * Output: * 38

Training data does not consist of probabilities Observations are still binary:! " = ±1 Logistic Regression Goal is to learn & ( = )! = +1 ( h ( = -. / ( = 0 012 345 6 = 1 + 7895 : 80 0,1 Note that 1 -. / ( = -. / ( 39

Cross-entropy Error Some hypothesis h is good if: the probability of the training data " given h is high % # $% & = 1 ) * $+, ln 1 + 0 12 34 5 6 3 40

Gradient Descent: Intuition Iterative method for minimizing functions Requires the gradient to exist everywhere Particularly useful for minimizing convex functions, like the cross-entropy error 41

Suppose the current location is! (#) Gradient Descent: Intuition Move some distance, %, in the most downhill direction possible, &'! (#()) =! # + % &' 42

Fix # and choose $" to minimize Δ& '( after making the update ) (+,-) = ) + + # $" Δ& '( $" = & '( ) + + # $" & '( ) + " Δ& '( $" & '( ) + + # $" 3 5 & '( ) + & '( ) + Δ& '( $" # $" 3 5 & '( ) + Δ& '( $" = # 5 & '( ) + $" = 7 8 9: 5 ; 7 8 9: 5 ; 43

! " Small! Large! Variable! " Set! " =! $ & ' () * "! " decreases as + increases, because & ' () * " decreases as ' () * " approaches its minimum 44

Input:! = # $, & $,, # (, & (, ) * 1. Initialize + * to all zeros and set, = 0 2. While termination condition is not satisfied Gradient Descent a. Compute / 0 1( + 2 b. Update +: + 23$ = + 2 ) * / 0 1( + 2 c. Increment,:, =, + 1 Output: + 2 8 # = : & = +1 # = $ $3; <= >? @ 45

Stochastic Gradient Descent (SGD) Input:! = # $, & $,, # (, & (, ) * 1. Initialize + * to all zeros and set, = 0 2. While termination condition is not satisfied a. Pick a random data point in!, #, & b. Compute 0 1 + 2, #, & = 34 5 6 78 : 9 ; <$ c. Update +: + 2<$ = + 2 ) * 0 1 + 2, #, & = + 2 +? @4 5 6 78 : 9 ; <$ d. Increment,:, =, + 1 Output: + 2 C # = D & = +1 # = $ $<6 E8 9 : ; 46

Use logistic regression to find! " Logistic Regression for Classification Use! " for classification: if # $ = +1 ) = *! " + ) -. then classify ) as +1; otherwise, classify ) as 1 A ) = BCAD - -EF GH I J K -. 47

Use logistic regression to find! " Logistic Regression for Classification Use! " for classification: if # $ = +1 ) = *! " + ) - then classify ) as +1; otherwise, classify ) as 1 @ ) = AB@C D DEF GH I J K - 48

Fingerprint recognition: Inputs are fingerprints Outputs: +1 means you, -1 means not you Error: Classification For personalized coupons:! " +1-1 +1 0 1 # " -1 100 0 For unlocking phones:! " +1-1 +1 0 1000 # " -1 1 0 49

Decide on a transformation Φ: # % Nonlinear Models Convert & = ( ), + ),, ( -, + - to.& = Φ ( ) = / ), + ),, Φ ( - = / -, + - Fit a linear model using.&, 01 / Return the corresponding predictor in the original space: 1 ( = 01 Φ ( 50

Tradeoffs Low-Dimensional Transformations High-Dimensional Transformations! "# High Low Generalization Good Bad 51