CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Size: px

Start display at page:

Download "CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18"

Dulcie Fowler
5 years ago
Views:

1 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18

2 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H Learned Hypothesis H 0! 2

3 Unknown Target Function!: # % Probability Distribution 3 on # Training Data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H Learned Hypothesis H 0! 3

4 Unknown Target Distribution!: # % plus noise Probability Distribution 3 on # Training Data Error Measure Formal Setup + = -., 0.,, - 2, 0 2 Learning Algorithm * 4 Hypothesis Set H Learned Hypothesis H (! 4

5 Hoeffding s Inequality (Validation) For a given hypothesis, h " #$ h = in-sample error of h " &'( (h) = out-of-sample error of h + " #$ h " &'( h > $ As 6 increases, RHS decreases As. decreases, RHS increases

6 For a given finite hypothesis set, H = h $,, h ' ( )* + = in-sample error of best hypothesis in H Hoeffding s Inequality (Corrected) (,-. (+) = out-of-sample error of best hypothesis in H 1 ( )* + (,-. + > :;< * As = increases, RHS decreases As 4 decreases, RHS increases As 7 increases, RHS increases

7 ! " $ =! " +! $!{" $} The Union Bound is bad A B 7

8 Given some finite sample of points! ",,! % from the input space and single hypothesis h H, applying h to each point in! ",,! % results in a dichotomy Dichotomy h! ",, h! % is a vector of ) +1 s and -1 s Given! ",,! %, each hypothesis in H generates a dichotomy but not necessarily a unique dichotomy! The set of dichotomies induced by Hon! ",,! % H! ",,! % = h! ",, h! % h H is 8

9 Growth Function The growth function of H is the largest number of dichotomies H can induce across all data sets of size " # H " = max ( ),,(,. H / 0,, / 1 9

10 Observe that! H # 2 & H and # Growth Function (Shattering) Given H, if ) *,, ) &. s.t. H ) *,, ) & = 2 &, then H shatters ) *,, ) & If ) *,, ) &! H # = 2 &. that is shattered by H, then 10

11 Growth Function (Break Points) If! H # < 2 &, then # is a break point for H If there is at least one break point for H, then! H ' is polynomial in ' 11

12 ! = R and H = Positive rays: h & = '()* &, Growth Function: Example - H * = * + 1 & 0 & 1 & 2 & 4 & 5 & 6 & 371 & 370 & 3, 12

13 ! = R and H = Positive intervals Growth Function: Example % H & = '() * + 1 = '- * + ' * + 1. ). *. / '3*. '3). '

14 ! = R $ and H = Convex sets Growth Function: Example & H ' = 2 ) * ) * + * $ *, * / * - *. 14

15 ! " #$ % " '() % > + 4. H $ Vapornik- Chervonenkis (VC)-Bound Or " '() % " #$ % + 5 $ log <= H >$? with probability at least 1 A 15

16 ! "# H = the largest value of & s.t. ' H & = 2 ) The VC-dimension is the greatest number of points that can be shattered by H VC-Dimension If * is the smallest breakpoint for H, then! "# H = * 1 ' H & & / ) 7 + 9! "# :;< ) ) 16

17 How many samples do we need in our training data to say that the generalization error is less than! with probability at least 1 $? Sample Complexity Set % & log * +, -&./0 1! Conclude that we need 3 % 5 6 log * +, -&. /0 1 As $ decreases, RHS increases As! decreases, RHS increases As 7 89 decreases, RHS decreases 17

18 Penalty for Model Complexity Given! samples, how good can we say our learned hypothesis will do with confidence at least 1 $? Conclude that % &'( ) % +, ) +., log 2 3,

19 How well does % generalize? Approximation Generalization Tradeoff! "#$ %! '( % + * +,- log 1 1 How well does % approximate 2? 19

20 Increases as! "# increases Approximation Generalization Tradeoff $ %&' ( $ *+ ( + -! "# log 1 1 Decreases as! "# increases 20

21 How variable is '? Bias-Variance Tradeoff! " # $%& ' " =! *! " ' " +, ' +, + ' + 0 +, How well, on average, does ' approximate 0? 21

22 Increases as H becomes more complex Bias-Variance Tradeoff " # $ %&' ( # = " + " # ( #, - (, - + (, 1, - Decreases as H becomes more complex 22

23 ! "#$! "#$ Expected error! %& Expected error! %& Number of training points, ' Number of training points, ' Simple model Complex model 23

24 Expected error! "#$ Generalization error In-sample error! %& Expected error Variance Bias! "#$! %& Number of training points, ' Number of training points, ' VC analysis Bias-Variance analysis 24

25 Instead of bounding! "#$ % using! &' %, estimate! "#$ % using the error on some test dataset ( ),! $*+$ % Test Sets If the ( ) is not involved in the training process, then we are validating % using ( ) Therefore, Hoeffding s bound applies with, = H = 1 0! $*+$ %! "#$ % > : ' ; where < ) = ( ) As < ) increases, : ' ; decreases As < ) increases,! $*+$ % increases 25

26 3 Learning Problems Problem Domain Classification! = 1, +1 Predicting Probabilities! = [0, 1] Regression! = R 26

27 Linear Models h # = some function of / 0 # # = 1 # 2 # 3 # 5 27

28 3 Learning Solutions Problem Model Linear Classification h # = %&'( ) * # Logistic Regression h # = + ) * # Linear Regression h # = ) * # 28

29 Linear Classification Perceptron Given some input " = " $ = 1, " ',, " ) : ) h " = +,-. / 2 0 " 0 01$ 29

30 PLA finds a linear separator in finite time, if the data is linearly separable Perceptron Learning Algorithm Given: training data! = # $, & $,, # (, & ( Initialize ) to all zeros or (small) random numbers While some misclassified training example i.e. # +, & +! s.t. h # + =./01 ) 2 # + & + Randomly pick a misclassified training example, #, & Update ): ) = ) + & # 30

31 Perceptron Learning Algorithm Suppose ", $ & is a misclassified training example and $ = +1 * + " is negative After updating * = * + $ ", * + $ " + " = * + " + $ " + " is less negative than * + " Because $ > 0 and " + " > 0 A similar argument holds if $ = 1 31

32 #! "# $ = 1 ' ( ")* $ +, ". " / Linear Regression: Squared Error # = 1 ' ( + /, " $." ")* = 1 ' 0$. / where 6 = ( 6 / " = ")* = 1 ' 0$. 8 0$. 32

33 Find the gradient Minimizing Error Set it equal to zero Solve (Check that the solution is a minimum) 33

34 ! "# $ = 1 ' ($ +, ($ + = 1 ' ($ 2 +, ($ + Minimizing Error = 1 ' $2 ( 2 ($ 2$ 2 ( /! "# $ = 1 ' 2(2 ($ 2( 2 + = 0 2( 2 ($ 2( 2 + = 0 ( 2 ($ = ( 2 + $ = ( 2 ( 56 (

35 " # $% & = 1 ) 2+, +& 2+, / Checking 0 " # $% & = 1 ) 2+, + 0 " # $% & is (almost always) positive definite & = +, , / is a unique global minimum 35

36 Input:! = # $, & $, # ', & ',, # ), & ) Linear Regression Algorithm 1. Construct * and & 2. Compute the pseudo-inverse of * = *, = * - *.$ * - 3. Compute / = X, & Output: / 36

37 Key observation: 1, +1 R Use linear regression to find ' = * + *,- * + / Linear Regression for Classification ' minimizes 8 59 ' = 1 3 : ' / < 5 5;- In general, 0123 ' / Use ' for linear classification: 2 4 = 0123 '

38 Input:! = # $, & $,, # (, & (, ) 1. Initialize * to all zeros and +,-./ = 2. For 1 = 1, 2,, ) The Pocket Algorithm a. Randomly pick a misclassified training example, #, & b. Update *: * = * + & # c. If + 6( * < +,-./ I. +,-./ = + 6( * II. * = * Output: * 38

39 Training data does not consist of probabilities Observations are still binary:! " = ±1 Logistic Regression Goal is to learn & ( = )! = +1 ( h ( = -. / ( = = : 80 0,1 Note that 1 -. / ( = -. / ( 39

40 Cross-entropy Error Some hypothesis h is good if: the probability of the training data " given h is high % # $% & = 1 ) * $+, ln

41 Gradient Descent: Intuition Iterative method for minimizing functions Requires the gradient to exist everywhere Particularly useful for minimizing convex functions, like the cross-entropy error 41

42 Suppose the current location is! (#) Gradient Descent: Intuition Move some distance, %, in the most downhill direction possible, &'! (#()) =! # + % &' 42

43 Fix # and choose $" to minimize Δ& '( after making the update ) (+,-) = ) + + # $" Δ& '( $" = & '( ) + + # $" & '( ) + " Δ& '( $" & '( ) + + # $" 3 5 & '( ) + & '( ) + Δ& '( $" # $" 3 5 & '( ) + Δ& '( $" = # 5 & '( ) + $" = 7 8 9: 5 ; 7 8 9: 5 ; 43

44 ! " Small! Large! Variable! " Set! " =! $ & ' () * "! " decreases as + increases, because & ' () * " decreases as ' () * " approaches its minimum 44

45 Input:! = # $, & $,, # (, & (, ) * 1. Initialize + * to all zeros and set, = 0 2. While termination condition is not satisfied Gradient Descent a. Compute / 0 1( + 2 b. Update +: + 23$ = + 2 ) * / 0 1( + 2 c. Increment,:, =, + 1 Output: # = : & = +1 # = $ $3; <= 45

46 Stochastic Gradient Descent (SGD) Input:! = # $, & $,, # (, & (, ) * 1. Initialize + * to all zeros and set, = 0 2. While termination condition is not satisfied a. Pick a random data point in!, #, & b. Compute , #, & = : 9 ; <$ c. Update +: + 2<$ = + 2 ) * , #, & = : 9 ; <$ d. Increment,:, =, + 1 Output: + 2 C # = D & = +1 # = $ $<6 E8 9 : ; 46

47 Use logistic regression to find! " Logistic Regression for Classification Use! " for classification: if # $ = +1 ) = *! " + ) -. then classify ) as +1; otherwise, classify ) as 1 A ) = BCAD - -EF GH I J K -. 47

48 Use logistic regression to find! " Logistic Regression for Classification Use! " for classification: if # $ = +1 ) = *! " + ) - then classify ) as +1; otherwise, classify ) as ) = AB@C D DEF GH I J K - 48

49 Fingerprint recognition: Inputs are fingerprints Outputs: +1 means you, -1 means not you Error: Classification For personalized coupons:! " # " For unlocking phones:! " # "

50 Decide on a transformation Φ: # % Nonlinear Models Convert & = ( ), + ),, ( -, + - to.& = Φ ( ) = / ), + ),, Φ ( - = / -, + - Fit a linear model using.&, 01 / Return the corresponding predictor in the original space: 1 ( = 01 Φ ( 50

51 Tradeoffs Low-Dimensional Transformations High-Dimensional Transformations! "# High Low Generalization Good Bad 51

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$