CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18

Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2

Estimating! "#$ instead of! %&! "#$ h =! %& h + over;it penalty regularization estimates this quantity 3

Constrain hypothesis sets to prevent them from being able to fit noise Regularization Learning algorithms are optimization problems and regularization imposes constraints on that optimization 4

minimize ; <=> (, @ A = ; BC ( + @ A E Ω ( Regularization. Ridge: Ω ( = * ( + / Low Order: Ω ( = * 5( + /. Lasso: Ω ( = * +,-. +,- +,- ( + 5

Estimating! "#$ instead of! %&! "#$ h =! %& h + over:it penalty validation estimates this quantity 6

Test sets Estimate! "#$ % using the error on some test dataset t& $'($,! $'($ % If & $'($ is not involved in the training process, then )! $'($ %! "#$ % >, 2/ 01234 5 = & $'($ 7

More test data leads to a tighter bound on " #$% & ' but fewer training data generally means the learned & ' is worse i.e. " #$% & ' tends to increases as (! decreases Picking! " #$% & " #$% & ' " %+,% & ' +. / 0 probability) (with high Return & but bound " #$% & using " %+,% & ' +. / 0 Practical rule of thumb:! = 2 3 8

Test sets Estimate! "#$ % using the error on some test dataset & $'($,! $'($ % If & $'($ is not involved in the training process, then )! $'($ %! "#$ % >, 2/ 01234 5 = & $'($ 9

! "#$%& is used to build a finite set of candidate hypotheses: H ($) = {, -.,, 0.,,, 2. }.! ($) is used to select the hypothesis from H ($) :, 2 Validation set. 5 6 ($), 2. 6 89", 2 > ; 2 >?.0@A B. 6 ($), 2 C DE 2 B. 6 89", 2. 6 ($), 2 + C DE 2 B with high probability 10

! "# vs.! $%& vs.! '()' Bias! "# Incredibly biased! $%& Slightly biased! '()' Not biased Relationship to * +,- VC-bound Hoeffding s bound (multiple hypotheses) Hoeffding s bound (single hypothesis) 11

Occam s Razor The simplest model that fits the data is also the most plausible Three Learning Principles Sampling Bias If the data is sampled in a biased way, learning will produce a similarly biased outcome Data Snooping If a data set has affected any step in the learning process, its ability to assess the outcome has been compromised 12

Not Tired! " Tired Backpack! $! #! # Both, Lunchbox Backpack Metro! $ Both, Lunchbox Drive No Rain Rain No Rain Rain! % Metro Bike Metro Before, After During Bike Metro Decision Tree: Example 13

Initialize the tree as a single leaf that contains all labels ID3 Learning Algorithm While an impure leaf (not all labels are the same) Pick an arbitrary impure leaf Find the feature, ", with the largest information gain relative to the labels in that leaf Create a child (or split) for each unique value of " Assign each label in the original leaf to one of its children depending on its corresponding " value The original leaf is no longer a leaf All of its children are new leaves 14

Intuitive / explainable Decision Tree / ID3 Pros Can handle categorical and real-valued features Automatically performs feature selection The ID3 algorithm has a preference for shorter trees (simpler hypotheses) 15

The ID3 algorithm is greedy so no optimality guarantee Overfitting! Heuristics ( regularization ): Decision Tree / ID3 Cons Do not split leaves that are past a fixed depth! or have fewer than " labels or where the maximal information gain is less than # Pruning ( validation ): Evaluate each split using a validation set and remove the one that most improves the validation error 16

Short for Bootstrap aggregating Combines the prediction of many independent hypotheses to reduce variance Bagging Bootstrapping: A statistical method for estimating properties of a distribution, given (potentially a small number of) samples from that distribution Relies on resampling the samples with replacement many, many times Aggregating: Combining multiple hypotheses, h ", h $,, h &, to arrive at a single hypothesis 17

Predictions made by trees trained on similar datasets are highly correlated Split-Feature Randomization To decorrelate these predictions, randomly limit the features available at each iteration of the ID3 algorithm Every time the ID3 algorithm goes to split an impure leaf, randomly select! < # features and only allow the algorithm to use one of those! features. For classification, a common choice is! = # For regression, a common choice is! = % & 18

Input:! = # $, & $, # ', & ',, # ), & ), *, + Random Forests For, = 1, 2,, * Create a dataset,! /, by sampling 0 points from! with replacement Learn a decision tree, 1 /, using! / and the ID3 algorithm with split-feature randomization Output: 1, the aggregated hypothesis 19

Another ensemble method (like bagging) that combines the predictions of multiple hypotheses Boosting Aims to reduce the bias of a weak or highly biased hypothesis set (can also reduce variance) Intuition: iteratively reweight inputs, giving more weight to inputs that are difficult-to-predict correctly Fundamentally requires that we have access to weak learners that are better than random chance 20

Input:! " = 1, +1, ( A d a B o o s t Initialize input weights: ) * (,),, )/, = * / For 0 = 1,, ( 1. Train a weak learner (hypothesis), h 2, by minimizing the weighted training error 2. Compute the weighted training error of h 2 : / 3 2 = 4 ) 27* 5 h 2 8 5 : 5 56* 3. Compute the importance of h 2 : ; 2 = 1 2 log 1 3 2 3 2 4. Update the weights: ) 5 2 = ) 5 27* @ 2 B C7DE if h 2 8 5 = : 5 C D E if h 2 8 5 : 5 = ) 5 27* C 7D E H I J E K I @ 2 Output: an aggregated hypothesis L M 8 = sign Q M 8 M = sign 4 ; 2 h 2 8 26* 21

Why AdaBoost? 1. If you only have access to weak learners 2. and want your final hypothesis to be a weighted combination of weak learners, 3. then Adaboost greedily minimizes the exponential loss:! h, $, & =! () * + * 1. Because of computational constraints 2. Because weak learners are not great on their own 3. Because the exponential loss upper bounds binary error 22

Nearest Neighbor Intuition Classify a point as the label of the most similar training point Use Euclidean distance as the similarity metric:! #, # % = # # % = (, )*+ # ) # ) % - 23

1 - The Nearest Neighbor Hypothesis 0.9 0.8 0.7 0.6 - - 0.5! # = % & # 0.4 - - 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 24

Generalization of Nearest Neighbor Claim:! "#$ for the nearest neighbor hypothesis is not much worse than the best possible! "#$! Formally: with high probability,! "#$ % 2! "#$ % as ) Interpretation: half of the data s predictive power is in the nearest neighbor! 25

Classify a point as the most common label among the labels of the! nearest training points When! = 1, $ is the nearest neighbor hypothesis complicated decision boundaries; may overfit!-nearest Neighbors (!NN) When! = %, $ always predicts the most common label in the training dataset no decision boundaries;may underfit! controls the complexity of the hypothesis set! affects how well the learned hypothesis will generalize Practical rules of thumb:! = 3! = % Cross-validation 26

Pros: Intuitive / explainable No training / retraining!nn Pros and Cons Cons: Provably near-optimal in terms of " #$% Computationally expensive Always needs to store all data: & '( Computing ) + requires computing, +, +. +. 1 and finding the! closest points: & '( + ' log! Suffers from the curse of dimensionality 27

The fundamental assumption of!nn is that similar points or points close to one another should have the same label Curse of Dimensionality The closer two points are, the more confident we can be that they will have the same label As the number of dimensions the input has grows, the less likely it is that two random points will be close As the number of dimensions the input has grows, it takes more points to cover the input space 28

More data Curing the Curse of Dimensionality Fewer dimensions Blessing of non-uniformity: data from the real world is rarely uniformly distributed across the input space 29

No training required! Memory: " #$ Computing % ' : " #$ + # log! Computational Cost of!nn Idea: preprocess inputs in order to speed up predictions Reduce the number of inputs held in memory by eliminating redundancies Organize inputs in data structures that make searching for nearest neighbors more efficient 30

1 Data Condensing 0.9 0.8 0.7 0.6 - - Reduce the number of inputs while maintaining the same predictions on all inputs 0.5 0.4 0.3 0.2 - Let! " be the #NN hypothesis when trained on " 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 $ " is training-set consistent if:! & ' ( =! " ' ( ' ( " 1 0.9 0.8 0.7 0.6 - - Training-set consistent is a much weaker constraint than decision-boundary consistent 0.5 0.4 0.3 0.2-0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 31

Intuition: split the inputs into clusters, groups of points that are close to one another but far from other groups. Organizing the Inputs If an input point is really close to one group of points and really far from the other groups then we can skip searching through the other groups and just look for nearest neighbors in the close group! We want cluster centers to be far apart and cluster radii to be small 32

!NN only considers some points and weights them equally Radial Basis Functions (RBF) RBFs consider all points but weight them unequally Intuition: all points are useful but some points are more useful than others! Bonus: no need to choose!. " $ = sign * +,- / 0 101 2 45 3. +,- / 0 101 7 2 3 + 45 3 3 33

The margin of a separating hyperplane is the distance between the hyperplane and the nearest training point Maximal Margin Linear Separators Questions: How can we efficiently find a maximal-margin linear separator? Why are linear separators with larger margins better? What can we do if the data is not linearly separable? 34

minimize 1 2 >? > subject to < = >? @ = + > A 1 @ =, < = E Maximizing the Margin This optimization problem to be solved (approximately) using quadratic programming (QP) in! " # time Let H % = linear separators with minimum margin '. If the input space is a "-dimensional sphere of radius (, then: ) *+ H % min ", ( 1 ' 1 + 1 35

Linearly Inseparable Data What can we do if the data is not linearly separable? Accept some non-zero in-sample error How much in-sample error should we tolerate? Apply a non-linear transformation that shifts the data into a space where it is linearly separable How can we pick a non-linear transformation? 36

minimize 1 2 )* ) + K 2 5 "34 subject to ( " ) * + " + ) - 1! " + ", ( " B! " subject to! " 0 _ # 1,, E Soft-Margin SVMs! " is the soft error on the # $% training 5 2 "34 If! " > 1, then ( " ) * + " + ) - < 0 + ", ( " is incorrectly classified If 0 <! " < 1, then ( " ) * + " + ) - > 0 + ", ( " is correctly classified but inside the margin! " is the soft in-sample error 37

Decide on a transformation Φ: # % Nonlinear Dual SVMs Find a maximal-margin separating hyperplane in the transformed space, &', &' *, by solving the QP: 6 minimize 1 2 2 345 6 subject to 2 89 3 : 3 = 0 subject to 6 345 2 89 3 89 7 : 3 : 7 Φ ; < 3 Φ ; 7 2 89 3 745 345 89 3 0 I 1,, L Return the corresponding predictor in the original space: M ; = sign 2 89 3 : 3 Φ ; < 3 Φ ; + &' * 3 Q R S * 6 38

Perceptrons Low-Dimensional Input Space High-Dimensional Input Space! "# High Low Generalization Good Bad SVMs Low-Dimensional Input Space High-Dimensional Input Space! "# High Low Generalization Good Okay $ %& H = ) + 1 vs. $ %& H / min ), 5 1 6 7 + 1 39

Depending on the transformation, Φ, and the dimensionality of the original input space. ", computing Φ $ can be computationally expensive Computing Φ % $ requires & " % time Efficiency High-dimensional transformations can result in good hypotheses (as long as they don t overfit) but highdimensional transformations are expensive Approach: instead of computing Φ $, find a function ' ( s.t. ' ( $, $ * = Φ $, Φ $ * $, $ * / 40

Decide on a (valid) kernel function! " Nonlinear Dual SVMs Find a maximal-margin separating hyperplane in the transformed space, #$, #$ ', by solving the QP: 3 minimize 1 2 / 012 3 subject to / 56 0 7 0 = 0 subject to 3 012 / 56 0 56 4 7 0 7 4! " 8 0, 8 4 / 56 0 412 012 56 0 0 E 1,, H Return the corresponding predictor in the original space: 3 I 8 = sign / 0 M N O ' 56 0 7 0! " 8 0, 8 + #$ ' 41

- - - h # h # - - - h " h " $ & = () *+, h " &,h # &,*+, h " &,h # & 42

! # = %& '() h + #,h - #,'() h + #,h - # 1 1 1 Building a Network # + 4 -,5 4 +,+ 4 +,5 h + # 1.5 1 1.5 1 1.5! # 4 -,+ # 2 4 +,2 4 -,2 h - # 1 1 1 1 43

Replace the hard sign function with a soft, differentiable approximation, & Feed-Forward Neural Network (NN) 1 $ ( 1 & 1 & & ' $ $ ) h " & $ & 44

The architecture of a NN is the vector dimensionalities: " = " $, " &,, " ( Architecture " = ) the NN has ) layers, ) 1 hidden layers and 1 output layer Layer - has dimension " (/) Layer - has " (/) + 1 nodes, counting the bias node Every architecture corresponds to a hypothesis set A hypothesis is specified by setting all the weights 45

The weights between layer! 1 and layer! are a matrix: $ % R ( )*+,- ( ) Weights, Signals and Outputs % / 01 is the weight between node 2 in layer! 1 and node 3 in layer! Every node has an incoming signal, 4 % 1, and an outgoing output, 5 % 1 : 5 % = 1 7 4 % and 4 % = $ % ; 5 %<- 46

Input: weights! ",,! % and a query point ' Initialize ' ( = 1 ' Forward Propagation For + = 1,,, -. =!. / '.0" '. = 1 1 -. Output: ' ",, ' % 47

Input: weights! ",,! % and a query point ' Run forward propagation to get ' ",, ' % Backpropagation Initialize ( " % = 2 ' " %, -. 1 '" % For 0 = 1 1,, 1. Compute ( 2 =! 23" ( 23" 1 ' 2 ' 2 Output: ( ",, ( % 48

Input:! ",,! % and & = ( ", ) ",, ( *, ) * Initialize +,* = 0 and. / = 0! / for 1 = 1,, 3 For 4 = 1,, 5 Run forward propagation to get ( ",, ( % Computing Gradients Run backpropagation to get 6 ",, 6 % Increment +,* : +,* = +,* + " * ( % 9 ), For 1 = 1,, 3 / Compute., = ( /:" 6 / ; Increment. / :. / =. / + " *., Output:. ",,. %, the gradients of +,* w.r.t! ",,! % / 49

Both forward and backpropagation contain matrix multiplications involving! ",,! % both take time '! " + +! % Complexity Computing * ",, * % requires running forward and backpropagation for each training point,, - / Each iteration of gradient descent for a neural network takes time ' 0! " + +! % Use stochastic gradient descent instead! Also use parallelization and GPUs / TPUs! 50

Stochastic Gradient Descent for Neural Networks Input:! = # $, & $,, # (, & (, * + Initialize all weights, + $,,, + - to small, random numbers and set. = 0 While some termination condition is not satisfied For 0 = 1,, 2 Randomly select a point # $, & $! Compute 4 5 6 = 8 9 : h # <, > $,,, > - Update, 5 5 :, >?$ Increment.:. =. + 1 Output:, > $,,, > - =, > 5 * + 4 5 6, & < 51

Initialization: Randomness is good for non-convex optimization Initialize weights by sampling from! 0, $ % Initialization and Termination Termination: For complicated surfaces, the gradient s magnitude is not a good metric for proximity to a minimum A simple solution: combine multiple termination criteria e.g. stop if enough iterations have passed and the improvement in error is small 52