CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Similar documents
ECE 5424: Introduction to Machine Learning

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

VBM683 Machine Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

6.036 midterm review. Wednesday, March 18, 15

Holdout and Cross-Validation Methods Overfitting Avoidance

Support Vector Machines. Machine Learning Fall 2017

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

PAC-learning, VC Dimension and Margin-based Bounds

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Data Mining und Maschinelles Lernen

Learning from Examples

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

AE = q < H(p < ) + (1 q < )H(p > ) H(p) = p lg(p) (1 p) lg(1 p)

Understanding Generalization Error: Bounds and Decompositions

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

TDT4173 Machine Learning

COMS 4771 Introduction to Machine Learning. Nakul Verma

18.9 SUPPORT VECTOR MACHINES

Statistical Machine Learning from Data

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Decision Tree Learning

Learning with multiple models. Boosting.

Lecture 3: Decision Trees

CS 6375 Machine Learning

Stochastic Gradient Descent

Mining Classification Knowledge

18.6 Regression and Classification with Linear Models

Ensemble Methods and Random Forests

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

ECE 5984: Introduction to Machine Learning

Learning Theory Continued

CS7267 MACHINE LEARNING

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Voting (Ensemble Methods)

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Neural Networks and Ensemble Methods for Classification

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Decision Trees: Overfitting

Support vector machines Lecture 4

Neural Networks and Deep Learning

PATTERN CLASSIFICATION

Midterm: CS 6375 Spring 2015 Solutions

CS534 Machine Learning - Spring Final Exam

Lecture 5: Logistic Regression. Neural Networks

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Classification: The rest of the story

Support Vector Machines

Chapter 6: Classification

Nonlinear Classification

Pattern Recognition and Machine Learning

Learning Decision Trees

Decision Tree Learning Lecture 2

Machine Learning & Data Mining

Decision Trees (Cont.)

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Machine Learning 4771

Numerical Learning Algorithms

Dan Roth 461C, 3401 Walnut

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Midterm Exam Solutions, Spring 2007

Computational Learning Theory

Multiclass Classification-1

CSCI-567: Machine Learning (Spring 2019)

Bagging and Other Ensemble Methods

The exam is closed book, closed notes except your one-page cheat sheet.

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Final Exam, Fall 2002

Discriminative Models

Kernelized Perceptron Support Vector Machines

Decision Trees. Danushka Bollegala

Midterm Exam, Spring 2005

Lecture 3: Decision Trees

day month year documentname/initials 1

Linear & nonlinear classifiers

Algorithm-Independent Learning Issues

Statistics and learning: Big Data

Classification: Decision Trees

Final Exam, Machine Learning, Spring 2009

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Ensemble Methods for Machine Learning

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Neural networks and support vector machines

Artificial Intelligence Roman Barták

Variance Reduction and Ensemble Methods

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Introduction to Machine Learning Midterm Exam

Discriminative Models

Transcription:

CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18

Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2

Estimating! "#$ instead of! %&! "#$ h =! %& h + over;it penalty regularization estimates this quantity 3

Constrain hypothesis sets to prevent them from being able to fit noise Regularization Learning algorithms are optimization problems and regularization imposes constraints on that optimization 4

minimize ; <=> (, @ A = ; BC ( + @ A E Ω ( Regularization. Ridge: Ω ( = * ( + / Low Order: Ω ( = * 5( + /. Lasso: Ω ( = * +,-. +,- +,- ( + 5

Estimating! "#$ instead of! %&! "#$ h =! %& h + over:it penalty validation estimates this quantity 6

Test sets Estimate! "#$ % using the error on some test dataset t& $'($,! $'($ % If & $'($ is not involved in the training process, then )! $'($ %! "#$ % >, 2/ 01234 5 = & $'($ 7

More test data leads to a tighter bound on " #$% & ' but fewer training data generally means the learned & ' is worse i.e. " #$% & ' tends to increases as (! decreases Picking! " #$% & " #$% & ' " %+,% & ' +. / 0 probability) (with high Return & but bound " #$% & using " %+,% & ' +. / 0 Practical rule of thumb:! = 2 3 8

Test sets Estimate! "#$ % using the error on some test dataset & $'($,! $'($ % If & $'($ is not involved in the training process, then )! $'($ %! "#$ % >, 2/ 01234 5 = & $'($ 9

! "#$%& is used to build a finite set of candidate hypotheses: H ($) = {, -.,, 0.,,, 2. }.! ($) is used to select the hypothesis from H ($) :, 2 Validation set. 5 6 ($), 2. 6 89", 2 > ; 2 >?.0@A B. 6 ($), 2 C DE 2 B. 6 89", 2. 6 ($), 2 + C DE 2 B with high probability 10

! "# vs.! $%& vs.! '()' Bias! "# Incredibly biased! $%& Slightly biased! '()' Not biased Relationship to * +,- VC-bound Hoeffding s bound (multiple hypotheses) Hoeffding s bound (single hypothesis) 11

Occam s Razor The simplest model that fits the data is also the most plausible Three Learning Principles Sampling Bias If the data is sampled in a biased way, learning will produce a similarly biased outcome Data Snooping If a data set has affected any step in the learning process, its ability to assess the outcome has been compromised 12

Not Tired! " Tired Backpack! $! #! # Both, Lunchbox Backpack Metro! $ Both, Lunchbox Drive No Rain Rain No Rain Rain! % Metro Bike Metro Before, After During Bike Metro Decision Tree: Example 13

Initialize the tree as a single leaf that contains all labels ID3 Learning Algorithm While an impure leaf (not all labels are the same) Pick an arbitrary impure leaf Find the feature, ", with the largest information gain relative to the labels in that leaf Create a child (or split) for each unique value of " Assign each label in the original leaf to one of its children depending on its corresponding " value The original leaf is no longer a leaf All of its children are new leaves 14

Intuitive / explainable Decision Tree / ID3 Pros Can handle categorical and real-valued features Automatically performs feature selection The ID3 algorithm has a preference for shorter trees (simpler hypotheses) 15

The ID3 algorithm is greedy so no optimality guarantee Overfitting! Heuristics ( regularization ): Decision Tree / ID3 Cons Do not split leaves that are past a fixed depth! or have fewer than " labels or where the maximal information gain is less than # Pruning ( validation ): Evaluate each split using a validation set and remove the one that most improves the validation error 16

Short for Bootstrap aggregating Combines the prediction of many independent hypotheses to reduce variance Bagging Bootstrapping: A statistical method for estimating properties of a distribution, given (potentially a small number of) samples from that distribution Relies on resampling the samples with replacement many, many times Aggregating: Combining multiple hypotheses, h ", h $,, h &, to arrive at a single hypothesis 17

Predictions made by trees trained on similar datasets are highly correlated Split-Feature Randomization To decorrelate these predictions, randomly limit the features available at each iteration of the ID3 algorithm Every time the ID3 algorithm goes to split an impure leaf, randomly select! < # features and only allow the algorithm to use one of those! features. For classification, a common choice is! = # For regression, a common choice is! = % & 18

Input:! = # $, & $, # ', & ',, # ), & ), *, + Random Forests For, = 1, 2,, * Create a dataset,! /, by sampling 0 points from! with replacement Learn a decision tree, 1 /, using! / and the ID3 algorithm with split-feature randomization Output: 1, the aggregated hypothesis 19

Another ensemble method (like bagging) that combines the predictions of multiple hypotheses Boosting Aims to reduce the bias of a weak or highly biased hypothesis set (can also reduce variance) Intuition: iteratively reweight inputs, giving more weight to inputs that are difficult-to-predict correctly Fundamentally requires that we have access to weak learners that are better than random chance 20

Input:! " = 1, +1, ( A d a B o o s t Initialize input weights: ) * (,),, )/, = * / For 0 = 1,, ( 1. Train a weak learner (hypothesis), h 2, by minimizing the weighted training error 2. Compute the weighted training error of h 2 : / 3 2 = 4 ) 27* 5 h 2 8 5 : 5 56* 3. Compute the importance of h 2 : ; 2 = 1 2 log 1 3 2 3 2 4. Update the weights: ) 5 2 = ) 5 27* @ 2 B C7DE if h 2 8 5 = : 5 C D E if h 2 8 5 : 5 = ) 5 27* C 7D E H I J E K I @ 2 Output: an aggregated hypothesis L M 8 = sign Q M 8 M = sign 4 ; 2 h 2 8 26* 21

Why AdaBoost? 1. If you only have access to weak learners 2. and want your final hypothesis to be a weighted combination of weak learners, 3. then Adaboost greedily minimizes the exponential loss:! h, $, & =! () * + * 1. Because of computational constraints 2. Because weak learners are not great on their own 3. Because the exponential loss upper bounds binary error 22

Nearest Neighbor Intuition Classify a point as the label of the most similar training point Use Euclidean distance as the similarity metric:! #, # % = # # % = (, )*+ # ) # ) % - 23

1 - The Nearest Neighbor Hypothesis 0.9 0.8 0.7 0.6 - - 0.5! # = % & # 0.4 - - 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 24

Generalization of Nearest Neighbor Claim:! "#$ for the nearest neighbor hypothesis is not much worse than the best possible! "#$! Formally: with high probability,! "#$ % 2! "#$ % as ) Interpretation: half of the data s predictive power is in the nearest neighbor! 25

Classify a point as the most common label among the labels of the! nearest training points When! = 1, $ is the nearest neighbor hypothesis complicated decision boundaries; may overfit!-nearest Neighbors (!NN) When! = %, $ always predicts the most common label in the training dataset no decision boundaries;may underfit! controls the complexity of the hypothesis set! affects how well the learned hypothesis will generalize Practical rules of thumb:! = 3! = % Cross-validation 26

Pros: Intuitive / explainable No training / retraining!nn Pros and Cons Cons: Provably near-optimal in terms of " #$% Computationally expensive Always needs to store all data: & '( Computing ) + requires computing, +, +. +. 1 and finding the! closest points: & '( + ' log! Suffers from the curse of dimensionality 27

The fundamental assumption of!nn is that similar points or points close to one another should have the same label Curse of Dimensionality The closer two points are, the more confident we can be that they will have the same label As the number of dimensions the input has grows, the less likely it is that two random points will be close As the number of dimensions the input has grows, it takes more points to cover the input space 28

More data Curing the Curse of Dimensionality Fewer dimensions Blessing of non-uniformity: data from the real world is rarely uniformly distributed across the input space 29

No training required! Memory: " #$ Computing % ' : " #$ + # log! Computational Cost of!nn Idea: preprocess inputs in order to speed up predictions Reduce the number of inputs held in memory by eliminating redundancies Organize inputs in data structures that make searching for nearest neighbors more efficient 30

1 Data Condensing 0.9 0.8 0.7 0.6 - - Reduce the number of inputs while maintaining the same predictions on all inputs 0.5 0.4 0.3 0.2 - Let! " be the #NN hypothesis when trained on " 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 $ " is training-set consistent if:! & ' ( =! " ' ( ' ( " 1 0.9 0.8 0.7 0.6 - - Training-set consistent is a much weaker constraint than decision-boundary consistent 0.5 0.4 0.3 0.2-0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 31

Intuition: split the inputs into clusters, groups of points that are close to one another but far from other groups. Organizing the Inputs If an input point is really close to one group of points and really far from the other groups then we can skip searching through the other groups and just look for nearest neighbors in the close group! We want cluster centers to be far apart and cluster radii to be small 32

!NN only considers some points and weights them equally Radial Basis Functions (RBF) RBFs consider all points but weight them unequally Intuition: all points are useful but some points are more useful than others! Bonus: no need to choose!. " $ = sign * +,- / 0 101 2 45 3. +,- / 0 101 7 2 3 + 45 3 3 33

The margin of a separating hyperplane is the distance between the hyperplane and the nearest training point Maximal Margin Linear Separators Questions: How can we efficiently find a maximal-margin linear separator? Why are linear separators with larger margins better? What can we do if the data is not linearly separable? 34

minimize 1 2 >? > subject to < = >? @ = + > A 1 @ =, < = E Maximizing the Margin This optimization problem to be solved (approximately) using quadratic programming (QP) in! " # time Let H % = linear separators with minimum margin '. If the input space is a "-dimensional sphere of radius (, then: ) *+ H % min ", ( 1 ' 1 + 1 35

Linearly Inseparable Data What can we do if the data is not linearly separable? Accept some non-zero in-sample error How much in-sample error should we tolerate? Apply a non-linear transformation that shifts the data into a space where it is linearly separable How can we pick a non-linear transformation? 36

minimize 1 2 )* ) + K 2 5 "34 subject to ( " ) * + " + ) - 1! " + ", ( " B! " subject to! " 0 _ # 1,, E Soft-Margin SVMs! " is the soft error on the # $% training 5 2 "34 If! " > 1, then ( " ) * + " + ) - < 0 + ", ( " is incorrectly classified If 0 <! " < 1, then ( " ) * + " + ) - > 0 + ", ( " is correctly classified but inside the margin! " is the soft in-sample error 37

Decide on a transformation Φ: # % Nonlinear Dual SVMs Find a maximal-margin separating hyperplane in the transformed space, &', &' *, by solving the QP: 6 minimize 1 2 2 345 6 subject to 2 89 3 : 3 = 0 subject to 6 345 2 89 3 89 7 : 3 : 7 Φ ; < 3 Φ ; 7 2 89 3 745 345 89 3 0 I 1,, L Return the corresponding predictor in the original space: M ; = sign 2 89 3 : 3 Φ ; < 3 Φ ; + &' * 3 Q R S * 6 38

Perceptrons Low-Dimensional Input Space High-Dimensional Input Space! "# High Low Generalization Good Bad SVMs Low-Dimensional Input Space High-Dimensional Input Space! "# High Low Generalization Good Okay $ %& H = ) + 1 vs. $ %& H / min ), 5 1 6 7 + 1 39

Depending on the transformation, Φ, and the dimensionality of the original input space. ", computing Φ $ can be computationally expensive Computing Φ % $ requires & " % time Efficiency High-dimensional transformations can result in good hypotheses (as long as they don t overfit) but highdimensional transformations are expensive Approach: instead of computing Φ $, find a function ' ( s.t. ' ( $, $ * = Φ $, Φ $ * $, $ * / 40

Decide on a (valid) kernel function! " Nonlinear Dual SVMs Find a maximal-margin separating hyperplane in the transformed space, #$, #$ ', by solving the QP: 3 minimize 1 2 / 012 3 subject to / 56 0 7 0 = 0 subject to 3 012 / 56 0 56 4 7 0 7 4! " 8 0, 8 4 / 56 0 412 012 56 0 0 E 1,, H Return the corresponding predictor in the original space: 3 I 8 = sign / 0 M N O ' 56 0 7 0! " 8 0, 8 + #$ ' 41

- - - h # h # - - - h " h " $ & = () *+, h " &,h # &,*+, h " &,h # & 42

! # = %& '() h + #,h - #,'() h + #,h - # 1 1 1 Building a Network # + 4 -,5 4 +,+ 4 +,5 h + # 1.5 1 1.5 1 1.5! # 4 -,+ # 2 4 +,2 4 -,2 h - # 1 1 1 1 43

Replace the hard sign function with a soft, differentiable approximation, & Feed-Forward Neural Network (NN) 1 $ ( 1 & 1 & & ' $ $ ) h " & $ & 44

The architecture of a NN is the vector dimensionalities: " = " $, " &,, " ( Architecture " = ) the NN has ) layers, ) 1 hidden layers and 1 output layer Layer - has dimension " (/) Layer - has " (/) + 1 nodes, counting the bias node Every architecture corresponds to a hypothesis set A hypothesis is specified by setting all the weights 45

The weights between layer! 1 and layer! are a matrix: $ % R ( )*+,- ( ) Weights, Signals and Outputs % / 01 is the weight between node 2 in layer! 1 and node 3 in layer! Every node has an incoming signal, 4 % 1, and an outgoing output, 5 % 1 : 5 % = 1 7 4 % and 4 % = $ % ; 5 %<- 46

Input: weights! ",,! % and a query point ' Initialize ' ( = 1 ' Forward Propagation For + = 1,,, -. =!. / '.0" '. = 1 1 -. Output: ' ",, ' % 47

Input: weights! ",,! % and a query point ' Run forward propagation to get ' ",, ' % Backpropagation Initialize ( " % = 2 ' " %, -. 1 '" % For 0 = 1 1,, 1. Compute ( 2 =! 23" ( 23" 1 ' 2 ' 2 Output: ( ",, ( % 48

Input:! ",,! % and & = ( ", ) ",, ( *, ) * Initialize +,* = 0 and. / = 0! / for 1 = 1,, 3 For 4 = 1,, 5 Run forward propagation to get ( ",, ( % Computing Gradients Run backpropagation to get 6 ",, 6 % Increment +,* : +,* = +,* + " * ( % 9 ), For 1 = 1,, 3 / Compute., = ( /:" 6 / ; Increment. / :. / =. / + " *., Output:. ",,. %, the gradients of +,* w.r.t! ",,! % / 49

Both forward and backpropagation contain matrix multiplications involving! ",,! % both take time '! " + +! % Complexity Computing * ",, * % requires running forward and backpropagation for each training point,, - / Each iteration of gradient descent for a neural network takes time ' 0! " + +! % Use stochastic gradient descent instead! Also use parallelization and GPUs / TPUs! 50

Stochastic Gradient Descent for Neural Networks Input:! = # $, & $,, # (, & (, * + Initialize all weights, + $,,, + - to small, random numbers and set. = 0 While some termination condition is not satisfied For 0 = 1,, 2 Randomly select a point # $, & $! Compute 4 5 6 = 8 9 : h # <, > $,,, > - Update, 5 5 :, >?$ Increment.:. =. + 1 Output:, > $,,, > - =, > 5 * + 4 5 6, & < 51

Initialization: Randomness is good for non-convex optimization Initialize weights by sampling from! 0, $ % Initialization and Termination Termination: For complicated surfaces, the gradient s magnitude is not a good metric for proximity to a minimum A simple solution: combine multiple termination criteria e.g. stop if enough iterations have passed and the improvement in error is small 52