CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Size: px
Start display at page:

Download "CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18"

Transcription

1 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18

2 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2

3 Estimating! "#$ instead of! %&! "#$ h =! %& h + over;it penalty regularization estimates this quantity 3

4 Constrain hypothesis sets to prevent them from being able to fit noise Regularization Learning algorithms are optimization problems and regularization imposes constraints on that optimization 4

5 minimize ; <=> A = ; BC ( A E Ω ( Regularization. Ridge: Ω ( = * ( + / Low Order: Ω ( = * 5( + /. Lasso: Ω ( = * +,-. +,- +,- ( + 5

6 Estimating! "#$ instead of! %&! "#$ h =! %& h + over:it penalty validation estimates this quantity 6

7 Test sets Estimate! "#$ % using the error on some test dataset t& $'($,! $'($ % If & $'($ is not involved in the training process, then )! $'($ %! "#$ % >, 2/ = & $'($ 7

8 More test data leads to a tighter bound on " #$% & ' but fewer training data generally means the learned & ' is worse i.e. " #$% & ' tends to increases as (! decreases Picking! " #$% & " #$% & ' " %+,% & ' +. / 0 probability) (with high Return & but bound " #$% & using " %+,% & ' +. / 0 Practical rule of thumb:! = 2 3 8

9 Test sets Estimate! "#$ % using the error on some test dataset & $'($,! $'($ % If & $'($ is not involved in the training process, then )! $'($ %! "#$ % >, 2/ = & $'($ 9

10 ! "#$%& is used to build a finite set of candidate hypotheses: H ($) = {, -.,, 0.,,, 2. }.! ($) is used to select the hypothesis from H ($) :, 2 Validation set. 5 6 ($), ", 2 > ; 2 >?.0@A B. 6 ($), 2 C DE 2 B. 6 89", 2. 6 ($), 2 + C DE 2 B with high probability 10

11 ! "# vs.! $%& vs.! '()' Bias! "# Incredibly biased! $%& Slightly biased! '()' Not biased Relationship to * +,- VC-bound Hoeffding s bound (multiple hypotheses) Hoeffding s bound (single hypothesis) 11

12 Occam s Razor The simplest model that fits the data is also the most plausible Three Learning Principles Sampling Bias If the data is sampled in a biased way, learning will produce a similarly biased outcome Data Snooping If a data set has affected any step in the learning process, its ability to assess the outcome has been compromised 12

13 Not Tired! " Tired Backpack! $! #! # Both, Lunchbox Backpack Metro! $ Both, Lunchbox Drive No Rain Rain No Rain Rain! % Metro Bike Metro Before, After During Bike Metro Decision Tree: Example 13

14 Initialize the tree as a single leaf that contains all labels ID3 Learning Algorithm While an impure leaf (not all labels are the same) Pick an arbitrary impure leaf Find the feature, ", with the largest information gain relative to the labels in that leaf Create a child (or split) for each unique value of " Assign each label in the original leaf to one of its children depending on its corresponding " value The original leaf is no longer a leaf All of its children are new leaves 14

15 Intuitive / explainable Decision Tree / ID3 Pros Can handle categorical and real-valued features Automatically performs feature selection The ID3 algorithm has a preference for shorter trees (simpler hypotheses) 15

16 The ID3 algorithm is greedy so no optimality guarantee Overfitting! Heuristics ( regularization ): Decision Tree / ID3 Cons Do not split leaves that are past a fixed depth! or have fewer than " labels or where the maximal information gain is less than # Pruning ( validation ): Evaluate each split using a validation set and remove the one that most improves the validation error 16

17 Short for Bootstrap aggregating Combines the prediction of many independent hypotheses to reduce variance Bagging Bootstrapping: A statistical method for estimating properties of a distribution, given (potentially a small number of) samples from that distribution Relies on resampling the samples with replacement many, many times Aggregating: Combining multiple hypotheses, h ", h $,, h &, to arrive at a single hypothesis 17

18 Predictions made by trees trained on similar datasets are highly correlated Split-Feature Randomization To decorrelate these predictions, randomly limit the features available at each iteration of the ID3 algorithm Every time the ID3 algorithm goes to split an impure leaf, randomly select! < # features and only allow the algorithm to use one of those! features. For classification, a common choice is! = # For regression, a common choice is! = % & 18

19 Input:! = # $, & $, # ', & ',, # ), & ), *, + Random Forests For, = 1, 2,, * Create a dataset,! /, by sampling 0 points from! with replacement Learn a decision tree, 1 /, using! / and the ID3 algorithm with split-feature randomization Output: 1, the aggregated hypothesis 19

20 Another ensemble method (like bagging) that combines the predictions of multiple hypotheses Boosting Aims to reduce the bias of a weak or highly biased hypothesis set (can also reduce variance) Intuition: iteratively reweight inputs, giving more weight to inputs that are difficult-to-predict correctly Fundamentally requires that we have access to weak learners that are better than random chance 20

21 Input:! " = 1, +1, ( A d a B o o s t Initialize input weights: ) * (,),, )/, = * / For 0 = 1,, ( 1. Train a weak learner (hypothesis), h 2, by minimizing the weighted training error 2. Compute the weighted training error of h 2 : / 3 2 = 4 ) 27* 5 h : 5 56* 3. Compute the importance of h 2 : ; 2 = 1 2 log Update the weights: ) 5 2 = ) 5 2 B C7DE if h = : 5 C D E if h : 5 = ) 5 27* C 7D E H I J E K 2 Output: an aggregated hypothesis L M 8 = sign Q M 8 M = sign 4 ; 2 h * 21

22 Why AdaBoost? 1. If you only have access to weak learners 2. and want your final hypothesis to be a weighted combination of weak learners, 3. then Adaboost greedily minimizes the exponential loss:! h, $, & =! () * + * 1. Because of computational constraints 2. Because weak learners are not great on their own 3. Because the exponential loss upper bounds binary error 22

23 Nearest Neighbor Intuition Classify a point as the label of the most similar training point Use Euclidean distance as the similarity metric:! #, # % = # # % = (, )*+ # ) # ) % - 23

24 1 - The Nearest Neighbor Hypothesis ! # = % & #

25 Generalization of Nearest Neighbor Claim:! "#$ for the nearest neighbor hypothesis is not much worse than the best possible! "#$! Formally: with high probability,! "#$ % 2! "#$ % as ) Interpretation: half of the data s predictive power is in the nearest neighbor! 25

26 Classify a point as the most common label among the labels of the! nearest training points When! = 1, $ is the nearest neighbor hypothesis complicated decision boundaries; may overfit!-nearest Neighbors (!NN) When! = %, $ always predicts the most common label in the training dataset no decision boundaries;may underfit! controls the complexity of the hypothesis set! affects how well the learned hypothesis will generalize Practical rules of thumb:! = 3! = % Cross-validation 26

27 Pros: Intuitive / explainable No training / retraining!nn Pros and Cons Cons: Provably near-optimal in terms of " #$% Computationally expensive Always needs to store all data: & '( Computing ) + requires computing, +, and finding the! closest points: & '( + ' log! Suffers from the curse of dimensionality 27

28 The fundamental assumption of!nn is that similar points or points close to one another should have the same label Curse of Dimensionality The closer two points are, the more confident we can be that they will have the same label As the number of dimensions the input has grows, the less likely it is that two random points will be close As the number of dimensions the input has grows, it takes more points to cover the input space 28

29 More data Curing the Curse of Dimensionality Fewer dimensions Blessing of non-uniformity: data from the real world is rarely uniformly distributed across the input space 29

30 No training required! Memory: " #$ Computing % ' : " #$ + # log! Computational Cost of!nn Idea: preprocess inputs in order to speed up predictions Reduce the number of inputs held in memory by eliminating redundancies Organize inputs in data structures that make searching for nearest neighbors more efficient 30

31 1 Data Condensing Reduce the number of inputs while maintaining the same predictions on all inputs Let! " be the #NN hypothesis when trained on " $ " is training-set consistent if:! & ' ( =! " ' ( ' ( " Training-set consistent is a much weaker constraint than decision-boundary consistent

32 Intuition: split the inputs into clusters, groups of points that are close to one another but far from other groups. Organizing the Inputs If an input point is really close to one group of points and really far from the other groups then we can skip searching through the other groups and just look for nearest neighbors in the close group! We want cluster centers to be far apart and cluster radii to be small 32

33 !NN only considers some points and weights them equally Radial Basis Functions (RBF) RBFs consider all points but weight them unequally Intuition: all points are useful but some points are more useful than others! Bonus: no need to choose!. " $ = sign * +,- / ,- /

34 The margin of a separating hyperplane is the distance between the hyperplane and the nearest training point Maximal Margin Linear Separators Questions: How can we efficiently find a maximal-margin linear separator? Why are linear separators with larger margins better? What can we do if the data is not linearly separable? 34

35 minimize 1 2 >? > subject to < = = + > A =, < = E Maximizing the Margin This optimization problem to be solved (approximately) using quadratic programming (QP) in! " # time Let H % = linear separators with minimum margin '. If the input space is a "-dimensional sphere of radius (, then: ) *+ H % min ", ( 1 '

36 Linearly Inseparable Data What can we do if the data is not linearly separable? Accept some non-zero in-sample error How much in-sample error should we tolerate? Apply a non-linear transformation that shifts the data into a space where it is linearly separable How can we pick a non-linear transformation? 36

37 minimize 1 2 )* ) + K 2 5 "34 subject to ( " ) * + " + ) - 1! " + ", ( " B! " subject to! " 0 _ # 1,, E Soft-Margin SVMs! " is the soft error on the # $% training 5 2 "34 If! " > 1, then ( " ) * + " + ) - < 0 + ", ( " is incorrectly classified If 0 <! " < 1, then ( " ) * + " + ) - > 0 + ", ( " is correctly classified but inside the margin! " is the soft in-sample error 37

38 Decide on a transformation Φ: # % Nonlinear Dual SVMs Find a maximal-margin separating hyperplane in the transformed space, &', &' *, by solving the QP: 6 minimize subject to : 3 = 0 subject to : 3 : 7 Φ ; < 3 Φ ; I 1,, L Return the corresponding predictor in the original space: M ; = sign : 3 Φ ; < 3 Φ ; + &' * 3 Q R S * 6 38

39 Perceptrons Low-Dimensional Input Space High-Dimensional Input Space! "# High Low Generalization Good Bad SVMs Low-Dimensional Input Space High-Dimensional Input Space! "# High Low Generalization Good Okay $ %& H = ) + 1 vs. $ %& H / min ),

40 Depending on the transformation, Φ, and the dimensionality of the original input space. ", computing Φ $ can be computationally expensive Computing Φ % $ requires & " % time Efficiency High-dimensional transformations can result in good hypotheses (as long as they don t overfit) but highdimensional transformations are expensive Approach: instead of computing Φ $, find a function ' ( s.t. ' ( $, $ * = Φ $, Φ $ * $, $ * / 40

41 Decide on a (valid) kernel function! " Nonlinear Dual SVMs Find a maximal-margin separating hyperplane in the transformed space, #$, #$ ', by solving the QP: 3 minimize 1 2 / subject to / = 0 subject to / ! " 8 0, 8 4 / E 1,, H Return the corresponding predictor in the original space: 3 I 8 = sign / 0 M N O ' ! " 8 0, 8 + #$ ' 41

42 - - - h # h # h " h " $ & = () *+, h " &,h # &,*+, h " &,h # & 42

43 ! # = %& '() h + #,h - #,'() h + #,h - # Building a Network # + 4 -,5 4 +,+ 4 +,5 h + # ! # 4 -,+ # 2 4 +,2 4 -,2 h - #

44 Replace the hard sign function with a soft, differentiable approximation, & Feed-Forward Neural Network (NN) 1 $ ( 1 & 1 & & ' $ $ ) h " & $ & 44

45 The architecture of a NN is the vector dimensionalities: " = " $, " &,, " ( Architecture " = ) the NN has ) layers, ) 1 hidden layers and 1 output layer Layer - has dimension " (/) Layer - has " (/) + 1 nodes, counting the bias node Every architecture corresponds to a hypothesis set A hypothesis is specified by setting all the weights 45

46 The weights between layer! 1 and layer! are a matrix: $ % R ( )*+,- ( ) Weights, Signals and Outputs % / 01 is the weight between node 2 in layer! 1 and node 3 in layer! Every node has an incoming signal, 4 % 1, and an outgoing output, 5 % 1 : 5 % = % and 4 % = $ % ; 5 %<- 46

47 Input: weights! ",,! % and a query point ' Initialize ' ( = 1 ' Forward Propagation For + = 1,,, -. =!. / '.0" '. = Output: ' ",, ' % 47

48 Input: weights! ",,! % and a query point ' Run forward propagation to get ' ",, ' % Backpropagation Initialize ( " % = 2 ' " %, -. 1 '" % For 0 = 1 1,, 1. Compute ( 2 =! 23" ( 23" 1 ' 2 ' 2 Output: ( ",, ( % 48

49 Input:! ",,! % and & = ( ", ) ",, ( *, ) * Initialize +,* = 0 and. / = 0! / for 1 = 1,, 3 For 4 = 1,, 5 Run forward propagation to get ( ",, ( % Computing Gradients Run backpropagation to get 6 ",, 6 % Increment +,* : +,* = +,* + " * ( % 9 ), For 1 = 1,, 3 / Compute., = ( /:" 6 / ; Increment. / :. / =. / + " *., Output:. ",,. %, the gradients of +,* w.r.t! ",,! % / 49

50 Both forward and backpropagation contain matrix multiplications involving! ",,! % both take time '! " + +! % Complexity Computing * ",, * % requires running forward and backpropagation for each training point,, - / Each iteration of gradient descent for a neural network takes time ' 0! " + +! % Use stochastic gradient descent instead! Also use parallelization and GPUs / TPUs! 50

51 Stochastic Gradient Descent for Neural Networks Input:! = # $, & $,, # (, & (, * + Initialize all weights, + $,,, + - to small, random numbers and set. = 0 While some termination condition is not satisfied For 0 = 1,, 2 Randomly select a point # $, & $! Compute = 8 9 : h # <, > $,,, > - Update, 5 5 :, >?$ Increment.:. =. + 1 Output:, > $,,, > - =, > 5 * , & < 51

52 Initialization: Randomness is good for non-convex optimization Initialize weights by sampling from! 0, $ % Initialization and Termination Termination: For complicated surfaces, the gradient s magnitude is not a good metric for proximity to a minimum A simple solution: combine multiple termination criteria e.g. stop if enough iterations have passed and the improvement in error is small 52

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

6.036 midterm review. Wednesday, March 18, 15

6.036 midterm review. Wednesday, March 18, 15 6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that

More information

Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

More information

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machines. Machine Learning Fall 2017 Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24 Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Feb 24 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Prediction III Goal

More information

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of

More information

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE You are allowed a two-page cheat sheet. You are also allowed to use a calculator. Answer the questions in the spaces provided on the question sheets.

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

Learning from Examples

Learning from Examples Learning from Examples Data fitting Decision trees Cross validation Computational learning theory Linear classifiers Neural networks Nonparametric methods: nearest neighbor Support vector machines Ensemble

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

AE = q < H(p < ) + (1 q < )H(p > ) H(p) = p lg(p) (1 p) lg(1 p)

AE = q < H(p < ) + (1 q < )H(p > ) H(p) = p lg(p) (1 p) lg(1 p) 1 Decision Trees (13 pts) Data points are: Negative: (-1, 0) (2, 1) (2, -2) Positive: (0, 0) (1, 0) Construct a decision tree using the algorithm described in the notes for the data above. 1. Show the

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

TDT4173 Machine Learning

TDT4173 Machine Learning TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014 Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists

More information

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I What We Did The Machine Learning Zoo Moving Forward M Magdon-Ismail CSCI 4100/6100 recap: Three Learning Principles Scientist 2

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Lecture 3: Decision Trees

Lecture 3: Decision Trees Lecture 3: Decision Trees Cognitive Systems - Machine Learning Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning last change November 26, 2014 Ute Schmid (CogSys,

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification

More information

18.6 Regression and Classification with Linear Models

18.6 Regression and Classification with Linear Models 18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

Learning Theory Continued

Learning Theory Continued Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec.

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

Voting (Ensemble Methods)

Voting (Ensemble Methods) 1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

Neural Networks and Ensemble Methods for Classification

Neural Networks and Ensemble Methods for Classification Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

PATTERN CLASSIFICATION

PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

More information

Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Classification: The rest of the story

Classification: The rest of the story U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher

More information

Support Vector Machines

Support Vector Machines Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder September 28, 2017 Prof. Michael Paul Today Two important concepts: Margins Kernels Large Margin Classification

More information

Chapter 6: Classification

Chapter 6: Classification Chapter 6: Classification 1) Introduction Classification problem, evaluation of classifiers, prediction 2) Bayesian Classifiers Bayes classifier, naive Bayes classifier, applications 3) Linear discriminant

More information

Nonlinear Classification

Nonlinear Classification Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Fall 2018 Some slides from Tom Mitchell, Dan Roth and others 1 Key issues in machine learning Modeling How to formulate your problem as a machine learning problem?

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

Machine Learning & Data Mining

Machine Learning & Data Mining Group M L D Machine Learning M & Data Mining Chapter 7 Decision Trees Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University Top 10 Algorithm in DM #1: C4.5 #2: K-Means #3: SVM

More information

Decision Trees (Cont.)

Decision Trees (Cont.) Decision Trees (Cont.) R&N Chapter 18.2,18.3 Side example with discrete (categorical) attributes: Predicting age (3 values: less than 30, 30-45, more than 45 yrs old) from census data. Attributes (split

More information

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37 COMP 652: Machine Learning Lecture 12 COMP 652 Lecture 12 1 / 37 Today Perceptrons Definition Perceptron learning rule Convergence (Linear) support vector machines Margin & max margin classifier Formulation

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines

More information

Numerical Learning Algorithms

Numerical Learning Algorithms Numerical Learning Algorithms Example SVM for Separable Examples.......................... Example SVM for Nonseparable Examples....................... 4 Example Gaussian Kernel SVM...............................

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1 Decision Trees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 5 th, 2007 2005-2007 Carlos Guestrin 1 Linear separability A dataset is linearly separable iff 9 a separating

More information

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof Ganesh Ramakrishnan October 20, 2016 1 / 25 Decision Trees: Cascade of step

More information

Midterm Exam Solutions, Spring 2007

Midterm Exam Solutions, Spring 2007 1-71 Midterm Exam Solutions, Spring 7 1. Personal info: Name: Andrew account: E-mail address:. There should be 16 numbered pages in this exam (including this cover sheet). 3. You can use any material you

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Bagging and Other Ensemble Methods

Bagging and Other Ensemble Methods Bagging and Other Ensemble Methods Sargur N. Srihari srihari@buffalo.edu 1 Regularization Strategies 1. Parameter Norm Penalties 2. Norm Penalties as Constrained Optimization 3. Regularization and Underconstrained

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right

More information

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning Ensembles. 293S T. Yang. UCSB, 2017. Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)

More information

Final Exam, Fall 2002

Final Exam, Fall 2002 15-781 Final Exam, Fall 22 1. Write your name and your andrew email address below. Name: Andrew ID: 2. There should be 17 pages in this exam (excluding this cover sheet). 3. If you need more room to work

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Kernelized Perceptron Support Vector Machines

Kernelized Perceptron Support Vector Machines Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:

More information

Decision Trees. Danushka Bollegala

Decision Trees. Danushka Bollegala Decision Trees Danushka Bollegala Rule-based Classifiers In rule-based learning, the idea is to learn a rule from train data in the form IF X THEN Y (or a combination of nested conditions) that explains

More information

Midterm Exam, Spring 2005

Midterm Exam, Spring 2005 10-701 Midterm Exam, Spring 2005 1. Write your name and your email address below. Name: Email address: 2. There should be 15 numbered pages in this exam (including this cover sheet). 3. Write your name

More information

Lecture 3: Decision Trees

Lecture 3: Decision Trees Lecture 3: Decision Trees Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning Lecture 3: Decision Trees p. Decision

More information

day month year documentname/initials 1

day month year documentname/initials 1 ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

Classification: Decision Trees

Classification: Decision Trees Classification: Decision Trees These slides were assembled by Byron Boots, with grateful acknowledgement to Eric Eaton and the many others who made their course materials freely available online. Feel

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Ensemble Methods for Machine Learning

Ensemble Methods for Machine Learning Ensemble Methods for Machine Learning COMBINING CLASSIFIERS: ENSEMBLE APPROACHES Common Ensemble classifiers Bagging/Random Forests Bucket of models Stacking Boosting Ensemble classifiers we ve studied

More information

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML) Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:

More information

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018 From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction

More information

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Introduction to ML Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Why Bayesian learning? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical

More information

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14 Brett Bernstein CDS at NYU March 30, 2017 Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 1 / 14 Initial Question Intro Question Question Suppose 10 different meteorologists have produced functions

More information

Neural networks and support vector machines

Neural networks and support vector machines Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith

More information

Artificial Intelligence Roman Barták

Artificial Intelligence Roman Barták Artificial Intelligence Roman Barták Department of Theoretical Computer Science and Mathematical Logic Introduction We will describe agents that can improve their behavior through diligent study of their

More information

Variance Reduction and Ensemble Methods

Variance Reduction and Ensemble Methods Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information