TDT4173 Machine Learning

Similar documents
TDT4173 Machine Learning

TDT 4173 Machine Learning and Case Based Reasoning. Helge Langseth og Agnar Aamodt. NTNU IDI Seksjon for intelligente systemer

18.9 SUPPORT VECTOR MACHINES

Support Vector Machine (continued)

FINAL: CS 6375 (Machine Learning) Fall 2014

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Voting (Ensemble Methods)

ECE 5984: Introduction to Machine Learning

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

ECE 5424: Introduction to Machine Learning

Lecture 8. Instructor: Haipeng Luo

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Data Mining und Maschinelles Lernen

CS7267 MACHINE LEARNING

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Learning with multiple models. Boosting.

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

6.036 midterm review. Wednesday, March 18, 15

VBM683 Machine Learning

Machine Learning. Ensemble Methods. Manfred Huber

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Learning Theory Continued

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Variance Reduction and Ensemble Methods

Statistical Machine Learning from Data

Perceptron Revisited: Linear Separators. Support Vector Machines

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Stochastic Gradient Descent

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

COMS 4771 Introduction to Machine Learning. Nakul Verma

Kernel Methods and Support Vector Machines

PAC-learning, VC Dimension and Margin-based Bounds

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support Vector Machine (SVM) and Kernel Methods

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Machine Learning Practice Page 2 of 2 10/28/13

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Support Vector Machine (SVM) and Kernel Methods

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines

Statistical Pattern Recognition

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Introduction to Support Vector Machines

Machine Learning, Fall 2011: Homework 5

Learning theory. Ensemble methods. Boosting. Boosting: history

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines.

CS798: Selected topics in Machine Learning

Useful Equations for solving SVM questions

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

PDEEC Machine Learning 2016/17

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

CS 543 Page 1 John E. Boon, Jr.

Support Vector Machines

Midterm Exam Solutions, Spring 2007

Bagging and Other Ensemble Methods

6.867 Machine learning

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CSCI-567: Machine Learning (Spring 2019)

CS-E3210 Machine Learning: Basic Principles

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Machine Learning Lecture 7

Boos$ng Can we make dumb learners smart?

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Max Margin-Classifier

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

(Kernels +) Support Vector Machines

Machine Learning, Midterm Exam

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Understanding Generalization Error: Bounds and Decompositions

Qualifying Exam in Machine Learning

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Support Vector Machines

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Logistic Regression and Support Vector Machine

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Data Mining and Analysis: Fundamental Concepts and Algorithms

Learning Ensembles. 293S T. Yang. UCSB, 2017.

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Algorithm-Independent Learning Issues

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Boosting: Foundations and Algorithms. Rob Schapire

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Multiclass Classification-1

Jeff Howbert Introduction to Machine Learning Winter

Linear & nonlinear classifiers

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

Holdout and Cross-Validation Methods Overfitting Avoidance

Transcription:

TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning

Outline 1 Ensemble-methods Background Bagging Boosting 2 Support Vector Machines Background Linear separators The dual problem Non-separable subspaces Nonlinearity and kernels 2 TDT4173 Machine Learning

Summary-points from last lesson 1 Decision trees Representation, learning Overfitting Bias 2 Evaluating hypothesis Sample error, true error Estimators Confidence intervals for observed hypothesis error The central limit Theorem Comparing hypothesis Comparing learners 3 Computational Learning Theory Bounding the true error PAC learning 3 TDT4173 Machine Learning

Ensemble-methods Ensemble methods: Motivation Background 1 I ask someone a question where everyone has a p =.6 probability to get it right. 2 I ask B 1 (independent) people the same question, and let them vote. Which setup will give me the most reliable result? 4 TDT4173 Machine Learning

Ensemble-methods Ensemble methods: Motivation Background 1 I ask someone a question where everyone has a p =.6 probability to get it right. 2 I ask B 1 (independent) people the same question, and let them vote. Which setup will give me the most reliable result? 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0 50 100 150 200 250 300 4 TDT4173 Machine Learning

Ensemble-methods Ensemble-methods Background Fundamental setup 1 Generate a collection (ensemble) of classifiers from a given dataset. Each classifier can be weak. 2 Each classifier is given a different dataset. The datasets are generated by resampling. 3 The final classification is defined by voting. Nice property Each weak classifier in the ensemble may be fairly simple ( silly and with a huge bias); we only require that it is able to do better than choosing blindly. Still, the ensemble of classifiers can be made quite clever! 5 TDT4173 Machine Learning

Ensemble-methods Weak learner example Background Consider a two-dimentional dataset of two classes. 1 Data for a two class problem 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1 0.5 0 0.5 1 What is a weak classifier in this case? 6 TDT4173 Machine Learning

Ensemble-methods Weak learner example Background Consider a two-dimentional dataset of two classes. 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1 0.5 0 0.5 1 One example: The half-space classifier 6 TDT4173 Machine Learning

Ensemble-methods Weak learner example Background Consider a two-dimentional dataset of two classes. 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1 0.5 0 0.5 1 One example: The half-space classifier Which aggregates well 6 TDT4173 Machine Learning

Ensemble-methods Background Learning the half-space classifier (Naïve implementation) Consider a dataset D = {D 1,D 2,...,D m }, each D j is an observation ((x 1,x 2 ),c). c is either red or blue. For d in each dimension: For j over each observation: 1 Let b(d,j) be x d from observation D j. 2 r l No. red observations with x d b(d,j) + No. blue observations with x d > b(d,j) 3 quality(d,j) max{r l,m r l } (d 0,j 0 ) argmax d,j quality(d,j) Define the half-space classifier at b(d 0,j 0 ) 1 1 Classifier for data in iteration 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 0.5 0 0.5 1 1 1 0.5 0 0.5 1 7 TDT4173 Machine Learning

Ensemble-methods Background Two methods of ensemble learning We will consider two methods of ensemble learning: Bagging: Bagging (Bootstrap Aggregating) classifiers means to generate new datasets by drawing at random from original data (with replacement) Boosting: Data for the classifiers are adaptively resampled (re-weighted), so that data-points that classifiers have had problems classifying are given more weight. 8 TDT4173 Machine Learning

Bootstrapping Ensemble-methods Bagging What is a bootstrap sample? Consider a data set D = {D 1,D 2,...,D m } with m datapoints A boostrap sample D i can be created from D by choosing m points from D on random. The selection is done with replacement. 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 0.5 0 0.5 1 1 1 0.5 0 0.5 1 9 TDT4173 Machine Learning

Bagging Ensemble-methods Bagging Bagging (= Bootstrap aggregating) produces replications of the training set by sampling with replacement. Each replication of the training set has the same size as the original set, but some examples can appear more than once while others don t appear at all. A classifier is generated from each replication. All classifiers are used to classify each sample from the test set using a majority voting scheme. 10 TDT4173 Machine Learning

Ensemble-methods Bagging more formally Bagging 1 Start with dataset D 2 Generate B bootstrap samples D 1,D 2,...,D B. 3 Learn weak classifier for each dataset: c i L(D i ); L( ) is the weak learner; c i ( ) : X { 1,+1} 4 The aggregated classifier of a new instance x simply found by taking a majority vote over classifiers: { B } c(x) sign c b (x) b=1 DEMO 11 TDT4173 Machine Learning

Bagging Summary Ensemble-methods Bagging When is it useful Bagging should only be used if the learner is unstable. A learner is said to be unstable if a small change in the training set yields large variations in the classification. Examples of unstable learners: neural networks, decision trees. Example of a stable learner: k-nearest neighbours, The half-space classifier. Bagging properties 1 Improves the estimate if the learning algorithm is unstable. 2 Reduces the variance of predictions. 3 Degrades the estimate if the learning algorithm is stable. 12 TDT4173 Machine Learning

Boosting Ensemble-methods Boosting 1 Main idea as for bagging: Invent new datasets, and learn separate classifiers for each dataset 2 Clever trick: The data is now adaptively resampled, so that previously misclassified examples count more in next dataset, previously well-classified observations are weighted less. 3 The classifiers are aggregated by adding them together, but each classifier is weighted differently 13 TDT4173 Machine Learning

Ensemble-methods Boosting How to learn the WEIGHTED half-space classifier Consider a dataset D = {D 1,D 2,...,D m }, each D j is an observation ((x 1,x 2 ),c). c is either red or blue. For d in each dimension: For j over each observation: 1 Let b(d,j) be x d from observation D j. 2 r l Sum of weight of red observations with x d b(d,j) + Sum of weight of blue observations with x d > b(d,j) 3 quality(d,j) max { r l, i w(i) r } l (d 0,j 0 ) argmax d,j quality(d,j) Define the half-space classifier at b(d 0,j 0 ) 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 0.5 0 0.5 1 1 1 0.5 0 0.5 1 14 TDT4173 Machine Learning

Boosting Ensemble-methods Boosting Boosting produces replications of the training set by re-weighting the importance of each observation in the original dataset. A classifier is generated from each replication using the weighted data-set as input. All classifiers are used to classify each sample from the test set. Each classifiers get a weight based on its calculated accuracy on the weighted training set. 15 TDT4173 Machine Learning

Ensemble-methods Boosting more formally Boosting 1 Start with dataset D and weights w 1 = [1/m,1/m,...,1/m]. 2 Repeat for t = 1,...,T: 1 c t is the classifier learned from the weighted data (D,w t ). 2 ǫ t is the error of c t measured on the weighted data (D,w t ). ( 3 Calculate α t 1 2 ln 1 ǫt ǫ t ). 4 Update weights (1): { exp( αt ) if x w t+1 (i) w t (i) i was classified correctly exp(+α t ) otherwise 5 Update weights (2): Normalize w t+1. 3 The aggregated classifier of a new instance x is found by taking a weighted vote over classifiers: { T } c(x) sign α t c t (x) DEMO 16 TDT4173 Machine Learning t=1

Boosting Summary Ensemble-methods Boosting Boosting properties The error rate of c on the un-weighted training instances approaches zero exponentially quickly as T increases Boosting forces the classifier to have at least 50% accuracy on the re-weighted training instances. This causes the learning system to produce a quite different classifier on the following trial (as next time, we focus on the mis-classifications of this round). This leads to an extensive exploration of the classifier space. Boosting is more clever than bagging, as it will never reduce classification performance. 17 TDT4173 Machine Learning

Support Vector Machines Description of the task Background Data: 1 We have a set of data D = {(x 1,y 1 ),...,(x m,y m )}. The instances are described by x i, the class is y i. 2 The data is generated by some unknown probability distribution P(x,y). Task: 1 Be able to guess y at a new location x. 2 For SVMs one typically states this as find an unknown function f(x) that estimates y at x. 3 Note! In this lesson we look at binary classification, and let y { 1, +1} denote the classes. 4 We will look for linear functions, i.e., f(x) = b+w T x b+ m i=1 w i x i 18 TDT4173 Machine Learning

Support Vector Machines Background Describing the separating plane x 2 {x : b+w T x 0} w x 1 19 TDT4173 Machine Learning

Support Vector Machines More on the representation Background x 2 x 2 Same w, increase b x 1 x 1 Change w, b=0 20 TDT4173 Machine Learning

Support Vector Machines Linear separators How to find the best linear separator We are looking for a linear separator for this data 21 TDT4173 Machine Learning

Support Vector Machines Linear separators How to find the best linear separator There are so many solutions... 21 TDT4173 Machine Learning

Support Vector Machines Linear separators How to find the best linear separator But only one is considered the best! 21 TDT4173 Machine Learning

Support Vector Machines Linear separators How to find the best linear separator SVMs are called large margin classifiers 21 TDT4173 Machine Learning

Support Vector Machines Linear separators How to find the best linear separator... and the data points touching the lines are the support vectors 21 TDT4173 Machine Learning

Support Vector Machines The geometry of the problem Linear separators {x : b+w T x 1} {x : b+w T x 0} {x : b+w T x +1} w Note! Since one line has b+w T x = 1, the other has b+w T x = 1, the length between them is 2/ w. 22 TDT4173 Machine Learning

Support Vector Machines An optimisation problem Linear separators Optimisation criteria: The distance between margins is 2/ w, so that is what we want to maximise. Equivalently, we can minimise w /2. For simplicity of the mathematics, we will rather minimise w 2 /2 Constraints: The margin separates all data observations correctly: b+w T x i 1 for y i = 1. b+w T x i +1 for y i = +1. Alternative (equivalent) constraint set: y i (b+w T x i ) 1 23 TDT4173 Machine Learning

Support Vector Machines An optimisation problem (2) Linear separators Mathematical Programming Setting: Combining the above requirements we obtain minimize wrt. w and b: 1 2 w 2 subject to y i (b+w T x i ) 1 0,i = 1,...,m Properties: Problem is convex Hence it has unique minimum Efficient algorithms for solving it exist 24 TDT4173 Machine Learning

Support Vector Machines The dual problem The dual problem and the convex hull The convex hull of {x j }: The smallest subset of the instance space that is convex contains all elements {x j } is the convex hull of {x j }. Find it by drawing lines between all x j and choose the outermost boundary. 25 TDT4173 Machine Learning

Support Vector Machines The dual problem The dual problem and the convex hull (2) Look at the difference between the points closest in the convex hulls. The decision line must be orthogonal to the line between the two closest points. c d c d So, we want to minimise c d. c can be written as a weighted sum of all elements in the green class: c = y i =+1 α ix i, and similarly for d. Note: The α i -values are zero unless x i is a support vector. 26 TDT4173 Machine Learning

Support Vector Machines The dual problem The dual problem and the convex hull (3) Minimising c d is (modulo a constant) equivalent to this formulation: minimize wrt. α: 1 2 subject to Properties: m m α i α j y i y j x T i x j m i=1 j=1 i=1 α i m y i α i = 0 and that α i 0,i = 1,...,m i=1 Problem is convex, hence has unique minimum. Quadratic programming problem known solution method. For solution: α i > 0 only if x i is a support vector. 27 TDT4173 Machine Learning

Support Vector Machines Theoretical foundation The dual problem 1 Proofs of SVM properties available (but out of scope for us) 2 Large separators smart if we have small variations in x then we will still classify correctly 3 There are many skinny margin planes, only one if you look for the fattest plane; thus more robust. 28 TDT4173 Machine Learning

Support Vector Machines Non-separable subspaces What if the convex hulls are overlapping? If the convex hulls are overlapping we cannot find a linear separator To handle this, we optimise a criteria where we maximise distance between lines minus a penalty for mis-classifications This is equivalent to scaling the convex hulls, and do as before on the reduced convex hulls 29 TDT4173 Machine Learning

Support Vector Machines Non-separable subspaces What if the convex hulls are overlapping? (2) The problem with scaling is (modulo a constant) equivalent to this formulation: minimize wrt. α: 1 2 subject to Properties: m i=1 m α i α j y i y j x T i x j j=1 m i=1 m y i α i = 0 and that 0 α i C,i = 1,...,m i=1 Problem as before, but C introduces the scaling; this is equivalent to incurring cost of misclassification. Still solvable using standard methods. Demo: Different values of C: http://svm.dcs.rhbnc.ac.uk/pagesnew/gpat.shtml 30 TDT4173 Machine Learning α i

Support Vector Machines Nonlinearity and kernels Nonlinear problems when scaling does not make sense The problem is difficult to solve when x = (r, s) has only two dimensions...but if we blow it up to us five dimension: θ(x) = {r,s,rs,r 2,s 2 }, i.e. invent the mapping θ( ) : R 2 R 5, and try to find the linear separator in R 5, then everything is OK. 31 TDT4173 Machine Learning

Support Vector Machines Nonlinearity and kernels Solving the problem in higher dimensions We solve this as before, but remembering to look in the higher dimension: minimize wrt. α: 1 m m m α i α j y i y j θ(x i ) T θ(x j ) 2 subject to i=1 j=1 i=1 m y i α i = 0 and that 0 α i C,i = 1,...,m i=1 α i 32 TDT4173 Machine Learning

Support Vector Machines Nonlinearity and kernels Solving the problem in higher dimensions We solve this as before, but remembering to look in the higher dimension: minimize wrt. α: 1 m m m α i α j y i y j θ(x i ) T θ(x j ) 2 subject to i=1 j=1 i=1 m y i α i = 0 and that 0 α i C,i = 1,...,m i=1 Note that: We do not need to evaluate θ(x) directly, only θ(x i ) T θ(x j ). If we find a clever way of evaluating θ(x i ) T θ(x j ) (i.e., independent of the size of the target space) we can solve the problem easily, and without even thinking about what θ(x) even means. We define K(x i,x j ) = θ(x i ) T θ(x j ), and focus on finding K(, ) instead of the mapping. K is called a kernel. 32 TDT4173 Machine Learning α i

Kernel functions Support Vector Machines Nonlinearity and kernels θ(x) K(θ(x i ),θ(x j )) Degree d polynomial (x T i x ( j +1) d ) (x Radial Basis Functions exp i x j ) 2 Two-layer Neural Network 2σ sigmoid(η x T i x j +c) Different kernels have different properties, and finding the right kernel is a difficult task, and can be hard to visualise. Example: The RBF kernel uses (implicitly) an infinitely dimensional representation for θ( ). 33 TDT4173 Machine Learning

Support Vector Machines SVMs: Algorithmic summary Nonlinearity and kernels Select the parameter C (tradeoff between minimising training set error and maximising the margin). Select kernel function, and associated parameters (e.g., σ for RBF). Solve the optimisation problem using quadratic programming. Find the value b (the treshold) by using the support vectors. Classify a new point x using { m } f(x) = sign y i α i K(x,x i ) b i=1 Demo: Different kernels http://svm.dcs.rhbnc.ac.uk/pagesnew/gpat.shtml 34 TDT4173 Machine Learning