ECE 5984: Introduction to Machine Learning

Similar documents
ECE 5424: Introduction to Machine Learning

VBM683 Machine Learning

Voting (Ensemble Methods)

Stochastic Gradient Descent

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

PAC-learning, VC Dimension and Margin-based Bounds

Learning Theory Continued

PAC-learning, VC Dimension and Margin-based Bounds

ECE 5984: Introduction to Machine Learning

Decision Trees: Overfitting

Boos$ng Can we make dumb learners smart?

CS7267 MACHINE LEARNING

Machine Learning. Ensemble Methods. Manfred Huber

Learning with multiple models. Boosting.

Ensemble Methods for Machine Learning

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Data Mining und Maschinelles Lernen

Lecture 8. Instructor: Haipeng Luo

10-701/ Machine Learning - Midterm Exam, Fall 2010

Statistical Machine Learning from Data

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

TDT4173 Machine Learning

TDT4173 Machine Learning

CSCI-567: Machine Learning (Spring 2019)

Bias-Variance Tradeoff

FINAL: CS 6375 (Machine Learning) Fall 2014

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Ensembles of Classifiers.

Variance Reduction and Ensemble Methods

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Announcements Kevin Jamieson

Ensembles. Léon Bottou COS 424 4/8/2010

Machine Learning

Support Vector Machines

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Support vector machines Lecture 4

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Midterm Exam Solutions, Spring 2007

Bias-Variance in Machine Learning

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Machine Learning

CSE 546 Final Exam, Autumn 2013

Learning theory. Ensemble methods. Boosting. Boosting: history

Gradient Boosting (Continued)

A Brief Introduction to Adaboost

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

CS 6375 Machine Learning

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Logistic Regression. Machine Learning Fall 2018

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

COMS 4771 Lecture Boosting 1 / 16

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

ECE 5984: Introduction to Machine Learning

CS534 Machine Learning - Spring Final Exam

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Name (NetID): (1 Point)

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Chapter 14 Combining Models

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

SVMs, Duality and the Kernel Trick

6.036 midterm review. Wednesday, March 18, 15

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Logistic Regression Logistic

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Holdout and Cross-Validation Methods Overfitting Avoidance

Linear and Logistic Regression. Dr. Xiaowei Huang

Linear Classifiers: Expressiveness

Final Exam, Spring 2006

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Introduction to Machine Learning Midterm Exam

Neural Networks and Deep Learning

Combining Classifiers

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine Learning Lecture 10

Hierarchical Boosting and Filter Generation

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Introduction to Machine Learning Midterm, Tues April 8

Machine Learning

CSE446: non-parametric methods Spring 2017

Midterm: CS 6375 Spring 2015 Solutions

Machine Learning Lecture 12

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Support Vector Machine I

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

CS 188: Artificial Intelligence Spring Announcements

Day 5: Generative models, structured classification

Understanding Generalization Error: Bounds and Decompositions

Machine Learning Lecture 7

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

CSE 546 Midterm Exam, Fall 2014

Transcription:

ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech

Administrativia HW3 Due: April 14, 11:55pm You will implement primal & dual SVMs Kaggle competition: Higgs Boson Signal vs Background classification https://inclass.kaggle.com/c/2015-spring-vt-ece-machinelearning-hw3 https://www.kaggle.com/c/higgs-boson (C) Dhruv Batra 2

Administrativia Project Mid-Sem Spotlight Presentations 9 remaining Resume in class on April 20th Format 5 slides (recommended) 4 minute time (STRICT) + 1-2 min Q&A Content Tell the class what you re working on Any results yet? Problems faced? Upload slides on Scholar (C) Dhruv Batra 3

New Topic: Ensemble Methods Bagging Boosting (C) Dhruv Batra Image Credit: Antonio Torralba 4

Synonyms Ensemble Methods Learning Mixture of Experts/Committees Boosting types AdaBoost L2Boost LogitBoost <Your-Favorite-keyword>Boost (C) Dhruv Batra 5

So far you have learnt Regression Least Squares Robust Least Squares Classification Linear Naïve Bayes Logistic Regression SVMs Non-linear Decision Trees Neural Networks K-NNs A quick look back (C) Dhruv Batra 6

Recall Bias-Variance Tradeoff Demo http://www.princeton.edu/~rkatzwer/polynomialregression/ (C) Dhruv Batra 7

Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance (C) Dhruv Batra Slide Credit: Carlos Guestrin 8

Fighting the bias-variance tradeoff Simple (a.k.a. weak) learners e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees) Good: Low variance, don t usually overfit Bad: High bias, can t solve hard learning problems Sophisticated learners Kernel SVMs, Deep Neural Nets, Deep Decision Trees Good: Low bias, have the potential to learn with Big Data Bad: High variance, difficult to generalize Can we make combine these properties In general, No!! But often yes (C) Dhruv Batra Slide Credit: Carlos Guestrin 9

Voting (Ensemble Methods) Instead of learning a single classifier, learn many classifiers Output class: (Weighted) vote of each classifier Classifiers that are most sure will vote with more conviction With sophisticated learners Uncorrelated errors à expected error goes down On average, do better than single classifier! Bagging With weak learners each one good at different parts of the input space On average, do better than single classifier! Boosting (C) Dhruv Batra 10

Bagging Bagging = Bootstrap Averaging On board Bootstrap Demo http://wise.cgu.edu/bootstrap/ (C) Dhruv Batra 11

Decision Forests Learn many trees & Average Outputs Will formally visit this in Bagging lecture

Boosting [Schapire, 1989] Pick a class of weak learners H = {h h : X Y } You have a black box that picks best weak learning unweighted sum weighted sum h = argmin h H h = argmin h H X L (y i,h(x i )) i X w i L (y i,h(x i )) i On each iteration t Compute error for each point Update weight of each training example based on it s error. Learn a hypothesis h t A strength for this hypothesis α t tx f t 1 (x i )= t 0h t 0(x i) t 0 =1 Final classifier: f t (x) =f t 1 + t h t (x) (C) Dhruv Batra 13

Boosting Demo Matlab demo by Antonio Torralba http://people.csail.mit.edu/torralba/shortcourserloc/ boosting/boosting.html (C) Dhruv Batra 14

Types of Boosting Loss Name Loss Formula Boosting Name Regression: Squared Loss (y f(x)) 2 L2Boosting Regression: Absolute Loss y f(x) Gradient Boosting Classification: Exponential Loss e yf(x) AdaBoost Classification: Log/Logistic Loss log 1+e yf(x) LogitBoost (C) Dhruv Batra 15

L2 Boosting Loss Name Loss Formula Boosting Name Regression: Squared Loss (y f(x)) 2 L2Boosting Algorithm On Board (C) Dhruv Batra 16

Adaboost Loss Name Loss Formula Boosting Name Classification: Exponential Loss e yf(x) AdaBoost Algorithm You will derive in HW4! Guaranteed to achieve zero training error With infinite rounds of boosting (assuming no label noise) Need to do early stopping (C) Dhruv Batra 17

What you should know Voting/Ensemble methods Bagging How to sample Under what conditions is error reduced Boosting General outline L2Boosting derivation Adaboost derivation (C) Dhruv Batra 18

Learning Theory Probably Approximately Correct (PAC) Learning What does it formally mean to learn? (C) Dhruv Batra 19

Learning Theory We have explored many ways of learning from data But How good is our classifier, really? How much data do I need to make it good enough? (C) Dhruv Batra Slide Credit: Carlos Guestrin 20

A simple setting Classification N data points Finite space H of possible hypothesis e.g. dec. trees of depth d A learner finds a hypothesis h that is consistent with training data Gets zero error in training error train (h) = 0 What is the probability that h has more than ε true error? error true (h) ε (C) Dhruv Batra Slide Credit: Carlos Guestrin 21

Generalization error in finite hypothesis Theorem: spaces [Haussler 88] Hypothesis space H finite dataset D with N i.i.d. samples 0 < ε < 1 For any learned hypothesis h that is consistent on the training data: N Even if h makes zero errors in training data, may make errors in test (C) Dhruv Batra Slide Credit: Carlos Guestrin 22

Using a PAC bound N Typically, 2 use cases: 1: Pick ε and δ, give you N 2: Pick N and δ, give you ε (C) Dhruv Batra Slide Credit: Carlos Guestrin 23

Haussler 88 bound Strengths: Holds for all (finite) H Holds for all data distributions N Weaknesses Consistent classifier Finite hypothesis space (C) Dhruv Batra Slide Credit: Carlos Guestrin 24

Generalization bound for H hypothesis Theorem: Hypothesis space H finite dataset D with N i.i.d. samples 0 < ε < 1 For any learned hypothesis h: N (C) Dhruv Batra Slide Credit: Carlos Guestrin 25

PAC bound and Bias-Variance tradeoff N or, after moving some terms around, with probability at least 1-δ: N Important: PAC bound holds for all h, but doesn t guarantee that algorithm finds best h!!! (C) Dhruv Batra Slide Credit: Carlos Guestrin 26