Announcements Kevin Jamieson

Similar documents
Decision Trees: Overfitting

ECE 5424: Introduction to Machine Learning

Gradient Boosting (Continued)

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Stochastic Gradient Descent

VBM683 Machine Learning

ECE 5984: Introduction to Machine Learning

Warm up: risk prediction with logistic regression

Voting (Ensemble Methods)

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Learning with multiple models. Boosting.

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Classification Logistic Regression

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Is the test error unbiased for these programs? 2017 Kevin Jamieson

A Brief Introduction to Adaboost

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Is the test error unbiased for these programs?

Machine Learning for NLP

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

PAC-learning, VC Dimension and Margin-based Bounds

Chapter 14 Combining Models

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Announcements. Proposals graded

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Learning theory. Ensemble methods. Boosting. Boosting: history

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Ensemble Methods for Machine Learning

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Combining Classifiers

Ensembles. Léon Bottou COS 424 4/8/2010

Chapter 6. Ensemble Methods

Data Mining und Maschinelles Lernen

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Logistic Regression. Machine Learning Fall 2018

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Why does boosting work from a statistical view

Machine Learning CSE546

FINAL: CS 6375 (Machine Learning) Fall 2014

CS7267 MACHINE LEARNING

Bias-Variance Tradeoff

LEAST ANGLE REGRESSION 469

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Machine Learning. Ensemble Methods. Manfred Huber

Generalization, Overfitting, and Model Selection

Ensemble Methods: Jay Hyer

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Lecture 2 Machine Learning Review

Neural Networks and Ensemble Methods for Classification

Machine Learning for NLP

Bias-Variance in Machine Learning

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Harrison B. Prosper. Bari Lectures

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Day 3: Classification, logistic regression

Cross Validation & Ensembling

Classification Logistic Regression

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Generative v. Discriminative classifiers Intuition

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Statistics and learning: Big Data

CSC 411 Lecture 3: Decision Trees

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Statistical Data Mining and Machine Learning Hilary Term 2016

Boos$ng Can we make dumb learners smart?

Midterm Exam, Spring 2005

SPECIAL INVITED PAPER

Classification Based on Probability

Gradient Boosting, Continued

Statistical Machine Learning from Data

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

1/sqrt(B) convergence 1/B convergence B

CS229 Supplemental Lecture notes

2D1431 Machine Learning. Bagging & Boosting

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

10-701/ Machine Learning - Midterm Exam, Fall 2010

10701/15781 Machine Learning, Spring 2007: Homework 2

Multivariate Analysis Techniques in HEP

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Machine Learning, Fall 2009: Midterm

Linear classifiers: Logistic regression

Ensembles of Classifiers.

Transcription:

Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium Support your peers and check out the posters! Poster description from website: We will hold a poster session in the Atrium of the Paul Allen Center. Each team will be given a stand to present a poster summarizing the project motivation, methodology, and results. The poster session will give you a chance to show off the hard work you put into your project, and to learn about the projects of your peers. We will provide poster boards that are 32x40 inches. Both one large poster or several pinned pages are OK (fonts should be easily readable from 5 feet away). Course Evaluation: https://uw.iasystem.org/survey/200308 (or on MyUW) Other anonymous Google form course feedback: https://bit.ly/2rmdyac Homework 3 Problem 5 revisited. Optional. Can only increase your grade, but will not hurt it. 1

You may also like ML uses past data to make personalized predictions

You may also like ML uses past data to make personalized predictions

Basics of Fair ML You work at a bank that gives loans based on credit score. credit score paid back loan You have historical data: {(x i,y i )} n i=1 x i 2 R y i 2 {0, 1} If the loan defaults (y i = 1) you receive $300 in interest If the loan defaults (y i = 0) you lose $700 Discussion based on [Hardt, Price, Srebro 16]. See http://www.fatml.org for more resources.

Basics of Fair ML You work at a bank that gives loans based on credit score. Boss tells you make sure it doesn t discriminate on race You have historical data: {(x i,a i,y i )} n i=1 credit score paid back loan race x i 2 R y i 2 {0, 1} a i 2 {asian, white, hispanic, black} If the loan defaults (y i = 1) you receive $300 in interest If the loan defaults (y i = 0) you lose $700 Discussion based on [Hardt, Price, Srebro 16]. See http://www.fatml.org for more resources.

Basics of Fair ML You work at a bank that gives loans based on credit score. Boss tells you make sure it doesn t discriminate on race You have historical data: {(x i,a i,y i )} n i=1 credit score paid back loan race x i 2 R y i 2 {0, 1} a i 2 {asian, white, hispanic, black} - Fairness through unawareness. Ignore a i, everyone gets same threshold - Pro: simple, - Con: features are often proxy for protected group P(x i >t a i = ) =P(x i >t) Discussion based on [Hardt, Price, Srebro 16]. See http://www.fatml.org for more resources.

Basics of Fair ML You work at a bank that gives loans based on credit score. Boss tells you make sure it doesn t discriminate on race You have historical data: {(x i,a i,y i )} n i=1 credit score paid back loan race x i 2 R y i 2 {0, 1} a i 2 {asian, white, hispanic, black} - Demographic parity. proportion of loans to each group is the same - Pro: sounds fair, - Con: groups more likely to pay back loans penalized P(x i >t a i = ) =P(x i >t a i = ) Discussion based on [Hardt, Price, Srebro 16]. See http://www.fatml.org for more resources.

Basics of Fair ML You work at a bank that gives loans based on credit score. Boss tells you make sure it doesn t discriminate on race You have historical data: {(x i,a i,y i )} n i=1 credit score paid back loan race x i 2 R y i 2 {0, 1} a i 2 {asian, white, hispanic, black} P(y i =1 x i,a i ) P(x i apple z a i ) Discussion based on [Hardt, Price, Srebro 16]. See http://www.fatml.org for more resources.

Basics of Fair ML You work at a bank that gives loans based on credit score. Boss tells you make sure it doesn t discriminate on race You have historical data: {(x i,a i,y i )} n i=1 credit score paid back loan race x i 2 R y i 2 {0, 1} a i 2 {asian, white, hispanic, black} - Equal opportunity. proportion of those who would pay back loans equal - Pro: Bayes optimal if conditional distributions are the same, - Con: needs one class to be good, another bad TPR=equal P(x i >t y i =1,a i = ) =P(x i >t y i =1,a i = ) Discussion based on [Hardt, Price, Srebro 16]. See http://www.fatml.org for more resources.

Basics of Fair ML You work at a bank that gives loans based on credit score. Boss tells you make sure it doesn t discriminate on race You have historical data: {(x i,a i,y i )} n i=1 credit score paid back loan race x i 2 R y i 2 {0, 1} a i 2 {asian, white, hispanic, black} Discussion based on [Hardt, Price, Srebro 16]. See http://www.fatml.org for more resources.

www.fatml.org Discussion based on [Hardt, Price, Srebro 16]. See http://www.fatml.org for more resources.

Trees Machine Learning CSE546 Kevin Jamieson University of Washington December 4, 2018 12

Trees Build a binary tree, splitting along axes 13

Learning decision trees Start from empty decision tree Split on next best attribute (feature) Use, for example, information gain to select attribute Split on Recurse Prune Kevin Jamieson 2016 14

Trees Trees have low bias, high variance deal with categorial variables well intuitive, interpretable good software exists Some theoretical guarantees 15

Random Forests Machine Learning CSE546 Kevin Jamieson University of Washington December 4, 2018 16

Random Forests Tree methods have low bias but high variance. One way to reduce variance is to construct a lot of lightly correlated trees and average them: Bagging: Bootstrap aggregating 17

Random Forrests m~p/3 m~sqrt(p) 18

Random Forests Random Forests have low bias, low variance deal with categorial variables well not that intuitive or interpretable Notion of confidence estimates good software exists Some theoretical guarantees works well with default hyperparameters 19

Boosting Machine Learning CSE546 Kevin Jamieson University of Washington December 4, 2018 20

Boosting 1988 Kearns and Valiant: Can weak learners be combined to create a strong learner? Weak learner definition (informal): An algorithm A is a weak learner for a hypothesis class H that maps X to { 1, 1} if for all input distributions over X and h 2 H, we have that A correctly classifies h with error at most 1/2 21

Boosting 1988 Kearns and Valiant: Can weak learners be combined to create a strong learner? Weak learner definition (informal): An algorithm A is a weak learner for a hypothesis class H that maps X to { 1, 1} if for all input distributions over X and h 2 H, we have that A correctly classifies h with error at most 1/2 1990 Robert Schapire: Yup! 1995 Schapire and Freund: Practical for 0/1 loss AdaBoost 22

Boosting 1988 Kearns and Valiant: Can weak learners be combined to create a strong learner? Weak learner definition (informal): An algorithm A is a weak learner for a hypothesis class H that maps X to { 1, 1} if for all input distributions over X and h 2 H, we have that A correctly classifies h with error at most 1/2 1990 Robert Schapire: Yup! 1995 Schapire and Freund: Practical for 0/1 loss AdaBoost 2001 Friedman: Practical for arbitrary losses 23

Boosting 1988 Kearns and Valiant: Can weak learners be combined to create a strong learner? Weak learner definition (informal): An algorithm A is a weak learner for a hypothesis class H that maps X to { 1, 1} if for all input distributions over X and h 2 H, we have that A correctly classifies h with error at most 1/2 1990 Robert Schapire: Yup! 1995 Schapire and Freund: Practical for 0/1 loss AdaBoost 2001 Friedman: Practical for arbitrary losses 2014 Tianqi Chen: Scale it up! XGBoost 24

Boosting and Additive Models Machine Learning CSE546 Kevin Jamieson University of Washington December 4, 2018 25

Additive models Consider the first algorithm we used to get good classification for MNIST. Given: Generate random functions: Learn some weights: Classify new data: {(x i,y i )} n i=1 x i 2 R d,y i 2 { 1, 1} t : R d! R t =1,...,p! nx px bw = arg min Loss y i, w t t (x i ) w i=1 t=1! px f(x) =sign bw t t (x) t=1 26

Additive models Consider the first algorithm we used to get good classification for MNIST. Given: Generate random functions: Learn some weights: Classify new data: {(x i,y i )} n i=1 x i 2 R d,y i 2 { 1, 1} t : R d! R t =1,...,p! nx px bw = arg min Loss y i, w t t (x i ) w i=1 t=1! px f(x) =sign bw t t (x) An interpretation: Each t (x) is a classification rule that we are assigning some weight bw t t=1 27

Additive models Consider the first algorithm we used to get good classification for MNIST. Given: Generate random functions: Learn some weights: Classify new data: {(x i,y i )} n i=1 x i 2 R d,y i 2 { 1, 1} t : R d! R t =1,...,p! nx px bw = arg min Loss y i, w t t (x i ) w i=1 t=1! px f(x) =sign bw t t (x) An interpretation: Each t (x) is a classification rule that we are assigning some weight bw t bw, b 1,..., b t = arg min w, 1,..., p t=1 nx Loss y i, i=1! px w t t (x i ) t=1 is in general computationally hard 28

Forward Stagewise Additive models b(x, ) is a function with parameters Examples: b(x, )= 1 1+e T x b(x, )= 1 1{x 3 apple 2 } Idea: greedily add one function at a time 29

Forward Stagewise Additive models b(x, ) is a function with parameters Examples: b(x, )= 1 1+e T x b(x, )= 1 1{x 3 apple 2 } Idea: greedily add one function at a time AdaBoost: b(x, ): classifiers to { 1, 1} L(y, f(x)) = exp( yf(x)) 30

Forward Stagewise Additive models b(x, ) is a function with parameters Examples: b(x, )= 1 1+e T x b(x, )= 1 1{x 3 apple 2 } Idea: greedily add one function at a time Boosted Regression Trees: L(y, f(x)) = (y f(x)) 2 b(x, ): regression trees 31

Forward Stagewise Additive models b(x, ) is a function with parameters Examples: 1 b(x, )= 1+e T x b(x, )= 1 1{x 3 apple 2 } Idea: greedily add one function at a time Boosted Regression Trees: L(y, f(x)) = (y f(x)) 2 Efficient: No harder than learning regression trees! 32

Forward Stagewise Additive models b(x, ) is a function with parameters Examples: b(x, )= 1 1+e T x b(x, )= 1 1{x 3 apple 2 } Idea: greedily add one function at a time Boosted Logistic Trees: L(y, f(x)) = y log(f(x)) + (1 y) log(1 f(x)) b(x, ): regression trees Computationally hard to update 33

Gradient Boosting Least squares, exponential loss easy. But what about cross entropy? Huber? LS fit regression tree to n-dimensional gradient, take a step in that direction 34

Gradient Boosting Least squares, 0/1 loss easy. But what about cross entropy? Huber? AdaBoost uses 0/1 loss, all other trees are minimizing binomial deviance 35

Additive models Boosting is popular at parties: Invented by theorists, heavily adopted by practitioners. 36

Additive models Boosting is popular at parties: Invented by theorists, heavily adopted by practitioners. Computationally efficient with weak learners. But can also use trees! Boosting can scale. Kind of like sparsity? 37

Additive models Boosting is popular at parties: Invented by theorists, heavily adopted by practitioners. Computationally efficient with weak learners. But can also use trees! Boosting can scale. Kind of like sparsity? Gradient boosting generalization with good software packages (e.g., XGBoost). Effective on Kaggle Robust to overfitting and can be dealt with with shrinkage and sampling 38

Bagging versus Boosting Bagging averages many low-bias, lightly dependent classifiers to reduce the variance Boosting learns linear combination of high-bias, highly dependent classifiers to reduce error Empirically, boosting appears to outperform bagging 39

Which algorithm do I use? 40