Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Similar documents
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Lecture VII: Classification I. Dr. Ouiem Bchir

Introduction to Machine Learning CMU-10701

UVA CS 4501: Machine Learning

CS145: INTRODUCTION TO DATA MINING

Learning with multiple models. Boosting.

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Data Mining algorithms

VBM683 Machine Learning

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

ECE 5424: Introduction to Machine Learning

Decision Trees: Overfitting

Classification Lecture 2: Methods

Data Mining und Maschinelles Lernen

Holdout and Cross-Validation Methods Overfitting Avoidance

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

FINAL: CS 6375 (Machine Learning) Fall 2014

Variance Reduction and Ensemble Methods

Statistical Machine Learning from Data

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 5. Introduction to Data Mining

CS7267 MACHINE LEARNING

ECE 5984: Introduction to Machine Learning

Chapter 6: Classification

Statistics and learning: Big Data

1 Handling of Continuous Attributes in C4.5. Algorithm

DATA MINING LECTURE 10

Ensemble Methods for Machine Learning

SF2930 Regression Analysis

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Lecture 3: Decision Trees

Introduction to Machine Learning Midterm Exam

Classification using stochastic ensembles

Decision Tree And Random Forest

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

day month year documentname/initials 1

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Machine Learning & Data Mining

Linear and Logistic Regression. Dr. Xiaowei Huang

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Voting (Ensemble Methods)

Ensembles. Léon Bottou COS 424 4/8/2010

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning 2nd Edi7on

1 Handling of Continuous Attributes in C4.5. Algorithm

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Notes on Machine Learning for and

MIRA, SVM, k-nn. Lirong Xia

Einführung in Web- und Data-Science

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Machine Learning Basics

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Decision Tree Learning Lecture 2

15-381: Artificial Intelligence. Decision trees

Learning Decision Trees

TDT4173 Machine Learning

Ensembles of Classifiers.

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Bagging and Other Ensemble Methods

PAC-learning, VC Dimension and Margin-based Bounds

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Machine Learning Lecture 12

Midterm: CS 6375 Spring 2018

Ensemble Learning in the Presence of Noise

Hierarchical models for the rainfall forecast DATA MINING APPROACH

Decision Tree Learning

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

A Brief Introduction to Adaboost

Decision trees COMS 4771

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Introduction to Machine Learning Midterm Exam Solutions

Chapter 14 Combining Models

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

CS6220: DATA MINING TECHNIQUES

Machine Learning Lecture 7

10-701/ Machine Learning - Midterm Exam, Fall 2010

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Ensemble Methods and Random Forests

Classification Part 2. Overview

COMS 4771 Introduction to Machine Learning. Nakul Verma

Knowledge Discovery and Data Mining

Course in Data Science

Machine Learning. Ensemble Methods. Manfred Huber

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Statistical Consulting Topics Classification and Regression Trees (CART)

Supervised Learning via Decision Trees

Transcription:

Statistisches Data Mining (StDM) Woche 11 Oliver Dürr Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften oliver.duerr@zhaw.ch Winterthur, 29 November 2016 1

Multitasking senkt Lerneffizienz: Keine Laptops im Theorie-Unterricht Deckel zu oder fast zu (Sleep modus)

Overview of classification (until the end to the semester) Classifiers K-Nearest-Neighbors (KNN) Logistic Regression Linear discriminant analysis Support Vector Machine (SVM) Classification Trees Neural networks NN Deep Neural Networks (e.g. CNN, RNN) Evaluation Cross validation Performance measures ROC Analysis / Lift Charts Theoretical Guidance / General Ideas Combining classifiers Bagging Boosting Random Forest Bayes Classifier Bias Variance Trade off (Overfitting) Feature Engineering Feature Extraction Feature Selection

Decision Trees Chapter 8.1 in ILSR

Note on ISLR In ISLR they also include trees for regression. Here we focus on trees for classification

10 Example of a Decision Tree for classification Tid Refund Marital Status Taxable Income Cheat Splitting Attributes 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Refund Yes No MarSt Single, Divorced TaxInc < 80K > 80K YES Married Training Data Model: Decision Tree Source: Tan, Steinbach, Kumar 6

10 Another Example of Decision Tree Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Cheat 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Married MarSt Yes Single, Divorced Refund No TaxInc < 80K > 80K YES There could be more than one tree that fits the same data! Source: Tan, Steinbach, Kumar 7

10 10 Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No Tree Induction algorithm 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes Induction 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No Learn Model 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class Apply Model Model Decision Tree 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? Deduction 14 No Small 95K? 15 No Large 67K? Test Set Source: Tan, Steinbach, Kumar 8

10 Apply Model to Test Data Start from the root of tree. Test Data Refund Marital Status Taxable Income Cheat Refund No Married 80K? Yes No Single, Divorced MarSt Married TaxInc < 80K > 80K YES Source: Tan, Steinbach, Kumar 9

10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? Single, Divorced MarSt Married TaxInc < 80K > 80K YES Source: Tan, Steinbach, Kumar 10

10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? Single, Divorced MarSt Married TaxInc < 80K > 80K YES Source: Tan, Steinbach, Kumar 11

10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? Single, Divorced MarSt Married TaxInc < 80K > 80K YES Source: Tan, Steinbach, Kumar 12

10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? Single, Divorced MarSt Married TaxInc < 80K > 80K YES Source: Tan, Steinbach, Kumar 13

10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Refund No Married 80K? Yes No Single, Divorced MarSt Married Assign Cheat to No TaxInc < 80K > 80K YES Source: Tan, Steinbach, Kumar 14

10 10 Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? 15 No Large 67K? Test Set Induction Deduction Tree Induction algorithm Learn Model Apply Model Model Decision Tree Source: Tan, Steinbach, Kumar 15

Restrict to recursive Partitioning by binary splits There many approaches C 4.5 / C5 Algorithms, SPRINT Here top down splitting until no further split possible (or other criteria) 16

How to train a classification tree Score: Node impurity (next slides) Draw 3 splits to separate the data. Draw the corresponding tree x2 0.4 0.25 x1 17

Construction of a classification tree: Minimize the impurity in each node Parent Node p is split into 2 par00ons Maximize Gain over possible splits: 2 ni GAINsplit = IMPURITY ( p) IMPURITY ( i) i 1 n = n i is number of records in par00on i Possible Impurity Measures: - Gini index - entropy - misclassifica0on error Source: Tan, Steinbach, Kumar 19

Construction of a classification tree: Minimize the impurity in each node We have n c =4 different classes: red, green, blue, olive Before split Possible Split Before split Possible Split 20

The three most common impurity measures p( j t) is the rela0ve frequency of class j at node t n c is the number of different classes Gini Index: GINI () t 1 [ p( j t)] n c = j= 1 2 Entropy: n C j=1 ( ) Entropy(t) = p( j t)log 2 p( j t) Classifica0on error: Classification Error() t = 1 max p( j t) j {1,..., n c ) 21

Computing the Gini Index for three 2-class examples Gini Index at a given node t (in case of n c =2 different classes): Distribution of class 1 and class2 in node t: C1 0 C2 6 2 ( ) GINI ( t) = 1 [ p( j t)] = 1 p( class.1 t) + p( class.2 t) j = 1 2 2 2 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 P(C1) 2 P(C2) 2 = 1 0 1 = 0 C1 1 C2 5 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 (1/6) 2 (5/6) 2 = 0.278 C1 2 C2 4 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 (2/6) 2 (4/6) 2 = 0.444 Source: Tan, Steinbach, Kumar 22

Computing the Entropy for three examples Entropy at a given node t (in case of 2 different classes): ( ) Entropy( t) = p( class.1) log (p( class.1)) + p( class.2) log (p( class.2)) 2 2 C1 0 C2 6 p(c1) = 0/6 = 0 p(c2) = 6/6 = 1 Entropy = 0 log(0) 1 log(1) = 0 0 = 0 C1 1 C2 5 p(c1) = 1/6 p(c2) = 5/6 Entropy = (1/6) log 2 (1/6) (5/6) log 2 (1/6) = 0.65 C1 2 C2 4 p(c1) = 2/6 p(c2) = 4/6 Entropy = (2/6) log 2 (2/6) (4/6) log 2 (4/6) = 0.92 Source: Tan, Steinbach, Kumar 23

Computing the miss-classification error for three examples Classification error at a node t (in case of 2 different classes) : ( ) Classification Error( t) = 1 max p( class.1), p(class.2) C1 0 C2 6 p(c1) = 0/6 = 0 p(c2) = 6/6 = 1 Error = 1 max (0, 1) = 1 1 = 0 C1 1 C2 5 p(c1) = 1/6 p(c2) = 5/6 Error = 1 max (1/6, 5/6) = 1 5/6 = 1/6 C1 2 C2 4 p(c1) = 2/6 p(c2) = 4/6 Error = 1 max (2/6, 4/6) = 1 4/6 = 1/3 Source: Tan, Steinbach, Kumar 24

Compare the three impurity measures for a 2-class problem Impurity p: proportion of class 1 Source: Tan, Steinbach, Kumar 25

Using a tree for classification problems Score: Node impurity Model class label is given by the majority of all observation labels in the region x1<0.25 no x2<0.4 no yes yes X x2 0.4? fˆb = o? X 0.25 x1 26

Decision Tree in R # rpart for recursive partitioning library(rpart) d = rpart(species ~., data=iris) plot(d,margin=0.2, uniform=true) text(d, use.n = TRUE)

Overfitting Overfitting The larger the tree, the lower the error on the training set. low bias However, training error can get worse. high variance Taken from ISLR

Pruning First Strategy: Stop growing the tree if impurity measure gain is small. To short-sighted: a seemingly worthless split early on in the tree might be followed by a very good split A better strategy is to grow a very large tree to the end, and then prune it back in order to obtain a subtree In rpart the pruning controlled with a complexity parameter cp cp is a parameter in regressions trees which can be optimized (like k in knn) Later: Crossvalidation a way to systematically find the best meta parameter

Pruning Original Tree t1 <- rpart(sex ~., data=crabs) plot(t1, margin=0.2, uniform=true);text(t1,use.n=true) t2 = prune(t1, cp=0.05) plot(t2);text(t2,use.n=true) t3 = prune(t1, cp=0.1) plot(t3);text(t3,use.n=true) t4 = prune(t1, cp=0.2) plot(t4);text(t4,use.n=true) cp = 0.05 cp = 0.1 cp = 0.2

Trees vs Linear Models Slide Credit: ISLR

Pros and cons of tree models Pros - Interpretability - Robust to outliers in the input variables - Can capture non-linear structures - Can capture local interac0ons very well - Low bias if appropriate input variables are available and tree has sufficient depth. Cons - High varia0on (instability of trees) - Tend to overfit - Needs big datasets to capture addi0ve structures - Inefficient for capturing linear structures 32

Joining Forces

Which supervised learning method is best? Bild: Seni & Elder

Bundling improves performance Bild: Seni & Elder

Evolving results I Low hanging fruits and slowed down progress After 3 weeks, at least 40 teams had improved the Netflix classifier Top teams showed about 5% improvement However, improvement slowed: from hpp://www.research.ap.com/~volinsky/netlix/

Details: Gravity Quote [home.mit.bme.hu/~gtakacs/download/gravity.pdf]

Details: When Gravity and Dinosaurs Unite Quote Our common team blends the result of team Gravity and team Dinosaur Planet.

Details: BellKor / KorBell Quote Our final solution (RMSE=0.8712) consists of blending 107 individual results.

Evolving results III Final results The winner was an ensemble of ensembles (including BellKor) Gradient boosted decision trees [http://www.netflixprize.com/assets/grandprize2009_bpc_bellkor.pdf]

Overview of Ensemble Methods Many instances of the same classifier Bagging (bootstrapping & aggregating) Create new data using bootstrap Train classifiers on new data Average over all predictions (e.g. take majority vote) Bagged Trees Use bagging with decision trees Random Forest Bagged trees with special trick Boosting An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records. Combining classifiers Weighted averaging over predictions Stacking classifiers Use output of classifiers as input for a new classifier

Bagging

Idee des Bootstraps Idee: Die Sampling Verteilung liegt nahegenug an der wahren

Bagging: Bootstrap Aggregating D Original Training data Step 1: Create Multiple Data Sets... D 1 D 2 D t-1 D t Bootstrap Step 2: Build Multiple Classifiers C 1 C 2 C t -1 C t Step 3: Combine Classifiers C * Aggregating: mean, majority vote Source: Tan, Steinbach, Kumar

Why does it work? Suppose there are 25 base classifiers Each classifier has error rate, ε = 0.35 Assume classifiers are independent (that s the hardest assumption) Take majority vote Majority voter is wrong if 13 or more are wrong Number of wrong predictors X ~ Bin(size=25, p 0 =0.35) > 1 - pbinom(12, size=25, prob = 0.35) [1] 0.06044491 25 i= 13 25 i ε (1 ε ) i 25 i = 0.06

Why does it work? 25 Base Classifiers Ensembles are only better than one classifier, if each classifier is better than random guessing!