Deconstructing Data Science

Similar documents
Machine Learning Recitation 8 Oct 21, Oznur Tastan

CS145: INTRODUCTION TO DATA MINING

Deconstructing Data Science

Deconstructing Data Science

UVA CS 4501: Machine Learning

VBM683 Machine Learning

ECE 5424: Introduction to Machine Learning

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Applied Natural Language Processing

Decision trees COMS 4771

Statistical Machine Learning from Data

the tree till a class assignment is reached

Classification and Regression Trees

Decision Trees. Tirgul 5

Decision Tree Learning

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Lecture 7: DecisionTrees

Jeffrey D. Ullman Stanford University

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

SF2930 Regression Analysis

Randomized Decision Trees

The Perceptron algorithm

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Learning with multiple models. Boosting.

Variance Reduction and Ensemble Methods

1 Handling of Continuous Attributes in C4.5. Algorithm

Lecture 3: Decision Trees

Decision Tree And Random Forest

CS7267 MACHINE LEARNING

Introduction to Machine Learning CMU-10701

CS 6375 Machine Learning

Bias-Variance in Machine Learning

Natural Language Processing

day month year documentname/initials 1

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Ensembles of Classifiers.

Lecture 3: Decision Trees

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Decision Tree Learning Lecture 2

BAGGING PREDICTORS AND RANDOM FOREST

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Notes on Machine Learning for and

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Decision Tree Learning

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning (CS 567) Lecture 2

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Empirical Risk Minimization, Model Selection, and Model Assessment

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

ORIE 4741: Learning with Big Messy Data. Generalization

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

COMS 4771 Introduction to Machine Learning. Nakul Verma

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

1 Handling of Continuous Attributes in C4.5. Algorithm

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Knowledge Discovery and Data Mining

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Machine Learning Lecture 5

Decision Trees & Random Forests

Supervised Learning via Decision Trees

Lecture 7 Decision Tree Classifier

Learning Decision Trees

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Nonlinear Classification

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Generative v. Discriminative classifiers Intuition

Induction of Decision Trees

Decision Tree Learning

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Machine Learning. Nathalie Villa-Vialaneix - Formation INRA, Niveau 3

ECE 5984: Introduction to Machine Learning

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

From perceptrons to word embeddings. Simon Šuster University of Groningen

Machine Learning (CS 567) Lecture 3

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Lossless Online Bayesian Bagging

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Ensembles. Léon Bottou COS 424 4/8/2010

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Decision Trees: Overfitting

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Decision Trees (Cont.)

Ensemble Methods for Machine Learning

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Statistics and learning: Big Data

Classification II: Decision Trees and SVMs

EECS 349:Machine Learning Bryan Pardo

Classification and Prediction

Machine Learning 2nd Edi7on

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Boosting & Deep Learning

Learning Decision Trees

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Lectures in AstroStatistics: Topics in Machine Learning for Astronomers

Machine Learning, Fall 2009: Midterm

Transcription:

econstructing ata Science avid Bamman, UC Berkeley Info 290 Lecture 6: ecision trees & random forests Feb 2, 2016

Linear regression eep learning ecision trees Ordinal regression Probabilistic graphical models andom forests Support vector machines Logistic regression Survival models Topic models Networks Neural networks Perceptron K-means clustering Hierarchical clustering

ecision trees andom forests

20 questions

Feature Value lives in Berkeley follow clinton 0 follow trump 0 follows Trump email benghazi 0 negative sentiment + benghazi illegal immigrants 0 republican in profile democrat in profile 0 0 0 profile epublican follows Trump self-reported location = Berkeley 1

lives in Berkeley the follows Trump email a he follows Trump she profile epublican they how do we find the best tree?

the a he they she how do we find the best tree? the a he they she an are our them him her hers his

ecision trees from Flach 2014

<x, y> training data x2 >15 x2 15 x2 > 5 x2 5 x1 10 x1 > 10

ecision trees from Flach 2014

ecision trees Homogeneous(): the elements in are homogeneous eugh that they can be labeled with a single label Label(): the single most appropriate label for all elements in

ecision trees Homogeneous Label Classification All (or most) of the elements in share the same label y y egression The elements in have low variance the average of elements in

ecision trees from Flach 2014

Entropy Measure of uncertainty in a probability distribution P(x) log P(x) x X a great the oakland

a great deal 12196 job 2164 idea 1333 opportunity 855 weekend 585 player 556 extent 439 hor 282 pleasure 267 gift 256 humor 221 tool 184 athlete 173 disservice 108 the oakland athletics 185 raiders 185 museum 92 hills 72 tribune 51 police 49 coliseum 41 Corpus of Contemporary American English

Entropy P(x) log P(x) x X High entropy means the phemen is less predictable Entropy of 0 means it is entirely predictable.

Entropy P(X=x) 0.0 0.2 0.4 1 2 3 4 5 6 6 1 1 6 log 1 6 = 2.58 A uniform distribution has maximum entropy P(X=x) 0.0 0.2 0.4 0.4 log 0.4 5 1 0.12 log 0.12 = 2.36 1 2 3 4 5 6 This entropy is lower because it is more predictable (if we always guess 2, we would be right 40% of the time)

Conditional entropy Measures your level of surprise about some phemen Y if you have information about ather phemen X Y = word, X = preceding bigram ( the oakland ) Y = label (democrat, republican), X = feature (lives in Berkeley)

Conditional entropy Measures you level of surprise about some phemen Y if you have information about ather phemen X H(Y X) Y = label X = feature value = P(X = x)h(y X = x) x H(Y X = x) = p(y x) log p(y x) y Y

Information gain aka Mutual Information : the reduction in entropy in Y as a result of kwing information about X H(Y) H(Y X) H(Y) = p(y) log p(y) y Y H(Y X) = p(x) p(y x) log p(y x) x X y Y

1 2 3 4 5 6 x1 0 1 1 0 0 1 x2 0 0 0 1 1 1 y Which of these features gives you more information about y?

1 2 3 4 5 6 x1 0 1 1 0 0 1 x2 0 0 0 1 1 1 y x X 0 1 x1 y Y 3 0 0 3

x X 0 1 x1 y Y 3 0 0 3 H(Y X) = p(x) p(y x) log p(y x) x X y Y P(x = 0) = 3 3 + 3 = 0.5 P(x = 1) = 3 3 + 3 = 0.5 P(y =+ x = 0) = 3 3 + 0 = 1 P(y = x = 0) = 0 3 + 0 = 0 P(y =+ x = 1) = 0 3 + 0 = 0 P(y = x = 1) = 3 3 + 0 = 1

x X 0 1 x1 y Y 3 0 0 3 H(Y X) = p(x) p(y x) log p(y x) x X y Y 3 6 (1 log 1 + 0 log 0) 3 (0 log 0 + 1 log 1) =0 6

1 2 3 4 5 6 x1 0 1 1 0 0 1 x2 0 0 0 1 1 1 y x X 0 1 x2 y Y 1 2 2 1

x X 0 1 x2 y Y 1 2 2 1 P(y =+ x = 0) = 1 1 + 2 = 0.33 P(x = 0) = 3 3 + 3 = 0.5 P(x = 1) = 3 3 + 3 = 0.5 P(y = x = 0) = 2 1 + 2 = 0.67 P(y =+ x = 1) = 2 1 + 2 = 0.67 P(y = x = 1) = 1 1 + 2 = 0.33

x X 0 1 x2 y Y 1 2 2 1 H(Y X) = p(x) p(y x) log p(y x) x X y Y 3 6 (0.33 log 0.33 + 0.67 log 0.67) 3 (0.67 log 0.67 + 0.33 log 0.33) =0.91 6

Feature H(Y X) follow clinton 0.91 follow trump 0.77 benghazi 0.45 In decision trees, the feature with the lowest conditional entropy/highest information gain defines the best split negative sentiment + benghazi 0.33 illegal immigrants 0 MI = IG = H(Y) H(Y X) republican in profile democrat in profile self-reported location = Berkeley 0.31 0.67 0.80 for a given partition, H(Y) is the same for all features, so we can igre it when deciding among them

Feature H(Y X) follow clinton 0.91 follow trump 0.77 How could we use this in other models (e.g., the perceptron)? benghazi 0.45 negative sentiment + benghazi 0.33 illegal immigrants 0 republican in profile democrat in profile self-reported location = Berkeley 0.31 0.67 0.80

ecision trees BestSplit identifies the feature with the highest information gain and partitions the data according to values for that feature

Gini impurity Measure the purity of a partition (how diverse the labels are). If we were to pick an element in and assign a label in proportion to the label distribution in, how often would we make a mistake? Probability of selecting an item with label y at random y Y p y (1 p y ) The probability of randomly assigning it the wrong label

Gini impurity p y (1 p y ) y Y x X 0 1 x X 0 1 x1 y Y 3 0 0 3 y Y 1 2 2 1 x2 G(0) =1 (1 1)+0 (1 0) =0 G(0) =0 (1 0)+1 (1 1) =0 G(0) =0.33 (1 0.33)+0.67 (1 0.67) =0.44 G(1) =0.67 (1 0.67)+0.33 (1 0.33) =0.44 G(x 1 )=( 3 3 + 3 )0 +( 3 3 + 3 )0 = 0 G(x 2 )=( 3 3 + 3 )0.44 +( 3 )0.44 = 0.44 3 + 3

Classification A mapping h from input data x (drawn from instance space X) to a label (or labels) y from some enumerable output space Y X = set of all skyscrapers Y = {art deco, neo-gothic, modern} x = the empire state building y = art deco

Feature Value lives in Berkeley follow clinton 0 follows Trump email follow trump 0 benghazi 0 follows Trump negative sentiment + benghazi illegal immigrants 0 0 profile epublican republican in profile democrat in profile self-reported location = Berkeley 0 0 1 The tree that we ve learned is the mapping ĥ(x)

Feature Value lives in Berkeley follow clinton 0 follows Trump email follow trump 0 benghazi 0 follows Trump negative sentiment + benghazi illegal immigrants 0 0 profile epublican republican in profile democrat in profile self-reported location = Berkeley 0 0 1 How is this different from the perceptron?

egression A mapping from input data x (drawn from instance space X) to a point y in ( = the set of real numbers) x = the empire state building y = 17444.5625

Feature Value follow clinton 0 lives in Berkeley follow trump 0 follows Trump email benghazi 0 negative sentiment + benghazi illegal immigrants 0 republican in profile democrat in profile 0 0 0 $10 $0 profile epublican $7 $1 $13 follows Trump $2 self-reported location = Berkeley 1

ecision trees from Flach 2014

Variance The level of dispersion of a set of values, how far they tend to fall from the average 5 5 0 2 4 6 8 10 0 2 4 6 8 10 5.1 10 4.8 3 5.3 1 4.9 9 Mean 5.0 5.0 Variance 0.025 10

Variance The level of dispersion of a set of values, how far they tend to fall from the average Var(Y) = 1 N N i=1 N (y i ȳ) 2 5 5 5.1 10 4.8 3 5.3 1 ȳ = 1 N i=1 y i 4.9 9 Mean 5.0 5.0 Variance 0.025 10

egression trees ather than using entropy/gini as a splitting criterion, we ll find the feature that results in the lowest variance of the data after splitting on the feature values.

1 2 3 4 5 6 x1 0 1 1 0 0 1 x2 0 0 0 1 1 1 y 5.0 1.7 0 10 8 2.2 x X 0 1 x1 y Y 5.0, 10, 8 1.7, 0, 2.2 Var 6.33 1.33 Average Variance: 3 6 6.33 + 3 6 1.33 = 3.83

1 2 3 4 5 6 x1 0 1 1 0 0 1 x2 0 0 0 1 1 1 y 5.0 1.7 0 10 8 2.2 x X 0 1 x2 y Y 5.0, 1.7, 0 10, 8, 2.2 Var 6.46 16.4 Average Variance: 3 6 6.46 + 3 16.4 = 11.43 6

egression trees ather than using entropy/gini as a splitting criterion, we ll find the feature that results in the lowest variance of the data after splitting on the feature values. Homogeneous(): the elements in are homogeneous eugh that they can be labeled with a single label. Variance < small threshold. Label(): the single most appropriate label for all elements in ; the average value of y among

the Overfitting a he she With eugh features, you can perfectly memorize the training data, encoding in paths within the tree follow clinton = false follow trump = false benghazi = false illegal immigrants = false republican in profile = false democrat in profile = false self-reported location = Berkeley = true emocrat follow clinton = true follow trump = false benghazi = false illegal immigrants = false republican in profile = false democrat in profile = false self-reported location = Berkeley = true epublican are they hers an them her our his him

Pruning One way to prevent overfitting is to grow the tree to an arbitrary depth, and then prune back layers (delete subtrees)

the Pruning a he she they eeper into the tree = more conjunctions of features; a shallower tree only the most important (by IG) features are an them her our his him hers

Interpretability ecision trees are considered a relatively interpretable model, since they can be postprocessed in a sequence of decisions If self-reported location = Berkeley and benghazi = false, then y = emocrat

the Interpretability a he she they Manageable for trees of an small depth, but t deep trees (each layer = one additional rule) are them our him Even in small trees, potentially many her disjunctions (or for each his terminal de) hers

Low bias: decision trees can perfectly match the training data (learning a perfect path through the conjunctions of features to recover the true y. High variance: because of that, they re very sensitive to whatever data you train on, resulting in very different models on different data

Solution: train many models Bootstrap aggregating (bagging) is a method for reducing the variance of a model by averaging the results from multiple models trained on slightly different data. Bagging creates multiple versions of your dataset using the bootstrap (sampling data uniformly and with replacement)

Bootstrapped data original x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 rep 1 x3 x9 x1 x3 x10 x6 x2 x9 x8 x1 rep 2 x7 x9 x1 x1 x4 x9 x10 x7 x5 x6 rep 3 x2 x3 x5 x8 x9 x8 x10 x1 x2 x4 rep 4 x5 x1 x10 x5 x4 x2 x1 x9 x8 x10 Train one decision tree on each replicant and average the predictions (or take the majority vote)

e-correlating further Bagging is great, but the variance goes down when the datasets are independent of each other. If there s one strong feature that s a great predictor, then the predictions will be dependent because they all have that feature Solution: for each trained decision tree, only use a random subset of features.

andom forest