Day 3: Classification, logistic regression

Similar documents
Day 4: Classification, support vector machines

Day 5: Generative models, structured classification

Machine Learning Practice Page 2 of 2 10/28/13

Holdout and Cross-Validation Methods Overfitting Avoidance

MIRA, SVM, k-nn. Lirong Xia

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Linear and Logistic Regression. Dr. Xiaowei Huang

CPSC 340: Machine Learning and Data Mining

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Lecture 2 Machine Learning Review

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Variance Reduction and Ensemble Methods

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Learning Theory Continued

Computational and Statistical Learning Theory

Introduction to Machine Learning Summer School June 18, June 29, 2018, Chicago

18.9 SUPPORT VECTOR MACHINES

Recap from previous lecture

Decision Tree Learning Lecture 2

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

ECE 5424: Introduction to Machine Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning and Data Mining. Linear classification. Kalev Kask

CSC 411 Lecture 17: Support Vector Machine

CS 6375 Machine Learning

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Machine Learning for NLP

COMS 4771 Regression. Nakul Verma

Tufts COMP 135: Introduction to Machine Learning

PAC-learning, VC Dimension and Margin-based Bounds

Logic and machine learning review. CS 540 Yingyu Liang

Machine Learning

Jeffrey D. Ullman Stanford University

Binary Classification / Perceptron

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Classification and Pattern Recognition

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Classification: The rest of the story

9 Classification. 9.1 Linear Classifiers

Bias-Variance in Machine Learning

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Lecture 7: DecisionTrees

Generalization, Overfitting, and Model Selection

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

ML4NLP Multiclass Classification

Machine Learning for NLP

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Machine Learning

ESS2222. Lecture 4 Linear model

Stochastic optimization in Hilbert spaces

Midterm Exam, Spring 2005

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Class 2 & 3 Overfitting & Regularization

Introduction to Machine Learning

the tree till a class assignment is reached

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

ECE 5984: Introduction to Machine Learning

INTRODUCTION TO DATA SCIENCE

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Supervised Learning. George Konidaris

10-701/ Machine Learning - Midterm Exam, Fall 2010

Foundations of Machine Learning

The Perceptron algorithm

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Linear Classifiers: Expressiveness

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Stochastic Gradient Descent

Part of the slides are adapted from Ziko Kolter

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Understanding Generalization Error: Bounds and Decompositions

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Lecture 3: Statistical Decision Theory (Part II)

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Nonlinear Classification

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

CS6220: DATA MINING TECHNIQUES

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Warm up: risk prediction with logistic regression

Machine Learning (CS 567) Lecture 3

CS6220: DATA MINING TECHNIQUES

VBM683 Machine Learning

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Generalization and Overfitting

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

Probability and Statistical Decision Theory

Decision trees COMS 4771

EE 511 Online Learning Perceptron

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Transcription:

Day 3: Classification, logistic regression Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018

Topics so far Supervised learning, linear regression Yesterday o Overfitting, o Ridge and lasso Regression o Gradient descent Today o Bias variance trade-off o Classification o Logistic regression o Regularization for logistic regression o Classification metrics 1

Bias-variance tradeoff 2

Empirical vs population loss Population distribution Let x, y D We have o Loss function l(y(, y) o Hypothesis class H o Training data S = x -, y - : i = 1,2,, N 445 D 6 Think of S as random variable What we really want f H to minimize population loss L D (f) E D l f x, y = F l f x, y (H,I) ERM minimizes empirical loss Pr(x, y) L L f EM L l f x, y = 1 N N l f x -, y - 6 -OP e.g, Pr x = uniform 0,1 y = w. x + ε where ε = N 0,0.1 Pr y x = N w. x, 0.1 Pr x, y = Pr x Pr (y x) 3

Empirical vs population loss L(f) E D l f x, y = F l f x, y (H,I) Pr(x, y) L L f EM L l f x, y = 1 N N l f x -, y - -OP f_ L from some model overfits to S if there is f H with EM L l f_ L x, y EM L l f x, y but E D l f_ L x, y E D l f x, y If f is independent of S efg-h then both L Lijklm f and L Linoi f are good approximations of L D f But generally, f_ depends on S efg-h. Why? o L Lijklm f_ Lijklm is no more a good approximation of L D f o L Linoi f_ Lijklm is still a good approximation of L D f since f_ Lijklm is independent of S epqe 6 4

Optimum Unrestricted Predictor Consider population squared loss argmin w H L(f) E D l f x, y = E (H,I) f x y y Say H is unrestricted any function f: x y is allowed L f = E (H,I) f x y y = E H E I f x y y x = E H E I f x E I y x + E I y x y y x = E H E I f x E I y x y x + E H E I E I y x y y x + 2 E H E I (f x E I y x )(E I y x y) x not a function of y = 0 = E H [ f x E I y x y ] + E H,I E I y x y y minimized for f = E I y x Noise 5

Bias variance decomposition Best unrestricted predictor f (x) = E I [y x] L f L = E H [ f L x f x y ] + E H,I f x y y E L L f L = E L E H f L x f x y + noise E L E H f L x f x y = E H E L f L x f x y x = E H E L f L x E L f L x + E L f L x f x y x = E H E L f L x E L f L x y x + E H E L f L x f x y + 2 E H E L E L f L x f x f L x E L f L x x = E L,H f L x E L f L x y + E H E L f L x f x y E L L f L = E L,H f L x E L f L x y +E H E L f L x f x y +E H,I [ f x y y ] = variance + bias 2 + noise 6

Bias-variance tradeoff E L L f L = E L,H f L x E L f L x y +E H E L f L x f x y +E H,I [ f x y y ] = variance + bias 2 + noise f L H noiseis irreducible variance can reduced by o get more data o make f L less sensitive to S less number of candidates in H to choose fromàless variance reducing the complexity of model class H decreases variance bias y min E H f x f x y w ƒ H expanding model class H decreases bias 7

Model complexity reducing the complexity of model class H decreases variance expanding model class H decreases bias Complexity number of choices in H For any loss L, for all f H with probability greater than 1 δ L f L L f + log H + log 1 δ N many other variants for infinite cardinality classes often bounds are loose Complexity number of degrees of freedom e.g., number of parameters to estimate more data ècan fit more complex models Is H P = {x w + w 1. x w 2. x} more complex than H y = {x w + w 1. x}? What we need is how many different behaviors we can get on same S 8

Overfitting Summary o What is overfitting? o How to detect overfitting? o Avoiding overfitting using model selection Bias variance tradeoff 9

Classification Supervised learning: estimate a mapping f from input x X to output y Y o Regression Y = R or other continuous variables o Classification Y takes discrete set of values Examples: q Y = {spam, nospam}, q digits (not values) Y = {0,1,2,, 9} Many successful applications of ML in vision, speech, NLP, healthcare 10

Classification vs Regression Label-values do not have meaning o Y = {spam, nospam} or Y = {0,1} or Y = { 1,1} Ordering of labels does not matter (for most parts) o f(x) = 0 when y = 1 is as bad as f(x) = 9 when y = 1 Often f(x) does not return labels y o e.g. in binary classification with Y = { 1,1} we often estimate f: X R and then post process to get y( f x = 1 f x 0 o mainly for computational reasons remember, we need to solve min l(f x -, y - ) w H - discrete values à combinatorial problems à hard to solve o more generally H f: X R and loss l: R Y R compare to regression, where typically H f: X Y and loss l: Y Y R 11

Non-parametric classifiers 12

Nearest Neighbor (NN) Classifier Training data S = x -, y - : i = 1,2,, N Want to predict label of new point x Nearest Neighbor Rule o Find the closest training point: i = arg min - ρ(x, x - ) o Predict label of x as y( x = y - Computation o Training time: Do nothing? o Test time: search the training set for a NN Figure credit: Nati Srebro 13

Nearest Neighbor (NN) Classifier Where is the main model? o i = arg min - ρ(x, x - ) o What is the right distance between images? Between sound waves? Between sentences? o Often ρ x, x œ = φ x φ x œ y or other norms x x œ P x x œ 1 φÿ x φÿ(x œ ) 2 φÿ x = (5x 1, x 2 ) Slide credit: Nati Srebro 14

k-nearest Neighbor (knn) classifier Training data S = x -, y - : i = 1,2,, N Want to predict label of new point x k-nearest Neighbor Rule o Find the k closest training point: i P, i y,, i o Predict label of x as y( x = majority y -, y -,, y - Computation o Training time: Do nothing o Test time: search the training set for k NNs 15

k-nearest Neighbor Advantages ono training ouniversal approximator non-parametric Disadvantages onot scalable test time memory requirement test time computation oeasily overfits with small data 16

Training vs test error 1-NN Training error? 0 Test error? Depends on Pr(x, y) k-nn Training error: can be greater than 0 Test error: again depends on Pr(x, y) Figure credit: Nati Srebro 17

k-nearest Neighbor: Data Fit / Complexity Tradeoff h*= S= k=1 k=50 Slide credit: Nati Srebro k=5 k=12 k=100 k=200 18

Space partition knn partitioning of X (or R ) into regions of +1 and -1 What about discrete valued features x? Even for continuous x, can we get more structured partitions? o easy to describe e.g., R y = {x: x P < t P and x y > t y } o reduces degrees of freedom Any non-overlapping partition using only (hyper) rectangles à representable by a tree Figure credit: Greg Shaknarovich 19

Decision trees Focus on binary trees (trees with at most two children at each node) How to create trees? What is a good tree? o Measure of purity at each leaf node where each leaf node corresponding to a region R - purity tree = # blue at R 4 # red at R - l There are various metrics of (im)purity that are used in practice, but the rough idea is the same 20

Decision trees How to create trees? Training data S = x -, y - : i = 1,2,, N, where y - {blue, red} At each point, purity tree = N # blue at leaf # red at leaf ²³ µ Start with all data at root o only one leaf = root. What is purity tree? 21

Decision trees How to create trees? Training data S = x -, y - : i = 1,2,, N, where y - {blue, red} At each point, purity tree = N # blue at leaf # red at leaf ²³ µ Start with all data at root o only one leaf = root. What is purity tree? Create a split based on a rule that increases the amount of purity of tree. o How complex can the rules be? 22

Decision trees How to create trees? Training data S = x -, y - : i = 1,2,, N, where y - {blue, red} At each point, purity tree = N # blue at leaf # red at leaf ²³ µ Start with all data at root o only one leaf = root. What is purity tree? Create a split based on a rule that increases the amount of purity of tree. o How complex can the rules be? Repeat When to stop? what is the complexity of a DT? Limit the number of leaf nodes 23

Decision trees Advantages o interpretable o easy to deal with non-numeric features o natural extensions to multi-class, multi-label Disadvantages o not scalable o hard decisions non-smooth decisions o often overfits in spite of regularization Check CART package in scikit-learn 24

Parametric classifiers What is the equivalent of linear regression? o something easy to train o something easy to use at test time f x = f w x = w. x + w H = {f w = x w. x + w : w R, w R} but f x { 1,1}! how do we get labels? o reasonable choice y( x = 1 if f w¹ x 0 and y( x = 1 otherwise o linear classifier: y( x = sign(w¹. x + w¹ ) 25

Parametric classifiers H = f w = x w. x + w : w R, w R y( x = sign(w¹. x + w¹ ) w¹. x + w¹ = 0 (linear) decision boundary or separating hyperplane o that separates R into two halfspaces (regions) w¹. x + w¹ > 0 and w¹. x + w¹ < 0! # more generally, y( x = sign f_ x ç à decision boundary is f_ x = 0! " 26

Linear classifier 27

Classification vs Regression Label-values do not have meaning o Y = {spam, nospam} or Y = {0,1} or Y = { 1,1} Ordering of labels does not matter (for most parts) o f(x) = 0 when y = 1 is as bad as f(x) = 9 when y = 1 Often f(x) does not return labels y o e.g. in binary classification with Y = { 1,1} we often estimate f: X R and then post process to get y( f x = 1 f x 0 o mainly for computational reasons remember, we need to solve min l(f x -, y - ) w H - discrete values à combinatorial problems à hard to solve o more generally H f: X R and loss l: R Y R compare to regression, where typically H f: X Y and loss l: Y Y R 28

Classification vs Regression Label-values do not have meaning o Y = {spam, nospam} or Y = {0,1} or Y = { 1,1} Ordering of labels does not matter (for most parts) o f(x) = 0 when y = 1 is as bad as f(x) = 9 when y = 1 What if we ignore above Often f(x) does not return labels y and solve classification o e.g. in binary classification with Y = { 1,1} we often estimate f: X R and then post process to get y( f x = 1 f x 0 using regression? o mainly for computational reasons remember, we need to solve min l(f x -, y - ) w H - discrete values à combinatorial problems à hard to solve o more generally H f: X R and loss l: R Y R compare to regression, where typically H f: X Y and loss l: Y Y R 29

Classification as regression Binary classification Y = 1,1 and X R Treat it as regression with squared loss, say linear regression o Training data S = x i, y - : i = 1,2,, N o ERM w¹, w¹ = argmin N w. x i + w y -»,» ¼ - y 30

Classification as regression x y y( = +1 y( = 1 y( x = sign(wx + w ) x Example credit: Greg Shaknarovich 31

Classification as regression x y classified correctly by y( x = sign w. x but squared loss w. x + 1 y will be high x Example credit: Greg Shaknarovich 32

Classification as regression x y y = +1 y = 1 x Example credit: Greg Shaknarovich 33

Classification as regression Slide credit: Greg Shaknarovich 34

Surrogate Losses The correct loss to use is 0-1 loss after thresholding l P f x, y = 1 sign f x y = 1 sign f x y < 0 l(f x, y) 0 f(x)y 35

Surrogate Losses The correct loss to use is 0-1 loss after thresholding l P f x, y = 1 sign f x y = 1 sign f x y < 0 Linear regression uses l ¾L f x, y = f x y y l(f x, y) 0 y(y f(x)y Why not do ERM over l P f x, y directly? o non-continuous, non-convex 36

Surrogate Losses Hard to optimize over l P, find another loss l(y(, y) o Convex (for any fixed y) à easier to minimize o An upper bound of l P à small l small l P Satisfied by squared loss àbut has large loss even when l P y(, y = 0 Two more surrogate losses in in this course o Logistic loss l ² y(, y = log 1 + exp y(y (TODAY) o Hinge loss l Á-hÂp y(, y = max(0,1 y(y) (TOMORROW) l(f x, y) 0 f(x)y 37

Logistic Regression 38

Logistic regression: ERM on surrogate loss Logistic loss l f x, y = log 1 + exp f(x)y l(f x, y) S = x i, y - : i = 1,2,, N, X = R, Y = { 1,1} Linear model f x = f w x = w. x + w Minimize training loss w¹, w¹ = argmin w,» ¼ N log 1 + exp w. x i + w y - - Output classifier y( x = sign(w. x + w ) 0 f(x)y 39

Logistic regression w¹, w¹ = argmin w,» ¼ N log 1 + exp w. x i + w y - - Learns a linear decision boundary o {x: w. x + w = 0} is a hyperplane in R - decision boundary o {x: w. x + w = 0} divides R into two halfspace (regions) o x: w. x + w 0 will get label +1 and {x: w. x + w < 0} will get label -1 Maps x to a 1D coordinate x œ = w. x + w w x y ç w x x x P Figure credit: Greg Shaknarovich 40

Logistic Regression w¹, w¹ = argmin w,» ¼ N log 1 + exp w. x + w y - Convex optimization problem Can solve using gradient descent Can also add usual regularization: l y, l P o More details in the next session 41