Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Similar documents
LDA, QDA, Naive Bayes

Dimension Reduction Methods

Machine Learning Practice Page 2 of 2 10/28/13

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning Linear Classification. Prof. Matteo Matteucci

Support Vector Machines

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Applied Machine Learning Annalisa Marsico

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Introduction to Machine Learning Midterm, Tues April 8

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Midterm exam CS 189/289, Fall 2015

Introduction to Machine Learning Midterm Exam

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Statistical Methods for Data Mining

Statistical Methods for SVM

Introduction to Machine Learning Midterm Exam Solutions

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Statistical Data Mining and Machine Learning Hilary Term 2016

CMU-Q Lecture 24:

Lecture 3: Statistical Decision Theory (Part II)

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

COMS 4771 Regression. Nakul Verma

ECE 5424: Introduction to Machine Learning

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

FINAL: CS 6375 (Machine Learning) Fall 2014

10-701/ Machine Learning - Midterm Exam, Fall 2010

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Course in Data Science

A Magiv CV Theory for Large-Margin Classifiers

Evaluation. Andrea Passerini Machine Learning. Evaluation

18.9 SUPPORT VECTOR MACHINES

Machine Learning for OR & FE

Lecture 6: Linear Regression (continued)

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Introduction to Machine Learning

Evaluation requires to define performance measures to be optimized

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

ESS2222. Lecture 4 Linear model

Introduction to Machine Learning

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

CS534 Machine Learning - Spring Final Exam

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Introduction to Data Science

The exam is closed book, closed notes except your one-page cheat sheet.

Statistical Machine Learning Hilary Term 2018

Day 5: Generative models, structured classification

Statistical Machine Learning from Data

CSCI-567: Machine Learning (Spring 2019)

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

VBM683 Machine Learning

Machine Learning for OR & FE

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Linear Regression (9/11/13)

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

1 Machine Learning Concepts (16 points)

Stephen Scott.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Recap from previous lecture

A Bias Correction for the Minimum Error Rate in Cross-validation

Linear Model Selection and Regularization

Train the model with a subset of the data. Test the model on the remaining data (the validation set) What data to choose for training vs. test?

Holdout and Cross-Validation Methods Overfitting Avoidance

Introduction to Gaussian Process

Review: Support vector machines. Machine learning techniques and image analysis

Simple Linear Regression (single variable)

Applied Multivariate and Longitudinal Data Analysis

Day 3: Classification, logistic regression

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

ECE521 week 3: 23/26 January 2017

Warm up: risk prediction with logistic regression

CMSC858P Supervised Learning Methods

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Support Vector Machines for Classification: A Statistical Portrait

CIS 520: Machine Learning Oct 09, Kernel Methods

CS7267 MACHINE LEARNING

Lecture 6: Linear Regression

Linear & nonlinear classifiers

Machine Learning Lecture 5

Lecture 9: Classification, LDA

CPSC 340: Machine Learning and Data Mining

Pattern Recognition 2018 Support Vector Machines

STA 4273H: Statistical Machine Learning

Support Vector Machine. Industrial AI Lab.

Lecture 9: Classification, LDA

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

6.036 midterm review. Wednesday, March 18, 15

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Transcription:

Final Overview Introduction to ML Marek Petrik 4/25/2017

This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood, cross validation Fundamental machine learning techniques: regression, model-selection, deep learning Educational goals: 1. How to apply basic methods 2. Reveal what happens inside 3. What are the pitfalls 4. Expand understanding of linear algebra, statistics, and optimization

What is Machine Learning Discover unknown function f: X = set of features, or inputs Y = target, or response Y = f(x) Sales 5 10 15 20 25 Sales 5 10 15 20 25 Sales 5 10 15 20 25 0 50 100 200 300 TV 0 10 20 30 40 50 Radio 0 20 40 60 80 100 Newspaper Sales = f(tv, Radio, Newspaper)

Statistical View of Machine Learning Probability space Ω: Set of all adults Random variable: X(ω) = R: Years of education Random variable: Y (ω) = R: Salary Income 20 30 40 50 60 70 80 Income 20 30 40 50 60 70 80 10 12 14 16 18 20 22 Years of Education 10 12 14 16 18 20 22 Years of Education

How Good are Predictions? Learned function ˆf Test data: (x 1, y 1 ), (x 2, y 2 ),... Mean Squared Error (MSE): MSE = 1 n n (y i ˆf(x i )) 2 i=1 This is the estimate of: MSE = E[(Y ˆf(X)) 2 ] = 1 Ω Important: Samples x i are i.i.d. (Y (ω) ˆf(X(ω))) 2 ω Ω

KNN: K-Nearest Neighbors Idea: Use similar training points when making predictions o o o o o o o o o o o o o o o o o o o o o o o o Non-parametric method (unlike regression)

Bias-Variance Decomposition Y = f(x) + ɛ Mean Squared Error can be decomposed as: MSE = E(Y ˆf(X)) 2 = Var( }{{ ˆf(X)) + (E( } ˆf(X))) 2 + Var(ɛ) }{{} Variance Bias Bias: How well would method work with infinite data Variance: How much does output change with different data sets

R 2 Statistic R 2 = 1 RSS n TSS = 1 i=1 (y i ŷ i ) 2 n i=1 (y i ȳ) 2 RSS - residual sum of squares, TSS - total sum of squares R 2 measures the goodness of the fit as a proportion Proportion of data variance explained by the model Extreme values: 0: Model does not explain data 1: Model explains data perfectly

Correlation Coefficient Measures dependence between two random variables X and Y r = Cov(X, Y ) Var(X) Var(Y ) Correlation coefficient r is between [ 1, 1] 0: Variables are not related 1: Variables are perfectly related (same) 1: Variables are negatively related (different)

Correlation Coefficient Measures dependence between two random variables X and Y r = Cov(X, Y ) Var(X) Var(Y ) Correlation coefficient r is between [ 1, 1] 0: Variables are not related 1: Variables are perfectly related (same) 1: Variables are negatively related (different) R 2 = r 2

Qualitative Features: Many Values The Right Way Predict salary as a function of state Feature state i {MA, NH, ME}

Qualitative Features: Many Values The Right Way Predict salary as a function of state Feature state i {MA, NH, ME} Introduce 2 indicator variables x i, z i : { 0 if state i = MA x i = z i = 1 if state i MA { 0 if state i = NH 1 if state i NH Predict salary as: β 0 + β 1 salary = β 0 + β 1 x i + β 2 z i = β 0 + β 2 β 0 if state i = MA if state i = NH if state i = ME

Qualitative Features: Many Values The Right Way Predict salary as a function of state Feature state i {MA, NH, ME} Introduce 2 indicator variables x i, z i : { 0 if state i = MA x i = z i = 1 if state i MA { 0 if state i = NH 1 if state i NH Predict salary as: β 0 + β 1 salary = β 0 + β 1 x i + β 2 z i = β 0 + β 2 β 0 if state i = MA if state i = NH if state i = ME Need an indicator variable for ME? Why? hint: linear independence

Outlier Data Points Data point that is far away from others Measurement failure, sensor fails, missing data point Can seriously influence prediction quality Y 4 2 0 2 4 6 20 Residuals 1 0 1 2 3 4 20 Studentized Residuals 0 2 4 6 20 2 1 0 1 2 X 2 0 2 4 6 Fitted Values 2 0 2 4 6 Fitted Values

Points with High Leverage Points with unusual value of x i Single data point can have significant impact on prediction R and other packages can compute leverages of data points Y 0 5 10 20 41 X2 2 1 0 1 2 Studentized Residuals 1 0 1 2 3 4 5 20 41 2 1 0 1 2 3 4 X 2 1 0 1 2 X1 0.00 0.05 0.10 0.15 0.20 0.25 Leverage Good to remove points with high leverage and residual

Best Subset Selection Want to find a subset of p features The subset should be small and predict well Example: credit rating + income + student + limit M 0 null model (no features); for k = 1, 2,..., p do Fit all ( p k) models that contain k features ; M k best of ( p k) models according to a metric (CV error, R 2, etc) end return Best of M 0, M 1,..., M p according to metric above Algorithm 1: Best Subset Selection

Regularization Ridge regression (parameter λ), l 2 penalty min β min RSS(β) + λ βj 2 = β j 2 n p y i β 0 β j x ij + λ j=1 j i=1 β 2 j Lasso (parameter λ), l 1 penalty min β min RSS(β) + λ β j = β j 2 n p y i β 0 β j x ij + λ j i=1 j=1 β j Approximations to the l 0 solution

Logistic Regression Predict probability of a class: p(x) Example: p(balance) probability of default for person with balance Linear regression: logistic regression: p(x) = β 0 + β 1 p(x) = eβ 0+β 1 X 1 + e β 0+β 1 X the same as: ( ) p(x) log = β 0 + β 1 X 1 p(x) Odds: p(x) /1 p(x)

Logistic Function y = ex 1 + e x Logistic 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 p(x) = x eβ 0+β 1 X 1 + e β 0+β 1 X

Logit Function ( ) p(x) log 1 p(x) Logit 4 2 0 2 4 0.0 0.2 0.4 0.6 0.8 1.0 ( ) p(x) p(x) log = β 0 + β 1 X 1 p(x)

Estimating Coefficients: Maximum Likelihood Likelihood: Probability that data is generated from a model Find the most likely model: l(model) = Pr[data model] max l(model) = max Pr[data model] model model Likelihood function is difficult to maximize Transform it using log (strictly increasing) max log l(model) model Strictly increasing transformation does not change maximum

Discriminative vs Generative Models Discriminative models Estimate conditional models Pr[Y X] Linear regression Logistic regression Generative models Estimate joint probability Pr[Y, X] = Pr[Y X] Pr[X] Estimates not only probability of labels but also the features Once model is fit, can be used to generate data LDA, QDA, Naive Bayes

LDA: Linear Discriminant Analysis Generative model: capture probability of predictors for each label 0 1 2 3 4 5 4 2 0 2 4 3 2 1 0 1 2 3 4 Predict:

LDA: Linear Discriminant Analysis Generative model: capture probability of predictors for each label 0 1 2 3 4 5 4 2 0 2 4 3 2 1 0 1 2 3 4 Predict: 1. Pr[balance default = yes] and Pr[default = yes]

LDA: Linear Discriminant Analysis Generative model: capture probability of predictors for each label 0 1 2 3 4 5 4 2 0 2 4 3 2 1 0 1 2 3 4 Predict: 1. Pr[balance default = yes] and Pr[default = yes] 2. Pr[balance default = no] and Pr[default = no]

LDA: Linear Discriminant Analysis Generative model: capture probability of predictors for each label 0 1 2 3 4 5 4 2 0 2 4 3 2 1 0 1 2 3 4 Predict: 1. Pr[balance default = yes] and Pr[default = yes] 2. Pr[balance default = no] and Pr[default = no] Classes are normal: Pr[balance default = yes]

QDA: Quadratic Discriminant Analysis Generalizes LDA LDA: Class variances Σ k = Σ are the same QDA: Class variances Σ k can differ

QDA: Quadratic Discriminant Analysis Generalizes LDA LDA: Class variances Σ k = Σ are the same QDA: Class variances Σ k can differ LDA or QDA has smaller training error on the same data?

QDA: Quadratic Discriminant Analysis Generalizes LDA LDA: Class variances Σ k = Σ are the same QDA: Class variances Σ k can differ LDA or QDA has smaller training error on the same data? What about the test error?

QDA: Quadratic Discriminant Analysis X2 4 3 2 1 0 1 2 X2 4 3 2 1 0 1 2 4 2 0 2 4 X 1 4 2 0 2 4 X 1

Naive Bayes Simple Bayes net classification With normal distribution over features X 1,..., X k special case of QDA with diagonal Σ Generalizes to non-normal distributions and discrete variables More on it later...

Maximum Margin Hyperplane X2 1 0 1 2 3 1 0 1 2 3 X 1

Introducing Slack Variables Maximum margin classifier max β,m s.t. M y i (β x i ) M β 2 = 1 Support Vector Classifier a.k.a Linear SVM Slack variables: ɛ Parameter: C max β,m,ɛ 0 M s.t. y i (β x i ) (M ɛ i ) β 2 = 1 ɛ 1 C

Introducing Slack Variables Maximum margin classifier max β,m s.t. M y i (β x i ) M β 2 = 1 Support Vector Classifier a.k.a Linear SVM Slack variables: ɛ max β,m,ɛ 0 Parameter: C What if C = 0? M s.t. y i (β x i ) (M ɛ i ) β 2 = 1 ɛ 1 C

Kernelized SVM Dual Quadratic Program (usually max-min, not here) max α 0 s.t. M α l 1 2 l=1 M α l y l = 0 l=1 M α j α k y j y k k(x j, x k ) j,k=1 Representer theorem: (classification test): f(z) = M α l y l k(z, x l ) > 0 l=1

Kernels Polynomial kernel ) d k(x 1, x 2 ) = (1 + x 1 x 2 Radial kernel k(x 1, x 2 ) = exp ( γ x 1 x 2 2 ) 2 Many many more

Polynomial and Radial Kernels X2 4 2 0 2 4 X2 4 2 0 2 4 4 2 0 2 4 X 1 4 2 0 2 4 X 1

Regression Trees Predict Baseball Salary based on Years played and Hits Example: Years < 4.5 5.11 Hits < 117.5 6.00 6.74

CART: Recursive Binary Splitting Greedy top-to-bottom approach Recursively divide regions to minimize RSS (y i ȳ 1 ) 2 + x i R 1 x i R 2 (y i ȳ 2 ) 2 Prune tree

Trees vs. KNN Trees do not require a distance metric Trees work well with categorical predictors Trees work well in large dimensions KNN are better in low-dimensional problems with complex decision boundaries

Bagging Stands for Bootstrap Aggregating Construct multiple bootstrapped training sets: Fit a tree to each one: T 1, T 2,..., T B ˆf 1, ˆf 2,..., ˆf B Make predictions by averaging individual tree predictions ˆf(x) = 1 B B b=1 ˆf b (x) Large values of B are not likely to overfit, B 100 is a good choice

Random Forests Many trees in bagging will be similar Algorithms choose the same features to split on Random forests help to address similarity: At each split, choose only from m randomly sampled features Good empirical choice is m = p Test Classification Error 0.2 0.3 0.4 0.5 m=p m=p/2 m= p 0 100 200 300 400 500 Number of Trees

Gradient Boosting (Regression) Boosting uses all of data, not a random subset (usually) Also builds trees ˆf 1, ˆf 2,... and weights λ 1, λ 2,... Combined prediction: ˆf(x) = i λ i ˆfi (x) Assume we have 1... m trees and weights, next best tree?

Gradient Boosting (Regression) Just use gradient descent Objective is to minimize RSS (1/2): 1 2 n (y i f(x i )) 2 i=1 Objective with the new tree m + 1: 1 2 n y i i=1 m ˆf j (x i ) ˆf m+1 (x i ) j=1 Greatest reduction in RSS: gradient y i m ˆf j (x i ) ˆf m+1 (x i ) j=1 2

ROC Curve Confusion matrix Reality Predicted Positive Negative Positive True Positive False Positive Negative False Negative True Negative ROC Curve True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Area Under ROC Curve ROC Curve True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate Larger area is better Many other ways to measure classifier performance, like F 1

Evaluation Method 1: Validation Set Just evaluate how well the method works on the test set Randomly split data to: 1. Training set: about half of all data 2. Validation set (AKA hold-out set): remaining half

Evaluation Method 1: Validation Set Just evaluate how well the method works on the test set Randomly split data to: 1. Training set: about half of all data 2. Validation set (AKA hold-out set): remaining half Choose the number of features/representation based on minimizing error on validation set

Evaluation Method 2: Leave-one-out Addresses problems with validation set Split the data set into 2 parts: 1. Training: Size n 1 2. Validation: Size 1 Repeat n times: to get n learning problems

Evaluation Method 3: k-fold Cross-validation Hybrid between validation set and LOO Split training set into k subsets 1. Training set: n n /k 2. Test set: n /k k learning problems Cross-validation error: CV (k) = 1 k k MSE i i=1

Bootstrap Goal: Understand the confidence in learned parameters Most useful in inference How confident are we in learned values of β: mpg = β 0 + β 1 power

Bootstrap Goal: Understand the confidence in learned parameters Most useful in inference How confident are we in learned values of β: mpg = β 0 + β 1 power Approach: Run learning algorithm multiple times with different data sets:

Bootstrap Goal: Understand the confidence in learned parameters Most useful in inference How confident are we in learned values of β: mpg = β 0 + β 1 power Approach: Run learning algorithm multiple times with different data sets: Create a new data-set by sampling with replacement from the original one

Principal Component Analysis Ad Spending 0 5 10 15 20 25 30 35 10 20 30 40 50 60 70 Population 1st Principal Component: Direction with the largest variance Z 1 = 0.839 (pop pop) + 0.544 (ad ad)

Principal Component Analysis Ad Spending 0 5 10 15 20 25 30 35 10 20 30 40 50 60 70 Population 1st Principal Component: Direction with the largest variance Is this linear? Z 1 = 0.839 (pop pop) + 0.544 (ad ad)

Principal Component Analysis Ad Spending 0 5 10 15 20 25 30 35 10 20 30 40 50 60 70 Population 1st Principal Component: Direction with the largest variance Z 1 = 0.839 (pop pop) + 0.544 (ad ad) Is this linear? Yes, after mean centering.

K-Means Algorithm Heuristic solution to the minimization problem 1. Randomly assign cluster numbers to observations 2. Iterate while clusters change 2.1 For each cluster, compute the centroid 2.2 Assign each observation to the closest cluster Note that: 1 C k i,i C k j=1 p (x ij x i j) 2 = 2 i,i C k j=1 p (x ij x kj ) 2

K-Means Illustration Data Step 1 Iteration 1, Step 2a Iteration 1, Step 2b Iteration 2, Step 2a Final Results

Dendrogram: Similarity Tree 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

What Next? This course: How to use machine learning. More tools: Gaussian processes Time series models Domain specific models (e.g., natural language processing)

What Next? This course: How to use machine learning. More tools: Gaussian processes Time series models Domain specific models (e.g., natural language processing) Doing ML Research Musts: Linear algebra, statistics, convex optimization Important: Probably Approximately Correct Learning