Final Overview Introduction to ML Marek Petrik 4/25/2017
This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood, cross validation Fundamental machine learning techniques: regression, model-selection, deep learning Educational goals: 1. How to apply basic methods 2. Reveal what happens inside 3. What are the pitfalls 4. Expand understanding of linear algebra, statistics, and optimization
What is Machine Learning Discover unknown function f: X = set of features, or inputs Y = target, or response Y = f(x) Sales 5 10 15 20 25 Sales 5 10 15 20 25 Sales 5 10 15 20 25 0 50 100 200 300 TV 0 10 20 30 40 50 Radio 0 20 40 60 80 100 Newspaper Sales = f(tv, Radio, Newspaper)
Statistical View of Machine Learning Probability space Ω: Set of all adults Random variable: X(ω) = R: Years of education Random variable: Y (ω) = R: Salary Income 20 30 40 50 60 70 80 Income 20 30 40 50 60 70 80 10 12 14 16 18 20 22 Years of Education 10 12 14 16 18 20 22 Years of Education
How Good are Predictions? Learned function ˆf Test data: (x 1, y 1 ), (x 2, y 2 ),... Mean Squared Error (MSE): MSE = 1 n n (y i ˆf(x i )) 2 i=1 This is the estimate of: MSE = E[(Y ˆf(X)) 2 ] = 1 Ω Important: Samples x i are i.i.d. (Y (ω) ˆf(X(ω))) 2 ω Ω
KNN: K-Nearest Neighbors Idea: Use similar training points when making predictions o o o o o o o o o o o o o o o o o o o o o o o o Non-parametric method (unlike regression)
Bias-Variance Decomposition Y = f(x) + ɛ Mean Squared Error can be decomposed as: MSE = E(Y ˆf(X)) 2 = Var( }{{ ˆf(X)) + (E( } ˆf(X))) 2 + Var(ɛ) }{{} Variance Bias Bias: How well would method work with infinite data Variance: How much does output change with different data sets
R 2 Statistic R 2 = 1 RSS n TSS = 1 i=1 (y i ŷ i ) 2 n i=1 (y i ȳ) 2 RSS - residual sum of squares, TSS - total sum of squares R 2 measures the goodness of the fit as a proportion Proportion of data variance explained by the model Extreme values: 0: Model does not explain data 1: Model explains data perfectly
Correlation Coefficient Measures dependence between two random variables X and Y r = Cov(X, Y ) Var(X) Var(Y ) Correlation coefficient r is between [ 1, 1] 0: Variables are not related 1: Variables are perfectly related (same) 1: Variables are negatively related (different)
Correlation Coefficient Measures dependence between two random variables X and Y r = Cov(X, Y ) Var(X) Var(Y ) Correlation coefficient r is between [ 1, 1] 0: Variables are not related 1: Variables are perfectly related (same) 1: Variables are negatively related (different) R 2 = r 2
Qualitative Features: Many Values The Right Way Predict salary as a function of state Feature state i {MA, NH, ME}
Qualitative Features: Many Values The Right Way Predict salary as a function of state Feature state i {MA, NH, ME} Introduce 2 indicator variables x i, z i : { 0 if state i = MA x i = z i = 1 if state i MA { 0 if state i = NH 1 if state i NH Predict salary as: β 0 + β 1 salary = β 0 + β 1 x i + β 2 z i = β 0 + β 2 β 0 if state i = MA if state i = NH if state i = ME
Qualitative Features: Many Values The Right Way Predict salary as a function of state Feature state i {MA, NH, ME} Introduce 2 indicator variables x i, z i : { 0 if state i = MA x i = z i = 1 if state i MA { 0 if state i = NH 1 if state i NH Predict salary as: β 0 + β 1 salary = β 0 + β 1 x i + β 2 z i = β 0 + β 2 β 0 if state i = MA if state i = NH if state i = ME Need an indicator variable for ME? Why? hint: linear independence
Outlier Data Points Data point that is far away from others Measurement failure, sensor fails, missing data point Can seriously influence prediction quality Y 4 2 0 2 4 6 20 Residuals 1 0 1 2 3 4 20 Studentized Residuals 0 2 4 6 20 2 1 0 1 2 X 2 0 2 4 6 Fitted Values 2 0 2 4 6 Fitted Values
Points with High Leverage Points with unusual value of x i Single data point can have significant impact on prediction R and other packages can compute leverages of data points Y 0 5 10 20 41 X2 2 1 0 1 2 Studentized Residuals 1 0 1 2 3 4 5 20 41 2 1 0 1 2 3 4 X 2 1 0 1 2 X1 0.00 0.05 0.10 0.15 0.20 0.25 Leverage Good to remove points with high leverage and residual
Best Subset Selection Want to find a subset of p features The subset should be small and predict well Example: credit rating + income + student + limit M 0 null model (no features); for k = 1, 2,..., p do Fit all ( p k) models that contain k features ; M k best of ( p k) models according to a metric (CV error, R 2, etc) end return Best of M 0, M 1,..., M p according to metric above Algorithm 1: Best Subset Selection
Regularization Ridge regression (parameter λ), l 2 penalty min β min RSS(β) + λ βj 2 = β j 2 n p y i β 0 β j x ij + λ j=1 j i=1 β 2 j Lasso (parameter λ), l 1 penalty min β min RSS(β) + λ β j = β j 2 n p y i β 0 β j x ij + λ j i=1 j=1 β j Approximations to the l 0 solution
Logistic Regression Predict probability of a class: p(x) Example: p(balance) probability of default for person with balance Linear regression: logistic regression: p(x) = β 0 + β 1 p(x) = eβ 0+β 1 X 1 + e β 0+β 1 X the same as: ( ) p(x) log = β 0 + β 1 X 1 p(x) Odds: p(x) /1 p(x)
Logistic Function y = ex 1 + e x Logistic 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 p(x) = x eβ 0+β 1 X 1 + e β 0+β 1 X
Logit Function ( ) p(x) log 1 p(x) Logit 4 2 0 2 4 0.0 0.2 0.4 0.6 0.8 1.0 ( ) p(x) p(x) log = β 0 + β 1 X 1 p(x)
Estimating Coefficients: Maximum Likelihood Likelihood: Probability that data is generated from a model Find the most likely model: l(model) = Pr[data model] max l(model) = max Pr[data model] model model Likelihood function is difficult to maximize Transform it using log (strictly increasing) max log l(model) model Strictly increasing transformation does not change maximum
Discriminative vs Generative Models Discriminative models Estimate conditional models Pr[Y X] Linear regression Logistic regression Generative models Estimate joint probability Pr[Y, X] = Pr[Y X] Pr[X] Estimates not only probability of labels but also the features Once model is fit, can be used to generate data LDA, QDA, Naive Bayes
LDA: Linear Discriminant Analysis Generative model: capture probability of predictors for each label 0 1 2 3 4 5 4 2 0 2 4 3 2 1 0 1 2 3 4 Predict:
LDA: Linear Discriminant Analysis Generative model: capture probability of predictors for each label 0 1 2 3 4 5 4 2 0 2 4 3 2 1 0 1 2 3 4 Predict: 1. Pr[balance default = yes] and Pr[default = yes]
LDA: Linear Discriminant Analysis Generative model: capture probability of predictors for each label 0 1 2 3 4 5 4 2 0 2 4 3 2 1 0 1 2 3 4 Predict: 1. Pr[balance default = yes] and Pr[default = yes] 2. Pr[balance default = no] and Pr[default = no]
LDA: Linear Discriminant Analysis Generative model: capture probability of predictors for each label 0 1 2 3 4 5 4 2 0 2 4 3 2 1 0 1 2 3 4 Predict: 1. Pr[balance default = yes] and Pr[default = yes] 2. Pr[balance default = no] and Pr[default = no] Classes are normal: Pr[balance default = yes]
QDA: Quadratic Discriminant Analysis Generalizes LDA LDA: Class variances Σ k = Σ are the same QDA: Class variances Σ k can differ
QDA: Quadratic Discriminant Analysis Generalizes LDA LDA: Class variances Σ k = Σ are the same QDA: Class variances Σ k can differ LDA or QDA has smaller training error on the same data?
QDA: Quadratic Discriminant Analysis Generalizes LDA LDA: Class variances Σ k = Σ are the same QDA: Class variances Σ k can differ LDA or QDA has smaller training error on the same data? What about the test error?
QDA: Quadratic Discriminant Analysis X2 4 3 2 1 0 1 2 X2 4 3 2 1 0 1 2 4 2 0 2 4 X 1 4 2 0 2 4 X 1
Naive Bayes Simple Bayes net classification With normal distribution over features X 1,..., X k special case of QDA with diagonal Σ Generalizes to non-normal distributions and discrete variables More on it later...
Maximum Margin Hyperplane X2 1 0 1 2 3 1 0 1 2 3 X 1
Introducing Slack Variables Maximum margin classifier max β,m s.t. M y i (β x i ) M β 2 = 1 Support Vector Classifier a.k.a Linear SVM Slack variables: ɛ Parameter: C max β,m,ɛ 0 M s.t. y i (β x i ) (M ɛ i ) β 2 = 1 ɛ 1 C
Introducing Slack Variables Maximum margin classifier max β,m s.t. M y i (β x i ) M β 2 = 1 Support Vector Classifier a.k.a Linear SVM Slack variables: ɛ max β,m,ɛ 0 Parameter: C What if C = 0? M s.t. y i (β x i ) (M ɛ i ) β 2 = 1 ɛ 1 C
Kernelized SVM Dual Quadratic Program (usually max-min, not here) max α 0 s.t. M α l 1 2 l=1 M α l y l = 0 l=1 M α j α k y j y k k(x j, x k ) j,k=1 Representer theorem: (classification test): f(z) = M α l y l k(z, x l ) > 0 l=1
Kernels Polynomial kernel ) d k(x 1, x 2 ) = (1 + x 1 x 2 Radial kernel k(x 1, x 2 ) = exp ( γ x 1 x 2 2 ) 2 Many many more
Polynomial and Radial Kernels X2 4 2 0 2 4 X2 4 2 0 2 4 4 2 0 2 4 X 1 4 2 0 2 4 X 1
Regression Trees Predict Baseball Salary based on Years played and Hits Example: Years < 4.5 5.11 Hits < 117.5 6.00 6.74
CART: Recursive Binary Splitting Greedy top-to-bottom approach Recursively divide regions to minimize RSS (y i ȳ 1 ) 2 + x i R 1 x i R 2 (y i ȳ 2 ) 2 Prune tree
Trees vs. KNN Trees do not require a distance metric Trees work well with categorical predictors Trees work well in large dimensions KNN are better in low-dimensional problems with complex decision boundaries
Bagging Stands for Bootstrap Aggregating Construct multiple bootstrapped training sets: Fit a tree to each one: T 1, T 2,..., T B ˆf 1, ˆf 2,..., ˆf B Make predictions by averaging individual tree predictions ˆf(x) = 1 B B b=1 ˆf b (x) Large values of B are not likely to overfit, B 100 is a good choice
Random Forests Many trees in bagging will be similar Algorithms choose the same features to split on Random forests help to address similarity: At each split, choose only from m randomly sampled features Good empirical choice is m = p Test Classification Error 0.2 0.3 0.4 0.5 m=p m=p/2 m= p 0 100 200 300 400 500 Number of Trees
Gradient Boosting (Regression) Boosting uses all of data, not a random subset (usually) Also builds trees ˆf 1, ˆf 2,... and weights λ 1, λ 2,... Combined prediction: ˆf(x) = i λ i ˆfi (x) Assume we have 1... m trees and weights, next best tree?
Gradient Boosting (Regression) Just use gradient descent Objective is to minimize RSS (1/2): 1 2 n (y i f(x i )) 2 i=1 Objective with the new tree m + 1: 1 2 n y i i=1 m ˆf j (x i ) ˆf m+1 (x i ) j=1 Greatest reduction in RSS: gradient y i m ˆf j (x i ) ˆf m+1 (x i ) j=1 2
ROC Curve Confusion matrix Reality Predicted Positive Negative Positive True Positive False Positive Negative False Negative True Negative ROC Curve True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Area Under ROC Curve ROC Curve True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate Larger area is better Many other ways to measure classifier performance, like F 1
Evaluation Method 1: Validation Set Just evaluate how well the method works on the test set Randomly split data to: 1. Training set: about half of all data 2. Validation set (AKA hold-out set): remaining half
Evaluation Method 1: Validation Set Just evaluate how well the method works on the test set Randomly split data to: 1. Training set: about half of all data 2. Validation set (AKA hold-out set): remaining half Choose the number of features/representation based on minimizing error on validation set
Evaluation Method 2: Leave-one-out Addresses problems with validation set Split the data set into 2 parts: 1. Training: Size n 1 2. Validation: Size 1 Repeat n times: to get n learning problems
Evaluation Method 3: k-fold Cross-validation Hybrid between validation set and LOO Split training set into k subsets 1. Training set: n n /k 2. Test set: n /k k learning problems Cross-validation error: CV (k) = 1 k k MSE i i=1
Bootstrap Goal: Understand the confidence in learned parameters Most useful in inference How confident are we in learned values of β: mpg = β 0 + β 1 power
Bootstrap Goal: Understand the confidence in learned parameters Most useful in inference How confident are we in learned values of β: mpg = β 0 + β 1 power Approach: Run learning algorithm multiple times with different data sets:
Bootstrap Goal: Understand the confidence in learned parameters Most useful in inference How confident are we in learned values of β: mpg = β 0 + β 1 power Approach: Run learning algorithm multiple times with different data sets: Create a new data-set by sampling with replacement from the original one
Principal Component Analysis Ad Spending 0 5 10 15 20 25 30 35 10 20 30 40 50 60 70 Population 1st Principal Component: Direction with the largest variance Z 1 = 0.839 (pop pop) + 0.544 (ad ad)
Principal Component Analysis Ad Spending 0 5 10 15 20 25 30 35 10 20 30 40 50 60 70 Population 1st Principal Component: Direction with the largest variance Is this linear? Z 1 = 0.839 (pop pop) + 0.544 (ad ad)
Principal Component Analysis Ad Spending 0 5 10 15 20 25 30 35 10 20 30 40 50 60 70 Population 1st Principal Component: Direction with the largest variance Z 1 = 0.839 (pop pop) + 0.544 (ad ad) Is this linear? Yes, after mean centering.
K-Means Algorithm Heuristic solution to the minimization problem 1. Randomly assign cluster numbers to observations 2. Iterate while clusters change 2.1 For each cluster, compute the centroid 2.2 Assign each observation to the closest cluster Note that: 1 C k i,i C k j=1 p (x ij x i j) 2 = 2 i,i C k j=1 p (x ij x kj ) 2
K-Means Illustration Data Step 1 Iteration 1, Step 2a Iteration 1, Step 2b Iteration 2, Step 2a Final Results
Dendrogram: Similarity Tree 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
What Next? This course: How to use machine learning. More tools: Gaussian processes Time series models Domain specific models (e.g., natural language processing)
What Next? This course: How to use machine learning. More tools: Gaussian processes Time series models Domain specific models (e.g., natural language processing) Doing ML Research Musts: Linear algebra, statistics, convex optimization Important: Probably Approximately Correct Learning