Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Size: px

Start display at page:

Download "Final Overview. Introduction to ML. Marek Petrik 4/25/2017"

Carmella Sutton
6 years ago
Views:

1 Final Overview Introduction to ML Marek Petrik 4/25/2017

2 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood, cross validation Fundamental machine learning techniques: regression, model-selection, deep learning Educational goals: 1. How to apply basic methods 2. Reveal what happens inside 3. What are the pitfalls 4. Expand understanding of linear algebra, statistics, and optimization

3 What is Machine Learning Discover unknown function f: X = set of features, or inputs Y = target, or response Y = f(x) Sales Sales Sales TV Radio Newspaper Sales = f(tv, Radio, Newspaper)

4 Statistical View of Machine Learning Probability space Ω: Set of all adults Random variable: X(ω) = R: Years of education Random variable: Y (ω) = R: Salary Income Income Years of Education Years of Education

5 How Good are Predictions? Learned function ˆf Test data: (x 1, y 1 ), (x 2, y 2 ),... Mean Squared Error (MSE): MSE = 1 n n (y i ˆf(x i )) 2 i=1 This is the estimate of: MSE = E[(Y ˆf(X)) 2 ] = 1 Ω Important: Samples x i are i.i.d. (Y (ω) ˆf(X(ω))) 2 ω Ω

6 KNN: K-Nearest Neighbors Idea: Use similar training points when making predictions o o o o o o o o o o o o o o o o o o o o o o o o Non-parametric method (unlike regression)

7 Bias-Variance Decomposition Y = f(x) + ɛ Mean Squared Error can be decomposed as: MSE = E(Y ˆf(X)) 2 = Var( }{{ ˆf(X)) + (E( } ˆf(X))) 2 + Var(ɛ) }{{} Variance Bias Bias: How well would method work with infinite data Variance: How much does output change with different data sets

8 R 2 Statistic R 2 = 1 RSS n TSS = 1 i=1 (y i ŷ i ) 2 n i=1 (y i ȳ) 2 RSS - residual sum of squares, TSS - total sum of squares R 2 measures the goodness of the fit as a proportion Proportion of data variance explained by the model Extreme values: 0: Model does not explain data 1: Model explains data perfectly

9 Correlation Coefficient Measures dependence between two random variables X and Y r = Cov(X, Y ) Var(X) Var(Y ) Correlation coefficient r is between [ 1, 1] 0: Variables are not related 1: Variables are perfectly related (same) 1: Variables are negatively related (different)

10 Correlation Coefficient Measures dependence between two random variables X and Y r = Cov(X, Y ) Var(X) Var(Y ) Correlation coefficient r is between [ 1, 1] 0: Variables are not related 1: Variables are perfectly related (same) 1: Variables are negatively related (different) R 2 = r 2

11 Qualitative Features: Many Values The Right Way Predict salary as a function of state Feature state i {MA, NH, ME}

12 Qualitative Features: Many Values The Right Way Predict salary as a function of state Feature state i {MA, NH, ME} Introduce 2 indicator variables x i, z i : { 0 if state i = MA x i = z i = 1 if state i MA { 0 if state i = NH 1 if state i NH Predict salary as: β 0 + β 1 salary = β 0 + β 1 x i + β 2 z i = β 0 + β 2 β 0 if state i = MA if state i = NH if state i = ME

13 Qualitative Features: Many Values The Right Way Predict salary as a function of state Feature state i {MA, NH, ME} Introduce 2 indicator variables x i, z i : { 0 if state i = MA x i = z i = 1 if state i MA { 0 if state i = NH 1 if state i NH Predict salary as: β 0 + β 1 salary = β 0 + β 1 x i + β 2 z i = β 0 + β 2 β 0 if state i = MA if state i = NH if state i = ME Need an indicator variable for ME? Why? hint: linear independence

14 Outlier Data Points Data point that is far away from others Measurement failure, sensor fails, missing data point Can seriously influence prediction quality Y Residuals Studentized Residuals X Fitted Values Fitted Values

15 Points with High Leverage Points with unusual value of x i Single data point can have significant impact on prediction R and other packages can compute leverages of data points Y X Studentized Residuals X X Leverage Good to remove points with high leverage and residual

16 Best Subset Selection Want to find a subset of p features The subset should be small and predict well Example: credit rating + income + student + limit M 0 null model (no features); for k = 1, 2,..., p do Fit all ( p k) models that contain k features ; M k best of ( p k) models according to a metric (CV error, R 2, etc) end return Best of M 0, M 1,..., M p according to metric above Algorithm 1: Best Subset Selection

17 Regularization Ridge regression (parameter λ), l 2 penalty min β min RSS(β) + λ βj 2 = β j 2 n p y i β 0 β j x ij + λ j=1 j i=1 β 2 j Lasso (parameter λ), l 1 penalty min β min RSS(β) + λ β j = β j 2 n p y i β 0 β j x ij + λ j i=1 j=1 β j Approximations to the l 0 solution

18 Logistic Regression Predict probability of a class: p(x) Example: p(balance) probability of default for person with balance Linear regression: logistic regression: p(x) = β 0 + β 1 p(x) = eβ 0+β 1 X 1 + e β 0+β 1 X the same as: ( ) p(x) log = β 0 + β 1 X 1 p(x) Odds: p(x) /1 p(x)

19 Logistic Function y = ex 1 + e x Logistic p(x) = x eβ 0+β 1 X 1 + e β 0+β 1 X

20 Logit Function ( ) p(x) log 1 p(x) Logit ( ) p(x) p(x) log = β 0 + β 1 X 1 p(x)

21 Estimating Coefficients: Maximum Likelihood Likelihood: Probability that data is generated from a model Find the most likely model: l(model) = Pr[data model] max l(model) = max Pr[data model] model model Likelihood function is difficult to maximize Transform it using log (strictly increasing) max log l(model) model Strictly increasing transformation does not change maximum

22 Discriminative vs Generative Models Discriminative models Estimate conditional models Pr[Y X] Linear regression Logistic regression Generative models Estimate joint probability Pr[Y, X] = Pr[Y X] Pr[X] Estimates not only probability of labels but also the features Once model is fit, can be used to generate data LDA, QDA, Naive Bayes

23 LDA: Linear Discriminant Analysis Generative model: capture probability of predictors for each label Predict:

24 LDA: Linear Discriminant Analysis Generative model: capture probability of predictors for each label Predict: 1. Pr[balance default = yes] and Pr[default = yes]

25 LDA: Linear Discriminant Analysis Generative model: capture probability of predictors for each label Predict: 1. Pr[balance default = yes] and Pr[default = yes] 2. Pr[balance default = no] and Pr[default = no]

26 LDA: Linear Discriminant Analysis Generative model: capture probability of predictors for each label Predict: 1. Pr[balance default = yes] and Pr[default = yes] 2. Pr[balance default = no] and Pr[default = no] Classes are normal: Pr[balance default = yes]

27 QDA: Quadratic Discriminant Analysis Generalizes LDA LDA: Class variances Σ k = Σ are the same QDA: Class variances Σ k can differ

28 QDA: Quadratic Discriminant Analysis Generalizes LDA LDA: Class variances Σ k = Σ are the same QDA: Class variances Σ k can differ LDA or QDA has smaller training error on the same data?

29 QDA: Quadratic Discriminant Analysis Generalizes LDA LDA: Class variances Σ k = Σ are the same QDA: Class variances Σ k can differ LDA or QDA has smaller training error on the same data? What about the test error?

30 QDA: Quadratic Discriminant Analysis X X X X 1

31 Naive Bayes Simple Bayes net classification With normal distribution over features X 1,..., X k special case of QDA with diagonal Σ Generalizes to non-normal distributions and discrete variables More on it later...

32 Maximum Margin Hyperplane X X 1

33 Introducing Slack Variables Maximum margin classifier max β,m s.t. M y i (β x i ) M β 2 = 1 Support Vector Classifier a.k.a Linear SVM Slack variables: ɛ Parameter: C max β,m,ɛ 0 M s.t. y i (β x i ) (M ɛ i ) β 2 = 1 ɛ 1 C

34 Introducing Slack Variables Maximum margin classifier max β,m s.t. M y i (β x i ) M β 2 = 1 Support Vector Classifier a.k.a Linear SVM Slack variables: ɛ max β,m,ɛ 0 Parameter: C What if C = 0? M s.t. y i (β x i ) (M ɛ i ) β 2 = 1 ɛ 1 C

35 Kernelized SVM Dual Quadratic Program (usually max-min, not here) max α 0 s.t. M α l 1 2 l=1 M α l y l = 0 l=1 M α j α k y j y k k(x j, x k ) j,k=1 Representer theorem: (classification test): f(z) = M α l y l k(z, x l ) > 0 l=1

36 Kernels Polynomial kernel ) d k(x 1, x 2 ) = (1 + x 1 x 2 Radial kernel k(x 1, x 2 ) = exp ( γ x 1 x 2 2 ) 2 Many many more

37 Polynomial and Radial Kernels X X X X 1

38 Regression Trees Predict Baseball Salary based on Years played and Hits Example: Years < Hits <

39 CART: Recursive Binary Splitting Greedy top-to-bottom approach Recursively divide regions to minimize RSS (y i ȳ 1 ) 2 + x i R 1 x i R 2 (y i ȳ 2 ) 2 Prune tree

40 Trees vs. KNN Trees do not require a distance metric Trees work well with categorical predictors Trees work well in large dimensions KNN are better in low-dimensional problems with complex decision boundaries

41 Bagging Stands for Bootstrap Aggregating Construct multiple bootstrapped training sets: Fit a tree to each one: T 1, T 2,..., T B ˆf 1, ˆf 2,..., ˆf B Make predictions by averaging individual tree predictions ˆf(x) = 1 B B b=1 ˆf b (x) Large values of B are not likely to overfit, B 100 is a good choice

42 Random Forests Many trees in bagging will be similar Algorithms choose the same features to split on Random forests help to address similarity: At each split, choose only from m randomly sampled features Good empirical choice is m = p Test Classification Error m=p m=p/2 m= p Number of Trees

43 Gradient Boosting (Regression) Boosting uses all of data, not a random subset (usually) Also builds trees ˆf 1, ˆf 2,... and weights λ 1, λ 2,... Combined prediction: ˆf(x) = i λ i ˆfi (x) Assume we have 1... m trees and weights, next best tree?

44 Gradient Boosting (Regression) Just use gradient descent Objective is to minimize RSS (1/2): 1 2 n (y i f(x i )) 2 i=1 Objective with the new tree m + 1: 1 2 n y i i=1 m ˆf j (x i ) ˆf m+1 (x i ) j=1 Greatest reduction in RSS: gradient y i m ˆf j (x i ) ˆf m+1 (x i ) j=1 2

45 ROC Curve Confusion matrix Reality Predicted Positive Negative Positive True Positive False Positive Negative False Negative True Negative ROC Curve True positive rate

46 Area Under ROC Curve ROC Curve True positive rate False positive rate Larger area is better Many other ways to measure classifier performance, like F 1

47 Evaluation Method 1: Validation Set Just evaluate how well the method works on the test set Randomly split data to: 1. Training set: about half of all data 2. Validation set (AKA hold-out set): remaining half

48 Evaluation Method 1: Validation Set Just evaluate how well the method works on the test set Randomly split data to: 1. Training set: about half of all data 2. Validation set (AKA hold-out set): remaining half Choose the number of features/representation based on minimizing error on validation set

49 Evaluation Method 2: Leave-one-out Addresses problems with validation set Split the data set into 2 parts: 1. Training: Size n 1 2. Validation: Size 1 Repeat n times: to get n learning problems

50 Evaluation Method 3: k-fold Cross-validation Hybrid between validation set and LOO Split training set into k subsets 1. Training set: n n /k 2. Test set: n /k k learning problems Cross-validation error: CV (k) = 1 k k MSE i i=1

51 Bootstrap Goal: Understand the confidence in learned parameters Most useful in inference How confident are we in learned values of β: mpg = β 0 + β 1 power

52 Bootstrap Goal: Understand the confidence in learned parameters Most useful in inference How confident are we in learned values of β: mpg = β 0 + β 1 power Approach: Run learning algorithm multiple times with different data sets:

53 Bootstrap Goal: Understand the confidence in learned parameters Most useful in inference How confident are we in learned values of β: mpg = β 0 + β 1 power Approach: Run learning algorithm multiple times with different data sets: Create a new data-set by sampling with replacement from the original one

54 Principal Component Analysis Ad Spending Population 1st Principal Component: Direction with the largest variance Z 1 = (pop pop) (ad ad)

55 Principal Component Analysis Ad Spending Population 1st Principal Component: Direction with the largest variance Is this linear? Z 1 = (pop pop) (ad ad)

56 Principal Component Analysis Ad Spending Population 1st Principal Component: Direction with the largest variance Z 1 = (pop pop) (ad ad) Is this linear? Yes, after mean centering.

57 K-Means Algorithm Heuristic solution to the minimization problem 1. Randomly assign cluster numbers to observations 2. Iterate while clusters change 2.1 For each cluster, compute the centroid 2.2 Assign each observation to the closest cluster Note that: 1 C k i,i C k j=1 p (x ij x i j) 2 = 2 i,i C k j=1 p (x ij x kj ) 2

58 K-Means Illustration Data Step 1 Iteration 1, Step 2a Iteration 1, Step 2b Iteration 2, Step 2a Final Results

59 Dendrogram: Similarity Tree

60 What Next? This course: How to use machine learning. More tools: Gaussian processes Time series models Domain specific models (e.g., natural language processing)

61 What Next? This course: How to use machine learning. More tools: Gaussian processes Time series models Domain specific models (e.g., natural language processing) Doing ML Research Musts: Linear algebra, statistics, convex optimization Important: Probably Approximately Correct Learning

LDA, QDA, Naive Bayes

LDA, QDA, Naive Bayes Generative Classification Models Marek Petrik 2/16/2017 Last Class Logistic Regression Maximum Likelihood Principle Logistic Regression Predict probability of a class: p(x) Example: