Dimension Reduction Methods

Size: px

Start display at page:

Download "Dimension Reduction Methods"

Cleopatra Newman
5 years ago
Views:

1 Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28

2 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2. Regularization (shrinkage) 3. Dimensionality reduction (next class)

3 Best Subset Selection Want to find a subset of p features The subset should be small and predict well Example: credit rating + income + student + limit M 0 null model (no features); for k = 1, 2,..., p do Fit all ( p k) models that contain k features ; M k best of ( p k) models according to a metric (CV error, R 2, etc) end return Best of M 0, M 1,..., M p according to metric above Algorithm 1: Best Subset Selection

4 Achieving Scalability Complexity of Best Subset Selection?

5 Achieving Scalability Complexity of Best Subset Selection? Examine all possible subsets? How many?

6 Achieving Scalability Complexity of Best Subset Selection? Examine all possible subsets? How many? O(2 p )!

7 Achieving Scalability Complexity of Best Subset Selection? Examine all possible subsets? How many? O(2 p )! Heuristic approaches: 1. Stepwise selection: Solve the problem approximately: greedy 2. Regularization: Solve a different (easier) problem: relaxation

8 Which Metric to Use? M 0 null model (no features); for k = 1, 2,..., p do Fit all ( p k) models that contain k features ; M k best of ( p k) models according to a metric (CV error, R 2, etc) end return Best of M 0, M 1,..., M p according to metric above Algorithm 2: Best Subset Selection

9 Which Metric to Use? M 0 null model (no features); for k = 1, 2,..., p do Fit all ( p k) models that contain k features ; M k best of ( p k) models according to a metric (CV error, R 2, etc) end return Best of M 0, M 1,..., M p according to metric above Algorithm 3: Best Subset Selection 1. Direct error estimate: Cross validation, precise but computationally intensive

10 Which Metric to Use? M 0 null model (no features); for k = 1, 2,..., p do Fit all ( p k) models that contain k features ; M k best of ( p k) models according to a metric (CV error, R 2, etc) end return Best of M 0, M 1,..., M p according to metric above Algorithm 4: Best Subset Selection 1. Direct error estimate: Cross validation, precise but computationally intensive 2. Indirect error estimate: Mellow s C p : C p = 1 n (RSS +2dˆσ2 ) where ˆσ 2 Var[ɛ] Akaike information criterion, BIC, and many others. Theoretical foundations

11 Which Metric to Use? M 0 null model (no features); for k = 1, 2,..., p do Fit all ( p k) models that contain k features ; M k best of ( p k) models according to a metric (CV error, R 2, etc) end return Best of M 0, M 1,..., M p according to metric above Algorithm 5: Best Subset Selection 1. Direct error estimate: Cross validation, precise but computationally intensive 2. Indirect error estimate: Mellow s C p : C p = 1 n (RSS +2dˆσ2 ) where ˆσ 2 Var[ɛ] Akaike information criterion, BIC, and many others. Theoretical foundations 3. Interpretability Penalty: What is the cost of extra features

12 Regularization 1. Stepwise selection: Solve the problem approximately 2. Regularization: Solve a different (easier) problem: relaxation Solve a machine learning problem, but penalize solutions that use too much of the features

13 Regularization Ridge regression (parameter λ), l 2 penalty min β min RSS(β) + λ βj 2 = β j 2 n p y i β 0 β j x ij + λ j=1 j i=1 β 2 j Lasso (parameter λ), l 1 penalty min β min RSS(β) + λ β j = β j 2 n p y i β 0 β j x ij + λ j i=1 j=1 β j Approximations to the l 0 solution

14 Why Lasso Works Bias-variance trade-off Increasing λ increases bias Example: all features relevant Mean Squared Error Mean Squared Error λ R 2 on Training Data purple: test MSE, black: bias, green: variance dotted (ridge)

15 Why Lasso Works Bias-variance trade-off Increasing λ increases bias Example: some features relevant Mean Squared Error Mean Squared Error λ R 2 on Training Data purple: test MSE, black: bias, green: variance dotted (ridge)

16 Regularization Ridge regression (parameter λ), l 2 penalty min β min RSS(β) + λ βj 2 = β j 2 n p y i β 0 β j x ij + λ j=1 j i=1 β 2 j Lasso (parameter λ), l 1 penalty min β min RSS(β) + λ β j = β j 2 n p y i β 0 β j x ij + λ j i=1 j=1 β j Approximations to the l 0 solution

17 Regularization: Constrained Formulation Ridge regression (parameter λ), l 2 penalty min β n y i β 0 i=1 p β j x ij j=1 2 subject to j β 2 j s Lasso (parameter λ), l 1 penalty min β n y i β 0 i=1 p β j x ij j=1 Approximations to the l 0 solution 2 subject to j β j s

18 Lasso Solutions are Sparse Constrained Lasso (left) vs Constrained Ridge Regression (right) Constraints are blue, red are contours of the objective

19 Today Dimension reduction methods Principal component regression Partial least squares Interpretation in high dimensions Bayesian view of ridge regression and lasso

20 Dimensionality Reduction Methods Different approach to model selection We have many features: X 1, X 2,..., X p Transform features to a smaller number Z 1,..., Z M

21 Dimensionality Reduction Methods Different approach to model selection We have many features: X 1, X 2,..., X p Transform features to a smaller number Z 1,..., Z M Find constants φ jm New features Z m are linear combinations of X j : p Z m = φ jm X j j=1

22 Dimensionality Reduction Methods Different approach to model selection We have many features: X 1, X 2,..., X p Transform features to a smaller number Z 1,..., Z M Find constants φ jm New features Z m are linear combinations of X j : p Z m = φ jm X j j=1 Dimension reduction: M is much smaller than p

23 Using Transformed Features New features Z m are linear combinations of X j : Z m = Fit linear regression model: y i = θ 0 + p φ jm X j j=1 M θ m z im + ɛ i m=1 Run plain linear regression, logistic regression, LDA, or anything else

24 Recovering Coefficients for Original Features Prediction using transformed features M y i = θ 0 + θ m z im + ɛ i m=1 New features Z m are linear combinations of X j : p Z m = φ jm X j j=1 Consider prediction for data point i M θ m z im m=1

25 Recovering Coefficients for Original Features Prediction using transformed features M y i = θ 0 + θ m z im + ɛ i m=1 New features Z m are linear combinations of X j : p Z m = φ jm X j j=1 Consider prediction for data point i M M θ m z im = θ m m=1 m=1 j=1 p φ jm x ij

26 Recovering Coefficients for Original Features Prediction using transformed features M y i = θ 0 + θ m z im + ɛ i m=1 New features Z m are linear combinations of X j : p Z m = φ jm X j j=1 Consider prediction for data point i M M θ m z im = θ m m=1 m=1 j=1 p φ jm x ij = p M θ m φ jm x ij j=1 m=1

27 Recovering Coefficients for Original Features Prediction using transformed features M y i = θ 0 + θ m z im + ɛ i m=1 New features Z m are linear combinations of X j : p Z m = φ jm X j j=1 Consider prediction for data point i M M θ m z im = θ m m=1 m=1 j=1 p φ jm x ij = p M p θ m φ jm x ij = β j x ij j=1 m=1 j=1

28 Dimension Reduction 1. Reduce dimensions of features Z from X 2. Fit prediction model to compute θ 3. Compute weights for the original features β

29 Dimension Reduction 1. Reduce dimensions of features Z from X 2. Fit prediction model to compute θ 3. Compute weights for the original features β

30 Dimension Reduction 1. Reduce dimensions of features Z from X 2. Fit prediction model to compute θ 3. Compute weights for the original features β

31 How (and Why) Reduce Feaures? 1. Principal Component Analysis (PCA) 2. Partial least squares 3. Also: many other non-linear dimensionality reduction methods

32 Principal Component Analysis Unsupervised dimensionality reduction methods Works with n p data matrix X (no labels) Correlated features: pop and ad Ad Spending Population

33 1st Principal Component Ad Spending Population 1st Principal Component: Direction with the largest variance Z 1 = (pop pop) (ad ad)

34 1st Principal Component Ad Spending Population 1st Principal Component: Direction with the largest variance Is this linear? Z 1 = (pop pop) (ad ad)

35 1st Principal Component Ad Spending Population 1st Principal Component: Direction with the largest variance Z 1 = (pop pop) (ad ad) Is this linear? Yes, after mean centering.

36 1st Principal Component Ad Spending nd Principal Component Population st Principal Component green line: 1st principal component, minimize distances to all points

37 1st Principal Component Ad Spending nd Principal Component Population st Principal Component green line: 1st principal component, minimize distances to all points Is this the same as linear regression?

38 1st Principal Component Ad Spending nd Principal Component Population st Principal Component green line: 1st principal component, minimize distances to all points Is this the same as linear regression? No, like total least squares.

39 2nd Principal Component Ad Spending Population 2nd Principal Component: Orthogonal to 1st component, largest variance Z 2 = (pop pop) (ad ad)

40 1st Principal Component Population Ad Spending st Principal Component st Principal Component Population Ad Spending nd Principal Component nd Principal Component

41 Properties of PCA No more principal components than features Principal components are perpendicular Principal components are eigenvalues of X X Assumes normality, can break with heavy tails PCA depends on the scale of features

42 Principal Component Regression 1. Use PCA to reduce features to a small number of principal components 2. Fit regression using principal components Mean Squared Error Mean Squared Error Squared Bias Test MSE Variance Number of Components Number of Components

43 PCR vs Ridge Regression & Lasso PCR Ridge Regression and Lasso Mean Squared Error Squared Bias Test MSE Variance Mean Squared Error Number of Components Shrinkage Factor PCR selects combinations of all features (not feature selection) PCR is closely related to ridge regression

44 PCR Application Standardized Coefficients Income Limit Rating Student Cross Validation MSE Number of Components Number of Components

45 Standardizing Features Regularization and PCR depend on scales of features Good practice is to standardize features to have same variance x ij = 1 n x ij n i=1 (x ij x j ) 2 Do not standardize features when they have the same units PCA needs mean-centered features x ij = x ij x j

46 Partial Least Squares Supervised version of PCR Ad Spending Population

47 High-dimensional Data 1. Predict blood pressure from DNA: n = 200, p = Predicting user behavior online: n = , p =

48 Problem With High Dimensions Computational complexity Overfitting is a problem Y Y X X

49 Overfitting with Many Variables R Training MSE Test MSE Number of Variables Number of Variables Number of Variables

50 Interpreting Feature Selection 1. Solutions may not be unique 2. Must be careful about how we report solutions 3. Just because one combination of features predicts well, does not mean others will not

Linear Model Selection and Regularization

Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In