Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Similar documents
Regression, Ridge Regression, Lasso

Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008.

Machine Learning for OR & FE

STK-IN4300 Statistical Learning Methods in Data Science

STK-IN4300 Statistical Learning Methods in Data Science

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

MS-C1620 Statistical inference

Machine Learning Linear Classification. Prof. Matteo Matteucci

Chapter 7: Model Assessment and Selection

MS&E 226: Small Data

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

ISyE 691 Data mining and analytics

Sparse Linear Models (10/7/13)

Chapter 3. Linear Models for Regression

Linear regression methods

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Linear model selection and regularization

Applied Machine Learning Annalisa Marsico

Linear Model Selection and Regularization

Dimension Reduction Methods

Linear Methods for Regression. Lijun Zhang

Machine Learning Linear Regression. Prof. Matteo Matteucci

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Statistical aspects of prediction models with high-dimensional data

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Introduction to Machine Learning and Cross-Validation

Shrinkage Methods: Ridge and Lasso

Day 4: Shrinkage Estimators

Introduction to Statistical modeling: handout for Math 489/583

Minimum Description Length (MDL)

CMSC858P Supervised Learning Methods

ECE521 week 3: 23/26 January 2017

Linear Regression In God we trust, all others bring data. William Edwards Deming

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

STK4900/ Lecture 5. Program

CS6220: DATA MINING TECHNIQUES

Lecture 14: Shrinkage

CS6220: DATA MINING TECHNIQUES

COMS 4771 Regression. Nakul Verma

Regularization and Variable Selection via the Elastic Net

High-dimensional regression modeling

Extended Bayesian Information Criteria for Model Selection with Large Model Spaces

STA414/2104 Statistical Methods for Machine Learning II

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Statistics 203: Introduction to Regression and Analysis of Variance Course review

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

PDEEC Machine Learning 2016/17

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Prediction & Feature Selection in GLM

Lecture 6: Linear Regression (continued)

GWAS IV: Bayesian linear (variance component) models

Linear Models for Regression CS534

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Statistics 262: Intermediate Biostatistics Model selection

Tuning Parameter Selection in L1 Regularized Logistic Regression

Statistical Data Mining and Machine Learning Hilary Term 2016

Least Squares Regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Regularization: Ridge Regression and the LASSO

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

Machine Learning for OR & FE

6. Regularized linear regression

Least Squares Regression

Proteomics and Variable Selection

Recap from previous lecture

MSA220/MVE440 Statistical Learning for Big Data

CS Homework 3. October 15, 2009

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 7: Modeling Krak(en)

Lecture 6: Methods for high-dimensional problems

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Biostatistics Advanced Methods in Biostatistics IV

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

STA 4273H: Statistical Machine Learning

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Theorems. Least squares regression

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Data Mining Stat 588

How the mean changes depends on the other variable. Plots can show what s happening...

Lecture 6: Linear Regression

A simulation study of model fitting to high dimensional data using penalized logistic regression

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Linear Regression. Volker Tresp 2018

Bi-level feature selection with applications to genetic association

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Regression Shrinkage and Selection via the Lasso

Statistical Inference

Analysis Methods for Supersaturated Design: Some Comparisons

Transcription:

Biostatistics-Lecture 16 Model Selection Ruibin Xi Peking University School of Mathematical Sciences

Motivating example1 Interested in factors related to the life expectancy (50 US states,1969-71 ) Per capita income (1974) Illiteracy (1970,percent of population) Murder rate per 100,000 population Percent high-school graduates Mean number of days with min temperature < 30 degree Land area in square mile

Motivating Example2 The role of microrna on regulating gene expression Response: standard deviation of a gene expression Covariates: mean gene expression length of the 3 -UTRs number of microrna targets in the 3 -UTRs mean target score of the microrna targets number of common SNPs in the 3 -UTR mean of minor allele frequencies of common SNPs in 3 - UTRs 2016-05-15 3

Motivating Example2 Expressio n SD Expression Mean Gene Length Length of 3 UTR Number of mirna targets Mean target score Number of SNPs Mean MAF Expression SD 1 0.952031 0.084030 0.068084 0.146080 0.135781 0.035820 0.033893 Expression Mean 0.952031 1 0.157620 0.120004 0.165147 0.156963 0.055338 0.051066 Gene Length 0.084030 0.157620 1 0.471435 0.406605 0.311641 0.216913 0.173899 Length of 3 UTR 0.068084 0.120004 0.471435 1 0.246593 0.206723 0.227899 0.197424 Number of mirna targets 0.146080 0.165147 0.406605 0.246593 1 0.849602 0.185446 0.142214 Mean target score 0.135781 0.156963 0.311641 0.206723 0.849602 1 0.151916 0.128167 Number of SNPs 0.035820 0.055338 0.216913 0.227899 0.185446 0.151916 1 0.947236 Mean MAF 0.033893 0.051066 0.173899 0.197424 0.142214 0.128167 0.947236 1 2016-05-15 4

Motivating Example3 Communities and Crime Response: total number of violent crimes per 100K population Covariates (128): population for community percentage of population that is caucasian percentage of population that is african american median household income per capita income for african americans percentage of kids born to never married number of vacant households number of sworn full time police officers. http://archive.ics.uci.edu/ml/datasets/communities+and+crime 2016-05-15 5

Motivating Example4 Motif finding Response : univariate response measuring binding intensity (ChIP-seq or ChIP-chip data) Covariates (~200): abundant score of candidate motifs 2016-05-15 6

Motivating example 5 Genome-wide association studies Response: disease or not Covariates (~10 6 ): single nucleotide polymorphisms (SNPs) 2016-05-15 数据挖掘与机器学习 7

Motivating example 6 Expression quantitative trait loci (eqtl) studies Response (~ 20,000): gene expression Covariates(~10 6 ): SNPs 2016-05-15 8

Motivating example 6 Expression quantitative trait loci (eqtl) studies Response (~ 20,000): gene expression Covariates(~10 6 ): SNPs 2016-05-15 9

General framework 2016-05-15 10

General framework 2016-05-15 11

Stepwise selection Backward Elimination

Forward Selection Stepwise selection

Drawbacks of stepwise selection One-at-a-time: may miss optimal P-values of the remaining predictors tends to be overstated Multiple testing Model tends to be smaller than desirable for prediction purpose Variable not in the model may still be correlated with the response

Stepwise selection An example Interested in factors related to the life expectancy (50 US states,1969-71 ) Per capita income (1974) Illiteracy (1970,percent of population) Murder rate per 100,000 population Percent high-school graduates Mean number of days with min temperature < 30 degree Land area in square mile

Bias, Variance, and Model Complexity Bias-Variance trade-off again Generalization: test sample vs. training sample performance Training data usually monotonically increasing performance with model complexity

Measuring Performance target variable Vector of inputs Prediction model Y X ˆf X Typical Choices of Loss function L Y, fˆ X 2 ˆ Y f X squared error Y fˆ X absolute error

Generalization Error Test error. Generalization error Note: This expectation averages anything that is random, including the randomness in the training sample that it produced Training error Err E L Y fˆ X, average loss over training sample n 1 err L y, fˆ x N i 1 not a good estimate of test error (next slide) i i

Training Error Training error - Overfitting not a good estimate of test error consistently decreases with model complexity drops to zero with high enough complexity

Two separate goals Model selection: Estimating the performance of different models in order to choose the (approximate) best one Model assessment: Having chosen a final model, estimating its prediction error (generalization error) on new data Ideal situation: split data into the 3 parts for training, validation (est. prediction error+select model), and testing (assess model) Typical split: 50% / 25% / 25%

Bias-Variance Decomposition Y f X, 0, 2 E Var Then for an input point X x 0, using unit-square loss and regression fit: Irreducible Error variance of the target around the true mean Bias^2 Amount by which average estimate differs from the true mean Variance Expected deviation of f^ around its mean

Bias-Variance Decomposition Linear Model Fit: fˆ x ˆT x p 2 2 2 ˆ 2 0 o p 0 0 Err x f x Ef x h x average over sample values x i : N N 1 2 1 ˆ 2 p 2 Err xi f xi Ef x i... in-sample error i1 i1 N N N Model complexity is directly related to the number of parameters p

Bias-Variance Decomposition Taken from Elements of Statistical Learning

Bias-Variance Decomposition - Example 50 observations. 20 predictors. Uniform in Left panels: Right panels 0,1 20 1 1 Y is 0 if X 1 and 1 if X 1, and we apply knn 2 2 Y 10 is 1 if X 5 and 0 otherwise, and we use the j=1 j best subset linear regression of size p

Example, continued Regression with squared error loss Prediction error Squared bias Variance Classification with 0-1 loss

Optimism of the Training Error Rate Typically: training error rate < true error (same data is being used to fit the method and assess its error) n 1 err L y fˆ x, < N i 1 i i Err E L Y fˆ X, overly optimistic

Optimism of the Training Error Rate Err kind of extra-sample error: test features don t need to coincide with training feature vectors Focus on in-sample error: 1 Err E E L Y fˆ x N new in Y new i, Y i N i 1 new Y observe N new response values at each of training points for squared error 0-1 and other loss functions: optimism: op Errin Ey err op 2 N N i 1, i=1, 2,...,N x i Cov yˆ, y i i

Optimism of the Training Error Rate 2 N Summary: Err E err Cov yˆ, y in y i i N i 1 The harder we fit the data, the greater ˆ, will be, thereby increasing the Cov y y i i optimism. For linear fit with d indep covariates: 2 2 Err E err d in optimism linearly with # d of covariates Optimism as training sample size y N

Optimism of the Training Error Rate Ways to estimate prediction error: Estimate optimism and then add it to training error rate AIC, BIC, and others work this way, for a special class of estimates that are linear in their parameters Direct estimates of the sample error Cross-validation, bootstrap Can be used with any loss function, and with nonlinear, adaptive fitting techniques

Estimates of In-Sample Prediction Error General form of the in-sample estimate: Êrr in err op ˆ 2 For linear fit and with Err E err d : 2d 2 C ˆ p err, so called C N p in statistic 2 ˆ... estimate of noise variance, from mean-squared error of low-bias model d... # of basis functions N... training sample size with estimate of optimism y 2 N

Estimates of In-Sample Prediction Error Similarly: Akaike Information Criterion (AIC) More applicable estimate of Err in, when loglikelihood function is used 2 For N : 2E log Pr ˆ Y E loglik 2 d N N Pr Y loglik=... family density for Y (containing the true density) ˆ... ML estimate of N i1 log Pr ˆ y i Maximized log-likelihood due to ML estimate of theta

AIC 2 For N : 2E log Pr ˆ Y E loglik 2 d N N For example, for logistic regression model, using binomial log-likelihood: To use AIC for model selection: choose the model giving smallest AIC over the set of models considered. f ˆ AIC AIC 2 loglik 2 N err 2 2 d ˆ N x d... set of models,... tuning parameter err... training error,... # parameters d N

Effective Number of Parameters y y y= y 1 2 N Vector of Outcomes, similarly for predicitons ŷ Sy Linear fit (e.g. linear regression, quadratic shrinkage ridge, splines) S... N N matrix, depends on input vector x but not on y effective number of parameters: d S i trace S i c.f. Cov( yˆ, y) d(s) is the correct d for C p C p err 2d ˆ N 2

Bayesian Approach and BIC Like AIC used in when fitting by max log-likelihood Bayesian Information Criterion (BIC): BI C 2loglik (log N) d 2 Assuming Gaussian model : known, 2loglik then BIC 2 i ( y N [ err i fˆ( x i )) 2 (logn ) 2 / d N 2 ] N err 2 / BIC proportional to AIC except for log(n) rather than factor of 2. For N>e 2 (approx 7.4), BIC penalizes complex models more heavily.

BIC Motivation Given a set of candidate models Posterior probability of a given model: Where Z representsthe training data{ To compare two models, form the posterior odds: Pr( Mm Z) Pr( Mm) Pr( Z Mm) Pr( Ml Z) Pr( Ml ) Pr( Z Ml ) If odds > 1, then choose model m. Prior over models (left half) considered constant. Right half, contribution of data (Z) to posterior odds, is called the Bayes factor BF(Z). Need to approximate Pr(Z M m )., m 1M and model parameters Can est. posterior from BIC and compare relative merits of models. Μ m N xi, y i } 1 Pr( M m Z) Pr( M m ) Pr( Z M m ) m

General framework 2016-05-15 36

Penalty based methods 2016-05-15 37

The Lasso 2016-05-15 38

The Lasso 2016-05-15 39

The Lasso and the Ridge Regression 2016-05-15 40

The Lasso and the Ridge Regression 2016-05-15 41

Relationship with Bayesian methods 2016-05-15 42

Relationship with Bayesian methods 2016-05-15 43

Lasso for orthogonal design 2016-05-15 44

Lasso for orthogonal design 2016-05-15 45

Estimation of regression coefficients 2016-05-15 46

Asymptotic Results-preview 2016-05-15 47

Asymptotic Results 2016-05-15 48

Parameter Tuning 2016-05-15 49

Parameter Tuning 2016-05-15 50

Adaptive Lasso 2016-05-15 51

Adaptive Lasso 2016-05-15 52

KKT conditions and Computation 2016-05-15 53

Coordinate descent algorithm for computation 2016-05-15 54

Coordinate descent algorithm for For linear regression computation glmnet: R-package 2016-05-15 55

Least Angle Regression-- LAR 2016-05-15 56

Least Angle Regression-- LAR 2016-05-15 57

Least Angle Regression-- LAR 2016-05-15 58

Least Angle Regression-- LAR 2016-05-15 59

Least Angle Regression-- LAR 2016-05-15 60

Generalization of Lasso and Ridge regression 2016-05-15 61

The Dantzig Selector 2016-05-15 62

The grouped Lasso 2016-05-15 63

Sure Independent Screening In Linear regression: The true sparse model Need to show 2016-05-15 64

Sure Independent Screening Rationale of correlation learning: When p > n, the OLS estimator is the Moore-Penrose generalized inverse The ridge regression 2016-05-15 65

Sure Independent Screening Iteratively thresholded ridge regression screener (ITRRS) 2016-05-15 66

Sure Independent Screening 2016-05-15 67

Sure Independent Screening 2016-05-15 68

Sure Independent Screening S=8, 18 2016-05-15 69

Feature Screening by Distance Correlation Szekely, Rizzo and Bakirov (2007) proposed the distance correlation characteristic functions of two random vectors 2016-05-15 70

Feature Screening by Distance Correlation For two normal random variables Distance correlation is strictly increasing in ρ 2016-05-15 71

Feature Screening by Distance Correlation 2016-05-15 72

Feature Screening by Distance Correlation 2016-05-15 73

Feature Screening by Distance Correlation Li et al. (2012) JASA 2016-05-15 74

Feature Screening by distance correlation 2016-05-15 75

Feature Screening by distance correlation 2016-05-15 76

Feature Screening by distance correlation 2016-05-15 77

Feature Screening by distance correlation 2016-05-15 78

Bayesian Methods See Blackboard 2016-05-15 79