Data analysis strategies for high dimensional social science data M3 Conference May 2013

Similar documents
Linear Model Selection and Regularization

The prediction of house price

ISyE 691 Data mining and analytics

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Machine Learning for OR & FE

Linear Methods for Regression. Lijun Zhang

Dimension Reduction Methods

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

BAGGING PREDICTORS AND RANDOM FOREST

Linear model selection and regularization

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Regression, Ridge Regression, Lasso

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Linear regression methods

Ensemble Methods and Random Forests

High-dimensional regression modeling

Theorems. Least squares regression

Feature Engineering, Model Evaluations

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Sparse Linear Models (10/7/13)

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

High-dimensional regression

A simulation study of model fitting to high dimensional data using penalized logistic regression

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Lecture Data Science

Applied Machine Learning Annalisa Marsico

COMS 4771 Regression. Nakul Verma

Holdout and Cross-Validation Methods Overfitting Avoidance

Learning with multiple models. Boosting.

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns

Statistics and learning: Big Data

Remedial Measures for Multiple Linear Regression Models

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

SF2930 Regression Analysis

Stability and the elastic net

Machine Learning for OR & FE

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Statistical Learning with the Lasso, spring The Lasso

CS145: INTRODUCTION TO DATA MINING

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

MS&E 226: Small Data

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Lecture 6: Methods for high-dimensional problems

Experimental Design and Data Analysis for Biologists

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Variance Reduction and Ensemble Methods

The Precise Effect of Multicollinearity on Classification Prediction

Statistical Methods as Optimization Problems

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Prediction & Feature Selection in GLM

Day 4: Shrinkage Estimators

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

Regularization Paths

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

MSA220/MVE440 Statistical Learning for Big Data

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

The risk of machine learning

Supplementary material for Intervention in prediction measure: a new approach to assessing variable importance for random forests

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

STK4900/ Lecture 5. Program

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Decision trees COMS 4771

Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity

Model Selection. Frank Wood. December 10, 2009

MS-C1620 Statistical inference

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

Introduction to Machine Learning and Cross-Validation

Constructing Prediction Intervals for Random Forests

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Statistical Learning

Lecture 14: Shrinkage

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Regression tree methods for subgroup identification I

CS6220: DATA MINING TECHNIQUES

Statistical aspects of prediction models with high-dimensional data

IEOR165 Discussion Week 5

Regularization Paths. Theme

Tree-based methods. Patrick Breheny. December 4. Recursive partitioning Bias-variance tradeoff Example Further remarks

Lecture VIII Dim. Reduction (I)

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Linear Models for Regression CS534

Decision Tree Learning Lecture 2

ESL Chap3. Some extensions of lasso

Chapter 14 Combining Models

Train the model with a subset of the data. Test the model on the remaining data (the validation set) What data to choose for training vs. test?

Bootstrap, Jackknife and other resampling methods

Induction of Decision Trees

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

the tree till a class assignment is reached

Day 3: Classification, logistic regression

A Modern Look at Classical Multivariate Techniques

Methods for the Comparison of DIF across Assessments W. Holmes Finch Maria Hernandez Finch Brian F. French David E. McIntosh

arxiv: v1 [stat.me] 30 Dec 2017

Exam: high-dimensional data analysis January 20, 2014

Transcription:

Data analysis strategies for high dimensional social science data M3 Conference May 2013 W. Holmes Finch, Maria Hernández Finch, David E. McIntosh, & Lauren E. Moss Ball State University

High dimensional data High dimensional data refers to the situation in which the number of variables of interest, p, is nearly as large as, or larger than the sample size, N Analysis of high dimensional data is common in genomics, and other areas of medical research High dimensional data may also occur in educational and psychological research Specialized groups that are of interest, such as individuals with autism, or children of migrant workers may be very rare in the population, and/or difficult to identify

High dimensional data High dimensional data causes a number of analytic problems with common statistical models, such as regression Such problems include overfitting of the data so that predictions with individuals not in the original sample are not accurate, as well as estimation bias in model coefficients and standard errors Given these problems, researchers faced with high dimensional data must find alternative modeling methods

Linear regression (OLS) Often researchers use linear regression models to relate one or more independent variables (IVs) to a single dependent variable (DV). y i = β 0 + β 1 x 1i +β 2 x 2i + +β j x ji + ε i Model parameter estimates are obtained by minimizing the least squares criterion: N E 2 = i=1(y i (β 0 + β 1 x 1i +β 2 x 2i + +β j x ji ) E 2 = (y i y i ) 2 ) 2

OLS When p approaches N, several problems can result: 1. Overfitting of the model to the sample 2. Collinearity among the independent variables 3. Inability to obtain an admissible solution

Ridge Regression (RR) RR is designed to overcome the problems of overfitting and collinearity by imposing a penalty on estimates of β j in a process known as shrinkage. Parameters are estimated by minimizing the penalized residual sum of squares function: (y i y i ) 2 2 N P +λ β j i=1 j=1

RR λ is a shrinkage parameter, where larger values are associated with greater shrinkage of the regression coefficients. The use of this shrinkage parameter ensures that the likelihood of extreme β j estimates often associated with highly collinear data is minimized. This is an important property in high dimensional data, where collinearity can be a major problem.

RR The selection of l involves the use of jackknife (leave one out) cross validation. Fitted values of y i are obtained for each individual, when they are not included in the model estimation stage, and the squared difference between model predicted and actual values of y i is calculated. The optimal value of l is the one that results in the minimum mean square error value across all members of the sample.

Least Absolute Shrinkage and Selection Operator (LASSO) The LASSO is similar to RR in that it involves shrinkage of model parameter estimates. Model parameters are estimated in LASSO by minimizing: N P i=1 (y i y i ) 2 +λ j=1 β j

LASSO In practice, LASSO adjusts the OLS coefficient estimates by the constant factor λ and frequently behaves like a best subsets regression, selecting variables for inclusion or exclusion in the analysis Selection of values for λ is carried out in the same manner as for RR

Partial Least Squares (PLS) PLS is a data reduction technique, in which p IVs are reduced to m linear combinations. PLS involves the following steps: 1. Compute φ 1j for each IV, x j. This value is equivalent to β 1 from y i = β 0 + β 1 x 1i + ε i J 2. Compute z 1 = j=1 φ 1j x j 3. Fit model y i = θ 1 z 1 4. Orthogonalize all x j with respect to z 1 J 5. Obtain φ 2j and compute z 2 = j=1 φ 2j x j such that z 2 is orthogonal to z 1 6. Repeat to obtain z 1 to z m, all of which are orthogonal

PLS In order to determine the number of linear combinations to estimate in PLS, we rely on jackknife cross validation, as described for RR. The minimum number of linear combinations, m, that reduces the jackknife mean square error across all members of the sample below a given threshold serves as the stopping point for PLS.

Supervised Principal Components (SPC) SPC is based on principal components regression, in which the p IVs are reduced to a small number of linear combinations that account for most of their variance However, unlike principal components regression, which only maximizes accounted for IV variance, SPC creates the linear combinations so as to maximize both IV variance and covariance of the IVs with the DV

SPC The SPC algorithm involves the following steps: Estimate y i = β 0 + β j x ji for each IV Compare β j to a predetermined threshold, t Retain x j only if β j exceeds t Compute first m principal components, z 1, z 2,, z m, using only the retained IVs Use Jackknife cross-validation to determine the optimal values of t and m Estimate the model y i = β 0 + β 1 z 1i +β 2 z 2i + +β j z mi

Random Forest (RF) RF relies on the recursive partitioning algorithm used in classification and regression trees (CART) In CART, the data is repeatedly divided into ever smaller groups based on values of the IV, with each division minimizing, at that step, within node heterogeneity in terms of the DV

Sample CART tree PASS_ENG < 4 22.00 28.43

RF While CART has been shown to be a very effective tool for both prediction and explanation, it has a tendency to overfit the data. One solution to the overfitting problem is to grow a large number of trees from the same sample using the bootstrap technique

RF In RF a large number of B bootstrap samples with replacement of size N is obtained, as is a subset of the predictor variables For each bootstrap sample a CART tree is grown, and predicted values obtained for each individual in the sample Prediction results for each individual are then averaged across the B trees to obtain a single predicted value for the entire RF.

RF Variable importance in RF can be measured using a permutation statistic that is calculated as follows: 1. For each time x j is used in a split, calculate the mean square error for y i 2. Permute the values of x j randomly so that any relationships with y i are removed 3. Calculate the mean square error for y i for the permuted x j based on the occasions when it is used 4. Repeat steps 2 and 3 a large number (e.g. 1000) of times Variables with larger values of the permutation statistic are deemed to be more important, as they have a greater impact on prediction accuracy

Current data Data from 2 studies were analyzed using OLS, and the alternative methods Study 1: 14 Latino migrant students Study focused on relating 6 measures of cognitive functioning, level of English language acquisition, and engagement in learning with an outcome measuring metacognition Study 2: 19 children with Autism Study focused on relating 13 measures of language and reading fluency with a standardized measure of reading aptitude

Migrant worker students STUDY 1

CV Negative Log Likelihood 400 600 800 1000 1200 1400 Results: LASSO -3.0-2.5-2.0-1.5-1.0-0.5 0.0 log(lambda)

Coefficients OLS (*=VIF>5) R 2 =0.36 RR R 2 =0.36 LASSO Variable b / SE P b / SE P b KBIT-2 verbal 0.16 / 0.26 0.564 0.09 / 0.15 0.177 0.82 KBIT-2 nonverbal 0.09 / 0.20 0.669 0.08 / 0.11 0.211 0.06 Active Engagement -0.02 / 0.18 0.935 0.01 / 0.09 0.802 0.01 Passive Engagement 0.02 / 0.09 0.873 0.00 / 0.03 0.929 0 CLIC English 0.23 / 0.26 0.399 0.15 / 0.08 0.024 0.17 Time off task -0.01 / 0.06 0.850-0.01 / 0.09 0.854-0.01

Random Forest Average R 2 =0.288 MEP3.forest MEP2.kbit_verb MEP2.kbit_nverb MEP2.pass_eng_a MEP2.off_task MEP2.act_eng_a MEP2.clic_eng -5 0 5 10 %IncMSE

PLS One statistically significant linear combination was found (b=3.34, p=0.03) Variable PLS Coefficient (ϕ) KBIT-2 verbal 1.85 KBIT-2 nonverbal 2.27 Active Engagement 0.28 Passive Engagement -0.01 CLIC English 2.24 Time off task -0.35

Likelihood ratio statistic 0 1 2 3 4 5 SPC Optimal threshold is near 0.35 Selected variables must have coefficients > 0.35 0.0 0.2 0.4 0.6 0.8 1.0

SPC First principal component was significantly (p=0.055) related to metacognition score (KMA reading) Coefficient = 0.30

SPC feature scores and importance Variable Feature score (Coefficient) Importance PLS Coefficient (ϕ) KBIT-2 verbal 0.97 2957.64 1.85 KBIT-2 nonverbal 0.93 2705.19 2.27 Active Engagement 1.00 604.55 0.28 Passive Engagement 0.33 NA -0.01 CLIC English 0.06 NA 2.24 Time off task 0.06 NA -0.35

Children with autism STUDY 2

CV Negative Log Likelihood 4000 5000 6000 7000 8000 Results: LASSO -1.5-1.0-0.5 0.0 0.5 1.0 log(lambda)

OLS (*=VIF>5) R 2 =0.86 RR R 2 =0.69 LASSO Variable b / SE P b / SE P b Word independent 5.9* / 7.9 0.487 21.7 / 11.9 0.068 15.1 Word instructional -6.1* / 5.3 0.301 18.2 / 11.8 0.122 9.1 Word frustration 9.5* / 6.3 0.193 22.0 / 12.1 0.068 28.6 Passage independent -9.7* / 21.6 0.673-10.5 / 10.1 0.298 0 Passage instructional 2.1* / 29.1 0.673-11.0 / 8.4 0.189 0 Passage frustration 40.0* / 31.9 0.266 14.1 / 9.8 0.150 1.4 Listening -34.7* / 35.2 0.370 3.0 / 8.9 0.739 0 Main idea 0.0* / 0.2 0.993-2.7 / 13.8 0.846 0 Details 0.1* / 0.4 0.826-11.8 / 13.1 0.365 0 Sequence oral 0.1* / 0.2 0.674-9.6 / 13.6 0.481-4.6 Sequence listening 0.3 / 0.2 0.250 24.6 / 13.4 0.065 11.3 Cause/effect oral 0.2 / 0.2 0.329 8.7 / 13.2 0.509 0 Cause/effect listening 0.0* / 0.3 0.991-9.1 / 13.2 0.491 0

Random Forest Mean R 2 =0.227 autism.forest.iri iri1frustrationlevelwordlist iri1independentlevelwordlist iri1instructionallevelwordlist iri1sequencepercenterrororal iri1frustrationlevelpassagecomprehension iri1mainideapercenterrorstotal iri1sequencepercenterrorlistening iri1listeningcomprehension iri1detailpercenterrorstotal iri1independentleveliripassagecomprehension iri2causeandeffectpercenterrorslistening iri1instructionallevelpassagecomprehension iricauseandeffectpercenterrorsoral -2 0 2 4 6 8 10 %IncMSE

PLS One statistically significant linear combination was found (b=5.53, p=0.002) Variable PLS Coefficient (ϕ) Word independent 6.82 Word instructional 6.91 Word frustration 8.96 Passage independent -3.75 Passage instructional -5.74 Passage frustration 0.96 Listening -1.65 Main idea -1.44 Details -1.41 Sequence oral -3.42 Sequence listening 3.74 Cause/effect oral -0.00 Cause/effect listening -2.65

Likelihood ratio statistic 4 6 8 10 12 14 16 SPC Optimal threshold is near 0.2 Selected variables must have coefficients > 0.2 0.0 0.5 1.0 1.5 2.0 Threshold

SPC First principal component was significantly (p=0.033) related to reading aptitude Coefficient = 0.40

SPC feature scores and importance Variable Feature score (Coefficient) Importance PLS Coefficient (ϕ) Word independent 2.17 829.41 6.82 Word instructional 1.95 663.81 6.91 Word frustration 2.00 1202.84 8.96 Passage independent 1.54 451.44-3.75 Passage instructional 1.57 598.77-5.74 Passage frustration 1.85 811.85 0.96 Listening 1.62 885.11-1.65 Main idea -0.21 188.94-1.44 Details -0.23 158.74-1.41 Sequence oral -0.26 109.92-3.42 Sequence listening 0.07 NA 3.74 Cause/effect oral -0.08 NA -0.00 Cause/effect listening -0.19 NA -2.65

Conclusions There are several alternatives available for researchers using high dimensional data Some approaches are variants of standard OLS (RR, LASSO) that employ shrinkage parameters Other methods involve data reduction (PLS, SPC) Yet another is nonparametric in nature (RF)

Conclusions Results provided by the various methods may differ substantially Researchers should select the method that best addresses their research problems When the predictors represent different facets of a common construct, data reduction methods such as SPC and PLS may be optimal When collinearity is a particular problem, the researcher may select RR or LASSO If it is believed that the relationships between predictors and the outcome are nonlinear, RF may be the best choice