Bootstrapping, Randomization, 2B-PLS

Similar documents
Statistical Analysis. G562 Geometric Morphometrics PC 2 PC 2 PC 3 PC 2 PC 1. Department of Geological Sciences Indiana University

G562 Geometric Morphometrics. Statistical Tests. Department of Geological Sciences Indiana University. (c) 2012, P. David Polly

Principal component analysis

df=degrees of freedom = n - 1

Lecture 30. DATA 8 Summer Regression Inference

The bootstrap. Patrick Breheny. December 6. The empirical distribution function The bootstrap

y response variable x 1, x 2,, x k -- a set of explanatory variables

Experimental Design and Data Analysis for Biologists

Bivariate Regression Analysis. The most useful means of discerning causality and significance of variables

THE PRINCIPLES AND PRACTICE OF STATISTICS IN BIOLOGICAL RESEARCH. Robert R. SOKAL and F. James ROHLF. State University of New York at Stony Brook

Inferences for Regression

Stats Review Chapter 14. Mary Stangler Center for Academic Success Revised 8/16

Phenotypic Evolution. and phylogenetic comparative methods. G562 Geometric Morphometrics. Department of Geological Sciences Indiana University

Hotelling s One- Sample T2

Unconstrained Ordination

The Nonparametric Bootstrap

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

STAT 501 EXAM I NAME Spring 1999

Lecture 11: Simple Linear Regression

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section:

STAT440/840: Statistical Computing

Bootstrapping Dependent Data in Ecology

Lectures 5 & 6: Hypothesis Testing

Simple Linear Regression Using Ordinary Least Squares

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

Exam details. Final Review Session. Things to Review

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Lecture 1: Random number generation, permutation test, and the bootstrap. August 25, 2016

Review of Multiple Regression

Module 8: Linear Regression. The Applied Research Center

A strategy for modelling count data which may have extra zeros

Ch Inference for Linear Regression

11. Bootstrap Methods

Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods

A better way to bootstrap pairs

LECTURE 5. Introduction to Econometrics. Hypothesis testing

EXAM PRACTICE. 12 questions * 4 categories: Statistics Background Multivariate Statistics Interpret True / False

Econometrics. 4) Statistical inference

ASSIGNMENT 3 SIMPLE LINEAR REGRESSION. Old Faithful

s e, which is large when errors are large and small Linear regression model

Hypothesis Testing for Var-Cov Components

ST Correlation and Regression

Inference for Regression Simple Linear Regression

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

Bootstrapping Heteroskedasticity Consistent Covariance Matrix Estimator

Data Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA

Confidence Intervals, Testing and ANOVA Summary

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Multivariate analysis of variance and covariance

PRINCIPAL COMPONENTS ANALYSIS

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

Permutation Tests. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Canonical Correlations

Subject CS1 Actuarial Statistics 1 Core Principles

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Performance Evaluation and Comparison

MANOVA MANOVA,$/,,# ANOVA ##$%'*!# 1. $!;' *$,$!;' (''

the logic of parametric tests

GROUPED DATA E.G. FOR SAMPLE OF RAW DATA (E.G. 4, 12, 7, 5, MEAN G x / n STANDARD DEVIATION MEDIAN AND QUARTILES STANDARD DEVIATION

Chapter 12 - Lecture 2 Inferences about regression coefficient

Chapter 16. Simple Linear Regression and Correlation

The exact bootstrap method shown on the example of the mean and variance estimation

Measuring the fit of the model - SSR

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Bootstrap Simulation Procedure Applied to the Selection of the Multiple Linear Regressions

Measuring relationships among multiple responses

Do not copy, post, or distribute

A Bootstrap Test for Causality with Endogenous Lag Length Choice. - theory and application in finance

Principal Component Analysis (PCA) Principal Component Analysis (PCA)

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Applied Statistics Friday, January 15, 2016

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Basic Statistics. 1. Gross error analyst makes a gross mistake (misread balance or entered wrong value into calculation).

Big Data Analysis with Apache Spark UC#BERKELEY

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

6348 Final, Fall 14. Closed book, closed notes, no electronic devices. Points (out of 200) in parentheses.

Ordinary Least Squares Regression Explained: Vartanian

Journal of Statistical Software

The assumptions are needed to give us... valid standard errors valid confidence intervals valid hypothesis tests and p-values

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Inference for Regression

Testing Indirect Effects for Lower Level Mediation Models in SAS PROC MIXED

Chapter 11 Canonical analysis

Linear Regression with 1 Regressor. Introduction to Econometrics Spring 2012 Ken Simons

Data are sometimes not compatible with the assumptions of parametric statistical tests (i.e. t-test, regression, ANOVA)

Chapter 4: Factor Analysis

Inference for the Regression Coefficient

Multiple Regression Analysis

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Bootstrap Approach to Comparison of Alternative Methods of Parameter Estimation of a Simultaneous Equation Model

Bootstrap tests of multiple inequality restrictions on variance ratios

23. Inference for regression

You have 3 hours to complete the exam. Some questions are harder than others, so don t spend too long on any one question.

General Linear Model (Chapter 4)

determine whether or not this relationship is.

Multilevel Statistical Models: 3 rd edition, 2003 Contents

STA 101 Final Review

Transcription:

Bootstrapping, Randomization, 2B-PLS

Statistics, Tests, and Bootstrapping Statistic a measure that summarizes some feature of a set of data (e.g., mean, standard deviation, skew, coefficient of variation, regression slope, correlation, covariance, principal component, eigenvalue, f-value). Statistical parameter the value of a particular statistic for the entire population. Statistical estimate the value of a particular statistic for a sample of a population. Statistical test an assessment of how likely a null hypothesis is to be true given the data at hand, or how probable it is that the data fit the null hypothesis given a random sample of the population. Null hypothesis usually the hypothesis that the estimated statistics from two or more samples result from two or more random samplings of the same population. In other words, the hypotheses that the sample do not come from different populations. Confidence interval or standard error a metastatistic that expresses how closely a statistical estimate is likely to match the population parameter.

Bootstrapping and randomization tests Used when assumptions of ordinary (parametric) statistical tests are not met, or when they are not known These tests randomize the data with respect to the statistic being measured Randomization is repeated a large number of times (e.g., 10000) and a distribution of the randomized statistic is generated. Observed value from the real data is compared to see whether it falls within the range of randomized values These tests take biases, non-normality, etc. into account automatically

Types of randomization tests Bootstrap. Random resampling of original data, recalculation of test statistics to determine standard errors. Jackknife. Same as bootstrap, but where each individual data point is left out in turn and the test statistic recalculated each time to determine standard error. Randomization. Randomizing original data, test observed compared to randomized samples. Monte Carlo. Data are simulated based on a particular hypothesis or model, real data are tested against the simulated data to see if the model holds.

Example: Test for difference in mean 1. Choose statistic that describes difference in mean: D = Sqrt[Mean[sample1]-Mean[sample2]^2] 2. Pool samples and randomly draw new sample 1 and 2 with replacement. 3. Calculate D for randomly drawn samples 4. Repeat 10,000 times 5. Compare real D with randomized D distribution 6. P-value is the proportion of randomized D smaller than real D.

Bootstrap replicates vs. theoretical normal density Efron and Tibrishani, 1986

Review Eigenvalues variance on each PC axis (In Mathematica: Eigenvalues[CM]) Eigenvectors loading of each original variable on each PC axis (In Mathematica: Eigenvectors[CM]) Scores (=shape variables) location of each data point on each PC axis (In Mathematica: PrincipalComponents[resids]) resids are the residuals of the Procrustes coordinates CM is the covariance matrix of the residuals

Statistical analysis: partitioning variance In GMM, variance / covariance = variation in shape Purpose of statistical analysis is to ascertain to what extent part of that variance is associated with a factor of interest, aka partitioning variance. 1. P-value: indicates whether the association is greater than expected by random chance 2. Regression parameters (slopes, intercepts): indicate the axis in shape space associated with the factor, useful for modeling the aspect of shape associated with the factor 3. Correlation coefficient (R): indicates the strength of the association between the variance and the factor 4. Coefficients of determination (R 2 ): indicates the proportion of the variance that is associated with the factor

Univariate linear regression Y = a X + b + E Linear regression of one Y variable onto one X variable. We have regressed one principal component onto one explanatory variable. PC 1 Scores a = 2.0 b = 0.5 R 2 = 0.85 Size R 2 also ranges from 1.0 (100% explained) to 0.0 (0% explained).

Regression - a multivariate look Y (shape) Y (shape) Y1 Y2 Y3 (shape) X (other variable) X1 X2 X3 (other variables) X (other variables) Univariate Linear Regression Multiple Linear Regression Multivariate Linear Regression

ShapeRegress[proc, variable (, PCs)] Function for multivariate regression of entire shape onto a single independent variable proc is a matrix of Procrustes superimposed landmark coordinates with objects in rows and coordinates in columns. variable is the variable onto which proc is to be regressed. It is a vector containing observations for each object from a continuous variable. PCs is an optional parameter specifying which PC should be shown in the graph. By default the regression on PC1 is shown.

Example of ShapeRegress[] ShapeRegress[proc, x, 2] 0.15 0.10 0.05 Graph shows one dimension of the regression PC 2 0.00-0.05-0.10 0.60 0.65 0.70 0.75 0.80 0.85 0.90 Var R-square Hall PCsL = 0.21 P@R-square is randomd = 0.34 PC1 PC2 PC3 PC4 PC5 Intercept -0.405353-0.63799-0.0227621-0.07787 0.0231763 Slope 0.550962 0.867165 0.0309385 0.105842-0.0315015 Univariate R-square 0.09 0.83 0.00 0.07 0.01 R-square shows total amount of shape explained by x variable Table gives regression coefficients and univariate r-squared

Two-Block Partial Least Squares Shape is inherently multivariate Independent variables, such as diet, vegetation, precipitation, and temperature, are likely to be correlated with one another Y1 Y2 Y3 (shape) The two correlated independent variables will have overlapping correlation with shape (if vegetation and precipitation are correlated with each other, then both will be correlated with shape if one of them is) One needs to control for the correlations between independent variables in order to properly test many of them X1 X2 X3 (other variables or shape)

Two-block partial least squares (2B-PLS) Multivariate regression lines (major axes of correlation) Multivariate data set two (Block 2) Regression axes (PLS axes) PLS axis one is best regression of data set one on data set two (where best means that it explains the most of both data sets) Multivariate data set one (Block 1)

Example from Rohlf and Corti paper Regression axes Block 1 Block 2 Correlation between blocks on each axis

Example from Rohlf and Corti paper Regression axes Block 1 Block 2

Example from Rohlf and Corti paper Regression axes Block 1 Block 2

TwoBlockPartialLeastSquares[proc1, proc2, { Shape, Shape }] This function performs a two-block partial least squares analysis following the methodology of Rohlf and Corti (2000). Two blocks of data are given to the function, along with a list of two strings indicating the type of data. Allowable types are "Shape" (Procrustes superimposed coordinates), "Standardized" (independent variables with different units of measurement that need to be standardized), and "Unstandardized" (independent variables with the same unit of measurement that do not need to be standardized). An optional argument is the number of the PLS axis to plot in the output graph. By default PLS 1 is plotted. Arguments: data1 and data2 are two blocks of variables, one or both of which can be a matrix of Procrustes superimposed landmark coordinates with objects in rows and coordinates in columns. {type1, type2} is a list of two data types, in quotation marks. Allowable types are "Shape" (Procrustes superimposed coordinates), "Standardized" (independent variables with different units of measurement that need to be standardized), and "Unstandardized" (independent variables with the same unit of measurement that do not need to be standardized). PLS is an integer indicating which PLS axis to plot.

TwoBlockPartialLeastSquares[data1, data2, {type1, type2}, (, PLS)] Stats Proportion of Actual Squared Covariance Explained by PLS 1: 0.48 Proportion of Total Possible Squared Covariance Explained by All PLS Axes: 0.08 Data 2, PLS 1 0.02 0.01 0.00 Scatter plot of scores of each data block on their shared PLS axis -0.01-0.02-0.01 0.00 0.01 0.02 0.03 0.04 Data 1, PLS 1 Positive End of PLS 1 Shape models for the positive end of the PLS axis, showing what aspects of the two shapes are correlated