Hypothesis Testing For Multilayer Network Data

Similar documents
Heteroskedasticity-Robust Inference in Finite Samples

Testing Statistical Hypotheses

A better way to bootstrap pairs

Strong Consistency of Set-Valued Frechet Sample Mean in Metric Spaces

Political Science 236 Hypothesis Testing: Review and Bootstrapping

Stat 5101 Lecture Notes

High-dimensional covariance estimation based on Gaussian graphical models

University of California, Berkeley

A central limit theorem for an omnibus embedding of random dot product graphs

Extreme inference in stationary time series

The Number of Bootstrap Replicates in Bootstrap Dickey-Fuller Unit Root Tests

Fréchet Analysis of Variance for Random Objects

Testing Statistical Hypotheses

Bootstrapping Heteroskedasticity Consistent Covariance Matrix Estimator

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

An Introduction to Multivariate Statistical Analysis

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

STT 843 Key to Homework 1 Spring 2018

11. Bootstrap Methods

University of California San Diego and Stanford University and

Birkbeck Working Papers in Economics & Finance

INFERENCE APPROACHES FOR INSTRUMENTAL VARIABLE QUANTILE REGRESSION. 1. Introduction

Statistical Inference of Covariate-Adjusted Randomized Experiments

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Binary choice 3.3 Maximum likelihood estimation

STA6557 Object Data Analysis SPRING 2017

Large sample distribution for fully functional periodicity tests

Matrix Basic Concepts

Quantile Processes for Semi and Nonparametric Regression

On the Estimation and Application of Max-Stable Processes

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Data Mining Stat 588

Outline of GLMs. Definitions

Bootstrap (Part 3) Christof Seiler. Stanford University, Spring 2016, Stats 205

Hakone Seminar Recent Developments in Statistics

Bootstrap inference for the finite population total under complex sampling designs

AFT Models and Empirical Likelihood

Inference For High Dimensional M-estimates. Fixed Design Results

. a m1 a mn. a 1 a 2 a = a n

Generalized Cp (GCp) in a Model Lean Framework

Advanced Econometrics I

Two-sample hypothesis testing for random dot product graphs

The Wild Bootstrap, Tamed at Last. Russell Davidson. Emmanuel Flachaire

Bootstrapping heteroskedastic regression models: wild bootstrap vs. pairs bootstrap

Bayesian Network Regularized Regression for Crime Modeling

Empirical Likelihood Inference for Two-Sample Problems

Econ 583 Final Exam Fall 2008

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

XXXV Konferencja Statystyka Matematyczna -7-11/ XII 2009

More Powerful Tests for Homogeneity of Multivariate Normal Mean Vectors under an Order Restriction

COPYRIGHTED MATERIAL CONTENTS. Preface Preface to the First Edition

Math 4377/6308 Advanced Linear Algebra

CSCI 1951-G Optimization Methods in Finance Part 10: Conic Optimization

DM559 Linear and Integer Programming. Lecture 2 Systems of Linear Equations. Marco Chiarandini

TESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Second-Order Inference for Gaussian Random Curves

Inference for High Dimensional Robust Regression

Principal Components Theory Notes

LECTURE 2 LINEAR REGRESSION MODEL AND OLS

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Cointegration Lecture I: Introduction

Empirical likelihood and self-weighting approach for hypothesis testing of infinite variance processes and its applications

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Cross-Validation with Confidence

Economics 583: Econometric Theory I A Primer on Asymptotics

You can compute the maximum likelihood estimate for the correlation

Permutation-invariant regularization of large covariance matrices. Liza Levina

Math 240, 4.3 Linear Independence; Bases A. DeCelles. 1. definitions of linear independence, linear dependence, dependence relation, basis

Preliminaries The bootstrap Bias reduction Hypothesis tests Regression Confidence intervals Time series Final remark. Bootstrap inference

TESTING FOR NORMALITY IN THE LINEAR REGRESSION MODEL: AN EMPIRICAL LIKELIHOOD RATIO TEST

Lecture 6: September 19

Monitoring Wafer Geometric Quality using Additive Gaussian Process

Lecture 3. Inference about multivariate normal distribution

CS 195-5: Machine Learning Problem Set 1

Cross-Validation with Confidence

A Brief Introduction to Copulas

Missing dependent variables in panel data models

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

Bayesian Interpretations of Heteroskedastic Consistent Covariance Estimators Using the Informed Bayesian Bootstrap

Bootstrapping high dimensional vector: interplay between dependence and dimensionality

Chapter 4. Solving Systems of Equations. Chapter 4

Tube formula approach to testing multivariate normality and testing uniformity on the sphere

Math Models of OR: Some Definitions

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Multivariate Regression Generalized Likelihood Ratio Tests for FMRI Activation

Diagnostics. Gad Kimmel

A Resampling Method on Pivotal Estimating Functions

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems

MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2

17: INFERENCE FOR MULTIPLE REGRESSION. Inference for Individual Regression Coefficients

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Estimating network degree distributions from sampled networks: An inverse problem

Stat 710: Mathematical Statistics Lecture 31

Multivariate Statistical Analysis

Math 494: Mathematical Statistics

Extremal properties of the variance and the quantum Fisher information; Phys. Rev. A 87, (2013).

Transcription:

Hypothesis Testing For Multilayer Network Data Jun Li Dept of Mathematics and Statistics, Boston University Joint work with Eric Kolaczyk

Outline Background and Motivation Geometric structure of multilayer networks Frechet mean and its general central limit theorem Central limit theorem for multilayer network Simulation Study Potential directions

Background and Motivation There is a trend to analyze large collections of networks. In recent work by our group 1, a formal notion of a space of unilayer network Graph Laplacians has been introduced and a central limit theorem has been developed based on it. In many natural and engineered systems, collections of multiple networks best describe them, and multilayer network representations arise naturally. 1 Ginestet, C. E., Balanchandran, P., Rosenberg, S., & Kolaczyk, E. D. (201). Hypothesis Testing For Network Data in Functional Neuroimaging. arxiv preprint

Background and Motivation: A Motivating Example n patients; d Regions of interest (ROI) month 1 month 1 month 1 2 2 2 3 3 3. Assume patients received treatment between the 2nd and the 3rd month: if the treatment has an effect.. Assume we measured two samples from different populations: if there exists a difference between the two populations.

Outline Geometric structure of multilayer networks Frechet mean and its general central limit theorem Central limit theorem for multilayer network Simulation Study Potential directions

Geometric Structure of Multilayer Networks Laplacian matrix and Supra-Laplacian The Laplacian matrix of a network G is defined by L = D A For a multilayer network M, we can list all its nodes and treat it as a network G M. Then the Supra-Laplacian for M is defined the same as the Laplacian matrix for G M.

Geometric Structure of Multilayer Networks (cont.) Geometric structure of unilayer networks Graph Laplacians 2 : Theorem 1 Let L d be d d matrices L satisfying: (1) Symmetry, L = L (2) The entries in each row sum to 0 (3) The off-diagonal entries are non-positive, e ij 0 () Rank(L) = d 1 Then the matrix L should also satisfy: (5) Positive semi-definiteness, L 0 The matrices with these properties form a submanifold of R d2 of dimension d(d 1) 2 with corners. In addition, L d is a convex subset in R d2. 2 Ginestet, C. E., Balanchandran, P., Rosenberg, S., & Kolaczyk, E. D. (201). Hypothesis Testing For Network Data in Functional Neuroimaging. arxiv preprint

Geometric Structure of Multilayer Networks (cont.) Two classes of multilayer network Supra-Laplacian Class 1: Class 2: Extend Class 1 Case by letting inter-layer links be any positive weights.

Geometric Structure of Multilayer Networks (cont.) Geometric structure of Two classes of multilayer network Supra-Laplacian Theorem 2 Class 1 Supra-Laplacians form a submanifold of R (nd)2 dimension nd(d 1) 2. of Class 1 Class 2 Class 2 Supra-Laplacians form a submanifold of R (nd)2 nd(d 1) 2 + (n 1)d. of dimension Both of the submanifolds are convex subsets in R (nd)2.

Outline Frechet mean and its general central limit theorem Central limit theorem for multilayer network Simulation Study Potential directions

Frechet Mean and its General Central Limit Theorem Definition of Frechet mean On a metric space (S, ρ) there is a notion of the mean µ of a distribution Q, as the minimizer of the expected squared distance from a point, µ = argmin ρ 2 (p, q)q(dq) p assuming the integral is finite (for some p) and the minimizer is unique, in which case one says that the Frechet mean of Q exists.

Frechet Mean and its General Central Limit Theorem (cont.) Theorem 3 On the metric space S, under some regularity conditions 3, we have the general CLT for Frechet mean: Notations: n 1/2 [J(µ n ) J(µ)] N(0, Λ 1 CΛ 1 ), as n h x h(x; q) := ρ 2 (J 1 (x), q) µ n the Frechet sample mean of the empirical distribution J a homeomorphism from a measurable subset of S to an open subset of R s C the covariance matrix of {D r h(j(µ); Y 1 ), r = 1,..., s} Λ [ED r,r h(j(µ); Y 1 )] r,r =1,...,s Y i s i.i.d. S-valued random variables 3 Bhattacharya, R., & Lin, L. (2013). An omnibus CLT for Fréchet means and nonparametric inference on non-euclidean spaces. arxiv preprint

Outline Central limit theorem for multilayer network Simulation Study Potential directions

Central Limit Theorem for Multilayer Network Theorem Let M 1,..., M n denote n multilayer networks and let L 1,..., L n be the corresponding Supra-Laplacians. ˆL n is their empirical mean. The L i s are assumed to be independent and identically distributed according to a distribution Q. If the expectation, Λ := E[L], does not lie on the boundary of L d, and P[U] > 0, where U is an open subset of L d with Λ U, and under the condition that each element of L i, i = 1,...n has finite variance; we obtain the following convergence in distribution, n 1/2 (J(ˆL n ) J(Λ)) N(0, Σ) where Σ := Cov[J(L)] and J( ) denotes the supra-half-vectorization of its matrix argument, that is, J aligns the upper diagonal of a symmetric matrix to vectorize it.

Outline Simulation Study Potential directions

Simulation Study: Motivating Example Review n patients; d Regions of interest (ROI) month 1 month 1 month 1 2 2 2 3 3 3. Assume patients received treatment between the 2nd and the 3rd treatment: if the treatment has an effect.. Assume we measured two samples from different populations: if there exists a difference between the two populations.

Simulation Study: Methods 1. Hypothesis Testing procedure based on central limit theorem only One-sample case: Each multilayer network (patient) could be spreaded into a d(d 1) 2 T dimensional vector through the supra-half-vectorization J. Suppose the vectors are Y 1,..., Y n. Let a = (1, 1,..., 1, 1,..., 1, 1) and X i = a Y i. H 0 : X s distribution has 0 mean H 1 : X s distribution doesn t have 0 mean. Two-sample case: In this case, we have two sample of such vectors to represent patients: Y 11,..., Y 1n and Y 21,..., Y 2n. H 0 :The two population have the same mean H 1 : The two population have different means. We can construct T = (J(Ŷ1) J(Ŷ2)) T ˆΣ 1 (J(Ŷ1) J(Ŷ2)) which has an asymptotic χ 2 m distribution under the null hypothesis, where m = ( ) d 2 T Ginestet, C. E., Balanchandran, P., Rosenberg, S., & Kolaczyk, E. D. (201). Hypothesis Testing For Network Data in Functional Neuroimaging. arxiv preprint

Simulation Study: Methods 2. Hypothesis Testing procedure based on the bootstrap In many situations, the bootstrap can be used to perform hypothesis tests that are more reliable in finite samples than tests based on asymptotic theory. If the bootstrap is to work well, the original test statistic must be asymptotically pivotal. 5 In the one dimensional case, a two-sided test based on normal approximation has level error of order n 1, which is reduced to order n 2 by using bootstrap with asymptotic pivotal statistic. 6 Fot the multi dimensional case, the level of error should be related to n and d. As shown in our simulation, the bootstrap leads to a higher power. 5 Davidson, R., & MacKinnon, J. G. (1996). The power of bootstrap tests. Queens Institute for Economic Research Discussion Paper 6 Hall, P. (2013). The bootstrap and Edgeworth expansion. Springer Science & Business Media.

Simulation Study: Methods 2. Hypothesis Testing procedure based on bootstrap One-sample case: Test statistic is defined as T n = n X n ˆσ. Bootstrap is from empirical distribution G n based on {X i X n, i = 1,..., n}. Here X s are scalars. Two-sample case: G n is the same as that in the one sample case. The test statistic is defined as T n = n ˆΣ 1 (1) (2) (1) (2) 2 ( X n X n ), where X n and X n indicate two sample mean. Here X s are vectors, ˆΣ is the pooled sample covariance matrix.

Simulation Study: Results Block Model one sample Bootstrap Case 0 0 0 T Value 8 0.000 0.025 0.050 0.075 0.100 d = 6, n = 0 8 0.000 0.025 0.050 0.075 0.100 d = 6, n = 200 8 0.000 0.025 0.050 0.075 0.100 d = 6, n = 1000 0 0 0 8 0.000 0.025 0.050 0.075 0.100 d = 10, n = 0 8 0.000 0.025 0.050 0.075 0.100 d = 10, n = 200 Effect Size 8 0.000 0.025 0.050 0.075 0.100 d = 10, n = 1000

Potential Directions Estimate rate of convergence of our central limit theorem for multilayer network as function of d and n. Power analysis: to know how many patients we need to measure, if we want to ensure power at a certain level. Needs the above convergence rate in the multivariate case. More applications based on this CLT on multilayer network, e.g. the regression on network.