Nonparametric Estimation of Distributions in a Large-p, Small-n Setting

Similar documents
Available online: 01 Jan Full terms and conditions of use:

Bootstrap tests. Patrick Breheny. October 11. Bootstrap vs. permutation tests Testing for equality of location

Estimation of a Two-component Mixture Model

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Math 494: Mathematical Statistics

Frequency Analysis & Probability Plots

Confidence intervals for kernel density estimation

Statistics: Learning models from data

11. Bootstrap Methods

The bootstrap. Patrick Breheny. December 6. The empirical distribution function The bootstrap

Saddlepoint-Based Bootstrap Inference in Dependent Data Settings

Hypothesis Testing with the Bootstrap. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

Issues on quantile autoregression

Modeling Uncertainty in the Earth Sciences Jef Caers Stanford University

S The Over-Reliance on the Central Limit Theorem

Parametric Empirical Bayes Methods for Microarrays

Supplement to A Hierarchical Approach for Fitting Curves to Response Time Measurements

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

NONPARAMETRIC DENSITY ESTIMATION WITH RESPECT TO THE LINEX LOSS FUNCTION

Package Rsurrogate. October 20, 2016

Cross-Validation with Confidence

Chapter 9. Non-Parametric Density Function Estimation

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao

Cross-Validation with Confidence

MATH4427 Notebook 4 Fall Semester 2017/2018

H 2 : otherwise. that is simply the proportion of the sample points below level x. For any fixed point x the law of large numbers gives that

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Stat 516, Homework 1

EXAM. Exam #1. Math 3342 Summer II, July 21, 2000 ANSWERS

Chapter 9. Non-Parametric Density Function Estimation

On robust and efficient estimation of the center of. Symmetry.

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

3 Joint Distributions 71

A3. Statistical Inference Hypothesis Testing for General Population Parameters

A Recursive Formula for the Kaplan-Meier Estimator with Mean Constraints

Nonparametric Estimation of Functional-Coefficient Autoregressive Models

arxiv: v1 [stat.me] 17 Jan 2008

Web-based Supplementary Material for. Dependence Calibration in Conditional Copulas: A Nonparametric Approach

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

CS 147: Computer Systems Performance Analysis

UQ, Semester 1, 2017, Companion to STAT2201/CIVL2530 Exam Formulae and Tables

36. Multisample U-statistics and jointly distributed U-statistics Lehmann 6.1

MATH 644: Regression Analysis Methods

GS Analysis of Microarray Data

STAT Section 2.1: Basic Inference. Basic Definitions

Declarative Statistics

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

Rank-sum Test Based on Order Restricted Randomized Design

arxiv: v1 [stat.me] 6 Nov 2013

SMOOTHED BLOCK EMPIRICAL LIKELIHOOD FOR QUANTILES OF WEAKLY DEPENDENT PROCESSES

Empirical likelihood ratio with arbitrarily censored/truncated data by EM algorithm

One-Sample Numerical Data

Hypothesis Testing For Multilayer Network Data

Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods

Introduction to Curve Estimation

Expectation, Variance and Standard Deviation for Continuous Random Variables Class 6, Jeremy Orloff and Jonathan Bloom

Supplement to Quantile-Based Nonparametric Inference for First-Price Auctions

Robust Performance Hypothesis Testing with the Sharpe Ratio

Chapter 1 Statistical Inference

Empirical Bayes Deconvolution Problem

arxiv: v2 [stat.me] 3 Dec 2016

Spring 2012 Math 541B Exam 1

Discussion of the paper Inference for Semiparametric Models: Some Questions and an Answer by Bickel and Kwon

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Review of basic probability and statistics

Two-by-two ANOVA: Global and Graphical Comparisons Based on an Extension of the Shift Function

Multivariate Statistical Analysis

Density Deconvolution for Generalized Skew-Symmetric Distributions

Robust Performance Hypothesis Testing with the Variance. Institute for Empirical Research in Economics University of Zurich

University of California San Diego and Stanford University and

probability George Nicholson and Chris Holmes 29th October 2008

1 Random walks and data

Business Statistics. Lecture 10: Course Review

An automatic report for the dataset : affairs

Nonparametric confidence intervals. for receiver operating characteristic curves

AFT Models and Empirical Likelihood

Bootstrap Confidence Intervals

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Professors Lin and Ying are to be congratulated for an interesting paper on a challenging topic and for introducing survival analysis techniques to th

The Bootstrap: Theory and Applications. Biing-Shen Kuo National Chengchi University

Physics 509: Bootstrap and Robust Parameter Estimation

UNIVERSITÄT POTSDAM Institut für Mathematik

Lesson 8: Testing for IID Hypothesis with the correlogram

Reduction of Model Complexity and the Treatment of Discrete Inputs in Computer Model Emulation

A Bahadur Representation of the Linear Support Vector Machine

Bootstrap inference for the finite population total under complex sampling designs

simple if it completely specifies the density of x

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

Non-specific filtering and control of false positives

STAT 6350 Analysis of Lifetime Data. Probability Plotting

O Combining cross-validation and plug-in methods - for kernel density bandwidth selection O

ECONOMETRICS II (ECO 2401S) University of Toronto. Department of Economics. Spring 2013 Instructor: Victor Aguirregabiria

A Bootstrap Test for Conditional Symmetry

INFERENCE APPROACHES FOR INSTRUMENTAL VARIABLE QUANTILE REGRESSION. 1. Introduction

First Year Examination Department of Statistics, University of Florida

Simple Linear Regression

0.1 The plug-in principle for finding estimators

Transcription:

Nonparametric Estimation of Distributions in a Large-p, Small-n Setting Jeffrey D. Hart Department of Statistics, Texas A&M University Current and Future Trends in Nonparametrics Columbia, South Carolina October 11, 2007 Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.1

Outline A random effects location model Multiple hypothesis testing Brief review of estimation results for location model Minimum distance estimation Simulation results Location-scale model Microarray example Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.2

Location model X ij = µ i + σǫ ij, j = 1,...,n, i = 1,...,p. Assumptions: µ 1,...,µ p are i.i.d. as G. ǫ i1,...,ǫ in, i = 1,...,p, are i.i.d. as F, where F has mean 0 and standard deviation 1. All µ i s independent of all ǫ ij s. σ is an unknown constant. Problem of interest: Obtain nonparametric estimates of F and G. Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.3

Connection with deconvolution When n = 1, our location model is a classic deconvolution model. In this case (n = 1), it s clear that F and G are not both identifiable. In deconvolution, the distribution of ǫ is assumed to be known, in which case it is possible to consistently estimate the distribution of µ from the X-data. [Carroll and Hall (1988, JASA)] Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.4

Multiple hypothesis testing The location model is sometimes used in microarray analyses, where p is number of genes and n is number of measurements per gene. Test all hypotheses H 0i : µ i = 0, i = 1,...,p. Typically, the distribution of a test statistic (under the null) will depend on F. Dependence on F is strong when n is small. Previous point implies that it is desirable to infer F. Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.5

Nonparametric estimation of F and G Is it possible to construct consistent nonparametric estimators of F and G when p goes to infinity but n is bounded? Perhaps surprisingly, the answer is yes, even when n is as small as 2. Two important early papers: Reiersøl (1950, Econometrica): Identifiability Wolfowitz (1957, Ann. Math. Statist.): Minimum distance estimation (MDE) Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.6

Two main types of estimators Explicit estimators Based on characteristic function inversion, in analogy to simpler deconvolution problem. Minimum distance estimators Choose F and G so that the induced distribution of (X i1,...,x in ) is a good match to the empirical distribution. Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.7

More recent literature Horowitz and Markatou (1996, Review of Economic Studies): Explicit estimators from panel data. (Error density assumed to be symmetric.) Li and Vuong (1998, JMVA): Explicit estimator in the location model. Hall and Yao (2003, Ann. Statist.): Explicit estimators and MDE histograms in location model. Neumann (2006): Strong consistency of MDEs of F 0 and G 0 in the location model. Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.8

Characteristic functions ψ(s, t) = E [exp(isx j1 + itx j2 )] Under conditions more general than the location model, ψ(s, t) is consistently estimated by ˆψ(s, t) = ( n 2) 1 1 j<k n ˆψ j,k (s, t), where ˆψ j,k (s, t) = 1 p In the location model, p exp(isx lj + itx lk ). l=1 ψ(s, t) = ψ µ (s + t)ψ ǫ (σs)ψ ǫ (σt). Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.9

An MDE metric Suppose ˆµ 1,..., ˆµ k are candidates for quantiles of G at probabilities (j 1/2)/k, j = 1,...,k. An estimate of the cf of G is ˆψ µ (t) = 1 k k e itˆµ j. j=1 Given candidate quantiles for F, we may likewise compute an estimate ˆψ ǫ of ψ ǫ. Metric: e h2 (s 2 +t 2) ˆψ(s, t) ˆψ µ (s + t) ˆψ ǫ (ˆσs) ˆψ ǫ (ˆσt) 2 dsdt Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.10

An estimation algorithm 1. Compute metric for initial estimates of F 0 and G 0. 2. Randomly jitter initial quantiles of F 0, and recompute metric. 3. If new distance is smaller than the previous one, accept the jittered quantiles. 4. Repeat 2 and 3 some predetermined number of times. 5. Repeat 2-4 for estimates of the G 0 quantiles. 6. Iterate 2-5 until the distance changes by less than, say, 1% from one iteration to the next. Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.11

Simulations X ij = µ i + σǫ ij, i = 1,...,1000, j = 1, 2 Two choices for G: Standard normal Bimodal mixture of two normals Three choices for F : Standard normal Standard exponential shifted to have mean 0 t 3 -distribution rescaled to have variance 1 Two values of σ: 1 and 3 Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.12

Simulations, continued Estimates of the quantile functions G 1 (u) and F 1 (u) were computed at u = (j 1/2)/30, j = 1,...,30, for each data set generated from the location model. ˆσ 2 = (2p) 1 p i=1 (X i1 X i2 ) 2 Two hundred replications were performed at each combination of F, G and σ. Some of the results are summarized in the graphs to follow. Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.13

G = Normal, F = Normal σ = 1 σ = 3 quantile 2 1 0 1 2 quantile 2 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 probability 0.0 0.2 0.4 0.6 0.8 1.0 probability Red curve: F 1 Black curve: Median estimate of F 1 Dashed curves: 10th and 90th percentiles of all estimates Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.14

G = Normal, F = Exponential σ = 1 σ = 3 quantile 1 0 1 2 3 quantile 1 0 1 2 3 0.0 0.2 0.4 0.6 0.8 1.0 probability 0.0 0.2 0.4 0.6 0.8 1.0 probability Red curve: F 1 Black curve: Median estimate of F 1 Dashed curves: 10th and 90th percentiles of all estimates Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.15

G = Normal, F = t 3 σ = 1 σ = 3 quantile 3 2 1 0 1 2 3 quantile 4 2 0 2 4 0.0 0.2 0.4 0.6 0.8 1.0 probability 0.0 0.2 0.4 0.6 0.8 1.0 probability Red curve: F 1 Black curve: Median estimate of F 1 Dashed curves: 10th and 90th percentiles of all estimates Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.16

G = Normal mixture, F = Normal σ = 1 σ = 3 quantile 2 1 0 1 2 quantile 2 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 probability 0.0 0.2 0.4 0.6 0.8 1.0 probability Red curve: F 1 Black curve: Median estimate of F 1 Dashed curves: 10th and 90th percentiles of all estimates Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.17

G = Normal mixture, F = Exponential σ = 1 σ = 3 quantile 1 0 1 2 3 quantile 1 0 1 2 3 0.0 0.2 0.4 0.6 0.8 1.0 probability 0.0 0.2 0.4 0.6 0.8 1.0 probability Red curve: F 1 Black curve: Median estimate of F 1 Dashed curves: 10th and 90th percentiles of all estimates Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.18

G = Normal mixture, F = t 3 σ = 1 σ = 3 quantile 4 2 0 2 4 quantile 3 2 1 0 1 2 3 0.0 0.2 0.4 0.6 0.8 1.0 probability 0.0 0.2 0.4 0.6 0.8 1.0 probability Red curve: F 1 Black curve: Median estimate of F 1 Dashed curves: 10th and 90th percentiles of all estimates Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.19

µ i -free MDE estimates of F Suppose n 3 and define δ i1 = X i2 X i1 = σ(ǫ i2 ǫ i1 ) and δ i2 = X i2 X i3 = σ(ǫ i2 ǫ i3 ). (δ i1, δ i2 ), i = 1,...,p, are a special case of the location model. It follows that the distribution of ǫ ij is estimable from the differenced data!! If one still wishes to estimate G, having a good estimate of F will help in this process. Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.20

Location-scale model Suppose observed data are X ij, j = 1,...,n, i = 1,...,p. Consider the following model: X ij = µ i + σ i ǫ ij, j = 1,...,n, i = 1,...,p (µ 1, σ 1 ),...,(µ p, σ p ) are i.i.d. as G 0. ǫ ij, j = 1,...,n, i = 1,...,p, are i.i.d. as F 0, and independent of (µ 1, σ 1 ),...,(µ p, σ p ). Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.21

MDE estimation of F based on residuals In the location-scale model, the residuals e ij = X ij X i S i = ǫ ij ǫ i S ǫ,i are completely free of (µ i, σ i ), i = 1,...,p. f: Density of ǫ ij. f n : Corresponding density of e ij. Conjecture: Unless n is very small, f is identifiable from f n. Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.22

MDE estimation from residuals, continued Algorithm: Compute a kernel density estimate ˆf n of f n from the residuals e ij, j = 1,...,n, i = 1,...,p. Given a candidate f for f, use simulation to approximate (arbitrarily well) the corresponding f n. Compute ( ˆf n (x) f n (x)) 2 dx. Try to find a density f such that the corresponding f n minimizes the distance in the previous step. As candidate densities, use kernel smooths of candidate quantiles. Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.23

An example Suppose ǫ ij + 1 has a standard exponential distribution. Generate 5(8038) values of ǫ ij. Compute 5(8038) standardized residuals. Apply algorithm from previous page to estimate f. Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.24

Results for exponential example Exponential cdf and estimate Residual densities F(x) 0.0 0.2 0.4 0.6 0.8 1.0 density 0.0 0.1 0.2 0.3 0.4 0.5 1 0 1 2 3 x 2 1 0 1 2 residual Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.25

Microarray example Microarray data collected by Texas A&M nutritionist Robert Chapkin and coworkers. The data here are a subset of data from a larger study. n = 5 rats that were all given the same treatment p = 8038 genes X ij = log(expression level for gene j of rat i) Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.26

Model X ij = M j + µ i + σ i ǫ ij, j = 1,...,5, i = 1,...,8038. M j : rat effect (µ i, σ i ): gene effect ǫ ij : error Remarks: 1. The rat effects can be very efficiently estimated since p is so large. 2. In this example our main interest is in estimating F, the distribution of each ǫ ij. Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.27

Typical scatterplot Rat 1 vs. Rat 2 X2 4 2 0 2 4 6 2 0 2 4 6 X1 Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.28

Sample means and standard deviations log(sd) 5 4 3 2 1 0 1 2 0 2 4 Mean Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.29

Residuals for two rats Rat 2 vs. Rat 1 Rat 2 vs. Rat 1: lag of one gene Rat2 2 1 0 1 2 Rat2 2 1 0 1 2 2 1 0 1 2 Rat1 2 1 0 1 2 Rat1 Elliptical pattern in the left plot is to be expected. Right plot is reassuring about independence between genes. Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.30

Density estimates Estimate of f Residual densities density 0.0 0.1 0.2 0.3 0.4 density 0.00 0.10 0.20 0.30 2 1 0 1 2 x 2 1 0 1 2 residual Right hand graph Red: Kernel estimate of f n Black: fn Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.31

Further research Plenty of room to improve algorithm for approximating MDEs. Identifiability issues in location-scale model. Efficiency of MDE relative to explicit methods: ongoing work with Jan Johannes. Nonparametric Estimation of Distributions in a Large-p, Small-n Setting p.32