Robust statistics. Michael Love 7/10/2016

Similar documents
9. Robust regression

Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p.

Nonparametric Statistics

SPSS Guide For MMI 409

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

Association studies and regression

Statistics Handbook. All statistical tables were computed by the author.

STATISTICS 4, S4 (4769) A2

Contents. Acknowledgments. xix

Introduction and Single Predictor Regression. Correlation

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Introduction Robust regression Examples Conclusion. Robust regression. Jiří Franc

Introduction to Statistical Analysis

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

Introduction to Linear regression analysis. Part 2. Model comparisons

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

Normalization of metagenomic data A comprehensive evaluation of existing methods

High-Throughput Sequencing Course

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Simple and Multiple Linear Regression

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Statistics: revision

Y i = η + ɛ i, i = 1,...,n.

robustness, efficiency, breakdown point, outliers, rank-based procedures, least absolute regression

Rank-Based Methods. Lukas Meier

Lecture 12 Robust Estimation

SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1

Practical Statistics for the Analytical Scientist Table of Contents

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Regression Models - Introduction

Resampling Methods. Lukas Meier

Math 423/533: The Main Theoretical Topics

Statistics for Differential Expression in Sequencing Studies. Naomi Altman

Statistics 135 Fall 2008 Final Exam

Regression Models - Introduction

Chapter 7 Comparison of two independent samples

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

sphericity, 5-29, 5-32 residuals, 7-1 spread and level, 2-17 t test, 1-13 transformations, 2-15 violations, 1-19

Lecture 8: Information Theory and Statistics

Regression Analysis for Data Containing Outliers and High Leverage Points

Non-parametric tests, part A:

Central Limit Theorem ( 5.3)

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis


Math 180B Problem Set 3

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Lecture: Mixture Models for Microbiome data

Statistical Modeling and Analysis of Scientific Inquiry: The Basics of Hypothesis Testing

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Why is the field of statistics still an active one?

Non-parametric methods

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Fish SR P Diff Sgn rank Fish SR P Diff Sng rank

Bio 183 Statistics in Research. B. Cleaning up your data: getting rid of problems

Correlation and Regression

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries Possessing Only Two Moments

Next is material on matrix rank. Please see the handout

Lecture 14 October 13

Exploratory Data Analysis

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Notes on Discriminant Functions and Optimal Classification

Simple Linear Regression for the Climate Data

Wilcoxon Test and Calculating Sample Sizes

Power and nonparametric methods Basic statistics for experimental researchersrs 2017

Graphical Techniques Stem and Leaf Box plot Histograms Cumulative Frequency Distributions

Exam: high-dimensional data analysis January 20, 2014

Generalized Linear Models (1/29/13)

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

Regression Clustering

Probability and Statistics Notes

Lecture 7 Introduction to Statistical Decision Theory

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Nonparametric Statistics Notes

Physics 509: Non-Parametric Statistics and Correlation Testing

Transition Passage to Descriptive Statistics 28

3. Nonparametric methods

Analysis of 2x2 Cross-Over Designs using T-Tests

Empirical Bayes Moderation of Asymptotically Linear Parameters

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

MATH 1150 Chapter 2 Notation and Terminology

Unlocking RNA-seq tools for zero inflation and single cell applications using observation weights

R in Linguistic Analysis. Wassink 2012 University of Washington Week 6

Performance Evaluation

Empirical Bayes Moderation of Asymptotically Linear Parameters

Lecture 26. December 19, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Introduction to Robust Statistics. Elvezio Ronchetti. Department of Econometrics University of Geneva Switzerland.

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

GROUPED DATA E.G. FOR SAMPLE OF RAW DATA (E.G. 4, 12, 7, 5, MEAN G x / n STANDARD DEVIATION MEDIAN AND QUARTILES STANDARD DEVIATION

Linear Regression Model. Badr Missaoui

BTRY 4090: Spring 2009 Theory of Statistics

Applied Econometrics (QEM)

Bayesian Model Diagnostics and Checking

Transcription:

Robust statistics Michael Love 7/10/2016

Robust topics Median MAD Spearman Wilcoxon rank test Weighted least squares Cook's distance M-estimators

Robust topics Median => middle MAD => spread Spearman => association Wilcoxon rank test => group diffs Weighted least squares Cook's distance => observation influence M-estimators => framework for estimation

What do we mean by robust? These are most important: robust to outliers robust to misspecification of the model Robust doesn't mean: Low variance ( precise ), low bias ( accurate ) Accuracy (TP+TN/total), precision (1-FDR), sensitivity (TPR), specificity (1-FPR)

What do we mean by outlier? Technical error? Data entry error? Unaccounted for tail of data distribution?

How do most statistics work Typically, adding up (squared) terms from individual observations to make a statistic. Statistic often has useful math properties. n 1 n i=1 y i 1 n 1 i=1 n ( ) y i ȳ 2 n i=1 ( ) 2 y i β^0 x i β^1

Median Mean is typically used to describe the middle instead of median dat <- matrix(rnorm(5*1e5),ncol=5) means <- rowmeans(dat) medians <- apply(dat,1,median)

sd(medians)/sd(means) Efficiency [1] 1.197183

Median for non-normal data X 95% N(0, 1) + 5% N(0, 10) dat <- cbind(matrix(rnorm(19*1e5,sd=1),ncol=19), matrix(rnorm(1*1e5,sd=10),ncol=1)) means <- rowmeans(dat) medians <- apply(dat,1,median)

Trimmed mean for non-normal data X 95% N(0, 1) + 5% N(0, 10) dat <- cbind(matrix(rnorm(19*1e5,sd=1),ncol=19), matrix(rnorm(1*1e5,sd=10),ncol=1)) means <- rowmeans(dat) # trim 5% from each end = 10% of data tmeans <- apply(dat,1,mean,trim=.05)

MAD: median absolute deviation

Median absolute deviation & SD mad()will give you 1.4826 times the median absolute deviation

Efficiency of MAD dat <- matrix(rnorm(20*1e5),ncol=20) sds <- apply(dat,1,sd) mads <- apply(dat,1,mad)

MAD with outliers dat <- cbind(matrix(rnorm(19*1e5,sd=1),ncol=19), matrix(rnorm(1*1e5,sd=10),ncol=1)) sds <- apply(dat,1,sd) mads <- apply(dat,1,mad)

Spearman correlation Translate data to ranks and compute correlation + less sensitive to outliers - all subregions of range count equally

pearson spearman 0.9932279 0.9879952 Spearman correlation

Spearman correlation pearson spearman 0.69943398-0.03404562

Spearman correlation Drug resistance in cell lines Gene expression pearson spearman 0.9977304 0.1939154

Wilcoxon / Mann-Whitney rank test W <- 0 for (i in seq_along(x)) { W <- W + sum(y <= x[i]) } print(w) [1] 211

Wilcoxon vs t-test p distribution (n=20) X N(0, 1) Y 90% N(0, 1) + 10% N(0, 50)

Wilcoxon vs t-test sensitivity (SD=1, n=4)

Wilcoxon for small sample size wilcox.test(x=101:103, y=1:3) Wilcoxon rank sum test data: 101:103 and 1:3 W = 9, p-value = 0.1 alternative hypothesis: true location shift is not equal to 0

Weighted least squares If you know uncertainty associated with each observation y i

Weighted least squares Best linear unbiased estimator is to use 1 w i = ^ Var( ) ε i

Cook's distance fit1 fit2 (Intercept) 0.06755603-0.006076753 x 0.59821827-0.056409684

dfbeta(fit1)[1,"x"] Cook's distance [1] 0.654628 coef(fit1)[2] - coef(fit2)[2] x 0.654628

Cook's distance cooksd <- cooks.distance(fit1)

M-estimators Because of the squared terms in regression, a few observations with high leverage move the estimated coefficient away from the true value.

M-estimators M-estimators are a generalized framework for estimation M for Maximum likelihood-type estimation Least squares is a maximum likelihood estimate for data with normally-distributed error. For data Z and a parameter to be estimated: θ θ^mle = arg max log ( ) fθ z i i Generalize to some function : = arg max ( ) θ ~ i ρ θ z i ρ

MLE reminder What are we maximizing with MLE and normal errors? Something that looks like: theta <- seq(-5,5,.1) plot(theta, dnorm(theta,log=true))

Theory of estimation It is interesting to look back to the very origin of the theory of estimation, namely to Gauss and his theory of least squares. Gauss was fully aware that his main reason for assuming an underlying normal distribution and a quadratic loss function was mathematical, i.e., computational, convenience. In later times, this was often forgotten, partly because of the central limit theorem. -Huber Robust Estimation of a Location Parameter 1964

Theory of estimation However, if one wants to be honest, the central limit theorem can at most explain why many distributions occurring in practice are approximately normal. The stress is on the word approximately. This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): What happens if the true distribution deviates slightly from the assumed normal one? -Huber Robust Estimation of a Location Parameter 1964

M-estimators ρ We want easy derivatives for so that the minimum can be found using iterative methods like Newton-Raphson. For MLE and normal error this is linear. With M-estimators we have other options.

M-estimators Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S library(mass) rob.fit <- rlm(y ~ x)

Links on robust statistics in genomics SAMseq 's implementation of rank test sequencing depth noise of low counts false discovery rate voom weighted linear model edger and limma-voom sample quality weights DESeq2 use of Cook's distance