Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation Discussion of sampling approach in big data

Similar documents
A Statistical Perspective on Algorithmic Leveraging

arxiv: v1 [stat.me] 23 Jun 2013

A Statistical Perspective on Algorithmic Leveraging

Sketched Ridge Regression:

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Machine Learning for OR & FE

Properties of the least squares estimates

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

COMPRESSED AND PENALIZED LINEAR

Gradient-based Sampling: An Adaptive Importance Sampling for Least-squares

Apart from this page, you are not permitted to read the contents of this question paper until instructed to do so by an invigilator.

Approximate Principal Components Analysis of Large Data Sets

Regression Diagnostics for Survey Data

Data Mining Stat 588

Dealing with Heteroskedasticity

Math 423/533: The Main Theoretical Topics

Lecture 4: Regression Analysis

High-dimensional regression

Quantitative Methods I: Regression diagnostics

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Regression, Ridge Regression, Lasso

Regression diagnostics

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

14 Multiple Linear Regression

BIOS 2083 Linear Models c Abdus S. Wahed

20. REML Estimation of Variance Components. Copyright c 2018 (Iowa State University) 20. Statistics / 36

High-dimensional regression modeling

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

A Modern Look at Classical Multivariate Techniques

Chapter 7: Variances. October 14, In this chapter we consider a variety of extensions to the linear model that allow for more gen-

Remedial Measures for Multiple Linear Regression Models

Weighted Least Squares

Ch 3: Multiple Linear Regression

Homoskedasticity. Var (u X) = σ 2. (23)

Computation time/accuracy trade-off and linear regression

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Statistics 910, #5 1. Regression Methods

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

Heteroskedasticity. Part VII. Heteroskedasticity

REGRESSION WITH SPATIALLY MISALIGNED DATA. Lisa Madsen Oregon State University David Ruppert Cornell University

Applied Econometrics (QEM)

Inference For High Dimensional M-estimates: Fixed Design Results

Confidence Intervals, Testing and ANOVA Summary

Best Linear Unbiased Prediction (BLUP) of Random Effects in the Normal Linear Mixed Effects Model. *Modified notes from Dr. Dan Nettleton from ISU

Regression Analysis for Data Containing Outliers and High Leverage Points

4 Multiple Linear Regression

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Machine Learning for OR & FE

Fast and Robust Least Squares Estimation in Corrupted Linear Models

Statistics 135 Fall 2008 Final Exam

Estimating Estimable Functions of β. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics / 17

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Multiple Linear Regression

STAT 4385 Topic 06: Model Diagnostics

Simple Linear Regression

3 Multiple Linear Regression

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Lecture 24: Weighted and Generalized Least Squares

STAT 540: Data Analysis and Regression

ISyE 691 Data mining and analytics

Multiple Regression Analysis. Part III. Multiple Regression Analysis

1 Least Squares Estimation - multiple regression.

General Linear Model: Statistical Inference

Lecture 1: Linear Models and Applications

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

Matrix Factorizations

A Fast Algorithm For Computing The A-optimal Sampling Distributions In A Big Data Linear Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

Multivariate Regression Analysis

Lecture 11: Regression Methods I (Linear Regression)

Linear Methods for Regression. Lijun Zhang

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

ECON The Simple Regression Model

Linear Methods for Prediction

Lecture 11: Regression Methods I (Linear Regression)

if n is large, Z i are weakly dependent 0-1-variables, p i = P(Z i = 1) small, and Then n approx i=1 i=1 n i=1

Least Squares Estimation-Finite-Sample Properties

Generalized Linear Models

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Sketching as a Tool for Numerical Linear Algebra

Regression #3: Properties of OLS Estimator

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Indian Statistical Institute

4 Bias-Variance for Ridge Regression (24 points)

Chapter 5 Matrix Approach to Simple Linear Regression

Math 5305 Notes. Diagnostics and Remedial Measures. Jesse Crawford. Department of Mathematics Tarleton State University

Sampling Distributions

[y i α βx i ] 2 (2) Q = i=1

Linear Models in Machine Learning

Ordinary Least Squares Regression

Regression: Lecture 2

Linear Regression. Volker Tresp 2018

Lecture 24: Element-wise Sampling of Graphs and Linear Equation Solving. 22 Element-wise Sampling of Graphs and Linear Equation Solving

arxiv: v1 [stat.me] 30 Dec 2017

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Transcription:

Discussion of sampling approach in big data Big data discussion group at MSCS of UIC

Outline 1 Introduction 2 The framework 3 Bias and variance 4 Approximate computation of leverage 5 Empirical evaluation

Mainly based on Ping Ma, Michael Mahoney, Bin Yu (2015), A statistical perspective on algorithmic leveraging, Journal of Machine Learning Research, 16, 861-911

Sampling in big data analysis One popular approach Choose a small portion of full data One possible way: uniform random sampling Worst-case may perform poorly

Leveraging approach Data-dependent sampling process Least-square regression (Avron et al. 2010, Meng et al. 2014) Least absolute deviation and quantile regression (Clarkson et al. 2013, Yang et al. 2013) Low-rank matrix approximation (Mahoney and Drineas, 2009) Leveraging provides uniformly superior worst-case algorithmic result No work addresses the statistical aspects

Summary of the results Based on linear model Analytic framework for evaluating sampling approaches Use Taylor expansion to approximate the subsampling estimator

Uniform approach vs leveraging approach Compare the biases and variance, both conditional and not unconditional Both are unbiased to leading order Leveraging approach improve the size-scale of the variance but may inflate the variance with small leverage scores Neither leveraging nor uniform approach dominates each other

New approaches Shrinkage Leveraging Estimator (SLEV): a convex combination of leveraging sampling probability and uniform probability Unweighted leveraging Estimator (LEVUNW): leveraging sampling approach with unweighted LS estimation Both approaches have some improvements

Outline 1 Introduction 2 The framework 3 Bias and variance 4 Approximate computation of leverage 5 Empirical evaluation

Linear Model y = Xβ 0 + ɛ X is n p matrix β 0 is p p ɛ N(0, σ 2 ) Least-squared estimator: ˆβ ols = (X T X) 1 X T y

About ˆβ ols Computation time O(np 2 ) Can be written as V 1 U T y, where X = U V T (thin SVD) Can be solved approximately with computation time o(np 2 ) with error bounded by ɛ

Leverage Consider ŷ = Hy, where H = X(X T X) 1 X T The ith diagonal element, h ii = x T i (X T X) 1 x i, called the statistical leverage of the ith observation. Var(e i ) = (1 h ii )σ 2 e i Student residual: ˆσ 1 h ii h ii has been used to qualify for the influential observations

Leverage h ii = p j=1 U2 ij Exact computation time: O(np 2 ) Approximate computation time: o(np 2 )

Sampling algorithm {π i } n i=1 is a sampling distribution Randomly sample r > p rows of X and the corresponding elements of y, using {π i } n i=1 1 Rescale each sampled row/element by (r π i ) to form a weighted LS subproblem Solve the weighted LS subproblem, the solution denoted as β wls

Weighted LS subproblem Let SX T (r n) be the sampling matrix indicating the selected samples Let D (r r) be the diagonal matrix with the ith element if the kth data is chosen being 1 rπk The weighted LS estimator is argmin β DS T X y DST X Xβ

Weighted sampling estimators β W = (X T WX) 1 X T Wy with W = S X D 2 SX T (n n diagonal random matrix). W is a random matrix with E(W ii ) = 1.

Smapling approaches Uniform: π i = 1/n, for all i; Uniform sampling estimator (UNIF) Leverage-based: π i = (LEV) Shrinkage: π i = απi Lev estimator (SLEV) h ii n i h ii Unweighted leveraging: with π Lev i = h ii /p; Leveraging Estimator + (1 α)πi Unif ; Shrinkage leveraging solving argmin β S T X y ST X Xβ

Outline 1 Introduction 2 The framework 3 Bias and variance 4 Approximate computation of leverage 5 Empirical evaluation

Lemma 1 A Taylor expansion of β W around the point E(W ) = 1 yields β W = ˆβ ols + (X T X) 1 X T Diag{ê}(w 1) + R w where ê = y X ˆβ ols and R w is the Taylor expansion reminder Remark: (1) when Taylor expansion is valid when R W = o p ( W 1 ). No theoretical justification when it holds. (2) the formula does not apply to LEVUNW

Lemma 2 [ ] E W βw y = ˆβ ols + E W [R w ] [ ] [ Var W β W y =(X T X) 1 Diag{ê}Diag{ 1 ] rπ }Diag{ê} X(X T X) 1 + Var W [R w ] [ ] Remark: when E W β W y is negligible, β W is approximately unbiased relative to full sample estimate ˆβ ols. The variance is inversely proportional to subsample size r.

Lemma 2 [ ] E β W =β 0 [ ] Var βw =σ 2 (X T X) 1 + σ2 r (X T X) 1 Diag{ (1 h ii) 2 }X(X T X) 1 π i + Var [R w ] Remark: β W is unbiased to true value β 0. The variance depends on leverage and sampling probability, and is inversely proportional to subsample size r.

UNIF [ ] E W β UNIF y = ˆβ ols + E W [R UNIF ] [ ] Var W βunif y = n r (X T X) 1 [Diag{ê}Diag{ê}] X(X T X) 1 + Var W [R UNIF ] [ ] E βunif =β 0 [ ] Var β UNIF =σ 2 (X T X) 1 + n r (X T X) 1 Diag{(1 h ii ) 2 }X(X T X) 1 + Var [R UNIF ] Remark: (1) The variance depends on n r, could be very large unless r is closed to n; (2) The sandwich-type expression will not be inflated by small h ii.

LEV [ ] E W βlev y = ˆβ ols + E W [R LEV ] [ ] Var W β LEV y = p [ r (X T X) 1 Diag{ê}Diag{ 1 ] }Diag{ê} X(X T X) 1 h ii + Var W [R LEV ] [ ] E βlev =β 0 Var [ ] β LEV =σ 2 (X T X) 1 + pσ2 (X T X) 1 Diag{ (1 h ii) 2 }X(X T X) r h ii + Var [R LEV ] Remark: (1) The variance depends on p r, not sample size n; (2) The sandwich-type expression can be inflated by small h ii.

SLEV π i = απi Lev + (1 α)πi Unif Lemma 2 still holds If (1 α) is not small, variance of the SLEV does not get inflated too much If (1 α) is not large, variance of the SLEV has a scale of p/r Not only increase the small scores, but also shrinkage on large scores

LEVUNW A Taylor expansion of β W around the point E(W ) = rπ yields β LEVUNW = ˆβ wls + (X T X) 1 X T Diag{ê W }(W rπ) + R LEVUNW where ˆβ wls = (X T W 0 X) 1 XW 0 y and ê W = y X ˆβ wls, W 0 = Diag{rh ii /p}

LEVUNW [ ] E W β LEVUNW y = ˆβ wls + E W [R LEVUNW ] [ ] Var W βlevunw y =(X T W 0 X) 1 Diag{ê W }W 0 Diag{ê W }X(X T W 0 X) + Var W [R LEVUNW ] Remark: for a given data set, β LEVUNW is approximately unbiased to ˆβ wls, but not ˆβ ols.

LEVUNW E W [ β LEVUNW ] =β 0 Var W [ β LEVUNW ] =σ 2 (X T W 0 X) 1 X T W 2 0 X(X T W 0 X) 1 +(X T W 0 X) 1 X T Diag{I P X,W0 }W 0 Diag{I P X,W0 }X(X T W 0 X) 1 + Var W [R LEVUNW ] Remark: β LEVUNW is unbiased to β 0 and the variance is not inflated by small leverage

Outline 1 Introduction 2 The framework 3 Bias and variance 4 Approximate computation of leverage 5 Empirical evaluation

Approximate computation Based on Drineas et al. (2012) Generate an r 1 n random matrix 1 Generate an p r 2 random matrix 2 Compute R, where R is the thin SVD of 1 X = QR Return the leverage score of XR 1 2

Computation time For approximate choices of r 1 and r 2, if one chooses 1 to be a Hadamard-based random matrix, the the computation time is o(np 2 )

Empirical studies n = 20, 000 and p = 1, 000 BFast: each element of 1 and 2 is generated i.i.d from {-1,1} with equal sampling GFast: each element of 1 and 2 is generated i.i.d from N(0, 1 n ) and N(0, 1 p ) n = 20, 000 and p = 1, 000 r 1 = p, 1.5p, 2p, 3p, 4p, 5p and r 2 = klog(n) with k = 1, 2,..., 20

Empirical studies: choose of r 1 and r 2 With the increase of r 1, the correlation are not sensitive but the running time increase linearly With the increase of r 2, the correlation increase rapidly but the running time not sensitive Choose small r 1 and large r 2

Empirical studies: choose of r 1 and r 2

Empirical studies: computation time When n 20, 000, exact method takes less time When n > 20, 000, the approximate approach has some advantage

Empirical studies: computation time

Empirical studies: estimation comparision Compare the bias and variance of LEV, SLEV, and LEVUNW using exact, BFast, and GFast The results are almost identical

Empirical studies: estimation comparision

Outline 1 Introduction 2 The framework 3 Bias and variance 4 Approximate computation of leverage 5 Empirical evaluation

Plan Unconditional bias and variance for LEV and UNIF Unconditional bias and variance for SLEV and LEVUNW Conditional bias and variance of SLEV and LEVUNW Real data application

Synthetic data y = Xβ + ɛ, where ɛ N(0, 9I n ) Nearly uniform leverage scores (GA): X N(1 p, Σ), Σ ij = 2 0.5 i j, and β = (1 10, 0.11 p 20, 1 10 ) Moderately nonuniform leverage scores (T 3 ): X is from multivariate t-distribution with df=3 Very nonuniform leverage scores (T 1 ): X is from multivariate t-distribution with df=1

LEV vs UNIF: square loss and variance n = 1000, p = 10, 50, 100, and repeat sampling 1000 times Square loss is much smaller than variance Similarly for GA Less similarly for T 3 Very different for T 1 Both decrease as r increase, but slower for UNIF

LEV vs UNIF: square loss and variance

Improvements from SLEV and LEVUNW n = 1000, p = 10, 50, 100, and repeat sampling 1000 times Similarly for GA Less similarly for T 3 Different for T 1 SLEV with α = 0.9 and LEVUNW have better performance

Improvements from SLEV and LEVUNW

Choices of α in SLEV n = 1000, p = 10, 50, 100, and repeat sampling 1000 times T 1 data 0.8 α 0.9 has beneficial effect Recommend α = 0.9 LEVUNW has better performance

Choices of α in SLEV

Conditional bias and variance n = 1000, p = 10, 50, 100, and repeat sampling 1000 times LEVUNW is biased for ˆβ ols LEVUNW has smallest variance Recommend use SLEV with α = 0.9

Conditional bias and variance

Real Data: RNA-SEQ data n = 51, 751 read counts from embryonic mouse stem cells n ij denotes the counts of reads that are mapped to the genome starting at the jth nucleotide of the ith gene y ij = log(n ij + 0.5) Independent variables: 40 nucleotides denoted as b ij, 20,, b ij, 19,..., b ij,19. Linear model: y ij = α + 19 k= 20 h H β khi(b ij,k = h) + ɛ ij, where H = {A, C, G}, T is used as baseline level. p = 121

Sampling analysis UNIF, LEV, and SLEV r = 2p, 3p, 4p, 5p, 10p, 20p, 50p Compare sample bias (respect to ˆβ ols ) and variance Sampling 100 times

Comparison Relatively uniform leverage scores Almost identical variances LEVUNW has slightly larger bias

Emprical resutls for real data I

Real Data: predicting gene expression of cancer patient n = 5, 520 genes for 46 patients. Randomly select one patient s gene expression as y and remaining patients gene expressions as predictors (p = 45) Sample sizes from 100 to 5000 UNIF, LEV, and SLEV

Comparison Relatively nonuniform leverage scores SLEV and LEV have smaller variances LEVUNW has the largest bias

Emprical resutls for real data II