Quantile Regression for Extraordinarily Large Data

Quantile Regression for Extraordinarily Large Data Shih-Kang Chao Department of Statistics Purdue University November, 2016 A joint work with Stanislav Volgushev and Guang Cheng

Quantile regression Two-step algorithm Oracle rules Simulation References With the advance of technology, it is increasingly common that data set is so large that it cannot be stored in a single machine Social media (views, likes, comments, images...) Meteorological and environmental surveillance Transactions in e-commerce Others... Figure: A server room in Council Bluffs, Iowa. Photo: Google/Connie Zhou.

Divide and Conquer (D&C) To take advantage of the opportunities in massive data, we need to deal with storing (disc) and computational (memory, processor) bottlenecks. Divide and conquer paradigm: Randomly divide N samples into S groups. n = N/S Problem(N) subproblem (n) subproblem (n) subproblem (n) subproblem (n)

Aggregate solutions from subproblems to get a solution for original problem Can be implemented by computational platforms such as Hadoop (White, 2012) Data that are stored in distributed locations can be analyzed in similar manner Problem(N) subproblem (n) subproblem (n) subproblem (n) subproblem (n)

Does D&C fit for statistical analysis? Sometimes it does, but sometimes it doesn t... In the following, we give two simple examples.

Example 1: sample mean Problem(N) subproblem (n) subproblem (n) subproblem (n) subproblem (n) X n,1 X n,2 X n,3 X n,4 1 4 4 s=1 It fits! X s = 1 4n 4 s=1 i=1 n X is = 1 N N X i = X N. i=1

Example 2: sample median Problem(N) subproblem (n) subproblem (n) X 1 (0.5) X 2 (0.5) subproblem (n) X 3 (0.5) subproblem (n) X 4 (0.5) X(0.5) s = the middle value of ordered n samples in s group; X (0.5) = overall median 1 4 4 s=1 X s (0.5)?? = X (0.5)

Example 2: sample median Simulation 1: X i N(0, 1); Simulation 2: X i Exponential(1). N = 2 15. True median v.s. simulated S 1 S s=1 Xs (0.5) 0.20 0.10 0.00 0.10 τ = 0.5, N (0,1) 0.2 0.4 0.6 0.8 log N(S) τ = 0.5, EXP (1) 0.7 0.8 0.9 1.0 0.2 0.4 0.6 0.8 log N(S)

Major issues When does the D&C algorithm work? Especially for skewed and heavy tail distribution Statistical inference Asymptotic distribution Inference for the whole distribution? Take advantage of massive size to discover subtle patterns hidden in the whole distribution

Outline 1 Quantile regression 2 Two-step algorithm: D&C and quantile projection 3 Oracle rules: linear model and nonparametric model 4 Simulation

Quantile Response Y, predictors X. For τ (0, 1), conditional quantile curve Q( ; τ) of Y R conditional on X is defined through P (Y Q(X; τ) X = x) = τ x. Q(x; τ) at τ = 0.1, 0.25, 0.5, 0.75, 0.9 under different models 2 1 0 1 2 3 Y = 0.5*X + N(0,1) x y 2 0 2 4 6 Y = X + (.5+X)*N(0,1) x y Y X=x ~ Beta(1.5 x,.5+x) x y

Quantile regression v.s. mean regression Mean regression: Y i = m(x i ) + ε i, E[ε X = x] = 0 m: Regression function, object of interest. ε i : errors. Quantile regression: P (Y Q(x; τ) X = x) = τ No strict distinction between signal and noise. Object of interest: properties of conditional distribution of Y X = x. Contains much richer information than just conditional mean. 10 15 20 25 0.05 0.00 0.05 0.10 0.15 0.20 Age Bone Mass Density (BMD) 10 15 20 25 0.05 0.00 0.05 0.10 0.15 0.20 Age Bone Mass Density (BMD)

Quantile curves v.s. conditional distribution Q(x 0 ; τ) = F 1 (τ x 0 ), where F Y X (y x) is the conditional distribution function of Y given X BMD 0.00 0.05 0.10 Q^( Age x 0 = 13 ; τ) Q^( Age x 0 = 18 ; τ) Q^( Age x 0 = 23 ; τ) CDF F^(y Age x 0 = 13) F^(y Age x 0 = 18) F^(y Age x 0 = 23) τ 0.00 0.05 0.10 BMD

Quantile Regression as Optimization {(X i, Y i )} N i=1 independent and identical samples in Rd+1 Koenker and Bassett (1978): if Q(x; τ) = β(τ) x, estimate by β or (τ) := arg min b N ρ τ (Y i b X i ) (1.1) i=1 where ρ τ (u) := τu + + (1 τ)u check function. Optimization problem (1.2) is convex (but non-smooth), which can be solved by linear programming Q(x 0 ; τ) := x 0 β or (τ) for any x 0 More generally, we can consider a series approximation model

A unified framework: Q(x; τ) Z(x) β(τ) m := dim(z) (it is possible that m ). Solve β or (τ) := arg min b N { ρ τ Yi b Z(X i ) } (1.2) i=1 Examples of Z(x): linear model with fixed/increasing dimension, B-splines, polynomials, trigonometric polynomials Q(x; τ) := Z(x) βor (τ) Need to control the bias Q(x; τ) Z(x) β(τ) Quantile Regression

Summary for quantile regression β or (τ) is computationally infeasible when N is so large that cannot be handled with a single machine To infer the whole conditional distribution for fixed x 0, by F Y X (y x 0 ) = 1 0 1{Q(x 0 ; τ) < y}dτ where 1(A) = 1 if A is true. To approximate the integral, we need a to compute Q(x 0 ; τ) for many τ

D&C algorithm at fixed τ Problem(N) subproblem (n) subproblem (n) β 1 (τ 1 ) β 2 (τ 1 ) subproblem (n) β 3 (τ 1 ) subproblem (n) β 4 (τ 1 ) β s (τ) := arg min b R m n { ρ τ Yis b Z(X is ) } (2.1) i=1 β(τ) := 1 S S β s (τ). (2.2) s=1 However, this is only for a fixed τ!

Quantile projection algorithm To avoid repetitively applying D&C, take B := (B 1,..., B q ) B-spline basis defined on equidistant knots {t 1,..., t G } T with degree r τ N, β(τ) := Ξ B(τ). (2.3) Computation of Ξ: 1 Define a grid of quantile levels {τ 1,..., τ K } on [τ L, τ U ], K > q. For each τ k, compute β(τ k ) 2 Compute for j = 1,..., m α j := arg min α R q K ( βj (τ k ) α B(τ k ) ) 2. (2.4) k=1 3 Set the matrix Ξ := [ α 1 α 2... α m ].

Computation of F Y X (y x) Define Q(x; τ) := Z(x) β(τ) = Z(x) Ξ B(τ). Compute τu F Y X (y x 0 ) := τ L + 1{ Q(x 0 ; τ) < y}dτ. (2.5) τ L where 0 < τ L < τ U < 1.

Remarks Both S and K can grow with N, and they decide the computational limit as well as the statistical performance Find S and K such that β(τ) and β(τ) are close to β or (τ) in some statistical sense This results in sharp upper bound of S and lower bound of K which impose computational limit The two-step procedure requires only one pass through the entire data Quantile projection requires only {β(τ 1 ),..., β(τ K )} of size m K, without the need to access the raw data set

Oracle rules For T = [τ L, τ U ] (0, 1), under regularity conditions, C., Volgushev and Cheng (2016): Chao et al. (2016) a N u ( βor (τ) β(τ) ) N at any fixed τ T (3.1) a N u ( βor (τ) β(τ) ) G(τ) as a process in τ T (3.2) where N is a centered normal distribution and G is a centered Gaussian process The oracle rule holds for β(τ) and β(τ) if β(τ) satisfies (3.1) and β(τ) satisfies (3.2) Inference follows from the oracle rule

Two leading models Linear model: m = dim(z(x)) is fixed, and Q(x; τ) = Z(x) β(τ); Univariate spline nonparametric model: m = dim(z(x)) with N. c N (γ N ) := Q(x; τ) Z(x) γ N (τ) 0, γ N (τ) := arg min γ R m E[ (Z γ Q(X; τ)) 2 f(q(x; τ) X) ]. (3.3) Oracle rule: spline model γ N can be defined in many other ways. We do not go into detail here Conditions imposed throughout the talk: J m (τ) := E[ZZ f Y X (Q(X; τ) X)] Assumption (A)

Linear model with fixed dimension P 1 (ξ m, M, f, f, f min ): all distributions satisfying (A1)-(A3) with some constants 0 < ξ m, M, f, f < and f min > 0 Theorem 3.1 (Oracle Rule for β(τ)) Suppose that S = o(n 1/2 (log N) 2 ). τ T and u R m, Nu (β(τ) β(τ)) (u J m (τ) 1 E[Z(X)Z(X) ]J m (τ) 1 u) 1/2 N ( 0, τ(1 τ) ), If S N 1/2, then the weak convergence result above fails for some P P 1 (1, d, f, f, f min ). S = o(n 1/2 (log N) 2 ) is sharp: only miss by a factor of (log N) 2

Λ ητ c (T ): Hölder class with smoothness η τ > 0. l (T ): set of all uniformly bounded real functions on T Theorem 3.2 (Oracle Rule for β( )) Suppose that τ Q(x 0 ; τ) Λ ητ c (T ) for a x 0 X. If S = o(n 1/2 (log N) 2 ), and K G N 1/(2ητ ) with r τ η τ then the projection estimator β(τ) defined in (2.3) satisfies N ( Z(x0 ) β( ) Q(x0 ; ) ) G( ) in l (T ), (3.4) where G is a centered Gaussian process. Covariance function If G N 1/(2ητ ) the weak convergence in (3.4) fails for some P P 1 (1, d, f, f, f min ) with τ Q(x 0 ; τ) Λ ητ c (T ). K N 1/(2ητ ) is sharp!

Splines nonparametric model We require an additional condition: (L) For each x X, the vector Z(x) has zeroes in all but at most r consecutive entries, where r is fixed. Furthermore, sup x X Ẽ(Z(x), γ N ) = O(1). which guarantees that the matrix J m (τ) to be a block matrix P L (Z, M, f, f, f min, R) := { all sequences of distributions of (X, Y ) on R d+1 satisfying (A1)-(A3) with M, f, f <, f min > 0, and (L) for some r < R; m 2 (log N) 6 = o(n), c N (γ N ) 2 = o(n 1/2 ) }. (3.5) Definition c N

Splines nonparametric model Theorem 3.3 (Oracle rule for β(τ)) Let {(X i, Y i )} N i=1 be distributed P N P L (Z, M, f, f, f min, R), m N ζ for some ζ > 0. S = o((nm 1 (log N) 4 ) 1/2 ) then τ T, x 0 X, NZ(x0 ) (β(τ) γ N (τ)) (Z(x 0 ) J m (τ) 1 E[ZZ ]J m (τ) 1 Z(x 0 )) 1/2 N ( 0, τ(1 τ) ), If S (N/m) 1/2 the weak convergence above fails for some P N P L (Z, M, f, f, f min, R) Sharpness of S = o((n/m) 1/2 (log N) 2 ): nonparametric rate is slower, again miss by a factor of (log N) 2

Theorem 3.4 (Oracle rule for β(τ)) Let the assumptions of Theorem 3.3 hold and assume that τ Q(x 0 ; τ) Λ ητ c (T ), r τ η τ, Suppose that c N (γ N ) = o(n 1/2 Z(x 0 ) ) and K G N 1/(2ητ ) Z(x 0 ) 1/ητ, and the limit H(τ 1, τ 2 ) := lim N K N (τ 1, τ 2 ) > 0 K N τ 1, τ 2 T. N ( Z(x0 ) β( ) Q(x0 ; ) ) G( ) in l (T ), Z(x 0 ) where G is a centered Gaussian process with covariance structure E[G(τ)G(τ )] = H(τ, τ ). If G N 1/(2ητ ) Z(x 0 ) 1/ητ the weak convergence fails for some τ Q(x 0 ; τ) Λ ητ c (T ) for all S.

Distribution function τu F Y or X ( x 0) := τ L + 1{Z(x) βor (τ) < y}dτ τ L The oracle rule holds for both models. Corollary 3.5 Under the same conditions as Theorem 3.2 (linear model) or Theorem 3.4 (nonparametric spline model), we have for any x 0 X, N ( FY X ( x 0 ) F Y X ( x 0 ) ) f Y X ( x 0 )G ( F Y X ( x 0 ) ), in l ( (Q(x 0 ; τ L ), Q(x 0 ; τ U )) ), where G is a centered Gaussian process defined in respective theorem. The same holds for F or Y X ( x 0).

Phase transitions K?? N 1/(2ητ ) ( N m) 1/(2ητ ) ( N 1/2 1/2 N m 1/2 log 2 N m) N 1/2 log 2 N N 1/2 S Figure: Regions (S, K) for the oracle rule of linear model and spline nonparametric model.? region is the discrepancy between the sufficient and necessary conditions.

Setting: linear model Model: Y i = 0.21 + β m 1 X i + ε i, m = 4, 16, 32 X i follows a multivariate uniform distribution U([0, 1] m 1 ) with covariance matrix Σ X := E[X i X i ] with Σ jk = 0.1 2 0.7 j k for j, k = 1,..., m 1 ε N or ε EXP (skewed) x 0 = (1, (m 1) 1/2 l m 1 ) From Theorem 3.1, the coverage probability of the α = 95% confidence interval is P { x 0 β(τ) [ x 0 β(τ) ± N 1/2 fε,τ 1 τ(1 τ)x 0 Σ 1 X x 0Φ 1 (1 α/2) ]} S : the point of S where the coverage starts to drop Additional information

β(τ), ε N (0, 0.1 2 ) Normal(0, 0.1 2 ), τ = 0.1 Normal(0, 0.1 2 ), τ = 0.5 Coverage Probability N = 2 14, m = 4 N = 2 22, m = 4 N = 2 14, m = 16 N = 2 22, m = 16 N = 2 14, m = 32 N = 2 22, m = 32 Coverage Probability N = 2 14, m = 4 N = 2 22, m = 4 N = 2 14, m = 16 N = 2 22, m = 16 N = 2 14, m = 32 N = 2 22, m = 32 0.0 0.2 0.4 0.6 0.8 log N(S) 0.0 0.2 0.4 0.6 0.8 log N(S) Normal(0, 0.1 2 ), τ = 0.9 Coverage Probability N = 2 14, m = 4 N = 2 22, m = 4 N = 2 14, m = 16 N = 2 22, m = 16 N = 2 14, m = 32 N = 2 22, m = 32 0.0 0.2 0.4 0.6 0.8 log N(S)

β(τ), ε Exp(0.8) Exp (0.8), τ = 0.1 Exp (0.8), τ = 0.5 Coverage Probability N = 2 14, m = 4 N = 2 22, m = 4 N = 2 14, m = 16 N = 2 22, m = 16 N = 2 14, m = 32 N = 2 22, m = 32 Coverage Probability N = 2 14, m = 4 N = 2 22, m = 4 N = 2 14, m = 16 N = 2 22, m = 16 N = 2 14, m = 32 N = 2 22, m = 32 0.0 0.2 0.4 0.6 0.8 log N(S) 0.0 0.2 0.4 0.6 0.8 log N(S) Exp (0.8), τ = 0.9 Coverage Probability N = 2 14, m = 4 N = 2 22, m = 4 N = 2 14, m = 16 N = 2 22, m = 16 N = 2 14, m = 32 N = 2 22, m = 32 0.0 0.2 0.4 0.6 0.8 log N(S)

Summary for simulation of β(τ) m increases, S decreases N increases, S gets close to N 1/2 ε N, coverage is symmetric in τ, ε Exp, coverage is asymmetric in τ t distribution behaves similarly to normal distribution

Quantile projection setting B: cubic B-spline with q = dim(b) defined on G = 4 + q equidistant knots on [τ L, τ U ]. We require K > q so that β(τ) is computable (see (2.4)) N = 2 14 y 0 = Q(x 0 ; τ) so that F Y X (y 0 x 0 ) = τ From Theorem 3.2, the coverage probability with size α = 0.95 is P { τ [ FY X (Q(x 0 ; τ) x 0 ) ± N 1/2 τ(1 τ)x 0 Σ 1 X x 0Φ 1 (1 α/2) ]}

F Y X (y x), ε N (0, 0.1 2 ), m = 4 for β(τ) Normal(0, 0.1 2 ), m= 4, y 0 = Q( x 0 ; τ = 0.1 ) Normal(0, 0.1 2 ), m= 4, y 0 = Q( x 0 ; τ = 0.5 ) Coverage Probability q = 7, K = 20 q = 7, K = 30 q = 10, K = 20 q = 10, K = 30 q = 15, K = 20 q = 15, K = 30 Coverage Probability q = 7, K = 20 q = 7, K = 30 q = 10, K = 20 q = 10, K = 30 q = 15, K = 20 q = 15, K = 30 0.0 0.2 0.4 0.6 0.8 log N(S) 0.0 0.2 0.4 0.6 0.8 log N(S) Normal(0, 0.1 2 ), m= 4, y 0 = Q( x 0 ; τ = 0.9 ) Coverage Probability q = 7, K = 20 q = 7, K = 30 q = 10, K = 20 q = 10, K = 30 q = 15, K = 20 q = 15, K = 30 0.0 0.2 0.4 0.6 0.8 log N(S)

F Y X (y x), ε N (0, 0.1 2 ), m = 32 for β(τ) Normal(0, 0.1 2 ), m= 32, y 0 = Q( x 0 ; τ = 0.1 ) Normal(0, 0.1 2 ), m= 32, y 0 = Q( x 0 ; τ = 0.5 ) Coverage Probability q = 7, K = 20 q = 7, K = 30 q = 10, K = 20 q = 10, K = 30 q = 15, K = 20 q = 15, K = 30 Coverage Probability q = 7, K = 20 q = 7, K = 30 q = 10, K = 20 q = 10, K = 30 q = 15, K = 20 q = 15, K = 30 0.0 0.2 0.4 0.6 0.8 log N(S) 0.0 0.2 0.4 0.6 0.8 log N(S) Normal(0, 0.1 2 ), m= 32, y 0 = Q( x 0 ; τ = 0.9 ) Coverage Probability q = 7, K = 20 q = 7, K = 30 q = 10, K = 20 q = 10, K = 30 q = 15, K = 20 q = 15, K = 30 0.0 0.2 0.4 0.6 0.8 log N(S)

F Y X (y x), ε Exp(0.8), m = 4 for β(τ) Exp (0.8), m= 4, y 0 = Q( x 0 ; τ = 0.1 ) Exp (0.8), m= 4, y 0 = Q( x 0 ; τ = 0.5 ) Coverage Probability q = 7, K = 20 q = 7, K = 30 q = 10, K = 20 q = 10, K = 30 q = 15, K = 20 q = 15, K = 30 Coverage Probability q = 7, K = 20 q = 7, K = 30 q = 10, K = 20 q = 10, K = 30 q = 15, K = 20 q = 15, K = 30 0.0 0.2 0.4 0.6 0.8 log N(S) 0.0 0.2 0.4 0.6 0.8 log N(S) Exp (0.8), m= 4, y 0 = Q( x 0 ; τ = 0.9 ) Coverage Probability q = 7, K = 20 q = 7, K = 30 q = 10, K = 20 q = 10, K = 30 q = 15, K = 20 q = 15, K = 30 0.0 0.2 0.4 0.6 0.8 log N(S)

F Y X (y x), ε Exp(0.8), m = 32 for β(τ) Exp (0.8), m= 32, y 0 = Q( x 0 ; τ = 0.1 ) Exp (0.8), m= 32, y 0 = Q( x 0 ; τ = 0.5 ) Coverage Probability q = 7, K = 20 q = 7, K = 30 q = 10, K = 20 q = 10, K = 30 q = 15, K = 20 q = 15, K = 30 Coverage Probability q = 7, K = 20 q = 7, K = 30 q = 10, K = 20 q = 10, K = 30 q = 15, K = 20 q = 15, K = 30 0.0 0.2 0.4 0.6 0.8 log N(S) 0.0 0.2 0.4 0.6 0.8 log N(S) Exp (0.8), m= 32, y 0 = Q( x 0 ; τ = 0.9 ) Coverage Probability q = 7, K = 20 q = 7, K = 30 q = 10, K = 20 q = 10, K = 30 q = 15, K = 20 q = 15, K = 30 0.0 0.2 0.4 0.6 0.8 log N(S)

Summary for simulation of F Y X (y x) Increase in m lowers both S and q (the critical point in q for the oracle rule) Either S > S or q < q (q = G 4) leads to the collapse of the oracle rule Increase in q and K improves the coverage probability For ε N, coverage is no longer symmetric in τ

Thank you for your attention

References I Chao, S., Volgushev, S., and Cheng, G. (2016). Quantile process for semi and nonparametric regression models. ArXiv Preprint Arxiv 1604.02130. White, T. (2012). Hadoop: The Definitive Guide. O Reilly Media / Yahoo Press.

Assumption (A): data (X i, Y i ) i=1,...,n form triangular array and are row-wise i.i.d. with Linear models (A1) Assume that Z i ξ m <, where ξ m = O(N b ) is allowed to increase to infinity, and that 1/M λ min (E[ZZ T ]) λ max (E[ZZ T ]) M holds uniformly in n for some fixed constant M. (A2) The conditional distribution F Y X (y x) is twice differentiable w.r.t. y. Denote the corresponding derivatives by f Y X (y x) and f Y X (y x). Assume that f := sup y,x uniformly in n. (A3) Assume that uniformly in n. f Y X (y x) <, f := sup f Y X (y x) < y,x 0 < f min inf inf f Y X(Q(x; τ) x) τ T x

E [ G(τ)G(τ ) ] (5.1) = Z(x 0 ) J m (τ) 1 E [ Z(X)Z(X) ] J m (τ ) 1 Z(x 0 )(τ τ ττ ). Process: linear model K(τ 1, τ 2 ) := Z(x 0 ) 2 Z(x 0 ) J 1 m (τ 1 )E[ZZ ]J 1 m (τ 2 )Z(x 0 )(τ 1 τ 2 τ 1 τ 2 ) Process: local polynomial

β 3 = (0.21, 0.89, 0.38) ; β 15 = (β 3, 0.63, 0.11, 1.01, 1.79, 1.39, β 31 = (β 15, 0.21, β 15). β(τ) = (0.21 + 0.1 Φ 1 σ=0.1 (τ), β m 1 ) 0.52, 1.62, 1.26, 0.72, 0.43, 0.41, 0.02) ; (5.2) f ε,τ > 0 is the height of the density of ε i evaluated at ε i s τ quantile. Simulation setting