M-Estimation under High-Dimensional Asymptotics

Size: px

Start display at page:

Download "M-Estimation under High-Dimensional Asymptotics"

Dinah Parks
5 years ago
Views:

1 M-Estimation under High-Dimensional Asymptotics

2 Classical M-estimation Big Data M-estimation An out-of-the-park grand-slam home run Annals of Mathematical Statistics 1964 Richard Olshen

3 Classical M-estimation Big Data M-estimation

4 Classical M-estimation Big Data M-estimation M-estimation Basics Location model Y i = θ + Z i, i = 1,..., n Errors: Z i F, not necessarily Gaussian. Loss Function ρ(t) eg t 2, t, log(f (t)),... (M) min θ n ρ(y i θ) i=1 Asymptotic Distribution n( ˆθ n θ) D N(0, V ), n. Asymptotic Variance: ψ = ρ : Information Bound ψ 2 df V (ψ, F ) = ( ψ df ) 2 V (ψ, F ) 1 I (F )

5 Classical M-estimation Big Data M-estimation The One-Step Viewpoint P. J. BICKEL* One-Step Huber Estimates in the Linear Model Simple "one-step" versions of Huber's (M) estimates for the linear equivalence holds in the more general context of the model are introduced. Some relevant Monte Carlo results obtained linear model for general 46. in the Princeton project [1] are singled out and discussed. The large Typically the estimates obtained from sample behavior of these procedures is examined (1.1) are not under very mild regularity conditions. scale equivariant.1 To obtain acceptable procedures a scale equivariant and location invariant estimate of scale 1. INTRODUCTION 6 must be calculated from the data and 6 be obtained as the solution of In 1964 Huber [7] introduced a class of estimates n (referred to as (M)) in the location problem, studied their E, Off (Xi - 0) = O(1.4) asymptotic behavior and identified robust members of j-1 the group. These procedures are the solutions 8 of equa- where tions of the form, (x) = (x/a). (1.5) n E +(X}-') -0, (1.1) The resulting 6 is then both location and scale equivariant. The estimate 6 can be obtained simultaneously where Xi = 0 + El, *?Xn = + E. and with ' by solving a system of equations such as those of El, ** En are unknown independent, identically distributed Huber's errors Proposal 2 [8, p. 96] or the "likelihood which have a distribution F which is symmetric about 0. equations" n Xj If F has a density f which is smooth and if f is known, - E 4 ;\ ) 0X then maximum likelihood estimates if they exist satisfy a-1 O' (1.6) (1.1) with F6 = -f'/f. n Xi - 6 Under successively milder regularity conditions on V and F, Huber showed in [7] and [8] that such ' were consistent and asymptotically normal with mean 0 and where x (t) = t4* (t) -1. Or, we may choose 6 indepenvariance K(#k, F)/n where dently. For instance, in this article, the normalized interquartile range, K(VI, F) =1 /(t)f(t) 2 dt f(t)do(t)i. (1.2) l = (X(n-[n/4]+1) - X([,/4]))/24'-1(3/4), (1.7) JASA 1975 If F is unknown but close to a normal distribution with and the symmetrized interquartile range, mean 0 and known variance in a suitable sense, Huber 62 = median { i - m I/(D-1(3), (1.8) in [7] further showed that (M) estimates based on are used where X(l) <... < X(n) are the order statistics, #K(t) = t if Itl < K 4' is the standard normal cdf and m is the sample median. = Ksgnt if Itl >K DLD, (1.3) Andrea If 6 Montanari -+ a(f) at rate 1/v'n and F is symmetric as hy-

6 Classical M-estimation Big Data M-estimation Regression M-estimation: the One-Step Viewpoint Regression model Objective function of (M): Y i = X i θ + Z i, R(ϑ) = n i=1 Z i iid F, i = 1,..., n ρ(y i X i ϑ) (M) min R(ϑ) ϑ One-step estimate: θn any n-consistent estimate of θ: ˆθ 1 = θ n [Hess R θn ] 1 R θn. Effectiveness: ˆθ true solution of M-equation: E( ˆθ 1 ˆθ)( ˆθ 1 ˆθ) = o(n 1 )

7 Classical M-estimation Big Data M-estimation Driving Idea of Classical Asymptotics The M-estimate is asymptotically equivalent to a single step of Newton s method for finding a zero of R starting at the true underlying parameter. Goes back to Fisher, Method of Scoring for MLE.

8 Classical M-estimation Big Data M-estimation Derivation of Asymptotic Variance Formula Approximation to One-Step: ˆθ 1 = θ + 1 B(ψ, F ) (X X ) 1 X (ψ(z i )) + o p (n 1/2 ) where B(ψ, F ) = ψ df. Observe that Var((X X ) 1 X (ψ(z i ))) (X X ) 1 A(ψ, F ) where A(ψ, F ) = ψ 2 df. Hence if X i,j N(0, 1 n ) Var(ˆθ i θ i ) A(ψ, F ) B(ψ, F ) 2 = V (ψ, F )

9 Classical M-estimation Big Data M-estimation Asymptotics for Regression, I

10 Classical M-estimation Big Data M-estimation Asymptotics for Regression, II PJ Huber, Annals of Statistics 1973

11 Classical M-estimation Big Data M-estimation

12 Classical M-estimation Big Data M-estimation

13 Classical M-estimation Big Data M-estimation On robust regression with high-dimensional predictors Noureddine El Karoui a,1, Derek Bean a, Peter J. Bickel a,1, Chinghway Lim b, and Bin Yu a a Department of Statistics, University of California, Berkeley, CA 94720; and b Department of Statistics and Applied Probability, Faculty of Science, National University of Singapore, Contributed by Peter J. Bickel, April 25, 2013 (sent for review March 1, 2012) We study regression M-estimates in the setting where p, the number of covariates, and n, the number of observations, are both aged to obtain rigorous proofs for many of our assertions. They simulations. (While the paper was under review, we have man- large, but p n. Wefind an exact stochastic representation for will be presented elsewhere because they are very long and technical.) We give several results for covariates that are Gaussian or the distribution of ^β = argmin β R p n i=1 ρðyi Xi βþ at fixed p and n under various assumptions on the objective function ρ and our derived from Gaussian but present grounds that the behavior statistical model. A scalar random variable whose deterministic holds much more generally the key being concentration of limit rρðκþ can be studied when p=n κ > 0 plays a central role in certain quadratic forms involving the vectors of covariates. We this representation. We discover a nonlinear system of two deterministic equations that characterizes rρðκþ. Interestingly, the sys- the design matrix. [Further results with different designs can be also investigate the sensitivity of our results to the geometry of tem shows that rρðκþ depends on ρ through proximal mappings of found in our work (5).] ρ as well as various aspects of the statistical model underlying our We find that (i) estimates of coordinates and contrasts that study. Several surprising results emerge. In particular, we show have coefficients independent of the observed covariates continue to be unbiased and asymptotically normal; and (ii) as in the that, when p=n is large enough, least squares becomes preferable to least absolute deviations for double-exponential errors. fixed p case, this happens at scale n 1=2, at least when the minimal and maximal eigenvalues of the covariance of the predictors prox function high-dimensional statistics concentration of measure stay bounded away from 0 and, respectively.* These findings are obtained by (i) using leave-one-out perturbation arguments both for the data units and predictors; (ii) n the classical period up to the 1980s, research on regression exhibiting a pair of master equations from which the asymptotic Imodels focused on situations for which the number of covariates p was much smaller than n, the sample size. Least-squares asymptotic variances can be recovered; and (iii) showing that mean square prediction error and the correct expressions for regression (LSE) was the main fitting tool used, but its sensitivity these two quantities depend in a nonlinear way on p/n, the error to outliers came to the fore with the work of Tukey, Huber, distribution, the design matrix, and the form of the objective Hampel, and others starting in the 1950s. function, ρ. Given the model Yi = Xi β 0 + ei and M-estimation methods It is worth noting that our findings go against the likelihood described in the Abstract, it follows from the discussion in ref. 1 principle that the ideal objective function is the negative logdensity of the error distribution. For example, we show that, (p. 170, for instance) that, if the design matrix X (an n p matrix whose ith row is Xi) is nonsingular, under various regularity when p/n is large enough, it becomes preferable use least conditions on X, ρ, ψ = ρ and DLD, the Andrea [independent Montanari identically dis- squares rather than LAD for double-exponential errors. We il- STATISTICS

14 Classical M-estimation Big Data M-estimation Bickel, Yu, El Karoui

15 Classical M-estimation Big Data M-estimation High-Dimensional Asymptotics (HDA) n, p n. X i,j iid N(0, 1 n ) p n 1 θ 0,n 2 2 τ 2 0. Y = X θ 0 + Z n/p n δ (1, ). Meaning of δ : # of observations per parameter to be estimated

16 Classical M-estimation Big Data M-estimation Emergent Phenomena under HDA Classical setting - random design, p fixed, n. var( ˆθ i ) V (ψ, F ), n. HDA setting - for n/pn δ (1, ) Effective Score Ψ = Ψ δ,ψ,f (to be described... ) Effective Error Distribution F = F N(0, τ 2 ) Extra Gaussian noise: τ = τ (δ, ψ, F ). Asymptotic Variance under HDA var( ˆθ i ) V ( ψ, F ), n, p n. Classical Correspondence Ψ δ,ψ,f ψ, F δ,ψ,f F, δ.

17 Classical M-estimation Big Data M-estimation Immediate Implications 1. Classical formulas for confidence statements about M-estimates are overoptimistic under high dimensional asymptotics (even dramatically so). 2. Maximum likelihood estimates are inefficient under high-dimensional asymptotics. Score ψ = ( log f W ) does not yield an efficient estimator. 3. The usual Fisher Information bound is not attainable, as I ( F ) < I (F ).

18 : Sample implication of DLD & Montanari (2013) Corollary. For an M-estimator under HDA with errors Z i iid F : lim Ave ivar(ˆθ i ) n 1 1 1/δ 1 I (F ) Effect of HDA parameter δ (1, ) evident

20 Regularized ρ Definition ρ : R R is smooth if C 1 with absolutely continuous derivative ψ = ρ having a bounded a.e. derivative ψ. Excludes l 1 :ρ l1 (z) = z. Allows Huber: ρ H (z) = min(z 2 /2, λ z λ 2 /2) Regularized ρ: { ρ b (z) min bρ(x) + 1 x R 2 (x z)2}, (1) Min-convolution of ρ with a square loss.

21 Regularized Score Function Regularized Score Function Ψ(z; b) = ρ b (z). Example: Huber loss ρ H (z; λ), with score function ψ(z; λ) = min(max( λ, z), λ). ( z ) Ψ(z; b) = bψ 1 + b ; λ. Ψ like ψ, but central slope Ψ ( ; b) = Effective score: for special choice b, Ψ = Ψ(z; b ) b 1+b < 1.

22 Approximate Message Passing Algorithm Initialization θ 0 = 0. Adjusted residuals. R t = Y X θ t + Ψ(R t 1 ; b t 1 ) ; (2) Effective Score. Choose b t > 0 achieving empirical average slope p/n (0, 1). p n = 1 n n Ψ (Ri t ; b). (3) i=1 Scoring. Apply effective score function Ψ(R t ; b t ): θ t+1 = θ t + δx T Ψ(R t ; b t ). (4)

23 Determining Regularization Parameter b t Example: ρ = ρ H ( ; 3), F = 0.95N(0, 1) δ Determination of b History of b t b 0.5 b t Average Slope AMP iteration t

24 Connection of AMP to M-estimation Lemma Let ( θ, R, b ) be a fixed point of the AMP iteration (2), (3), (4) having b > 0. Then θ is a minimizer of the problem (M). Viceversa, any minimizer θ of the problem (M) corresponds to an AMP fixed point of the form ( θ, R, b ).

25 Example: Size n = 1000, p = 200, so δ = 5. Truth θ 0 random θ 0 2 = 6 p. Errors F = CN(0.05, 10), i.e. F = 0.95Φ H 10 H x denotes unit mass at x. Loss ρ = ρ H (z; λ) with λ = 3. Iterations Run Amp for 20 iterations. Comparison Use CVX to obtain Huber estimator directly

26 Convergence of AMP in Example Convergence of AMP ˆθ t to θ 0 AMP M 6 Convergence of AMP ˆθ t to ˆθ RMSE( ˆθ t, θ 0) RMSE( ˆθ t, ˆθ) AMP iteration t AMP iteration t

27 Contrast AMP w/ Traditional Scoring Traditional residual: Z = Y X θ Traditional Scoring: AMP: θ 1 = θ + 1 n 1 n i=1 ψ (Z t ) (X X ) 1 X (ψ(z i )), (5) θ t+1 = θ t + δn 1 X Ψ(R t ; b t ). Object Scoring AMP Average Slope n i=1 ψ (Z i )/n n i=1 Ψ (Ri t; b t)/n δ 1 Gram (X X ) 1 1 Residual Traditional (Z i ) Adjusted (Ri t) Score ψ( ) Ψ( ; b t ) Basepoint True θ Current Iterate Iteration 1 t 1

28 Basic contrasts between HDA and Classical Asymptotics Under the High-Dimensional Asymptotic, δ (1, ), The heuristic expand around θ is incorrect; it understates the true uncertainty. The heuristic errors equal the residuals is incorrect; the residuals are noisier. The heuristic take one step is incorrect; must take many steps.

29 Extra Variance at Initialization of AMP Initialize with θ 0 = 0, R 1 = 0. Initial residual R 1 = Y X θ 0 = Z + X (θ 0 θ 0 ). Terms Z and X (θ 0 θ 0 ) are independent. X (θ 0 θ 0 ) is Gaussian with variance τ 2 0 = θ 0 θ /n = MSE( θ 0, θ 0 )/δ. Var(R 1 i ) = Var(Z) + Var(X (θ 0 θ 0 )) = Var(Z) + MSE( θ 0, θ 0 )/δ. Extra Gaussian noise τ 2 0 = MSE( θ 0, θ 0 )/δ.

30 Evolution of Extra Variance at later AMP iterations Pretend R t Z + τ t W with W an independent standard normal Define variance map { V(τ 2, b; δ, F ) = δ E Ψ(Z + τ W ; b) 2}, Define slope par. b = b(τ; δ, F ): smallest solution b 0 to 1 { } δ = E Ψ (Z + τ W ; b) (6) Definition dynamical system {τ 2 t } t 0, starting at τ 2 0 R 0 by τ 2 t+1 = V(τ 2 t, b(τ t )) = V(τ t, b(τ t ; δ, F ); δ, F ). (7)

31 Variance Map, in running example V(tau 2 ) y=x Variance map V( 2 ), Contaminated Normal 3 Output Variance (0.472,0.472) Input Variance 0 F = 0.95N(0, 1) H 10, ρ H with λ = 3.

32 , in running example V(tau 2 ) y=x Dynamics of t Output Input Variance 0 F = 0.95N(0, 1) H 10, ρ H with λ = 3.

33 Operating Characteristics from Definition The state evolution formalism assigns predictions E under states S = (τ, b, δ, F ) Functions ξ of θ θ 0. Predict p 1 i p ξ( θ i θ 0,i ) by { E(ξ( θ ϑ) S) E ξ( } δ τ Z), Functions ξ 2 of Residual, Error. Predict n 1 n i=1 ξ 2(R i, Z i ) by E(ξ 2 (R, Z) S) Eξ 2 (Z + τ W, Z) where W N(0, 1) and Z F is independent of W.

34 Example of Predictions MSE at iteration t. We let S t = (τ t, b(τ t ), δ, F ) denote the state of AMP at iteration t, and predict { MSE( θ t, θ 0 ) E(( ˆϑ ϑ) 2 S t ) = E ( δ τ t Z) 2} = δτt 2. MSE at convergence. With τ > 0 the limit of τ t, let S = (τ, b(τ ), δ, F ) denote the equilibrium state of AMP { MSE( θ, θ 0 ) E(( ˆϑ ϑ) 2 S ) = E ( δ τ Z) 2} = δτ 2. Ordinary residuals Y X θ at AMP convergence. Setting η(z; b) = z Ψ(z; b) Y X θ D η(z + τ W ; b ).

35 t b t Iteration t MSE 5 MAE Iteration t Iteration t Figure : Experimental means from 10 simulations compared with predictions under CN(0.05, 10), with Huber ψ, λ = 3. Upper Left: ˆτ t = θ t θ 0 2 / n. Upper Right: ˆb t. Lower Left: MSE, Mean Squared Error. Lower Right: MAE, Mean Absolute Error. Blue + symbols: Empirical means of AMP observables. Green Curve: Theoretical predictions by SE.

36 Lower Bounds on Lemma. Suppose that F has a well-defined Fisher information I (F ). Then for any t > 0 τ 2 t 1 δi (F ). Lemma. (uses Barron-Madiman, 2007) Corollary. For t > k I (F N(0, τ 2 )) τ 2 t I (F ) 1 + τ 2 I (F ). 1 + δ δ δ k. δi (F ) Corollary. For every accumulation point τ of τ 2 1 δ 1 1 I (F ). Corollary. For an M-estimator under HDA with errors Z i iid F : lim n Var( ˆθ 1 i ) 1 1/δ 1 I (F )

37 1 Basic Assumptions: A1 Discrepancy function ρ is convex and smooth; A2 Matrices {X(n)} n are iid N(0, 1 n ) A3 θ 0, θ 0 = 0 are deterministic sequences such that AMSE(θ 0, θ 0 ) = δτ 2 0. A4 F has finite second moment. Terminology: Let {τ 2 t } t 0 denote the state evolution sequence with initial condition τ 2 0. Let { θt, R t }t 0 be the AMP trajectory with parameters b t. Definition: A function ξ : R k R is pseudo-lipschitz if there exists L < such that, for all x, y R k, ξ(x) ξ(y) L(1 + x 2 + y 2 ) x y 2.

38 2 Theorem Assume A1-A4. Let ξ : R R, ξ 2 : R R R be pseudo-lipschitz functions. Then, for any t > 0, we have, for W N(0, 1) independent of Z F 1 lim n p 1 lim n n p i=1 n i=1 { ξ( θ i t θ 0,i ) = a.s. E ξ( } δ τ t Z), (8) { } ξ 2 (Ri t, Z i ) = a.s. E ξ 2 (Z + τ t W, Z). (9)

39 Corollary of Correctness Under HDA, can define AMSE( ˆθ t, θ 0 ) = a.s lim θ t θ 0 2 n,p 2/p; and and AMSE( ˆθ t, θ 0 ) = δτ 2 t. lim AMSE( ˆθ t, θ 0 ) = δτ 2 t.

40 Convergence of AMP to M-Estimator, 2 Theorem. Assume A1-A4 and that ρ is strongly convex and δ > 1. Let (τ, b ) be a solution of the two equations { τ 2 = δ E Ψ(Z + τ W ; b) 2}, (10) 1 { } δ = E Ψ (Z + τ W ; b). (11) Assume that AMSE( θ 0, θ 0 ) = δτ 2. Then lim AMSE( θ t, θ) = 0. (12) t

41 Main Result: Asymptotic Variance Formula under High-Dimensional Asymptotics. Corollary. Assume setting of previous Theorem. lim Ave i [p]var( θ i ) = a.s V ( Ψ, F ), (13) n,p where the effective score Ψ is Ψ( ) = Ψ( ; b ), while the effective noise distribution F is F = F N(0, τ 2 ). Here (τ, b ) are the unique solutions of the equations (10)-(11).

42 Driving Idea of High-Dimensional Asymptotics, 1 M-estimate asy. equivalent to limit of many steps of AMP. First step of AMP contains extra Gaussian noise. Extra Gaussian noise caused by errors in initial parameter estimates. Extra Gaussian noise declines with AMP iterations but does not go away completely; declines to level defined by the fixed point equations of state evolution. Extra Gaussian noise in M-estimate characterized by fixed point equations of state evolution.

43 Driving Idea of High-Dimensional Asymptotics, 2 Classically M-estimate propagates true errors through score ˆθ = θ B(ψ, F ) (X X ) 1 X (ψ(z i )) + o p (n 1/2 ) HDA propagates noisier effective errors through effective score ˆθ = θ ( ) B( Ψ, F ) n 1 X Ψ(Z i + τ W i ) + o p (n 1/2 ) where W i Z i and both are iid, and Ψ is the effective score.

44 How is Proved, 1? Simply apply existing paper: Bayati, Mohsen, and Andrea Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE IT 57.2 (2011): Bayati-Montanari 2011 designed for a seemingly different problem: Asymptotics of Lasso in p > n case. min Ỹ X β 2 2 /2 + λ β 1 Setting was compressed sensing, where X i,j iid N(0, n 1 ), and p n/n δ (0, 1). Formalism of AMP and was introduced, and developed in DLD Arian Maleki, and Andrea Montanari. Message-passing algorithms for compressed sensing. PNAS (2009): DLD, Arian Maleki, and Andrea Montanari. The noise-sensitivity phase transition in compressed sensing. IEEE IT (2011): These papers systematically understood and used the Extra Gaussian Noise property of High Dimensional Asymptotics. Generality of Bayati-Montanari treatment, easily accommodated M-estimation.

45 Iterative Thresholding Soft Threshold Nonlinearity η(x; λ) = ( x λ) + sgn(x). Iterative solution β t, t = 0, 1, 2, 3,.... β 0 = 0 z t = Ỹ X β t β t+1 = η(β t + X z t ; λ t ) Heuristic to solve LASSO: min β Ỹ X β 2 2 /2 + λ β 1.

46 Approximate Message Passing (AMP) Iterative Thresholding First order Approximate Message Passing (AMP) algorithm z t = Ỹ X β t + 1 δ zt 1 η t 1( X z t 1 + β t 1 ). β t+1 = η t (β t + X z t ), η t( s ) = s η t( s ). Feature: Essentially same cost as Iterative Soft Thresholding If η = soft thresholding, 1 δ η t 1 ( X z t 1 + β t 1 ) = βt 0 n.

47 How is Proved, 2? Central Recursion in Bayati-Montanari 2011 h t, q t R N z t, m t R n Initial Condition q 0 ; m 1 = 0. h t+1 = A m t ξ t q t, m t = g t (b t, w) b t = Aq t λ t m t 1, q t = f t (h t, x 0 ) Reaction Coefficients: ξt = g t (bt, w) ; λ t = 1 δ f t (ht, x 0 ) τ 2 t = E{g t (σ t Z, W ) 2 }; σ 2 t = 1 δ E{ft (τ t 1Z, X 0 )} where W F W, X 0 F X0

48 ϑ t+1 = δx T Ψ(W + S t ; b t ) + q t ϑ t ϑ t+1 X T δψ(w + S t ; b t ) q t ϑ t h t+1 = A m t ξ t q t h t+1 A m t ξ t q t S t = Xϑ t + Ψ(W + S t 1 ; b t 1 ) S t X ϑ t 1 Ψ(W + S t 1 ; b t 1 ) b t = Aq t λ t m t 1 b t A q t λ t m t 1 Table : Correspondences between terms in the centered recursions of DLD-Montanari 2013, and recursions analyzed in Bayati-Montanari We get exact correspondence between the two systems, provided we identify δψ(w + S t ; b t ) with m t = g t (b t ; w) and δh t with f t (h t ). One has, in particular, that λ t = 1 δ f t (ht ) = 1, and that ξ t = g t (bt, w) = δψ (W + S t ; b t ) = q t.

49 Full Circle, 1 Why did work on sparse signal recovery solve problem in robust regression? Duality of Sparsity and Robustness Explicit link between solutions of Lasso with p > n and Huber with p < n. Identity between estimating sparsely nonzero vector and uncovering outliers in a linear relation.

50 (Lasso λ ) 1 p min β R p 2 Ỹ Xβ λ β i, (14) i=1 (Huber λ ) min ϑ R p n ρ H (Y i X i, ϑ ; λ) (15) i=1 Let X be a matrix with orthonormal rows such that XX = 0, i.e. null( X) = image(x), (16) finally, set Ỹ = XY. Proposition. With problem instances (Y, X ) and (Ỹ, X) related as above, the optimal values of the Lasso problem (Lasso λ ) and the Huber problem (Huber λ ) are identical. The solutions of the two problems are in one-one-relation. In particular, we have θ = (X T X) 1 X T (Y β). (17) (numerous references: e.g. Art Owen/IPOD & earlier)

51 Full Circle, 3 Isometry between performance of Lasso in ε-sparse regression problem, p > n Huber (M)-estimation in ε-contaminated data, p < n

52 Two Scalar Minimax problems Huber (1964) Minimax problem: v(ε) = min λ sup{v (ψ λ, F ) : F = (1 ε)φ + εh} H Here ψ λ is Huber ψ capped at λ, Φ is N(0, 1) Minimax MSE under sparse means (DLD & Johnstone, 1992) m(ε) = inf sup E F (η λ (X + Z) X ) 2 : X (1 ε)ν 0 + εh λ H Here η λ is soft-thresholding at λ, ν 0 is point mass at zero.

53 Full Circle, 4 (w/ Montanari, Johnstone) HDA regression with fraction ε contaminated errors: δ = n/p V (ε, δ) min max lim Ave ivar(ˆθ i ) λ H n { + v(ε) > δ V (ε, δ) = v(ε) < δ v(ε) 1 v(ε)/δ HDA Sparse Regression, δ = n/p < 1, Ỹ = Xβ 0 + σ Z, Z iid N(0, 1), X i,j iid N(0, n 1 ) β 0 is ε-sparse M(ε, δ) sup min max lim MSE( ˆβ λ, β 0 )/σ 2 σ>0 λ β 0 0 /n ε n { + m(ε) > δ M(ε, δ) = m(ε) 1 m(ε)/δ m(ε) < δ

54 Conclusions High-Dimensional Asymptotics imposes extra Gaussian noise in estimation, not seen classically Approximate Message Passing new algorithm to understand and analyse new type of analysis to obtain properties of estimates New phenomena become visible, e.g. phase transitions in (M)-estimation, previously unknown. Alternate Approach: N. El Karoui, arxiv:

Inference For High Dimensional M-estimates. Fixed Design Results

Inference For High Dimensional M-estimates. Fixed Design Results : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and