STAT 593 Robust statistics: Modeling and Computing

Size: px

Start display at page:

Download "STAT 593 Robust statistics: Modeling and Computing"

Amberlynn Pierce
5 years ago
Views:

1 STAT 593 Robust statistics: Modeling and Computing Joseph Salmon Télécom Paristech, Institut Mines-Télécom & University of Washington, Department of Statistics (Visiting Assistant Professor) 1 / 49

2 Outline Presentation / course organization Prerequisite / references Common estimators Linear Model 2 / 49

3 Table of Contents Presentation / course organization Teaching staff Practical aspects Prerequisite / references Common estimators Linear Model 3 / 49

4 Presentation Joseph Salmon (Assistant Professor): Positions: PhD. student at Paris Diderot-Paris 7 (27-21) Post-Doc at Duke University ( ) Assistant Professor at Télécom ParisTech (212-) Visiting Assistant Professor at UW (218) Research themes: high dimensional statistics, optimization for machine learning, aggregation, image processing joseph.salmon@telecom-paristech.fr Website: josephsalmon.eu 4 / 49

5 (No) Grades / office hours Beware: this is a Credit/No-Credit grading course Office hours : Friday 1:3-11:3 AM; by appointment only Office: B314 Padelford Number of credits: 3 5 / 49

6 Outline of the course Week 1 Introduction, examples, basic concepts, location, scale, equivariance Week 2 Breaking point, M-estimates, pseudo-observations Week 3 L-statistics: Linear combination of order statistics Week 4 Numerical computation of M-estimates, non-smooth convex optimization, Iterative Re-weighted Least Square IRLS Week 5 Smoothing non smooth problems Week 6 Gâteaux differentiability, Sensitivity curve, Influence Function Week 7 Robust regression and multivariate statistics Week 8 Quantile regression, crossing Week 9 Guest Lectures Week 1 Project presentations 6 / 49

7 Table of Contents Presentation / course organization Prerequisite / references General advice Reading Common estimators Linear Model 7 / 49

8 Prerequisites Probability basis: probability, expectation, law of large number, Gaussian distribution, central limit theorem. Books: Murphy (212, ch.1 and 2) Optimisation basis: (differential) calculus, convexity, first order conditions, gradient descent, Newton method Books: Boyd and Vandenberghe (24), Bertsekas (1999) 8 / 49

9 Prerequisites Probability basis: probability, expectation, law of large number, Gaussian distribution, central limit theorem. Books: Murphy (212, ch.1 and 2) Optimisation basis: (differential) calculus, convexity, first order conditions, gradient descent, Newton method Books: Boyd and Vandenberghe (24), Bertsekas (1999) (bi-)linear algebra basis: vector space, norms, inner product, matrices, determinants, diagonalization Lecture: Horn and Johnson (1994) 8 / 49

10 Prerequisites Probability basis: probability, expectation, law of large number, Gaussian distribution, central limit theorem. Books: Murphy (212, ch.1 and 2) Optimisation basis: (differential) calculus, convexity, first order conditions, gradient descent, Newton method Books: Boyd and Vandenberghe (24), Bertsekas (1999) (bi-)linear algebra basis: vector space, norms, inner product, matrices, determinants, diagonalization Lecture: Horn and Johnson (1994) Numerical linear algebra: linear system resolution, Gaussian elimination, matrix factorization, conditioning, etc. Lecture: Golub and VanLoan (213), Applied Numerical Computing by L. Vandenberghe 8 / 49

11 Prerequisites Probability basis: probability, expectation, law of large number, Gaussian distribution, central limit theorem. Books: Murphy (212, ch.1 and 2) Optimisation basis: (differential) calculus, convexity, first order conditions, gradient descent, Newton method Books: Boyd and Vandenberghe (24), Bertsekas (1999) (bi-)linear algebra basis: vector space, norms, inner product, matrices, determinants, diagonalization Lecture: Horn and Johnson (1994) Numerical linear algebra: linear system resolution, Gaussian elimination, matrix factorization, conditioning, etc. Lecture: Golub and VanLoan (213), Applied Numerical Computing by L. Vandenberghe 8 / 49

12 Books, recommended lectures Books on robust statistics: Maronna et al. (26) Huber and Ronchetti (29) Hampel et al. (1986) Rousseeuw and Leroy (1987) Book for linear models: Seber and Lee (23) Book for optimization, Legendre/Fenchel conjugacy: Hiriart-Urruty and Lemarechal (1993,1993b) Bauschke and Combettes (211) Surveys on optimization: Parikh et al. (213) 9 / 49

13 Algorithmic aspects: some advice Python installation: use Conda / Anaconda Recommended tools: Jupyter / IPython Notebook, IPython with a text editor e.g., Atom, Sublime Text, Visual Studio Code, etc. Python, Scipy, Numpy: Pandas: scikit-learn: Statmodels: 1 / 49

14 General advice Use version control system for your work: Git (e.g., Bitbucket, Github, etc.) or Mercurial Use clean way of writing / presenting your code Example : PEP8 for Python (use for instance AutoPEP8, Learn from good examples: etc. 11 / 49

15 List of interesting papers (I) Depth 12 Linear models / Lasso methods D. L. Donoho and M. Gasko. Breakdown properties of location estimates based on halfspace depth and projected outlyingness. In: Ann. Statist. 2.4 (1992), pp K. Mosler. Depth statistics. In: Robustness and complex data structures. Springer, 213, pp M. Avella-Medina and E. M. Ronchetti. Robust and consistent variable selection in high-dimensional generalized linear models. In: Biometrika 15.1 (218), pp H. Xu, C. Caramanis, and S. Mannor. Robust regression and Lasso. In: IEEE Trans. Inf. Theory 56.7 (21), pp A. Alfons, C. Croux, and S. Gelper. Sparse least trimmed squares regression for analyzing high-dimensional large data sets. In: Ann. Appl. Stat. 7.1 (213), pp / 49

16 List of interesting papers (II) Robust optimization point of view 67 Robust covariance estimation 8 Geometric median 91 Smoothing non-smooth functions Y. Chen, C. Caramanis, and S. Mannor. Robust sparse regression under adversarial corruption. In: ICML. 213, pp D. Bertsimas, D. B. Brown, and C. Caramanis. Theory and applications of robust optimization. In: SIAM Rev (211), pp M. Chen, C. Gao, and Z. Ren. A General Decision Theory for Huber s ɛ-contamination Model. In: Electron. J. Stat. 1.2 (216), pp S. Minsker. Geometric median and robust estimation in Banach spaces. In: Bernoulli 21.4 (215), pp X. Wei and S. Minsker. Estimation of the covariance structure of heavy-tailed distributions. In: NIPS. 217, pp Y. Nesterov. Smooth minimization of non-smooth functions. In: Math. Program (25), pp A. Beck and M. Teboulle. Smoothing and first order methods: A unified framework. In: SIAM J. Optim (212), pp / 49

17 Table of Contents Presentation / course organization Prerequisite / references Common estimators Location estimation Scale estimation Masking effect Linear Model 14 / 49

18 Notation / Settings Observations: n samples x 1,..., x n real numbers; later real vector will be elements of R d Vector notation : n samples x 1,..., x n (or y 1,..., y n ): x = (x 1,..., x n ) R n (or y = (y 1,..., y n ) R n ) n Inner product: x, y = x i y i i=1 15 / 49

19 Sample Mean (empirical mean) x n : empirical mean x Definition Sample mean : x n = 1 n n i=1 x i = arg min µ R n (µ x i ) 2 i=1 Rem: x n = x, 1n n, where 1 n = (1,..., 1) R n and x n 1 n is the (Euclidean) projection of x on Span(1 n ) 16 / 49

20 Mean: optimization problem Individual objectives 17 / 49

21 Mean: optimization problem Individual objectives Sum of individual objective 17 / 49

22 Median Med n (x) : empirical median x Definition Median : Med n (x) arg min µ R n µ x i i=1 Rem: often, with x (1) x (2)... x (n) (order statistics) x ( n 2 ) +x ( n 2 +1) Med n (x) = 2, if n is even x ( n+1 ), if n is odd 2 18 / 49

23 Median: optimization problem Individual objectives 19 / 49

24 Median: optimization problem Individual objectives Sum of individual objective 19 / 49

25 Median: optimization problem Individual objectives 19 / 49

26 Median: optimization problem Individual objectives Sum of individual objective 19 / 49

27 Trimmed mean x n,.25 : trimmed mean x Definition Trimmed mean (at level α) : x n,α = 1 n 2m n m x (i) i=m+1 where m = (n 1)α and x (i) denotes the order statistics Rem: u is the integer part of u 2 / 49

28 Mean vs median vs Trimmed Mean Med n (x) : empirical median x n,.25 : trimmed mean x n : empirical mean x Trimmed Mean and median are robust to outliers; the (empirical) mean is not 21 / 49

29 Mean vs median vs Trimmed Mean Med n (x) : empirical median x n,.25 : trimmed mean x n : empirical mean x Trimmed Mean and median are robust to outliers; the (empirical) mean is not 21 / 49

30 Mean vs median vs Trimmed Mean Med n (x) : empirical median x n,.25 : trimmed mean x n : empirical mean x Trimmed Mean and median are robust to outliers; the (empirical) mean is not 21 / 49

31 Variance / standard-deviation s n s n x n : empirical mean x Definition Variance : var n (x) = 1 n (x i x n ) 2 = 1 n 1 n 1 x x n1 n 2 i=1 n Std : s n (x) = var n (x) (where z 2 = zi 2 ) i=1 Rem: normalization can change 1/n or 1/(n 1) (unbiased) 22 / 49

32 Mean Absolute Deviation MADn(x) MADn(x) Med n (x) : empirical median x Definition Mean Absolute Deviation (MAD): MAD n (x) = Med n ( Med n (x) x ) Normalized Mean Absolute Deviation (MADN): MADN n (x) = MAD n (x)/.6745 Rem: Φ 1 (3/4).6745 (Φ : Standard Gaussian CDF) 23 / 49

33 Newcomb s experiments (speed of light) Raw observations 24 / 49

34 Newcomb s experiments (speed of light) t-statistics index Standard statistical rule of thumb 3σ : flag a sample x i as outlier when t i > 3, where t i = x i x n s n 24 / 49

35 Newcomb s experiments (speed of light) t-statistics index Robust counterpart for 3σ rule of thumb: flag a sample x i as outlier when t i > 3, where t i = x i Med n (x) MADN n (x) Rem: helps limiting the masking effect 24 / 49

36 Table of Contents Presentation / course organization Prerequisite / references Common estimators Linear Model Least square and variants Leverage points Multidimensional regression 25 / 49

37 Ordinary Least Squares: toy example (y-corruption) 5 Data Raw data 4 3 y-axis x-axis 26 / 49

38 Ordinary Least Squares: toy example (y-corruption) 5 4 OLS-sklearn Data Raw data and fitted 3 y-axis x-axis 26 / 49

39 Ordinary Least Squares: toy example (y-corruption) 5 4 OLS-sklearn Data Raw data and fitted 3 y-axis x-axis 26 / 49

40 Ordinary Least Squares: toy example (y-corruption) 5 4 HuberRegressor-sklearn Data Raw data and fitted 3 y-axis x-axis 26 / 49

41 Ordinary Least Squares: toy example (x-corruption) 5 Raw data Data 4 3 y-axis x-axis 27 / 49

42 Ordinary Least Squares: toy example (x-corruption) 5 4 Raw data and fitted OLS-sklearn Data 3 y-axis x-axis 27 / 49

43 Ordinary Least Squares: toy example (x-corruption) 5 4 Raw data and fitted OLS-sklearn Data 3 y-axis x-axis 27 / 49

44 Ordinary Least Squares: toy example (x-corruption) 5 4 Raw data and fitted HuberRegressor-sklearn Data 3 y-axis x-axis 27 / 49

45 Ordinary Least Squares: toy example (x-corruption) 5 4 Raw data and fitted LTS Data 3 y-axis x-axis 27 / 49

46 A real 2D example Example : braking distance for cars as a function of speed (n = 5 measurements) Data Raw data Distance Speed Dataset cars: 28 / 49

47 A real 2D example Example : braking distance for cars as a function of speed (n = 5 measurements) Data OLS-sklearn Raw data and fitted Distance Speed Dataset cars: 28 / 49

48 Modeling : single feature Observations: (y i, x i ), for i = 1,..., n Linear model or linear regression hypothesis assume: β : intercept (unknown) β1 : slope (unknown) y i β + β 1x i Rem: both parameters are unknown from the statistician Definitions y is an observation or a variable to explain x is a feature or a covariate 29 / 49

49 Modeling III intercept : the scalar β y i = β + β 1x i + ε i Definitions slope : the scalar β 1 noise : the vector ε = (ε 1,..., ε n ) Goal Estimate β and β 1 (unknown) by ˆβ and ˆβ 1 relying on observations (y i, x i ) for i = 1,..., n 3 / 49

50 OLS and Center of gravity y ˆβ + ˆβ 1 x Distance Raw data fitted with intercept and center of mass Data OLS-sklearn speed = 15.4 dist = ˆβ = intercept (negative!) ˆβ1 = slope Speed Physical interpretation: the cloud of points center of gravity belongs to the (estimated) regression line 31 / 49

51 Centering Centered model: Write for any i = 1,..., n : { x i = x i x n y i = y i y n { x = x x n 1 n y = y y n 1 n and 1 n = (1,..., 1) R n, then solving the OLS with (x, y ) leads to β = 1 β 1 = n 1 n n x iy i i=1 n x 2 i Rem: equivalent to choosing the cloud of points center of mass as origin, i.e., (x n, y n) = (, ) i=1 32 / 49

52 Centering (II) 8 6 Data OLS Raw data recentered to center of mass Recentered distance Recentered speed 33 / 49

53 Centering and interpretation Consider the coefficient β 1 ( β = ) for centered y, x, then: β 1 arg min β 1 R n n (y i β 1 x i) 2 = arg min i=1 β 1 R i=1 x 2 i ( y i x i β 1 ) 2 Interpretation: β 1 is a weighted average of the slopes y i β 1 = n i=1 n x 2 y i i x i Influence of extreme points: weights proportional to x 2 i ; leverage effect for far away points j=1 x 2 j x i 34 / 49

54 Extreme points leverage effect Recentered distance Average of slopes: weight by importance Recentered speed 35 / 49

55 Extreme points leverage effect Recentered distance Average of slopes: weight by importance Recentered speed 35 / 49

56 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

57 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

58 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

59 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

60 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

61 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

62 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

63 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

64 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

65 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

66 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

67 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

68 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

69 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

70 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

71 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

72 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

73 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

74 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

75 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

76 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

77 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

78 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

79 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

80 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

81 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

82 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

83 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

84 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

85 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

86 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

87 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

88 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

89 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

90 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

91 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

92 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

93 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

94 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

95 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

96 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

97 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

98 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

99 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

100 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

101 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

102 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

103 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

104 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

105 Extreme points leverage effect Recentered distance Least-squares with sample size n = Recentered speed 36 / 49

106 Multidimensional regression: Model / vocabulary y = Xβ + ε y R n : observations vector X R n p : design matrix (with features as columns) β R p : (unknown) true parameter to be estimated ε R n : noise vector Observations point of view: y i = x i, β + ε i for i = 1,..., n, stands for standard inner product p Features point of view: y = βj x j + ε j=1 37 / 49

107 (Ordinary) Least squares, (O)LS A least square estimator is any solution of the following problem: 1 ˆβ arg min β R p 2 y Xβ 2 2 := f(β) 1 ˆβ arg min β R p 2 n [ yi x i, β ] 2 i=1 Rem: uniqueness does not hold when features are co-linear, and then there are an infinite number of solutions Rem: an intercept is often added Rem: Gaussian (-log)-likelihood leads to square formulation 38 / 49

108 Least squares - normal equation f(β) = X Xβ X y = X (Xβ y) = Theorem Fermat s rule ensures that any LS solution ˆβ satisfies: Normal equation: X X ˆβ = X y ˆβ is solution of the linear system Aβ = b for a matrix A = X X and right hand side b = X y 39 / 49

109 Proof: gradient computation The gradient of f, f is defined for any β as the vector satisfying: f(β + h) = f(β) + h, f(β) + o(h) for any h For the f of interest here, this reads f(β + h) = 1 2 y 2 β + h, X y (β + h) X X(β + h) 4 / 49

110 Proof: gradient computation The gradient of f, f is defined for any β as the vector satisfying: f(β + h) = f(β) + h, f(β) + o(h) for any h For the f of interest here, this reads f(β + h) = 1 2 y 2 β + h, X y (β + h) X X(β + h) = 1 2 y 2 β, X y h, X y β X Xβ h X Xh + β X Xh 4 / 49

111 Proof: gradient computation The gradient of f, f is defined for any β as the vector satisfying: f(β + h) = f(β) + h, f(β) + o(h) for any h For the f of interest here, this reads f(β + h) = 1 2 y 2 β + h, X y (β + h) X X(β + h) = 1 2 y 2 β, X y h, X y β X Xβ h X Xh + β X Xh =f(β) h, X y h X Xh + β X Xh 4 / 49

112 Proof: gradient computation The gradient of f, f is defined for any β as the vector satisfying: f(β + h) = f(β) + h, f(β) + o(h) For the f of interest here, this reads for any h f(β + h) = 1 2 y 2 β + h, X y (β + h) X X(β + h) = 1 2 y 2 β, X y h, X y β X Xβ h X Xh + β X Xh =f(β) h, X y h X Xh + β X Xh =f(β) + h, X Xβ X y + 1 }{{} 2 h X Xh }{{} f(β) o(h) 4 / 49

113 Proof: gradient computation The gradient of f, f is defined for any β as the vector satisfying: f(β + h) = f(β) + h, f(β) + o(h) For the f of interest here, this reads for any h f(β + h) = 1 2 y 2 β + h, X y (β + h) X X(β + h) = 1 2 y 2 β, X y h, X y β X Xβ h X Xh + β X Xh =f(β) h, X y h X Xh + β X Xh =f(β) + h, X Xβ X y + 1 }{{} 2 h X Xh }{{} f(β) o(h) Hence, f(β) = X Xβ X y = X (Xβ y) 4 / 49

114 Proof: gradient computation The gradient of f, f is defined for any β as the vector satisfying: f(β + h) = f(β) + h, f(β) + o(h) For the f of interest here, this reads for any h f(β + h) = 1 2 y 2 β + h, X y (β + h) X X(β + h) = 1 2 y 2 β, X y h, X y β X Xβ h X Xh + β X Xh =f(β) h, X y h X Xh + β X Xh =f(β) + h, X Xβ X y + 1 }{{} 2 h X Xh }{{} f(β) o(h) Hence, f(β) = X Xβ X y = X (Xβ y) 4 / 49

115 Vocabulary (and abuse of terms) Definition We call Gramian matrix the matrix X X R p p, whose general term is [X X] i,j = x i, x j Rem: X X is often referred to as the feature correlation matrix (true for standardized columns) Rem: when columns are scaled such that j 1, p, x j 2 = n, the Gramian diagonal is (n,..., n) X y = x 1, y. x p, y : observations/features correlation 41 / 49

116 OLS closed-form solution (full rank case) Theorem If X is full (column) rank (i.e., if X X is non-singular) then ˆβ OLS = (X X) 1 X y Rem: if X = 1 n : ˆβ OLS = 1 n, y 1 n, 1 n = ȳ n (empirical mean) Rem: single feature X = x = (x 1,..., x n ) : ˆβ OLS x = x 2, y Beware: in practice avoid inverting the matrix X X numerically time consuming the matrix X X is not even be invertible if p n, e.g., in biology n patients ( 1), p genes ( 5) 42 / 49

117 Example Stackloss dataset: Stackloss plant data, Brownlee (1965), contains 21 days of measurements from a plant s oxidation of ammonia to nitric acid. The nitric oxide pollutants are captured in an absorption tower. number of samples : n = 21 number of features : p = 3 y (to predict): STACKLOSS - 1 times the percentage of ammonia going into the plant escaping from the tower Features: AIRFLOW - Rate of operation of the plant WATERTEMP - Cooling water temperature in the tower ACIDCONC - Acid concentration of circulating acid minus 5 times / 49

118 3σ rule to spot outliers in a linear model 1 OLS t-statistics index t i = y i x i, ˆβ OLS ˆσ with ˆσ = y X ˆβ OLS n p 44 / 49

119 3σ rule to spot outliers in a linear model Robust t-statistics 1 1 Huber index t i = y i x i, ˆβ Huber ˆσ with ˆσ = y X ˆβ Huber n p 44 / 49

120 3σ rule to spot outliers in a linear model Robust t-statistics 1 1 RANSAC index t i = y i x i, ˆβ RANSAC ˆσ with ˆσ = MAD n(y X ˆβ RANSAC ) / 49

121 3σ rule to spot outliers in a linear model Robust t-statistics 1 1 Least Trimmed Squares index t i = y i x i, ˆβ LTS ˆσ with ˆσ = MAD n(y X ˆβ LTS ) / 49

122 References I Alfons, A., C. Croux, and S. Gelper. Sparse least trimmed squares regression for analyzing high-dimensional large data sets. In: Ann. Appl. Stat. 7.1 (213), pp Avella-Medina, M. and E. M. Ronchetti. Robust and consistent variable selection in high-dimensional generalized linear models. In: Biometrika 15.1 (218), pp Bauschke, H. H. and P. L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces. New York: Springer, 211, pp. xvi+468. Beck, A. and M. Teboulle. Smoothing and first order methods: A unified framework. In: SIAM J. Optim (212), pp Bertsekas, D. P. Nonlinear programming. Athena Scientific, Bertsimas, D., D. B. Brown, and C. Caramanis. Theory and applications of robust optimization. In: SIAM Rev (211), pp / 49

123 References II Boyd, S. and L. Vandenberghe. Convex optimization. Cambridge University Press, 24, pp. xiv+716. Chen, M., C. Gao, and Z. Ren. A General Decision Theory for Huber s ɛ-contamination Model. In: Electron. J. Stat. 1.2 (216), pp Chen, Y., C. Caramanis, and S. Mannor. Robust sparse regression under adversarial corruption. In: ICML. 213, pp Donoho, D. L. and M. Gasko. Breakdown properties of location estimates based on halfspace depth and projected outlyingness. In: Ann. Statist. 2.4 (1992), pp Golub, G. H. and C. F. van Loan. Matrix computations. Fourth. Johns Hopkins University Press, Baltimore, MD, 213, pp. xiv+756. Hampel, F. R. et al. Robust statistics: The Approach Based on Influence Functions. Wiley series in probability and statistics. Wiley, / 49

124 References III Hiriart-Urruty, J.-B. and C. Lemaréchal. Convex analysis and minimization algorithms. I. Vol. 35. Berlin: Springer-Verlag, Convex analysis and minimization algorithms. II. Vol. 36. Berlin: Springer-Verlag, Horn, R. A. and C. R. Johnson. Topics in matrix analysis. Corrected reprint of the 1991 original. Cambridge: Cambridge University Press, 1994, pp. viii+67. Huber, P. J. and E. M. Ronchetti. Robust statistics. Second. Wiley series in probability and statistics. Wiley, 29. Maronna, R. A., R. D. Martin, and V. J. Yohai. Robust statistics: Theory and methods. Chichester: John Wiley & Sons, 26. Minsker, S. Geometric median and robust estimation in Banach spaces. In: Bernoulli 21.4 (215), pp Mosler, K. Depth statistics. In: Robustness and complex data structures. Springer, 213, pp / 49

125 References IV Murphy, K. P. Machine learning: a probabilistic perspective. MIT press, 212. Nesterov, Y. Smooth minimization of non-smooth functions. In: Math. Program (25), pp Parikh, N. et al. Proximal algorithms. In: Foundations and Trends in Machine Learning 1.3 (213), pp Rousseeuw, P. J. and A. M. Leroy. Robust regression and outlier detection. Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. New York: John Wiley & Sons Inc., 1987, pp. xvi+329. Seber, G. A. F. and A. J. Lee. Linear Regression Analysis, 2nd edition (Wiley Series in Probability and Statistics). 2nd ed. Wiley, 23. Wei, X. and S. Minsker. Estimation of the covariance structure of heavy-tailed distributions. In: NIPS. 217, pp / 49

126 References V Xu, H., C. Caramanis, and S. Mannor. Robust regression and Lasso. In: IEEE Trans. Inf. Theory 56.7 (21), pp / 49

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,