Robust Statistics, Revisited

Size: px

Start display at page:

Download "Robust Statistics, Revisited"

Tyler Augustine Lawrence
5 years ago
Views:

1 Robust Statistics, Revisited Ankur Moitra (MIT) joint work with Ilias Diakonikolas, Jerry Li, Gautam Kamath, Daniel Kane and Alistair Stewart

2 CLASSIC PARAMETER ESTIMATION Given samples from an unknown distribution in some class e.g. a 1-D Gaussian can we accurately estimate its parameters?

3 CLASSIC PARAMETER ESTIMATION Given samples from an unknown distribution in some class e.g. a 1-D Gaussian can we accurately estimate its parameters? Yes!

4 CLASSIC PARAMETER ESTIMATION Given samples from an unknown distribution in some class e.g. a 1-D Gaussian can we accurately estimate its parameters? Yes! empirical mean: empirical variance:

5 R. A. Fisher The maximum likelihood estimator is asymptotically efficient ( )

6 R. A. Fisher J. W. Tukey The maximum likelihood estimator is asymptotically efficient ( ) What about errors in the model itself? (1960)

7 ROBUST STATISTICS What estimators behave well in a neighborhood around the model?

8 ROBUST STATISTICS What estimators behave well in a neighborhood around the model? Let s study a simple one-dimensional example.

9 ROBUST PARAMETER ESTIMATION Given corrupted samples from a 1-D Gaussian: + = ideal model noise observed model can we accurately estimate its parameters?

10 How do we constrain the noise?

11 How do we constrain the noise? Equivalently: L 1 -norm of noise at most O(ε)

12 How do we constrain the noise? Equivalently: L 1 -norm of noise at most O(ε) Arbitrarily corrupt O(ε)-fraction of samples (in expectation)

13 How do we constrain the noise? Equivalently: L 1 -norm of noise at most O(ε) Arbitrarily corrupt O(ε)-fraction of samples (in expectation) This generalizes Huber s Contamination Model: An adversary can add an ε-fraction of samples

14 How do we constrain the noise? Equivalently: L 1 -norm of noise at most O(ε) Arbitrarily corrupt O(ε)-fraction of samples (in expectation) This generalizes Huber s Contamination Model: An adversary can add an ε-fraction of samples Outliers: Points adversary has corrupted, Inliers: Points he hasn t

15 In what norm do we want the parameters to be close?

16 In what norm do we want the parameters to be close? Definition: The total variation distance between two distributions with pdfsf(x) and g(x) is

17 In what norm do we want the parameters to be close? Definition: The total variation distance between two distributions with pdfsf(x) and g(x) is From the bound on the L 1 -norm of the noise, we have: ideal observed

18 In what norm do we want the parameters to be close? Definition: The total variation distance between two distributions with pdfsf(x) and g(x) is Goal: Find a 1-D Gaussian that satisfies estimate ideal

19 In what norm do we want the parameters to be close? Definition: The total variation distance between two distributions with pdfsf(x) and g(x) is Equivalently, find a 1-D Gaussian that satisfies estimate observed

20 Do the empirical mean and empirical variance work?

21 Do the empirical mean and empirical variance work? No!

22 Do the empirical mean and empirical variance work? No! + = ideal model noise observed model

23 Do the empirical mean and empirical variance work? No! + = ideal model noise observed model A single corrupted sample can arbitrarily corrupt the estimates

24 Do the empirical mean and empirical variance work? No! + = ideal model noise observed model A single corrupted sample can arbitrarily corrupt the estimates But the median and median absolute deviation do work

25 Do the empirical mean and empirical variance work? No! + = ideal model noise observed model A single corrupted sample can arbitrarily corrupt the estimates But the median and median absolute deviation do work

26 Fact [Folklore]: Given samples from a distribution that are ε-close in total variation distance to a 1-D Gaussian the median and MAD recover estimates that satisfy where

27 Fact [Folklore]: Given samples from a distribution that are ε-close in total variation distance to a 1-D Gaussian the median and MAD recover estimates that satisfy where Also called (properly) agnostically learning a 1-D Gaussian

28 Fact [Folklore]: Given samples from a distribution that are ε-close in total variation distance to a 1-D Gaussian the median and MAD recover estimates that satisfy where What about robust estimation in high-dimensions?

29 Fact [Folklore]: Given samples from a distribution that are ε-close in total variation distance to a 1-D Gaussian the median and MAD recover estimates that satisfy where What about robust estimation in high-dimensions? e.g. microarrays with 10k genes

30 OUTLINE Part I: Introduction Robust Estimation in One-dimension Robustness vs. Hardness in High-dimensions Our Results Part II: Agnostically Learning a Gaussian Parameter Distance Detecting When an Estimator is Compromised Filtering and Convex Programming Unknown Covariance Part III: Experiments and Extensions

31 OUTLINE Part I: Introduction Robust Estimation in One-dimension Robustness vs. Hardness in High-dimensions Our Results Part II: Agnostically Learning a Gaussian Parameter Distance Detecting When an Estimator is Compromised Filtering and Convex Programming Unknown Covariance Part III: Experiments and Extensions

32 Main Problem: Given samples from a distribution that are ε-close in total variation distance to a d-dimensional Gaussian give an efficient algorithm to find parameters that satisfy

33 Main Problem: Given samples from a distribution that are ε-close in total variation distance to a d-dimensional Gaussian give an efficient algorithm to find parameters that satisfy Special Cases: (1) Unknown mean (2) Unknown covariance

34 A COMPENDIUM OF APPROACHES Unknown Mean Error Guarantee Running Time

35 A COMPENDIUM OF APPROACHES Unknown Mean Tukey Median Error Guarantee Running Time

36 A COMPENDIUM OF APPROACHES Unknown Mean Tukey Median Error Guarantee O(ε) Running Time

37 A COMPENDIUM OF APPROACHES Unknown Mean Tukey Median Error Guarantee O(ε) Running Time NP-Hard

38 A COMPENDIUM OF APPROACHES Unknown Mean Tukey Median Error Guarantee O(ε) Running Time NP-Hard Geometric Median

39 A COMPENDIUM OF APPROACHES Unknown Mean Tukey Median Error Guarantee O(ε) Running Time NP-Hard Geometric Median poly(d,n)

40 A COMPENDIUM OF APPROACHES Unknown Mean Tukey Median Error Guarantee O(ε) Running Time NP-Hard Geometric Median O(ε d) poly(d,n)

41 A COMPENDIUM OF APPROACHES Unknown Mean Tukey Median Error Guarantee O(ε) Running Time NP-Hard Geometric Median O(ε d) poly(d,n) Tournament O(ε) N O(d)

42 A COMPENDIUM OF APPROACHES Unknown Mean Tukey Median Error Guarantee O(ε) Running Time NP-Hard Geometric Median O(ε d) poly(d,n) Tournament O(ε) N O(d) Pruning O(ε d) O(dN)

43 A COMPENDIUM OF APPROACHES Unknown Mean Error Guarantee Running Time Tukey Median O(ε) NP-Hard Geometric Median O(ε d) poly(d,n) Tournament O(ε) N O(d) Pruning O(ε d) O(dN)

44 The Price of Robustness? All known estimators are hard to compute or lose polynomial factors in the dimension

45 The Price of Robustness? All known estimators are hard to compute or lose polynomial factors in the dimension Equivalently: Computationally efficient estimators can only handle fraction of errors and get non-trivial (TV < 1) guarantees

46 The Price of Robustness? All known estimators are hard to compute or lose polynomial factors in the dimension Equivalently: Computationally efficient estimators can only handle fraction of errors and get non-trivial (TV < 1) guarantees

47 The Price of Robustness? All known estimators are hard to compute or lose polynomial factors in the dimension Equivalently: Computationally efficient estimators can only handle fraction of errors and get non-trivial (TV < 1) guarantees Is robust estimation algorithmically possible in high-dimensions?

48 OUTLINE Part I: Introduction Robust Estimation in One-dimension Robustness vs. Hardness in High-dimensions Our Results Part II: Agnostically Learning a Gaussian Parameter Distance Detecting When an Estimator is Compromised Filtering and Convex Programming Unknown Covariance Part III: Experiments and Extensions

49 OUTLINE Part I: Introduction Robust Estimation in One-dimension Robustness vs. Hardness in High-dimensions Our Results Part II: Agnostically Learning a Gaussian Parameter Distance Detecting When an Estimator is Compromised Filtering and Convex Programming Unknown Covariance Part III: Experiments and Extensions

50 OUR RESULTS Robust estimation is high-dimensions is algorithmically possible! Theorem [Diakonikolas, Li, Kamath, Kane, Moitra, Stewart 16]: There is an algorithm when given samples from a distribution that is ε-close in total variation distance to a d-dimensional Gaussian finds parameters that satisfy Moreover the algorithm runs in time poly(n, d)

distribution that is ε-close in total variation distance to a d-dimensional Gaussian finds parameters that

51 OUR RESULTS Robust estimation is high-dimensions is algorithmically possible! Theorem [Diakonikolas, Li, Kamath, Kane, Moitra, Stewart 16]: There is an algorithm when given samples from a distribution that is ε-close in total variation distance to a d-dimensional Gaussian finds parameters that satisfy Moreover the algorithm runs in time poly(n, d) Alternatively: Can approximate the Tukey median, etc, in interesting semi-random models

52 Simultaneously [Lai, Rao, Vempala 16] gave agnostic algorithms that achieve: and work for non-gaussian distributions too

53 Simultaneously [Lai, Rao, Vempala 16] gave agnostic algorithms that achieve: and work for non-gaussian distributions too Many other applications across both papers: product distributions, mixtures of spherical Gaussians, SVD, ICA

54 A GENERAL RECIPE Robust estimation in high-dimensions: Step #1: Find an appropriate parameter distance Step #2: Detect when the naïve estimator has been compromised Step #3: Find good parameters, or make progress Filtering: Fast and practical Convex Programming: Better sample complexity

55 A GENERAL RECIPE Robust estimation in high-dimensions: Step #1: Find an appropriate parameter distance Step #2: Detect when the naïve estimator has been compromised Step #3: Find good parameters, or make progress Filtering: Fast and practical Convex Programming: Better sample complexity Let s see how this works for unknown mean

56 OUTLINE Part I: Introduction Robust Estimation in One-dimension Robustness vs. Hardness in High-dimensions Our Results Part II: Agnostically Learning a Gaussian Parameter Distance Detecting When an Estimator is Compromised Filtering and Convex Programming Unknown Covariance Part III: Experiments and Extensions

57 OUTLINE Part I: Introduction Robust Estimation in One-dimension Robustness vs. Hardness in High-dimensions Our Results Part II: Agnostically Learning a Gaussian Parameter Distance Detecting When an Estimator is Compromised Filtering and Convex Programming Unknown Covariance Part III: Experiments and Extensions

58 PARAMETER DISTANCE Step #1: Find an appropriate parameter distance for Gaussians

59 PARAMETER DISTANCE Step #1: Find an appropriate parameter distance for Gaussians A Basic Fact: (1)

This can be proven using Pinsker s Inequality and

60 PARAMETER DISTANCE Step #1: Find an appropriate parameter distance for Gaussians A Basic Fact: (1) This can be proven using Pinsker s Inequality and the well-known formula for KL-divergence between Gaussians

61 PARAMETER DISTANCE Step #1: Find an appropriate parameter distance for Gaussians A Basic Fact: (1)

62 PARAMETER DISTANCE Step #1: Find an appropriate parameter distance for Gaussians A Basic Fact: (1) Corollary: If our estimate (in the unknown mean case) satisfies then

Corollary: If our estimate (in the unknown mean case)

63 PARAMETER DISTANCE Step #1: Find an appropriate parameter distance for Gaussians A Basic Fact: (1) Corollary: If our estimate (in the unknown mean case) satisfies then Our new goal is to be close in Euclidean distance

64 OUTLINE Part I: Introduction Robust Estimation in One-dimension Robustness vs. Hardness in High-dimensions Our Results Part II: Agnostically Learning a Gaussian Parameter Distance Detecting When an Estimator is Compromised Filtering and Convex Programming Unknown Covariance Part III: Experiments and Extensions

65 OUTLINE Part I: Introduction Robust Estimation in One-dimension Robustness vs. Hardness in High-dimensions Our Results Part II: Agnostically Learning a Gaussian Parameter Distance Detecting When an Estimator is Compromised Filtering and Convex Programming Unknown Covariance Part III: Experiments and Extensions

66 DETECTING CORRUPTIONS Step #2: Detect when the naïve estimator has been compromised

67 DETECTING CORRUPTIONS Step #2: Detect when the naïve estimator has been compromised = uncorrupted = corrupted

68 DETECTING CORRUPTIONS Step #2: Detect when the naïve estimator has been compromised = uncorrupted = corrupted There is a direction of large (> 1) variance

69 Key Lemma: If X 1, X 2, X N come from a distribution that is ε-close to and then for (1) (2) with probability at least 1-δ

70 Key Lemma: If X 1, X 2, X N come from a distribution that is ε-close to and then for (1) (2) with probability at least 1-δ Take-away: An adversary needs to mess up the second moment in order to corrupt the first moment

71 OUTLINE Part I: Introduction Robust Estimation in One-dimension Robustness vs. Hardness in High-dimensions Our Results Part II: Agnostically Learning a Gaussian Parameter Distance Detecting When an Estimator is Compromised Filtering and Convex Programming Unknown Covariance Part III: Experiments and Extensions

72 OUTLINE Part I: Introduction Robust Estimation in One-dimension Robustness vs. Hardness in High-dimensions Our Results Part II: Agnostically Learning a Gaussian Parameter Distance Detecting When an Estimator is Compromised Filtering and Convex Programming Unknown Covariance Part III: Experiments and Extensions

73 OUR ALGORITHM(S) Step #3: Either find good parameters, or remove many outliers

74 OUR ALGORITHM(S) Step #3: Either find good parameters, or remove many outliers Filtering Approach: Suppose that:

75 OUR ALGORITHM(S) Step #3: Either find good parameters, or remove many outliers Filtering Approach: Suppose that: We can throw out more corrupted than uncorrupted points: v where v is the direction of largest variance

76 OUR ALGORITHM(S) Step #3: Either find good parameters, or remove many outliers Filtering Approach: Suppose that: We can throw out more corrupted than uncorrupted points: v where v is the direction of largest variance, and T has a formula

77 OUR ALGORITHM(S) Step #3: Either find good parameters, or remove many outliers Filtering Approach: Suppose that: We can throw out more corrupted than uncorrupted points: T v where v is the direction of largest variance, and T has a formula

78 OUR ALGORITHM(S) Step #3: Either find good parameters, or remove many outliers Filtering Approach: Suppose that: We can throw out more corrupted than uncorrupted points

79 OUR ALGORITHM(S) Step #3: Either find good parameters, or remove many outliers Filtering Approach: Suppose that: We can throw out more corrupted than uncorrupted points If we continue too long, we d have no corrupted points left!

80 OUR ALGORITHM(S) Step #3: Either find good parameters, or remove many outliers Filtering Approach: Suppose that: We can throw out more corrupted than uncorrupted points If we continue too long, we d have no corrupted points left! Eventually we find (certifiably) good parameters

81 OUR ALGORITHM(S) Step #3: Either find good parameters, or remove many outliers Filtering Approach: Suppose that: We can throw out more corrupted than uncorrupted points If we continue too long, we d have no corrupted points left! Eventually we find (certifiably) good parameters Running Time: Sample Complexity:

that: We can throw out more corrupted than uncorrupted

82 OUR ALGORITHM(S) Step #3: Either find good parameters, or remove many outliers Filtering Approach: Suppose that: We can throw out more corrupted than uncorrupted points If we continue too long, we d have no corrupted points left! Eventually we find (certifiably) good parameters Running Time: Sample Complexity: Concentration of LTFs

83 OUTLINE Part I: Introduction Robust Estimation in One-dimension Robustness vs. Hardness in High-dimensions Our Results Part II: Agnostically Learning a Gaussian Parameter Distance Detecting When an Estimator is Compromised Filtering and Convex Programming Unknown Covariance Part III: Experiments and Extensions

84 OUTLINE Part I: Introduction Robust Estimation in One-dimension Robustness vs. Hardness in High-dimensions Our Results Part II: Agnostically Learning a Gaussian Parameter Distance Detecting When an Estimator is Compromised Filtering and Convex Programming Unknown Covariance Part III: Experiments and Extensions

85 A GENERAL RECIPE Robust estimation in high-dimensions: Step #1: Find an appropriate parameter distance Step #2: Detect when the naïve estimator has been compromised Step #3: Find good parameters, or make progress Filtering: Fast and practical Convex Programming: Better sample complexity

86 A GENERAL RECIPE Robust estimation in high-dimensions: Step #1: Find an appropriate parameter distance Step #2: Detect when the naïve estimator has been compromised Step #3: Find good parameters, or make progress Filtering: Fast and practical Convex Programming: Better sample complexity How about for unknown covariance?

87 PARAMETER DISTANCE Step #1: Find an appropriate parameter distance for Gaussians

88 PARAMETER DISTANCE Step #1: Find an appropriate parameter distance for Gaussians Another Basic Fact: (2)

89 PARAMETER DISTANCE Step #1: Find an appropriate parameter distance for Gaussians Another Basic Fact: (2) Again, proven using Pinsker s Inequality

90 PARAMETER DISTANCE Step #1: Find an appropriate parameter distance for Gaussians Another Basic Fact: (2) Again, proven using Pinsker s Inequality Our new goal is to find an estimate that satisfies:

91 PARAMETER DISTANCE Step #1: Find an appropriate parameter distance for Gaussians Another Basic Fact: (2) Again, proven using Pinsker s Inequality Our new goal is to find an estimate that satisfies: Distance seems strange, but it s the right one to use to bound TV

92 UNKNOWN COVARIANCE What if we are given samples from?

93 UNKNOWN COVARIANCE What if we are given samples from? How do we detect if the naïve estimator is compromised?

94 UNKNOWN COVARIANCE What if we are given samples from? How do we detect if the naïve estimator is compromised? Key Fact: Let and Then restricted to flattenings of d x d symmetric matrices

95 UNKNOWN COVARIANCE What if we are given samples from? How do we detect if the naïve estimator is compromised? Key Fact: Let and Then restricted to flattenings of d x d symmetric matrices Proof uses Isserlis s Theorem

96 UNKNOWN COVARIANCE What if we are given samples from? How do we detect if the naïve estimator is compromised? Key Fact: Let and Then restricted to flattenings of d x d symmetric matrices need to project out

97 Key Idea: Transform the data, look for restricted large eigenvalues

98 Key Idea: Transform the data, look for restricted large eigenvalues

99 Key Idea: Transform the data, look for restricted large eigenvalues If were the true covariance, we would have for inliers

100 Key Idea: Transform the data, look for restricted large eigenvalues If were the true covariance, we would have for inliers, in which case: would have small restricted eigenvalues

101 Key Idea: Transform the data, look for restricted large eigenvalues If were the true covariance, we would have for inliers, in which case: would have small restricted eigenvalues Take-away: An adversary needs to mess up the (restricted) fourth moment in order to corrupt the second moment

102 ASSEMBLING THE ALGORITHM Given samples that are ε-close in total variation distance to a d-dimensional Gaussian

103 ASSEMBLING THE ALGORITHM Given samples that are ε-close in total variation distance to a d-dimensional Gaussian Step #1: Doubling trick

104 ASSEMBLING THE ALGORITHM Given samples that are ε-close in total variation distance to a d-dimensional Gaussian Step #1: Doubling trick Now use algorithm for unknown covariance

105 ASSEMBLING THE ALGORITHM Given samples that are ε-close in total variation distance to a d-dimensional Gaussian Step #1: Doubling trick Now use algorithm for unknown covariance Step #2: (Agnostic) isotropic position

106 ASSEMBLING THE ALGORITHM Given samples that are ε-close in total variation distance to a d-dimensional Gaussian Step #1: Doubling trick Now use algorithm for unknown covariance Step #2: (Agnostic) isotropic position right distance, in general case

107 ASSEMBLING THE ALGORITHM Given samples that are ε-close in total variation distance to a d-dimensional Gaussian Step #1: Doubling trick Now use algorithm for unknown covariance Step #2: (Agnostic) isotropic position right distance, in general case Now use algorithm for unknown mean

108 OUTLINE Part I: Introduction Robust Estimation in One-dimension Robustness vs. Hardness in High-dimensions Our Results Part II: Agnostically Learning a Gaussian Parameter Distance Detecting When an Estimator is Compromised Filtering and Convex Programming Unknown Covariance Part III: Experiments and Extensions

109 OUTLINE Part I: Introduction Robust Estimation in One-dimension Robustness vs. Hardness in High-dimensions Our Results Part II: Agnostically Learning a Gaussian Parameter Distance Detecting When an Estimator is Compromised Filtering and Convex Programming Unknown Covariance Part III: Experiments and Extensions

110 FURTHER RESULTS Use restricted eigenvalue problems to detect outliers

111 FURTHER RESULTS Use restricted eigenvalue problems to detect outliers Binary Product Distributions:

112 FURTHER RESULTS Use restricted eigenvalue problems to detect outliers Binary Product Distributions: Mixtures of Two c-balanced Binary Product Distributions:

113 FURTHER RESULTS Use restricted eigenvalue problems to detect outliers Binary Product Distributions: Mixtures of Two c-balanced Binary Product Distributions: Mixtures of k Spherical Gaussians:

114 SYNTHETIC EXPERIMENTS Error rates on synthetic data (unknown mean): + 10% noise

115 SYNTHETIC EXPERIMENTS Error rates on synthetic data (unknown mean): excess `2 error excess `2 error dimension dimension Filtering Sample mean w/ noise RANSAC LRVMean Pruning Geometric Median

116 SYNTHETIC EXPERIMENTS Error rates on synthetic data (unknown covariance, isotropic): + 10% noise close to identity

117 SYNTHETIC EXPERIMENTS Error rates on synthetic data (unknown covariance, isotropic): excess `2 error excess `2 error dimension dimension Filtering Sample covariance w/ noise RANSAC LRVCov Pruning

118 SYNTHETIC EXPERIMENTS Error rates on synthetic data (unknown covariance, anisotropic): + 10% noise far from identity

119 SYNTHETIC EXPERIMENTS Error rates on synthetic data (unknown covariance, anisotropic): excess `2 error excess `2 error dimension dimension Filtering Sample covariance w/ noise RANSAC LRVCov Pruning

120 REAL DATA EXPERIMENTS Famous study of [Novembreet al. 08]: Take top two singular vectors of people x SNP matrix (POPRES)

121 REAL DATA EXPERIMENTS Famous study of [Novembreet al. 08]: Take top two singular vectors of people x SNP matrix (POPRES) Original Data

122 REAL DATA EXPERIMENTS Famous study of [Novembreet al. 08]: Take top two singular vectors of people x SNP matrix (POPRES) Original Data

123 REAL DATA EXPERIMENTS Famous study of [Novembreet al. 08]: Take top two singular vectors of people x SNP matrix (POPRES) Original Data Genes Mirror Geography in Europe

124 REAL DATA EXPERIMENTS Can we find such patterns in the presence of noise?

125 REAL DATA EXPERIMENTS Can we find such patterns in the presence of noise? % noise Pruning Projection What PCA finds

126 REAL DATA EXPERIMENTS Can we find such patterns in the presence of noise? % noise Pruning Projection What PCA finds

127 REAL DATA EXPERIMENTS Can we find such patterns in the presence of noise? % noise RANSAC Projection What RANSAC finds

128 REAL DATA EXPERIMENTS Can we find such patterns in the presence of noise? 10% noise XCS Projection What robust PCA (via SDPs) finds

129 REAL DATA EXPERIMENTS Can we find such patterns in the presence of noise? 10% noise Filter Projection What our methods find

130 REAL DATA EXPERIMENTS The power of provably robust estimation: 10% noise Filter Projection no noise Original Data What our methods find

131 LOOKING FORWARD Can algorithms for agnostically learning a Gaussian help in exploratory data analysis in high-dimensions?

132 LOOKING FORWARD Can algorithms for agnostically learning a Gaussian help in exploratory data analysis in high-dimensions? Isn t this what we would have been doing with robust statistical estimators, if we had them all along?

133 Summary: Nearly optimal algorithm for agnostically learning a high-dimensional Gaussian General recipe using restricted eigenvalue problems Further applications to other mixture models Is practical, robust statistics within reach? Thanks! Any Questions?

Robustness Meets Algorithms

Robustness Meets Algorithms Ankur Moitra (MIT) ICML 2017 Tutorial, August 6 th CLASSIC PARAMETER ESTIMATION Given samples from an unknown distribution in some class e.g. a 1-D Gaussian can we accurately