Fast Algorithms for Segmented Regression

Size: px

Start display at page:

Download "Fast Algorithms for Segmented Regression"

Bruce McLaughlin
6 years ago
Views:

1 Fast Algorithms for Segmented Regression Jayadev Acharya 1 Ilias Diakonikolas 2 Jerry Li 1 Ludwig Schmidt 1 1 MIT 2 USC June 21, / 21

2 Statistical vs computational tradeoffs? General Motivating Question When is it worth it to trade statistical efficiency for runtime? 2 / 21

3 Statistical vs computational tradeoffs? General Motivating Question When is it worth it to trade statistical efficiency for runtime? Given two estimators: 2 / 21

4 Statistical vs computational tradeoffs? General Motivating Question When is it worth it to trade statistical efficiency for runtime? Given two estimators: Estimator A: Great statistical rate, but slow to compute Estimator B: Worse statistical rate, but faster to compute 2 / 21

5 Statistical vs computational tradeoffs? General Motivating Question When is it worth it to trade statistical efficiency for runtime? Given two estimators: Estimator A: Great statistical rate, but slow to compute Estimator B: Worse statistical rate, but faster to compute When is it better to use estimator B vs estimator A? 2 / 21

6 Statistical vs computational tradeoffs? General Motivating Question When is it worth it to trade statistical efficiency for runtime? Given two estimators: Estimator A: Great statistical rate, but slow to compute Estimator B: Worse statistical rate, but faster to compute When is it better to use estimator B vs estimator A? As data grows, it may be beneficial to consider faster inferential algorithms, because the increasing statistical strength of the data can compensate for the poor algorithmic quality. [Jor13] 2 / 21

7 Introduction Outline Introduction The exact algorithm Our algorithm Experiments 2 / 21

8 Introduction Outline Introduction The exact algorithm Our algorithm Experiments 2 / 21

9 Introduction Linear regression 3 / 21

10 Introduction Linear regression We are given a labelled data set (x (1), y (1) ),..., (x (n), y (n) ) R d R so that y (i) = l (x (i) ) + ɛ (i), 3 / 21

11 Introduction Linear regression We are given a labelled data set (x (1), y (1) ),..., (x (n), y (n) ) R d R so that y (i) = l (x (i) ) + ɛ (i), l (x) = θ, x is an unknown linear function that we want to recover. 3 / 21

12 Introduction Linear regression We are given a labelled data set (x (1), y (1) ),..., (x (n), y (n) ) R d R so that y (i) = l (x (i) ) + ɛ (i), l (x) = θ, x is an unknown linear function that we want to recover. Assume that ɛ (i) are independent noise variables. 3 / 21

13 Introduction Linear regression We are given a labelled data set (x (1), y (1) ),..., (x (n), y (n) ) R d R so that y (i) = l (x (i) ) + ɛ (i), l (x) = θ, x is an unknown linear function that we want to recover. Assume that ɛ (i) are independent noise variables. Goal: Find a linear l(x) minimizing MSE(l) = 1 n n (l(x (i) ) l (x (i) )) 2. i=1 3 / 21

14 Introduction Linear regression We are given a labelled data set (x (1), y (1) ),..., (x (n), y (n) ) R d R so that y (i) = l (x (i) ) + ɛ (i), l (x) = θ, x is an unknown linear function that we want to recover. Assume that ɛ (i) are independent noise variables. Goal: Find a linear l(x) minimizing MSE(l) = 1 n n (l(x (i) ) l (x (i) )) 2. i=1 We consider fixed design regression: we assume the x (i) are fixed, and the only randomness is over the ɛ (i). 3 / 21

15 Introduction The least squares estimator Definition (Least squares estimator) The least squares estimator, denoted l LS, is given by: l LS def 1 = arg min l linear n n (y (i) l(x (i) )) 2. i=1 4 / 21

16 Introduction The least squares estimator Definition (Least squares estimator) The least squares estimator, denoted l LS, is given by: l LS def 1 = arg min l linear n n (y (i) l(x (i) )) 2. i=1 The least squares fit simply the best fit linear function to the data. 4 / 21

17 Introduction The least squares estimator Definition (Least squares estimator) The least squares estimator, denoted l LS, is given by: l LS def 1 = arg min l linear n n (y (i) l(x (i) )) 2. i=1 The least squares fit simply the best fit linear function to the data. How well does it recover the ground truth l? 4 / 21

18 Introduction The least squares estimator Theorem Let l LS be as above. Suppose that ɛ (i) N (0, σ 2 ). Then with high probability, ( MSE( l LS ) = O σ 2 d ). n Moreover, l LS can be computed in time O(nd 2 ). 5 / 21

19 Introduction The least squares estimator Theorem Let l LS be as above. Suppose that ɛ (i) N (0, σ 2 ). Then with high probability, ( MSE( l LS ) = O σ 2 d ). n Moreover, l LS can be computed in time O(nd 2 ). More recent work (see e.g. [CW13]) gets even faster theoretical runtimes. 5 / 21

20 Introduction Dealing with change What if linear regression is insufficient? 6 / 21

21 Introduction Dealing with change What if linear regression is insufficient? Dow Jones data 6 / 21

22 Introduction Dealing with change What if linear regression is insufficient? Dow Jones data Q: What if your model changes as a function of one of your variables? 6 / 21

23 Introduction Dealing with change What if linear regression is insufficient? Dow Jones data Q: What if your model changes as a function of one of your variables? A: Model it with a piecewise linear fit! 6 / 21

24 Introduction Segmented Regression 7 / 21

25 Introduction Segmented Regression Definition (Piecewise linearity) A function f : R d R is k-piecewise linear if there exists a partition of R into k intervals I 1,..., I k so that for all j, the function f is linear restricted to the set of x R d so that x 1 I j. 7 / 21

26 Introduction Segmented Regression Definition (Piecewise linearity) A function f : R d R is k-piecewise linear if there exists a partition of R into k intervals I 1,..., I k so that for all j, the function f is linear restricted to the set of x R d so that x 1 I j. Segmented Regression [BP98, YP13] Given a data set (x (1), y (1) ),..., (x (n), y (n) ) so that y (i) = f (x (i) ) + ɛ (i), where ɛ (i) are independent noise and f is k-piecewise linear, recover f in MSE. 7 / 21

27 The exact algorithm Outline Introduction The exact algorithm Our algorithm Experiments 8 / 21

28 The exact algorithm The k-piecewise linear LS estimator Definition (Least squares estimator) The k-piecewise linear least squares estimator, denoted f LS k def 1 = arg min f k-piecewise linear n n (y (i) f (x (i) )) 2. i=1 f LS k, is given by: 9 / 21

29 The exact algorithm The k-piecewise linear LS estimator Definition (Least squares estimator) The k-piecewise linear least squares estimator, denoted Theorem f LS f LS k def 1 = arg min f k-piecewise linear n n (y (i) f (x (i) )) 2. i=1 f LS k, is given by: Let k be as above. Suppose that ɛ (i) N (0, σ 2 ). Then with high probability, ( MSE( f k LS ) = O σ 2 kd ). n Moreover, this rate is optimal. 9 / 21

30 The exact algorithm The k-piecewise linear LS estimator, computationally How fast can you compute this estimator? 10 / 21

31 The exact algorithm The k-piecewise linear LS estimator, computationally How fast can you compute this estimator? Theorem (BP98) There is a dynamic program for computing the k-piecewise linear LS estimator on n samples in d dimensions which runs in time O(n 2 (d 2 + k)). 10 / 21

32 The exact algorithm The k-piecewise linear LS estimator, computationally How fast can you compute this estimator? Theorem (BP98) There is a dynamic program for computing the k-piecewise linear LS estimator on n samples in d dimensions which runs in time O(n 2 (d 2 + k)). So poly time...but quite slow as n gets large. 10 / 21

33 The exact algorithm The k-piecewise linear LS estimator, computationally How fast can you compute this estimator? Theorem (BP98) There is a dynamic program for computing the k-piecewise linear LS estimator on n samples in d dimensions which runs in time O(n 2 (d 2 + k)). So poly time...but quite slow as n gets large. The algorithm took Θ(1) minutes to run for 10 4 samples, and Θ(1) hours to run for 10 5 samples. 10 / 21

34 Our algorithm Outline Introduction The exact algorithm Our algorithm Experiments 10 / 21

35 Our algorithm Our Results Main Result (informally) An algorithm for segmented regression which runs in time which is linear in n / 21

36 Our algorithm Our Results Main Result (informally) An algorithm for segmented regression which runs in time which is linear in n...but has a worse theoretical statistical guarantee. 11 / 21

37 Our algorithm Our Results Main Result (informally) An algorithm for segmented regression which runs in time which is linear in n...but has a worse theoretical statistical guarantee. Formally... Theorem There is an 4k-piecewise linear estimator f which can be computed in time O(n(d 2 + k)) so that w.h.p. ( MSE( f ) Õ σ 2 kd ) k n + σ. n 11 / 21

38 Our algorithm Compare and contrast Algorithm Statistical Rate Runtime 12 / 21

39 Our algorithm Compare and contrast Algorithm Statistical Rate Runtime DP 12 / 21

40 Our algorithm Compare and contrast Algorithm Statistical Rate Runtime DP O ( σ 2 kd n ) 12 / 21

41 Our algorithm Compare and contrast Algorithm Statistical Rate Runtime DP O ( σ 2 kd n ) O(n 2 (d 2 + k)) 12 / 21

42 Our algorithm Compare and contrast Algorithm Statistical Rate Runtime DP Our Results O ( σ 2 kd n ) O(n 2 (d 2 + k)) 12 / 21

43 Our algorithm Compare and contrast Algorithm Statistical Rate Runtime DP Our Results O ( σ 2 kd n ) O(n 2 (d 2 + k)) O(n(d 2 + k)) 12 / 21

44 Our algorithm Compare and contrast Algorithm Statistical Rate Runtime DP Our Results O ( σ 2 kd ) n (σ Õ 2 kdn + σ k n ) O(n 2 (d 2 + k)) O(n(d 2 + k)) 12 / 21

45 Our algorithm Compare and contrast Algorithm Statistical Rate Runtime DP Our Results O ( σ 2 kd ) n (σ Õ 2 kdn + σ k n ) O(n 2 (d 2 + k)) O(n(d 2 + k)) Given enough data, how much time does it take to get some target accuracy ɛ? 12 / 21

46 Our algorithm Compare and contrast Algorithm Statistical Rate Runtime DP Our Results O ( σ 2 kd ) n (σ Õ 2 kdn + σ k n ) O(n 2 (d 2 + k)) O(n(d 2 + k)) Given enough data, how much time does it take to get some target accuracy ɛ? ( ) DP: O σ 2 k2 d 2 (d 2 + k) ɛ 2 12 / 21

47 Our algorithm Compare and contrast Algorithm Statistical Rate Runtime DP Our Results O ( σ 2 kd ) n (σ Õ 2 kdn + σ k n ) O(n 2 (d 2 + k)) O(n(d 2 + k)) Given enough data, how much time does it take to get some target accuracy ɛ? ( ) DP: O σ 2 k2 d 2 (d 2 + k) ɛ 2 Our Results: Õ ( σ 2 k ɛ max ( d, 1 ) ɛ (d 2 + k) ) 12 / 21

48 Our algorithm Compare and contrast Algorithm Statistical Rate Runtime DP Our Results O ( σ 2 kd ) n (σ Õ 2 kdn + σ k n ) O(n 2 (d 2 + k)) O(n(d 2 + k)) Given enough data, how much time does it take to get some target accuracy ɛ? ( ) DP: O σ 2 k2 d 2 (d 2 + k) ɛ 2 Our Results: Õ ( σ 2 k ɛ max ( d, 1 ) ɛ (d 2 + k) ) Speedup: Õ(min ( kd ɛ, kd 2) ) 12 / 21

49 Our algorithm The Greedy Merging Algorithm Input: A labelled data set (x (1), y (1) ),... (x (n), y (n) ). 13 / 21

50 Our algorithm The Greedy Merging Algorithm Input: A labelled data set (x (1), y (1) ),... (x (n), y (n) ). Sort them so that x (1) 1 x (2) 1... x (n) / 21

51 Our algorithm The Greedy Merging Algorithm Input: A labelled data set (x (1), y (1) ),... (x (n), y (n) ). Sort them so that x (1) 1 x (2) 1... x (n) 1. Let I 0 {{x (1) 1 (1) (n) }, {x 2 },..., {x 1 }}. 13 / 21

52 Our algorithm The Greedy Merging Algorithm Input: A labelled data set (x (1), y (1) ),... (x (n), y (n) ). Sort them so that x (1) 1 x (2) 1... x (n) 1. Let I 0 {{x (1) (1) (n) 1 }, {x 2 },..., {x 1 }}. While I j > 4k: 13 / 21

53 Our algorithm The Greedy Merging Algorithm Input: A labelled data set (x (1), y (1) ),... (x (n), y (n) ). Sort them so that x (1) 1 x (2) 1... x (n) 1. Let I 0 {{x (1) 1 While I j > 4k: Let I j = I 1,..., I s. (1) (n) }, {x 2 },..., {x 1 }}. 13 / 21

54 Our algorithm The Greedy Merging Algorithm Input: A labelled data set (x (1), y (1) ),... (x (n), y (n) ). Sort them so that x (1) 1 x (2) 1... x (n) 1. Let I 0 {{x (1) 1 While I j > 4k: Let I j = I 1,..., I s. (1) (n) }, {x 2 },..., {x 1 }}. Pair up consecutive intervals: J u = I 2u 1 I 2u, u = 1,..., s/2. 13 / 21

55 Our algorithm The Greedy Merging Algorithm Input: A labelled data set (x (1), y (1) ),... (x (n), y (n) ). Sort them so that x (1) 1 x (2) 1... x (n) 1. Let I 0 {{x (1) 1 While I j > 4k: Let I j = I 1,..., I s. (1) (n) }, {x 2 },..., {x 1 }}. Pair up consecutive intervals: J u = I 2u 1 I 2u, u = 1,..., s/2. For each J u, compute the least squares fit for all data points in J u, and an error quantity e u. 13 / 21

56 Our algorithm The Greedy Merging Algorithm Input: A labelled data set (x (1), y (1) ),... (x (n), y (n) ). Sort them so that x (1) 1 x (2) 1... x (n) 1. Let I 0 {{x (1) 1 While I j > 4k: Let I j = I 1,..., I s. (1) (n) }, {x 2 },..., {x 1 }}. Pair up consecutive intervals: J u = I 2u 1 I 2u, u = 1,..., s/2. For each J u, compute the least squares fit for all data points in J u, and an error quantity e u. Let L be the set of 2k u s with largest e u 13 / 21

57 Our algorithm The Greedy Merging Algorithm Input: A labelled data set (x (1), y (1) ),... (x (n), y (n) ). Sort them so that x (1) 1 x (2) 1... x (n) 1. Let I 0 {{x (1) 1 While I j > 4k: Let I j = I 1,..., I s. (1) (n) }, {x 2 },..., {x 1 }}. Pair up consecutive intervals: J u = I 2u 1 I 2u, u = 1,..., s/2. For each J u, compute the least squares fit for all data points in J u, and an error quantity e u. Let L be the set of 2k u s with largest e u For u = 1,..., s/2: 13 / 21

58 Our algorithm The Greedy Merging Algorithm Input: A labelled data set (x (1), y (1) ),... (x (n), y (n) ). Sort them so that x (1) 1 x (2) 1... x (n) 1. Let I 0 {{x (1) 1 While I j > 4k: Let I j = I 1,..., I s. (1) (n) }, {x 2 },..., {x 1 }}. Pair up consecutive intervals: J u = I 2u 1 I 2u, u = 1,..., s/2. For each J u, compute the least squares fit for all data points in J u, and an error quantity e u. Let L be the set of 2k u s with largest e u For u = 1,..., s/2: If u L, include I 2u 1 I 2u in I j+1 13 / 21

59 Our algorithm The Greedy Merging Algorithm Input: A labelled data set (x (1), y (1) ),... (x (n), y (n) ). Sort them so that x (1) 1 x (2) 1... x (n) 1. Let I 0 {{x (1) 1 While I j > 4k: Let I j = I 1,..., I s. (1) (n) }, {x 2 },..., {x 1 }}. Pair up consecutive intervals: J u = I 2u 1 I 2u, u = 1,..., s/2. For each J u, compute the least squares fit for all data points in J u, and an error quantity e u. Let L be the set of 2k u s with largest e u For u = 1,..., s/2: If u L, include I 2u 1 I 2u in I j+1 Else if u L include I 2u 1 and I 2u in I j / 21

60 Our algorithm The Greedy Merging Algorithm Input: A labelled data set (x (1), y (1) ),... (x (n), y (n) ). Sort them so that x (1) 1 x (2) 1... x (n) 1. Let I 0 {{x (1) 1 While I j > 4k: Let I j = I 1,..., I s. (1) (n) }, {x 2 },..., {x 1 }}. Pair up consecutive intervals: J u = I 2u 1 I 2u, u = 1,..., s/2. For each J u, compute the least squares fit for all data points in J u, and an error quantity e u. Let L be the set of 2k u s with largest e u For u = 1,..., s/2: If u L, include I 2u 1 I 2u in I j+1 Else if u L include I 2u 1 and I 2u in I j+1. Output The linear least squares fit over each interval in I j 13 / 21

61 Our algorithm Example: k = 2 x 1 14 / 21

62 Our algorithm Example: k = 2 x 1 14 / 21

63 Our algorithm Example: k = 2 x 1 14 / 21

64 Our algorithm Example: k = 2 x 1 14 / 21

65 Our algorithm Example: k = 2 x 1 14 / 21

66 Our algorithm Example: k = 2 x 1 14 / 21

67 Our algorithm Remarks 15 / 21

68 Our algorithm Remarks Algorithmically similar to an algorithm due to [ADHLS15] for histogram approximation however analysis is quite different and more involved here. 15 / 21

69 Our algorithm Remarks Algorithmically similar to an algorithm due to [ADHLS15] for histogram approximation however analysis is quite different and more involved here. Can get a smooth tradeoff between runtime and number of pieces see paper for details. 15 / 21

70 Our algorithm Remarks Algorithmically similar to an algorithm due to [ADHLS15] for histogram approximation however analysis is quite different and more involved here. Can get a smooth tradeoff between runtime and number of pieces see paper for details. The error rule we use requires knowledge of σ 2 we also give a similar algorithm which (up to log factors) matches the same guarantees as before which requires no such knowledge. 15 / 21

71 Our algorithm Remarks Algorithmically similar to an algorithm due to [ADHLS15] for histogram approximation however analysis is quite different and more involved here. Can get a smooth tradeoff between runtime and number of pieces see paper for details. The error rule we use requires knowledge of σ 2 we also give a similar algorithm which (up to log factors) matches the same guarantees as before which requires no such knowledge. Can show that our algorithms are robust to model misspecification. 15 / 21

72 Experiments Outline Introduction The exact algorithm Our algorithm Experiments 15 / 21

73 Experiments Experiments: piecewise constant MSE Relative MSE ratio n Running time (s) Speed-up n n n 10 4 Merging k Merging 2k Merging 4k Exact DP 16 / 21

74 Experiments Experiments: piecewise linear MSE Relative MSE ratio n Running time (s) Speed-up n n n 10 4 Merging k Merging 2k Merging 4k Exact DP 17 / 21

75 Experiments Experiments: time vs error trade-off 10 0 Piecewise constant 10 0 Piecewise linear MSE 10 2 MSE Time (s) Time (s) Merging k Merging 2k Merging 4k Exact DP 18 / 21

76 Experiments Experiments: real data Dow Jones data index value time Dow Jones Exact DP Merging 19 / 21

77 Experiments Experiments 20 / 21

78 Experiments Experiments Our algorithm performs 1000 faster with n = / 21

79 Experiments Experiments Our algorithm performs 1000 faster with n = 10 4 Our algorithm s MSE on synthetic data was 2 4 times worse than the DP s 20 / 21

80 Experiments Experiments Our algorithm performs 1000 faster with n = 10 4 Our algorithm s MSE on synthetic data was 2 4 times worse than the DP s Given enough data, we get the same MSE 100 faster 20 / 21

81 Conclusions Conclusions 21 / 21

82 Conclusions Conclusions When is it worth it to trade statistical effectiveness for algorithmic efficiency? 21 / 21

83 Conclusions Conclusions When is it worth it to trade statistical effectiveness for algorithmic efficiency? We give an algorithm for segmented regression that gets a worse theoretical MSE, but a much faster runtime 21 / 21

84 Conclusions Conclusions When is it worth it to trade statistical effectiveness for algorithmic efficiency? We give an algorithm for segmented regression that gets a worse theoretical MSE, but a much faster runtime Experimentally our algorithm has slightly worse MSE, but runs 1000 faster. 21 / 21

85 Conclusions Conclusions When is it worth it to trade statistical effectiveness for algorithmic efficiency? We give an algorithm for segmented regression that gets a worse theoretical MSE, but a much faster runtime Experimentally our algorithm has slightly worse MSE, but runs 1000 faster. Open Question: Is this tradeoff necessary? 21 / 21

86 Conclusions Conclusions When is it worth it to trade statistical effectiveness for algorithmic efficiency? We give an algorithm for segmented regression that gets a worse theoretical MSE, but a much faster runtime Experimentally our algorithm has slightly worse MSE, but runs 1000 faster. Open Question: Is this tradeoff necessary? Thank you! 21 / 21

Fast and Near Optimal Algorithms for Approximating Distributions by Histograms

Fast and Near Optimal Algorithms for Approximating Distributions by Histograms Jayadev Acharya MIT jayadev@csail.mit.edu Ilias Diakonikolas University of Edinburgh ilias.d@ed.ac.uk Chinmay Hegde MIT chinmay@csail.mit.edu