Online isotonic regression

Size: px

Start display at page:

Download "Online isotonic regression"

Stanley Ellis
5 years ago
Views:

1 Online isotonic regression Wojciech Kot lowski Joint work with: Wouter Koolen (CWI, Amsterdam) Alan Malek (MIT) Poznań University of Technology

2 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

3 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

4 Motivation I house pricing Assess the selling price of a house based on its attributes.

5 Motivation I house pricing Den Bosch data set area price

6 Motivation I house pricing Fitting linear function area price

7 Motivation I house pricing Fitting isotonic 1 function area price 1 isotonic non-decreasing, order-preserving

8 Motivation II predicting good probabilities Predictions of SVM classifier (german credit) score label Can we turn score values into conditional probabilities P(y x)?

9 Motivation II predicting good probabilities Fitting isotonic function to the labels [Zadrozny & Elkan, 2002] score label/probability

10 Motivation II predicting good probabilities 1.0 Calibration plots (reliability curve) 0.8 Fraction of positives Perfectly calibrated Logistic (0.099) SVM (0.163) SVM + Isotonic (0.100) (generated by a script from scikit-learn.org)

11 Motivation II predicting good probabilities 1.0 Calibration plots (reliability curve) 0.8 Fraction of positives Perfectly calibrated Logistic (0.099) Naive Bayes (0.118) Naive Bayes + Isotonic (0.098) (generated by a script from scikit-learn.org)

12 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

13 Isotonic regression Definition Fit an isotonic (monotonically increasing) function to the data. Extensively studied in statistics [Ayer et al., 55; Brunk, 55; Robertson et al., 98]. Numerous applications: Biology, medicine, psychology, etc. Multicriteria decision support. Hypothesis tests under order constraints. Multidimensional scaling. Machine learning: probability calibration, ROC analysis.

14 Isotonic regression

15 Isotonic regression Definition Given data {(x t, y t )} T t=1 R R, find isotonic (nondecreasing) f : R R, which minimizes squared error over the labels: T min : (y t f (x t )) 2, f t=1 subject to : x t x q = f (x t ) f (x q ), q, t {1,..., T }. The optimal solution f is called isotonic regression function. What only matters are values f (x t ), t = 1,..., T.

16 Isotonic regression example (source: scikit-learn.org)

17 Properties of isotonic regression Depends on instances (x) only through their order relation. Only defined at points {x 1,..., x T }. Often extended to R by linear interpolation. Piecewise constants (splits the data into level sets). Self-averaging property: the value of f in a given level set equals the average of labels in that level set. For any v: v = 1 S v t S v y t where S v = {t : f (x t ) = v}. If y t [a, b] for all t, then f (x t ) [a, b] for all t.

18 Isotonic regression gives calibrated probabilities Definition Let y {0, 1}. A probability estimator p of y is calibrated if E[y p = v] = v

19 Isotonic regression gives calibrated probabilities Definition Let y {0, 1}. A probability estimator p of y is calibrated if E[y p = v] = v Fact For binary labels, isotonic regression f is a calibrated probability estimator on the data set. Proof: Let S v = {t : f (x t ) = v}. By self-averaging: E[y f (x) = v] = 1 y t = v. S v t S v

20 Pool Adjacent Violators Algorithm (PAVA) Iterative merging of of data points into blocks until no violators of isotonic constraints exist. The values assigned to each block is the average over labels in this block. The final assignments to blocks corresponds to the level sets of isotonic regression. Works in linear O(T ) time, but requires the data to be sorted.

21 PAVA: example x y y x

22 PAVA: example Step 1: Sort the data in the increasing order of x. x y x y

23 PAVA: example Step 2: Split the data into blocks B 1,..., B r, such that points with the same x t fall into the same block. Assign value f i to each block (i = 1,..., r) which is the average of labels in this block. x y block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 B 9 B 10 B 11 data {1} {2, 3} {4} {5} {6} {7} {8} {9} {10} {11, 12} {13} f i

24 PAVA: example Step 3: While there exists a violator, i.e. a pair of blocks B i, B i+1 such that f i > f i+1 : Merge B i and B i+1 and assign a weighted average: f i = B i f i + B i+1 f i+1. B i + B i+1 block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 B 9 B 10 B 11 data {1} {2, 3} {4} {5} {6} {7} {8} {9} {10} {11, 12} {13} f i block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 B 9 B 10 data {1} {2, 3, 4} {5} {6} {7} {8} {9} {10} {11, 12} {13} f i

25 PAVA: example Step 3: While there exists a violator, i.e. a pair of blocks B i, B i+1 such that f i > f i+1 : Merge B i and B i+1 and assign a weighted average: f i = B i f i + B i+1 f i+1. B i + B i+1 block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 B 9 B 10 data {1} {2, 3, 4} {5} {6} {7} {8} {9} {10} {11, 12} {13} f i block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 B 9 data {1} {2, 3, 4} {5} {6} {7, 8} {9} {10} {11, 12} {13} f i

26 PAVA: example Step 3: While there exists a violator, i.e. a pair of blocks B i, B i+1 such that f i > f i+1 : Merge B i and B i+1 and assign a weighted average: f i = B i f i + B i+1 f i+1. B i + B i+1 block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 B 9 data {1} {2, 3, 4} {5} {6} {7, 8} {9} {10} {11, 12} {13} f i block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 data {1} {2, 3, 4} {5} {6, 7, 8} {9} {10} {11, 12} {13} f i

27 PAVA: example Step 3: While there exists a violator, i.e. a pair of blocks B i, B i+1 such that f i > f i+1 : Merge B i and B i+1 and assign a weighted average: f i = B i f i + B i+1 f i+1. B i + B i+1 block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 data {1} {2, 3, 4} {5} {6, 7, 8} {9} {10} {11, 12} {13} f i block B 1 B 2 B 3 B 4 B 5 B 6 B 7 data {1} {2, 3, 4} {5} {6, 7, 8} {9} {10} {11, 12, 13} f i No more violators finished.

28 PAVA: example Reading out the solution. block B 1 B 2 B 3 B 4 B 5 B 6 B 7 data {1} {2, 3, 4} {5} {6, 7, 8} {9} {10} {11, 12, 13} f i x y f

29 PAVA: example x y f y x

30 Generalized isotonic regression Definition Given data {(x t, y t )} T t=1 R R, find isotonic f : R R which minimizes: T (y t, f (x t )). min isotonic f t=1 Squared loss (y t f (x t )) 2 replaced with general loss (y t, f (x t )).

31 Generalized isotonic regression Definition Given data {(x t, y t )} T t=1 R R, find isotonic f : R R which minimizes: T (y t, f (x t )). min isotonic f t=1 Squared loss (y t f (x t )) 2 replaced with general loss (y t, f (x t )). Theorem [Robertson et al., 1998] All loss functions of the form: (y, z) = Ψ(y) Ψ(z) Ψ (z)(y z) for some strictly convex Ψ result in the same isotonic regression function f.

32 Generalized isotonic regression examples (y, z) = Ψ(y) Ψ(z) Ψ (z)(y z) Squared function Ψ(y) = y 2 : (y, z) = y 2 z 2 2f (y z) = (y z) 2 (squared loss). Entropy Ψ(y) = y log y (1 y) log(1 y), y [0, 1] (y, z) = y log z (1 y) log(1 z) (cross-entropy). Negative logarithm Ψ(y) = log y, y > 0 (y, z) = y z log y z (Itakura-Saito distance / Burg entropy).

33 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

34 Online learning framework A theoretical framework for the analysis of online algorithms. Learning process by its very nature is incremental. Avoids stochastic (e.g., i.i.d.) assumptions on the data sequence, designs algorithms which work well for any data. Meaningful performance guarantees based on observed quantities: regret bounds.

35 Online learning framework t t + 1 learner (strategy) f t : X Y prediction ŷ t = f t (x t ) suffered loss l(y t, ŷ t ) new instance (x t,?) feedback: y t

36 Online learning framework Set of strategies (actions) F; known loss function l. Learner starts with some initial strategy (action) f 1. For t = 1, 2,...: 1 Learner observes instance x t. 2 Learner predicts with ŷ t = f t (x t ). 3 The environment reveals outcome y t. 4 Learner suffers loss l(y t, ŷ t ). 5 Learner updates its strategy f t f t+1.

37 Online learning framework The goal of the learner is to be close to the best f in hindsight. Cumulative loss of the learner: T L T = l(y t, ŷ t ). t=1 Cumulative loss of the best strategy f in hindsight: Regret of the learner: L T = min f F T l(y t, f (x t )). t=1 regret T = L T L T. The goal is to minimize regret over all possible data sequences.

38 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

39 Online isotonic regression 1 Y 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X

40 Online isotonic regression 1 Y 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X x 5

41 Online isotonic regression 1 Y ŷ 5 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X x 5

42 Online isotonic regression 1 y 5 Y ŷ 5 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X x 5

43 Online isotonic regression 1 y 5 loss = (ŷ 5 y 5 ) 2 Y ŷ 5 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X x 5

44 Online isotonic regression 1 Y 0 x 1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X

45 Online isotonic regression 1 Y ŷ 1 0 x 1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X

46 Online isotonic regression 1 Y ŷ 1 y 1 0 x 1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X

47 Online isotonic regression 1 Y ŷ 1 y 1 loss = (ŷ 1 y 1 ) 2 0 x 1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X

48 Online isotonic regression 1 Y 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X

49 Online isotonic regression 1 Y 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X

50 Online isotonic regression The protocol Given: x 1 < x 2 <... < x T. At trial t = 1,..., T : Environment chooses a yet unlabeled point x it. Learner predicts ŷ it [0, 1]. Environment reveals label y it [0, 1]. Learner suffers squared loss (y it ŷ it ) 2.

51 Online isotonic regression The protocol Given: x 1 < x 2 <... < x T. At trial t = 1,..., T : Environment chooses a yet unlabeled point x it. Learner predicts ŷ it [0, 1]. Environment reveals label y it [0, 1]. Learner suffers squared loss (y it ŷ it ) 2. Strategies = isotonic functions: F = {f : f (x 1 ) f (x 2 )... f (x T )}

52 Online isotonic regression The protocol Given: x 1 < x 2 <... < x T. At trial t = 1,..., T : Environment chooses a yet unlabeled point x it. Learner predicts ŷ it [0, 1]. Environment reveals label y it [0, 1]. Learner suffers squared loss (y it ŷ it ) 2. Strategies = isotonic functions: F = {f : f (x 1 ) f (x 2 )... f (x T )} T T regret T = (y it ŷ it ) 2 min (y it f (x it )) 2 f F t=1 t=1

53 Online isotonic regression F = {f : f (x 1 ) f (x 2 )... f (x T )} T T regret T = (y it ŷ it ) 2 min (y it f (x it )) 2 f F t=1 t=1 Cumulative loss of the learner should not be much larger than the loss of (optimal) isotonic regression function in hindsight. Only the order x 1 <... < x T matters, not the values.

54 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y 0 X

55 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y 0 x 1 X

56 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y ŷ 1 0 x 1 X

57 The adversary is too powerful! y 1 1 Every algorithm will have Ω(T ) regret Y ŷ 1 0 x 1 X

58 The adversary is too powerful! y 1 1 Every algorithm will have Ω(T ) regret Y loss 1/4 ŷ 1 0 x 1 X

59 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y 0 x 2 x 1 X

60 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret ŷ 2 Y 0 x 2 x 1 X

61 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret ŷ 2 Y y 2 0 x 2 x 2 x 1 X

62 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret ŷ 2 Y loss 1/4 y 2 0 x 2 x 2 x 1 X

63 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y 0 x 2 x 3 x 1 X

64 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y ŷ 1 0 x 3 x 2 x 3 x 1 X

65 The adversary is too powerful! y 3 1 Every algorithm will have Ω(T ) regret Y ŷ 1 0 x 3 x 2 x 3 x 1 X

66 The adversary is too powerful! y 3 1 Every algorithm will have Ω(T ) regret loss 1/4 Y ŷ 1 0 x 3 x 2 x 3 x 1 X

67 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y 0 x 2 x 3 x 1 X Algorithms loss 1 4 per trial, loss of best isotonic function = 0.

68 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

69 Fixed design Data x 1,..., x T is known in advance to the learner We will show that in such model, efficient online algorithms exist. K., Koolen, Malek: Online Isotonic Regression. Proc. of Conference on Learning Theory (COLT), pp , 2016.

70 Off-the-shelf online algorithms Algorithm General bound Bound for online IR Stochastic Gradient Descent Exponentiated Gradient G 2 D 2 T G D 1 T log d T T log T Follow the Leader G 2 D 2 d log T T 2 log T Exponential Weights d log T T log T These bounds are tight (up to logarithmic factor).

71 Exponential Weights (Bayes) with uniform prior Let f = (f 1,..., f T ) denote values of f at (x 1,..., x T ). π(f ) = const, for all f : f 1... f T, P(f y i1,..., y it ) π(f )e 1 2 loss 1...t(f ), ŷ it+1 = f it+1 P(f y i1,..., y it )df. } {{ } = posterior mean

72 Exponential Weights with uniform prior does not learn prior mean Y prior mean posterior mean X

73 Exponential Weights with uniform prior does not learn posterior mean (t = 10) Y prior mean posterior mean X

74 Exponential Weights with uniform prior does not learn posterior mean (t = 20) Y prior mean posterior mean X

75 Exponential Weights with uniform prior does not learn posterior mean (t = 50) Y prior mean posterior mean X

76 Exponential Weights with uniform prior does not learn posterior mean (t = 100) Y prior mean posterior mean X

77 The algorithm Exponential Weights on a covering net { F K = f : f t = k t, k {0, 1,..., K}, K f 1... f T }, π(f ) uniform on F K. Efficient implementation by dynamic programming: O(Kt) at trial t. Speed-up to O(K) if the data revealed in isotonic order.

78 Covering net A finite set of isotonic functions on a discrete grid of y values x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12

79 Covering net A finite set of isotonic functions on a discrete grid of y values x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12

80 Covering net A finite set of isotonic functions on a discrete grid of y values x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12

81 Covering net A finite set of isotonic functions on a discrete grid of y values x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12

82 Covering net A finite set of isotonic functions on a discrete grid of y values There are O(T K ) functions in F K x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12

83 Performance of the algorithm Regret bound When K = Θ ( ) T 1/3 log 1/3 (T ), ( ) Regret = O T 1/3 log 2/3 (T )

84 Performance of the algorithm Regret bound When K = Θ ( ) T 1/3 log 1/3 (T ), ( ) Regret = O T 1/3 log 2/3 (T ) Matching lower bound Ω(T 1/3 ) (up to log factor).

85 Performance of the algorithm Regret bound When K = Θ ( ) T 1/3 log 1/3 (T ), ( ) Regret = O T 1/3 log 2/3 (T ) Matching lower bound Ω(T 1/3 ) (up to log factor). Proof idea Regret = Loss(alg) min f F K Loss(f ) + min Loss(f ) min Loss(f ) f F K isotonic f

86 Performance of the algorithm Regret bound When K = Θ ( ) T 1/3 log 1/3 (T ), ( ) Regret = O T 1/3 log 2/3 (T ) Matching lower bound Ω(T 1/3 ) (up to log factor). Proof idea Regret = Loss(alg) min f F K Loss(f ) }{{} =2 log F K =O(K log T ) + min Loss(f ) min Loss(f ) f F K isotonic f }{{} = T 4K 2

87 Performance of the algorithm prior mean Y prior mean posterior mean X

88 Performance of the algorithm posterior mean (t = 10) Y prior mean posterior mean X

89 Performance of the algorithm posterior mean (t = 20) Y prior mean posterior mean X

90 Performance of the algorithm posterior mean (t = 50) Y prior mean posterior mean X

91 Performance of the algorithm posterior mean (t = 100) Y prior mean posterior mean X

92 Other loss functions Cross-entropy loss l(y, ŷ) = y log ŷ (1 y) log(1 ŷ) ( ) The same bound O T 1/3 log 2/3 (T ). Covering net F K obtained by non-uniform discretization.

93 Other loss functions Cross-entropy loss l(y, ŷ) = y log ŷ (1 y) log(1 ŷ) ( ) The same bound O T 1/3 log 2/3 (T ). Covering net F K obtained by non-uniform discretization. Absolute loss l(y, ŷ) = y ŷ O ( T log T ) obtained by Exponentiated Gradient. Matching lower bound Ω( T ) (up to log factor).

94 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

95 Random permutation model A more realistic scenario for generating x 1,..., x T which allows data to be unknown in advance.

96 Random permutation model A more realistic scenario for generating x 1,..., x T which allows data to be unknown in advance. The data are chosen adversarially before the game begins, but then are presented to the learner in a random order Motivation: data gathering process is independent on the underlying data generation mechanism. Still very weak assumption. Evaluation: regret averaged over all permutations of data: E σ [regret T ] K., Koolen, Malek: Random Permutation Online Isotonic Regression. Submitted, 2017.

97 Leave-one-out loss Definition Given t labeled points {(x i, y i )} t i=1, for i = 1,..., t: Take out i-th point and give remaining t 1 points to the learner as a training data. Learner predict ŷ i on x i and receives loss l(y i, ŷ i ). Evaluate the learner by loo t = 1 t ti=1 l(y i, ŷ i ) No sequential structure in the definition.

98 Leave-one-out loss Definition Given t labeled points {(x i, y i )} t i=1, for i = 1,..., t: Take out i-th point and give remaining t 1 points to the learner as a training data. Learner predict ŷ i on x i and receives loss l(y i, ŷ i ). Evaluate the learner by loo t = 1 t ti=1 l(y i, ŷ i ) No sequential structure in the definition. Theorem If loo t g(t) for all t, then E σ [regret T ] T t=1 g(t).

99 Fixed design to random permutation conversion Any algorithm for fixed-design can be used in the random permutation setup by being re-run from the scratch in each trial. We have shown that: loo t 1 t E σ [fixed-design-regret t ] We thus get an optimal algorithm (Exponential Weights on a grid) with Õ(T 2/3 ) leave-one-out loss for free, but it is complicated. Can we get simpler algorithms to work in this setup?

100 Follow the Leader (FTL) algorithm Definition Given past t 1 data, compute the optimal (loss-minimizing) function f and predict on new instance x according to f (x).

101 Follow the Leader (FTL) algorithm Definition Given past t 1 data, compute the optimal (loss-minimizing) function f and predict on new instance x according to f (x). FTL is undefined for isotonic regression. x y f (x)

102 Follow the Leader (FTL) algorithm Definition Given past t 1 data, compute the optimal (loss-minimizing) function f and predict on new instance x according to f (x). FTL is undefined for isotonic regression. x y f (x) 0 0.2?? 0.7 1

103 Foward Algorithm (FA) Definition Given past t 1 data and a new instance x, take any guess y [0, 1] of the new label and predict according to the optimal function f on the past data including the new point (x, y ). x y f (x)

104 Foward Algorithm (FA) Definition Given past t 1 data and a new instance x, take any guess y [0, 1] of the new label and predict according to the optimal function f on the past data including the new point (x, y ). x y y = f (x)

105 Foward Algorithm (FA) Definition Given past t 1 data and a new instance x, take any guess y [0, 1] of the new label and predict according to the optimal function f on the past data including the new point (x, y ). x y y = f (x)

106 Foward Algorithm (FA) Definition Given past t 1 data and a new instance x, take any guess y [0, 1] of the new label and predict according to the optimal function f on the past data including the new point (x, y ). x y y = f (x) Various popular prediction algorithms for IR fall into this framework (including linear interpolation [Zadrozny & Elkan, 2002] and many others [Vovk et al., 2015]).

107 Foward Algorithm (FA) Two extreme FA: guess-1 and guess-0, denoted f 1 and f 0. Prediction of any FA is always between: f0 (x) f (x) f1 (x). 1 y 8 7 y 6 5 Y y 2 y 3 y 1 0 x 1 x 2 x 3 x 4 X x 5 x 6 x 7 x 8

108 Foward Algorithm (FA) Two extreme FA: guess-1 and guess-0, denoted f 1 and f 0. Prediction of any FA is always between: f0 (x) f (x) f1 (x). 1 y 8 7 y 6 5 f 1 Y y 2 y 3 y 1 0 x 1 x 2 x 3 x 4 X x 5 x 6 x 7 x 8

109 Foward Algorithm (FA) Two extreme FA: guess-1 and guess-0, denoted f 1 and f 0. Prediction of any FA is always between: f0 (x) f (x) f1 (x). 1 y 8 7 y 6 5 f 1 Y y 2 y 3 y 1 f 0 0 x 1 x 2 x 3 x 4 X x 5 x 6 x 7 x 8

110 Foward Algorithm (FA) Two extreme FA: guess-1 and guess-0, denoted f 1 and f 0. Prediction of any FA is always between: f0 (x) f (x) f1 (x). 1 y 8 7 y 6 5 f 1 Y y 2 every FA predicts in this range y 3 y 1 f 0 0 x 1 x 2 x 3 x 4 X x 5 x 6 x 7 x 8

111 Performance of FA Theorem For squared loss, every forward algorithm has: loo t = O log t t The bound is suboptimal, but only a factor of O(t 1/6 ) off. For cross-entropy loss, the some bound holds but a more careful choice of the guess must be made.

112 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

113 Conclusions Two models for online isotonic regression: fixed design and random permutation. Optimal algorithm in both models: Exponential Weights (Bayes) on a grid. In the random permutation model, a class of forward algorithms with good bounds on the leave-one-out loss. Open problem: Extend any of these algorithms to the partial order case.

114 Bibliography Statistics M. Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, and E. Silverman. An empirical distribution function for sampling with incomplete information. Annals of Mathematical Statistics, 26(4): , 1955 H. D. Brunk. Maximum likelihood estimates of monotone parameters. Annals of Mathematical Statistics, 26(4): , 1955 J. B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1):1 27, 1964 R. E. Barlow and H. D. Brunk. The isotonic regression problem and its dual. Journal of the American Statistical Association, 67: , 1972 T. Robertson, F. T. Wright, and R. L. Dykstra. Order Restricted Statistical Inference. John Wiley & Sons, 1998 Sara Van de Geer. Estimating a regression function. Annals of Statistics, 18: , 1990 Cun-Hui Zhang. Risk bounds in isotonic regression. The Annals of Statistics, 30(2): , 2002 Jan de Leeuw, Kurt Hornik, and Patrick Mair. Isotone optimization in R: Pool-adjacent-violators algorithm (PAVA) and active set methods. Journal of Statistical Software, 32:1 24, 2009

115 Bibliography Machine Learning Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In KDD, pages , 2002 Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In ICML, volume 119, pages ACM, 2005 Tom Fawcett and Alexandru Niculescu-Mizil. PAV and the ROC convex hull. Machine Learning, 68(1):97 106, 2007 Vladimir Vovk, Ivan Petej, and Valentina Fedorova. Large-scale probabilistic predictors with and without guarantees of validity. In NIPS, pages , 2015 Aditya Krishna Menon, Xiaoqian Jiang, Shankar Vembu, Charles Elkan, and Lucila Ohno-Machado. Predicting accurate probabilities with a ranking loss. In ICML, 2012 Rasmus Kyng, Anup Rao, and Sushant Sachdeva. Fast, provable algorithms for isotonic regression in all l p-norms. In NIPS, 2015 Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009 T. Moon, A. Smola, Y. Chang, and Z. Zheng. Intervalrank: Isotonic regression with listwise and pairwise constraint. In WSDM, pages ACM, 2010 Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In NIPS, pages , 2011

116 Bibliography Online isotonic regression Alexander Rakhlin and Karthik Sridharan. Online nonparametric regression. In COLT, pages , 2014 Pierre Gaillard and Sébastien Gerchinovitz. A chaining algorithm for online nonparametric regression. In COLT, pages , 2015 Wojciech Kot lowski, Wouter M. Koolen, and Alan Malek. Online isotonic regression. In COLT, pages , 2016 Wojciech Kot lowski, Wouter M. Koolen, and Alan Malek. Random permutation online isotonic regression. submitted, 2017

arxiv: v2 [cs.lg] 7 Oct 2016

arxiv: v2 [cs.lg] 7 Oct 2016 Online Isotonic Regression arxiv:1603.04190v [cs.lg] 7 Oct 016 Wojciech Kotłowski Poznań University of Technology Wouter M. Koolen Centrum Wiskunde & Informatica Alan Malek University of California at