Online isotonic regression

Size: px
Start display at page:

Download "Online isotonic regression"

Transcription

1 Online isotonic regression Wojciech Kot lowski Joint work with: Wouter Koolen (CWI, Amsterdam) Alan Malek (MIT) Poznań University of Technology

2 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

3 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

4 Motivation I house pricing Assess the selling price of a house based on its attributes.

5 Motivation I house pricing Den Bosch data set area price

6 Motivation I house pricing Fitting linear function area price

7 Motivation I house pricing Fitting isotonic 1 function area price 1 isotonic non-decreasing, order-preserving

8 Motivation II predicting good probabilities Predictions of SVM classifier (german credit) score label Can we turn score values into conditional probabilities P(y x)?

9 Motivation II predicting good probabilities Fitting isotonic function to the labels [Zadrozny & Elkan, 2002] score label/probability

10 Motivation II predicting good probabilities 1.0 Calibration plots (reliability curve) 0.8 Fraction of positives Perfectly calibrated Logistic (0.099) SVM (0.163) SVM + Isotonic (0.100) (generated by a script from scikit-learn.org)

11 Motivation II predicting good probabilities 1.0 Calibration plots (reliability curve) 0.8 Fraction of positives Perfectly calibrated Logistic (0.099) Naive Bayes (0.118) Naive Bayes + Isotonic (0.098) (generated by a script from scikit-learn.org)

12 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

13 Isotonic regression Definition Fit an isotonic (monotonically increasing) function to the data. Extensively studied in statistics [Ayer et al., 55; Brunk, 55; Robertson et al., 98]. Numerous applications: Biology, medicine, psychology, etc. Multicriteria decision support. Hypothesis tests under order constraints. Multidimensional scaling. Machine learning: probability calibration, ROC analysis.

14 Isotonic regression

15 Isotonic regression Definition Given data {(x t, y t )} T t=1 R R, find isotonic (nondecreasing) f : R R, which minimizes squared error over the labels: T min : (y t f (x t )) 2, f t=1 subject to : x t x q = f (x t ) f (x q ), q, t {1,..., T }. The optimal solution f is called isotonic regression function. What only matters are values f (x t ), t = 1,..., T.

16 Isotonic regression example (source: scikit-learn.org)

17 Properties of isotonic regression Depends on instances (x) only through their order relation. Only defined at points {x 1,..., x T }. Often extended to R by linear interpolation. Piecewise constants (splits the data into level sets). Self-averaging property: the value of f in a given level set equals the average of labels in that level set. For any v: v = 1 S v t S v y t where S v = {t : f (x t ) = v}. If y t [a, b] for all t, then f (x t ) [a, b] for all t.

18 Isotonic regression gives calibrated probabilities Definition Let y {0, 1}. A probability estimator p of y is calibrated if E[y p = v] = v

19 Isotonic regression gives calibrated probabilities Definition Let y {0, 1}. A probability estimator p of y is calibrated if E[y p = v] = v Fact For binary labels, isotonic regression f is a calibrated probability estimator on the data set. Proof: Let S v = {t : f (x t ) = v}. By self-averaging: E[y f (x) = v] = 1 y t = v. S v t S v

20 Pool Adjacent Violators Algorithm (PAVA) Iterative merging of of data points into blocks until no violators of isotonic constraints exist. The values assigned to each block is the average over labels in this block. The final assignments to blocks corresponds to the level sets of isotonic regression. Works in linear O(T ) time, but requires the data to be sorted.

21 PAVA: example x y y x

22 PAVA: example Step 1: Sort the data in the increasing order of x. x y x y

23 PAVA: example Step 2: Split the data into blocks B 1,..., B r, such that points with the same x t fall into the same block. Assign value f i to each block (i = 1,..., r) which is the average of labels in this block. x y block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 B 9 B 10 B 11 data {1} {2, 3} {4} {5} {6} {7} {8} {9} {10} {11, 12} {13} f i

24 PAVA: example Step 3: While there exists a violator, i.e. a pair of blocks B i, B i+1 such that f i > f i+1 : Merge B i and B i+1 and assign a weighted average: f i = B i f i + B i+1 f i+1. B i + B i+1 block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 B 9 B 10 B 11 data {1} {2, 3} {4} {5} {6} {7} {8} {9} {10} {11, 12} {13} f i block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 B 9 B 10 data {1} {2, 3, 4} {5} {6} {7} {8} {9} {10} {11, 12} {13} f i

25 PAVA: example Step 3: While there exists a violator, i.e. a pair of blocks B i, B i+1 such that f i > f i+1 : Merge B i and B i+1 and assign a weighted average: f i = B i f i + B i+1 f i+1. B i + B i+1 block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 B 9 B 10 data {1} {2, 3, 4} {5} {6} {7} {8} {9} {10} {11, 12} {13} f i block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 B 9 data {1} {2, 3, 4} {5} {6} {7, 8} {9} {10} {11, 12} {13} f i

26 PAVA: example Step 3: While there exists a violator, i.e. a pair of blocks B i, B i+1 such that f i > f i+1 : Merge B i and B i+1 and assign a weighted average: f i = B i f i + B i+1 f i+1. B i + B i+1 block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 B 9 data {1} {2, 3, 4} {5} {6} {7, 8} {9} {10} {11, 12} {13} f i block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 data {1} {2, 3, 4} {5} {6, 7, 8} {9} {10} {11, 12} {13} f i

27 PAVA: example Step 3: While there exists a violator, i.e. a pair of blocks B i, B i+1 such that f i > f i+1 : Merge B i and B i+1 and assign a weighted average: f i = B i f i + B i+1 f i+1. B i + B i+1 block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 data {1} {2, 3, 4} {5} {6, 7, 8} {9} {10} {11, 12} {13} f i block B 1 B 2 B 3 B 4 B 5 B 6 B 7 data {1} {2, 3, 4} {5} {6, 7, 8} {9} {10} {11, 12, 13} f i No more violators finished.

28 PAVA: example Reading out the solution. block B 1 B 2 B 3 B 4 B 5 B 6 B 7 data {1} {2, 3, 4} {5} {6, 7, 8} {9} {10} {11, 12, 13} f i x y f

29 PAVA: example x y f y x

30 Generalized isotonic regression Definition Given data {(x t, y t )} T t=1 R R, find isotonic f : R R which minimizes: T (y t, f (x t )). min isotonic f t=1 Squared loss (y t f (x t )) 2 replaced with general loss (y t, f (x t )).

31 Generalized isotonic regression Definition Given data {(x t, y t )} T t=1 R R, find isotonic f : R R which minimizes: T (y t, f (x t )). min isotonic f t=1 Squared loss (y t f (x t )) 2 replaced with general loss (y t, f (x t )). Theorem [Robertson et al., 1998] All loss functions of the form: (y, z) = Ψ(y) Ψ(z) Ψ (z)(y z) for some strictly convex Ψ result in the same isotonic regression function f.

32 Generalized isotonic regression examples (y, z) = Ψ(y) Ψ(z) Ψ (z)(y z) Squared function Ψ(y) = y 2 : (y, z) = y 2 z 2 2f (y z) = (y z) 2 (squared loss). Entropy Ψ(y) = y log y (1 y) log(1 y), y [0, 1] (y, z) = y log z (1 y) log(1 z) (cross-entropy). Negative logarithm Ψ(y) = log y, y > 0 (y, z) = y z log y z (Itakura-Saito distance / Burg entropy).

33 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

34 Online learning framework A theoretical framework for the analysis of online algorithms. Learning process by its very nature is incremental. Avoids stochastic (e.g., i.i.d.) assumptions on the data sequence, designs algorithms which work well for any data. Meaningful performance guarantees based on observed quantities: regret bounds.

35 Online learning framework t t + 1 learner (strategy) f t : X Y prediction ŷ t = f t (x t ) suffered loss l(y t, ŷ t ) new instance (x t,?) feedback: y t

36 Online learning framework Set of strategies (actions) F; known loss function l. Learner starts with some initial strategy (action) f 1. For t = 1, 2,...: 1 Learner observes instance x t. 2 Learner predicts with ŷ t = f t (x t ). 3 The environment reveals outcome y t. 4 Learner suffers loss l(y t, ŷ t ). 5 Learner updates its strategy f t f t+1.

37 Online learning framework The goal of the learner is to be close to the best f in hindsight. Cumulative loss of the learner: T L T = l(y t, ŷ t ). t=1 Cumulative loss of the best strategy f in hindsight: Regret of the learner: L T = min f F T l(y t, f (x t )). t=1 regret T = L T L T. The goal is to minimize regret over all possible data sequences.

38 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

39 Online isotonic regression 1 Y 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X

40 Online isotonic regression 1 Y 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X x 5

41 Online isotonic regression 1 Y ŷ 5 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X x 5

42 Online isotonic regression 1 y 5 Y ŷ 5 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X x 5

43 Online isotonic regression 1 y 5 loss = (ŷ 5 y 5 ) 2 Y ŷ 5 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X x 5

44 Online isotonic regression 1 Y 0 x 1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X

45 Online isotonic regression 1 Y ŷ 1 0 x 1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X

46 Online isotonic regression 1 Y ŷ 1 y 1 0 x 1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X

47 Online isotonic regression 1 Y ŷ 1 y 1 loss = (ŷ 1 y 1 ) 2 0 x 1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X

48 Online isotonic regression 1 Y 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X

49 Online isotonic regression 1 Y 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X

50 Online isotonic regression The protocol Given: x 1 < x 2 <... < x T. At trial t = 1,..., T : Environment chooses a yet unlabeled point x it. Learner predicts ŷ it [0, 1]. Environment reveals label y it [0, 1]. Learner suffers squared loss (y it ŷ it ) 2.

51 Online isotonic regression The protocol Given: x 1 < x 2 <... < x T. At trial t = 1,..., T : Environment chooses a yet unlabeled point x it. Learner predicts ŷ it [0, 1]. Environment reveals label y it [0, 1]. Learner suffers squared loss (y it ŷ it ) 2. Strategies = isotonic functions: F = {f : f (x 1 ) f (x 2 )... f (x T )}

52 Online isotonic regression The protocol Given: x 1 < x 2 <... < x T. At trial t = 1,..., T : Environment chooses a yet unlabeled point x it. Learner predicts ŷ it [0, 1]. Environment reveals label y it [0, 1]. Learner suffers squared loss (y it ŷ it ) 2. Strategies = isotonic functions: F = {f : f (x 1 ) f (x 2 )... f (x T )} T T regret T = (y it ŷ it ) 2 min (y it f (x it )) 2 f F t=1 t=1

53 Online isotonic regression F = {f : f (x 1 ) f (x 2 )... f (x T )} T T regret T = (y it ŷ it ) 2 min (y it f (x it )) 2 f F t=1 t=1 Cumulative loss of the learner should not be much larger than the loss of (optimal) isotonic regression function in hindsight. Only the order x 1 <... < x T matters, not the values.

54 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y 0 X

55 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y 0 x 1 X

56 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y ŷ 1 0 x 1 X

57 The adversary is too powerful! y 1 1 Every algorithm will have Ω(T ) regret Y ŷ 1 0 x 1 X

58 The adversary is too powerful! y 1 1 Every algorithm will have Ω(T ) regret Y loss 1/4 ŷ 1 0 x 1 X

59 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y 0 x 2 x 1 X

60 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret ŷ 2 Y 0 x 2 x 1 X

61 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret ŷ 2 Y y 2 0 x 2 x 2 x 1 X

62 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret ŷ 2 Y loss 1/4 y 2 0 x 2 x 2 x 1 X

63 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y 0 x 2 x 3 x 1 X

64 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y ŷ 1 0 x 3 x 2 x 3 x 1 X

65 The adversary is too powerful! y 3 1 Every algorithm will have Ω(T ) regret Y ŷ 1 0 x 3 x 2 x 3 x 1 X

66 The adversary is too powerful! y 3 1 Every algorithm will have Ω(T ) regret loss 1/4 Y ŷ 1 0 x 3 x 2 x 3 x 1 X

67 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y 0 x 2 x 3 x 1 X Algorithms loss 1 4 per trial, loss of best isotonic function = 0.

68 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

69 Fixed design Data x 1,..., x T is known in advance to the learner We will show that in such model, efficient online algorithms exist. K., Koolen, Malek: Online Isotonic Regression. Proc. of Conference on Learning Theory (COLT), pp , 2016.

70 Off-the-shelf online algorithms Algorithm General bound Bound for online IR Stochastic Gradient Descent Exponentiated Gradient G 2 D 2 T G D 1 T log d T T log T Follow the Leader G 2 D 2 d log T T 2 log T Exponential Weights d log T T log T These bounds are tight (up to logarithmic factor).

71 Exponential Weights (Bayes) with uniform prior Let f = (f 1,..., f T ) denote values of f at (x 1,..., x T ). π(f ) = const, for all f : f 1... f T, P(f y i1,..., y it ) π(f )e 1 2 loss 1...t(f ), ŷ it+1 = f it+1 P(f y i1,..., y it )df. } {{ } = posterior mean

72 Exponential Weights with uniform prior does not learn prior mean Y prior mean posterior mean X

73 Exponential Weights with uniform prior does not learn posterior mean (t = 10) Y prior mean posterior mean X

74 Exponential Weights with uniform prior does not learn posterior mean (t = 20) Y prior mean posterior mean X

75 Exponential Weights with uniform prior does not learn posterior mean (t = 50) Y prior mean posterior mean X

76 Exponential Weights with uniform prior does not learn posterior mean (t = 100) Y prior mean posterior mean X

77 The algorithm Exponential Weights on a covering net { F K = f : f t = k t, k {0, 1,..., K}, K f 1... f T }, π(f ) uniform on F K. Efficient implementation by dynamic programming: O(Kt) at trial t. Speed-up to O(K) if the data revealed in isotonic order.

78 Covering net A finite set of isotonic functions on a discrete grid of y values x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12

79 Covering net A finite set of isotonic functions on a discrete grid of y values x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12

80 Covering net A finite set of isotonic functions on a discrete grid of y values x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12

81 Covering net A finite set of isotonic functions on a discrete grid of y values x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12

82 Covering net A finite set of isotonic functions on a discrete grid of y values There are O(T K ) functions in F K x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12

83 Performance of the algorithm Regret bound When K = Θ ( ) T 1/3 log 1/3 (T ), ( ) Regret = O T 1/3 log 2/3 (T )

84 Performance of the algorithm Regret bound When K = Θ ( ) T 1/3 log 1/3 (T ), ( ) Regret = O T 1/3 log 2/3 (T ) Matching lower bound Ω(T 1/3 ) (up to log factor).

85 Performance of the algorithm Regret bound When K = Θ ( ) T 1/3 log 1/3 (T ), ( ) Regret = O T 1/3 log 2/3 (T ) Matching lower bound Ω(T 1/3 ) (up to log factor). Proof idea Regret = Loss(alg) min f F K Loss(f ) + min Loss(f ) min Loss(f ) f F K isotonic f

86 Performance of the algorithm Regret bound When K = Θ ( ) T 1/3 log 1/3 (T ), ( ) Regret = O T 1/3 log 2/3 (T ) Matching lower bound Ω(T 1/3 ) (up to log factor). Proof idea Regret = Loss(alg) min f F K Loss(f ) }{{} =2 log F K =O(K log T ) + min Loss(f ) min Loss(f ) f F K isotonic f }{{} = T 4K 2

87 Performance of the algorithm prior mean Y prior mean posterior mean X

88 Performance of the algorithm posterior mean (t = 10) Y prior mean posterior mean X

89 Performance of the algorithm posterior mean (t = 20) Y prior mean posterior mean X

90 Performance of the algorithm posterior mean (t = 50) Y prior mean posterior mean X

91 Performance of the algorithm posterior mean (t = 100) Y prior mean posterior mean X

92 Other loss functions Cross-entropy loss l(y, ŷ) = y log ŷ (1 y) log(1 ŷ) ( ) The same bound O T 1/3 log 2/3 (T ). Covering net F K obtained by non-uniform discretization.

93 Other loss functions Cross-entropy loss l(y, ŷ) = y log ŷ (1 y) log(1 ŷ) ( ) The same bound O T 1/3 log 2/3 (T ). Covering net F K obtained by non-uniform discretization. Absolute loss l(y, ŷ) = y ŷ O ( T log T ) obtained by Exponentiated Gradient. Matching lower bound Ω( T ) (up to log factor).

94 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

95 Random permutation model A more realistic scenario for generating x 1,..., x T which allows data to be unknown in advance.

96 Random permutation model A more realistic scenario for generating x 1,..., x T which allows data to be unknown in advance. The data are chosen adversarially before the game begins, but then are presented to the learner in a random order Motivation: data gathering process is independent on the underlying data generation mechanism. Still very weak assumption. Evaluation: regret averaged over all permutations of data: E σ [regret T ] K., Koolen, Malek: Random Permutation Online Isotonic Regression. Submitted, 2017.

97 Leave-one-out loss Definition Given t labeled points {(x i, y i )} t i=1, for i = 1,..., t: Take out i-th point and give remaining t 1 points to the learner as a training data. Learner predict ŷ i on x i and receives loss l(y i, ŷ i ). Evaluate the learner by loo t = 1 t ti=1 l(y i, ŷ i ) No sequential structure in the definition.

98 Leave-one-out loss Definition Given t labeled points {(x i, y i )} t i=1, for i = 1,..., t: Take out i-th point and give remaining t 1 points to the learner as a training data. Learner predict ŷ i on x i and receives loss l(y i, ŷ i ). Evaluate the learner by loo t = 1 t ti=1 l(y i, ŷ i ) No sequential structure in the definition. Theorem If loo t g(t) for all t, then E σ [regret T ] T t=1 g(t).

99 Fixed design to random permutation conversion Any algorithm for fixed-design can be used in the random permutation setup by being re-run from the scratch in each trial. We have shown that: loo t 1 t E σ [fixed-design-regret t ] We thus get an optimal algorithm (Exponential Weights on a grid) with Õ(T 2/3 ) leave-one-out loss for free, but it is complicated. Can we get simpler algorithms to work in this setup?

100 Follow the Leader (FTL) algorithm Definition Given past t 1 data, compute the optimal (loss-minimizing) function f and predict on new instance x according to f (x).

101 Follow the Leader (FTL) algorithm Definition Given past t 1 data, compute the optimal (loss-minimizing) function f and predict on new instance x according to f (x). FTL is undefined for isotonic regression. x y f (x)

102 Follow the Leader (FTL) algorithm Definition Given past t 1 data, compute the optimal (loss-minimizing) function f and predict on new instance x according to f (x). FTL is undefined for isotonic regression. x y f (x) 0 0.2?? 0.7 1

103 Foward Algorithm (FA) Definition Given past t 1 data and a new instance x, take any guess y [0, 1] of the new label and predict according to the optimal function f on the past data including the new point (x, y ). x y f (x)

104 Foward Algorithm (FA) Definition Given past t 1 data and a new instance x, take any guess y [0, 1] of the new label and predict according to the optimal function f on the past data including the new point (x, y ). x y y = f (x)

105 Foward Algorithm (FA) Definition Given past t 1 data and a new instance x, take any guess y [0, 1] of the new label and predict according to the optimal function f on the past data including the new point (x, y ). x y y = f (x)

106 Foward Algorithm (FA) Definition Given past t 1 data and a new instance x, take any guess y [0, 1] of the new label and predict according to the optimal function f on the past data including the new point (x, y ). x y y = f (x) Various popular prediction algorithms for IR fall into this framework (including linear interpolation [Zadrozny & Elkan, 2002] and many others [Vovk et al., 2015]).

107 Foward Algorithm (FA) Two extreme FA: guess-1 and guess-0, denoted f 1 and f 0. Prediction of any FA is always between: f0 (x) f (x) f1 (x). 1 y 8 7 y 6 5 Y y 2 y 3 y 1 0 x 1 x 2 x 3 x 4 X x 5 x 6 x 7 x 8

108 Foward Algorithm (FA) Two extreme FA: guess-1 and guess-0, denoted f 1 and f 0. Prediction of any FA is always between: f0 (x) f (x) f1 (x). 1 y 8 7 y 6 5 f 1 Y y 2 y 3 y 1 0 x 1 x 2 x 3 x 4 X x 5 x 6 x 7 x 8

109 Foward Algorithm (FA) Two extreme FA: guess-1 and guess-0, denoted f 1 and f 0. Prediction of any FA is always between: f0 (x) f (x) f1 (x). 1 y 8 7 y 6 5 f 1 Y y 2 y 3 y 1 f 0 0 x 1 x 2 x 3 x 4 X x 5 x 6 x 7 x 8

110 Foward Algorithm (FA) Two extreme FA: guess-1 and guess-0, denoted f 1 and f 0. Prediction of any FA is always between: f0 (x) f (x) f1 (x). 1 y 8 7 y 6 5 f 1 Y y 2 every FA predicts in this range y 3 y 1 f 0 0 x 1 x 2 x 3 x 4 X x 5 x 6 x 7 x 8

111 Performance of FA Theorem For squared loss, every forward algorithm has: loo t = O log t t The bound is suboptimal, but only a factor of O(t 1/6 ) off. For cross-entropy loss, the some bound holds but a more careful choice of the guess must be made.

112 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions

113 Conclusions Two models for online isotonic regression: fixed design and random permutation. Optimal algorithm in both models: Exponential Weights (Bayes) on a grid. In the random permutation model, a class of forward algorithms with good bounds on the leave-one-out loss. Open problem: Extend any of these algorithms to the partial order case.

114 Bibliography Statistics M. Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, and E. Silverman. An empirical distribution function for sampling with incomplete information. Annals of Mathematical Statistics, 26(4): , 1955 H. D. Brunk. Maximum likelihood estimates of monotone parameters. Annals of Mathematical Statistics, 26(4): , 1955 J. B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1):1 27, 1964 R. E. Barlow and H. D. Brunk. The isotonic regression problem and its dual. Journal of the American Statistical Association, 67: , 1972 T. Robertson, F. T. Wright, and R. L. Dykstra. Order Restricted Statistical Inference. John Wiley & Sons, 1998 Sara Van de Geer. Estimating a regression function. Annals of Statistics, 18: , 1990 Cun-Hui Zhang. Risk bounds in isotonic regression. The Annals of Statistics, 30(2): , 2002 Jan de Leeuw, Kurt Hornik, and Patrick Mair. Isotone optimization in R: Pool-adjacent-violators algorithm (PAVA) and active set methods. Journal of Statistical Software, 32:1 24, 2009

115 Bibliography Machine Learning Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In KDD, pages , 2002 Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In ICML, volume 119, pages ACM, 2005 Tom Fawcett and Alexandru Niculescu-Mizil. PAV and the ROC convex hull. Machine Learning, 68(1):97 106, 2007 Vladimir Vovk, Ivan Petej, and Valentina Fedorova. Large-scale probabilistic predictors with and without guarantees of validity. In NIPS, pages , 2015 Aditya Krishna Menon, Xiaoqian Jiang, Shankar Vembu, Charles Elkan, and Lucila Ohno-Machado. Predicting accurate probabilities with a ranking loss. In ICML, 2012 Rasmus Kyng, Anup Rao, and Sushant Sachdeva. Fast, provable algorithms for isotonic regression in all l p-norms. In NIPS, 2015 Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009 T. Moon, A. Smola, Y. Chang, and Z. Zheng. Intervalrank: Isotonic regression with listwise and pairwise constraint. In WSDM, pages ACM, 2010 Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In NIPS, pages , 2011

116 Bibliography Online isotonic regression Alexander Rakhlin and Karthik Sridharan. Online nonparametric regression. In COLT, pages , 2014 Pierre Gaillard and Sébastien Gerchinovitz. A chaining algorithm for online nonparametric regression. In COLT, pages , 2015 Wojciech Kot lowski, Wouter M. Koolen, and Alan Malek. Online isotonic regression. In COLT, pages , 2016 Wojciech Kot lowski, Wouter M. Koolen, and Alan Malek. Random permutation online isotonic regression. submitted, 2017

arxiv: v2 [cs.lg] 7 Oct 2016

arxiv: v2 [cs.lg] 7 Oct 2016 Online Isotonic Regression arxiv:1603.04190v [cs.lg] 7 Oct 016 Wojciech Kotłowski Poznań University of Technology Wouter M. Koolen Centrum Wiskunde & Informatica Alan Malek University of California at

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Obtaining Calibrated Probabilities from Boosting

Obtaining Calibrated Probabilities from Boosting Obtaining Calibrated Probabilities from Boosting Alexandru Niculescu-Mizil Department of Computer Science Cornell University, Ithaca, NY 4853 alexn@cs.cornell.edu Rich Caruana Department of Computer Science

More information

Tutorial on Venn-ABERS prediction

Tutorial on Venn-ABERS prediction Tutorial on Venn-ABERS prediction Paolo Toccaceli (joint work with Alex Gammerman and Ilia Nouretdinov) Computer Learning Research Centre Royal Holloway, University of London 6th Symposium on Conformal

More information

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Krzysztof Dembczyński and Wojciech Kot lowski Intelligent Decision Support Systems Laboratory (IDSS) Poznań

More information

On the Interpretability of Conditional Probability Estimates in the Agnostic Setting

On the Interpretability of Conditional Probability Estimates in the Agnostic Setting On the Interpretability of Conditional Probability Estimates in the Agnostic Setting Yihan Gao Aditya Parameswaran Jian Peng University of Illinois at Urbana-Champaign Abstract We study the interpretability

More information

The Free Matrix Lunch

The Free Matrix Lunch The Free Matrix Lunch Wouter M. Koolen Wojciech Kot lowski Manfred K. Warmuth Tuesday 24 th April, 2012 Koolen, Kot lowski, Warmuth (RHUL) The Free Matrix Lunch Tuesday 24 th April, 2012 1 / 26 Introduction

More information

A Simple Algorithm for Multilabel Ranking

A Simple Algorithm for Multilabel Ranking A Simple Algorithm for Multilabel Ranking Krzysztof Dembczyński 1 Wojciech Kot lowski 1 Eyke Hüllermeier 2 1 Intelligent Decision Support Systems Laboratory (IDSS), Poznań University of Technology, Poland

More information

On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation

On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation Harikrishna Narasimhan Shivani Agarwal epartment of Computer Science and Automation Indian

More information

over the parameters θ. In both cases, consequently, we select the minimizing

over the parameters θ. In both cases, consequently, we select the minimizing MONOTONE REGRESSION JAN DE LEEUW Abstract. This is an entry for The Encyclopedia of Statistics in Behavioral Science, to be published by Wiley in 200. In linear regression we fit a linear function y =

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Relative Loss Bounds for Multidimensional Regression Problems

Relative Loss Bounds for Multidimensional Regression Problems Relative Loss Bounds for Multidimensional Regression Problems Jyrki Kivinen and Manfred Warmuth Presented by: Arindam Banerjee A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

On the tradeoff between computational complexity and sample complexity in learning

On the tradeoff between computational complexity and sample complexity in learning On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

Logistic Regression & Neural Networks

Logistic Regression & Neural Networks Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

More information

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Online Learning: Random Averages, Combinatorial Parameters, and Learnability Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Online Manifold Regularization: A New Learning Setting and Empirical Study

Online Manifold Regularization: A New Learning Setting and Empirical Study Online Manifold Regularization: A New Learning Setting and Empirical Study Andrew B. Goldberg 1, Ming Li 2, Xiaojin Zhu 1 1 Computer Sciences, University of Wisconsin Madison, USA. {goldberg,jerryzhu}@cs.wisc.edu

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Lecture 16: Perceptron and Exponential Weights Algorithm

Lecture 16: Perceptron and Exponential Weights Algorithm EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Probabilistic Machine Learning. Industrial AI Lab.

Probabilistic Machine Learning. Industrial AI Lab. Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Efficient and Principled Online Classification Algorithms for Lifelon

Efficient and Principled Online Classification Algorithms for Lifelon Efficient and Principled Online Classification Algorithms for Lifelong Learning Toyota Technological Institute at Chicago Chicago, IL USA Talk @ Lifelong Learning for Mobile Robotics Applications Workshop,

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Learning, Games, and Networks

Learning, Games, and Networks Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to

More information

Bregman Divergences for Data Mining Meta-Algorithms

Bregman Divergences for Data Mining Meta-Algorithms p.1/?? Bregman Divergences for Data Mining Meta-Algorithms Joydeep Ghosh University of Texas at Austin ghosh@ece.utexas.edu Reflects joint work with Arindam Banerjee, Srujana Merugu, Inderjit Dhillon,

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Listwise Approach to Learning to Rank Theory and Algorithm

Listwise Approach to Learning to Rank Theory and Algorithm Listwise Approach to Learning to Rank Theory and Algorithm Fen Xia *, Tie-Yan Liu Jue Wang, Wensheng Zhang and Hang Li Microsoft Research Asia Chinese Academy of Sciences document s Learning to Rank for

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Learnability, Stability, Regularization and Strong Convexity

Learnability, Stability, Regularization and Strong Convexity Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago

More information

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont) CS325 Artificial Intelligence Cengiz Spring 2013 Model Complexity in Learning f(x) x Model Complexity in Learning f(x) x Let s start with the linear case... Linear Regression Linear Regression price =

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Minimax Fixed-Design Linear Regression

Minimax Fixed-Design Linear Regression JMLR: Workshop and Conference Proceedings vol 40:1 14, 2015 Mini Fixed-Design Linear Regression Peter L. Bartlett University of California at Berkeley and Queensland University of Technology Wouter M.

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Distributed online optimization over jointly connected digraphs

Distributed online optimization over jointly connected digraphs Distributed online optimization over jointly connected digraphs David Mateos-Núñez Jorge Cortés University of California, San Diego {dmateosn,cortes}@ucsd.edu Mathematical Theory of Networks and Systems

More information

On Minimaxity of Follow the Leader Strategy in the Stochastic Setting

On Minimaxity of Follow the Leader Strategy in the Stochastic Setting On Minimaxity of Follow the Leader Strategy in the Stochastic Setting Wojciech Kot lowsi Poznań University of Technology, Poland wotlowsi@cs.put.poznan.pl Abstract. We consider the setting of prediction

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Online Bounds for Bayesian Algorithms

Online Bounds for Bayesian Algorithms Online Bounds for Bayesian Algorithms Sham M. Kakade Computer and Information Science Department University of Pennsylvania Andrew Y. Ng Computer Science Department Stanford University Abstract We present

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

Decomposing Isotonic Regression for Efficiently Solving Large Problems

Decomposing Isotonic Regression for Efficiently Solving Large Problems Decomposing Isotonic Regression for Efficiently Solving Large Problems Ronny Luss Dept. of Statistics and OR Tel Aviv University ronnyluss@gmail.com Saharon Rosset Dept. of Statistics and OR Tel Aviv University

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

Practical Agnostic Active Learning

Practical Agnostic Active Learning Practical Agnostic Active Learning Alina Beygelzimer Yahoo Research based on joint work with Sanjoy Dasgupta, Daniel Hsu, John Langford, Francesco Orabona, Chicheng Zhang, and Tong Zhang * * introductory

More information

arxiv: v1 [math.oc] 4 Mar 2019

arxiv: v1 [math.oc] 4 Mar 2019 arxiv:1903.01054v1 [math.oc] 4 Mar 2019 The Application of Multi-block ADMM on Isotonic Regression Problems Junxiang Wang and Liang Zhao George Mason University Abstract The multi-block ADMM has received

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

U Logo Use Guidelines

U Logo Use Guidelines Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Surrogate regret bounds for generalized classification performance metrics

Surrogate regret bounds for generalized classification performance metrics Surrogate regret bounds for generalized classification performance metrics Wojciech Kotłowski Krzysztof Dembczyński Poznań University of Technology PL-SIGML, Częstochowa, 14.04.2016 1 / 36 Motivation 2

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Better Algorithms for Selective Sampling

Better Algorithms for Selective Sampling Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Classification objectives COMS 4771

Classification objectives COMS 4771 Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Learning with Large Number of Experts: Component Hedge Algorithm

Learning with Large Number of Experts: Component Hedge Algorithm Learning with Large Number of Experts: Component Hedge Algorithm Giulia DeSalvo and Vitaly Kuznetsov Courant Institute March 24th, 215 1 / 3 Learning with Large Number of Experts Regret of RWM is O( T

More information

Manual for a computer class in ML

Manual for a computer class in ML Manual for a computer class in ML November 3, 2015 Abstract This document describes a tour of Machine Learning (ML) techniques using tools in MATLAB. We point to the standard implementations, give example

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Boosting Methods: Why They Can Be Useful for High-Dimensional Data New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

Learning from Corrupted Binary Labels via Class-Probability Estimation

Learning from Corrupted Binary Labels via Class-Probability Estimation Learning from Corrupted Binary Labels via Class-Probability Estimation Aditya Krishna Menon Brendan van Rooyen Cheng Soon Ong Robert C. Williamson xxx National ICT Australia and The Australian National

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Surrogate loss functions, divergences and decentralized detection

Surrogate loss functions, divergences and decentralized detection Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction

More information

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Krzysztof Dembczyński and Wojciech Kot lowski Intelligent Decision Support Systems Laboratory (IDSS) Poznań

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

CSC242: Intro to AI. Lecture 21

CSC242: Intro to AI. Lecture 21 CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning. Lecture 3: Logistic Regression. Feng Li. Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification

More information

Lecture 1: Bayesian Framework Basics

Lecture 1: Bayesian Framework Basics Lecture 1: Bayesian Framework Basics Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de April 21, 2014 What is this course about? Building Bayesian machine learning models Performing the inference of

More information