Online isotonic regression
|
|
- Stanley Ellis
- 5 years ago
- Views:
Transcription
1 Online isotonic regression Wojciech Kot lowski Joint work with: Wouter Koolen (CWI, Amsterdam) Alan Malek (MIT) Poznań University of Technology
2 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions
3 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions
4 Motivation I house pricing Assess the selling price of a house based on its attributes.
5 Motivation I house pricing Den Bosch data set area price
6 Motivation I house pricing Fitting linear function area price
7 Motivation I house pricing Fitting isotonic 1 function area price 1 isotonic non-decreasing, order-preserving
8 Motivation II predicting good probabilities Predictions of SVM classifier (german credit) score label Can we turn score values into conditional probabilities P(y x)?
9 Motivation II predicting good probabilities Fitting isotonic function to the labels [Zadrozny & Elkan, 2002] score label/probability
10 Motivation II predicting good probabilities 1.0 Calibration plots (reliability curve) 0.8 Fraction of positives Perfectly calibrated Logistic (0.099) SVM (0.163) SVM + Isotonic (0.100) (generated by a script from scikit-learn.org)
11 Motivation II predicting good probabilities 1.0 Calibration plots (reliability curve) 0.8 Fraction of positives Perfectly calibrated Logistic (0.099) Naive Bayes (0.118) Naive Bayes + Isotonic (0.098) (generated by a script from scikit-learn.org)
12 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions
13 Isotonic regression Definition Fit an isotonic (monotonically increasing) function to the data. Extensively studied in statistics [Ayer et al., 55; Brunk, 55; Robertson et al., 98]. Numerous applications: Biology, medicine, psychology, etc. Multicriteria decision support. Hypothesis tests under order constraints. Multidimensional scaling. Machine learning: probability calibration, ROC analysis.
14 Isotonic regression
15 Isotonic regression Definition Given data {(x t, y t )} T t=1 R R, find isotonic (nondecreasing) f : R R, which minimizes squared error over the labels: T min : (y t f (x t )) 2, f t=1 subject to : x t x q = f (x t ) f (x q ), q, t {1,..., T }. The optimal solution f is called isotonic regression function. What only matters are values f (x t ), t = 1,..., T.
16 Isotonic regression example (source: scikit-learn.org)
17 Properties of isotonic regression Depends on instances (x) only through their order relation. Only defined at points {x 1,..., x T }. Often extended to R by linear interpolation. Piecewise constants (splits the data into level sets). Self-averaging property: the value of f in a given level set equals the average of labels in that level set. For any v: v = 1 S v t S v y t where S v = {t : f (x t ) = v}. If y t [a, b] for all t, then f (x t ) [a, b] for all t.
18 Isotonic regression gives calibrated probabilities Definition Let y {0, 1}. A probability estimator p of y is calibrated if E[y p = v] = v
19 Isotonic regression gives calibrated probabilities Definition Let y {0, 1}. A probability estimator p of y is calibrated if E[y p = v] = v Fact For binary labels, isotonic regression f is a calibrated probability estimator on the data set. Proof: Let S v = {t : f (x t ) = v}. By self-averaging: E[y f (x) = v] = 1 y t = v. S v t S v
20 Pool Adjacent Violators Algorithm (PAVA) Iterative merging of of data points into blocks until no violators of isotonic constraints exist. The values assigned to each block is the average over labels in this block. The final assignments to blocks corresponds to the level sets of isotonic regression. Works in linear O(T ) time, but requires the data to be sorted.
21 PAVA: example x y y x
22 PAVA: example Step 1: Sort the data in the increasing order of x. x y x y
23 PAVA: example Step 2: Split the data into blocks B 1,..., B r, such that points with the same x t fall into the same block. Assign value f i to each block (i = 1,..., r) which is the average of labels in this block. x y block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 B 9 B 10 B 11 data {1} {2, 3} {4} {5} {6} {7} {8} {9} {10} {11, 12} {13} f i
24 PAVA: example Step 3: While there exists a violator, i.e. a pair of blocks B i, B i+1 such that f i > f i+1 : Merge B i and B i+1 and assign a weighted average: f i = B i f i + B i+1 f i+1. B i + B i+1 block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 B 9 B 10 B 11 data {1} {2, 3} {4} {5} {6} {7} {8} {9} {10} {11, 12} {13} f i block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 B 9 B 10 data {1} {2, 3, 4} {5} {6} {7} {8} {9} {10} {11, 12} {13} f i
25 PAVA: example Step 3: While there exists a violator, i.e. a pair of blocks B i, B i+1 such that f i > f i+1 : Merge B i and B i+1 and assign a weighted average: f i = B i f i + B i+1 f i+1. B i + B i+1 block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 B 9 B 10 data {1} {2, 3, 4} {5} {6} {7} {8} {9} {10} {11, 12} {13} f i block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 B 9 data {1} {2, 3, 4} {5} {6} {7, 8} {9} {10} {11, 12} {13} f i
26 PAVA: example Step 3: While there exists a violator, i.e. a pair of blocks B i, B i+1 such that f i > f i+1 : Merge B i and B i+1 and assign a weighted average: f i = B i f i + B i+1 f i+1. B i + B i+1 block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 B 9 data {1} {2, 3, 4} {5} {6} {7, 8} {9} {10} {11, 12} {13} f i block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 data {1} {2, 3, 4} {5} {6, 7, 8} {9} {10} {11, 12} {13} f i
27 PAVA: example Step 3: While there exists a violator, i.e. a pair of blocks B i, B i+1 such that f i > f i+1 : Merge B i and B i+1 and assign a weighted average: f i = B i f i + B i+1 f i+1. B i + B i+1 block B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8 data {1} {2, 3, 4} {5} {6, 7, 8} {9} {10} {11, 12} {13} f i block B 1 B 2 B 3 B 4 B 5 B 6 B 7 data {1} {2, 3, 4} {5} {6, 7, 8} {9} {10} {11, 12, 13} f i No more violators finished.
28 PAVA: example Reading out the solution. block B 1 B 2 B 3 B 4 B 5 B 6 B 7 data {1} {2, 3, 4} {5} {6, 7, 8} {9} {10} {11, 12, 13} f i x y f
29 PAVA: example x y f y x
30 Generalized isotonic regression Definition Given data {(x t, y t )} T t=1 R R, find isotonic f : R R which minimizes: T (y t, f (x t )). min isotonic f t=1 Squared loss (y t f (x t )) 2 replaced with general loss (y t, f (x t )).
31 Generalized isotonic regression Definition Given data {(x t, y t )} T t=1 R R, find isotonic f : R R which minimizes: T (y t, f (x t )). min isotonic f t=1 Squared loss (y t f (x t )) 2 replaced with general loss (y t, f (x t )). Theorem [Robertson et al., 1998] All loss functions of the form: (y, z) = Ψ(y) Ψ(z) Ψ (z)(y z) for some strictly convex Ψ result in the same isotonic regression function f.
32 Generalized isotonic regression examples (y, z) = Ψ(y) Ψ(z) Ψ (z)(y z) Squared function Ψ(y) = y 2 : (y, z) = y 2 z 2 2f (y z) = (y z) 2 (squared loss). Entropy Ψ(y) = y log y (1 y) log(1 y), y [0, 1] (y, z) = y log z (1 y) log(1 z) (cross-entropy). Negative logarithm Ψ(y) = log y, y > 0 (y, z) = y z log y z (Itakura-Saito distance / Burg entropy).
33 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions
34 Online learning framework A theoretical framework for the analysis of online algorithms. Learning process by its very nature is incremental. Avoids stochastic (e.g., i.i.d.) assumptions on the data sequence, designs algorithms which work well for any data. Meaningful performance guarantees based on observed quantities: regret bounds.
35 Online learning framework t t + 1 learner (strategy) f t : X Y prediction ŷ t = f t (x t ) suffered loss l(y t, ŷ t ) new instance (x t,?) feedback: y t
36 Online learning framework Set of strategies (actions) F; known loss function l. Learner starts with some initial strategy (action) f 1. For t = 1, 2,...: 1 Learner observes instance x t. 2 Learner predicts with ŷ t = f t (x t ). 3 The environment reveals outcome y t. 4 Learner suffers loss l(y t, ŷ t ). 5 Learner updates its strategy f t f t+1.
37 Online learning framework The goal of the learner is to be close to the best f in hindsight. Cumulative loss of the learner: T L T = l(y t, ŷ t ). t=1 Cumulative loss of the best strategy f in hindsight: Regret of the learner: L T = min f F T l(y t, f (x t )). t=1 regret T = L T L T. The goal is to minimize regret over all possible data sequences.
38 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions
39 Online isotonic regression 1 Y 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X
40 Online isotonic regression 1 Y 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X x 5
41 Online isotonic regression 1 Y ŷ 5 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X x 5
42 Online isotonic regression 1 y 5 Y ŷ 5 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X x 5
43 Online isotonic regression 1 y 5 loss = (ŷ 5 y 5 ) 2 Y ŷ 5 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X x 5
44 Online isotonic regression 1 Y 0 x 1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X
45 Online isotonic regression 1 Y ŷ 1 0 x 1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X
46 Online isotonic regression 1 Y ŷ 1 y 1 0 x 1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X
47 Online isotonic regression 1 Y ŷ 1 y 1 loss = (ŷ 1 y 1 ) 2 0 x 1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X
48 Online isotonic regression 1 Y 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X
49 Online isotonic regression 1 Y 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X
50 Online isotonic regression The protocol Given: x 1 < x 2 <... < x T. At trial t = 1,..., T : Environment chooses a yet unlabeled point x it. Learner predicts ŷ it [0, 1]. Environment reveals label y it [0, 1]. Learner suffers squared loss (y it ŷ it ) 2.
51 Online isotonic regression The protocol Given: x 1 < x 2 <... < x T. At trial t = 1,..., T : Environment chooses a yet unlabeled point x it. Learner predicts ŷ it [0, 1]. Environment reveals label y it [0, 1]. Learner suffers squared loss (y it ŷ it ) 2. Strategies = isotonic functions: F = {f : f (x 1 ) f (x 2 )... f (x T )}
52 Online isotonic regression The protocol Given: x 1 < x 2 <... < x T. At trial t = 1,..., T : Environment chooses a yet unlabeled point x it. Learner predicts ŷ it [0, 1]. Environment reveals label y it [0, 1]. Learner suffers squared loss (y it ŷ it ) 2. Strategies = isotonic functions: F = {f : f (x 1 ) f (x 2 )... f (x T )} T T regret T = (y it ŷ it ) 2 min (y it f (x it )) 2 f F t=1 t=1
53 Online isotonic regression F = {f : f (x 1 ) f (x 2 )... f (x T )} T T regret T = (y it ŷ it ) 2 min (y it f (x it )) 2 f F t=1 t=1 Cumulative loss of the learner should not be much larger than the loss of (optimal) isotonic regression function in hindsight. Only the order x 1 <... < x T matters, not the values.
54 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y 0 X
55 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y 0 x 1 X
56 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y ŷ 1 0 x 1 X
57 The adversary is too powerful! y 1 1 Every algorithm will have Ω(T ) regret Y ŷ 1 0 x 1 X
58 The adversary is too powerful! y 1 1 Every algorithm will have Ω(T ) regret Y loss 1/4 ŷ 1 0 x 1 X
59 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y 0 x 2 x 1 X
60 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret ŷ 2 Y 0 x 2 x 1 X
61 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret ŷ 2 Y y 2 0 x 2 x 2 x 1 X
62 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret ŷ 2 Y loss 1/4 y 2 0 x 2 x 2 x 1 X
63 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y 0 x 2 x 3 x 1 X
64 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y ŷ 1 0 x 3 x 2 x 3 x 1 X
65 The adversary is too powerful! y 3 1 Every algorithm will have Ω(T ) regret Y ŷ 1 0 x 3 x 2 x 3 x 1 X
66 The adversary is too powerful! y 3 1 Every algorithm will have Ω(T ) regret loss 1/4 Y ŷ 1 0 x 3 x 2 x 3 x 1 X
67 The adversary is too powerful! 1 Every algorithm will have Ω(T ) regret Y 0 x 2 x 3 x 1 X Algorithms loss 1 4 per trial, loss of best isotonic function = 0.
68 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions
69 Fixed design Data x 1,..., x T is known in advance to the learner We will show that in such model, efficient online algorithms exist. K., Koolen, Malek: Online Isotonic Regression. Proc. of Conference on Learning Theory (COLT), pp , 2016.
70 Off-the-shelf online algorithms Algorithm General bound Bound for online IR Stochastic Gradient Descent Exponentiated Gradient G 2 D 2 T G D 1 T log d T T log T Follow the Leader G 2 D 2 d log T T 2 log T Exponential Weights d log T T log T These bounds are tight (up to logarithmic factor).
71 Exponential Weights (Bayes) with uniform prior Let f = (f 1,..., f T ) denote values of f at (x 1,..., x T ). π(f ) = const, for all f : f 1... f T, P(f y i1,..., y it ) π(f )e 1 2 loss 1...t(f ), ŷ it+1 = f it+1 P(f y i1,..., y it )df. } {{ } = posterior mean
72 Exponential Weights with uniform prior does not learn prior mean Y prior mean posterior mean X
73 Exponential Weights with uniform prior does not learn posterior mean (t = 10) Y prior mean posterior mean X
74 Exponential Weights with uniform prior does not learn posterior mean (t = 20) Y prior mean posterior mean X
75 Exponential Weights with uniform prior does not learn posterior mean (t = 50) Y prior mean posterior mean X
76 Exponential Weights with uniform prior does not learn posterior mean (t = 100) Y prior mean posterior mean X
77 The algorithm Exponential Weights on a covering net { F K = f : f t = k t, k {0, 1,..., K}, K f 1... f T }, π(f ) uniform on F K. Efficient implementation by dynamic programming: O(Kt) at trial t. Speed-up to O(K) if the data revealed in isotonic order.
78 Covering net A finite set of isotonic functions on a discrete grid of y values x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12
79 Covering net A finite set of isotonic functions on a discrete grid of y values x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12
80 Covering net A finite set of isotonic functions on a discrete grid of y values x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12
81 Covering net A finite set of isotonic functions on a discrete grid of y values x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12
82 Covering net A finite set of isotonic functions on a discrete grid of y values There are O(T K ) functions in F K x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12
83 Performance of the algorithm Regret bound When K = Θ ( ) T 1/3 log 1/3 (T ), ( ) Regret = O T 1/3 log 2/3 (T )
84 Performance of the algorithm Regret bound When K = Θ ( ) T 1/3 log 1/3 (T ), ( ) Regret = O T 1/3 log 2/3 (T ) Matching lower bound Ω(T 1/3 ) (up to log factor).
85 Performance of the algorithm Regret bound When K = Θ ( ) T 1/3 log 1/3 (T ), ( ) Regret = O T 1/3 log 2/3 (T ) Matching lower bound Ω(T 1/3 ) (up to log factor). Proof idea Regret = Loss(alg) min f F K Loss(f ) + min Loss(f ) min Loss(f ) f F K isotonic f
86 Performance of the algorithm Regret bound When K = Θ ( ) T 1/3 log 1/3 (T ), ( ) Regret = O T 1/3 log 2/3 (T ) Matching lower bound Ω(T 1/3 ) (up to log factor). Proof idea Regret = Loss(alg) min f F K Loss(f ) }{{} =2 log F K =O(K log T ) + min Loss(f ) min Loss(f ) f F K isotonic f }{{} = T 4K 2
87 Performance of the algorithm prior mean Y prior mean posterior mean X
88 Performance of the algorithm posterior mean (t = 10) Y prior mean posterior mean X
89 Performance of the algorithm posterior mean (t = 20) Y prior mean posterior mean X
90 Performance of the algorithm posterior mean (t = 50) Y prior mean posterior mean X
91 Performance of the algorithm posterior mean (t = 100) Y prior mean posterior mean X
92 Other loss functions Cross-entropy loss l(y, ŷ) = y log ŷ (1 y) log(1 ŷ) ( ) The same bound O T 1/3 log 2/3 (T ). Covering net F K obtained by non-uniform discretization.
93 Other loss functions Cross-entropy loss l(y, ŷ) = y log ŷ (1 y) log(1 ŷ) ( ) The same bound O T 1/3 log 2/3 (T ). Covering net F K obtained by non-uniform discretization. Absolute loss l(y, ŷ) = y ŷ O ( T log T ) obtained by Exponentiated Gradient. Matching lower bound Ω( T ) (up to log factor).
94 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions
95 Random permutation model A more realistic scenario for generating x 1,..., x T which allows data to be unknown in advance.
96 Random permutation model A more realistic scenario for generating x 1,..., x T which allows data to be unknown in advance. The data are chosen adversarially before the game begins, but then are presented to the learner in a random order Motivation: data gathering process is independent on the underlying data generation mechanism. Still very weak assumption. Evaluation: regret averaged over all permutations of data: E σ [regret T ] K., Koolen, Malek: Random Permutation Online Isotonic Regression. Submitted, 2017.
97 Leave-one-out loss Definition Given t labeled points {(x i, y i )} t i=1, for i = 1,..., t: Take out i-th point and give remaining t 1 points to the learner as a training data. Learner predict ŷ i on x i and receives loss l(y i, ŷ i ). Evaluate the learner by loo t = 1 t ti=1 l(y i, ŷ i ) No sequential structure in the definition.
98 Leave-one-out loss Definition Given t labeled points {(x i, y i )} t i=1, for i = 1,..., t: Take out i-th point and give remaining t 1 points to the learner as a training data. Learner predict ŷ i on x i and receives loss l(y i, ŷ i ). Evaluate the learner by loo t = 1 t ti=1 l(y i, ŷ i ) No sequential structure in the definition. Theorem If loo t g(t) for all t, then E σ [regret T ] T t=1 g(t).
99 Fixed design to random permutation conversion Any algorithm for fixed-design can be used in the random permutation setup by being re-run from the scratch in each trial. We have shown that: loo t 1 t E σ [fixed-design-regret t ] We thus get an optimal algorithm (Exponential Weights on a grid) with Õ(T 2/3 ) leave-one-out loss for free, but it is complicated. Can we get simpler algorithms to work in this setup?
100 Follow the Leader (FTL) algorithm Definition Given past t 1 data, compute the optimal (loss-minimizing) function f and predict on new instance x according to f (x).
101 Follow the Leader (FTL) algorithm Definition Given past t 1 data, compute the optimal (loss-minimizing) function f and predict on new instance x according to f (x). FTL is undefined for isotonic regression. x y f (x)
102 Follow the Leader (FTL) algorithm Definition Given past t 1 data, compute the optimal (loss-minimizing) function f and predict on new instance x according to f (x). FTL is undefined for isotonic regression. x y f (x) 0 0.2?? 0.7 1
103 Foward Algorithm (FA) Definition Given past t 1 data and a new instance x, take any guess y [0, 1] of the new label and predict according to the optimal function f on the past data including the new point (x, y ). x y f (x)
104 Foward Algorithm (FA) Definition Given past t 1 data and a new instance x, take any guess y [0, 1] of the new label and predict according to the optimal function f on the past data including the new point (x, y ). x y y = f (x)
105 Foward Algorithm (FA) Definition Given past t 1 data and a new instance x, take any guess y [0, 1] of the new label and predict according to the optimal function f on the past data including the new point (x, y ). x y y = f (x)
106 Foward Algorithm (FA) Definition Given past t 1 data and a new instance x, take any guess y [0, 1] of the new label and predict according to the optimal function f on the past data including the new point (x, y ). x y y = f (x) Various popular prediction algorithms for IR fall into this framework (including linear interpolation [Zadrozny & Elkan, 2002] and many others [Vovk et al., 2015]).
107 Foward Algorithm (FA) Two extreme FA: guess-1 and guess-0, denoted f 1 and f 0. Prediction of any FA is always between: f0 (x) f (x) f1 (x). 1 y 8 7 y 6 5 Y y 2 y 3 y 1 0 x 1 x 2 x 3 x 4 X x 5 x 6 x 7 x 8
108 Foward Algorithm (FA) Two extreme FA: guess-1 and guess-0, denoted f 1 and f 0. Prediction of any FA is always between: f0 (x) f (x) f1 (x). 1 y 8 7 y 6 5 f 1 Y y 2 y 3 y 1 0 x 1 x 2 x 3 x 4 X x 5 x 6 x 7 x 8
109 Foward Algorithm (FA) Two extreme FA: guess-1 and guess-0, denoted f 1 and f 0. Prediction of any FA is always between: f0 (x) f (x) f1 (x). 1 y 8 7 y 6 5 f 1 Y y 2 y 3 y 1 f 0 0 x 1 x 2 x 3 x 4 X x 5 x 6 x 7 x 8
110 Foward Algorithm (FA) Two extreme FA: guess-1 and guess-0, denoted f 1 and f 0. Prediction of any FA is always between: f0 (x) f (x) f1 (x). 1 y 8 7 y 6 5 f 1 Y y 2 every FA predicts in this range y 3 y 1 f 0 0 x 1 x 2 x 3 x 4 X x 5 x 6 x 7 x 8
111 Performance of FA Theorem For squared loss, every forward algorithm has: loo t = O log t t The bound is suboptimal, but only a factor of O(t 1/6 ) off. For cross-entropy loss, the some bound holds but a more careful choice of the guess must be made.
112 Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions
113 Conclusions Two models for online isotonic regression: fixed design and random permutation. Optimal algorithm in both models: Exponential Weights (Bayes) on a grid. In the random permutation model, a class of forward algorithms with good bounds on the leave-one-out loss. Open problem: Extend any of these algorithms to the partial order case.
114 Bibliography Statistics M. Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, and E. Silverman. An empirical distribution function for sampling with incomplete information. Annals of Mathematical Statistics, 26(4): , 1955 H. D. Brunk. Maximum likelihood estimates of monotone parameters. Annals of Mathematical Statistics, 26(4): , 1955 J. B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1):1 27, 1964 R. E. Barlow and H. D. Brunk. The isotonic regression problem and its dual. Journal of the American Statistical Association, 67: , 1972 T. Robertson, F. T. Wright, and R. L. Dykstra. Order Restricted Statistical Inference. John Wiley & Sons, 1998 Sara Van de Geer. Estimating a regression function. Annals of Statistics, 18: , 1990 Cun-Hui Zhang. Risk bounds in isotonic regression. The Annals of Statistics, 30(2): , 2002 Jan de Leeuw, Kurt Hornik, and Patrick Mair. Isotone optimization in R: Pool-adjacent-violators algorithm (PAVA) and active set methods. Journal of Statistical Software, 32:1 24, 2009
115 Bibliography Machine Learning Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In KDD, pages , 2002 Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In ICML, volume 119, pages ACM, 2005 Tom Fawcett and Alexandru Niculescu-Mizil. PAV and the ROC convex hull. Machine Learning, 68(1):97 106, 2007 Vladimir Vovk, Ivan Petej, and Valentina Fedorova. Large-scale probabilistic predictors with and without guarantees of validity. In NIPS, pages , 2015 Aditya Krishna Menon, Xiaoqian Jiang, Shankar Vembu, Charles Elkan, and Lucila Ohno-Machado. Predicting accurate probabilities with a ranking loss. In ICML, 2012 Rasmus Kyng, Anup Rao, and Sushant Sachdeva. Fast, provable algorithms for isotonic regression in all l p-norms. In NIPS, 2015 Adam Tauman Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009 T. Moon, A. Smola, Y. Chang, and Z. Zheng. Intervalrank: Isotonic regression with listwise and pairwise constraint. In WSDM, pages ACM, 2010 Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In NIPS, pages , 2011
116 Bibliography Online isotonic regression Alexander Rakhlin and Karthik Sridharan. Online nonparametric regression. In COLT, pages , 2014 Pierre Gaillard and Sébastien Gerchinovitz. A chaining algorithm for online nonparametric regression. In COLT, pages , 2015 Wojciech Kot lowski, Wouter M. Koolen, and Alan Malek. Online isotonic regression. In COLT, pages , 2016 Wojciech Kot lowski, Wouter M. Koolen, and Alan Malek. Random permutation online isotonic regression. submitted, 2017
arxiv: v2 [cs.lg] 7 Oct 2016
Online Isotonic Regression arxiv:1603.04190v [cs.lg] 7 Oct 016 Wojciech Kotłowski Poznań University of Technology Wouter M. Koolen Centrum Wiskunde & Informatica Alan Malek University of California at
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationObtaining Calibrated Probabilities from Boosting
Obtaining Calibrated Probabilities from Boosting Alexandru Niculescu-Mizil Department of Computer Science Cornell University, Ithaca, NY 4853 alexn@cs.cornell.edu Rich Caruana Department of Computer Science
More informationTutorial on Venn-ABERS prediction
Tutorial on Venn-ABERS prediction Paolo Toccaceli (joint work with Alex Gammerman and Ilia Nouretdinov) Computer Learning Research Centre Royal Holloway, University of London 6th Symposium on Conformal
More informationBinary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach
Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Krzysztof Dembczyński and Wojciech Kot lowski Intelligent Decision Support Systems Laboratory (IDSS) Poznań
More informationOn the Interpretability of Conditional Probability Estimates in the Agnostic Setting
On the Interpretability of Conditional Probability Estimates in the Agnostic Setting Yihan Gao Aditya Parameswaran Jian Peng University of Illinois at Urbana-Champaign Abstract We study the interpretability
More informationThe Free Matrix Lunch
The Free Matrix Lunch Wouter M. Koolen Wojciech Kot lowski Manfred K. Warmuth Tuesday 24 th April, 2012 Koolen, Kot lowski, Warmuth (RHUL) The Free Matrix Lunch Tuesday 24 th April, 2012 1 / 26 Introduction
More informationA Simple Algorithm for Multilabel Ranking
A Simple Algorithm for Multilabel Ranking Krzysztof Dembczyński 1 Wojciech Kot lowski 1 Eyke Hüllermeier 2 1 Intelligent Decision Support Systems Laboratory (IDSS), Poznań University of Technology, Poland
More informationOn the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation
On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation Harikrishna Narasimhan Shivani Agarwal epartment of Computer Science and Automation Indian
More informationover the parameters θ. In both cases, consequently, we select the minimizing
MONOTONE REGRESSION JAN DE LEEUW Abstract. This is an entry for The Encyclopedia of Statistics in Behavioral Science, to be published by Wiley in 200. In linear regression we fit a linear function y =
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationRelative Loss Bounds for Multidimensional Regression Problems
Relative Loss Bounds for Multidimensional Regression Problems Jyrki Kivinen and Manfred Warmuth Presented by: Arindam Banerjee A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationThe No-Regret Framework for Online Learning
The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,
More informationTutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.
Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research
More informationOn the tradeoff between computational complexity and sample complexity in learning
On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham
More informationMachine Learning Theory (CS 6783)
Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)
More informationOnline Learning and Online Convex Optimization
Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game
More informationLogistic Regression & Neural Networks
Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability
More informationOnline Learning: Random Averages, Combinatorial Parameters, and Learnability
Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago
More informationExponential Weights on the Hypercube in Polynomial Time
European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationOnline Manifold Regularization: A New Learning Setting and Empirical Study
Online Manifold Regularization: A New Learning Setting and Empirical Study Andrew B. Goldberg 1, Ming Li 2, Xiaojin Zhu 1 1 Computer Sciences, University of Wisconsin Madison, USA. {goldberg,jerryzhu}@cs.wisc.edu
More informationProbabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationLecture 16: Perceptron and Exponential Weights Algorithm
EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew
More informationWarm up. Regrade requests submitted directly in Gradescope, do not instructors.
Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required
More informationProbabilistic Machine Learning. Industrial AI Lab.
Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear
More informationAdaptive Sampling Under Low Noise Conditions 1
Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationEfficient and Principled Online Classification Algorithms for Lifelon
Efficient and Principled Online Classification Algorithms for Lifelong Learning Toyota Technological Institute at Chicago Chicago, IL USA Talk @ Lifelong Learning for Mobile Robotics Applications Workshop,
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More informationLearning, Games, and Networks
Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to
More informationBregman Divergences for Data Mining Meta-Algorithms
p.1/?? Bregman Divergences for Data Mining Meta-Algorithms Joydeep Ghosh University of Texas at Austin ghosh@ece.utexas.edu Reflects joint work with Arindam Banerjee, Srujana Merugu, Inderjit Dhillon,
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationListwise Approach to Learning to Rank Theory and Algorithm
Listwise Approach to Learning to Rank Theory and Algorithm Fen Xia *, Tie-Yan Liu Jue Wang, Wensheng Zhang and Hang Li Microsoft Research Asia Chinese Academy of Sciences document s Learning to Rank for
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationMaximum Likelihood, Logistic Regression, and Stochastic Gradient Training
Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions
More informationLinear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More informationMachine Learning Theory (CS 6783)
Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)
More informationIntroduction to Machine Learning
Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationLearnability, Stability, Regularization and Strong Convexity
Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago
More informationCS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)
CS325 Artificial Intelligence Cengiz Spring 2013 Model Complexity in Learning f(x) x Model Complexity in Learning f(x) x Let s start with the linear case... Linear Regression Linear Regression price =
More informationThe Online Approach to Machine Learning
The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationWorst-Case Bounds for Gaussian Process Models
Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationMinimax Fixed-Design Linear Regression
JMLR: Workshop and Conference Proceedings vol 40:1 14, 2015 Mini Fixed-Design Linear Regression Peter L. Bartlett University of California at Berkeley and Queensland University of Technology Wouter M.
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationConditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013
Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationDistributed online optimization over jointly connected digraphs
Distributed online optimization over jointly connected digraphs David Mateos-Núñez Jorge Cortés University of California, San Diego {dmateosn,cortes}@ucsd.edu Mathematical Theory of Networks and Systems
More informationOn Minimaxity of Follow the Leader Strategy in the Stochastic Setting
On Minimaxity of Follow the Leader Strategy in the Stochastic Setting Wojciech Kot lowsi Poznań University of Technology, Poland wotlowsi@cs.put.poznan.pl Abstract. We consider the setting of prediction
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationOnline Bounds for Bayesian Algorithms
Online Bounds for Bayesian Algorithms Sham M. Kakade Computer and Information Science Department University of Pennsylvania Andrew Y. Ng Computer Science Department Stanford University Abstract We present
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic
More informationDecomposing Isotonic Regression for Efficiently Solving Large Problems
Decomposing Isotonic Regression for Efficiently Solving Large Problems Ronny Luss Dept. of Statistics and OR Tel Aviv University ronnyluss@gmail.com Saharon Rosset Dept. of Statistics and OR Tel Aviv University
More informationLinear discriminant functions
Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative
More informationPractical Agnostic Active Learning
Practical Agnostic Active Learning Alina Beygelzimer Yahoo Research based on joint work with Sanjoy Dasgupta, Daniel Hsu, John Langford, Francesco Orabona, Chicheng Zhang, and Tong Zhang * * introductory
More informationarxiv: v1 [math.oc] 4 Mar 2019
arxiv:1903.01054v1 [math.oc] 4 Mar 2019 The Application of Multi-block ADMM on Isotonic Regression Problems Junxiang Wang and Liang Zhao George Mason University Abstract The multi-block ADMM has received
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationFoundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
More informationU Logo Use Guidelines
Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationOnline Learning with Feedback Graphs
Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationSurrogate regret bounds for generalized classification performance metrics
Surrogate regret bounds for generalized classification performance metrics Wojciech Kotłowski Krzysztof Dembczyński Poznań University of Technology PL-SIGML, Częstochowa, 14.04.2016 1 / 36 Motivation 2
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationBetter Algorithms for Selective Sampling
Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationClassification objectives COMS 4771
Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationLearning with Large Number of Experts: Component Hedge Algorithm
Learning with Large Number of Experts: Component Hedge Algorithm Giulia DeSalvo and Vitaly Kuznetsov Courant Institute March 24th, 215 1 / 3 Learning with Large Number of Experts Regret of RWM is O( T
More informationManual for a computer class in ML
Manual for a computer class in ML November 3, 2015 Abstract This document describes a tour of Machine Learning (ML) techniques using tools in MATLAB. We point to the standard implementations, give example
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationBoosting Methods: Why They Can Be Useful for High-Dimensional Data
New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More informationLearning from Corrupted Binary Labels via Class-Probability Estimation
Learning from Corrupted Binary Labels via Class-Probability Estimation Aditya Krishna Menon Brendan van Rooyen Cheng Soon Ong Robert C. Williamson xxx National ICT Australia and The Australian National
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationSurrogate loss functions, divergences and decentralized detection
Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1
More informationLogistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction
More informationBinary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach
Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Krzysztof Dembczyński and Wojciech Kot lowski Intelligent Decision Support Systems Laboratory (IDSS) Poznań
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationCSC242: Intro to AI. Lecture 21
CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages
More informationImportance Reweighting Using Adversarial-Collaborative Training
Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a
More informationMachine Learning. Lecture 3: Logistic Regression. Feng Li.
Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification
More informationLecture 1: Bayesian Framework Basics
Lecture 1: Bayesian Framework Basics Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de April 21, 2014 What is this course about? Building Bayesian machine learning models Performing the inference of
More information