Op#mal convex op#miza#on under Tsybakov noise through connec#ons to ac#ve learning

Size: px

Start display at page:

Download "Op#mal convex op#miza#on under Tsybakov noise through connec#ons to ac#ve learning"

Annabelle Booker
5 years ago
Views:

1 Op#mal convex op#miza#on under Tsybakov noise through connec#ons to ac#ve learning Aar$ Singh Joint work with: Aaditya Ramdas

2 Connec#ons between convex op#miza#on and ac#ve learning (a formal reduc#on) 2

3 Role of feedback in convex op$miza$on in ac$ve learning f(x) x minimize computa#onal complexity (# queries needed to find op$mum) X 0 Raginsky- Rakhlin 09 x minimize sample complexity (# queries needed to find decision boundary) 3 X

4 Ac#ve learning oracle model Oracle provides Y 2{0, 1} Query x 1 Y 1 Query x 2 Y 2 Query x N Y N E[Y X] =P (Y =1 X) 4

5 Stochas#c op#miza#on oracle model (first- order) Oracle provides f(x), g(x) = 5f(x) Query x 1 bf(x 1 ), bg(x 1 ) Query x 2 bf(x 2 ), bg(x 2 ) Query x N bf(x N ), bg(x N ) E[ f(x)] b = f(x), E[bg(x)] = g(x) unbiased, variance 2 5

6 Connec#ons in 1- dim noiseless sefng convex op$miza$on f(x) ac$ve learning X 0 X 1 P (sign(g(x)) = + X) P (Y =1 X) 0 X P (sign(g(x)) = + X) 6

7 Connec#ons in 1- dim noisy sefng convex op$miza$on f(x) ac$ve learning X X 1 bg P (sign(g(x)) = + X) P (Y =1 X) zero- mean noise X P (sign(g(x)) bg = + X) 7

8 Minimax ac#ve learning rates in 1- dim If Tsybakov Noise Condi$on (TNC) holds apple 1 kx x k apple 1 P (Y =1 X = x) 1/ then minimax op$mal ac$ve learning rate in 1- dim is E[kbx N x k] N 1 2apple 2 and under 0/1 loss + smoothness of P(Y X) Risk(bx N ) Risk(x ) N apple 2apple 2 Castro- Nowak 07 N 1 2 N 1 x 8 X apple =2

9 TNC and strong convexity Strong convexity TNC with apple =2 2 ) f(x) f(x ) kx x k apple 2 ) kg(x) g(x )k kx x k apple 0 If noise pmf grows linearly around its zero mean (Gaussian, uniform, triangular), then PP (sign(g(x)) bg = + X = x) 1/2 kx x k 9

10 Algorithmic reduc#on (1- dim) In 1- dim, consider any ac$ve learning algorithm that is op$mal for TNC exponent κ = 2. When given labels Y = sign(bg(x)), where f(x) is a strongly convex func$on with Lipschitz gradients, it yields E[kbx T x k]=o(t 1 2 ) E[f(bx T ) f(x )] = O(T 1 ) Matches op$mal rates for strongly convex func$ons Nemirovski- Yudin 83, Agarwal- Bartlei- Ravikumar- Wainwright 10 What about d- dim? 10

11 1- dim vs. d- dim Convex op$miza$on Ac$ve learning x x Minimizer: a point (0- dim) Decision boundary: curve (d- 1 dim) T 1 N 2 2+ d 1 Complexity of convex optimization in any dimension is same as complexity of active learning in 1 dimension. 11

12 Algorithmic reduc#on (d- dim) Random coordinate descent with 1- dim ac$ve learning subrou$ne For e =1,...,E = d(log T ) 2 Choose coordinate j at random from 1,...,d Do active learning along coordinate with sample budget T e = T/E treating sign(bg(x t )) as label Y t j If f is strongly convex with Lipschitz gradients O O S S inf bx inf bx f f E[kbx x f k] E[f(bx) f(x )] = Õ(T 1 2 ) = Õ(T 1 ) 12

13 Degree of convexity via Tsybakov noise condi#on (TNC) 13

14 Degree of convexity via TNC TNC for convex func$ons apple 1 ) f(x) f(x ) kx x k apple kg(x) g(x )k kx x k apple 1 Controls strength of convexity around the minimum x f(x) X Uniformly convex func$on implies TNC apple 2 Controls strength of convexity everywhere in domain 14

15 Minimax convex op#miza#on rates Theorem: If TNC for convex func$ons holds f(x) f(x ) kx x k apple apple>1 f(x) and f is Lipschitz, then minimax op$mal convex op$miza$on rate over a bounded set (diam 1) is x X kbx T x k T 1 2apple 2 d-dim kf(bx) f(x )k T apple 2apple 2 d-dim T T 3/2 Precisely the rates for 1-dim active learning! apple =3/2 T 1 T 1 2 Strongly convex Convex apple =2 apple!1 15

16 Lower bounds based on ac#ve learning O S inf bx f E[kbx x f k] = (T 1 2apple 2 ) f 1 f 0 S =[0, 1] d \ {kxk apple1} O : b f(x) N(f(x), 2 ), bg(x) N(g(x), 2 I d ) f 0 (x) =c 1 dx x i apple i=1 0 2a 4a f 1 (x) = c1 ( x 1 f 0 (x) 2a apple + P d i=2 x i apple )+c 2 x 1 apple 4a otherwise P 0 = P ({X i,f 0 (X i ),g 0 (X i )} T i=1) P 1 = P ({X i,f 1 (X i ),g 1 (X i )} T i=1) 16

17 Lower bounds based on ac#ve learning O S inf bx f E[kbx x f k] = (T 1 2apple 2 ) Fano s Inequality if KL(P 0,P 1 ) apple Constant inf bx f P (kbx x f k > kx f 0 x f 1 k/2) constant KL(P 0,P 1 ) apple f 1 f 0 Query that yields max difference between func$on/gradient values Castro- Nowak a 4a apple Constant if kx f 0 x f 1 k/2 =a = T 1 2apple 2 17

18 Lower bounds based on ac#ve learning O S inf bx f E[kbx x f k] = (T 1 2apple 2 ) Fano s Inequality if KL(P 0,P 1 ) apple Constant inf bx f P (kbx x f k > kx f 0 x f 1 k/2) constant KL(P 0,P 1 ) apple Query that yields max difference between func$on/gradient values Castro- Nowak 07 Also yields lower bounds for uniformly convex func$ons and zeroth- order oracle which match Iouditski- Nesterov 10, Jamieson- Nowak- Recht 12 18

19 Epoch- based gradient descent Initialize e =1,x 1 1,T 1,R 1, 1 until Oracle budget T is exhausted Projected Gradient Descent Requires knowledge of κ,r e+1 1 apple e+1, If f is a convex func$on that sa$sfies TNC(κ) and is Lipschitz O O S S inf bx inf bx f f E[kbx x f k] = Õ(T 1 2apple 2 ) E[f(bx) f(x )] = Õ(T apple 2apple 2 ) 19

20 Adap#ng to degree of convexity 20

21 Adap#ng to degree of convexity For e =1,...,E = log p T/log T (ignoring κ) Run any optimization procedure that is optimal for convex functions, with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 since Also, 9ē s.t. kxē x ēk T kx e x ek apple apple f(x e ) f(x e) x ē = x 1/(2apple 2) R e p T rate for convex Lipschitz functions 21

22 Adap#ng to degree of convexity For e =1,...,E = log p T/log T (ignoring κ) Run any optimization procedure that is optimal for convex functions, with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 9ē s.t. kxē x ēk T 1/(2apple 2) x ē = x x R 1 = x 1 x 1 22

23 Adap#ng to degree of convexity For e =1,...,E = log p T/log T (ignoring κ) Run any optimization procedure that is optimal for convex functions, with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 9ē s.t. kxē x ēk T 1/(2apple 2) x ē = x x R 2 = x 2 x 2 23

24 Adap#ng to degree of convexity For e =1,...,E = log p T/log T (ignoring κ) Run any optimization procedure that is optimal for convex functions, with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 9ē s.t. kxē x ēk T 1/(2apple 2) T x ē = x x R e x ē = xē 1/(2apple 2) 24

25 Adap#ng to degree of convexity For e =1,...,E = log p T/log T (ignoring κ) Run any optimization procedure that is optimal for convex functions, with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 9ē s.t. kxē x ēk T x ē = x 1/(2apple 2) T = 8e ē, kx e xēk T 1/(2apple 2) x x e xē 1/(2apple 2) R e>e x e 25

26 Adap#ng to degree of convexity For e =1,...,E = log p T/log T (ignoring κ) Run any optimization procedure that is optimal for convex functions, with sample budget T e = T/E R e+1 = R e /2,e e +1 If f is a convex func$on that sa$sfies TNC(κ) and is Lipschitz O O S S inf bx inf bx f f E[kbx x f k] = Õ(T 1 2apple 2 ) E[f(bx) f(x )] = Õ(T apple 2apple 2 ) 26

27 Adap#ve ac#ve learning 27

28 Adap#ve 1- dim ac#ve learning Robust Binary Search adap$ve to κ For e =1,...,E = log p T/log T (ignoring κ) Do passive learning with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 28

29 Adap#ve 1- dim ac#ve learning Robust Binary Search adap$ve to κ For e =1,...,E = log p T/log T (ignoring κ) Do passive learning with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 since ckx e x ek apple apple Risk(x e ) Risk(x e) Also, 9ē s.t. kxē x ēk T x ē = x 1/(2apple 2) R e p T passive rate for threshold classifiers 29

30 Adap#ve 1- dim ac#ve learning Robust Binary Search adap$ve to κ For e =1,...,E = log p T/log T (ignoring κ) Do passive learning with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 9ē s.t. kxē x ēk T x ē = x 8e ē, kx e xēk T 1/(2apple 2) 1/(2apple 2) x = x e T xē 1/(2apple 2) R e>e x e Much simpler than Hanneke 09 30

31 Reference & Future direc#ons A. Ramdas and A. Singh, Op$mal rates for stochas$c convex op$miza$on under Tsybakov noise condi$on, to appear ICML (Available on arxiv) Ø Reduc$on from d- dim convex op$miza$on to 1- dim ac$ve learning for κ- TNC func$ons (κ 2)? Ø Adap$ve d- dimensional ac$ve learning/model selec$on in ac$ve learning? Ø Por$ng ac$ve learning results to yield non- convex op$miza$on guarantees? 31

Tsybakov noise adap/ve margin- based ac/ve learning

Tsybakov noise adap/ve margin- based ac/ve learning Aar$ Singh A. Nico Habermann Associate Professor NIPS workshop on Learning Faster from Easy Data II Dec 11, 2015 Passive Learning Ac/ve Learning (X j,?)