Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Size: px

Start display at page:

Download "Online (and Distributed) Learning with Information Constraints. Ohad Shamir"

Sandra Abigail Foster
5 years ago
Views:

1 Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information Constraints 1/23

2 Main Question How well can we learn with information constraints on how we can interact with data? Ohad Shamir Learning with Information Constraints 2/23

3 Partial-Information Constraints Only part of the data is visible to the learning algorithm Ohad Shamir Learning with Information Constraints 3/23

4 Partial-Information Constraints Only part of the data is visible to the learning algorithm Examples A few coordinates from each example Multi-armed bandits + variants, attribute-efficient learning, missing features... Ohad Shamir Learning with Information Constraints 3/23

5 Partial-Information Constraints Only part of the data is visible to the learning algorithm Examples A few coordinates from each example Multi-armed bandits + variants, attribute-efficient learning, missing features... A linear projection of each example Bandit linear optimization Ohad Shamir Learning with Information Constraints 3/23

6 Partial-Information Constraints Only part of the data is visible to the learning algorithm Examples A few coordinates from each example Multi-armed bandits + variants, attribute-efficient learning, missing features... A linear projection of each example Bandit linear optimization Some other function of each example Partial monitoring; bandit convex optimization... Ohad Shamir Learning with Information Constraints 3/23

Partial-Information Constraints Only part of the data is visible to the learning algorithm Examples A few coordinates from each example Multi-armed bandits + variants, attribute-efficient learning,

7 Partial-Information Constraints Only part of the data is visible to the learning algorithm Examples A few coordinates from each example Multi-armed bandits + variants, attribute-efficient learning, missing features... A linear projection of each example Bandit linear optimization Some other function of each example Partial monitoring; bandit convex optimization... Price of partial information: Known in some cases, setting-dependent Ohad Shamir Learning with Information Constraints 3/23

8 Memory Constraints Ohad Shamir Learning with Information Constraints 4/23

9 Memory Constraints Example (Online Convex Optimization) Consider the squared loss or exp-concave losses Ohad Shamir Learning with Information Constraints 4/23

10 Memory Constraints Example (Online Convex Optimization) Consider the squared loss or exp-concave losses O(log(T )) regret possible, but known algorithms require Ω(d 2 ) memory [Vovk 1998; Azoury and Warmuth 2001; Hazan et al. 2007] Ohad Shamir Learning with Information Constraints 4/23

11 Memory Constraints Example (Online Convex Optimization) Consider the squared loss or exp-concave losses O(log(T )) regret possible, but known algorithms require Ω(d 2 ) memory [Vovk 1998; Azoury and Warmuth 2001; Hazan et al. 2007] O( T ) regret possible with O(d)-memory algorithms (e.g. online gradient descent) Is there a price to pay for linear-memory methods? Ohad Shamir Learning with Information Constraints 4/23

12 Memory Constraints Example (Online PCA) i.i.d. data stream x 1,..., x m R d Goal: Find direction/s with most variance Solution: Consider empirical covariance matrix 1 m m i=1 x ix i d d matrix: too large in high dimensions Ohad Shamir Learning with Information Constraints 5/23

13 Memory Constraints Example (Online PCA) i.i.d. data stream x 1,..., x m R d Goal: Find direction/s with most variance Solution: Consider empirical covariance matrix 1 m m i=1 x ix i d d matrix: too large in high dimensions Can we do online PCA with limited memory? [Warmuth and Kuzmin 2008, Arora et al. 2012, Mitliagkas et al. 2013, Balsubramani et al ] Ohad Shamir Learning with Information Constraints 5/23

14 Communication Constraints Training examples partitioned in a distributed system Communication is (relatively) slow and expensive Can we learn with small communication complexity? Ohad Shamir Learning with Information Constraints 6/23

15 Main Questions Can we quantify information theoretically how such constraints affect our performance? Can we trade-off between data size and information constraints? Memory/ communication/ Information per example feasible region Sample complexity (data size) Ohad Shamir Learning with Information Constraints 7/23

16 Related Work Problem-specific lower bounds (e.g. multi-armed bandits) Learning with communication constraints [e.g. Balcan et al. 2012, Zhang et al. 2013]: Non i.i.d. data or very small communication budget Communication/space bounds in theoretical computer science. But not learning problems and/or non-i.i.d. data Statistical estimation with limited memory [e.g. Hellman and Cover 1970, Leighton and Rivest 1986, Ertin and Potter 2003, Kontorovich 2012]: Asymptotic / small memory regime Ohad Shamir Learning with Information Constraints 8/23

17 (b, n, m) Protocols Ohad Shamir Learning with Information Constraints 9/23

18 (b, n, m) Protocols Definition For t = 1... m Receive n i.i.d. instances X t Compute W t = f t (X t, W 1, W 2,..., W t 1 ), where W t b bits Return f (W 1... W m ) X t b bits W 1 W t 1 W t Learning Algorithm Ohad Shamir Learning with Information Constraints 9/23

19 Examples of (b, n, m) Protocols Ohad Shamir Learning with Information Constraints 10/23

20 Examples of (b, n, m) Protocols Bounded memory online algorithms For t = 1... m Receive i.i.d. instance x t Compute W t = f t (x t, W t 1 ), where W t b bits Return f (W m ) W t 1 x t W t Learning Algorithm Ohad Shamir Learning with Information Constraints 10/23

21 Examples of (b, n, m) Protocols Learning from Expert Advice w/ partial information For t = 1... m Reward vector x t generated Observe W t = f t (x t, W 1, W 2,..., W t 1 ), where W t b bits E.g. for multiarmed bandits, W t = x t,it where coordinate i t selected based on past observations x 1,i1,, x t 1,it 1 x t x t,it Learning Algorithm Ohad Shamir Learning with Information Constraints 11/23

22 Examples of (b, n, m) Protocols Non-interactive distributed learning Machine 1 X 1 Machine 2 X 2 Machine m X m W 2 W 1 W m Ohad Shamir Learning with Information Constraints 12/23

23 Examples of (b, n, m) Protocols One-pass distributed learning Machine 1 X 1 W 1 Machine 2 X 2 Machine m X m W 2 W m Ohad Shamir Learning with Information Constraints 13/23

24 Hide-and-Seek Problem Ohad Shamir Learning with Information Constraints 14/23

25 Hide-and-Seek Problem Instances x t drawn i.i.d. from a product distribution on { 1, +1} d A single unknown coordinate j {1,..., d} is biased E[x t ] = ρ e j Goal: Given sampled data, find j Ohad Shamir Learning with Information Constraints 14/23

26 Result for (b, 1, m) Protocols (e.g. bounded-memory) Ohad Shamir Learning with Information Constraints 15/23

27 Result for (b, 1, m) Protocols (e.g. bounded-memory) Theorem Given m i.i.d. instances; bias ρ: Ohad Shamir Learning with Information Constraints 15/23

28 Result for (b, 1, m) Protocols (e.g. bounded-memory) Theorem Given m i.i.d. instances; bias ρ: Can detect biased coordinate if m log(d)/ρ 2 Just return coordinate J with highest empirical mean Ohad Shamir Learning with Information Constraints 15/23

29 Result for (b, 1, m) Protocols (e.g. bounded-memory) Theorem Given m i.i.d. instances; bias ρ: Can detect biased coordinate if m log(d)/ρ 2 Just return coordinate J with highest empirical mean Any (b, 1, m) protocol will fail if m (d/b)/ρ 2 Ohad Shamir Learning with Information Constraints 15/23

30 Result for (b, 1, m) Protocols (e.g. bounded-memory) Theorem Given m i.i.d. instances; bias ρ: Can detect biased coordinate if m log(d)/ρ 2 Just return coordinate J with highest empirical mean Any (b, 1, m) protocol will fail if m (d/b)/ρ 2 Tight up to log factors Can be generalized to n > 1 Ohad Shamir Learning with Information Constraints 15/23

31 Proof Idea x t,1 j x t,2 x t,j x t,d W t Ohad Shamir Learning with Information Constraints 16/23

32 Proof Idea x t,1 j x t,2 x t,j x t,d W t As long as W t doesn t know j, can only convey b/d bits on x t,j in expectation Ohad Shamir Learning with Information Constraints 16/23

33 Proof Idea x t,1 j x t,2 x t,j x t,d W t As long as W t doesn t know j, can only convey b/d bits on x t,j in expectation x t,j conveys at most ρ 2 bits on j Ohad Shamir Learning with Information Constraints 16/23

34 Proof Idea x t,1 j x t,2 x t,j x t,d W t As long as W t doesn t know j, can only convey b/d bits on x t,j in expectation x t,j conveys at most ρ 2 bits on j Standard data processing inequality: W t conveys min{ρ 2, b/d} bits on j Ohad Shamir Learning with Information Constraints 16/23

35 Proof Idea x t,1 j x t,2 x t,j x t,d W t As long as W t doesn t know j, can only convey b/d bits on x t,j in expectation x t,j conveys at most ρ 2 bits on j Standard data processing inequality: W t conveys min{ρ 2, b/d} bits on j We prove an information contraction inequality: W t conveys ρ 2 b/d bits on j Ohad Shamir Learning with Information Constraints 16/23

36 Proof Idea x t,1 j x t,2 x t,j x t,d W t W t conveys ρ 2 b/d bits on j After m rounds: mρ 2 b/d bits of information on j. Ohad Shamir Learning with Information Constraints 17/23

37 Proof Idea x t,1 j x t,2 x t,j x t,d W t W t conveys ρ 2 b/d bits on j After m rounds: mρ 2 b/d bits of information on j. Insufficient to detect if m (d/b)/ρ 2 Ohad Shamir Learning with Information Constraints 17/23

38 Application: Online learning with partial information Ohad Shamir Learning with Information Constraints 18/23

39 Application: Online learning with partial information T reward vectors x 1,..., x T chosen from {0, 1} d. Some b bits of each x t are sequentially observed Ohad Shamir Learning with Information Constraints 18/23

40 Application: Online learning with partial information T reward vectors x 1,..., x T chosen from {0, 1} d. Some b bits of each x t are sequentially observed ( ) d Theorem: Expected regret Ω b T Result is generic holds regardless of what are those b bits: Chosen coordinate x t,it (Multi-armed bandits) Ohad Shamir Learning with Information Constraints 18/23

41 Application: Online learning with partial information T reward vectors x 1,..., x T chosen from {0, 1} d. Some b bits of each x t are sequentially observed ( ) d Theorem: Expected regret Ω b T Result is generic holds regardless of what are those b bits: Chosen coordinate x t,it (Multi-armed bandits) Some other coordinate x t,jt ; Some subset of coordinates x t,j1, x t,j2,..., x t,jk (semi-bandit feedback; prediction with limited advice; bandits with side-information; attribute-efficient learning...); Linear projection of x t (bandit linear optimization); Using some bounded-width feedback matrix (partial monitoring)... Even if algorithm can choose b bits based on x t! Ohad Shamir Learning with Information Constraints 18/23

42 Application: Stochastic/Online Optimization Theorem stochastic/online optimization problems a in R d, s.t. Possible to get Õ( T ) regret Any b-memory online algorithm (or b-communication distributed algorithm) has Ω( (d 2 /b)t ) regret a Caveat: Non-convex, but efficiently solvable Ohad Shamir Learning with Information Constraints 19/23

43 Application: Stochastic/Online Optimization Theorem stochastic/online optimization problems a in R d, s.t. Possible to get Õ( T ) regret Any b-memory online algorithm (or b-communication distributed algorithm) has Ω( (d 2 /b)t ) regret a Caveat: Non-convex, but efficiently solvable Conclusion: Any online algorithm with o(d 2 ) memory is suboptimal (e.g. gradient descent, mirror descent) New Memory-sample and communication-sample trade-offs: In a statistical setting, can get same average regret with more data, even with memory/communication constraints Ohad Shamir Learning with Information Constraints 19/23

44 Application: Sparse PCA / Covariance Estimation Simplest Case - Detecting Correlations Given i.i.d. sample x 1,..., x m, detect a single pair of correlated features Ohad Shamir Learning with Information Constraints 20/23

45 Application: Sparse PCA / Covariance Estimation Simplest Case - Detecting Correlations Given i.i.d. sample x 1,..., x m, detect a single pair of correlated features Statistically optimal methods use empirical covariance matrix 1 m m i=1 x ix i Problem: Requires too much communication/memory when dimension d and m are large We Show: There are situations where no memory/communication-efficient method can be statistically optimal Ohad Shamir Learning with Information Constraints 20/23

46 Application: Sparse PCA / Covariance Estimation Theorem Suppose: E[xx ] = I d + τ (E i,j + E j,i ) for unknown i, j Given m samples, i j, Pr ( x i x j E[x i x j ] τ 2 ) 2 exp( mτ 2 /6). Then for τ = Θ(1/d): m d 2 : Enough to return largest off-diagonal entry of empirical covariance matrix m d4 b : Any b-memory online algorithm / (b, n, m) protocol for appropriate n will fail for some such distribution Ohad Shamir Learning with Information Constraints 21/23

47 Summary Generic framework of information constraints in learning Same results applicable to Different information constraints (memory, communication, partial information...) Different learning settings (multi-armed bandits and variants, stochastic/online optimization, sparse PCA and covariance estimation...) New resource trade-offs: More data - less memory More data - less communication (in a natural regime) Ohad Shamir Learning with Information Constraints 22/23

48 Summary Generic framework of information constraints in learning Same results applicable to Different information constraints (memory, communication, partial information...) Different learning settings (multi-armed bandits and variants, stochastic/online optimization, sparse PCA and covariance estimation...) New resource trade-offs: More data - less memory More data - less communication (in a natural regime) Information Trade-offs for other learning settings? Ohad Shamir Learning with Information Constraints 22/23

49 More details: arxiv and NIPS 2014 THANKS! Ohad Shamir Learning with Information Constraints 23/23

Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation

Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation Ohad Shamir Weizmann Institute of Science ohad.shamir@weizmann.ac.il Abstract Many machine learning approaches