Machine Learning in the Data Revolution Era

Size: px

Start display at page:

Download "Machine Learning in the Data Revolution Era"

Meryl Booth
5 years ago
Views:

Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine

1 Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo, July 2010 Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 1 / 40

2 10 Years Ago... ALLWEIN, SCHAPIRE & ALLWEIN, SINGER, SCHAPIRE 2000 &SINGER (received Best 10-year paper at ICML 2010) #Examples Problem Train Test #Attributes #Classes dermatology satimage glass segmentation ecoli pendigits yeast vowel soybean thyroid audiology isolet letter Table 1: Description of the datasets used in the experiments. tradeoff between the redundancy and correcting properties of the output codes and the difficulty of the binary learning problems it induces. The complete code has good error correcting properties; Shai Shalev-Shwartz the distance (Hebrew between U) each pair Learning of rows in is the Data Revolution Era. However, many Google-UW, of the binary July 10 2 / 40

3 Data Revolution In 2009, more data has been generated by individuals than in the entire history of mankind through Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 3 / 40

4 Discriminative Learning Document If only Tiger Woods could bond with fans as well as he did with his caddie on the Old Course practice range on Monday he might find the road to redemption all the more simple... About Sport? Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 4 / 40

5 Discriminative Learning Training Set {(x i, y i )} m i=1 Learning Algorithm Hypothesis set H Learning rule Output h : X Y E.g. X - instance domain (e.g documents) Y - label set (e.g. document category is about sport?) Goal: learn a predictor h : X Y H - Set of possible predictors Learning rule - (e.g. find h H with minimal error on training set ) Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 5 / 40

6 Large Scale Learning 10k training examples 1 hour 2.3% error Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 6 / 40

7 Large Scale Learning 10k training examples 1 hour 2.3% error 1M training examples 1 week 2.29% error Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 6 / 40

8 Large Scale Learning 10k training examples 1 hour 2.3% error 1M training examples 1 week 2.29% error Can always sub-sample and get error of 2.3% using 1 hour Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 6 / 40

9 Large Scale Learning 10k training examples 1 hour 2.3% error 1M training examples 1 week 2.29% error Can always sub-sample and get error of 2.3% using 1 hour Say we re happy with 2.3% error What else can we gain from the larger data set? Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 6 / 40

10 How can more data help? More Data Traditional reduce error Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 7 / 40

11 How can more data help? More Data Traditional This talk reduce error reduce runtime compensate for missing information Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 7 / 40

12 Outline How can more data reduce runtime? Learning using Stochastic Optimization (S. & Srebro 2008) Injecting Structure (S, Shamir, Sirdharan 2010) How can more data compensate for missing information? Attribute Efficient Learning (Cesa-Bianchi, S., Shamir 2010) Technique: Rely on Stochastic Optimization Ads Placement (Kakade, S., Tewari 2008) Technique: Inject Structure by Exploration Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 8 / 40

13 Background on ML: Vapnik s General Learning Setting Probably Approximately Solve: where: min h H E [f(h, z)] z D D is an unknown distribution Can only get samples z 1,..., z m i.i.d. D H is a known hypothesis class f(h, z) is the loss of using hypothesis h on example z Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 9 / 40

14 Background on ML: Vapnik s General Learning Setting Probably Approximately Solve: where: min h H E [f(h, z)] z D D is an unknown distribution Can only get samples z 1,..., z m i.i.d. D H is a known hypothesis class f(h, z) is the loss of using hypothesis h on example z Main question studied by Vapnik: How many examples are required to find ĥ = A(z 1,..., z m ) s.t. [f(ĥ, ] z) min E [f(h, z)] ɛ w.p. 1 δ E z D h H z D Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 9 / 40

15 Examples Document Categorization: z = (x, y), x represents a document and y [k] is the topic h H is a predictor h : X [k] f(h, z) = 1[h(x) y] k-means clustering: z R d h R k,d specifies k cluster centers f(h, z) = min j h j, z Density Estimation: h is a parameter of a density p h (z) f(h, z) = log p h (z) Ads Placement: z = (x, c), x represents context and c [0, 1] k represents the gain of placing each ad h : X S k returns the probability to place each ad f(h, z) = h(x), c is the expected gain of following policy h Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

16 Vapnik s Learning Rule Empirical Risk Minimization: Return a hypothesis with smallest risk on the training examples: ĥ argmin h H 1 m m f(h, z i ) i=1 Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

17 Vapnik s Learning Rule Empirical Risk Minimization: Return a hypothesis with smallest risk on the training examples: ĥ argmin h H 1 m m f(h, z i ) i=1 ( Sample Complexity: If m Ω C(H) log(1/δ) ), where C(H) is some ɛ 2 complexity measure, then ĥ is Probably Approximately Correct. Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

18 Example: Support Vector Machines SVM is a very popular Machine Learning algorithm Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

19 Support Vector Machines SVM learning rule: Given samples (x 1, y 1 ),..., (x m, y m ) solve: min w λ 2 w m m max{0, 1 y i w, x i } i=1 Casting in the General Learning Setting: z = (x, y) h = w f(h, z) = max{0, 1 y w, x } + λ 2 w 2 SVM learning rule is exactly ERM for the above loss function Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

20 Solving the SVM problem SVM learning rule can be rewritten as a Quadratic Optimization Problem. Solvable in polynomial time by standard solvers. Dedicated solvers also exist. End of story? Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

21 Solving the SVM problem SVM learning rule can be rewritten as a Quadratic Optimization Problem. Solvable in polynomial time by standard solvers. Dedicated solvers also exist. End of story? Not quite... Runtime of standard solvers increases with m What if m is very large? Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

22 Machine Learning Perspective on Optimization Our real goal is not to minimize the SVM optimization problem SVM optimization problem is just a proxy for achieving the real goal find a hypothesis with small error on unseen data Should study runtime required to solve the Learning Problem: T (m, ɛ) runtime required to find ɛ-accurate solution w.r.t. unseen data when number of training examples available is m T (, ɛ) data laden analysis data is plentiful and computational complexity is the real bottleneck Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

23 ERM vs. Stochastic Optimization Goal: min h E z D [f(h, z)] ERM (traditional) min h 1 m m i=1 f(h, z i) Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

24 ERM vs. Stochastic Optimization Goal: min h E z D [f(h, z)] ERM (traditional) SGD min h 1 m m i=1 f(h, z i) h h η f(h, z i ) Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

25 Data Laden Analysis of SVM Solvers Method Optimization runtime T (, ɛ) Interior Point Dual decompositoin SVM-Perf (Joachims 06) SGD (S, Srbero, Singer 07) m 3.5 log log 1 ɛ m 2 log 1 ɛ m λɛ 1 λɛ 1 ɛ 7 1 ɛ 4 1 ɛ 4 1 ɛ 2 Best optimization method is the worst learning method... Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

26 More Data Less Work Runtime Dual Decomposition SVM Perf Runtime SGD PEGASOS Runtime Training Set Size Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

27 More Data Less Work Theoretical Empirical (CCAT) Runtime sample complexity data-laden Training Set Size Million Iterations (! runtime) , , ,000 Training Set Size Intuition: Better to do simple operations on more examples than complicated operations on fewer examples Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

28 Outline How can more data reduce runtime? Learning using Stochastic Optimization (S. & Srebro 2008) Injecting Structure (S, Shamir, Sirdharan 2010) How can more data compensate for missing information? Attribute Efficient Learning (Cesa-Bianchi, S., Shamir 2010) Technique: rely on Stochastic Optimization Ads Placement (Kakade, S., Tewari 2008) Technique: Inject Structure by Exploration Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

29 Injecting Structure Main Idea Replace original hypothesis class with a larger hypothesis class On one hand: Larger class has more structure easier to optimize On the other hand: Larger class larger estimation error need more examples New Hypothesis Class Original Hypothesis Class Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

30 Example Learning 3-DNF Goal: learn a Boolean function h : {0, 1} d {0, 1} Runtime m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

31 Example Learning 3-DNF Goal: learn a Boolean function h : {0, 1} d {0, 1} Hypothesis class: 3-term DNF formulae: Runtime m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

32 Example Learning 3-DNF Goal: learn a Boolean function h : {0, 1} d {0, 1} Hypothesis class: 3-term DNF formulae: E.g. h(x) = (x 1 x 3 x 7 ) (x 4 x 2 ) (x 5 x 9 ) Runtime m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

33 Example Learning 3-DNF Goal: learn a Boolean function h : {0, 1} d {0, 1} Hypothesis class: 3-term DNF formulae: E.g. h(x) = (x 1 x 3 x 7 ) (x 4 x 2 ) (x 5 x 9 ) Sample complexity is order d/ɛ Runtime m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

34 Example Learning 3-DNF Goal: learn a Boolean function h : {0, 1} d {0, 1} Hypothesis class: 3-term DNF formulae: E.g. h(x) = (x 1 x 3 x 7 ) (x 4 x 2 ) (x 5 x 9 ) Sample complexity is order d/ɛ Kearns & Vazirani: If RP NP, it is not possible to efficiently find an ɛ-accurate 3-DNF formula Runtime m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

35 Example Learning 3-DNF Goal: learn a Boolean function h : {0, 1} d {0, 1} Hypothesis class: 3-term DNF formulae: E.g. h(x) = (x 1 x 3 x 7 ) (x 4 x 2 ) (x 5 x 9 ) Sample complexity is order d/ɛ Kearns & Vazirani: If RP NP, it is not possible to efficiently find an ɛ-accurate 3-DNF formula Claim: if m d 3 /ɛ it is possible to find a predictor with error ɛ in polynomial time Runtime m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

36 Proof Observation: 3-DNF formula can be rewritten as u T1,v T 2,w T 3 (u v w) for three sets of literals T 1, T 2, T 3 Define: ψ : {0, 1} d {0, 1} 2(2d)3 s.t. for each triplet of literals u, v, w there are two variables indicating if u v w is true or false Observation: Each 3-DNF can be represented as a single conjunction over ψ(x) Easy to learn single conjunction (greedy or LP) conjunction over ψ(x) 3-DNF over x Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

37 Trading samples for runtime Algorithm samples runtime 3-DNF over x d ɛ 2 d Conjunction over ψ(x) d 3 ɛ poly(d) 3-DNF Runtime Conjunction m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

38 More complicated example: Agnostic learning of Halfspaces with 0 1 loss Ben-David & Simon (2000) Runtime m S., Shamir, Sridharan (2010) Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

39 Outline How can more data reduce runtime? Learning using Stochastic Optimization (S. & Srebro 2008) Injecting Structure (S, Shamir, Sirdharan 2010) How can more data compensate for missing information? Attribute Efficient Learning (Cesa-Bianchi, S., Shamir 2010) Technique: rely on Stochastic Optimization Ads Placement (Kakade, S., Tewari 2008) Technique: Inject Structure by Exploration Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

40 Attribute efficient regression Each training example is a pair (x, y) R d R Partial information: can only view O(1) attributes of each individual example Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

41 Sponsored Advertisement (contextual multi-armed bandit) Each training example is a pair (x, c) R d [0, 1] k Partial information: Don t know c. Can only guess some y and know the value of c y Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

42 How more data helps? Three main techniques: 1 Missing information as noise 2 Active Exploration try to fish the relevant information 3 Inject structure problem hard in the original representation but becomes simple in another representation (different hypothesis class) More data helps because: 1 It reduces variance compensates for the noise 2 It allows more exploration 3 It compensates for larger sample complexity due to using larger hypotheses classes Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

43 Attribute efficient regression Formal problem statement: Unknown distribution D over R d R Goal: learn a linear predictor x w, x with low risk: Risk: L D (w) = E D [( w, x y) 2 ] Training set: (x 1, y 1 ),..., (x m, y m ) Partial information: For each (x i, y i ), learner can view only k attributes of x i Active selection: learner can choose which k attributes to see Similar to Learning with restricted focus of attention (Ben-David & Dichterman 98) Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

44 Dealing with missing information Usually difficult exponential ways to complete the missing information Popular approach Expectation Maximization (EM) Previous methods usually do not come with guarantees (neither sample complexity nor computational complexity) Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

45 Ostrich approach Simply set the missing attributes to a default value Use your favorite regression algorithm for full information, e.g., min w m ( w, x i y i ) 2 i=1 + λ w p p Ridge Regression: p = 2 Lasso: p = 1 Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

Ostrich approach Simply set the missing attributes to a default value Use your favorite regression algorithm for full information, e.g., min w m ( w, x i y i ) 2 i=1 + λ w p p Ridge Regression: p = 2 Lasso: p = 1 Efficient Correct?

46 Ostrich approach Simply set the missing attributes to a default value Use your favorite regression algorithm for full information, e.g., min w m ( w, x i y i ) 2 i=1 + λ w p p Ridge Regression: p = 2 Lasso: p = 1 Efficient Correct? Sample complexity? How to choose attributes? Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

47 Partial information as noise Observation: x = x 1 x 2. x d = 1 d dx d 0. 0 dx d Therefore, choosing i uniformly at random gives E[dx i e i ] = x. i If x 1 then dx i e i d (i.e. variance increased) Reduced missing information to unbiased noise Many examples can compensate for the added noise Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

48 Many examples compensates for noise True goal: minimize over w the true risk L D (w) = E (x,y) [( w, x y) 2 ] Loss on one example, ( w, x y) 2, gives an unbiased estimate for L D (w) Averaging loss on many examples reduces variance In our case, we construct an unbiased estimate of the loss of each single example Variance increases but many examples still reduces it back Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

49 Loss-Based Attribute Efficient Regression (LaBAER) Theorem (Cesa-Bianchi, S, Shamir) Let ŵ be the output of LaBAER. Then, with overwhelming probability ( d 2 L D (ŵ) min L B 2 ) D(w) + Õ, w: w 1 B m where d is dimension and m is number of examples. Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

50 Loss-Based Attribute Efficient Regression (LaBAER) Theorem (Cesa-Bianchi, S, Shamir) Let ŵ be the output of LaBAER. Then, with overwhelming probability ( d 2 L D (ŵ) min L B 2 ) D(w) + Õ, w: w 1 B m where d is dimension and m is number of examples. Factor of d 4 more examples compensates for the missing information! Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

51 Loss-Based Attribute Efficient Regression (LaBAER) Theorem (Cesa-Bianchi, S, Shamir) Let ŵ be the output of LaBAER. Then, with overwhelming probability ( d 2 L D (ŵ) min L B 2 ) D(w) + Õ, w: w 1 B m where d is dimension and m is number of examples. Factor of d 4 more examples compensates for the missing information! Can we do better? Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

52 Pegasos Attribute Efficient Regression (PAER) The factor d 2 in the bound because we estimate a matrix xx T Can we avoid estimating the matrix? Yes! Estimate the gradient of the loss instead of the loss Follow the Stochastic Gradient Descent approach Leads to a much better bound Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

53 Demonstration Full information classifiers (top line) error of 1.5% Our algorithms (bottom line) error of 4% Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

54 Demonstration Test Regression Error PAER, LaBAER: 4 pixels per example Ridge, Lasso: 768 pixels per example Number of Examples 1.2 Number of Examples Test Regression Error Ridge Regression Lasso PAER LaBAER Number of Features x 10 4 Number of observed features Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

55 Ads Placement: More data improves runtime Partial supervision Technique: Active Exploration Larger regret can be compensated by having more examples runtime Halving Banditron m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

56 Summary Learning theory: Many examples smaller error This work: Many examples higher efficiency compensating for missing information Techniques: 1 Stochastic optimization 2 Inject structure 3 Missing information as noise 4 Active Exploration Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

On the tradeoff between computational complexity and sample complexity in learning

On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham