Machine Learning in the Data Revolution Era

Similar documents
On the tradeoff between computational complexity and sample complexity in learning

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 7

Learnability, Stability, Regularization and Strong Convexity

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Computational and Statistical Learning Theory

Using More Data to Speed-up Training Time

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Accelerating Stochastic Optimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Large-scale Stochastic Optimization

Computational Learning Theory

Introduction to Machine Learning (67577)

Inverse Time Dependency in Convex Regularized Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

SVRG++ with Non-uniform Sampling

Introduction to Machine Learning (67577) Lecture 5

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Littlestone s Dimension and Online Learnability

Support vector machines Lecture 4

On the Generalization Ability of Online Strongly Convex Programming Algorithms

Efficient Bandit Algorithms for Online Multiclass Prediction

Does Unlabeled Data Help?

ECS289: Scalable Machine Learning

Failures of Gradient-Based Deep Learning

Discriminative Models

The sample complexity of agnostic learning with deterministic labels

IFT Lecture 7 Elements of statistical learning theory

Machine Learning

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

A Distributed Solver for Kernelized SVM

STA141C: Big Data & High Performance Statistical Computing

Linear Models for Regression CS534

Machine Learning Lecture 6 Note

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Practical Agnostic Active Learning

Generalization, Overfitting, and Model Selection

Qualifying Exam in Machine Learning

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

l 1 and l 2 Regularization

Stochastic optimization in Hilbert spaces

6.036 midterm review. Wednesday, March 18, 15

Machine Learning

Generalization and Overfitting

Linear Models for Regression CS534

Discriminative Models

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Lecture Support Vector Machine (SVM) Classifiers

Computational Learning Theory. Definitions

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Linear Models for Regression CS534

Overfitting, Bias / Variance Analysis

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Computational Learning Theory

Statistical and Computational Learning Theory

Active Learning and Optimized Information Gathering

Midterm: CS 6375 Spring 2015 Solutions

Understanding Generalization Error: Bounds and Decompositions

Algorithmic Stability and Generalization Christoph Lampert

Multiclass Learnability and the ERM principle

Beating SGD: Learning SVMs in Sublinear Time

Optimistic Rates Nati Srebro

ECE 5424: Introduction to Machine Learning

Trade-Offs in Distributed Learning and Optimization

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Empirical Risk Minimization

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Stochastic Gradient Descent

Multiclass Learnability and the ERM principle

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

Stochastic Optimization

Multiclass Learnability and the ERM Principle

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

CMU-Q Lecture 24:

Mini-Batch Primal and Dual Methods for SVMs

Linear Models for Regression. Sargur Srihari

New Algorithms for Contextual Bandits

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

PAC-learning, VC Dimension and Margin-based Bounds

Classification Logistic Regression

Introduction to Machine Learning

The Perceptron algorithm

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

1 The Probably Approximately Correct (PAC) Model

The Online Approach to Machine Learning

Computational Learning Theory. CS534 - Machine Learning

CSCI-567: Machine Learning (Spring 2019)

Learning Kernel-Based Halfspaces with the Zero-One Loss

Machine Learning Theory (CS 6783)

COMP 551 Applied Machine Learning Lecture 2: Linear regression

Computational Learning Theory

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Transcription:

Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo, July 2010 Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 1 / 40

10 Years Ago... ALLWEIN, SCHAPIRE & ALLWEIN, SINGER, SCHAPIRE 2000 &SINGER (received Best 10-year paper at ICML 2010) #Examples Problem Train Test #Attributes #Classes dermatology 366-34 6 satimage 4435 2000 36 6 glass 214-9 7 segmentation 2310-19 7 ecoli 336-8 8 pendigits 7494 3498 16 10 yeast 1484-8 10 vowel 528 462 10 11 soybean 307 376 35 19 thyroid 9172-29 20 audiology 226-69 24 isolet 6238 1559 617 26 letter 16000 4000 16 26 Table 1: Description of the datasets used in the experiments. tradeoff between the redundancy and correcting properties of the output codes and the difficulty of the binary learning problems it induces. The complete code has good error correcting properties; Shai Shalev-Shwartz the distance (Hebrew between U) each pair Learning of rows in is the Data Revolution Era. However, many Google-UW, of the binary July 10 2 / 40

Data Revolution In 2009, more data has been generated by individuals than in the entire history of mankind through 2008... Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 3 / 40

Discriminative Learning Document If only Tiger Woods could bond with fans as well as he did with his caddie on the Old Course practice range on Monday he might find the road to redemption all the more simple... About Sport? Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 4 / 40

Discriminative Learning Training Set {(x i, y i )} m i=1 Learning Algorithm Hypothesis set H Learning rule Output h : X Y E.g. X - instance domain (e.g documents) Y - label set (e.g. document category is about sport?) Goal: learn a predictor h : X Y H - Set of possible predictors Learning rule - (e.g. find h H with minimal error on training set ) Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 5 / 40

Large Scale Learning 10k training examples 1 hour 2.3% error Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 6 / 40

Large Scale Learning 10k training examples 1 hour 2.3% error 1M training examples 1 week 2.29% error Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 6 / 40

Large Scale Learning 10k training examples 1 hour 2.3% error 1M training examples 1 week 2.29% error Can always sub-sample and get error of 2.3% using 1 hour Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 6 / 40

Large Scale Learning 10k training examples 1 hour 2.3% error 1M training examples 1 week 2.29% error Can always sub-sample and get error of 2.3% using 1 hour Say we re happy with 2.3% error What else can we gain from the larger data set? Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 6 / 40

How can more data help? More Data Traditional reduce error Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 7 / 40

How can more data help? More Data Traditional This talk reduce error reduce runtime compensate for missing information Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 7 / 40

Outline How can more data reduce runtime? Learning using Stochastic Optimization (S. & Srebro 2008) Injecting Structure (S, Shamir, Sirdharan 2010) How can more data compensate for missing information? Attribute Efficient Learning (Cesa-Bianchi, S., Shamir 2010) Technique: Rely on Stochastic Optimization Ads Placement (Kakade, S., Tewari 2008) Technique: Inject Structure by Exploration Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 8 / 40

Background on ML: Vapnik s General Learning Setting Probably Approximately Solve: where: min h H E [f(h, z)] z D D is an unknown distribution Can only get samples z 1,..., z m i.i.d. D H is a known hypothesis class f(h, z) is the loss of using hypothesis h on example z Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 9 / 40

Background on ML: Vapnik s General Learning Setting Probably Approximately Solve: where: min h H E [f(h, z)] z D D is an unknown distribution Can only get samples z 1,..., z m i.i.d. D H is a known hypothesis class f(h, z) is the loss of using hypothesis h on example z Main question studied by Vapnik: How many examples are required to find ĥ = A(z 1,..., z m ) s.t. [f(ĥ, ] z) min E [f(h, z)] ɛ w.p. 1 δ E z D h H z D Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 9 / 40

Examples Document Categorization: z = (x, y), x represents a document and y [k] is the topic h H is a predictor h : X [k] f(h, z) = 1[h(x) y] k-means clustering: z R d h R k,d specifies k cluster centers f(h, z) = min j h j, z Density Estimation: h is a parameter of a density p h (z) f(h, z) = log p h (z) Ads Placement: z = (x, c), x represents context and c [0, 1] k represents the gain of placing each ad h : X S k returns the probability to place each ad f(h, z) = h(x), c is the expected gain of following policy h Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 10 / 40

Vapnik s Learning Rule Empirical Risk Minimization: Return a hypothesis with smallest risk on the training examples: ĥ argmin h H 1 m m f(h, z i ) i=1 Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 11 / 40

Vapnik s Learning Rule Empirical Risk Minimization: Return a hypothesis with smallest risk on the training examples: ĥ argmin h H 1 m m f(h, z i ) i=1 ( Sample Complexity: If m Ω C(H) log(1/δ) ), where C(H) is some ɛ 2 complexity measure, then ĥ is Probably Approximately Correct. Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 11 / 40

Example: Support Vector Machines SVM is a very popular Machine Learning algorithm Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 12 / 40

Support Vector Machines SVM learning rule: Given samples (x 1, y 1 ),..., (x m, y m ) solve: min w λ 2 w 2 + 1 m m max{0, 1 y i w, x i } i=1 Casting in the General Learning Setting: z = (x, y) h = w f(h, z) = max{0, 1 y w, x } + λ 2 w 2 SVM learning rule is exactly ERM for the above loss function Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 13 / 40

Solving the SVM problem SVM learning rule can be rewritten as a Quadratic Optimization Problem. Solvable in polynomial time by standard solvers. Dedicated solvers also exist. End of story? Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 14 / 40

Solving the SVM problem SVM learning rule can be rewritten as a Quadratic Optimization Problem. Solvable in polynomial time by standard solvers. Dedicated solvers also exist. End of story? Not quite... Runtime of standard solvers increases with m What if m is very large? Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 14 / 40

Machine Learning Perspective on Optimization Our real goal is not to minimize the SVM optimization problem SVM optimization problem is just a proxy for achieving the real goal find a hypothesis with small error on unseen data Should study runtime required to solve the Learning Problem: T (m, ɛ) runtime required to find ɛ-accurate solution w.r.t. unseen data when number of training examples available is m T (, ɛ) data laden analysis data is plentiful and computational complexity is the real bottleneck Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 15 / 40

ERM vs. Stochastic Optimization Goal: min h E z D [f(h, z)] ERM (traditional) min h 1 m m i=1 f(h, z i) Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 16 / 40

ERM vs. Stochastic Optimization Goal: min h E z D [f(h, z)] ERM (traditional) SGD min h 1 m m i=1 f(h, z i) h h η f(h, z i ) Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 16 / 40

Data Laden Analysis of SVM Solvers Method Optimization runtime T (, ɛ) Interior Point Dual decompositoin SVM-Perf (Joachims 06) SGD (S, Srbero, Singer 07) m 3.5 log log 1 ɛ m 2 log 1 ɛ m λɛ 1 λɛ 1 ɛ 7 1 ɛ 4 1 ɛ 4 1 ɛ 2 Best optimization method is the worst learning method... Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 17 / 40

More Data Less Work Runtime Dual Decomposition SVM Perf Runtime SGD PEGASOS Runtime Training Set Size Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 18 / 40

More Data Less Work Theoretical Empirical (CCAT) Runtime sample complexity data-laden Training Set Size Million Iterations (! runtime) 15 12.5 10 7.5 5 2.5 0 300,000 500,000 700,000 Training Set Size Intuition: Better to do simple operations on more examples than complicated operations on fewer examples Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 19 / 40

Outline How can more data reduce runtime? Learning using Stochastic Optimization (S. & Srebro 2008) Injecting Structure (S, Shamir, Sirdharan 2010) How can more data compensate for missing information? Attribute Efficient Learning (Cesa-Bianchi, S., Shamir 2010) Technique: rely on Stochastic Optimization Ads Placement (Kakade, S., Tewari 2008) Technique: Inject Structure by Exploration Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 20 / 40

Injecting Structure Main Idea Replace original hypothesis class with a larger hypothesis class On one hand: Larger class has more structure easier to optimize On the other hand: Larger class larger estimation error need more examples New Hypothesis Class Original Hypothesis Class Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 21 / 40

Example Learning 3-DNF Goal: learn a Boolean function h : {0, 1} d {0, 1} Runtime m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 22 / 40

Example Learning 3-DNF Goal: learn a Boolean function h : {0, 1} d {0, 1} Hypothesis class: 3-term DNF formulae: Runtime m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 22 / 40

Example Learning 3-DNF Goal: learn a Boolean function h : {0, 1} d {0, 1} Hypothesis class: 3-term DNF formulae: E.g. h(x) = (x 1 x 3 x 7 ) (x 4 x 2 ) (x 5 x 9 ) Runtime m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 22 / 40

Example Learning 3-DNF Goal: learn a Boolean function h : {0, 1} d {0, 1} Hypothesis class: 3-term DNF formulae: E.g. h(x) = (x 1 x 3 x 7 ) (x 4 x 2 ) (x 5 x 9 ) Sample complexity is order d/ɛ Runtime m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 22 / 40

Example Learning 3-DNF Goal: learn a Boolean function h : {0, 1} d {0, 1} Hypothesis class: 3-term DNF formulae: E.g. h(x) = (x 1 x 3 x 7 ) (x 4 x 2 ) (x 5 x 9 ) Sample complexity is order d/ɛ Kearns & Vazirani: If RP NP, it is not possible to efficiently find an ɛ-accurate 3-DNF formula Runtime m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 22 / 40

Example Learning 3-DNF Goal: learn a Boolean function h : {0, 1} d {0, 1} Hypothesis class: 3-term DNF formulae: E.g. h(x) = (x 1 x 3 x 7 ) (x 4 x 2 ) (x 5 x 9 ) Sample complexity is order d/ɛ Kearns & Vazirani: If RP NP, it is not possible to efficiently find an ɛ-accurate 3-DNF formula Claim: if m d 3 /ɛ it is possible to find a predictor with error ɛ in polynomial time Runtime m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 22 / 40

Proof Observation: 3-DNF formula can be rewritten as u T1,v T 2,w T 3 (u v w) for three sets of literals T 1, T 2, T 3 Define: ψ : {0, 1} d {0, 1} 2(2d)3 s.t. for each triplet of literals u, v, w there are two variables indicating if u v w is true or false Observation: Each 3-DNF can be represented as a single conjunction over ψ(x) Easy to learn single conjunction (greedy or LP) conjunction over ψ(x) 3-DNF over x Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 23 / 40

Trading samples for runtime Algorithm samples runtime 3-DNF over x d ɛ 2 d Conjunction over ψ(x) d 3 ɛ poly(d) 3-DNF Runtime Conjunction m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 24 / 40

More complicated example: Agnostic learning of Halfspaces with 0 1 loss Ben-David & Simon (2000) Runtime m S., Shamir, Sridharan (2010) Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 25 / 40

Outline How can more data reduce runtime? Learning using Stochastic Optimization (S. & Srebro 2008) Injecting Structure (S, Shamir, Sirdharan 2010) How can more data compensate for missing information? Attribute Efficient Learning (Cesa-Bianchi, S., Shamir 2010) Technique: rely on Stochastic Optimization Ads Placement (Kakade, S., Tewari 2008) Technique: Inject Structure by Exploration Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 26 / 40

Attribute efficient regression Each training example is a pair (x, y) R d R Partial information: can only view O(1) attributes of each individual example Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 27 / 40

Sponsored Advertisement (contextual multi-armed bandit) Each training example is a pair (x, c) R d [0, 1] k Partial information: Don t know c. Can only guess some y and know the value of c y Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 28 / 40

How more data helps? Three main techniques: 1 Missing information as noise 2 Active Exploration try to fish the relevant information 3 Inject structure problem hard in the original representation but becomes simple in another representation (different hypothesis class) More data helps because: 1 It reduces variance compensates for the noise 2 It allows more exploration 3 It compensates for larger sample complexity due to using larger hypotheses classes Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 29 / 40

Attribute efficient regression Formal problem statement: Unknown distribution D over R d R Goal: learn a linear predictor x w, x with low risk: Risk: L D (w) = E D [( w, x y) 2 ] Training set: (x 1, y 1 ),..., (x m, y m ) Partial information: For each (x i, y i ), learner can view only k attributes of x i Active selection: learner can choose which k attributes to see Similar to Learning with restricted focus of attention (Ben-David & Dichterman 98) Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 30 / 40

Dealing with missing information Usually difficult exponential ways to complete the missing information Popular approach Expectation Maximization (EM) Previous methods usually do not come with guarantees (neither sample complexity nor computational complexity) Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 31 / 40

Ostrich approach Simply set the missing attributes to a default value Use your favorite regression algorithm for full information, e.g., min w m ( w, x i y i ) 2 i=1 + λ w p p Ridge Regression: p = 2 Lasso: p = 1 Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 32 / 40

Ostrich approach Simply set the missing attributes to a default value Use your favorite regression algorithm for full information, e.g., min w m ( w, x i y i ) 2 i=1 + λ w p p Ridge Regression: p = 2 Lasso: p = 1 Efficient Correct? Sample complexity? How to choose attributes? Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 32 / 40

Partial information as noise Observation: x = x 1 x 2. x d = 1 d dx 1 0. 0 +... + 1 d 0. 0 dx d Therefore, choosing i uniformly at random gives E[dx i e i ] = x. i If x 1 then dx i e i d (i.e. variance increased) Reduced missing information to unbiased noise Many examples can compensate for the added noise Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 33 / 40

Many examples compensates for noise True goal: minimize over w the true risk L D (w) = E (x,y) [( w, x y) 2 ] Loss on one example, ( w, x y) 2, gives an unbiased estimate for L D (w) Averaging loss on many examples reduces variance In our case, we construct an unbiased estimate of the loss of each single example Variance increases but many examples still reduces it back Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 34 / 40

Loss-Based Attribute Efficient Regression (LaBAER) Theorem (Cesa-Bianchi, S, Shamir) Let ŵ be the output of LaBAER. Then, with overwhelming probability ( d 2 L D (ŵ) min L B 2 ) D(w) + Õ, w: w 1 B m where d is dimension and m is number of examples. Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 35 / 40

Loss-Based Attribute Efficient Regression (LaBAER) Theorem (Cesa-Bianchi, S, Shamir) Let ŵ be the output of LaBAER. Then, with overwhelming probability ( d 2 L D (ŵ) min L B 2 ) D(w) + Õ, w: w 1 B m where d is dimension and m is number of examples. Factor of d 4 more examples compensates for the missing information! Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 35 / 40

Loss-Based Attribute Efficient Regression (LaBAER) Theorem (Cesa-Bianchi, S, Shamir) Let ŵ be the output of LaBAER. Then, with overwhelming probability ( d 2 L D (ŵ) min L B 2 ) D(w) + Õ, w: w 1 B m where d is dimension and m is number of examples. Factor of d 4 more examples compensates for the missing information! Can we do better? Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 35 / 40

Pegasos Attribute Efficient Regression (PAER) The factor d 2 in the bound because we estimate a matrix xx T Can we avoid estimating the matrix? Yes! Estimate the gradient of the loss instead of the loss Follow the Stochastic Gradient Descent approach Leads to a much better bound Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 36 / 40

Demonstration Full information classifiers (top line) error of 1.5% Our algorithms (bottom line) error of 4% Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 37 / 40

Demonstration Test Regression Error 1.2 1 0.8 0.6 0.4 PAER, LaBAER: 4 pixels per example Ridge, Lasso: 768 pixels per example 0.2 0 2000 4000 6000 8000 10000 12000 Number of Examples 1.2 Number of Examples Test Regression Error 1 0.8 0.6 0.4 Ridge Regression Lasso PAER LaBAER 0.2 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Number of Features x 10 4 Number of observed features Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 38 / 40

Ads Placement: More data improves runtime Partial supervision Technique: Active Exploration Larger regret can be compensated by having more examples runtime Halving Banditron m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 39 / 40

Summary Learning theory: Many examples smaller error This work: Many examples higher efficiency compensating for missing information Techniques: 1 Stochastic optimization 2 Inject structure 3 Missing information as noise 4 Active Exploration Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 40 / 40