Computational Oracle Inequalities for Large Scale Model Selection Problems

Similar documents
Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

The Multi-Armed Bandit Problem

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Two optimization problems in a stochastic bandit model

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Ordinal Optimization and Multi Armed Bandit Techniques

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning with Rejection

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods

Hybrid Machine Learning Algorithms

Advanced Machine Learning

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

Online Learning and Sequential Decision Making

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Distirbutional robustness, regularizing variance, and adversaries

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Statistical Ranking Problem

Reinforcement Learning

CS281B/Stat241B. Statistical Learning Theory. Lecture 1.

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Bandits : optimality in exponential families

The information complexity of sequential resource allocation

On Bayesian bandit algorithms

A talk on Oracle inequalities and regularization. by Sara van de Geer

Kernels to detect abrupt changes in time series

Multi-armed bandit models: a tutorial

Adaptive Online Gradient Descent

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Stochastic Contextual Bandits with Known. Reward Functions

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

The deterministic Lasso

Structured Prediction Theory and Algorithms

Ad Placement Strategies

Probably Approximately Correct (PAC) Learning

Machine Learning Theory (CS 6783)

The multi armed-bandit problem

Sparse Linear Contextual Bandits via Relevance Vector Machines

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Budget-Optimal Task Allocation for Reliable Crowdsourcing Systems

Model Selection and Geometry

Bandit models: a tutorial

On the Complexity of Best Arm Identification with Fixed Confidence

Subsampling, Concentration and Multi-armed bandits

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

On maxitive integration

Notes on AdaGrad. Joseph Perla 2014

Lecture 5: GPs and Streaming regression

Lecture 3: Introduction to Complexity Regularization

Performance Measures for Ranking and Selection Procedures

An Introduction to Statistical Machine Learning - Theoretical Aspects -

Statistical Properties of Large Margin Classifiers

Analysis of Thompson Sampling for the multi-armed bandit problem

Full-information Online Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Introducing strategic measure actions in multi-armed bandits

Foundations of Machine Learning

Class 2 & 3 Overfitting & Regularization

Bayesian Contextual Multi-armed Bandits

Worst-Case Bounds for Gaussian Process Models

New Algorithms for Contextual Bandits

An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement

Lecture 6: Non-stochastic best arm identification

Online Learning and Online Convex Optimization

Machine Learning Theory (CS 6783)

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

AdaBoost and other Large Margin Classifiers: Convexity in Classification

Bayesian and Frequentist Methods in Bandit Models

Machine Learning And Applications: Supervised Learning-SVM

Support Vector Machine

Plug-in Approach to Active Learning

Confidence Measure Estimation in Dynamical Systems Model Input Set Selection

Optimisation séquentielle et application au design

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Distributed Statistical Estimation and Rates of Convergence in Normal Approximation

Learning Algorithms for Minimizing Queue Length Regret

Stochastic Composition Optimization

THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich

Yevgeny Seldin. University of Copenhagen

arxiv: v2 [math.oc] 8 Oct 2011

A first model of learning

Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

11. Learning graphical models

Supplemental Material for Discrete Graph Hashing

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

Online Learning and Sequential Decision Making

Report on Preliminary Experiments with Data Grid Models in the Agnostic Learning vs. Prior Knowledge Challenge

On the Generalization Ability of Online Strongly Convex Programming Algorithms

PLC Papers Created For:

1.1 Basis of Statistical Decision Theory

Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms

Empirical Risk Minimization

Risk Bounds for CART Classifiers under a Margin Condition

Computational and Statistical Learning Theory

Multi-armed Bandit with Additional Observations

Transcription:

for Large Scale Model Selection Problems University of California at Berkeley Queensland University of Technology ETH Zürich, September 2011 Joint work with Alekh Agarwal, John Duchi and Clément Levrard.

Large Scale Data Analysis Observation: For many prediction problems, the amount of data available is effectively unlimited.

Large Scale Data Analysis Observation: For many prediction problems, the amount of data available is effectively unlimited. Information retrieval: Web search 10 8 websites. 10 10 pages. 10 9 queries/day.

Large Scale Data Analysis Observation: For many prediction problems, the amount of data available is effectively unlimited. Natural language processing: Spelling correction Google Linguistics Data Consortium n-gram corpus: 10 11 sentences.

Large Scale Data Analysis Observation: For many prediction problems, the amount of data available is effectively unlimited. Computer vision: Captions Facebook: 10 11 photos.

Large Scale Data Analysis Observation: For many prediction problems, the amount of data available is effectively unlimited. Information retrieval: Web search Natural language processing: Spelling correction Computer vision: Captions

Large Scale Data Analysis Observation: For many prediction problems, performance is limited by computational resources, not sample size. Information retrieval: Web search Natural language processing: Spelling correction Computer vision: Captions

Large Scale Data Analysis Example: Peter Norvig, Internet-Scale Data Analysis : On a spelling correction problem, trivial prediction rules, estimated with a massive dataset perform much better than complex prediction rules (which allow only a dataset of modest size). Given a limited computational budget, what is the best trade-off? That is, should we spend our computation on gathering more data, or on estimating richer prediction rules?

1 Computation is precious, not sample size Model selection Oracle inequalities 2 Problem formulation 3 4

Prediction Problem Model selection Oracle inequalities i.i.d. Z 1, Z 2,..., Z n, Z from Z. Use data Z 1,..., Z n to choose f n : Z A from a class F. Aim to ensure f n has small risk: L(f ) = El(f, Z), where l : F Z is a loss function.

Model selection Oracle inequalities Approximation-Estimation Trade-Off Define the Bayes risk, L = inf f L(f ), where the infimum is over measurable f. We can decompose the excess risk as ( ) L(f n ) L = L(f n ) inf L(f ) f F }{{} estimation error ( + ) inf L(f ) L. f F }{{} approximation error Model selection: automatically choose F to optimize this trade-off.

The Model Selection Problem Model selection Oracle inequalities Given function classes F 1,..., F K, use the data Z 1,..., Z n to choose f n K i=1 F i that gives a good trade-off between the approximation error and the estimation error. Example: Complexity-penalized model selection. f i n = arg min f F i L n (f ), f n = minimizer of L n (f i n) + γ i (n), where γ i (n) is a complexity penalty and L n is the empirical risk: L n (f ) = 1 n n l(f, Z i ). i=1

A Simple Model selection Oracle inequalities Theorem Suppose that we have risk bounds for each F i : w.p. 1 δ, log 1/δ sup L(f ) L n (f ) γ i (n) + c. f F i n If f n is chosen via complexity regularization: f i n = arg min f F i L n (f ), f n = minimizer of L n (f i n) + γ i (n), then with probability 1 δ, ( L(f n ) min i inf L(f ) + 2γ i (n) + c f F i ) log 1/δ + log i. n

A Simple Model selection Oracle inequalities Notice that, for each F i satisfying log 1/δ sup L(f ) L n (f ) γ i (n) + c, f F i n we have log 1/δ L(fn) i inf L(f ) + 2γ i (n) + c. f F i n But complexity regularization gives f n satisfying ( ) log 1/δ + log i L(f n ) min inf L(f ) + 2γ i (n) + c. i f F i n Thus, f n gives a near-optimal trade-off between the approximation error and the (bound on) estimation error, with only a log i penalty.

Computation versus sample size Model selection Oracle inequalities Complexity regularization involves computation of the empirical risk minimizer for each F i : f i n = arg min f F i L n (f ), f n = minimizer of L n (f i n) + γ i (n), So computation typically grows linearly with K. The oracle inequality gives the best trade-off for a given sample size: ( ) log 1/δ + log i L(f n ) min inf L(f ) + 2γ i (n) + c. i f F i n

Model selection Oracle inequalities 1 Computation is precious, not sample size Model selection Oracle inequalities 2 Problem formulation 3 4

Problem formulation Linear sample size scaling with computation Assumption For class F i : It takes unit time to use a sample of size n i to choose f ni F i. It takes time t to use a sample of size tn i to choose f tni F i. Our goal: A computational oracle inequality: f n compares favorably with each model, estimated using the entire computational budget. L(f n ) min i inf L(f ) + γ i (Tn i ) f F } i {{} c.f. estimate f using the entire budget +O log 1/δ Tn i.

Problem formulation Linear sample size scaling with computation Assumption For class F i : It takes unit time to use a sample of size n i to choose f ni F i. It takes time t to use a sample of size tn i to choose f tni F i. Our goal: A computational oracle inequality: f n compares favorably with each model, estimated using the entire computational budget. L(f n ) min i inf L(f ) + γ i (Tn i ) f F } i {{} c.f. estimate f using the entire budget +O log i/δ Tn i.

Naïve solution: grid search Problem formulation Allocate budget T /K to each model. Use a sample of size Tn i /K for F i. Choose f i Tn i /K = arg min f F i L Tni /K (f ), f n = minimizer of L Tni /K (f i Tn i /K ) + γ i Satisfies oracle inequality L(f n ) min i ( inf f F i L(f ) + γ i ( ) ( Tni + O K ( ) Tni. K K Tn i )).

Problem formulation Model selection from nested classes Suppose that the models are ordered by inclusion: F 1 F 2 F K. Examples: F i = { θ R d : θ r i }, r1 r 2 r K. F i = { θ R d i : θ 1 }, d 1 d 2 d K. Suppose that we have risk bounds for each F i : w.p. 1 δ, log 1/δ sup L(f ) L n (f ) γ i (n) + c. f F i n

Problem formulation Exploiting structure of nested classes Want to exploit monotonicity of risks and penalties Excess risk, R i = inf f Fi L(f ) L : Penalty, γ i (n): R i γi(n) i i

Coarse grid sets Problem formulation Want to spend computation on only few classes. Use monotonicity to interpolate for the rest. Partition based on penalty values. (1 + λ) (j+1) (1 + λ) j γi(n) (1 + λ) 2 1 + λ 1 F1F2 Fj Fj+1 i Coarse Grid

Coarse grids for model selection Problem formulation Definition (Coarse grid) A set S {1,..., K} is a coarse grid of size s if S = s and for each i {1,..., K} there is an index j S such that ( ) ( ) ( ) Tni Tnj Tni γ i γ j 2γ i. s s s Include a new class only after penalty increases sufficiently S = O(log K) is typical.

Problem formulation Complexity regularization on a coarse grid Given a coarse grid set S with cardinality s: 1 Allocate budget T /s to each class in S. 2 Choose f i Tn i /s = arg min f F i L Tni /s(f ) ( Tnj f = arg min L Tnj /s(f ) + γ j f {f j Tn j /s :j S} s ).

Problem formulation Complexity regularization on a coarse grid Theorem With probability at least 1 δ { ( )} Tni L(ˆf ) min inf L(f ) + 4γ i i {1,...,K} f F i s + O s log K + log 1/δ Tn K

Problem formulation Complexity regularization on a coarse grid Corollary For a coarse grid of logarithmic size, that is, S c log K, With probability at least 1 δ { ( )} Tni L(ˆf ) min inf L(f ) + 4γ i i {1,...,K} f F i c log K log + O 2 K + log 1 δ Tn K Both the statistical and computational costs scale logarithmically with K.

Problem formulation 1 Computation is precious, not sample size Model selection Oracle inequalities 2 Problem formulation 3 4

Heterogeneous Models In general, the F i can be heterogeneous, not ordered by inclusion. Different kernels. Graphs in directed graphical models. Subsets of features. Key idea: Successively allocate computational quanta online.

Multi-Armed Bandits for Model Selection Want class i that minimizes inf L(f ) + γ i (Tn i ). f F i Use idea of optimism in the face of uncertainty: neatly trade off exploration and exploitation by choosing the class with the smallest lower bound on the criterion. i t

Multi-Armed Bandits for Model Selection Want class i that minimizes inf L(f ) + γ i (Tn i ). f F i We know it suffices to choose a class i to minimize L Tni (f i Tn i ) + γ i (Tn i ). Use the lower confidence bound: log K L n (fn) i γ i (n) + γ i (Tn i ), n where n is the size of the sample that we have allocated already to class i.

Multi-Armed Bandits for Model Selection picks class i t with smallest lower confidence bound. Allocate additional sample of size n it to class i t. Regret analysis of upper-confidence-bound algorithm (Auer et al., 2002) extends to give oracle inequalities. i t

Oracle inequality under separation assumption ( ) Define i = arg min inf L(f ) + γ i (Tn i ), i f F i ( Assume γ i (n) = c i n. i = inf f F i L(f ) + γ i (Tn i ) inf L(f ) + γ i (Tn i ) f F i ). Theorem Let T i (T ) be the number of times class i is queried. There are constants C, κ 1, κ 2 such that with probability at least 1 κ 1 TK 4, T i (T ) C n i ( ci + κ 2 log T i ) 2.

Oracle inequality under separation assumption ( ) Define i = arg min inf L(f ) + γ i (Tn i ), i f F i i = inf L(f ) + γ i (Tn i ) inf L(f ) + γ i (Tn i ). f F i f F i If we can incrementally update the choice f i n, then the fraction of budget that is assigned to a suboptimal class i is no more than log T /(n i T 2 i ). This is essentially optimal (Lai and Robbins, 1985).

Oracle inequality without separation Assume that functions in {F i } map to a vector space, and the loss l(, z) is convex. Define ˆf T = 1 T T t=1 f t, where algorithm produces f t F it at time t. Theorem There is a constant κ such that with probability at least 1 2κ TK 3 ( ) L(ˆf T ) = inf inf L(f ) + γ i (Tn i ) i {1,...,K} f F i ( ) K max{log T, log K} + O. T Linear dependence on K.

Summary For large-scale problems, data is cheap but computation is precious. Computational oracle inequalities for model selection: select a near-optimal model without wasting computation on other models. A nested complexity hierarchy ensures cost logarithmic in size of hierarchy. If not nested, cost of model selection is linear in size of hierarchy.

Open problems Fast rates: If excess risk and excess empirical risk are close, then penalized empirical risk gives oracle inequalities with fast rates. Can this be exploited to give computational oracle inequalities in these cases? For nested hierarchies, the analysis relied on a coarse multiplicative cover of the penalty values. If the penalties are data-dependent, when is this approach possible? What other structures on function classes lead to good computational oracle inequalities?

Fast Rates Theorem For F 1 F 2 and γ 1 (n) γ 2 (n), if sup k sup k sup (L(f ) L(fi ) 2 (L n (f ) L n (fi )) γ i (n)) 0, f F i sup (L n (f ) L n (fi ) 2 (L(f ) L(fi )) γ i (n)) 0, f F i then L(f n ) inf i (L(f i ) + 9γ i (n)), where f n minimizes L n (f i n) + 7γ i (n)/2 and f i = arg min f Fi L(f ). Examples: Convex losses, classification with low noise. Large increase in expected minimal empirical risk with sample size makes it challenging to avoid sometimes choosing a class that is too large.