for Large Scale Model Selection Problems University of California at Berkeley Queensland University of Technology ETH Zürich, September 2011 Joint work with Alekh Agarwal, John Duchi and Clément Levrard.
Large Scale Data Analysis Observation: For many prediction problems, the amount of data available is effectively unlimited.
Large Scale Data Analysis Observation: For many prediction problems, the amount of data available is effectively unlimited. Information retrieval: Web search 10 8 websites. 10 10 pages. 10 9 queries/day.
Large Scale Data Analysis Observation: For many prediction problems, the amount of data available is effectively unlimited. Natural language processing: Spelling correction Google Linguistics Data Consortium n-gram corpus: 10 11 sentences.
Large Scale Data Analysis Observation: For many prediction problems, the amount of data available is effectively unlimited. Computer vision: Captions Facebook: 10 11 photos.
Large Scale Data Analysis Observation: For many prediction problems, the amount of data available is effectively unlimited. Information retrieval: Web search Natural language processing: Spelling correction Computer vision: Captions
Large Scale Data Analysis Observation: For many prediction problems, performance is limited by computational resources, not sample size. Information retrieval: Web search Natural language processing: Spelling correction Computer vision: Captions
Large Scale Data Analysis Example: Peter Norvig, Internet-Scale Data Analysis : On a spelling correction problem, trivial prediction rules, estimated with a massive dataset perform much better than complex prediction rules (which allow only a dataset of modest size). Given a limited computational budget, what is the best trade-off? That is, should we spend our computation on gathering more data, or on estimating richer prediction rules?
1 Computation is precious, not sample size Model selection Oracle inequalities 2 Problem formulation 3 4
Prediction Problem Model selection Oracle inequalities i.i.d. Z 1, Z 2,..., Z n, Z from Z. Use data Z 1,..., Z n to choose f n : Z A from a class F. Aim to ensure f n has small risk: L(f ) = El(f, Z), where l : F Z is a loss function.
Model selection Oracle inequalities Approximation-Estimation Trade-Off Define the Bayes risk, L = inf f L(f ), where the infimum is over measurable f. We can decompose the excess risk as ( ) L(f n ) L = L(f n ) inf L(f ) f F }{{} estimation error ( + ) inf L(f ) L. f F }{{} approximation error Model selection: automatically choose F to optimize this trade-off.
The Model Selection Problem Model selection Oracle inequalities Given function classes F 1,..., F K, use the data Z 1,..., Z n to choose f n K i=1 F i that gives a good trade-off between the approximation error and the estimation error. Example: Complexity-penalized model selection. f i n = arg min f F i L n (f ), f n = minimizer of L n (f i n) + γ i (n), where γ i (n) is a complexity penalty and L n is the empirical risk: L n (f ) = 1 n n l(f, Z i ). i=1
A Simple Model selection Oracle inequalities Theorem Suppose that we have risk bounds for each F i : w.p. 1 δ, log 1/δ sup L(f ) L n (f ) γ i (n) + c. f F i n If f n is chosen via complexity regularization: f i n = arg min f F i L n (f ), f n = minimizer of L n (f i n) + γ i (n), then with probability 1 δ, ( L(f n ) min i inf L(f ) + 2γ i (n) + c f F i ) log 1/δ + log i. n
A Simple Model selection Oracle inequalities Notice that, for each F i satisfying log 1/δ sup L(f ) L n (f ) γ i (n) + c, f F i n we have log 1/δ L(fn) i inf L(f ) + 2γ i (n) + c. f F i n But complexity regularization gives f n satisfying ( ) log 1/δ + log i L(f n ) min inf L(f ) + 2γ i (n) + c. i f F i n Thus, f n gives a near-optimal trade-off between the approximation error and the (bound on) estimation error, with only a log i penalty.
Computation versus sample size Model selection Oracle inequalities Complexity regularization involves computation of the empirical risk minimizer for each F i : f i n = arg min f F i L n (f ), f n = minimizer of L n (f i n) + γ i (n), So computation typically grows linearly with K. The oracle inequality gives the best trade-off for a given sample size: ( ) log 1/δ + log i L(f n ) min inf L(f ) + 2γ i (n) + c. i f F i n
Model selection Oracle inequalities 1 Computation is precious, not sample size Model selection Oracle inequalities 2 Problem formulation 3 4
Problem formulation Linear sample size scaling with computation Assumption For class F i : It takes unit time to use a sample of size n i to choose f ni F i. It takes time t to use a sample of size tn i to choose f tni F i. Our goal: A computational oracle inequality: f n compares favorably with each model, estimated using the entire computational budget. L(f n ) min i inf L(f ) + γ i (Tn i ) f F } i {{} c.f. estimate f using the entire budget +O log 1/δ Tn i.
Problem formulation Linear sample size scaling with computation Assumption For class F i : It takes unit time to use a sample of size n i to choose f ni F i. It takes time t to use a sample of size tn i to choose f tni F i. Our goal: A computational oracle inequality: f n compares favorably with each model, estimated using the entire computational budget. L(f n ) min i inf L(f ) + γ i (Tn i ) f F } i {{} c.f. estimate f using the entire budget +O log i/δ Tn i.
Naïve solution: grid search Problem formulation Allocate budget T /K to each model. Use a sample of size Tn i /K for F i. Choose f i Tn i /K = arg min f F i L Tni /K (f ), f n = minimizer of L Tni /K (f i Tn i /K ) + γ i Satisfies oracle inequality L(f n ) min i ( inf f F i L(f ) + γ i ( ) ( Tni + O K ( ) Tni. K K Tn i )).
Problem formulation Model selection from nested classes Suppose that the models are ordered by inclusion: F 1 F 2 F K. Examples: F i = { θ R d : θ r i }, r1 r 2 r K. F i = { θ R d i : θ 1 }, d 1 d 2 d K. Suppose that we have risk bounds for each F i : w.p. 1 δ, log 1/δ sup L(f ) L n (f ) γ i (n) + c. f F i n
Problem formulation Exploiting structure of nested classes Want to exploit monotonicity of risks and penalties Excess risk, R i = inf f Fi L(f ) L : Penalty, γ i (n): R i γi(n) i i
Coarse grid sets Problem formulation Want to spend computation on only few classes. Use monotonicity to interpolate for the rest. Partition based on penalty values. (1 + λ) (j+1) (1 + λ) j γi(n) (1 + λ) 2 1 + λ 1 F1F2 Fj Fj+1 i Coarse Grid
Coarse grids for model selection Problem formulation Definition (Coarse grid) A set S {1,..., K} is a coarse grid of size s if S = s and for each i {1,..., K} there is an index j S such that ( ) ( ) ( ) Tni Tnj Tni γ i γ j 2γ i. s s s Include a new class only after penalty increases sufficiently S = O(log K) is typical.
Problem formulation Complexity regularization on a coarse grid Given a coarse grid set S with cardinality s: 1 Allocate budget T /s to each class in S. 2 Choose f i Tn i /s = arg min f F i L Tni /s(f ) ( Tnj f = arg min L Tnj /s(f ) + γ j f {f j Tn j /s :j S} s ).
Problem formulation Complexity regularization on a coarse grid Theorem With probability at least 1 δ { ( )} Tni L(ˆf ) min inf L(f ) + 4γ i i {1,...,K} f F i s + O s log K + log 1/δ Tn K
Problem formulation Complexity regularization on a coarse grid Corollary For a coarse grid of logarithmic size, that is, S c log K, With probability at least 1 δ { ( )} Tni L(ˆf ) min inf L(f ) + 4γ i i {1,...,K} f F i c log K log + O 2 K + log 1 δ Tn K Both the statistical and computational costs scale logarithmically with K.
Problem formulation 1 Computation is precious, not sample size Model selection Oracle inequalities 2 Problem formulation 3 4
Heterogeneous Models In general, the F i can be heterogeneous, not ordered by inclusion. Different kernels. Graphs in directed graphical models. Subsets of features. Key idea: Successively allocate computational quanta online.
Multi-Armed Bandits for Model Selection Want class i that minimizes inf L(f ) + γ i (Tn i ). f F i Use idea of optimism in the face of uncertainty: neatly trade off exploration and exploitation by choosing the class with the smallest lower bound on the criterion. i t
Multi-Armed Bandits for Model Selection Want class i that minimizes inf L(f ) + γ i (Tn i ). f F i We know it suffices to choose a class i to minimize L Tni (f i Tn i ) + γ i (Tn i ). Use the lower confidence bound: log K L n (fn) i γ i (n) + γ i (Tn i ), n where n is the size of the sample that we have allocated already to class i.
Multi-Armed Bandits for Model Selection picks class i t with smallest lower confidence bound. Allocate additional sample of size n it to class i t. Regret analysis of upper-confidence-bound algorithm (Auer et al., 2002) extends to give oracle inequalities. i t
Oracle inequality under separation assumption ( ) Define i = arg min inf L(f ) + γ i (Tn i ), i f F i ( Assume γ i (n) = c i n. i = inf f F i L(f ) + γ i (Tn i ) inf L(f ) + γ i (Tn i ) f F i ). Theorem Let T i (T ) be the number of times class i is queried. There are constants C, κ 1, κ 2 such that with probability at least 1 κ 1 TK 4, T i (T ) C n i ( ci + κ 2 log T i ) 2.
Oracle inequality under separation assumption ( ) Define i = arg min inf L(f ) + γ i (Tn i ), i f F i i = inf L(f ) + γ i (Tn i ) inf L(f ) + γ i (Tn i ). f F i f F i If we can incrementally update the choice f i n, then the fraction of budget that is assigned to a suboptimal class i is no more than log T /(n i T 2 i ). This is essentially optimal (Lai and Robbins, 1985).
Oracle inequality without separation Assume that functions in {F i } map to a vector space, and the loss l(, z) is convex. Define ˆf T = 1 T T t=1 f t, where algorithm produces f t F it at time t. Theorem There is a constant κ such that with probability at least 1 2κ TK 3 ( ) L(ˆf T ) = inf inf L(f ) + γ i (Tn i ) i {1,...,K} f F i ( ) K max{log T, log K} + O. T Linear dependence on K.
Summary For large-scale problems, data is cheap but computation is precious. Computational oracle inequalities for model selection: select a near-optimal model without wasting computation on other models. A nested complexity hierarchy ensures cost logarithmic in size of hierarchy. If not nested, cost of model selection is linear in size of hierarchy.
Open problems Fast rates: If excess risk and excess empirical risk are close, then penalized empirical risk gives oracle inequalities with fast rates. Can this be exploited to give computational oracle inequalities in these cases? For nested hierarchies, the analysis relied on a coarse multiplicative cover of the penalty values. If the penalties are data-dependent, when is this approach possible? What other structures on function classes lead to good computational oracle inequalities?
Fast Rates Theorem For F 1 F 2 and γ 1 (n) γ 2 (n), if sup k sup k sup (L(f ) L(fi ) 2 (L n (f ) L n (fi )) γ i (n)) 0, f F i sup (L n (f ) L n (fi ) 2 (L(f ) L(fi )) γ i (n)) 0, f F i then L(f n ) inf i (L(f i ) + 9γ i (n)), where f n minimizes L n (f i n) + 7γ i (n)/2 and f i = arg min f Fi L(f ). Examples: Convex losses, classification with low noise. Large increase in expected minimal empirical risk with sample size makes it challenging to avoid sometimes choosing a class that is too large.