Computational and Statistical Learning Theory

Size: px

Start display at page:

Download "Computational and Statistical Learning Theory"

Ethan Paul
5 years ago
Views:

1 Computational and Statistical Learning Theory TTIC Prof. Nati Srebro Lecture 7: Computational Complexity of Learning Agnostic Learning

2 Hardness of Learning via Crypto Assumption: No poly-time algorithm for 3 b mod K that works for nonnegligible b, K = pq (p, q primes with 3 p 1 q 1 ) K, b f K 1 (b) very hard: No poly-time alg for non-negligible K, b K, a a 3 mod K easy K, a f K (a) easy Hard to learn H = h K b, i : b, i f K 1 (b) i D K, b f K 1 b easy (e.g. polytime) Hard to learn polytime functions K h K H D K a a = 3 a mod K Computable using log-depth logic circuit Computable using log-depth neural nets Hard to learn H Hard to learn log-depth circuit Hard to learn log-depth NN

3 Hardness of Learning via Crypto Public-key crypto is possible hard to learn poly-time functions Hardness of Discrete Cube Root hard to learn log(n)-depth logic circuits hard to learn log(n)-depth poly-size neural networks Hardness of breaking RSA hard to learn poly-length logical formulas hard to learn poly-size automata hard to learn push-down automata, ie regexps for some depth d, hard to learn poly-size depth-d threshold circuits (output of unit is one iff number of input units that are one is greater than threshold) hard to learn O(1)-depth, poly-size neural networks Hardness of lattice-shortest-vector based cryptography hard to learn intersection of n r halfspaces (for any r > 0) hard to learn depth-2, O n size neural networks

Intersections of Halfspaces H n k(n) = k n x i=1 w i, x > 0 w 1,, w k n R

Sherstov The unique shortest lattice vector problem: SVP v 1, v 2,, v n R

4 Intersections of Halfspaces H n k(n) = k n x i=1 w i, x > 0 w 1,, w k n R n O n 1.5 usvp RP Lattice-based cryptosystem is secure For any r > 0, hard to learn H n k n =n r Hard to learn 2-layer NN with n r hidden units Sasha Sherstov The unique shortest lattice vector problem: SVP v 1, v 2,, v n R n = arg min a 1,a 2,,a n Z a 1v 1 + a 2 v a n v n O n 1.5 usvp: only required to return SVP if next-shortest is O n 1.5 times longer Adam Klivans

5 Hardness of Learning via Crypto K, b f K 1 (b) very hard: No poly-time alg for non-negligible K, b Easy to generate random (K, D K ) K, a f K (a) easy Hard to learn H = h K b, i : b, i f K 1 (b) i D K, b f K 1 b easy (e.g. polytime) Hard to learn polytime functions

6 Hardness of Learning via Crypto K, b f K 1 (b) very hard: No poly-time alg for non-negligible K, b Easy to generate random (K, D K ) No poly-time alg for all K and almost all b K, a f K (a) easy Hard to learn H = h K b, i : b, i f K 1 (b) i D K, b f K 1 b easy (e.g. polytime) Hard to learn polytime functions

7 Hardness of Learning: Take II Recall how we proved hardness of proper learning: Reduction from deciding consistency with H If we had efficient proper learner, could train it and find consistent hypothesis in H if it exists Problem: if learning is not proper, might return good hypothesis not in H, even though D not consistent with H Instead: reduction from deciding between two possibilities: Sample is consistent with H For every consistent sample, return 1 w.p. 3/4 (over randomization in algorithm) Sample comes from random distribution E.g. sampled such that labels y independent of x For all but negligible samples S D m, return 0 w.p. 3/4 Amit Daniely

Hardness Relative to RSAT RSAT assumption: For some f K = ω 1, there is no poly-time randomized algorithm that gets as input a K-SAT formula with n f(k) constraints, and: If the input is satisfiable,

8 Hardness Relative to RSAT RSAT assumption: For some f K = ω 1, there is no poly-time randomized algorithm that gets as input a K-SAT formula with n f(k) constraints, and: If the input is satisfiable, then w.p. 3/4 (over the randomization in the algorithm), it outputs 1 If each constraint is generated independently and uniformly at random, then with probability approaching 1 (as n ) over the formula, w.p. 3/4 (over the randomization in the algorithm), it outputs 0 Theorem: Under the RSAT assumption, Poly-length DNFs are not efficiently PAC learnable e.g. h x = x 1 x 7 x 15 x 17 x 2 x 24 Intersection of ω 1 halfspaces are not efficiently PAC learnable 2-layer Neural Networks with O log log log n hidden units are not efficiently PAC learnable Amit Daniely

9 Hardness of Learning Axis-aligned rectangles in n dimensions Halfspaces in n dimensions Conjunctions on n variables 3-term DNF s DNF formulas of size poly(n) Generic logical formulas of size poly(n) Neural nets with at most poly(n) units Functions computable in poly(n) time Efficiently Properly Learnable Efficiently Learnable, but not Properly Not Efficiently Learnable

10 Realizable vs Agnostic Definition: A family H n of hypothesis classes is efficiently properly PAC-Learnable if there exists a learning rule A such that n ε, δ > 0, δ m n, ε, δ, D s.t. L D h = 0 for some h H, S D m n,ε,δ, L D A S ε and A(S)(x) can be computed in time poly n, 1 Τε, log 1 Τδ and A always outputs a predictor in H n Definition: A family H n of hypothesis classes is efficiently properly agnostically PAC-Learnable if there exists a learning rule A such that δ n ε, δ > 0, m n, ε, δ, D S D m n,ε,δ, L D A S inf L D h + ε h H n and A(S)(x) can be computed in time poly n, 1 Τε, log 1 Τδ and A always outputs a predictor in H n

11 Conditions for Efficient Agnostic Learning ERM H S = arg min h H L S(h) Claim: If VCdim H n poly(n), and Each h H n is computable in time poly(n) There is a poly-time (in size of input) algorithm for ERM H (i.e. that returns any ERM) then H n is efficiently agnostically properly PAC learnable. AGREEMENT H S, k = 1 iff h H L S h (1 k S ) Claim: If H n is efficiently properly agnostically PAC learnable, then AGREEMENT H RP

12 Poly-time functions? What is Properly Agnostically Learnable? Poly-length logical formulas? Poly-size depth-2 neural networks? Halfspaces (linear predictors)? X n = 0,1 n, H n = x w, x > 0 w R n Claim: AGREEMENT H is NP-Hard (optional HW problem) Conclusion: If NP RP, halfspaces are not efficiently properly agnostically learnable No! Conjunctions? Also NP-hard! No! (not even in realizable case) No! (not even in realizable case) No! (not even in realizable case) No! Yes! Unions of segments on the line n X n = 0,1, H n = x i=1 a i x b i a i, b i 0,1 Efficiently Properly Agnostically PAC Learnable!

13 Source of the Hardness min h H i l(h w x i ; y i ) h w x = w, x l 01 h x ; y = yh(x) 0 l 01 (h x ; y = 1) l sqr (h x ; y = 1) 1 1 h x R -1 h x R

14 Using a surrogate loss min h H i l(h w x i ; y i ) Instead of l 01 (z; y), use a surrogate l(z; y) s.t.: y l(z; y) is convex in z (and so easy to optimize) z,y l 01 z; y l(z; y) l sqr z; y = y z 2 l hinge z; y = 1 yz + = max{0,1 yz} l logistic z; y = log(1 + exp yz )

15 Agnostically Learning Halfspaces with the Hinge Loss H = x w, x > 0 w R n H = x w, x w R n arg min h H 1 m σ i l hinge (h(x i ) ; y i ) 1 arg min σ w R n m i 1 y i w, x i + Use linear programming: min σ i ξ i w R n y i w, x i 1 ξ i ξ R m ξ i 0 ξ i = 1 y i w, x i +

16 Does Minimizing a Surrogate Loss Also Minimize 0/1 Loss? l 01 z; y = yz 0 1 yz + = l hinge z; y Realizable case: w L S 01 x w, x = 0 L S hinge x w, x = 0, where w = L S hinge ERM hinge S = 0 w min i y i h w (x i ) L S 01 ERM hinge S L S hinge ERM hinge S = 0 Non-Realizable case: What can we ensure by minimizing surrogate loss???

17 Can we Efficiently Agnostically Learn? Minimizing a surrogate loss can be very bad (might result in L 01 w = 0.49 even when L 01 w = 0.001) Halfspaces not efficiently properly agnostically PAC learnable Finding the halfspace that minimizes the number of errors on a training set is NP-hard What about improper learning? next week we ll reduce learning intersection of halfspaces to agnostic learning halfspaces

18 Why Study Hardness? Understand why machine learning is essentially a computational problem Understand why we must sometimes take a non-exact/heuristic approach, and that it cannot be exact (eg use surrogate loss) Understand what we can never guarantee, and not try to guarantee it (e.g. cannot learn with a large NN just because there is a small NN that completely explains the data) Understand and be able to argue about sample complexity gaps between the statistical limit (using any learning rule) and the computational limit (using a tractable learning rule)

19 Weak vs Strong Learning Recall definition of (realizable) PAC learning of H using rule A( ): For any D s.t. inf L D h = 0, and any ε, δ > 0, using m(ε, δ) sample, h H δ L D A S < ε S D m(ε,δ) A( ) is a weak learner for H if: There exists ε < 1 Τ2, δ < 1, m, s.t. for any D with inf δ S D m (e.g. ε = 0.49 and 1 δ = 0.01) L D A S L D h = 0, h H < ε If H is weakly learnable, is it also strongly learnable? Yes: H is weakly learnable VCdim(H)< H is (strongly) learnable If H n is efficiently weakly learnable, is it also strongly efficiently learnable? If we have access to an (efficient) weak learner A( ), can we use it to build an (efficient) strong learner?

20 The Boosting Problem Boosting the Confidence: If the learning algorithm works only with some very small fixed probability 1 δ 0 (e.g. 1 δ 0 = 0.01), can we construct a new algorithm that works with arbitrarily high probability 1 δ (for any δ > 0)? Boosting the error: If the learning algorithm only returns a predictor that is guaranteed to be slightly better then chance, i.e. has error ε 0 = 1 γ < 1 (for some fixed γ > 0), 2 2 can we construct a new algorithm that achieves arbitrarily low error ε?

21 Boosting the Confidence For any δ: 1. For i=1..k: k = log 2Τδ log 1Τδ 0 Collect m 0 independent samples S i inf i h i = A(S i ) 2. Collect m val = 4 log 4k δ additional independent samples S val ε 2 3. Return h = arg min h 1,,h k L Sval h i w.p. 1 δ, L h i ε 0 ERM from class of size k Claim: w.p. 1 δ, L h ε 0 + ε Total samples used: O m 0 ε 0 log 1 + log1 δ δ ε 2 Efficient algorithm for some δ 0 < 1 and all ε > 0 with runtime and sample complexity poly(n, ε 0 ) efficient algorithm for any δ > 0 with runtime poly(n, ε, log 1 Τδ )

22 Boosting the Error? What if we can only find a predictor with relatively high excess error ε? We can always find a predictor with error 1 2 What if we have an algorithm that, for any source dist D s.t. inf L D h = 0, finds L D A S 1 γ. h 2 Can we use A( ) to find a predictor with arbitrarily low error?

23 Example: Weak Learning with a Weak Class X = R 2, H = axis aligned rectangles Decision stumps: B = s x i < θ i = 1,2, s = ±1, θ R Claim: For any D, if h HL D h = 0 h B L D h 3 7 < Since VCdim(B)=3, with m = m VC D = 3, ε = 0.001, δ = 0.9 : w.p. 0.1 over S D m : L D ERM B S < 0.43 Conclusion: ERM B ( ) is a weak learner for H with ε = 0.43 < 0.5 and δ = 0.9 < 1

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 6: Computational Complexity of Learning Proper vs Improper Learning Learning Using FIND-CONS For any family of hypothesis