Some Selected Topics. Spectral Norm in Learning Theory:

Size: px

Start display at page:

Download "Some Selected Topics. Spectral Norm in Learning Theory:"

Mervyn Hawkins
5 years ago
Views:

1 Spectral Norm in Learning Theory Slide 1 Spectral Norm in Learning Theory: Some Selected Topics Hans U. Simon (RUB) simon@lmi.rub.de Homepage:

2 Spectral Norm in Learning Theory Slide 2 The Star of the Talk λ 1

3 Spectral Norm in Learning Theory Slide 3 Spectral Norm of a Matrix For matrix A m n, consider the singular values σ1(a) σ2(a). If A is symmetric and positive semidefinite, they coincide with the eigenvalues λ1(a) λ2(a). Definition of Spectral Norm: A := σ1(a) = λ1(a A).

4 Spectral Norm in Learning Theory Slide 4 Structure of the Talk I Statistical Query Learning and Correlation Query Learning II Ke Yang s Lower Bound on the Number of Statistical Queries III Other Bounds in Terms of the Spectral Norm IV Hidden Number Problem from the Perspective of Learning Theory

5 Spectral Norm in Learning Theory Slide 5 Part I SQ = CQ

6 Spectral Norm in Learning Theory Slide 6 Concept Learning X, a domain x X, an instance D, a probability distribution on X (x, b) X { 1, +1}, a labeled instance f : X { 1, +1}, a concept F, a class of concepts Goal: Find a good approximation (hypothesis) h : X { 1, +1} for an unknown target concept f F.

7 Spectral Norm in Learning Theory Slide 7 Statistical Query Learning (fixed distribution D) h : X { 1, +1} { 1, +1}, a query function such that ED[h (x, f (x))] = Pr[h (x, f (x)) = +1] Pr[h (x, f (x)) = 1] τ > 0, a tolerance parameter Information Gathering Mechanism: LEARNER h,τ ORACLE ORACLE d LEARNER s.t. ED[h (x, f (x))] τ d ED[h (x, f (x))] + τ After gathering enough information, the learner outputs a final hypothesis h.

8 Spectral Norm in Learning Theory Slide 8 Statistical Query Learning (continued) Efficiency of the learner measured by: q, the number of specified query functions (including the final hypothesis) τ, smallest tolerance parameter ever used during learning Success of the learner measured by: ε, probability of misclassification of the final hypothesis Kearns Result: An SQ learner can be simulated by a noise-tolerant PAC learner that has access to poly(q, 1/τ, 1/ε) random examples.

9 Spectral Norm in Learning Theory Slide 9 Evaluation and Correlation Matrix Equip X with the following inner product: h1, h2 D := x X D(x)h1(x)h2(x) = ED[h1(x)h2(x)] Associate with a concept class F, the evaluation matrix EF[x, f] := f(x) and the (positive semidefinite) correlation matrix CF[f1, f2] := f1, f2 D. Column f of matrix EF contains the function table for f.

10 Spectral Norm in Learning Theory Slide 10 Correlation Query Learning (fixed distribution D) h : X { 1, 0, +1}, a query function τ > 0, a tolerance parameter γ > 0, desired correlation between h and f correlation γ misclassification rate 1 2 γ 2 Information Gathering Mechanism: LEARNER h,τ ORACLE ORACLE c LEARNER s.t. h, f D τ c h, f D + τ After gathering enough information, the learner outputs a final hypothesis h.

11 Spectral Norm in Learning Theory Slide 11 Equivalence of the SQ and the CQ Model Proof Idea: Mutual conversion h h of query functions h : X { 1, +1} { 1, +1} and h : X { 1, 0, +1} such that queries (h, τ) and (h, τ) yield precisely the same amount of information.

12 Spectral Norm in Learning Theory Slide 12 First Conversion Given h : X {±1} {±1}, consider X0(h ) := {x X : h (x, +1) = h (x, 1)} X1(h ) := {x X : h (x, +1) h (x, 1)} and the following function h : X { 1, 0, +1}: 0 if x X0(h ) h(x) := h (x, 1) if x X1(h ) Note that: x X1(h ), b = ±1 : h (x, b) = b h (x, 1) = b h(x)

13 Spectral Norm in Learning Theory Slide 13 First Conversion (continued) ED[h (x, f (x))] = = x X0(h ) x X0(h ) D(x)h (x, f (x)) + D(x)h (x, 1) + }{{} =:k(h ) = k(h ) + h, f D x X1(h ) x X1(h ) D(x)h (x, f (x)) D(x)f (x)h(x) Reverse Direction: Exploit the same relation between ED[h (x, f (x))] and h, f D again!

14 Spectral Norm in Learning Theory Slide 14 Part II Ke Yang s Lower Bound

15 Spectral Norm in Learning Theory Slide 15 Yang s Bound Theorem (Ke Yang, COLT 2002, JCSS2005): The smallest possible number q(f) of statistical queries is lower-bounded as follows: q(f) i=1 λi(cf) F min{γ 2, τ 2 } Corollary: Because of CF = λ1(cf): q(f) F CF min{γ2, τ 2 } Because F is not easier to learn than F F: ( F ) q(f) sup min{γ 2, τ 2 } F F CF

16 Spectral Norm in Learning Theory Slide 16 Sketch of Proof (Adversary Argument) CQ-oracle returns answer 0 as long as possible. Say (h1, τ1),..., (hq 1, τq 1) are answered 0. Let hq be the next query function (final hypothesis if q = q(f)). Let Q := h1,..., hq. Let f Q denote the projection of f onto subspace Q. With these notations: q(f) i=1 λi(cf) q i=1 λi(cf) f F f Q 2 F min{γ 2, τ 2 }

17 Spectral Norm in Learning Theory Slide 17 Characterization of Weak SQ-Learnability The following statements are equivalent: 1. (Fn)n 1 admits a weak polynomial learner in the SQ model. 2. The statistical query dimension of (Fn)n 1 is polynomially bounded. 3. sup F n Fn F n C F n is polynomially bounded : Blum, Furst, Jackson. Kearns, Mansour, Rudich (STOC 1994) 2. 3.: Statistical query dimension polynomially related to sup F F F n C F n

18 Spectral Norm in Learning Theory Slide 18 Part III The many faces of λ1

19 Spectral Norm in Learning Theory Slide 19 Spectral Norm and Half-space Embeddings A half-space embedding for a matrix A { 1, +1} m n : H i d dimensional space H + i P j H i margin The case A = +1 i,j d(a) := smallest possible dimension := smallest rank of a sign-equivalent matrix µ(a) := largest possible guaranteed margin

20 Spectral Norm in Learning Theory Slide 20 Results by Forster (CCC 2001, JCSS 2002) For every A { 1, +1} m n : d(a) mn A µ(a) A mn In particular for the Hadamard matrix Hn { 1, +1} 2n 2 n : d(hn) 2 n/2 ( probabilistic communication complexity n/2) µ(hn) 2 n/2

21 Spectral Norm in Learning Theory Slide 21 Application: Threshold Circuits of Exponential Size From Forster s Theorem: rank of H must be at least 2 By sub additivity of rank: only possible when there are exponentially many hidden units n/2 unbounded weights.... H(x,y) matrix of rank 2 H = sign H H (x,y) is the matrix before thresholding n hidden units computing matrices of small rank bounded weights (bound W) x x y y 1 n 1 n Boolean input units Result by Forster, Krause, Lokam, Mubarakzjanov, Schmitt, H.U.S.: 2 there must be at least n/2 nw +1 units in the hidden layer (FSTTCS 2001).

22 Spectral Norm in Learning Theory Slide 22 Sufficient Conditions for Weak SQ Learnability d(efn) is poly(n)-bounded. µ(efn) 1 is poly(n)-bounded. The probabilistic communication complexity for EFn in the unbounded error model is O(log(n))-bounded. Functions from Fn can be evaluated by a depth-2 threshold circuit (unbounded weights at top unit, poly(n)-bounded weights at hidden units) of poly(n) size.

23 Spectral Norm in Learning Theory Slide 23 Part IV Can we learn hidden numbers??

24 Spectral Norm in Learning Theory Slide 24 Prerequisites for the Hidden Number Problem p, an n-bit prime p = {0,..., p 1}, smallest residuals modulo p ( p, +, ), the corresponding prime field p = {1,..., p 1} ( p, ), the cyclic group of prime residuals modulo p g, a generator of p B : p { 1, +1}, a binary predicate for prime residuals --- All arithmetic is understood modulo p ---

25 Spectral Norm in Learning Theory Slide 25 The Hidden Number Problem (HNP[B]) Goal: Infer a hidden number u p from observations. Information-gathering Mechanism: z u B(u z) Elements z p independently drawn at random (uniform distribution)

26 Spectral Norm in Learning Theory Slide 26 Motivation: Bit Security Diffie-Hellman Function: DH(g a, g b ) = g ab. Central Relation: (Boneh, Venkatesan 1996; Vasco, Shparlinski 2000) g (b+r)a g (b+r)x = g ab+ar+xb+xr = g (a+x)(b+r) = DH(g a+x, g b+r ) Known: n, p, g, g a, g b, r, x Unknown: g ab, g (b+r)a Hidden Number: g (b+r)a Random Instance: g (b+r)x Note: g b+r should be a generator!!

27 Spectral Norm in Learning Theory Slide 27 Cast HNP[B] as Learning Problem Put problem in the concept learning framework: View u as target concept. View z as random instance. View B(u z) as the correct classification label. Note: The correlation between u1, u2 can be expressed as Pr[B(u1 z) = B(u2 z)] Pr[B(u1 z) B(u2 z)]. Assumption: Predicate B must distinguish different hidden numbers: Pr[B(u1 z) = B(u2 z)] Pr[B(u1 z) B(u2 z)] 1 1 poly(n)

28 Spectral Norm in Learning Theory Slide 28 Example for a distinguishing predicate Consider the unbiased most significant bit: +1 if p+1 2 z p 1 MSB(z) := 1 otherwise Kiltz, H.U.S. (unpublished manuscript): The unbiased most significant bit distinguishes hidden numbers in the following strong sense: Pr[MSB(u1 z) = MSB(u2 z)] Pr[MSB(u1 z) MSB(u2 z)] 2 3

29 Spectral Norm in Learning Theory Slide 29 Proper PAC Learners imply Bit Security Theorem: If HNP[B] is properly and polynomially PAC-learnable under the uniform distribution, then bit B of the Diffie-Hellman function is secure. The proof translates similar proofs by (Nguyen and Stern, 2001 ; Vasco and Shparlinski, 2000) into a learning-theoretic framework. Conjecture: Such PAC-learners do not exist.

30 Spectral Norm in Learning Theory Slide 30 Hidden Number Problem and SQ Learning Let M denote the multiplication table for p. Then MSB M is the evaluation matrix for HNP[MSB]. Lemma (Kiltz, H.U.S., JCSS 2005): MSB M = p 1/2+o(1). Because of the general relations (assuming uniform distribution D) CF = 1 X E F EF and CF = 1 X E F 2, and because of Ke Yang s lower bound: Corollary: 1. q(hnp[msb]) (p 1)2 p 1+o(1) = p 1 o(1). 2. HNP[MSB] is not weakly learnable in the SQ model.

31 Spectral Norm in Learning Theory Slide 31 Conclusions We have put the framework of SQ Learning into a broader context, relating it to various (seemingly different) problems in learning theory, complexity theory, and cryptography. The key notion has been the spectral norm.

32 Spectral Norm in Learning Theory Slide The End ---

On the power and the limits of evolvability. Vitaly Feldman Almaden Research Center

On the power and the limits of evolvability Vitaly Feldman Almaden Research Center Learning from examples vs evolvability Learnable from examples Evolvable Parity functions The core model: PAC [Valiant