Leaving The Span Manfred K. Warmuth and Vishy Vishwanathan

Size: px

Start display at page:

Download "Leaving The Span Manfred K. Warmuth and Vishy Vishwanathan"

Liliana Peters
5 years ago
Views:

1 Leaving The Span Manfred K. Warmuth and Vishy Vishwanathan UCSC and NICTA Talk at NYAS Conference, Thanks to Dima and Jun 1

2 Let s keep it simple Linear Regression Examples (x t, y t ) Linear hypothesis w Predicts with ŷ t = w x t 2

3 What if data not close to linear z Original Space z 1 Simply invent new variables/features :-) 3

4 Close to linear in feature space Embed instances into a feature space φ : R n R m z Original Space z 1 φ(x 1, x 2 ) = (arcsin(x 1 ), }{{}}{{} x 2 ) x x 1 2 z 2 = arcsin(z2) Feature Space z 1 = z 1 4

5 Does the expansion always work Can you always improve things by inventing new features Fitting the data may be - But is this learning? 5

6 The Kernel Trick [BGV92] If w linear combination of expanded instances, then ŷ = α t φ(x t ) φ(x) = α t φ(x t ) φ(x) }{{} t t }{{} K(x t,x) w Kernel function K(x t, x) often efficient to compute φ( (x 1,..., x n ) }{{} n Kernel magic K(x, z) = φ(x) φ(z) = ) = (1,..., x i,..., x i x j..., x i x j x k...) }{{} 2 n products I 1..n x i i I i I z i } {{ } O(2 n ) time = n (1 + x i z i ) i=1 }{{} O(n) time 6

7 Good news Many of our favorite algorithms can be kernelized : Linear Least Squares, Widrow-Hoff, Support Vector Machines, PCA, Simplex Algorithm,... Kernel Trick: Weight vector linear combination of embedded instances Individual features never accessed 7

8 Linear combinations? Representer Theorem: ( w = arginf w w 2 + η t Solution w linear combination of the φ(x t ) (w φ(x t ) y t ) 2 ) [KW71] Rotation invariance: [KWA97] Any algorithm whose predictions are not affected by rotating the instances in feature space must predict with linear combination of embedded instances Sufficient conditions! 8

9 Linear or non-linear? :-( We give a problem for which kernel algorithms behave like linear algorithms Embeddings don t help 9

10 A hard problem Hadamard Matrix: n instances and n targets Instances are orthogonal instances targets Target weight vectors are units 10

11 The n data sets ((+1, +1, +1, +1), +1) ((+1, 1, +1, 1), +1) ((+1, +1, 1, 1), +1) ((+1, 1, 1, +1, )+1) ((+1, +1, +1, +1), +1) ((+1, 1, +1, 1), +1) ((+1, +1, 1, 1), 1) ((+1, 1, 1, +1), 1) ((+1, +1, +1, +1), +1) ((+1, 1, +1, 1), 1) ((+1, +1, 1, 1), +1) ((+1, 1, 1, +1), 1) ((+1, +1, +1, +1), +1) ((+1, 1, +1, 1), 1) ((+1, +1, 1, 1), 1) ((+1, 1, 1, +1), +1) For each of the n data sets Subset of labeled examples is received Labels of remaining examples must be predicted Loss is averaged over all n examples 11

12 Without embeddings I Any linear combination of k training instances predicts zero on all n k test instances [LLW95,KWA97] So loss 1 on n k of the n instances Average square loss over all n instances is 1 k n n = 1024 lg(n) = 10 12

13 Without embeddings II Theorem For any linear combination of k rows of the n-dimensional Hadamard matrix and any of the n targets the average square loss over all n instances is 1 k n Theorem Any linear combination of k rows of n dimensional Hadamard matrix has distance 1 k n from each of the n unit vectors 13

14 With embeddings φ : }{{} H }{{} Z So after one example you learned one target Caveat: this embedding does poorly on the other targets φ : }{{}}{{} H k rows With k independent examples you can learn first k targets 14

15 Summary Memorize labels of first k instances Correct on k targets No improvement possible 15

16 Main Result Theorem No matter how the instances are embedded No matter what k training instances chosen by the learner No matter what linear combination used For one of the targets average square loss on all n instances is 1 k n 16

17 Probabilistic model I Uniform distribution on the n rows of Hadamard matrix Algorithm first embeds the n rows and then draws k rows without replacement all labeled by one of the n targets. Chooses hypothesis as linear combination of the k embedded instances Average square loss for at least one of the targets is 1 k n 17

18 Probabilistic model II As above but k examples are drawn with replacement Average square loss for at least one of the targets is (1 1 n )k Without replacement n = 100 With replacement 18

19 Our Approach Use the SVD spectrum instead Hadamard Random Average square loss 1 n n 2 i=k+1 = 1 k n s 2 i 19

20 Proof 1 H n n Ẑ k m Ẑ T mapped to Z n m first k rows a k 1 weight vector Z ẐT a h Z ẐT A H k n n Z 2 } ẐT {{ A} H 2 F rank k 1 n n 2 i=k+1 s2 i residuals for one target all n 2 residuals average squared error 20

21 Proof for non-square H 1 H n q Ẑ k m Ẑ T mapped to Z n m first k rows a k 1 weight vector Z ẐT a h Z ẐT A k q H nq Z } ẐT {{ A} H 2 F rank k 1 nq min(n,q) i=k+1 s 2 i residuals for one target all n 2 residuals average squared error 21

22 Additional Constraints Ẑ = w i 0 and n w i = 1 i= For above k instances, labeled by one of the 2 k columns, only consistent weight vector is unit identifying that column With constraints all 2 k units can be obtained Weight space can has rank 2 k With linear combinations of k rows at most rank many units (i.e. k) can be expressed 22

23 Additional Constraints - Part The above k rows appear as rows in the 2 k 2 k Hadamard Therefore any linear combination of the k rows of the sub matrix is distance at least 1 k 2 from each of the 2 k unit k vectors Every linear combination has average square loss at least 1 k 2 k on the full Hadamard matrix You need the additional constraints to bring up the span? Constraints and consistency = unique solution 23

24 Maintain additional constraints? Use Exponentiated Gradient Algorithm [KW97] Kernel methods w i = k t=1 Ẑt,ia t EG w i = exp k t=1 Ẑt,ia t /const Now log weights linear combination of expanded instances 24

25 Average Squared Error EG Kernel algs ln(n) t 1 t n and (1 1 n )t 25

26 How does EG realize units? Ẑ = EG w i = exp k t=1 Ẑt,ia t /const Set coefficient a t = ±η and let η go to infinity Each sign pattern corresponds to a different column 26

27 What constraints? 27

28 Good algs for sparsity? EG with loss loss(w x t, y t ) Santa Cruz way GD with loss loss(w x t, y t ) + sparsity regularizer such as w 1 or entropy of i w i log w i w i. What neural net community does Open: Can above handle worst case example sequences Regret bounds? 28

29 For the random bit matrix case [DPH] Consistency + minimizing w 1 puts all weight on consistent components Minimizing i (w x i y i ) 2 + η w 1 as η 0 puts all weight on consistent components Minimizing i w x i y i 1 + η w 1 puts all weight on consistent components. Is η 0 required? 29

30 Optimization versus ML Problem: Noise-free linear regression I.e. solve a system of linear equations Optimization: any solution is good Time, space, accuracy Machine Learning: How well does solution generalize 30

31 Incorporating side info Kernel algorithms: none EG: w i 0 and i w i = 1 1 k/n O( log n k ) instances targets Now target determined by any single example Trivial algorithm beats EG 31

32 Making it worse Spectrum of n log n matrix - all 2 log n sign patterns Spectrum of n n matrix produces by expanding the log n features to all 2 log n products Adding n log n random features instead 32

33 Random features cost LLS error w.r.t. any single feature in Hadamard matrix Average error w.r.t. all single features in random matrix Minimum error w.r.t. all single features O(1) examples needed per random feature 33

34 Which matrix? If eigen-spectrum of kernel matrix has heavy tail then kernel not useful Picked wrong kernel Problem too hard If svd-spectrum of problem matrix has heavy tail then problem not learnable Kernel matrix dot products of instances Problem matrix instances as rows - targets as columns We showed: Hadamard problem matrix has heavy tail Adding random features makes tail of kernel matrix heavy 34

35 Questions? Gave problem that cannot be learned well by kernel algs Similar bounds for classification? Linear neurons with sigmoided output? What is the optimal kernel for a given problem? Simpler generalization bounds for probabilistic settings? Feature selection Is there a similar story for learning matrix parameters Your questions :-) 35

Online Kernel PCA with Entropic Matrix Updates

Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth University of California - Santa Cruz ICML 2007, Corvallis, Oregon April 23, 2008 D. Kuzmin, M. Warmuth (UCSC) Online Kernel