Advanced Machine Learning & Perception

Size: px

Start display at page:

Download "Advanced Machine Learning & Perception"

Martin Eugene Cunningham
5 years ago
Views:

1 Advanced Machine Learning & Perception Instructor: Tony Jebara

2 Topic 6 Standard Kernels Unusual Input Spaces for Kernels String Kernels Probabilistic Kernels Fisher Kernels Probability Product Kernels Mixture Model Probability Product Kernels Junction Tree Algorithm for Kernels

3 SVMs and Kernels Recall the SVM dual optimization: L D : max α i i 1 α 2 i,j i α j y i y j k x i,x j subject to α i 0,C and y i i α i = 0 Solve a QP by using the Gram matrix K where K = Then KKT conditions give us b. The SVM prediction is then: = sign α i y i k ( x,x i ) f x k ( x 1,x 2 ) k ( x 1,x 3 ) k ( x 2,x 2 ) k ( x 2,x 3 ) k ( x 2,x 3 ) k ( x 3,x 3 ) k x 1,x 1 k x 1,x 2 k x 1,x 3 +b i K ij = k ( x i,x j ) Kernels replace vector dot products for nonlinear boundary Kernels also allow arbitrary large structured input spaces!

4 Standard Kernels on Vectors Kernels usually take two input vectors and output a scalar: Polynomial Kernel (gives p twists in SVM decision boundary) = φ x i k x i,x j,φ x j RBF Kernel (phi is infinite dimensional) Hyperbolic Tangent Kernel = ( x T i x j + 1) p k ( x i,x j ) = φ( x i ),φ( x j ) = exp 12σ x x 2 i j k ( x i,x j ) = φ( x i ),φ( x j ) = tanh( κx T i x j δ) But why just deal with vector inputs? We can apply kernels between any type of input: graphs, strings, probabilities, structured spaces, etc.! 2

5 String Kernels Inputs are two strings: X = aardvark X = accent Want k(x,x ) = scalar value = φ(x) T φ(x ) One choice for features φ(x) is the number of times each substrings of length 1, 2 and 3 appears in X φ(x) = [#a #b #c #d #e #f #y #z #aa #ab #ac ] = [ ] φ(x ) = [ ] Can do this efficiently via dynamic programming

6 String Kernels Want k(x,x ) = scalar value = φ(x) T φ(x ) Explicit Kernel can be slow because of many substrings s in our final vocabulary φ s (X) = number of times substring s appears in X φ a (aardvark) = 3 φ ar (aardvark) = 2 Implicit Kernel is more efficient k(x,x ): for each substring s of X, count how often s appears in X via dynamic programming in time proportional to the product of the lengths of X and X. This computes the dot product much more quickly!

7 Probability Kernels Inputs are two probabilities: p = p(x θ) = Want k(p,p ) = scalar value How to design this? p = p(x θ ) = Recall that a kernel can be converted into a distance: D( p i, p j ) = k ( p i, p i ) 2k ( p i, p j ) + k ( p j, p j ) What is a natural distance between two probabilities? Kullback-Leibler Divergence! = p i x D p i p j ( x) log p i x p j dx caveats?

8 Fisher Kernels KL isn t symmetric distance, probabilities lie on a manifold! So we must approximate it Try quadratic distance measure θ*. like Mahalanobis distance. θ θ. T I θ * D( p '' p ') U p '' U p ' Linearize manifold at θ * (some maximum likelihood point) Get distance between p and p via a distance from θ to θ The right kernel to go with the Mahalanobis distance is: =U p '' I θ * k p '', p ' where U p '' = θ '' 1 U p ' Matrix I is the Fisher information log p x θ 1 U p '' U p ' and U p ' = θ log p x θ ' ( i, j) = p( x θ * ) log p ( x θ ) I θ * θ i log p x θ θ j dx

9 Other Possible Divergences The Kullback-Leibler Divergence is asymmetric and tricky All divergences have: D( p,q) 0,D ( p,q) = 0iff p = q Consider KL, Jensen-Shannon and Symmetrized KL = p x = 1 p+q KL p, 2 2 = 1 KL p,q KL p,q JS p,q SKL p,q log p( x)q 1 ( x)dx KL( p+q q, ) 2 KL q, p Hellinger divergence is symmetric & can be written as kernel! H ( p,q) = 1 4 ( p( x) q ( x) ) 2 dx = k ( p, p) 2k p,q sqrt(js) is a true distance metric (Endres & Schindelin, 2003) + k ( q,q) what kernel is this?

10 Bhattacharyya Kernel A kernel between distributions that gives Hellinger divergence = p( x) q x k p,q dx A slight generalization is the Probability Product Kernel = p( x) k p,q ρ ( q x ) ρ dx Setting ρ=1/2 gives the Bhattacharyya Kernel = p( x) q x k p,q dx p( x) q ( x ) Setting ρ=1 gives the Expected Likelihood Kernel k ( p,q) = E p x q x = E q( x) p x

11 Probability Product Kernel Imagine we are given inputs which are probabilities and labels Dataset = {(p 1,y 1 ), (p 2,y 2 ),, (p n,y n )} E.g. each input i is a brand of car p i is its brakepad failure probability over time y i is binary, did government approve it for US roads? Build SVM with Probability Product Kernels to predict if a brakepad failure distribution will be approved/rejected by gov t

12 Probability Product Kernel Imagine we are given input data vectors with labels Dataset = {(x 1,y 1 ), (x 2,y 2 ),, (x n,y n )} E.g. each input i is a brand of car x i is time when one car of brand i failed y i is binary, did government approve it for US roads? For each i, estimate from x i probability p i =p(x θ i ) Dataset = {(p 1,y 1 ), (p 2,y 2 ),, (p n,y n )} * Build SVM with Probability Product Kernels to predict if a brakepad failure distribution will be approved/rejected by gov t

13 Probability Product Kernel Imagine we are given input data vector-sets with labels Dataset = {(χ 1,y 1 ), (χ 2,y 2 ),, (χ n,y n )} E.g. each input i is a brand of car χ i is a set of times when several cars of brand i failed y i is binary, did government approve it for US roads? For each i, estimate from χ i ={x i1,,x it } probability p i =p(x θ i ) Dataset = {(p 1,y 1 ), (p 2,y 2 ),, (p n,y n )} * ** Build SVM with Probability Product Kernels to predict if a brakepad failure distribution will be approved/rejected by gov t

14 Gaussian Product Kernels For Gaussians: = N ρ x µ k p, p ' N ρ ( x µ ')dx exp µ µ ' 2 / σ (if µ=x and µ =x, get back RBF kernel ) (Fisher kernel here is just a linear kernel ) For Gaussians with variable covariance: k ( p, p ') = N ρ ( x µ,σ)n ρ ( x µ ',Σ')dx Σ ρ/2 Σ' ρ/2 Σ 1/2 exp( ρ 2 µt Σ 1 µ ρ µ 2 'T Σ' 1 µ '+ 1 2 µt Σµ ) 1 and µ = ρσ 1 µ + ρσ' 1 µ ' where Σ = ρσ 1 + ρσ' 1 (Fisher kernel here is just a quadratic kernel )

15 Multinomial Product Kernels Bernoulli: (binary x) Multinomial: (discrete x) = D x γ d ( d 1 γ ) 1 x d d p x θ k χ, χ ' d =1 = p( x) p x θ ρ ( p '( x) ) ρ dx = D γ d γ d ' d =1 = α d x d k χ, χ ' D d =1 = ( α d α d ') ρ D d =1 ρ + ( 1 γ ) ρ ( d 1 γ d ') ρ For multinomial counts (for N words per document): k ( χ, χ ') = ( α d α d ') 1/2 N D d =1 (Fisher for Multinomial is linear )

16 Multinomial Product Kernels WebKB dataset: Faculty vs. student web page SVM classifier 20-Fold Cross-Validation, 1641 student & 1124 faculty,... Use Bhattacharyya Kernel on multinomial (word frequency) Training Set Size = 77 Training Set Size = 622

17 Exponential Family Kernels Exponential Family: Gaussian, Multinomial, Binomial, Poisson, Inverse Wishart, Gamma, Bernoulli, Dirichlet, p( x θ) = exp A( x) +T ( x) T θ K ( θ) Maximum likelihood is straightforward: K ( θ ) = 1 θ n T ( x ) n n All have above form but different A(x), T(x), convex K(q)

18 Exponential Family Kernels Compute the Bhattacharyya Kernel for the e-family: p( x θ) = exp A( x) +T ( x) T θ K ( θ) Analytic solution for e-family: k ( p, p ') = p( x θ) 1/2 p( x θ ') 1/2 dx ( ) = exp K ( 1 θ + 1 θ ' 2 2 ) 1 K ( 1 θ 2 2 ) 1 K θ ' 2 Only depends on convex cumulant-generating function K(q) Meanwhile, Fisher Kernel is always linear in sufficient stats U x = θ * = T χ log p χ θ T I θ * U χ I 1 θ U * χ ' = T ( χ) g 1 T ( θ *K θ χ ') g

19 Gaussian Product Kernels Instead of a single x & x Construct p & p from many x & x samples Use bag of vectors { χ 1,, χ } N p( x) = N ( x µ,σ) µ = 1 N i χ i, Σ = 1 N ( χ i µ )( χ i µ ) T i { χ ' ' 1,, χ } N ' p '( x) = N ( x µ ',Σ') µ ' = 1 N i χ i ', Σ' = 1 N χ i ' µ χ i ' µ T i k ( p, p ') = Σ ρ/2 Σ' ρ/2 Σ 1/2 exp( ρ 2 µt Σ 1 µ ρ µ 2 'T Σ' 1 µ '+ 1 2 µt Σµ )

20 Kernelized Gaussian Product Bhattacharyya affinity on Gaussians on bags {x,...} & {x,...} Invariant to order of tuples in each bag But too simple, just overlap of Gaussians on images Need more detail than mean and covariance of pixels Use another Kernel to find Gaussian in Hilbert space = φ( χ) T φ( χ ') κ χ, χ ' Never compute outerproducts, use kernel, i.e. infinite RBF: = exp 1 χ χ ' 2 2σ 2 κ χ, χ ' Compute mini-kernel between each pixel in a given image Gives kernelized or augmented Gaussian µ and Σ via Gram

21 Kernelized Gaussian Product Previously: Now: Compute Hilbert Gaussian s mean & covariance of each image bag or image is N x N pixel Gram matrix using kappa kernel Use kernel PCA to regularize infinite dimensional RBF Gaussian φ( χ ') µ = E χ { } { } Σ = E ( χ µ )( χ µ ) T { } Σ = E ( φ( χ) µ )( φ( χ) µ ) T µ = E φ( χ) κ( x 1,x 1 ) κ( x 1,x N ) κ( x N,x N ) κ x N,x 1 k ( φ χ ) Σ ρ/2 Σ' ρ/2 Σ 1/2 exp ρ 2 µt Σ 1 µ ρ µ 2 'T Σ' 1 µ '+ 1 2 µt Σµ Puts all dimensions (X,Y,I) on an equal footing κ( x 1,x 1 ) κ( x 1,x M ) κ( x M,x M ) κ x M,x 1

22 Kernelized Gaussian Product Letter R with 3 KPCA Components of RBF Kernel Reconstruction of Letter R with 1-4 KPCA with RBF Kernel Reconstruction of Letter R with 3 KPCA with RBF Kernel + Smoothing

23 Kernelized Gaussian Product x40 monochromatic images of crosses & squares translating & scaling SVM: Train 50, Test 50 Fisher for Gaussian is Quadratic Kernel RBF Kernel (red) 67% accuracy Bhattacharyya (blue) 90% accuracy

24 Kernelized Gaussian Product SVM Classification of NIST digit images 0,1,,9 Sample each image to get bag of 30 (X,Y) pixels Train on random 120, test on random 80 bag-of-vectors Bhattacharyya outperforms standard RBF due to built-in invariance Fisher Kernel for Gaussian is quadratic

25 Mixture Product Kernels Beyond Exponential Family: Mixtures and Hidden Variables Need ρ=1 Expected Likelihood kernel χ p x χ ' p ' x = M p( m) m=1 = p '( n) N n=1 k ( χ, χ ') = p( x)p '( x)dx p x m p ' x n N = p( m)p ' n M m=1 n=1 p x m = M N p( m)p '( n)c m,n m=1 n=1 p '( x n)dx Use M*N subkernel evaluations from our previous repertoire

26 HMM Product Kernels Hidden Markov Models: (sequences) p( x) = p( s 0 )p x 0 s T 0 S # of hidden configurations large p s t s t 1 t=1 p x t s t #configs = s T Kernel: k ( χ, χ ') = p( x)p '( x)dx = p( S)p ' U S U p x S p '( x U )dx = p( S)p '( S U U )c S,U computes cross-product of all hidden variables O s T u T

27 HMM Product Kernels = S U t p( s t s t 1 )p '( u t u t 1 ) p x t s t = S,U t p( s t s t 1 )p '( u t u t 1 )c st,u t = p( s 0 )p '( u 0 )ψ( s 0,u 0 ) p s t s t 1 k χ, χ ' S,U Take advantage of structure in HMMs via Bayesian network t p '( x t u t )dx t p '( u t u t 1 )ψ s t,u t Only compute subkernels for common parents Evaluate total of O ( T s u ) subkernels Form clique potential fn s, sum via junction tree algorithm u 0,s 0 u 1,s 1 u 2,s 2 u 3,s 3 u 4,s 4

28 Sampling Product Kernels Can approximate probability product kernel via sampling By definition, generative models can: 1) Generate a Sample 2) Compute Likelihood of a Sample Thus, approximate probability product via sampling: k ( χ, χ ') = k ( p, p ') = p( x)p '( x)dx β N x i p( x) i=1 N p '( x ) i + 1 β N ' x i ' p '( x) i=1 N ' p( x i ') Beta controls how much sampling from each distribution

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2