Learning from permutations. Jean-Philippe Vert

Size: px

Start display at page:

Download "Learning from permutations. Jean-Philippe Vert"

Julius Garrison
5 years ago
Views:

1 Learning from permutations Jean-Philippe Vert

3 Perception

4 Communication

5 Mobility

6 Health

A common process: learning from data https://www.linkedin.

7 A common process: learning from data Given examples (training data), make a machine learn how to predict on new samples, or discover patterns in data Statistics + optimization + computer science Gets better with more training examples and bigger computers

8 It (more or less) boils down to

9 It (more or less) boils down to

10 It (more or less) boils down to

11 It (more or less) boils down to

12 In practice (eg, linear ridge logistic regression) Input X 1,..., X n R p Output Y 1,..., Y n { 1, 1} Classifier: f β (X) = sign(β X) for β R p Training } n ( ) l i β X i + λ β 2 ˆβ = argmin β R p { 1 n i=1 with l i ( β X i ) = ln ( 1 + e Y i β X i ) Convex optimization problem, scalable (SGD)

13 Questions ˆβ = argmin β R p { 1 n n i=1 How to compute ˆβ in practice? Will f ˆβ make good predictions? How to train nonlinear models? What if inputs are not vectors? What if outputs are not binary?... } ) ln (1 + e Y i β X i + λ β 2

14 Questions ˆβ = argmin β R p { 1 n n i=1 How to compute ˆβ in practice? Will f ˆβ make good predictions? How to train nonlinear models? What if inputs are not vectors? What if outputs are not binary?... } ) ln (1 + e Y i β X i + λ β 2

15 What if inputs are permutations? Permutation: a bijection σ : [1, N] [1, N] σ(i) = rank of item i Composition (σ 1 σ 2 )(i) = σ 1 (σ 2 (i)) S N the symmetric group S N = N!

16 Examples Ranking data Ranks extracted from data (histogram equalization, quantile normalization...)

17 Learning from permutations Assume your data are permutations and you want to learn f : S N R A solutions: embed S N to a Euclidean (or Hilbert) space Φ : S N R p and learn a linear function: f β (σ) = β Φ(σ) The corresponding kernel is K (σ 1, σ 2 ) = Φ(σ 1 ) Φ(σ 2 )

18 How to define the embedding Φ : S N R p? Should encode interesting features Should lead to efficient algorithms Should be invariant to renaming of the items, i.e., the kernel should be right-invariant σ 1, σ 2, π S N, K (σ 1 π, σ 2 π) = K (σ 1, σ 2 )

19 Harmonic analysis on S N A representation of S N is a matrix-valued function ρ : S N C dρ dρ such that σ 1, σ 2 S N, ρ(σ 1 σ 2 ) = ρ(σ 1 )ρ(σ 2 ) A representation is irreductible (irrep) if it is not equivalent to the direct sum of two other representations S N has a finite number of irreps {ρ λ : λ Λ} where Λ = {λ N} 1 is the set of partitions of N For any f : S N R, the Fourier transform of f is λ Λ, ˆf (ρλ ) = σ S N f (σ)ρ λ (σ) 1 λ N iff λ = (λ 1,..., λ r ) with λ 1... λ r and r i=1 λ i = N

20 Right-invariant kernels Bochner s theorem An embedding Φ : S N R p defines a right-invariant kernel K (σ 1, σ 2 ) = Φ(σ 1 ) Φ(σ 2 ) if and only there exists φ : S N R such that σ 1, σ 2 S N, K (σ 1, σ 2 ) = φ(σ 1 2 σ 1) and λ Λ, ˆφ(ρ λ ) 0

21 Some attempts Kendall SUQUAN (Jiao and Vert, 2015, 2017, 2018; Le Morvan and Vert, 2017)

22 SUQUAN embedding (Le Morvan and Vert, 2017) Let Φ(σ) = Π σ the permutation representation (Serres, 1977): { 1 if σ(j) = i, [Π σ ] ij = 0 otherwise. Leads to new approaches for supervised quantile normalization (SUQUAN) and vector quantization

23 Example: CIFAR-10 Discriminate images of horse vs. plane Different methods learn different quantile functions original median SVD SUQUAN BND Index Index Index

24 Limits of the SUQUAN embedding Linear model on Φ(σ) = Π σ R N N Captures first-order information of the form "i-th feature ranked at the j-th position" What about higher-order information such as "feature i larger than feature j"?

25 The Kendall embedding (Jiao and Vert, 2015, 2017) Φ i,j (σ) = { 1 if σ(i) < σ(j), 0 otherwise.

26 Kendall and Mallows kernels The Kendall kernel is K τ (σ, σ ) = Φ(σ) Φ(σ ) The Mallows kernel is λ 0 KM λ (σ, σ ) = e λ Φ(σ) Φ(σ ) 2 Theorem (Jiao and Vert, 2015, 2017) The Kendall and Mallows kernels are positive definite and can be evaluated in O(N log N) time Kernel trick useful with few samples in large dimensions

27 Remark Kondor and Barbarosa (2010) proposed the diffusion kernel on the Cayley graph of the symmetric group generated by adjacent transpositions. Computationally intensive (O(N 2N )) Mallows kernel is written as K λ M (σ, σ ) = e λn d (σ,σ ), Cayley graph of S 4 where n d (σ, σ ) is the shortest path distance on the Cayley graph. It can be computed in O(N log N)

28 Higher-order kernels (Jiao and Vert, 2018) Φ(σ) = Π d σ For d = 1, this is the SUQUAN embedding For d = 2, this leads to a new weighted Kendall kernel, where weights can optimized during training

29 Conclusion Kendall SUQUAN Machine learning beyond vectors, strings and graphs Different embeddings of the symmetric group Scalability? Robustness to adversarial attacks? MERCI!

30 References R. E. Barlow, D. Bartholomew, J. M. Bremner, and H. D. Brunk. Statistical inference under order restrictions; the theory and application of isotonic regression. Wiley, New-York, Y. Jiao and J.-P. Vert. The Kendall and Mallows kernels for permutations. In Proceedings of The 32nd International Conference on Machine Learning, volume 37 of JMLR:W&CP, pages , URL Y. Jiao and J.-P. Vert. The Kendall and Mallows kernels for permutations. IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: /TPAMI URL Y. Jiao and J.-P. Vert. The weighted kendall and high-order kernels for permutations. Technical Report , arxiv, W. R. Knight. A computer method for calculating Kendall s tau with ungrouped data. J. Am. Stat. Assoc., 61(314): , URL M. Le Morvan and J.-P. Vert. Supervised quantile normalisation. Technical Report , arxiv, J.-P. Serres. Linear Representations of Finite Groups. Graduate Texts in Mathematics. Springer-Verlag New York, doi: / URL O. Sysoev and O. Burdakov. A smoothed monotonic regression via l2 regularization. Technical Report LiTH-MAT-R 2016/01 SE, Department of mathematics, Linköping University, URL

31 The quantile normalization (QN) embedding Data: permutation σ S N where σ(i)= rank of item/feature i Fix a target quantile q R N Define Φ q : S N R N by σ S N, [Φ q (σ)] i = q σ(i) "Keep the order, change the values"

32 How to choose a "good" target distribution?

33 SUQUAN (Le Morvan and Vert, 2017) Learn after standard QN: 1 Fix q arbitrarily 2 QN all samples to get Φ q (σ 1 ),..., Φ q (σ n ) 3 Learn a model on normalized data, e.g.: ˆβ = argmin β R N { 1 n } n ) l i (β Φ q (σ i ) + λ β 2 i=1 Supervised QN (SUQUAN): jointly learn q and the model: { } ( 1 n ) ˆβ, ˆq) = argmin l i (β Φ q (σ i ) + λ β 2 + γω(q) β,q R N n i=1

34 SUQUAN (Le Morvan and Vert, 2017) Learn after standard QN: 1 Fix q arbitrarily 2 QN all samples to get Φ q (σ 1 ),..., Φ q (σ n ) 3 Learn a model on normalized data, e.g.: ˆβ = argmin β R N { 1 n } n ) l i (β Φ q (σ i ) + λ β 2 i=1 Supervised QN (SUQUAN): jointly learn q and the model: { } ( 1 n ) ˆβ, ˆq) = argmin l i (β Φ q (σ i ) + λ β 2 + γω(q) β,q R N n i=1

35 Computing Φ q (σ) For σ S N let the permutation representation (Serres, 1977): { 1 if σ(j) = i, [Π σ ] ij = 0 otherwise. Then Φ q (σ) = Π σ q

36 Linear SUQAN as rank-1 matrix regression Linear SUQUAN therefore solves { 1 min β,q R N = min β,q R N = min β,q R N n l i { 1 n l i { 1 n l i ( ) β Φ q (σ i ) + λ β 2 + γω(q) } ( ) } β Π σ i q + λ β 2 + γω(q) ) } (< qβ, Π σi > Frobenius + λ β 2 + γω(q) A particular linear model to estimate a rank-1 matrix M = qβ Each sample σ S N is represented by the matrix Π σ R n n Non-convex Alternative optimization of f and w is easy

37 Experiments: CIFAR-10 Image classification into 10 classes (45 binary problems) N = 5, 000 per class, p = 1, 024 pixels AUC cauchy exponential uniform gaussian median SUQUAN SVD SUQUAN BND SUQUAN SPAV AUC on test set SUQUAN BND AUC on test set median

38 Experiments: CIFAR-10 Example: horse vs. plane Different methods learn different quantile functions original median SVD SUQUAN BND Index Index Index

39 Outline 1 The Kendall embedding

40 Limits of the QN embedding Linear model on Φ(σ) = Π σ R N N Captures first-order information of the form "i-th feature ranked at the j-th position" What about higher-order information such as "feature i larger than feature j"?

41 Another representation Φ i,j (σ) = { 1 if σ(i) < σ(j), 0 otherwise.

42 Kendall and Mallows kernels The Kendall kernel is K τ (σ, σ ) = Φ(σ) Φ(σ ) The Mallows kernel is λ 0 KM λ (σ, σ ) = e λ Φ(σ) Φ(σ ) 2 Theorem (Jiao and Vert, 2015, 2017) The Kendall and Mallows kernels are positive definite and can be evaluated in O(N log N) time Kernel trick useful with few samples in large dimensions

43 Proof For any two permutations σ, σ S N : Inner product Φ(σ) Φ(σ ) = 1 σ(i)<σ(j) 1 σ (i)<σ (j) = n c (σ, σ ) 1 i j N n c = number of concordant pairs Distance Φ(σ) Φ(σ ) 2 = (1 σ(i)<σ(j) 1 σ (i)<σ (j)) 2 = 2n d (σ, σ ) 1 i,j N n d = number of discordant pairs n c and n c can be computed in O(N log N) (Knight, 1966)

44 Related work Kondor and Barbarosa (2010) proposed the diffusion kernel on the Cayley graph of the symmetric group generated by adjacent transpositions. Computationally intensive (O(N 2N )) Mallows kernel is written as K λ M (σ, σ ) = e λn d (σ,σ ), Cayley graph of S 4 where n d (σ, σ ) is the shortest path distance on the Cayley graph. It can be computed in O(N log N)

45 Applications SVMkdtALL SVMlinearTOP SVMlinearALL SVMkdtTOP SVMpolyALL KFDkdtALL ktsp SVMpolyTOP KFDlinearALL KFDpolyALL TSP SVMrbfALL KFDrbfALL APMV acc Average performance on 10 microarray classification problems (Jiao and Vert, 2017).

46 Constraints on f Ridge Non-decreasing F 0 = { f R p : 1 p p i=1 f 2 i 1 }. F BND = F 0 I 0, where I 0 = {f R p : f 1 f 2... f p } Non-decreasing and smooth F SPAV = f I 0 : p 1 (f j+1 f j ) 2 1. j=1

47 produce a sequence of target quantiles although we found in our experiments below that the performance did not change significantly after the first iteration. Note also that, contrary to SUQUAN-SVD, this algorithm requires an initial non-decreasing target quantile function. By default we suggest to use the median of the data quantile functions, which is often the default used in standard QN normalisation. SUQUAN-BND and SUQUAN-PAVA Algorithm 2: SUQUAN-BND and SUQUAN-SPAV Input: (x1, y1 ),..., (xn, yn ), finit 2 I0, 2 R Output: f 2 I0 target quantile 1: for i = 1 to n do 2: ranki, orderi sort(xi ) 3: end for P n 4: w, b argmin n1 i=1 `i w> finit [ranki ] + b + w 2 w,b (standard linear optimisation) Pmodel n 5: f argmin n1 i=1 `i f > w[orderi ] + b f 2FBN D (isotonic optimisation problem using PAVA as prox) OR Pn f argmin n1 i=1 `i f > w[orderi ] + b f 2FSP AV (smoothed isotonic optimisation problem using SPAV as prox) 6 Experiments Alternate optimization in w and f, monotonicity constraint on f 6.1 Simulated dataproximal gradient optimization for f, using the Pool Accelerated We first test the ability of SUQUAN to overcome unwanted changes in quantile distributions on simulated Adjacent Violators Algorithm (PAVA, Barlow et al. (1972)) or the datasets. For that purpose we fix f 2 Rp to be the quantile distribution of the normal distribution, and p simulatesmoothed each sample xpool permuting the entries(spav, of f. WeSysoev then generate binary Violators algorithm and 1,..., xadjacent n 2 R by randomly labels y1,..., yn 2 { 1, 1} using the logistic model P (Y = 1 X = x) = 1+exp(1 w x), where w is randomly Burdakov (2016)) as proximal operator. sampled from a standard multivariate normal distribution. We then compare four methods to estimate w >

48 D A variant: SUQUAN-SVD F0, i.e., when we do not constrain nd ( ) = 2, then the set M Algorithm 1: SUQUAN-SVD 8) is exactly the set of rank-1 mainput: mounts to finding a rank-1 matrix (x1, y1 ),..., (xn, yn ) 2 Rp { 1, 1} es a linear regression or classificaoutput: f 2 F0 target quantile der the binary classification setting, 1: MLDA 0 2 Rp p mposed of pairs (xi, yi )i=1,...,n with 2: n {i : yi = +1} +1 e, a simple linear classifier (without 3: n 1 {i : yi = 1} e obtained by linear discriminant 4: for i = 1 to n do ariance: MLDA = µ+ µ, where 5: Compute xi (by sorting xi ) y the means of the matrices xi for 6: MLDA MLDA + nyyi xi lasses. Consequently, a good ranki 7: end for e closest rank-1 matrix to MLDA, 8: (, w, f ) SV D(MLDA, 1) d v are the left and right singular ed to the largest singular value. quantile function by keeping only tor of MLDA, which can then be used as target quantile for quantile normalising Ridge penalty (no monotonicity constraint), equivalent to running any linear classification method. Algorithm 1 summarises the method. nvolves an O(p ln(p)) sortingproblem of the entries of xi, and therefore computing MLDA, regression ion of n permutation matrices, requires O(np ln(p)) operations. Then computing finds the closest rank-1 thea LDA vector (linesvd 8) of M costs another O(p2 ) matrix operationsto using naive solution: LDA typically owever, if n p, we can exploit the fact that the product of a permutation matrix X X 1 permutation), so that the 1 power operation (just order the vector according to the MLDA = Πxi singular Πxi rst singular vector only takes O(np). Computing the right largest vector n+ n which is np)) complexity. Hence the complexity of SUQUAN-SVD is O(np ln(p)), i : yi =+1 i : yi =+1 y of the quantile normalisation. Complexity O(np ln(p)) (same as QN only) rank-1

49 Experiments: Simulations True distribution of X entries is normal Corrupt data with a cauchy, exponential, uniform or bimodal gaussian distributions. p = 1000, n varies, logistic regression. AUC Log. reg. original Log. reg. corrupted SUQUAN BND SUQUAN SPAV Number of samples Euclidean distance to original target quantile Number of samples

arxiv: v1 [stat.ml] 1 Jun 2017

arxiv: v1 [stat.ml] 1 Jun 2017 Supervised Quantile Normalisation Marine Le Morvan,2,3 and Jean-Philippe Vert,2,3,4 arxiv:706.00244v [stat.ml] Jun 207 MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, 75006