Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank

Size: px

Start display at page:

Download "Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank"

Lisa Harvey
5 years ago
Views:

1 Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank Peter Tiňo, Jakub Mažgút, Hong Yan, Mikael Bodén Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.1/32

2 Motivation Increasing number of data processing tasks involve manipulation of multi-dimensional objects - tensors. Applying pattern recognition or machine learning methods directly - high computational and memory requirements, as well as poor generalization. To address this curse of dimensionality - decomposition methods to compress the data while capturing the dominant trends. New methods for processing multi-dimensional tensors in their natural structure have been introduced - real-valued tensors - nonnegative tensors - symmetric tensors Not suitable for binary tensors. Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.2/32

3 An Example Source: [Li et al.:mpca, 2008] Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.3/32

4 Tensor N-th order tensor A R I 1 I 2... I N Addressed by N indices i n ranging from 1 to I n, n = 1,2,...,N. Rank-1 tensora R I 1 I 2... I N can be obtained as an outer product ofn non-zero vectors u (n) R I n, n = 1,2,...,N: A = u(1) u (2)... u (N). For a particular index setting i = (i 1,i 2,...,i N ) Υ = {1,2,...,I 1 } {1,2,...,I 2 }... {1,2,...,I N }, we have A i = A i1,i 2,...,i N = N n=1 u (n) i n, where u (n) i n is the i n -th component of the vector u (n). Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.4/32

5 Basic algebra A tensor can be multiplied by a matrix (2nd order tensor) using n-mode products. The n-mode product of a tensor A R I 1 I 2... I N by a matrix U R J I n is a tensor (A n U) with entries (A n U) i1,...,i n 1,j,i n+1,...,i N = I n in =1 A i 1,...,i n 1,i n,i n+1,...,i N U j,in. Orthonormal basis {u (n) 1, u(n) 2 Basis matrix U (n) = (u (n) 1, u(n) 2,..., u(n) I n } for the n-mode space R I n.,..., u(n) I n ). Any tensor A can be decomposed into the product A = Q 1 U (1) 2 U (2) 3... N U (N). The expansion coefficients stored in the Nth order tensor Q R I 1 I 2... I N. Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.5/32

6 Tucker model for decomposing real tensors The expansion can also be written as A = i Υ Q i (u (1) i 1 u (2) i 2... u (N) i N ). In other words, tensor A is expressed as a linear combination of ( N n=1 I n) (a lot!) rank-1 basis tensors (u (1) i 1 u (2) i 2... u (N) i N ). The rank-1 basis tensors obtained as outer products of the corresponding basis vectors. More restricted models available - e.g. PARAFAC Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.6/32

7 Reduced rank representations of tensors Assume a smaller number of basis tensors are sufficient to approximate all tensors in a given dataset: A i K Q i (u (1) i 1 u (2) i 2... u (N) i N ), where K Υ. Tensors can be found close to the K -dimensional hyperplane in the tensor space spanned by the basis tensors (u (1) i 1 u (2) i 2... u (N) i N ), i K. A can be represented through expansion coefficients Q i, i K. Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.7/32

8 The model Data: M tensors D = {A m } M m=1 Each element A m,i is assumed to be (independently) Bernoulli distributed with parameter (mean) p m,i : P(A m,i p m,i ) = pa m,i m,i (1 p m,i )1 A m,i. Parametrized through log-odds (natural parameter), θ m,i = log p m,i 1 p m,i. Link function is the logistic function p m,i = σ(θ m,i ) = 1 1+e θ m,i, 1 p m,i = σ( θ m,i ) Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.8/32

9 The model For each data tensor A m, m = 1,2,...,M, we have P(A m θ m ) = i ΥP(A m,i θ m,i ), where P(A m,i θ m,i ) = σ(θ m,i )A m,i σ( θ m,i )1 A m,i. We collect all the parameters θ m,i in a tensor Θ RM I 1 I 2... I N of order N +1. Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.9/32

10 The model So far the values in the parameter tensor Θ were unconstrained. Constrain the Nth order parameter tensors θ m R I 1 I 2... I N to lie in the subspace spanned by the reduced set of basis tensors (u (1) r 1 u (2) r 2... u (N) r N ), where r n {1,2,...,R n }, and R n I n, i = 1,2...,N. The indices r = (r 1,r 2,...,r N ) take values from the set ρ = {1,2,...,R 1 } {1,2,...,R 2 }... {1,2,...,R N }. There is no explicit pressure in the model to ensure independence of the basis vectors. However, in practice, the optimized model parameters always represented independent basis vectors, as dependent basis vectors would lead to dependent basis tensors, implying smaller than intended rank of the tensor decomposition Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.10/32

11 The model Allow for an N-th order bias tensor R I 1 I 2... I N Parameter tensors θ m are constrained onto an affine space. θ m = r ρ Q m,r (u (1) r 1 u (2) r 2... u (N) r N )+ θ m,i = r ρ Q m,r (u (1) r 1 u (2) r 2... u (N) r N ) i + i = r ρq m,r N n=1 u (n) r n,i n + i Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.11/32

12 The model log-likelihood: M L = A m,i logσ N m,r r ρq m=1 i Υ (1 A m,i )logσ m,r r ρq n=1 N n=1 r n,i n + i + u (n) u (n) r n,i n i Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.12/32

13 Parameter Estimation The log likelihood is not convex in the parameters, it is convex in any of these parameters, if the others are fixed. Analytical updates were derived from a lower bound on the likelihood, using a trick from [Schein et al., 2003]. The linear tensor structure gets through! Iterative estimation scheme: while(convergence criterion) 1. argmax Q L 2. argmax u L 3. argmax L Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.13/32

14 Parameter Estimation Messy derivations, but the main message is: Even though the original vector model [Schein 2003] is non-linear in parameters, the strong linear algebraic structure of the Tucker model for tensor decomposition can be superimposed on the parameter space of the tensor model, so that the efficient linear nature of parameter updates of [Schein 2003] can be preserved. Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.14/32

15 Parameter Estimation - Basis Vectors For n = 1,2,...,N, define Υ n = {1,2,...,I 1 }... {1,2,...,I n 1 } {1} {1,2,...,I n+1 }... {1,2,...,I N } Analogously for ρ n. Given i Υ n and an n-mode index j {1,2,...,I n }, the index N-tuple (i 1,...,i n 1,j,i n+1,...,i N ) formed by inserting j at the nth place of i is denoted by [i,j n]. Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.15/32

16 Parameter Estimation - Basis Vectors B (n) m,i,q = r ρ n Q m,[r,q n] N s=1,s n u (s) r s,i s S (n) q,j = M m=1 i Υ n (2A m,[i,j n] 1 T m,[i,j n] [i,j n] ) B(n) m,i,q K (n) q,t,j = M m=1 r ρ n Q m,[r,t n] i Υ n T m,[i,j n] B(n) m,i,q N s=1,s n u (s) r s,i s Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.16/32

17 Parameter Estimation - Basis Vectors For each n-mode coordinate j {1,2,...,I n }: Collect the j-th coordinate values of all n-mode basis vectors into a column vector u (n) :,j = (u (n) 1,j,u(n) 2,j,...,u(n) R n,j )T. Stack all the S (n) q,j values in a column vector S (n) :,j = (S (n) 1,j,S(n) 2,j,...,S(n) R n,j )T. Construct an R n R n matrix K (n) :,:,j whose q-th row is (K (n) q,1,j,k(n) q,2,j,...,k(n) q,r n,j ), q = 1,2,...,R n. The n-mode basis vectors are updated by solving I n linear systems of size R n R n : K (n) :,:,j u(n) :,j = S (n) :,j, j = 1,2,...,I n. Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.17/32

18 Experiments - Synthetic Data Goal: Evaluate the amount of preserved information in compressed tensor representations. Compare the performance with existing real-value tensor decomposition model (TensorLSI). Experiment sets of binary tensors were sampled from different Bernoulli natural parameter subspaces. 2. Each set contains 10,000 2nd-order binary tensors of size (30,250). 3. On each set, both models were used to find subspaces using 80% of the tensors. 4. The hold-out sets of tensors (20%) were projected onto the subspaces and then reconstructed back. 5. To evaluate the match between the real-valued predictions and the target binary values we employ AUC analysis. Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.18/32

19 Synthetic Data A sample of randomly generated binary tensors from the above Bernoulli natural parameter space: Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.19/32

20 ROC curve analysis {x 1,x 2...x J } - model prediction outputs for all nonzero elements of tensors from the test set {y 1,y 2...y K } - outputs for all zero elements AUC value for that prediction (reconstruction) of the test set tensors: AUC = J j=1 K k=1 C(x j,y k ), J K where J and K are the total number of nonzero and zero tensor elements in the test set, respectively, and C is a scoring function C(x j,y k ) = { 1 if xj > y k 0 otherwise. Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.20/32

21 Hold-out Binary Tensor Reconstructions AUC Num. of Principal Components GML PCA TensorLSI Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.21/32

22 Introns and Promoters in DNA Sequences Introns: nucleotide sequences within a gene that get removed by RNA splicing to generate the final RNA product of a gene. Sequences that are joined together in the final mature RNA after RNA splicing are exons. Promoters: a region of DNA that facilitates the transcription of a particular gene. Promoters are located near the genes they regulate. Promoters contain specific DNA sequences providing an initial binding site for RNA polymerase and for proteins - transcription factors - that recruit RNA polymerase. Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.22/32

23 Topographic Mapping of DNA Sequences Goal: Find a mapping that groups functionally similar sub-sequences. Underlying principle: DNA sub-sequences from different functional regions differ in local term composition. To capture the composition we propose a binary tensor representation of the DNA sub-sequences. As a dataset of DNA sequences we used 62,000 promoter and intronic subsequences employed in [Li et al., 2008]. Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.23/32

24 Representation of a Genomic Sequence DNA sub-sequence: cgctcccggggtggccccacgcccc ctctgagc gagcggcggcgcgggacggggacggctctggccgggaccagcaggcctcgggcatccgggacgccggggccgc gctccaggccaggggcgggggcgggaccggggcgggggccggcggcggggccgcgccctcggcctctccccggggcgaccgggcggctccacacgcgctgcgcccgcc gccggccccacgcgcggcccatgtcctccgcgc Term Corresponding term-posi on matrix representa on: aag agc acg at gat ggca ggcg ggct ggt gca gcg gcca gccg gcct gct gt ca cga cgcg cgt cca ccgc ccca cccg cct ctca tca tcg ttc tttc tttt Posi on Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.24/32

25 3D PCA Projections of 10-dim Tensor Subspaces 3D PCA Projection of Sequences Analyzed by GML-PCA 3D PCA Projection of Sequences Analyzed by TensorLSI 2 5 3rd Principal Component rd Principal Component nd Principal Component st Principal Component 5 Promoter sequences Intron sequences 2 2nd Principal Component st Principal Component 2 4 Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.25/32

26 Topographic Mapping of DNA sequences aag agc acg at gat ggca ggcg ggct ggt gca gcg gcca gccg gcct gct gt ca cga cgcg cgt cca ccgc ccca cccg cct ctca tca tcg ttc tttc tttt aag agc acg at gat ggca ggcg ggct ggt gca gcg gcca gccg gcct gct gt ca cga cgcg cgt cca ccgc ccca cccg cct ctca tca tcg ttc tttc tttt aag agc acg at gat ggca ggcg ggct ggt gca gcg gcca gccg gcct gct gt ca cga cgcg cgt cca ccgc ccca cccg cct ctca tca tcg ttc tttc tttt I-1 I-2 P-1 P-5 2nd Principal Component D PCA Projection of Expansion Coefficients from GML-PCA st Principal Component aag agc acg at gat ggca ggcg ggct ggt gca gcg gcca gccg gcct gct gt ca cga cgcg cgt cca ccgc ccca cccg cct ctca tca tcg ttc tttc tttt aag agc acg at gat ggca ggcg ggct ggt gca gcg gcca gccg gcct gct gt ca cga cgcg cgt cca ccgc ccca cccg cct ctca tca tcg ttc tttc tttt P-2 aag agc acg at gat ggca ggcg ggct ggt gca gcg gcca gccg gcct gct gt ca cga cgcg cgt cca ccgc ccca cccg cct ctca tca tcg ttc tttc tttt I-3 aag agc acg at gat ggca ggcg ggct ggt gca gcg gcca gccg gcct gct gt ca cga cgcg cgt cca ccgc ccca cccg cct ctca tca tcg ttc tttc tttt P-4 aag agc acg at gat ggca ggcg ggct ggt gca gcg gcca gccg gcct gct gt ca cga cgcg cgt cca ccgc ccca cccg cct ctca tca tcg ttc tttc tttt P-3 Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.26/32

27 Functional enrichment analysis of promoters DNA-binding sites of transcription factors are often characterized as relatively short (5-15 nucleotides) sequence patterns. They may occur multiple times in promoters of the genes the expression of which they modulate. To validate that our model picks up biologically meaningful patterns, we searched the compressed feature space of promoters for biologically relevant structure (including that left by transcription factors). Genes that are transcribed by the same factors are often functionally similar. Suitable representations of promoters should correlate with the roles assigned to their genes. If the projection to a compressed space highlights such features, it is testament to a method s utility for processing biological sequences. Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.27/32

28 Functional enrichment analysis of promoters Gene Ontology (GO) provides a controlled vocabulary for annotation of genes, broadly categorized into terms for cellular component, biological process and molecular function. Assign biologically meaningful labels to promoters: Sequences were mapped to gene identifiers. In cases of multiple gene identifiers for the same promoter sequence, we picked the identifier with the greatest number of annotations unique GO terms annotating promoters. Evaluate whether promoters deemed similar in the topographic mapping are also functionally similar. Need the notion of a distance between a pair of promoters. Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.28/32

29 Functional enrichment analysis of promoters Naive approach - use the Euclidean distance in the 10-dim coordinate space of natural parameters. Not correct: (1) the basis tensors are not orthogonal; (2) they span a space of Bernoulli natural parameters that have a nonlinear relationship with the data values. Model-based distance between two promoter sequences m and l - sum of average symmetrized Kullback-Leibler divergences between noise distributions for all corresponding tensor elements i Υ: ( KL[pm,i p l,i ]+KL[p l,i p m,i ] ) D(m,l) = i Υ 2. Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.29/32

30 Are the compressed promoter representations are biologically meaningful? In each test, select one promoter as a reference. Repeat for all promoters. Given a reference promoter m, we label the group of promoters S m = {l D(m,l) < D 0 } within a pre-specified distance D(m,l) < D 0 as positives and all others as negatives. D 0 = 25, usually rendering over one hundred positives. For each GO term Fisher s exact test resolves if it occurs more often amongst positives than would be expected by chance. Null hypothesis - the GO term is not attributed more often than by chance to the positives. A small p-value indicates that the term is enriched at the position of the reference promoter m. We repeated the tests after shuffling the points assigned to promoters. In no case did this permutation test identify a single GO term as significant. Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.30/32

31 Yes! At significance p < , 75 GO terms were enriched around one or more reference promoters. The observation that a subset of promoter sequences are functionally organized after decomposition into 10 basis tensors adds support to the methods ability to detect variation at an information-rich level. Found a number of terms that are specifically concerned with chromatin structure (that packages the DNA), e.g. GO: Nucleosome, GO: Chromatin assembly or disassembly and GO: Protein-DNA complex assembly. Found several enriched terms related to development, e.g. GO: Reproductive process and GO: Female pregnancy. Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.31/32

32 Promoter regions assigned to GO: GO: Biological process: Reproduction GO: biological process Reproduction Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank p.32/32

Dimensionality Reduction and Topographic Mapping of Binary Tensors

Pattern Analysis and Applications manuscript No. (will be inserted by the editor) Dimensionality Reduction and Topographic Mapping of Binary Tensors Jakub Mažgut Peter Tiňo Mikael Bodén Hong Yan Received: