Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota University of Padua June 7, 18

Introduction, background, and motivation Common goal of data mining methods: to extract meaningful information or patterns from data. Very broad area includes: data analysis, machine learning, pattern recognition, information retrieval,... Main tools used: linear algebra; graph theory; approximation theory; optimization;... In this talk: emphasis on dimension reduction techniques and the interrelations between techniques Padua 06/07/18 p. 2

Introduction: a few factoids We live in an era increasingly shaped by DATA 2.5 18 bytes of data created in 15 90 % of data on internet created since 16 3.8 Billion internet users in 17. 3.6 Million Google searches worldwide / minute (5.2 B/day) 15.2 Million text messages worldwide / minute Mixed blessing: Opportunities & big challenges. Trend is re-shaping & energizing many research areas...... including : numerical linear algebra Padua 06/07/18 p. 3

Drowning in data Dimension Reduction PLEASE! 10011 11011 1011 10 10 10 010100 1101 1010011110 Picture modified from http://www.123rf.com/photo_7238007_man-drowning-reaching-out-for-help.html

Major tool of Data Mining: Dimension reduction Goal is not as much to reduce size (& cost) but to: Reduce noise and redundancy in data before performing a task [e.g., classification as in digit/face recognition] Discover important features or paramaters The problem: Given: X = [x 1,, x n ] R m n, find a low-dimens. representation Y = [y 1,, y n ] R d n of X Achieved by a mapping Φ : x R m y R d so: φ(x i ) = y i, i = 1,, n Padua 06/07/18 p. 5

n m X x i d Y y n i Φ may be linear : y j = W x j, j, or, Y = W X... or nonlinear (implicit). Mapping Φ required to: Preserve proximity? Maximize variance? Preserve a certain graph? Padua 06/07/18 p. 6

Basics: Principal Component Analysis (PCA) In Principal Component Analysis W is computed to maximize variance of projected data: max W R m d ;W W =I n i=1 y i 1 n n j=1 y j 2 2, y i = W x i. Leads to maximizing Tr [ W (X µe )(X µe ) W ], µ = 1 n Σn i=1 x i Solution W = { dominant eigenvectors } of the covariance matrix Set of left singular vectors of X = X µe Padua 06/07/18 p. 7

SVD: X = UΣV, U U = I, V V = I, Σ = Diag Optimal W = U d matrix of first d columns of U Solution W also minimizes reconstruction error.. i x i W W T x i 2 = i x i W y i 2 In some methods recentering to zero is not done, i.e., X replaced by X. Padua 06/07/18 p. 8

Unsupervised learning Unsupervised learning : methods do not exploit labeled data Example of digits: perform a 2-D projection Images of same digit tend to cluster (more or less) Such 2-D representations are popular for visualization Can also try to find natural clusters in data, e.g., in materials Basic clusterning technique: K- means 5 4 3 2 1 0 1 2 3 4 PCA digits : 5 7 5 6 4 2 0 2 4 6 8 Superconductors Multi ferroics Superhard Catalytic Photovoltaic 5 6 7 Ferromagnetic Thermo electric Padua 06/07/18 p. 9

Example: The Swill-Roll (00 points in 3-D) Original Data in 3 D 15 5 0 5 0 15 15 5 0 5 15 Padua 06/07/18 p.

2-D reductions : 15 PCA 15 LPP 5 5 0 0 5 5 15 0 15 15 5 0 5 Eigenmaps ONPP Padua 06/07/18 p. 11

Example: Digit images (a random sample of 30) 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 Padua 06/07/18 p. 12

2-D reductions : 6 4 2 0 2 PCA digits : 0 4 0 1 2 3 4 0.15 0.1 0.05 0 0.05 0.1 LLE digits : 0 4 4 0.15 6 5 0 5 0.2 0.2 0.1 0 0.1 0.2 0.2 0.15 0.1 0.05 K PCA digits : 0 4 0 1 2 3 4 0.1 0.05 0 0.05 ONPP digits : 0 4 0 0.1 0.05 0.15 0.1 0.2 0.15 0.07 0.071 0.072 0.073 0.074 0.25 5.4341 5.4341 5.4341 5.4341 5.4341 x 3 Padua 06/07/18 p. 13

DIMENSION REDUCTION EXAMPLE: INFORMATION RETRIEVAL

Information Retrieval: Vector Space Model Given: a collection of documents (columns of a matrix A) and a query vector q. Collection represented by an m n term by document matrix with a ij = L ij G i N j Queries ( pseudo-documents ) q are represented similarly to a column Padua 06/07/18 p. 15

Vector Space Model - continued Problem: find a column of A that best matches q Similarity metric: cos of angle between a column of A and q q T A(:, j) q 2 A(:, j) 2 To rank all documents we need to compute s = q T A s = similarity (row) vector Literal matching not very effective. Padua 06/07/18 p. 16

Common approach: Use the SVD Need to extract intrinsic information or underlying semantic information LSI: replace A by a low rank approximation [from SVD] A = UΣV T A k = U k Σ k V T k U k : term space, V k : document space. New similarity vector: s k = q T A k = q T U k Σ k V T k Called Truncated SVD or TSVD in context of regularization Main issues: 1) computational cost 2) Updates Padua 06/07/18 p. 17

Use of polynomial filters Idea: Replace A k by Aφ(A T A), where φ == a filter function Consider the stepfunction (Heaviside): φ(x) = { 0, 0 x σ 2 k 1, σ 2 k x σ2 1 This would yield the same result as with TSVD but...... Not easy to use this function directly Solution : use a polynomial approximation to φ... then... s T = q T Aφ(A T A), requires only Mat-Vec s * See: E. Kokiopoulou & YS 04 Padua 06/07/18 p. 18

IR: Use of the Lanczos algorithm (J. Chen, YS 09) Lanczos algorithm = Projection method on Krylov subspace Span{v, Av,, A m 1 v} Can get singular vectors with Lanczos, & use them in LSI Better: Use the Lanczos vectors directly for the projection K. Blom and A. Ruhe [SIMAX, vol. 26, 05] perform a Lanczos run for each query [expensive]. Proposed: One Lanczos run- random initial vector. Then use Lanczos vectors in place of singular vectors. In short: Results comparable to those of SVD at a much lower cost. Padua 06/07/18 p. 19

Background: The Lanczos procedure Let A an n n symmetric matrix ALGORITHM : 1 Lanczos 1. Choose vector v 1 with v 1 2 = 1. Set β 1 0, v 0 0 2. For j = 1, 2,..., m Do: 3. α j := vj T Av j 4. w := Av j α j v j β j v j 1 5. β j+1 := w 2. If β j+1 = 0 then Stop 6. v j+1 := w/β j+1 7. EndDo Note: Scalars β j+1, α j, selected so that v j+1 v j, v j+1 v j 1, and v j+1 2 = 1. Padua 06/07/18 p.

Background: Lanczos (cont.) Lanczos recurrence: β j+1 v j+1 = Av j α j v j β j v j 1 Let V m = [v 1, v 2,, v m ]. Then we have: V T m AV m = T m = α 1 β 2 β 2 α 2... β 2...... β m 1 α m 1 β m β m α m In theory {v j } s are orthonormal. In practice need to reorthogonalize Padua 06/07/18 p. 21

Tests: IR Information retrieval datasets # Terms # Docs # queries sparsity MED 7,014 1,033 30 0.735 CRAN 3,763 1,398 225 1.412 Med dataset. Cran dataset. Preprocessing times preprocessing time 90 80 70 60 50 40 30 lanczos tsvd Med 0 0 50 0 150 0 250 300 iterations preprocessing time 60 50 40 30 lanczos tsvd Cran 0 0 50 0 150 0 250 300 iterations Padua 06/07/18 p. 22

Average query times Med dataset Cran dataset. average query time 5 x Med 3 lanczos tsvd lanczos rf 4 tsvd rf 4.5 3.5 3 2.5 2 1.5 1 0.5 average query time 4 x Cran 3 lanczos tsvd lanczos rf tsvd rf 3.5 3 2.5 2 1.5 1 0.5 0 0 50 0 150 0 250 300 iterations 0 0 50 0 150 0 250 300 iterations Padua 06/07/18 p. 23

Average retrieval precision Med dataset Cran dataset 0.9 Med 0.9 Cran 0.8 0.8 0.7 0.7 average precision 0.6 0.5 0.4 0.3 average precision 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 0 150 0 250 300 iterations lanczos tsvd lanczos rf tsvd rf 0.2 0.1 0 0 0 0 300 400 500 600 iterations lanczos tsvd lanczos rf tsvd rf Retrieval precision comparisons Padua 06/07/18 p. 24

Supervised learning We now have data that is labeled Example: (health sciences) malignant - non malignant Example: (materials) photovoltaic, hard, conductor,... Example: (Digit recognition) Digits 0, 1,..., 9 d e c f a b g Padua 06/07/18 p. 25

Supervised learning: classification Problem: Given labels (say A and B ) for each item of a given set, find a mechanism to classify an unlabelled item into either the A or the B" class. Many applications.?? Example: distinguish SPAM and non-spam messages Can be extended to more than 2 classes. Padua 06/07/18 p. 27

Another application: Padua 06/07/18 p. 28

Supervised learning: classification Best illustration: written digits recognition example Given: a set of labeled samples (training set), and an (unlabeled) test image. Problem: find label of test image Digit 0 Digit 0 Digfit 1 Digit 2 01 01 01 01 01 01 Training data Digfit 1 Digit 2 Digit 9 Digit 9 Digit?? Test data Digit?? Dimension reduction Roughly speaking: we seek dimension reduction so that recognition is more effective in low-dim. space Padua 06/07/18 p. 29

Basic method: K-nearest neighbors (KNN) classification Idea of a voting system: get distances between test sample and training samples Get the k nearest neighbors (here k = 8) Predominant class among these k items is assigned to the test sample ( here)? Padua 06/07/18 p. 30

Supervised learning: Linear classification Linear classifiers: Find a hyperplane which best separates the data in classes A and B. Example of application: Distinguish between SPAM and non-spam e- mails Linear classifier Note: The world in non-linear. Often this is combined with Kernels amounts to changing the inner product Padua 06/07/18 p. 31

A harder case: 4 Spectral Bisection (PDDP) 3 2 1 0 1 2 3 4 2 1 0 1 2 3 4 5 Use kernels to transform

0.1 Projection with Kernels σ 2 = 2.7463 0.08 0.06 0.04 0.02 0 0.02 0.04 0.06 0.08 0.1 0.08 0.06 0.04 0.02 0 0.02 0.04 0.06 Transformed data with a Gaussian Kernel Padua 06/07/18 p. 33

Simple linear classifiers Let X = [x 1,, x n ] be the data matrix. and L = [l 1,, l n ] the labels either +1 or -1. 1st Solution: Find a vector u such that u T x i close to l i, i Common solution: SVD to reduce dimension of data [e.g. 2-D] then do comparison in this space. e.g. A: u T x i 0, B: u T x i < 0 v [For clarity: principal axis u drawn below where it should be] Padua 06/07/18 p. 34

Fisher s Linear Discriminant Analysis (LDA) Principle: Use label information to build a good projector, i.e., one that can discriminate well between classes Define between scatter : a measure of how well separated two distinct classes are. Define within scatter : a measure of how well clustered items of the same class are. Objective: make between scatter measure large and within scatter small. Idea: Find projector that maximizes the ratio of the between scatter measure over within scatter measure Padua 06/07/18 p. 35

Define: Where: S B = S W = c k=1 c k=1 n k (µ (k) µ)(µ (k) µ) T, (x i µ (k) )(x i µ (k) ) T x i X k µ = mean (X) µ (k) = mean (X k ) X k = k-th class n k = X k X 1 CLUSTER CENTROIDS GLOBAL CENTROID X 2 µ X 3 Padua 06/07/18 p. 36

Consider 2nd moments for a vector a: a T S B a = a T S W a = c n k a T (µ (k) µ) 2, i=1 c k=1 a T (x i µ (k) ) 2 x i X k a T S B a weighted variance of projected µ j s a T S W a w. sum of variances of projected classes X j s LDA projects the data so as to maximize the ratio of these two numbers: max a a T S B a a T S W a Optimal a = eigenvector associated with the largest eigenvalue of: S B u i = λ i S W u i. Padua 06/07/18 p. 37

LDA Extension to arbitrary dimensions Criterion: maximize the ratio of two traces: Tr [U T S B U] Tr [U T S W U] Constraint: U T U = I (orthogonal projector). Reduced dimension data: Y = U T X. Common viewpoint: hard to maximize, therefore...... alternative: Solve instead the ( easier ) problem: max Tr [U T S B U] U T S W U=I Solution: largest eigenvectors of S B u i = λ i S W u i. Padua 06/07/18 p. 38

LDA Extension to arbitrary dimensions (cont.) Consider the original problem: max U R n p, U T U=I Tr [U T AU] Tr [U T BU] Let A, B be symmetric & assume that B is semi-positive definite with rank(b) > n p. Then Tr [U T AU]/Tr [U T BU] has a finite maximum value ρ. The maximum is reached for a certain U that is unique up to unitary transforms of columns. Consider the function: f(ρ) = max V T V =I Tr [V T (A ρb)v ] Call V (ρ) the maximizer for an arbitrary given ρ. Note: V (ρ) = Set of eigenvectors - not unique Padua 06/07/18 p. 39

Define G(ρ) A ρb and its n eigenvalues: µ 1 (ρ) µ 2 (ρ) µ n (ρ). Clearly: f(ρ) = µ 1 (ρ) + µ 2 (ρ) + + µ p (ρ). Can express this differently. Define eigenprojector: P (ρ) = V (ρ)v (ρ) T Then: f(ρ) = Tr [V (ρ) T G(ρ)V (ρ)] = Tr [G(ρ)V (ρ)v (ρ) T ] = Tr [G(ρ)P (ρ)]. Padua 06/07/18 p. 40

Recall [e.g. Kato 65] that: P (ρ) = 1 2πi Γ (G(ρ) zi) 1 dz Γ is a smooth curve containing the p eivenvalues of interest Hence: f(ρ) = 1 2πi Tr G(ρ)(G(ρ) zi) 1 dz =... Γ = 1 2πi Tr z(g(ρ) zi) 1 dz With this, can prove : 1. f is a non-increasing function of ρ; 2. f(ρ) = 0 iff ρ = ρ ; 3. f (ρ) = Tr [V (ρ) T BV (ρ)] Γ Padua 06/07/18 p. 41

Can now use Newton s method. Careful when defining V (ρ): define the eigenvectors so the mapping V (ρ) is differentiable. ρ new = ρ Tr [V (ρ)t (A ρb)v (ρ)] Tr [V (ρ) T BV (ρ)] = Tr [V (ρ)t AV (ρ)] Tr [V (ρ) T BV (ρ)] Newton s method to find the zero of f a fixed point iteration with: g(ρ) = Tr [V T (ρ)av (ρ)] Tr [V T (ρ)bv (ρ)]. Idea: Compute V (ρ) by a Lanczos-type procedure Note: Standard problem - [not generalized] inexpensive See T. Ngo, M. Bellalij, and Y.S. for details Padua 06/07/18 p. 42

GRAPH-BASED TECHNIQUES

Multilevel techniques in brief Divide and conquer paradigms as well as multilevel methods in the sense of domain decomposition Main principle: very costly to do an SVD [or Lanczos] on the whole set. Why not find a smaller set on which to do the analysis without too much loss? Tools used: graph coarsening, divide and conquer For text data we use hypergraphs Padua 06/07/18 p. 44

Multilevel Dimension Reduction Main Idea: coarsen for a few levels. Use the resulting data set ˆX to find a projector P from R m to R d. P can be used to project original data or new data................. Graph... Coarsen................... Last Level Coarsen n X.................... m m X r n r Project n Y Yr n r d d Gain: Dimension reduction is done with a much smaller set. Hope: not much loss compared to using whole data Padua 06/07/18 p. 45

Making it work: Use of Hypergraphs for sparse data 1 2 3 4 5 6 7 8 9 * * * * a * * * * b A = * * * * c * * * d * * e e 1 2 9 net e c a 3 4 7 8 b 5 d 6 net d Left: a (sparse) data set of n entries in R m represented by a matrix A R m n Right: corresponding hypergraph H = (V, E) with vertex set V representing to the columns of A. Padua 06/07/18 p. 46

Hypergraph Coarsening uses column matching similar to a common one used in graph partitioning Compute the non-zero inner product a (i), a (j) between two vertices i and j, i.e., the ith and jth columns of A. Note: a (i), a (j) = a (i) a (j) cos θ ij. Modif. 1: Parameter: 0 < ɛ < 1. Match two vertices, i.e., columns, only if angle between the vertices satisfies: tan θ ij ɛ Modif. 2: Scale coarsened columns. If i and j matched and if a (i) 0 a (j) 0 replace a (i) and a (j) by ) c (l) = ( 1 + cos 2 θ ij a (i) Padua 06/07/18 p. 47

Call C the coarsened matrix obtained from A using the approach just described Lemma: Let C R m c be the coarsened matrix of A obtained by one level of coarsening of A R m n, with columns a (i) and a (j) matched if tan θ i ɛ. Then x T AA T x x T CC T x 3ɛ A 2 F, for any x R m with x 2 = 1. Very simple bound for Rayleigh quotients for any x. Implies some bounds on singular values and norms - skipped. Padua 06/07/18 p. 48

Tests: Comparing singular values 55 50 45 Singular values to 50 and approximations Exact power it1 coarsen coars+zs rand 38 36 34 32 Singular values to 50 and approximations Exact power it1 coarsen coars+zs rand 50 45 Singular values to 50 and approximations Exact power it1 coarsen coars+zs rand 30 40 40 28 35 35 26 30 24 30 22 25 25 30 35 40 45 50 25 30 35 40 45 50 25 25 30 35 40 45 50 Results for the datasets CRANFIELD (left), MEDLINE (middle), and TIME (right). Padua 06/07/18 p. 49

Low rank approximation: Coarsening, random sampling, and rand+coarsening. Err1 = A H k Hk T A F ; Err2= 1 k k Dataset n k c Coarsen Rand Sampl Err1 Err2 Err1 Err2 ˆσ i σ i σ i Kohonen 4470 50 1256 86.26 0.366 93.07 0.434 aft01 85 50 40 913.3 0.299 06.2 0.614 FA 617 30 1504 27.79 0.131 28.63 0.4 chipcool0 082 30 2533 6.091 0.313 6.199 0.360 brainpc2 27607 30 865 2357.5 0.579 2825.0 0.603 scfxm1-2b 33047 25 2567 2326.1 2328.8 thermomechtc 2158 30 6286 63.2 79.7 Webbase-1M 00005 25 15625 3564.5 Padua 06/07/18 p. 50

LINEAR ALGEBRA METHODS: EXAMPLES

Updating the SVD (E. Vecharynski and YS 13) In applications, data matrix X often updated Example: Information Retrieval (IR), can add documents, add terms, change weights,.. Problem Given the partial SVD of X, how to get a partial SVD of X new Will illustrate only with update of the form X new = [X, D] (documents added in IR) Padua 06/07/18 p. 52

Updating the SVD: Zha-Simon algorithm Assume A U k Σ k V T k Compute D k = (I U k U T k )D and A D = [A, D], D R m p and its QR factorization: [Û p, R] = qr(d k, 0), R R p p, Û p R m p Note: A D [U k, Û p ]H D [ Vk 0 0 I p ] T ; H D [ Σk U T k D 0 R Zha Simon ( 99): Compute the SVD of H D & get approximate SVD from above equation It turns out this is a Rayleigh-Ritz projection method for the SVD [E. Vecharynski & YS 13] Can show optimality properties as a result ] Padua 06/07/18 p. 53

Updating the SVD When the number of updates is large this becomes costly. Idea: Replace Û p by a low dimensional approximation: Use Ū of the form Ū = [U k, Z l ] instead of Ū = [U k, Û p ] Z l must capture the range of D k = (I U k U T k )D Simplest idea : best rank l approximation using the SVD. Can also use Lanczos vectors from the Golub-Kahan-Lanczos algorithm. Padua 06/07/18 p. 54

An example LSI - with MEDLINE collection: m = 7, 014 (terms), n = 1, 033 (docs), k = 75 (dimension), t = 533 (initial # docs), n q = 30 (queries) Adding blocks of 25 docs at a time The number of singular triplets of (I U k Uk T projection ( SV ) is 2. )D using SVD For GKL approach ( GKL ) 3 GKL vectors are used These two methods are compared to Zha-Simon ( ZS ). We show average precision then time Padua 06/07/18 p. 55

0.75 0.7 ZS SV GKL Retrieval accuracy, p = 25 Average precision 0.65 0.6 0.55 0.5 0.45 0.4 500 600 700 800 900 00 10 Number of documents Experiments show: gain in accuracy is rather consistent Padua 06/07/18 p. 56

0.4 0.35 0.3 ZS SV GKL Updating time, p = 25 Time (sec.) 0.25 0.2 0.15 0.1 0.05 0 500 600 700 800 900 00 10 Number of documents Times can be significantly better for large sets Padua 06/07/18 p. 57

Conclusion Interesting new matrix problems in areas that involve the effective mining of data Among the most pressing issues is that of reducing computational cost - [SVD, SDP,..., too costly] Many online resources available Huge potential in areas like materials science though inertia has to be overcome To a researcher in computational linear algebra : big tide of change on types or problems, algorithms, frameworks, culture,.. But change should be welcome

When one door closes, another opens; but we often look so long and so regretfully upon the closed door that we do not see the one which has opened for us. Alexander Graham Bell (1847-1922) In the words of Lao Tzu: If you do not change directions, you may end-up where you are heading Thank you! Visit my web-site at www.cs.umn.edu/~saad Padua 06/07/18 p. 59