Department of Computer Science and Engineering

Size: px

Start display at page:

Download "Department of Computer Science and Engineering"

Franklin Webster
5 years ago
Views:

1 Linear algebra methods for data mining with applications to materials Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM 12 Annual Meeting Twin Cities, July 13th, 12

2 Introduction: a few factoids Data is growing exponentially at an alarming rate: 90% of data in world today was created in last two years Every day, 2.3 Million terabytes ( bytes) created Big data term coined to reflect this trend Mixed blessing: Opportunities & big challenges. Trend is re-shaping & energizing many research areas including my own: numerical linear algebra Minneapolis, p. 2

3 Introduction: What is data mining? Set of methods and tools to extract meaningful information or patterns from data. Broad area : data analysis, machine learning, pattern recognition, information retrieval,... Tools used: linear algebra; Statistics; Graph theory; Approximation theory; Optimization;... This talk: brief introduction with emphasis on linear algebra viewpoint + our initial work on materials. Big emphasis on Dimension reduction methods Minneapolis, p. 3

4 Drowning in data Dimension Reduction PLEASE! Picture modified from Minneapolis, p. 4

5 Major tool of Data Mining: Dimension reduction Goal is not as much to reduce size (& cost) but to: Reduce noise and redundancy in data before performing a task [e.g., classification as in digit/face recognition] Discover important features or patterns Map data to: Preserve proximity? Maximize variance? Preserve a certain graph? Other term used: feature extraction [general term for low dimensional representations of data] Minneapolis, p. 5

6 The problem of Dimension Reduction Given d m find a mapping Φ: Φ : x R m y R d Practically: Given: X = [x 1,, x n ] R m n, find a low-dimens. representation Y = [y 1,, y n ] R d n of X m X n x i Φ may be linear, or nonlinear (implicit). Linear case (W R m d ): d Y y i Y = W X n Minneapolis, p. 6

7 Example: Principal Component Analysis (PCA) In Principal Component Analysis W is computed to maximize variance of projected data: max W R m d W W = I d i=1 y i 1 n n j=1 y j 2 2, y i = W x i. Leads to maximizing Tr [ W (X µe )(X µe ) W ], µ = 1 n Σn i=1 x i Minneapolis, p. 7

8 Solution W = { dominant eigenvectors } of the covariance matrix == Set of left singular vectors of X = X µe SVD: X = UΣV, U U = I, V V = I, Σ = Diag Optimal W = U d matrix of first d columns of U Solution W also minimizes reconstruction error.. i x i W W T x i 2 = i x i W y i 2 In some methods recentering to zero is not done, i.e., X replaced by X. Minneapolis, p. 8

9 Example : Digit images (a random sample of 30) Minneapolis, p. 9

10 2-D reductions : PCA digits : LLE digits : K PCA digits : ONPP digits : x 3 Minneapolis, p.

11 Unsupervised learning Unsupervised learning : methods that do not exploit known labels Example of digits: perform a 2- D projection Images of same digit tend to cluster (more or less) Such 2-D representations are popular for visualization Can also try to find natural clusters in data, e.g., in materials Basic clusterning technique: K- means PCA digits : Superconductors Multi ferroics Superhard Catalytic Photovoltaic Ferromagnetic Thermo electric Minneapolis, p. 11

12 Example: Sparse Matrices viewpoint (J. Chen & YS 09) Communities modeled by an affinity graph [e.g., user A sends frequent s to user B ] Adjacency Graph represented by a sparse matrix Original matrix Goal: Find ordering so blocks are as dense as possible Use blocking techniques for sparse matrices Advantage of this viewpoint: need not know # of clusters. [data: www-personal.umich.edu/ mejn/netdata/] Minneapolis, p. 12

13 Supervised learning: classification Best illustration: written digits recognition example Given: a set of labeled samples (training set), and an (unlabeled) test image. Problem: find label of test image Digit 0 Digit 0 Digfit 1 Digit Training data Digfit 1 Digit Digit 9 Digit 9 Digit?? Test data Digit?? Dimension reduction Roughly speaking: we seek dimension reduction so that recognition is more effective in low-dim. space Minneapolis, p. 13

14 K-nearest neighbors (KNN) classification Idea of a voting system: get distances between test sample and training samples Get the k nearest neighbors (here k = 8) Predominant class among these k items is assigned to the test sample ( here)? Minneapolis, p. 14

15 Fisher s Linear Discriminant Analysis (LDA) Goal: Use label information to define a good projector, i.e., one that can discriminate well between given classes Define between scatter : a measure of how well separated two distinct classes are. Define within scatter : a measure of how well clustered items of the same class are. Objective: make between scatter measure large and within scatter small. Idea: Find projector that maximizes the ratio of the between scatter measure over within scatter measure Minneapolis, p. 15

16 Define: Where: S B = S W = c k=1 c k=1 n k (µ (k) µ)(µ (k) µ) T, (x i µ (k) )(x i µ (k) ) T x i X k µ = mean (X) µ (k) = mean (X k ) X k = k-th class n k = X k X 1 CLUSTER CENTROIDS GLOBAL CENTROID X 2 µ X 3 Minneapolis, p. 16

17 Consider 2nd moments for a vector a: a T S B a = a T S W a = c i=1 c k=1 n k a T (µ (k) µ) 2, x i X k a T (x i µ (k) ) 2 a T S B a weighted variance of projected µ j s a T S W a w. sum of variances of projected classes X j s LDA projects the data so as to maximize the ratio of these two numbers: max a a T S B a a T S W a Optimal a = eigenvector associated with the largest eigenvalue of: S B u i = λ i S W u i. Minneapolis, p. 17

18 LDA Extension to arbitrary dimensions Criterion: maximize the ratio of two traces: Tr [U T S B U] Tr [U T S W U] Constraint: U T U = I (orthogonal projector). Reduced dimension data: Y = U T X. Common viewpoint: hard to maximize, therefore alternative: Solve instead the ( easier ) problem: max U T S W U=I Tr [U T S B U] Solution: largest eigenvectors of S B u i = λ i S W u i. Minneapolis, p. 18

19 LDA Extension to arbitrary dimensions (cont.) Consider the original problem: max U R n p, U T U=I Tr [U T AU] Tr [U T BU] Let A, B be symmetric & assume that B is semi-positive definite with rank(b) > n p. Then Tr [U T AU]/Tr [U T BU] has a finite maximum value ρ. The maximum is reached for a certain U that is unique up to unitary transforms of columns. Consider the function: f(ρ) = max V T V =I Tr [V T (A ρb)v ] Call V (ρ) the maximizer. Note: V (ρ) = Set of eigenvectors - not unique Minneapolis, p. 19

20 Define G(ρ) A ρb and its n eigenvalues: µ 1 (ρ) µ 2 (ρ) µ n (ρ). Clearly: f(ρ) = µ 1 (ρ) + µ 2 (ρ) + + µ p (ρ). Can express this differently. Define eigenprojector: P (ρ) = V (ρ)v (ρ) T Then: f(ρ) = Tr [V (ρ) T G(ρ)V (ρ)] = Tr [G(ρ)V (ρ)v (ρ) T ] = Tr [G(ρ)P (ρ)]. Minneapolis, p.

21 Recall [e.g. Kato 65] that: P (ρ) = 1 2πi Γ (G(ρ) zi) 1 dz Γ is a smooth curve containing the p eivenvalues of interest Hence: f(ρ) = 1 2πi Tr G(ρ)(G(ρ) zi) 1 dz =... Γ = 1 2πi Tr z(g(ρ) zi) 1 dz With this, can prove : 1. f is a non-increasing function of ρ; 2. f(ρ) = 0 iff ρ = ρ ; 3. f (ρ) = Tr [V (ρ) T BV (ρ)] Γ Minneapolis, p. 21

22 Can now use Newton s method. ρ new = ρ Tr [V (ρ)t (A ρb)v (ρ)] Tr [V (ρ) T BV (ρ)] = Tr [V (ρ)t AV (ρ)] Tr [V (ρ) T BV (ρ)] Newton s method to find the zero of f a fixed point iteration with g(ρ) = Tr [V T (ρ)av (ρ)] Tr [V T (ρ)bv (ρ)], Idea: Compute V (ρ) by a Lanczos-type procedure Note: Standard problem - [not generalized] inexpensive! See T. Ngo, M. Bellalij, and Y.S. for details Minneapolis, p. 22

23 Recent papers advocated similar or related techniques [1] C. Shen, H. Li, and M. J. Brooks, A convex programming approach to the trace quotient problem. In ACCV (2) 07. [2] H. Wang, S.C. Yan, D.Xu, X.O. Tang, and T. Huang. Trace ratio vs. ratio trace for dimensionality reduction. In IEEE Conference on Computer Vision and Pattern Recognition, 07 [3] S. Yan and X. O. Tang, Trace ratio revisited Proceedings of the European Conference on Computer Vision, Minneapolis, p. 23

24 Graph-based methods Start with a graph of data. e.g.: graph of k nearest neighbors (k-nn graph) Want: Perform a projection which preserves the graph in some sense Define a graph Laplacean: x i x j L = D W y i y j e.g.,: w ij = { 1 if j Ni 0 else D = diag d ii = w ij j i with N i = neighborhood of i (excluding i) Minneapolis, p. 24

25 Example: The Laplacean eigenmaps approach Laplacean Eigenmaps [Belkin-Niyogi 01] *minimizes* n F(Y ) = w ij y i y j 2 subject to Y DY = I i,j=1 Motivation: if x i x j is small (orig. data), we want y i y j to be also small (low-dim. data) Original data used indirectly through its graph Leads to n n sparse eigenvalue problem [In sample space] x i x j y i y j Minneapolis, p. 25

26 Locally Linear Embedding (Roweis-Saul-00) Very similar to Eigenmaps. Graph Laplacean matrix is replaced by an affinity graph Graph: Each x i written as a convex combination of its k nearest neighbors: x i Σw ij x j, j N i w ij = 1 Optimal weights computed ( local calculation ) by minimizing x i Σw ij x j for i = 1,, n Mapped data (Y ) computed by minimizing yi Σw ij y j 2 x i x j Minneapolis, p. 26

27 ONPP (Kokiopoulou and YS 05) Orthogonal Neighborhood Preserving Projections A linear (orthogonoal) version of LLE obtained by writing Y in the form Y = V X Same graph as LLE. Objective: preserve the affinity graph (as in LEE) *but* with the constraint Y = V X Problem solved to obtain mapping: min V s.t. V T V = I Tr [ V X(I W )(I W )X V ] In LLE replace V X by Y Minneapolis, p. 27

28 Face Recognition background Problem: We are given a database of images: [arrays of pixel values]. And a test (new) image. - % Question: Does this new image correspond to one of those in the database? Minneapolis, p. 28

29 Example: Eigenfaces [Turk-Pentland, 91] Idea identical with the one we saw for digits: Consider each picture as a (1-D) column of all pixels Put together into an array A of size # pixels # images.... = :... }{{} A Do an SVD of A and perform comparison with any test image in low-dim. space Similar to LSI in spirit but data is not sparse. Minneapolis, p. 29

30 Graph-based methods in a supervised setting Graph-based methods can be adapted to supervised mode. Idea: Build G so that nodes in the same class are neighbors. If c = # classes, G consists of c cliques. Weight matrix W =block-diagonal Note: rank(w ) = n c. As before, graph Laplacean: W = L c = D W W 1 W 2 Can be used for ONPP and other graph based methods... W c 09] Improvement: add repulsion Laplacean [Kokiopoulou, YS Minneapolis, p. 30

31 Class 1 Class 2 Leads to eigenvalue problem with matrix: L c ρl R Class 3 Test: ORL L c = class-laplacean, L R = repulsion Laplacean, ρ = parameter 40 subjects, sample images each example: # of pixels : ; TOT. # images : 400 Minneapolis, p. 31

32 0.98 ORL TrainPer Class= onpp pca olpp R laplace fisher onpp R olpp Remarkable: there are values of ρ which give better results than using the optimum ρ obtained from maximizing trace ratio Minneapolis, p. 32

33 Data mining for materials: Materials Informatics *Collabor.: J. Chelikowsky (UT Austin), & Da Gao (U. of M) Huge potential in exploiting two trends: 1 Improvements in efficiency and capabilities in computational methods for materials 2 Recent progress in data mining techniques Current practice: One student, one alloy, one PhD [see special MRS issue on materials informatics] Slow.. Data Mining: can help speed-up process, e.g., by exploring in smarter ways Issue 1: Who will do the work? Few researchers are familiar with both worlds Minneapolis, p. 33

34 Issue 2: databases, and more generally sharing, not too common in materials The inherently fragmented and multidisciplinary nature of the materials community poses barriers to establishing the required networks for sharing results and information. One of the largest challenges will be encouraging scientists to think of themselves not as individual researchers but as part of a powerful network collectively analyzing and using data generated by the larger community. These barriers must be overcome. NSTC report to the White House, June 11. Materials genome initiative [NSF] Minneapolis, p. 34

35 Unsupervised learning ä 1970s: Data Mining by hand : Find coordinates that will cluster materials according to structure ä 2-D projection from physical knowledge ä Anomaly Detection : helped find that compound Cu F does not exist see: J. R. Chelikowsky, J. C. Phillips, Phys Rev. B 19 (1978). Minneapolis, p. 35

36 Question: Can modern data mining achieve a similar diagrammatic separation of structures? Should use only information from the two constituent atoms Experiment: 67 binary octets. Use PCA exploit only data from 2 constituent atoms: 1. Number of valence electrons; 2. Ionization energies of the s-states of the ion core; 3. Ionization energies of the p-states of the ion core; 4. Radii for the s-states as determined from model potentials; 5. Radii for the p-states as determined from model potentials. Minneapolis, p. 36

37 Result: Z W D R ZW WR MgTe CdTe AgI CuI MgSe MgS CuBr CuCl SiC ZnO 1.5 BN Minneapolis, p. 37

38 Supervised learning: classification Problem: classify an unknown binary compound into its crystal structure class 55 compounds, 6 crystal structure classes leave-one-out experiment Case 1: Use features 1:5 for atom A and 2:5 for atom B. No scaling is applied. Case 2: Features 2:5 from each atom + scale features 2 to 4 by square root of # valence electrons (feature 1) Case 3: Features 1:5 for atom A and 2:5 for atom B. Scale features 2 and 3 by square root of # valence electrons. Minneapolis, p. 38

39 Three methods tested 1. PCA classification. Project and do identification in space of reduced dimension (Euclidean distance in low-dim space). 2. KNN K-nearest neighbor classification 3. Orthogonal Neighborhood Preserving Projection (ONPP) - a graph based method - [see Kokiopoulou, YS, 05] Recognition rates for 3 different methods using different features Case KNN ONPP PCA Case Case Case Minneapolis, p. 39

40 A few experiments with a larger database 529 A x B y compounds - Extract valid entries Source: Semiconductors, Data Handbook Otfried Madelung, Springer; 3rd edition (January 22, 04) Band-gap information available - Experiments : [unsupervised learning] 2-D projection using SPDF blocks 2-D projection showing Insulator - Semi-conductor property Minneapolis, p. 40

41 S D P F Minneapolis, p. 41

42 Results for SDP 2-D visualization PCA projection, SDP blocks ; 526compounds 12 ONPP projection (k=), SDP blocks ; 526compounds SP DP PP SP DP PP Features used: val. elec.; electronegativ.; covalent rad.; density; ionic rad.; chemical scale; + band-gap. Minneapolis, p. 42

43 Results for 2-D Semiconductor/Insulator visualization Had to consider AB only compounds [119 of them] PCA projection, S/I labeling ; 119compounds ONPP projection (k=), S/I labeling ; 119compounds 12 S I 8 6 S I features used: val. electrons; electro-negativity; cov. rad.; density; ionic rad.; Minneapolis, p. 43

44 Conclusion Many, interesting new matrix problems in areas that involve the effective mining of data Among the most pressing issues is that of reducing computational cost - [SVD, SDP,... too costly] Many online resources available in computer science/ applied math..... but there is an inertia to overcome. In particular data-sharing is not as widespread in materials science Plus side: Materials informatics will likely be energized by the materials genome project. Minneapolis, p. 44

45 When one door closes, another opens; but we often look so long and so regretfully upon the closed door that we do not see the one which has opened for us. Alexander Graham Bell ( ) Thank you! Visit my web-site at saad Minneapolis, p. 45

Department of Computer Science and Engineering

Linear algebra methods for data mining with applications to materials Yousef Saad Department of Computer Science and Engineering University of Minnesota ICSC 2012, Hong Kong, Jan 4-7, 2012 HAPPY BIRTHDAY