Department of Computer Science and Engineering

Similar documents
Department of Computer Science and Engineering

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Department of Computer Science and Engineering

Efficient Linear Algebra methods for Data Mining Yousef Saad Department of Computer Science and Engineering University of Minnesota

Linear algebra methods for data mining Yousef Saad Department of Computer Science and Engineering University of Minnesota.

Department of Computer Science and Engineering

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

A few applications of the SVD

Machine learning for pervasive systems Classification in high-dimensional spaces

Statistical Pattern Recognition

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Enhanced graph-based dimensionality reduction with repulsion Laplaceans

Iterative Laplacian Score for Feature Selection

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Example: Face Detection

Principal Component Analysis

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

USING SINGULAR VALUE DECOMPOSITION (SVD) AS A SOLUTION FOR SEARCH RESULT CLUSTERING

Divide and conquer algorithms for large eigenvalue problems Yousef Saad Department of Computer Science and Engineering University of Minnesota

EIGENVALE PROBLEMS AND THE SVD. [5.1 TO 5.3 & 7.4]

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Statistical Machine Learning

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Let A an n n real nonsymmetric matrix. The eigenvalue problem: λ 1 = 1 with eigenvector u 1 = ( ) λ 2 = 2 with eigenvector u 2 = ( 1

Machine Learning - MT & 14. PCA and MDS

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Polynomial filtering for interior eigenvalue problems Yousef Saad Department of Computer Science and Engineering University of Minnesota

20 Unsupervised Learning and Principal Components Analysis (PCA)

Department of Computer Science and Engineering

Principal components analysis COMS 4771

Unsupervised Learning: K- Means & PCA

Locality Preserving Projections

Unsupervised Learning

Analysis of Spectral Kernel Design based Semi-supervised Learning

Principal Component Analysis

Discriminant Uncorrelated Neighborhood Preserving Projections

LEC 5: Two Dimensional Linear Discriminant Analysis

Machine Learning (Spring 2012) Principal Component Analysis

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Data dependent operators for the spatial-spectral fusion problem

Advanced Introduction to Machine Learning CMU-10715

Sparse representation classification and positive L1 minimization

Data Mining Techniques

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

The EVSL package for symmetric eigenvalue problems Yousef Saad Department of Computer Science and Engineering University of Minnesota

Clustering VS Classification

ECE 661: Homework 10 Fall 2014

ILAS 2017 July 25, 2017

Nonlinear Dimensionality Reduction. Jose A. Costa

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

EECS 275 Matrix Computation

Data Exploration and Unsupervised Learning with Clustering

Problems. Looks for literal term matches. Problems:

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides

Principal Component Analysis (PCA)

CITS 4402 Computer Vision

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

L11: Pattern recognition principles

Matrices, Vector Spaces, and Information Retrieval

Parallel Singular Value Decomposition. Jiaxing Tan

Principal Component Analysis and Linear Discriminant Analysis

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Index. Copyright (c)2007 The Society for Industrial and Applied Mathematics From: Matrix Methods in Data Mining and Pattern Recgonition By: Lars Elden

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Collaborative Filtering. Radek Pelánek

LECTURE NOTE #11 PROF. ALAN YUILLE

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

CS60021: Scalable Data Mining. Dimensionality Reduction

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

STA 414/2104: Machine Learning

Information Retrieval

A Least Squares Formulation for Canonical Correlation Analysis

Matrix Approximations

CS 340 Lec. 6: Linear Dimensionality Reduction

Numerical Methods I Singular Value Decomposition

STA 414/2104: Lecture 8

Issues and Techniques in Pattern Classification

ab initio Electronic Structure Calculations

Singular Value Decompsition

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Lecture: Face Recognition and Feature Reduction

Preprocessing & dimensionality reduction

PROBABILISTIC LATENT SEMANTIC ANALYSIS

CS 231A Section 1: Linear Algebra & Probability Review


Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

EECS 275 Matrix Computation

Lecture: Face Recognition and Feature Reduction

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Transcription:

From data-mining to nanotechnology and back: The new problems of numerical linear algebra Yousef Saad Department of Computer Science and Engineering University of Minnesota Computer Science Colloquium Nov. 9th, 2009

Introduction Numerical linear algebra has always been a universal tool in science and engineering. Its focus has changed over the years to tackle new challenges 1940s 1950s: Major issue: the flutter problem in aerospace engineering. Focus: eigenvalue problem. Then came the discoveries of the LR and QR algorithms, and the package Eispack followed a little later 1960s: Problems related to the power grid promoted what we know today as general sparse matrix techniques. Late 1980s 1990s: Focus on parallel matrix computations. Late 1990s: Big spur of interest in financial computing [Focus: Stochastic PDEs] CSE Colloquium 11/09/09 p. 2

Then the stock market tanked.. and numerical analysts returned to more mundaine basics Recent/Current: Google page rank, data mining, problems related to internet technology, knowledge discovery, bio-informatics, nano-technology,... Major new factor: Synergy between disciplines. Example: Discoveries in materials (e.g. semi conductors replacing vacuum tubes in 1950s) lead to faster computers, which in turn lead to better physical simulations.. What about data-mining and materials? Potential for a perfect union... CSE Colloquium 11/09/09 p. 3

NUMERICAL LINEAR ALGEBRA IN ELECTRONIC STRUCTURE

Electronic structure Quantum mechanics: Single biggest scientific achievement of 20th century. See Thirty years that shook physics, George Gamov, 1966 There is plenty of room at the bottom [Feynman, 1959] started the era of nanotechnology [making materials with specific properties.] see http://www.zyvex.com/nanotech/feynman.html Examples of applications: Superhard materials, superconductors, drug design, efficient photovoltaic cells,... All these need as a basic tool: computational methods to determine electronic structure. CSE Colloquium 11/09/09 p. 5

Electronic structure and Schrödinger s equation Basic and essential tool for simulations at the nano-scale. Determining matter s electronic structure can be a major challenge: Number of particules is large [a macroscopic amount contains 10 23 electrons and nuclei] and the physical problem is intrinsically complex. Solution via the many-body Schrödinger equation: HΨ = EΨ In its original form the above equation is very complex CSE Colloquium 11/09/09 p. 6

Hamiltonian H is of the form : H = i i,j 2 2 i 2M i j Z i e 2 R i r j + 1 2 2 2 j 2m + 1 2 i,j i,j e 2 r i r j Z i Z j e 2 R i R j Ψ = Ψ(r 1, r 2,..., r n, R 1, R 2,..., R N ) depends on coordinates of all electrons/nuclei. Involves sums over all electrons / nuclei and their pairs Note: 2 i Ψ is Laplacean of Ψ w.r.t. variable r i. Represents kinetic energy for i-th particle. CSE Colloquium 11/09/09 p. 7

A hypothetical calculation: [with a naive approach ] 10 Atoms each having 14 electrons [Silicon]... a total of 15*10= 150 particles... Assume each coordinate will need 100 points for discretization..... you will get # Unknowns = 100 }{{} part.1 100 }{{} part.2 100 }{{} part.150 = 100 150 Methods based on this basic formulation are limited to a few atoms useless for real compounds. CSE Colloquium 11/09/09 p. 8

The underlying physical laws necessary for the mathematical theory of a large part of physics and the whole chemistry are thus completely known, and the difficulty is only that the exact application of these laws leads to equations much too complicated to be soluble. It therefore becomes desirable that approximate practical methods of applying quantum mechanics should be developed, which can lead to the explanation of the main features of complex atomic systems without too much computations. Dirac, 1929 In 1929 quantum mechanics was basically understood Today the desire to have approximate practical methods is still alive CSE Colloquium 11/09/09 p. 9

Several approximations/theories used Born-Oppenheimer approximation: Neglect motion of nuclei [Much heavier than electrons] Replace many electrons by one electron systems: each electron sees only average potentials from other particles Density Functional Theory [Hohenberg-Kohn 65]: Observables determined by ground state charge density Consequence: Equation of the form [ ] h2 2m 2 + v 0 (r) + V H + V xc Ψ = EΨ v 0 = external potential, E xc = exchange-correlation energy CSE Colloquium 11/09/09 p. 10

Kohn-Sham equations nonlinear eigenvalue Pb [ 1 ] 2 2 + (V ion + V H + V xc ) Ψ i = E i Ψ i, i = 1,..., n o ρ(r) = n o i Ψ i (r) 2 2 V H = 4πρ(r) Both V xc and V H, depend on ρ. Potentials & charge densities must be self-consistent. Broyden-type quasi-newton technique used Typically, a small number of iterations are required Most time-consuming part: diagonalization CSE Colloquium 11/09/09 p. 11

Real-space Finite Difference Methods Use High-Order Finite Difference Methods [Fornberg & Sloan 94] Typical Geometry = Cube regular structure. Laplacean matrix need not even be stored. z y Order 4 Finite Difference Approximation: x CSE Colloquium 11/09/09 p. 12

The physical domain CSE Colloquium 11/09/09 p. 13

Computational code: PARSEC; Milestones PARSEC = Pseudopotential Algorithm for Real Space Electronic Calculations Sequential real-space code on Cray YMP [up to 93] Cluster of SGI workstations [up to 96] CM5 [ 94-96] Massive parallelism begins IBM SP2 [Using PVM] Cray T3D [PVM + MPI] 96; Cray T3E [MPI] 97 IBM SP with +256 nodes 98+ SGI Origin 3900 [128 processors] 99+ IBM SP + F90 - PARSEC name given, 02 PARSEC released in 2005. CSE Colloquium 11/09/09 p. 14

CHEBYSHEV FILTERING

Diagonalization via Chebyshev filtering Given a basis [v 1,..., v m ], filter each vector as ˆv i = p k (A)v i p k = Low deg. polynomial. Dampens unwanted components Filtering step not used to compute eigenvectors accurately In effect: SCF & diagonalization loops merged Convergence well preserved 1 0.8 0.6 0.4 0.2 0 Deg. 8 Cheb. polynom., on interv.: [ 11] 1 0.5 0 0.5 1 Yunkai Zhou, Y.S., Murilo L. Tiago, and James R. Chelikowsky, Phys. Rev. E, vol. 74, p. 066704 (2006). CSE Colloquium 11/09/09 p. 16

Chebyshev Subspace iteration - experiments A large calculation: Si 9041 H 1860, using 48 processors [SGI Altix, Itanium proc., 1.6 GHz] Hamiltonian size=2, 992, 832, Num. States= 19, 015. # A x # SCF total ev /atom 1st CPU total CPU 4804488 18-92.00412 102.12 hrs. 294.36 hrs Calculation done in 2006. Largest one done in 1997: Si 525 H 276 Took a few days [48 h. cpu] on 64PE - Cray T3D. Now 2 hours on 1 PE (!) CSE Colloquium 11/09/09 p. 17

CSE Colloquium 11/09/09 p. 18

NUMERICAL LINEAR ALGEBRA IN DATA MINING

Introduction: What is data mining? Common goal of data mining methods: to extract meaningful information or patterns from data. Very broad area includes: data analysis, machine learning, pattern recognition, information retrieval,... Main tools used: linear algebra; graph theory; approximation theory; optimization;... In this talk: brief overview with emphasis on dimension reduction techniques. interrelations between techniques, and graph theory tools. CSE Colloquium 11/09/09 p. 20

Major tool of Data Mining: Dimension reduction Goal is not just to reduce computational cost but to: Reduce noise and redundancy in data Discover features or patterns (e.g., supervised learning) Techniques depend on application: Preserve angles? Preserve distances? Maximize variance?.. CSE Colloquium 11/09/09 p. 21

The problem of Dimension Reduction Given d m find a mapping Φ : x R m y R d Mapping may be explicit [typically linear], e.g.: Or implicit (nonlinear) Practically: y = V T x Given: X R m n. Want: a low-dimensional representation Y R d n of X CSE Colloquium 11/09/09 p. 22

Linear Dimensionality Reduction Given: a data set X = [x 1, x 2,..., x n ], and d the dimension of the desired reduced space Y = [y 1, y 2,..., y n ]. Want: d A linear transformation from X to Y m v T m X Y n x y n i i d X R m n V R m d Y = V X Y R d n m-dimens. objects (x i ) flattened to d-dimens. space (y i ) Constraint: Optimization problem The y i s must satisfy certain properties CSE Colloquium 11/09/09 p. 23

Example 1: The Swill-Roll (2000 points in 3-D) Original Data in 3 D 15 10 5 0 5 20 10 10 0 15 15 10 5 0 5 10 15 20 10 CSE Colloquium 11/09/09 p. 24

2-D reductions : 15 PCA 15 LPP 10 10 5 5 0 0 5 5 10 10 15 20 10 0 10 20 15 15 10 5 0 5 10 Eigenmaps ONPP CSE Colloquium 11/09/09 p. 25

Example 2: Digit images (a random sample of 30) 10 10 10 10 10 10 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 10 10 10 10 10 10 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 10 10 10 10 10 10 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 10 10 10 10 10 10 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 10 10 10 10 10 10 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 20 5 10 15 CSE Colloquium 11/09/09 p. 26

2-D reductions : 6 4 2 0 2 PCA digits : 0 4 0 1 2 3 4 0.15 0.1 0.05 0 0.05 0.1 LLE digits : 0 4 4 0.15 6 10 5 0 5 10 0.2 0.2 0.1 0 0.1 0.2 0.2 0.15 0.1 0.05 K PCA digits : 0 4 0 1 2 3 4 0.1 0.05 0 0.05 ONPP digits : 0 4 0 0.1 0.05 0.15 0.1 0.2 0.15 0.07 0.071 0.072 0.073 0.074 0.25 5.4341 5.4341 5.4341 5.4341 5.4341 x 10 3 CSE Colloquium 11/09/09 p. 27

Unsupervised learning: Clustering Problem: partition a given set into subsets such that items of the same subset are most similar and those of two different subsets most dissimilar. Superhard Superconductors Photovoltaic Ferromagnetic Multi ferroics Catalytic Thermo electric Basic technique: K-means algorithm [slow but effective.] Example of application : cluster bloggers by social groups (anarchists, ecologists, sports-fans, liberals-conservative,... ) CSE Colloquium 11/09/09 p. 28

Sparse Matrices viewpoint * Joint work with J. Chen Communities modeled by an affinity graph [e.g., user A sends frequent e-mails to user B ] Adjacency Graph represented by a sparse matrix Goal: find ordering so blocks are as dense as possible: Use blocking techniques for sparse matrices CSE Colloquium 11/09/09 p. 29

Advantage of this viewpoint: need not know # of clusters Example of application Data set from : http://www-personal.umich.edu/ mejn/netdata/ Network connecting bloggers of different political orientations [2004 US presidentual election] Communities : liberal vs. conservative Graph: 1, 490 vertices (blogs) : first 758: liberal, rest: conservative. Edge: i j : a citation between blogs i and j Blocking algorithm (Density theshold=0.4): subgraphs [note: density = E / V 2.] Smaller subgraph: conservative blogs, larger one: liberals CSE Colloquium 11/09/09 p. 30

Supervised learning: classification Problem: Given labels (say A and B ) for each item of a given set, find a mechanism to classify an unlabelled item into either the A or the B class. Many applications. Example: distinguish SPAM and non-spam messages?? Can be extended to more than 2 classes. CSE Colloquium 11/09/09 p. 31

Linear classifiers Find a hyperplane which best separates the data in classes A and B. Linear classifier CSE Colloquium 11/09/09 p. 32

A harder case: 4 Spectral Bisection (PDDP) 3 2 1 0 1 2 3 4 2 1 0 1 2 3 4 5 Use kernels to transform

0.1 Projection with Kernels σ 2 = 2.7463 0.08 0.06 0.04 0.02 0 0.02 0.04 0.06 0.08 0.1 0.08 0.06 0.04 0.02 0 0.02 0.04 0.06 Transformed data with a Gaussian Kernel CSE Colloquium 11/09/09 p. 34

Fisher s Linear Discriminant Analysis (LDA) Define between scatter : a measure of how well separated two distinct classes are. Define within scatter : a measure of how well clustered items of the same class are. Goal: to make between scatter measure large, while making within scatter small. Idea: Project the data in low-dimensional space so as to maximize the ratio of the between scatter measure over within scatter measure of the classes. CSE Colloquium 11/09/09 p. 35

Let µ = mean of X, and µ (k) = mean of the k-th class (of size n k ). Define: S B = S W = c k=1 c k=1 n k (µ (k) µ)(µ (k) µ) T, x i X k (x i µ (k) )(x i µ (k) ) T. CLUSTER CENTROIDS GLOBAL CENTROID CSE Colloquium 11/09/09 p. 36

Project set on a one-dimensional space spanned by a vector a. Then: a T S B a = a T S W a = c i=1 c k=1 n k a T (µ (k) µ) 2, x i X k a T (x i µ (k) ) 2 LDA projects the data so as to maximize the ratio of these two numbers: max a a T S B a a T S W a Optimal a = eigenvector associated with the largest eigenvalue of: S B u i = λ i S W u i. CSE Colloquium 11/09/09 p. 37

LDA in d dimensions Criterion: maximize the ratio of two traces: Tr [U T S B U] Tr [U T S W U] Constraint: U T U = I (orthogonal projector). Reduced dimension data: Y = U T X. Common viewpoint: hard to maximize, therefore...... alternative: Solve instead the ( easier ) problem: max U T S W U=I Tr [U T S B U] Solution: largest eigenvectors of S B u i = λ i S W u i. CSE Colloquium 11/09/09 p. 38

Trace ratio problem * Joint work with Thanh Ngo and Mohamed Bellalij Main point: trace ratio can be maximized inexpensively Let f(ρ) = max U,U T U=I Trace[U T (A ρb)u] And let U(ρ) the maximizer of above trace. Then, under some mild assumptions on A and B : 1) Optimal ρ for trace ratio is a zero of f 2) f (ρ) = Trace[U(ρ) T BU(ρ)] 3) f is a decreasing function... + Newton s method to find zero amounts to a fixed point iteration: ρ new = Tr [U(ρ)T AU(ρ)] Tr [U(ρ) T BU(ρ)] CSE Colloquium 11/09/09 p. 39

Idea: Compute U(ρ) by an inexpensive Lanczos procedure Note: Standard problem - [not generalized] cheap.. Recent papers advocated similar or related techniques [1] C. Shen, H. Li, and M. J. Brooks, A convex programming approach to the trace quotient problem. In ACCV (2) 2007. [2] H. Wang, S.C. Yan, D.Xu, X.O. Tang, and T. Huang. Trace ratio vs. ratio trace for dimensionality reduction. In IEEE Conference on Computer Vision and Pattern Recognition, 2007 [3] S. Yan and X. O. Tang, Trace ratio revisited Proceedings of the European Conference on Computer Vision, 2006.... CSE Colloquium 11/09/09 p. 40

Background: The Lanczos procedure ALGORITHM : 1 Lanczos 1. Choose an initial vector v 1 of norm unity. Set β 1 0, v 0 0 2. For j = 1, 2,..., m Do: 3. α j := (w j, v j ) 4. w j := Av j α j v j β j v j 1 5. β j+1 := w j 2. If β j+1 = 0 then Stop 6. v j+1 := w j /β j+1 7. EndDo In first few steps of Newton: rough approximation needed. CSE Colloquium 11/09/09 p. 41

INFORMATION RETRIEVAL

Information Retrieval: Vector Space Model Given: a collection of documents (columns of a matrix A) and a query vector q. Collection represented by an m n term by document matrix with a ij = L ij G i N j Queries ( pseudo-documents ) q are represented similarly to a column CSE Colloquium 11/09/09 p. 43

Vector Space Model - continued Problem: find a column of A that best matches q Similarity metric: angle between the column and q - Use cosines: c T q c 2 q 2 To rank all documents we need to compute s = A T q s = similarity vector. Literal matching not very effective. CSE Colloquium 11/09/09 p. 44

Common approach: Use the SVD Need to extract intrinsic information or underlying semantic information LSI: replace A by a low rank approximation [from SVD] A = UΣV T A k = U k Σ k V T k U k : term space, V k : document space. New similarity vector: s k = A T k q = V kσ k U T k Main issues: 1) computational cost 2) Updates CSE Colloquium 11/09/09 p. 45

Use of polynomial filters * Joint work with E. Kokiopoulou The idea: Replace A k by Aφ(A T A), where φ is a certain filter function Consider the step-function (Heaviside function:) φ(x) = { 0, 0 x σ 2 k 1, σ 2 k x σ2 1 This would yield the same result as with TSVD but...... Not easy to use this function directly Solution : use a polynomial approximation to φ Note: s T = q T Aφ(A T A), requires only Mat-Vec s CSE Colloquium 11/09/09 p. 46

Polynomial filters - examples Approach: Approximate a piecewise polynom. by a polynomial in leastsquares sense a b φ Ideal for situations where data must be explored once or a small number of times only Details skipped see: E. Kokiopoulou and YS, Polynomial Filtering in Latent Semantic Indexing for Information Retrieval, ACM-SIGIR, 2004. CSE Colloquium 11/09/09 p. 47

IR: Use of the Lanczos algorithm * Joint work with Jie Chen Lanczos is good at catching large (and small) eigenvalues: can compute singular vectors with Lanczos, & use them in LSI Can do better: Use the Lanczos vectors directly for the projection.. Related idea: K. Blom and A. Ruhe [SIMAX, vol. 26, 2005]. Use Lanczos bidiagonalization. Use a similar approach But directly with AA T or A T A. CSE Colloquium 11/09/09 p. 48

IR: Use of the Lanczos algorithm (1) Let A R m n. Apply the Lanczos procedure to M = AA T. Result: Q T k AAT Q k = T k with Q k orthogonal, T k tridiagonal. Define s i orth. projection of Ab on subspace span{q i } s i := Q i Q T i Ab. s i can be easily updated from s i 1 : s i = s i 1 + q i q T i Ab. CSE Colloquium 11/09/09 p. 49

IR: Use of the Lanczos algorithm (2) If n < m it may be more economial to apply Lanczos to M = A T A which is n n. Result: Q T k AT A Q k = T k Define: t i := A Q i Q T i b, Project b first before applying A to result. CSE Colloquium 11/09/09 p. 50

Theory well understood: works well because Lanczos produces good approximations to dominant sing. vectors Advantages of Lanczos over polynomial filters: (1) No need for eigenvalue estimates (2) Mat-vecs performed only in preprocessing Disadvantages: (1) Need to store Lanczos vectors; (2) Preprocessing must be redone when A changes. (3) Need for reorthogonalization expensive for large k. CSE Colloquium 11/09/09 p. 51

Tests: IR Information retrieval datasets # Terms # Docs # queries sparsity MED 7,014 1,033 30 0.735 CRAN 3,763 1,398 225 1.412 Med dataset. Cran dataset. Preprocessing times preprocessing time 90 80 70 60 50 40 30 20 10 lanczos tsvd Med 0 0 50 100 150 200 250 300 iterations preprocessing time 60 50 40 30 20 10 lanczos tsvd Cran 0 0 50 100 150 200 250 300 iterations CSE Colloquium 11/09/09 p. 52

Average query times Med dataset Cran dataset. average query time 5 x Med 10 3 lanczos tsvd lanczos rf 4 tsvd rf 4.5 3.5 3 2.5 2 1.5 1 0.5 average query time 4 x Cran 10 3 lanczos tsvd lanczos rf tsvd rf 3.5 3 2.5 2 1.5 1 0.5 0 0 50 100 150 200 250 300 iterations 0 0 50 100 150 200 250 300 iterations CSE Colloquium 11/09/09 p. 53

Average retrieval precision Med dataset Cran dataset 0.9 Med 0.9 Cran 0.8 0.8 0.7 0.7 average precision 0.6 0.5 0.4 0.3 average precision 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 200 250 300 iterations lanczos tsvd lanczos rf tsvd rf 0.2 0.1 0 0 100 200 300 400 500 600 iterations lanczos tsvd lanczos rf tsvd rf Retrieval precision comparisons CSE Colloquium 11/09/09 p. 54

In summary: Results comparable to those of SVD..... at a much lower cost. Thanks: Helpful tools and datasets widely available. We used TMG [developed at the U. of Patras (D. Zeimpekis, E. Gallopoulos)] CSE Colloquium 11/09/09 p. 55

Face Recognition background Problem: We are given a database of images: [arrays of pixel values]. And a test (new) image. ÿ ÿ ÿ ÿ ÿ ÿ Question: Does this new image correspond to one of those in the database? CSE Colloquium 11/09/09 p. 56

Difficulty Positions, Expressions, Lighting,..., Eigenfaces: Principal Component Analysis technique Specific situation: Poor images or deliberately altered images [ occlusion ] See real-life examples [international man-hunt] CSE Colloquium 11/09/09 p. 57

Eigenfaces Consider each picture as a (1-D) column of all pixels Put together into an array A of size # pixels # images.... =... }{{} A Do an SVD of A and perform comparison with any test image in low-dim. space Similar to LSI in spirit but data is not sparse. Idea: replace SVD by Lanczos vectors (same as for IR) CSE Colloquium 11/09/09 p. 58

Tests: Face Recognition Tests with 2 well-known data sets: ORL 40 subjects, 10 sample images each example: # of pixels : 112 92 TOT. # images : 400 AR set 126 subjects 4 facial expressions selected for each [natural, smiling, angry, screaming] example: # of pixels : 112 92 # TOT. # images : 504 CSE Colloquium 11/09/09 p. 59

Tests: Face Recognition Recognition accuracy of Lanczos approximation vs SVD ORL dataset AR dataset 1 0.9 ORL lanczos tsvd 1 0.9 AR lanczos tsvd 0.8 0.8 0.7 0.7 average error rate 0.6 0.5 0.4 average error rate 0.6 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 10 20 30 40 50 60 70 80 90 100 iterations 0 0 10 20 30 40 50 60 70 80 90 100 iterations Vertical axis shows average error rate. Horizontal = Subspace dimension CSE Colloquium 11/09/09 p. 60

GRAPH-BASED METHODS

Graph-based methods Start with a graph of data. e.g.: graph of k nearest neighbors (k-nn graph) Want: Do a projection so as to preserve the graph in some sense x i x j Define a graph Laplacean: y j L = D W y i e.g.,: w ij = { 1 if j Ni 0 else D = diag d ii = w ij j i with N i = neighborhood of i (excl. i) CSE Colloquium 11/09/09 p. 62

Example: The Laplacean eigenmaps Laplacean Eigenmaps [Belkin & Niyogi 02] *minimizes* n F EM (Y ) = w ij y i y j 2 subject to Y DY = I i,j=1 Notes: 1. Motivation: if x i x j is small (orig. data), we want y i y j to be also small (low-dim. data) 2. Note: Min instead of Max as in PCA [counter-intuitive] 3. Above problem uses original data indirectly through its graph x i x j y i y j CSE Colloquium 11/09/09 p. 63

Problem translates to: min Y R d n Y D Y = I Tr [ Y (D W )Y ]. Solution (sort eigenvalues increasingly): (D W )u i = λ i Du i ; y i = u i ; i = 1,, d An n n sparse eigenvalue problem [In sample space] Note: can assume D = I. Amounts to rescaling data. Problem becomes (I W )u i = λ i u i ; y i = u i ; i = 1,, d CSE Colloquium 11/09/09 p. 64

A unified view * Joint work with Efi Kokiopoulou and J. Chen Most techniques lead to one of two types of problems First : Y results directly from computing eigenvectors LLE, Eigenmaps,... Second: Low-Dimens. data: Y = V X G == identity, or XDX, or XX Observation: min Y R d n Y Y = I min V R m d V G V = I Tr [ Y MY ] Tr [ V XMX V ] 2nd is just a projected version of the 1st. CSE Colloquium 11/09/09 p. 65

Computing k-nn graphs * Joint work with J. Chen and H. R. Fang Main idea: Divide and Conquer 2-nd smallest eigenvector roughly separates set in two. Recursively divide + handle interface with care Again exploit Lanczos! X 1 X 2 X 1 X 3 X 2 hyperplane hyperplane Cost: O(n 1+δ ), δ depends on parameters. Less than O(n 2 ) but higher than O(n log n) C-implementation [J. Chen] publically available [GPL] CSE Colloquium 11/09/09 p. 66

Graph-based methods in a supervised setting Subjects of training set are known (labeled). Q: given a test image (say) find its label. Z. Dostal Z. Dostal Z. Dostal Z. Dostal R. Blaheta R. Blaheta ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ R. Blaheta R. Blaheta Z. Strakos Z. Strakos Z. Strakos Z. Strakos Question: Find label (best match) for test image. CSE Colloquium 11/09/09 p. 67

Methods can be adapted to supervised mode by building the graph to use class labels. Idea: Build G so that nodes in the same class are neighbors. If c = # classes, G consists of c cliques. Matrix W is block-diagonal W 1 W 2 W = W 3 W 4 W 5 Note: rank(w ) = n c. Can be used for LPP, ONPP, etc.. CSE Colloquium 11/09/09 p. 68

Repulsion Laplaceans * Joint work with E. Kokiopoulou ( 09) Idea: Create repulsion forces (energies) between items that are close but *not* in the same class. 0.98 ORL TrainPer Class=5 Class 1 Class 2 0.96 0.94 0.92 0.9 Class 3 0.88 0.86 0.84 0.82 onpp pca olpp R laplace fisher onpp R olpp 0.8 10 20 30 40 50 60 70 80 90 100 CSE Colloquium 11/09/09 p. 69

Data mining for materials: Materials Informatics Huge potential in exploiting two trends: 1 Enormous improvements in efficiency and capabilities in computational methods for materials 2 Recent progress in data mining techniques For example, cluster materials into classes according to properties, types of atomic structures ( point groups )... Current practice: One student, one alloy, one PhD Slow pace of discovery Data Mining: help speed-up process - look at more promising alloys CSE Colloquium 11/09/09 p. 70

Materials informatics at work. Illustrations ä 1970s: Data Mining by hand : Find coordinates to cluster materials according to structure ä 2-D projection from physical knowledge see: J. R. Chelikowsky, J. C. Phillips, Phys Rev. B 19 (1978). ä Anomaly Detection : helped find that compound Cu F does not exist CSE Colloquium 11/09/09 p. 71

Example 1: [Norskov et al., 03,...] Use of genetic algorithms to search through database of binary materials. Lead to discovery of a promising catalytic material with low cost. Example 2 : [Curtalano et al. PRL vol 91, 1003] Goal: narrow search to do fewer electronic structures calculations 55 binary metallic alloys considered in 114 crystal structures Observation: Energies of different crystal structures are correlated Use PCA: 9 dimensions good enough to yield OK accuracy Structures (114) Alloys (55) energies CSE Colloquium 11/09/09 p. 72

Conclusion Many, many, interesting New matrix problems related to the new economy and new emerging scientific fields: 1 Information technologies [learning, data-mining,...] 2 Computational Chemistry / materials science 3 Bio-informatics: computational biology, genomics,.. Important: Many resources for data-mining available online: repositories, tutorials, Very easy to get started Materials informatics very likely to become a major force For a few recent papers and pointers visit my web-site at www.cs.umn.edu/ saad CSE Colloquium 11/09/09 p. 73

When one door closes, another opens; but we often look so long and so regretfully upon the closed door that we do not see the one which has opened for us. Alexander Graham Bell (1847-1922) Thank you! CSE Colloquium 11/09/09 p. 74