Department of Computer Science and Engineering

Similar documents
Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Department of Computer Science and Engineering

Linear algebra methods for data mining Yousef Saad Department of Computer Science and Engineering University of Minnesota.

Department of Computer Science and Engineering

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Department of Computer Science and Engineering

Efficient Linear Algebra methods for Data Mining Yousef Saad Department of Computer Science and Engineering University of Minnesota

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Enhanced graph-based dimensionality reduction with repulsion Laplaceans

Machine learning for pervasive systems Classification in high-dimensional spaces

Statistical Machine Learning

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Statistical Pattern Recognition

Unsupervised Learning: K- Means & PCA

Principal Component Analysis (PCA)

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

20 Unsupervised Learning and Principal Components Analysis (PCA)

Example: Face Detection

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Discriminant Uncorrelated Neighborhood Preserving Projections

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Iterative Laplacian Score for Feature Selection

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Kernel Methods. Barnabás Póczos

LEC 5: Two Dimensional Linear Discriminant Analysis

Clustering VS Classification

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Metric-based classifiers. Nuno Vasconcelos UCSD

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Advanced Introduction to Machine Learning CMU-10715

L11: Pattern recognition principles

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Issues and Techniques in Pattern Classification

Machine Learning (Spring 2012) Principal Component Analysis

CS 340 Lec. 6: Linear Dimensionality Reduction

Principal Component Analysis

Nonlinear Dimensionality Reduction. Jose A. Costa

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Locality Preserving Projections

Machine Learning 2nd Edition

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

A few applications of the SVD

Correlation Preserving Unsupervised Discretization. Outline

PCA, Kernel PCA, ICA

CITS 4402 Computer Vision

LEC 3: Fisher Discriminant Analysis (FDA)

Image Analysis & Retrieval. Lec 14. Eigenface and Fisherface

Low dimensional approximations: Problems and Algorithms

STA 414/2104: Lecture 8

PCA and admixture models

Subspace Analysis for Facial Image Recognition: A Comparative Study. Yongbin Zhang, Lixin Lang and Onur Hamsici

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Lecture: Face Recognition

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Data Mining Techniques

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

LECTURE NOTE #11 PROF. ALAN YUILLE

Principal Component Analysis

Principal Component Analysis

PCA and LDA. Man-Wai MAK

PCA and LDA. Man-Wai MAK

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Principal Component Analysis (PCA)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Spectral Regression for Dimensionality Reduction

Lecture Notes 1: Vector spaces

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Analysis of Spectral Kernel Design based Semi-supervised Learning

Deriving Principal Component Analysis (PCA)

Image Analysis & Retrieval Lec 14 - Eigenface & Fisherface

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Classification: The rest of the story

Preprocessing & dimensionality reduction

Chemometrics: Classification of spectra

Feature selection and extraction Spectral domain quality estimation Alternatives

Unsupervised Learning

Nonlinear Dimensionality Reduction

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Expectation Maximization

Lecture: Face Recognition and Feature Reduction

Sparse representation classification and positive L1 minimization

Face recognition Computer Vision Spring 2018, Lecture 21

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

ECE 592 Topics in Data Science

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

STA 414/2104: Machine Learning

Factor Analysis (10/2/13)

CS534 Machine Learning - Spring Final Exam

Recognition Using Class Specific Linear Projection. Magali Segal Stolrasky Nadav Ben Jakov April, 2015

STATISTICAL LEARNING SYSTEMS

Transcription:

Linear algebra methods for data mining with applications to materials Yousef Saad Department of Computer Science and Engineering University of Minnesota ICSC 2012, Hong Kong, Jan 4-7, 2012

HAPPY BIRTHDAY TONY!

Introduction: What is data mining? Common goal of data mining methods: to extract meaningful information or patterns from data. Very broad area includes: data analysis, machine learning, pattern recognition, information retrieval,... Main tools used: linear algebra; graph theory; approximation theory; statistics; optimization;... In this talk: brief introduction with emphasis on dimension reduction; Applications. Hong Kong, 01-05-2012 p. 3

Major tool of Data Mining: Dimension reduction Given d m find a mapping Φ : x R m y R d Mapping may be explicit [typically linear], e.g.: y = V T x Or implicit (nonlinear) Techniques depend on application: Preserve angles? Preserve distances? Maximize variance?.. Primary goals of dimension reduction : Reduce noise and redundancy in data Discover features or patterns Hong Kong, 01-05-2012 p. 4

Basic linear dimensionality reduction: PCA We are given points in R n and we want to project them in R d. Best way to do this? i.e.: find the best axes for projecting the data Q: best in what sense? A: maximize variance of new data Principal Component Analysis [PCA] Hong Kong, 01-05-2012 p. 5

Unsupervised learning: Clustering Problem: partition a given set into subsets such that items of the same subset are most similar and those of two different subsets most dissimilar. Superhard Superconductors Photovoltaic Ferromagnetic Multi ferroics Catalytic Thermo electric Basic technique: K-means algorithm [slow but effective.] Example of application : cluster bloggers by social groups Hong Kong, 01-05-2012 p. 6

Example: Sparse Matrices viewpoint (J. Chen & YS 09) Communities modeled by an affinity graph [e.g., user A sends frequent e-mails to user B ] Adjacency Graph represented by a sparse matrix Goal: find ordering so blocks are as dense as possible: Advantage of this viewpoint: need not know # of clusters. Use blocking techniques for sparse matrices Hong Kong, 01-05-2012 p. 7

Supervised learning: classification Problem: Given labels (say A and B ) for each item of a given set, find a mechanism to classify an unlabelled item into either the A or the B class. Many applications.?? Example: distinguish SPAM and non-spam messages Can be extended to more than 2 classes. Hong Kong, 01-05-2012 p. 8

Linear classifier Linear classifiers: Find a hyperplane which best separates the data in classes A and B. Hong Kong, 01-05-2012 p. 9

Linear classifiers Let X = [x 1,, x n ] be the data matrix. and L = [l 1,, l n ] the labels either +1 or -1. 1st Solution: Find a vector v such that v T x i close to l i i Common solution: SVD to reduce dimension of data [e.g. 2-D] then do comparison in this space. e.g. A: v T x i 0, B: v T x i < 0 v Hong Kong, 01-05-2012 p. 10

Need for nonlinear classifiers 4 Spectral Bisection (PDDP) 3 2 1 0 1 2 3 4 2 1 0 1 2 3 4 5 Result not too good. Use kernels to transform

0.1 Projection with Kernels σ 2 = 2.7463 0.08 0.06 0.04 0.02 0 0.02 0.04 0.06 0.08 0.1 0.08 0.06 0.04 0.02 0 0.02 0.04 0.06 Transformed data with a Gaussian Kernel Hong Kong, 01-05-2012 p. 12

Linear Discriminant Analysis (LDA) Define between scatter : a measure of how well separated two distinct classes are. Define within scatter : a measure of how well clustered items of the same class are. Goal: to make between scatter measure large, while making within scatter small. Idea: Project the data in low-dimensional space so as to maximize the ratio of the between scatter measure over within scatter measure of the classes. Hong Kong, 01-05-2012 p. 13

Let µ = mean of X, and µ (k) = mean of the k-th class (of size n k ). Define: S B = S W = c k=1 c k=1 n k (µ (k) µ)(µ (k) µ) T, x i X k (x i µ (k) )(x i µ (k) ) T. CLUSTER CENTROIDS GLOBAL CENTROID Hong Kong, 01-05-2012 p. 14

Project set on a one-dimensional space spanned by a vector a. Then: a T S B a = a T S W a = c n k a T (µ (k) µ) 2, i=1 c k=1 a T (x i µ (k) ) 2 x i X k LDA projects the data so as to maximize the ratio of these two numbers: max a a T S B a a T S W a Optimal a = eigenvector associated with the largest eigenvalue of: S B u i = λ i S W u i. Hong Kong, 01-05-2012 p. 15

LDA Extension to arbitrary dimension Would like to project in d dimensions Wish to maximize the ratio of two traces s.t. U T U = I Reduced dimension data: Y = U T X. Tr [U T S B U] Tr [U T S W U] Common belief: Hard to maximize. In fact not a big issue See Ngo, Bellalij & YS Common alternative: Solve instead the (easier) problem: max Tr [U T S B U] U T S W U=I Solution: largest eigenvectors of S B u i = λ i S W u i. Hong Kong, 01-05-2012 p. 16

Face Recognition background Problem: We are given a database of images: [arrays of pixel values]. And a test (new) image. Question: Does this new image correspond to one of those in the database? Hong Kong, 01-05-2012 p. 17

Difficulty Positions, Expressions, Background, Lighting, Neckties,..., Eigenfaces: Principal Component Analysis technique Specific situation: Poor images or deliberately altered images [ occlusion ] See real-life examples [international man-hunt] Hong Kong, 01-05-2012 p. 18

Eigenfaces Consider each picture as a (1-D) column of all pixels Put together into an array A of size # pixels # images.... =... }{{} A Do an SVD of A and perform comparison with any test image in low-dim. space Similar to LSI in spirit but data is not sparse. Idea: replace SVD by Lanczos vectors (same as for IR) Hong Kong, 01-05-2012 p. 19

Tests with two well-known data sets ORL 40 subjects, 10 sample images each example: # of pixels : 112 92; TOT. # images : 400 AR set 126 subjects 4 facial expressions selected for each [natural, smiling, angry, screaming] example: # of pixels : 112 92; TOT. # images : 504 Hong Kong, 01-05-2012 p. 20

Tests: Face Recognition Recognition accuracy of Lanczos approximation vs SVD ORL dataset AR dataset 1 0.9 ORL lanczos tsvd 1 0.9 AR lanczos tsvd 0.8 0.8 0.7 0.7 average error rate 0.6 0.5 0.4 average error rate 0.6 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 10 20 30 40 50 60 70 80 90 100 iterations 0 0 10 20 30 40 50 60 70 80 90 100 iterations y-axis: average error rate. x-axis: Subspace dimension Hong Kong, 01-05-2012 p. 21

GRAPH-BASED TECHNIQUES

Graph-based methods Start with a graph of data. e.g.: graph of k nearest neighbors (k-nn graph) Want: Do a projection so as to preserve the graph in some sense Define a graph Laplacean: x i x j L = D W y i y j e.g.,: w ij = { 1 if j Ni 0 else D = diag d ii = w ij j i with N i = neighborhood of i (excl. i) Hong Kong, 01-05-2012 p. 23

Example: The Laplacean eigenmaps approach Laplacean Eigenmaps *minimizes* n F EM (Y ) = w ij y i y j 2 subject to Y DY = I. i,j=1 Motivation: if x i x j is small (orig. data), we want y i y j to be also small (low-d data) Note: Min instead of Max as in PCA Above problem uses original data indirectly through its graph Leads to n n sparse eigenvalue problem [In sample space] x i x j y i y j Hong Kong, 01-05-2012 p. 24

Graph-based methods in a supervised setting Subjects of training set are known (labeled). Q: given a test image (say) find its label.,,,,,,,,,,,, Name1 Name1 Name1 Name1 Name2 Name2 Name2 Name2 Name3 Name3 Name3 Name3 Question: Find label (best match) for test image. Hong Kong, 01-05-2012 p. 25

Methods can be adapted to supervised mode. Idea: Build G so that nodes in the same class are neighbors. If c = # classes, G consists of c cliques. Matrix W is blockdiagonal Note: rank(w ) = n c. W = W 1 W 2 W 3 W 4 W 5 Can be used for LPP, ONPP, etc.. Recent improvement: add repulsion Laplacean [Kokiopoulou, YS 09] Hong Kong, 01-05-2012 p. 26

APPLICATION TO MATERIALS

Application to materials Studying materials properties using ab-initio methods can a major challenge Density Functional Theory & Kohn Sham equation used to determine electronic structure Often solved many many times for a single simulation Hong Kong, 01-05-2012 p. 28

Kohn-Sham equations nonlinear eigenvalue Pb [ 1 ] 2 2 + (V ion + V H + V xc ) Ψ i = E i Ψ i, i = 1,..., n o ρ(r) = n o i Ψ i (r) 2 2 V H = 4πρ(r) Both V xc and V H, depend on ρ. Potentials & charge densities must be self-consistent. Most time-consuming part: diagonalization Hong Kong, 01-05-2012 p. 29

Data mining for materials: Materials Informatics Huge potential in exploiting two trends: 1 Improvements in efficiency and capabilities in computational methods for materials 2 Recent progress in data mining techniques Current practice: One student, one alloy, one PhD Slow pace of discovery Data Mining: can help speed-up process, e.g., by exploring in smarter way Hong Kong, 01-05-2012 p. 30

However: databases, and generally sharing, in materials are not in great shape. The inherently fragmented and multidisciplinary nature of the materials community poses barriers to establishing the required networks for sharing results and information. One of the largest challenges will be encouraging scientists to think of themselves not as individual researchers but as part of a powerful network collectively analyzing and using data generated by the larger community. These barriers must be overcome. Materials genome initiative [NSF] NSTC report to the white house, June 2011. Hong Kong, 01-05-2012 p. 31

Unsupervised clustering ä 1970s: Data Mining by hand : Find coordinates to cluster materials according to structure ä 2-D projection from physical knowledge see: J. R. Chelikowsky, J. C. Phillips, Phys Rev. B 19 (1978). ä Anomaly Detection : helped find that compound Cu F does not exist Hong Kong, 01-05-2012 p. 32

Question: Can modern data mining achieve a similar diagrammatic separation of structures? Should use only information from the two constituent atoms Experiment: 67 binary octets. Use PCA exploit only data from 2 constituent atoms: 1. Number of valence electrons; 2. Ionization energies of the s-states of the ion core; 3. Ionization energies of the p-states of the ion core; 4. Radii for the s-states as determined from model potentials; 5. Radii for the p-states as determined from model potentials. Hong Kong, 01-05-2012 p. 33

Result: 2 1.5 1 Z W D R ZW WR 0.5 0 MgTe CdTe AgI CuI MgSe MgS CuBr CuCl 0.5 1 SiC ZnO 1.5 BN 2 2 1 0 1 2 3 Hong Kong, 01-05-2012 p. 34

Supervised learning: classification Problem: classify an unknown binary compound into its crystal structure class 55 compounds, 6 crystal structure classes leave-one-out experiment Case 1: Use features 1:5 for atom A and 2:5 for atom B. No scaling is applied. Case 2: Features 2:5 from each atom + scale features 2 to 4 by square root of # valence electrons (feature 1) Case 3: Features 1:5 for atom A and 2:5 for atom B. Scale features 2 and 3 by square root of # valence electrons. Hong Kong, 01-05-2012 p. 35

Three methods tested 1. PCA classification. Project and do identification in space of reduced dimension (Euclidean distance in low-dim space). 2. KNN K-nearest neighbor classification 3. Orthogonal Neighborhood Preserving Projection (ONPP) - a graph based method - [see Kokiopoulou, YS, 2005] Hong Kong, 01-05-2012 p. 36

K-nearest neighbor (KNN) classification Arguably the simplest of all methods Idea of a voting system: get distances between test sample and all other compounds Classes of the k nearest neighbors are considered (k=8) The predominant class among these k items is assigned to the test sample ( asterisk here)? Hong Kong, 01-05-2012 p. 37

Results Case KNN ONPP PCA Case 1 0.909 0.945 0.945 Case 2 0.945 0.945 1.000 Case 3 0.945 0.945 0.982 Recognition rate for 3 different methods using the data in different ways Hong Kong, 01-05-2012 p. 38

Melting point prediction It can be very difficult to predict materials properties by correlations alone. Test: 44 AB suboctet compounds Experimental melting points used Leave-one-out experiment Use simple linear regression with Tikhonov regulation Hong Kong, 01-05-2012 p. 39

Atomic features used: (1) Number of valence electrons; (2) Radius for s states as determined from model potentials; (3) Radius for p states as determined from model potentials; (4) Electron negativity; (5) Boiling point; (6) 1st ionization potential; (7) Heat of vaporization; (8) Atomic number. Hong Kong, 01-05-2012 p. 40

2000 1800 1600 OBSERVED (K) 1400 1200 1000 800 600 400 400 600 800 1000 1200 1400 1600 1800 PREDICTED (K) Experimental and predicted melting points in degrees K. Blue dashed lines == boundaries for 15% relative error. Hong Kong, 01-05-2012 p. 41

Conclusion Many, interesting new matrix problems related to data mining as well as emerging scientific fields: 1 Information technologies [learning, data-mining,...] 2 Computational Chemistry / materials science 3 Bio-informatics: computational biology, genomics,.. Important: Lots of resources available online: repositories, tutorials,.. Easy to get started. Materials informatics: likely be energized by the materials genome project. Hong Kong, 01-05-2012 p. 42