Advanced data analysis

Similar documents
Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction

Data Preprocessing. Jilles Vreeken IRDM 15/ Oct 2015

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Variations. ECE 6540, Lecture 02 Multivariate Random Variables & Linear Algebra

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Independent Component Analysis and FastICA. Copyright Changwei Xiong June last update: July 7, 2016

Non-linear Dimensionality Reduction

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Principal Component Analysis (PCA)

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Dimensionality Reduction

Statistical Pattern Recognition

Lesson 24: Using the Quadratic Formula,

Principal Component Analysis

Property Testing and Affine Invariance Part I Madhu Sudan Harvard University

10.4 The Cross Product

(1) Introduction: a new basis set

Quadratic Equations and Functions

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

TECHNICAL NOTE AUTOMATIC GENERATION OF POINT SPRING SUPPORTS BASED ON DEFINED SOIL PROFILES AND COLUMN-FOOTING PROPERTIES

Introduction to Machine Learning

Principal Component Analysis and Linear Discriminant Analysis

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

7.3 The Jacobi and Gauss-Seidel Iterative Methods

(1) Correspondence of the density matrix to traditional method

Online Mode Shape Estimation using Complex Principal Component Analysis and Clustering, Hallvar Haugdal (NTMU Norway)

Nonlinear Dimensionality Reduction

14 Singular Value Decomposition

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Fisher s Linear Discriminant Analysis

Principal Component Analysis

Principal Component Analysis

Advanced Introduction to Machine Learning CMU-10715

Quantitative Understanding in Biology Principal Components Analysis

Inferring the origin of an epidemic with a dynamic message-passing algorithm

Machine Learning (Spring 2012) Principal Component Analysis

CS249: ADVANCED DATA MINING

LECTURE NOTE #11 PROF. ALAN YUILLE

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Notes on Implementation of Component Analysis Techniques

Math, Stats, and Mathstats Review ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

Machine Learning (BSMC-GA 4439) Wenke Liu

Worksheets for GCSE Mathematics. Quadratics. mr-mathematics.com Maths Resources for Teachers. Algebra

PCA, Kernel PCA, ICA

Lecture 10: Dimension Reduction Techniques

Kernel methods for comparing distributions, measuring dependence

Statistical Machine Learning

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Worksheets for GCSE Mathematics. Algebraic Expressions. Mr Black 's Maths Resources for Teachers GCSE 1-9. Algebra

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

PCA and LDA. Man-Wai MAK

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Text Processing and High-Dimensional Data

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Integrating Rational functions by the Method of Partial fraction Decomposition. Antony L. Foster

PCA and LDA. Man-Wai MAK

STA 414/2104: Lecture 8

10.1 Three Dimensional Space

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

1 Principal Components Analysis

Data-dependent representations: Laplacian Eigenmaps

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Statistics for Applications. Chapter 9: Principal Component Analysis (PCA) 1/16

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Lesson 7: Algebraic Expressions The Commutative and Associative Properties

Nonlinear Dimensionality Reduction

Principal Components Analysis. Sargur Srihari University at Buffalo

PHL424: Nuclear Shell Model. Indian Institute of Technology Ropar

IN Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg

Course Business. Homework 3 Due Now. Homework 4 Released. Professor Blocki is travelling, but will be back next week

Prof. Dr.-Ing. Armin Dekorsy Department of Communications Engineering. Stochastic Processes and Linear Algebra Recap Slides

L26: Advanced dimensionality reduction

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Lecture: Face Recognition and Feature Reduction

Unsupervised dimensionality reduction

Covariance and Correlation Matrix

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Data Mining and Analysis: Fundamental Concepts and Algorithms

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Apprentissage non supervisée

Lesson 1: Successive Differences in Polynomials

Manifold Regularization

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

STA 414/2104: Lecture 8

CS249: ADVANCED DATA MINING

Grover s algorithm. We want to find aa. Search in an unordered database. QC oracle (as usual) Usual trick

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Compressed representation of Kohn-Sham orbitals via selected columns of the density matrix

Classical RSA algorithm

Dimensionality Reduction

Dimension Reduction (PCA, ICA, CCA, FLD,

Multi-Label Informed Latent Semantic Indexing

Wave Motion. Chapter 14 of Essential University Physics, Richard Wolfson, 3 rd Edition

Lecture: Face Recognition and Feature Reduction

Transcription:

Advanced data analysis Akisato Kimura ( 木村昭悟 ) NTT Communication Science Laboratories E-mail: akisato@ieee.org

Advanced data analysis 1. Introduction (Aug 20) 2. Dimensionality reduction (Aug 20,21) PCA, LPP, FDA, CCA, PLS 3. Non-linear methods (Aug 27) Kernel trick, kernel PCA Kernel LPP, Laplacian eigenmap, kernel FDA/CCA 4. Clustering (Aug 28) K-means, spectral clustering 5. Generalization (Sep 3) 4

Class web page http://www.brl.ntt.co.jp/people/akisato/titech/c lass.html Slides and data will be uploaded on this page. 5

Advanced data analysis 1. Introduction 2. Dimensionality reduction PCA, LPP, FDA, CCA, PLS 3. Non-linear methods Kernel trick, kernel PCA Kernel LPP, Laplacian eigenmap, kernel FDA/CCA 4. Clustering K-means, spectral clustering 5. Generalization 6

Curse of dimensionality 7 xx ii ii=1 nn, xx ii R dd, dd 1 If your data samples are high-dimensional, they are often too complex to directly analyze. Usual geometric intuition is often only applicable to low-dimensional problems. Such geometric intuition could be even misleading in high-dimensional problems. 7

Curse of dimensionality (cont.) When the dimensionality increases, Volume of unit hyper-cube VV cc is always 1. Volume of inscribed hyper-sphere VV ss goes to 0. Relative size of hyper-sphere gets small! VV ss 0 VVcc (in contradiction to our geometric intuition) dd = 1 dd = 2 dd = 3 dd 1 ππ 0.5 2 0.79 4ππ 0.5 3 /3 0.52 ππ nn/2 Γ nn/2 + 1 0.5 dd 0 8

Curse of Dimensionality (cont.) Grid sampling requires an exponentially large number of points. dd = 1 dd = 2 dd = 3 dd nn = 5 nn = 5 2 nn = 5 3 nn = 5 dd Unless you have an exponentially large number of samples, your high-dimensional samples are never dense. 9

Dimensionality reduction We want to reduce the dimensionality of data while preserving the intrinsic information in the data. Dimensionality reduction is also called embedding. If the dimension is reduced up to 3, it is also called data visualization. Basic assumption (or belief) behind dimensionality reduction: your highdimensional data is redundant in some sense. 10

Notation: Linear embedding Data samples: xx ii ii=1, xx ii R dd, dd 1 Embedding matrix: BB R mm dd, 1 mm dd Embedded data samples: nn, zz ii = BBxx ii R mm zz ii ii=1 = mm zz ii BB xx ii dd R mm R dd xx ii BB zz ii 11

Advanced data analysis 1. Introduction 2. Dimensionality reduction PCA, LPP, FDA, CCA, PLS 3. Non-linear methods Kernel trick, kernel PCA Kernel LPP, Laplacian eigenmap, kernel FDA/CCA 4. Clustering K-means, spectral clustering 5. Generalization 12

Principal component analysis (PCA) Idea: We want to get rid of a redundant dimension of the data samples 10 0, 20 0.1, 30 0.1 10, 20, 30 This could be achieved by minimizing the distance between embedded samples and original samples. xx ii zz ii 13

Data centering We center the data samples by nn nn xx ii = xx ii 1 nn xx jj 1 nn jj=1 ii xx ii = 0 In matrix, XX = XXXX XX = xx 1 xx 2 xx nn XX = (xx 1 xx 2 xx nn ) HH = II nn 1 nn 11 nn nn II nn : nn-dimensional identity matrix 11 nn nn : nn nn matrix with all ones 14

Orthogonal projection bb ii R dd mm ii=1 : Orthonormal basis in mm-dimensional embedding subspace bb ii, bb jj = bb 1 (ii = jj) ii bb jj = δδ ii,jj = 0 (ii jj) In matrix, BB = bb 1 bb 2 bb mm BBBB = II mm The orthogonal projection of xx ii is expressed by mm bb jj, xx ii bb jj (= BB BB xx ii ) jj=1 15

PCA criterion Minimize the sum of squared distances. nn ii=1 BB BB xx ii xx ii 2 = tr BB CCBB + tr CC tr BB CCBB PCA criterion: BB PPPPPP = arg max tr BB CCBB BB R mm dd subject to BBBB = II mm mm = bb ii CCbb ii ii=1 nn CC = ii=1 xx ii xx ii = XXXX xx ii BB BBxx ii 16

PCA: Summary A PCA solution: BB PPPPPP = ψψ 1 ψψ 2 ψψ mm dd λλ ii, ψψ ii ii=1 : Sorted eigenvalues and normalized eigenvectors of CCψψ = λλψψ λλ 1 λλ 2 λλ dd, ψψ ii, ψψ jj = δδ ii,jj PCA embedding of sample : zz = BB PPPPPP xx 1 nn XX11 nn 11 nn : nn-dimensional vector with all ones Data centering 17

Proof We first show necessary conditions of solutions. Lagrangian: LL BB, ΔΔ = tr BB CCBB Necessary conditions: tr( BBBB II mm ΔΔ) ΔΔ: Lagrange multipliers (symmetric matrix) LL BB = 2BB CC 2ΔΔBB = 00 LL ΔΔ = BBBB II mm = 00 CCBB = BB ΔΔ BBBB = II mm (2) (1) 18

Proof (cont.) Eigendecomposition of ΔΔ ΔΔ = TTTTTT (3) TT: orthogonal matrix (TT 1 = TT ) (1) CCBB = BB ΔΔ ΓΓ: diagonal matrix CCBB = BB TTTTTT (4) CCBB TT = BB TTΓΓ (5) This is an eigensystem of CC ΓΓ = diag λλ kk1, λλ kk2, λλ kkmm (6) BB TT = (ψψ kk1 ψψ kk2 ψψ kkmm ) kk ii {1,2,, dd} BB = TT ψψ kk1 ψψ kk2 ψψ kkmm (7) 19

Proof (cont.) BBBB = II mm rank(bb) = mm All kk ii ii=1 mm are distinct. Necessary conditions summary (3) ΔΔ = TTTTTT (6) ΓΓ = diag λλ kk1, λλ kk2, λλ kkmm (7) BB = TT ψψ kk1 ψψ kk2 ψψ kkmm All kk mm ii ii=1 are distinct. 20

Proof (cont.) Now, we choose the best kk ii ii=1 mm that maximizes the objective function tr BB CCBB. (2), (4) & (6) (4) (2) tr BB CCBB = tr BBBB TTΓΓTT = tr TTΓΓTT = tr ΓΓTT TT = mm ii=1 λλ kkii (6) (6) + TT is orthogonal λλ 1 λλ 2 λλ dd kk ii = ii maximizes the objective function. Choosing TT = II mm gives BB = ψψ 1 ψψ 2 ψψ mm 21

Pearson correlation nn Correlation coefficient for ss ii, tt ii ii=1 ρρ = nn ii=1 nn ii=1 ss ii ss ss ii ss 2 tt ii tt nn ii=1 tt ii tt 2 : ss = ss ii /nn tt = tt ii /nn Positively correlated Uncorrelated Negatively correlated ρρ > 0 ρρ 0 ρρ < 0 22

PCA uncorrelates data BB PPPPPP = ψψ 1 ψψ 2 ψψ mm The covariance matrix of PCA-embedded samples is diagonal. 1 nn zz ii zz ii = diag(λλ 1, λλ 2,, λλ mm ) nn ii (Homework) Elements in zz are uncorrelated! 23

Examples Data is well described. PCA is intuitive, easy to implement, analytically computable, and fast. 24

Examples (cont.) Iris data (4d->2d) Letter data (16d->2d) Embedded samples seem informative. 25

Examples (cont.) However, PCA does not necessarily preserve interesting information such as clusters. 26

Homework 1. Implement PCA and reproduce the 2- dimensional examples shown in the class. Datasets 1 and 2 are available from http://www.ms.k.u-tokyo.ac.jp/sugi/data/dataanalysis/ (Optional) Test PCA on your own (artificial or real) data and analyze characteristics of PCA. 27

Homework (cont.) 2. Prove that PCA uncorrelates samples. More specifically, prove that the covariance matrix of PCA-embedded samples is the following diagonal matrix: 1 nn nn ii zz ii zz ii = diag(λλ 1, λλ 2,, λλ mm ) BB PPPPPP = ψψ 1 ψψ 2 ψψ mm zz ii = BB PPPPPP xx ii 28

30

Advanced data analysis 1. Introduction 2. Dimensionality reduction PCA, LPP, FDA, CCA, PLS 3. Non-linear methods Kernel trick, kernel PCA Kernel LPP, Laplacian eigenmap, kernel FDA/CCA 4. Clustering K-means, spectral clustering 5. Generalization 31

Locality preserving projection (LPP) PCA finds a subspace that describes the data well. However, PCA can miss some interesting structures such as clusters. Another idea: Find a subspace that well preserves local structures in the data. 32

Similarity matrix Similarity matrix WW : the similar xx ii and xx jj are, the larger WW ii,jj is. Assumptions on WW : Symmetric: WW ii,jj = WW jj,ii Normalized: 0 WW ii,jj 1 WW is also called the affinity matrix. 33

Examples of similarity matrix Distance-based: WW ii,jj = exp xx ii xx jj 2 /γγ 2, γγ > 0 Nearest-neighbor-based: WW ii,jj = 1 if xx ii is a kk-nearest neighbor of xx jj or xx jj is a kk-nearest neighbor of xx ii. Otherwise, WW ii,jj = 0. Combination of these two is also possible 34

LPP criterion Idea: embed two close points as close, i.e., minimize nn 2 WW ii,jj BBxx ii BBBB jj 0 ii,jj=1 This is expressed as 2tr BBBBBBXX BB (Homework) XX = xx 1 xx 2 xx nn LL = DD WW nn DD = diag jj=1 nn WW 1,jj,, jj=1 WW nn,jj Since BB = 00 gives a meaningless solution, we impose BBBBBBXX BB = II mm 35

LPP: Summary LPP criterion: BB LLLLLL = arg min tr BB R BBBBBBXX BB mm dd subject to BBBBBBXX BB = II mm Solution (homework): BB LLLLLL = ψψ dd ψψ dd 1 ψψ dd mm+1 dd λλ ii, ψψ ii ii=1 : Sorted generalized eigenvalues and normalized eigenvectors of XXXXXX ψψ = λλxxxxxx ψψ λλ 1 λλ 2 λλ dd, XXXXXX ψψ ii, ψψ jj = δδ ii,jj LPP embedding of sample xx : zz = BB LLLLLL xx 36

Generalized eigenvalue problem AAAA = λλcccc CC: positive symmetric matrix Then, there exists a positive symmetric matrix CC 1/2 such that CC 1/2 2 = CC Eigenvalue decomposition of CC CC = γγ ii φφ ii φφ ii, γγ > 0 ii CC 1/2 = γγ ii φφ ii φφ ii ii 37

Generalized eigenvalue problem (cont.) AAAA = λλcccc Letting φφ = CC 1/2 ψψ, we obtain CC 1/22 AACC 1/2 φφ = λλφφ This is an ordinary eigenproblem. Ordinary eigenvectors are orthonormal. 1 (ii = jj) φφ ii, φφ jj δδ ii,jj = 0 (ii jj) Generalized eigenvectors are CC-orthonormal: CCCC ii, ψψ jj δδ ii,jj 38

Examples Blue: PCA Green: LPP Note: Similarity matrix is defined by the nearestneighbor-based method with 50 nearest neighbors. LPP can describe the data well, and also it preserves cluster structure. LPP is intuitive, easy to implement, analytically computable, and fast. 39

Examples (cont.) Embedding handwritten numerals from 3 to 8 into a 2- dimensional subspace. Each image consists of 16x16 pixels. 40

Examples (cont.) LPP finds (slightly) clearer clusters than PCA PCA LPP 41

Drawbacks of LPP Obtained results highly depend on the similarity matrix WW. Appropriately designing the similarity matrix (e.g., kk, γγ) is not always easy. 42

Local scaling of samples Densities of samples may be locally different. Dense region Sparse region Using the same γγ globally in the similarity matrix may not be appropriate. WW ii,jj = exp xx ii xx jj 2 /γγ 2, γγ > 0 43

Local scaling heuristic γγ ii : scaling around the sample xx ii (kk) γγ ii = xx ii xx ii xx ii (kk) : k-th nearest neighbor sample of xxii Local scaling based similarity matrix 2 WW ii,jj = exp xx ii xx jj /(γγii γγ jj ) A heuristic choice is kk = 7 L. Zelnik-Manor & P. Perona, Self-tuning spectral clustering, Advances in Neural Information Processing Systems 17, 1601-1608, MIT Press, 2005. 44

Graph theory Graph: A set of vertices and edges. Adjacency matrix WW : WW ii,jj is the number of edges from ii-th to jj-th vertices. Vertex degree dd ii : Number of connected edges at ii-th vertex. 45

Spectral graph theory Spectral graph theory studies relationships between the properties of a graph and its adjacency matrix. Graph Laplacian LL : dd ii (ii = jj) LL ii,jj = 1 (ii jj and WW ii,jj > 0) 0 (otherwise) 46

Relation to spectral graph theory Suppose our similarity matrix WW is defined based on nearest neighbors. Consider the following graph: Each vertex corresponds to each point xx ii. An edge exists if WW ii,jj > 0 WW is the adjacency matrix. DD is the diagonal matrix of vertex degrees. LL is the graph Laplacian. 47

Homework 1. Prove nn 2 WW ii,jj BBxx ii BBBB jj = 2tr BBBBBBXX BB ii,jj=1 XX = xx 1 xx 2 xx nn LL = DD WW nn nn DD = diag jj=1 WW 1,jj,, jj=1 WW nn,jj 48

Homework (cont.) 2. Let BB: mm dd matrix (1 mm dd) CC, DD: dd dd matrix, positive definite, symmetric mm λλ ii, ψψ ii ii=1: Sorted generalized eigenvalues and normalized eigenvectors of CCCC = λλdddd λλ 1 λλ 2 λλ dd, DDψψ ii, ψψ jj = δδ ii,jj Then, prove that a solution of BB mmmmmm = arg min tr BBBBBB BB Rmm dd is given by subject to BBBBBB = II mm BB mmmmmm = ψψ dd ψψ dd 1 ψψ dd mm+1 49

Homework (cont.) 3. (Optional) Implement LPP and reproduce the 2- dimensional examples shown in the class. Datasets 1 and 2 are available from http://www.ms.k.u-tokyo.ac.jp/sugi/data/dataanalysis/ Test LPP on your own (artificial or real) data and analyze characteristics of LPP. 50

51 51

Advanced data analysis 1. Introduction 2. Dimensionality reduction PCA, LPP, FDA, CCA, PLS 3. Non-linear methods Kernel trick, kernel PCA Kernel LPP, Laplacian eigenmap, kernel FDA/CCA 4. Clustering K-means, spectral clustering 5. Generalization 52

Supervised dimensionality reduction The best embedding is unknown in general. If every sample has a class label, the best embedding is the one such that samples in different classes are well separated. Better for representing large variances Which is the best??? Better for representing local structures 53

Supervised dimensionality reduction nn Samples xx ii ii=1 nn have class labels yy ii ii=1 xx ii, yy ii nn ii=1 xx ii R dd yy ii {1,2,, cc} We want to obtain an embedding such that samples in different classes are well separated from each other. 54

Within-class scatter matrix Sum of scatters within each class cc SS (ww) = xx ii μμ yy xx ii μμ yy yy=1 ii:yy ii =yy Mean of samples in class yy μμ yy = 1 nn yy ii:yy ii =yy xx ii # samples in class yy 55

Between-class scatter matrix Sum of scatters between classes cc SS (bb) = nn yy μμ yy μμ μμ yy μμ yy=1 Mean of samples in class yy μμ yy = 1 nn yy nn μμ = 1 nn ii ii:yy ii =yy xx ii xx ii Mean of all samples 56

Fisher discriminant analysis (FDA) Idea: Minimize within-class scatter and maximize between-class scatter by maximizing tr BBSS ww BB 1 BBSS (bb) BB To disable arbitrary scaling, we impose BBSS (ww) BB = II mm FDA criterion: BB FFFFFF = arg max tr BB R BBSS(bb) BB mm dd subject to BBSS (ww) BB = II mm 57

FDA: Summary FDA criterion: BB FFFFFF = arg max tr BB R BBSS(bb) BB mm dd subject to BBSS (ww) BB = II mm Solution: BB FFFFFF = ψψ 1 ψψ 2 ψψ mm λλ ii, ψψ mm ii ii=1 : Sorted generalized eigenvalues and normalized eigenvectors of SS (bb) ψψ = λλss (ww) ψψ λλ 1 λλ 2 λλ dd, SS (ww) ψψ ii, ψψ jj = δδ ii,jj FDA embedding of sample xx : zz = BB FFFFFF xx 58

Examples of FDA FDA can find an appropriate subspace. 59

Examples of FDA (cont.) However, FDA does not work well if samples in a class have multi-modality. 60

Dimensionality of embedding space It holds rank SS (bb) dd This means that λλ ii ii=cc λλ 1 λλ 2 λλ dd cc 1 (Homework) are always zero. Due to multiplicity of eigenvalues, dd eigenvectors ψψ ii ii=cc can be arbitrarily rotated in the null space of SS (bb). Thus, FDA essentially requires mm cc 1. When cc = 2, mm cannot be larger than 1! 61

Local Fisher discriminant analysis (LFDA) Idea: Take the locality of data into account. 1. Nearby samples in the same class are made close. 2. Samples in different classes are made apart. 3. Far-apart samples in the same class can be ignored. 1. 2. M. Sugiyama: Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis, JMLR, 8(May), 2007. 3. 62

Pairwise expressions of scatters (Homework) nn SS (ww) = 1 2 ii,jj=1 nn SS (bb) = 1 2 ii,jj=1 Samples in the same class are made close. (ww) QQ ii,jj xxii xx jj (bb) QQ ii,jj xxii xx jj xx ii xx jj (ww) 1/nn yy (yy ii = yy jj = yy) QQ ii,jj = 0 (yy ii yy jj ) xx ii xx jj (bb) 1/nn 1/nn yy (yy ii = yy jj = yy) QQ ii,jj = 1/nn (yy ii yy jj ) Samples in different classes are made apart. 63

Locality-aware scatters nn SS (llll) = 1 2 ii,jj=1 nn SS (llll) = 1 2 ii,jj=1 (llll) QQ ii,jj xxii xx jj (llll) QQ ii,jj xxii xx jj xx ii xx jj (llll) WW ii,jj /nn yy (yy ii = yy jj = yy) QQ ii,jj = 0 (yy ii yy jj ) QQ ii,jj Nearby samples in the same class are made close. xx ii xx jj WW ii.jj : similarity matrix (llll) WW ii,jj (1/nn 1/nn yy ) (yy ii = yy jj = yy) = 1/nn (yy ii yy jj ) Samples in different classes are made apart. 64

LFDA: Summary LFDA criterion: BB LLLLLLLL = arg max tr BB R BBSS(llll) BB mm dd subject to BBSS (llll) BB = II mm Solution: BB LLLLLLLL = ψψ 1 ψψ 2 ψψ mm λλ ii, ψψ mm ii ii=1 : Sorted generalized eigenvalues and normalized eigenvectors of SS (llll) ψψ = λλss (llll) ψψ λλ 1 λλ 2 λλ dd, SS (llll) ψψ ii, ψψ jj = δδ ii,jj FDA embedding of sample xx : zz = BB LLLLLLLL xx 65

Examples of LFDA Similarity matrix = nearest-neighbor method with 50 nearest neighbors. LFDA works well even for samples with within-class multi-modality. Since rank SS (llll) cc in general, thus mm can be larger in LFDA. 66

Examples of FDA/LFDA Thyroid disease data (5-dimensional) Representing several statistics obtained from blood tests. Label: Healthy or Sick Sick can be caused by Hyper-functioning of thyroid (too much working) Hypo-functioning of thyroid (too little working) 67

Projected samples onto 1-d space Sick and healthy are not separated. Hyper- and hypo-functioning are completely mixed. Sick and healthy are nicely separated. Hyper- and hypo-functioning are also well separated. 68

70