Machine Learning (BSMC-GA 4439) Wenke Liu

Similar documents
Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Linear and Non-Linear Dimensionality Reduction

Statistical Pattern Recognition

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Unsupervised learning: beyond simple clustering and PCA

PCA, Kernel PCA, ICA

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Dimensionality Reduction

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Unsupervised dimensionality reduction

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Introduction to Machine Learning

STA 414/2104: Machine Learning

Multidimensional scaling (MDS)

STATISTICAL LEARNING SYSTEMS

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

LECTURE NOTE #11 PROF. ALAN YUILLE

Discriminative Direction for Kernel Classifiers

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Principal Component Analysis

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Principal Component Analysis (PCA)

Lecture 6: Methods for high-dimensional problems

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Preprocessing & dimensionality reduction

Machine Learning - MT & 14. PCA and MDS

CS281 Section 4: Factor Analysis and PCA

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

L26: Advanced dimensionality reduction

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Linear Dimensionality Reduction

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Machine Learning (Spring 2012) Principal Component Analysis

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

c Springer, Reprinted with permission.

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Robustness of Principal Components

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Statistical Machine Learning

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Nonlinear Dimensionality Reduction

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture: Face Recognition and Feature Reduction

CS 340 Lec. 6: Linear Dimensionality Reduction

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Machine Learning 2nd Edition

Lecture VIII Dim. Reduction (I)

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians

Factor Analysis (10/2/13)

Principal Components Analysis. Sargur Srihari University at Buffalo

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

Nonlinear Dimensionality Reduction

Principal Component Analysis

Unsupervised Kernel Dimension Reduction Supplemental Material

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Automatic Rank Determination in Projective Nonnegative Matrix Factorization

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

LECTURE 16: PCA AND SVD

Independent Component Analysis and Its Application on Accelerator Physics

Nonlinear Manifold Learning Summary

Lecture 10: Dimension Reduction Techniques

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Principal Component Analysis (PCA)

Learning with Singular Vectors

PRINCIPAL COMPONENTS ANALYSIS

Table of Contents. Multivariate methods. Introduction II. Introduction I

Machine learning for pervasive systems Classification in high-dimensional spaces

Data Mining Techniques

What is Principal Component Analysis?

Properties of Matrices and Operations on Matrices

Independent Component Analysis

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

Kernel Methods. Machine Learning A W VO

PCA and admixture models

MLCC 2015 Dimensionality Reduction and PCA

MULTI-VARIATE/MODALITY IMAGE ANALYSIS

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Principal Component Analysis vs. Independent Component Analysis for Damage Detection

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction


7 Principal Component Analysis

CSC 411 Lecture 12: Principal Component Analysis

Lecture: Face Recognition and Feature Reduction

Probabilistic & Unsupervised Learning

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

Vector Space Models. wine_spectral.r

Kernel methods for comparing distributions, measuring dependence

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Manifold Learning: Theory and Applications to HRI

Transcription:

Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018

Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n

Problems Difficulty in interpretation of data The curse of dimensionality Data points are sparse in high-dimensional space Statistical methods that rely on local properties may not work well Extracted patterns may be unstable Dimension reduction is usually the first step

Dimension Reduction Mapping the data to a low-dimensional space For each p-dimensional data point x = ( x, ) T 1, x 2 K x p find a k- * * * * dimensional representation x = ( x ) T that captures the content of 1, x2 K x k original data, where When there is additional information (supervised) feature selection, ridge regression, LASSO, and other regularization methods When we only have X alone (unsupervised) feature extraction with linear transformation (projection) of the original data nonlinear extensions In this lecture, we will focus on the unsupervised case, especially the projection methods

A Note on Notation The data matrix X is a p n matrix, sometimes p>>n (for example, p gene expression levels for n different samples) n p pn pj p in ij i n j x x x x x x x x x K K M O M M K K M M O M L L 1 1 1 1 11

Matrix multiplication Pierce, Rod. (6 Oct 2014). "How to Multiply Matrices". Math Is Fun. http://www.mathsisfun.com/algebra/mat rix-multiplying.html

Mapping data to lower-dimensional space by linear combination Seek a factorization of the data matrix X X B R p n p k k n where column of the matrix B are k basis vectors, and n columns of the matrix R are representations of the n columns of X defined on B. k <= min (p,n). Both B and R contain useful information! Or, in other words, find a projection matrix of X that maps it to a lower dimensional space: R = P X k n k p p n Different methods have different constraints on B (or P) and different criteria of approximating X

Principal Component Analysis (PCA) An intuitive approach to the dimension reduction problem Project data matrix X to a new space, such that: Axes of the new coordinate system are orthogonal and of the same scale as in the original space Projected values preserve variance of the original data Projected vectors are uncorrelated In other words, we want to rotate the original axes and align them with directions with largest variance

A two-dimensional example of PCA ISLR, chapter 6

PCA reveals redundancy in the data Shlens, 2003

Mathematically Find a projection of X as its representation: Z = PX p i= 1 X = P 1 Columns of P -1 are orthogonal and of unit length. Z captures the variance of X, such that Z p T * µ i )( xi µ i ) = ( zi µ i )( zi i= 1 * ( x µ ) i x i and z i are rows of X and Z. Or in matrix form: tr( Σ( X)) = tr( Σ( Z)) Σ( Z) = diag( λ, λ K λ ) 1 2 p Σ(X) is the covariance matrix of X. Values of λ are as large as possible. Rows of Z are uncorrelated principal components of X. i T

Solution of Principal Components It turns out that the columns of P -1 are unitary eigenvectors of Σ(X), and λ 1, λ 2, λ p are the corresponding eigenvalues. T P 1 = P T P P = The ith principal component z i =p i X, p i is the ith row of P. Var( z i ) = λ i z i accounts for λ i / Σ λ of total variance in X. Let λ 1 λ 2 λ p, the first k PCs with significantly large λ values will capture the majority of variability in X. I p

Singular Value Decomposition (SVD) PCA is closely related to SVD. Let A be a p n matrix of real numbers, then there exist an p p orthogonal matrix U and a n n orthogonal matrix V such that T n n n p p p n p = V D U A T V D U A ) ( 1 13 11 ) ( 11 ) ( 2 1 2 22 21 1 12 11 ) ( 1 2 21 1 11 0 0 0 0 n n nn n n p nn p p pp p p p p n p pn p n n v v v v d d u u u u u u u u u a a a a a a = L M O M L L L M O M L L M O M M L L L M O M L L where d ii 0 are singular values of A, rows of U are p orthogonal eigenvectors of AA T, rows of V are n orthogonal eigenvectors of A T A. Non-zero d ii2 are the same non-zero eigenvalues of AA T and A T A.

Obtain PCs by SVD If the data matrix X is row-centered, then the sample covariance matrix S = Σ(X) = XX T. After factorizing X by SVD we have tr( S) = tr( D Note: we can always row-center X by right-multiplying I n (1/n) 11 T = I n -1 N S D = 2 UD = Λ 2 U Rows of U are unitary and mutually orthogonal, D 2 is diagonal and its elements are eigenvalues of XX T, U and D 2 are exactly what we want for P T and Σ(Z). T 2 )

Interpretation of PCA Apply PCA on X with SVD: X = UDV T = UZ Z = DV T Elements of U are called loadings. Columns of U (eigenvectors of XX T ) are loading vectors that define the rotation directions of the original axes. Elements of Z are called scores. Each column of Z is a representation of the corresponding column of X in the new space of U. Each row of Z is a principal component. If the first k (k<p) rows of U capture a sufficient large portion of the variance of X, X can be reduced to a lower-dimensional space. X U k Z k

PCA as an exploratory analysis PCA is often a pre-processing step before clustering or regression analysis ISLR, chapter 10

How Many PCs? X can be reduced to a k-dimensional space by only looking at the first k components that explained large proportions of variance Find the elbow on the scree plot ISLR, chapter 10

To standardize or not to standardize PCA can be performed on the sample covariance matrix S or the sample correlation matrix R. The latter is equivalent to performing PCA on the standardized data (each of the p feature measurements is normalized to have mean 0 and sd 1). If the observed features are of different units, it is recommended to standardize the data.

Effects of standardization ISLR, chapter 10

A use case in genomics study Apply PCA to the correlation matrix of different probe sets (gene expression features) Extract transcriptional components (eigenvector loadings) and activity measurements (principal component scores).

Limitations of PCA PCA is a linear method, it may not be productive if the features are non-linear. Extensions such as kernel-pca may improve the analysis. http://scikit-learn.org/stable/auto_examples/decomposition/plot_kernel_pca.html

Limitations of PCA (cont d) PCA captures variance of the data, but variance may not be informative. Projective methods with different constraints may be more desirable. http://gael-varoquaux.info/science/ica_vs_pca.html

Independent Component Analysis (ICA) Looking for linear transformation that result in statistically independent non-gaussian signals X=AS S=A -1 X Where A is the mixing matrix, its columns are latent vectors. Rows of S are statistically independent signal sources with unit variance. Based on higher order statistics rather than variance Stronger constraints than PCA (independence in addition to lack of correlation)

Formulating ICA In information theory, dependence among random variables is measured by mutual information: m I( y, y K y ) = H ( y ) H ( y 1 2 m i ) i= 1 Where H(y i ) is the differential entropy of individual y i, and H(y) is the differential entropy of the joint distribution. If y 1, y 2 y m are uncorrelated, mutual information can be expressed in the form of negentropy J(y): m 1, y2 K ym) = J ( ) J ( y i ) i= 1 I( y y Where J(y) measures the distance of y from normality: J ( y) = H ( y gauss ) H ( y)

Formulating ICA Negentropy J( ) is invariant under invertible linear transformations. Therefore, the search for a linear transformation of X such that Z=WX has minimized mutual information is equivalent to finding a W such that negentropy of each row in Z is maximized, under the constraint that rows of Z are uncorrelated. = = = = m i i m i i m z J J z J J z z z I 1 1 2 1 ) ( ) ( ) ( ) ( ), ( x wx K

Formulating ICA Negentropy J(z i ) is not easy to compute. But it can be approximated using a contrast function G( ): { G( z) } E{ G( )}] 2 J ( z) [ E ν Where G( ) is a non-quadratic function, and ν is a standardized Gaussian random variable. Z and W can be found by maximizing the approximation of each J(z i ), and they are estimates of S and A -1. Different algorithms utilized different contrast functions. Some common choices are 1 ( ) log(cosh( y 1 2 1 G y = a1 )) G ( ) exp( / 2) 1 2 y = a2 y G3( y) = y a a a3 1 2 4

ICA finds non-gaussian directions http://gael-varoquaux.info/science/ica_vs_pca.html

Application of ICA on expression data Kong et al., Biotechniques, 2008

Limitations of ICA No clear selection criteria for components. Initial value sensitive. Does not work well if the underlying distribution is close to Gaussian.

Non-negative Matrix Factorization (NMF) Given a nonnegative matrix X, find nonnegative matrices W and H, such that X W H p n p k k n Columns of W are basis vectors, columns of H are coefficients for each sample. The rank k has to be determined based on heuristics.

Comparison between PCA and NMF PCA extracts distributed representations on orthogonal directions NMF learns additive combination of parts Lee & Seung, 1999, Nature

Geometric interpretation of NMF The factorization X= WH has a geometric counterpart: all data points x all lie in the convex cone generated by column vectors of W, which is embedded in the first orthant of R p. In fact, if all the data points are strictly positive, there are many such cones, and the factorization is not unique. Geometrically, the NMF method seeks to approximate the conic hull of X by a low-dimensional cone in the first orthant. Data points lie outside the cone cannot be reconstructed.

Geometric interpretation of NMF Bauckhage, 16th LWA Workshops, 2014

NMF of gene expression data Brunet et al., 2004, PNAS

Determine the rank of factorization Cophenetic Correlation Brunet et al., 2004, PNAS

NMF is widely used in classification of cancer samples Tamayo et al., 2007, PNAS

Limitations of NMF Usually the NMF solution is not unique Initial value sensitive High computational complexity Projection of new data could be confusing (W is not always invertible; new H may contain negative values)

What to use? PCA is the most common method, but its interpretation may not be clear ICA extracts non-gaussian information from the data, but it may not be clear which components are more important. NMF produces additive features, but it s heavy in computation, and the solution can be unstable Zinovyev et al., Biochem Biophys Res Commun, 2013

Nonlinear extensions: kernel PCA In PCA, X is projected to Z=DV T, where columns of V T are eigenvectors of X T X. Kernel PCA, instead, seeks to find representation of X by finding eigenvalues and eigenvectors of a centered kernel matrix K * =K-1 N K-K1 N -1 N K1 N, where K ij = k(x i, x j )=Φ(x i ) T Φ(x j ) Scholkopf et al., ICAAN 97

The kernel trick The trick is, Φ( ) can be a very complex function while k( ) remains relatively simple. kpca projects the data to a nonlinear space without explicitly calculate each projection directions. A popular kernel functions is the Gaussian Radial Basis Function (RBF): 2 ( ) ( ) x k ik x jk k x, = exp i x j 2 2σ http://sebastianraschka.com/articles/2014_kernel_pca.html

Dimension reduction with neural networks A feed-forward neural network with non-linear activation function can approximate any function. An autoencoder network trained to reconstruct its input may represent the data in a lower dimensional latent space. http://blog.fastforwardlabs.com/2016/08/12/introducing-variational-autoencoders-in-prose-and.html

Visualizing high-dimensional data Multidimensional scaling (MDS) t-distributed Stochastic Neighbor Embedding (t-sne)

MDS Multidimensional Scaling (MDS) maps high-dimensional data to lowdimensional space. It preserves the pairwise distance (or dissimilarity) between original data points. This can be done in linear, nonlinear or nonmetric manner.

Metric MDS In the case of classical MDS, it is equivalent to PCA. S ( z z z ) = ( d z z ) 1, 2 L n i< j Nonlinear MDS such as Sammon mapping preserves the nearby points: S NL ( z, z L z ) 1 2 n = 1 ij dij i< j i< j i j 2 ( d z z ) ij d i ij j 2 Andrewman327/Wikipedia Commons

Non-metric MDS Use an increasing function of the original distance (preserve ranks). ( z z L z ) ( θ ( dij ) zi z j ) i< j S NM 2 n = 2 1, Have to optimize both the new coordinates and the function θ. Enables nonlinear transformation. i< j z i z j 2

t-sne Preserve the joint probabilities of pairs of original data. Model the original data with Gaussian distribution, represent in lower dimensional space with t-distribution, minimize the Kullback-Leibler divergence as the cost function. Computationally expensive, have to select random subsets for large data.

t-sne vs. nonlinear metric MDS Van Der Maaten and Hinton, 2008

I still believe that unsupervised learning is going to be crucial, and things will work incredibly much better than they do now when we get that working properly, but we haven't yet. - Geoffrey Hinton, 2017

Homework Cluster analysis and dimension reduction on a subset of the MNIST data

Further reading Text book: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Hastie T, Tibshirani R, Friedman J. Springer: 2011. Chapter 14. PCA: Shlens J (2003). A tutorial on principal component analysis. ICA: Hyvärinen A (1997). Independent Component Analysis by Minimization of Mutual Information. Lee SI, Batzoglou S (2003). Application of independent component analysis to microarrays. Genome Biol. 4(11):R76. NMF: Lee DD, Seung HS (1999). Learning the parts of objects by non-negative matrix factorization. Nature. 401(6755):788-91. General Topics: Zinovyev A, Kairov U, Karpenyuk T, Ramanculov E (2013). Blind source separation methods for deconvolution of complex signals in cancer biology. Biochem Biophys Res Commun. 430(3):1182-7.