Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Size: px
Start display at page:

Download "Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395"

Transcription

1 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

2 Outline 1 Introduction 2 Feature selection methods 3 Feature extraction methods Principal component analysis Kernel principal component analysis Factor analysis Locally Linear Embedding Linear discriminant analysis Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

3 Table of contents 1 Introduction 2 Feature selection methods 3 Feature extraction methods Principal component analysis Kernel principal component analysis Factor analysis Locally Linear Embedding Linear discriminant analysis Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

4 Introduction The complexity of any classifier or regressors depends on the number of input variables or features. These complexities include Time complexity: In most learning algorithms, the time complexity depends on the number of input dimensions(d) as well as on the size of training set (N). Decreasing D decreases the time complexity of algorithm for both training and testing phases. Space complexity: Decreasing D also decreases the memory amount needed for training and testing phases. Samples complexity: Usually the number of training examples (N) is a function of length of feature vectors (D). Hence, decreasing the number of features also decreases the number of training examples. Usually the number of training pattern must be 10 to 20 times of the number of features. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

5 Introduction There are several reasons why we are interested in reducing dimensionality as a separate preprocessing step. Decreasing the time complexity of classifiers or regressors. Decreasing the cost of extracting/producing unnecessary features. Simpler models are more robust on small data sets. Simpler models have less variance and thus are less depending on noise and outliers. Description of classifier or regressors is simpler / shorter. Visualization of data is simpler. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

6 Peaking phenomenon In practice, for a finite number of training examples (N), by increasing the number of features (l) we obtain an initial improvement in performance, but after a critical value, further increase of the number of features results in an increase of the probability of error. This phenomenon is also known as the peaking phenomenon. If the number of samples increases (N 2 N 1 ), the peaking phenomenon occures for larger number of features (l 2 > l 1 ). Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

7 Dimensionality reduction methods There are two main methods for reducing the dimensionality of inputs Feature selection: These methods select d (d < D) dimensions out of D dimensions and D d other dimensions are discarded. Feature extraction: Find a new set of d (d < D) dimensions that are combinations of the original dimensions. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

8 Table of contents 1 Introduction 2 Feature selection methods 3 Feature extraction methods Principal component analysis Kernel principal component analysis Factor analysis Locally Linear Embedding Linear discriminant analysis Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

9 Feature selection methods Feature selection methods select d (d < D) dimensions out of D dimensions and D d other dimensions are discarded. Reasons for performing feature selection Increasing the predictive accuracy of classifiers or regressors. Removing irrelevant features. Enhancing learning efficiency (reducing computational and storage requirements). Reducing the cost of future data collection (making measurements on only those variables relevant for discrimination/prediction). Reducing complexity of the resulting classifiers/regressors description (providing an improved understanding of the data and the model). Feature selection is not necessarily required as a pre-processing step for classification /regression algorithms to perform well. Several algorithms employ regularization techniques to handle over-fitting or averaging such as ensemble methods. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

10 Feature selection methods Feature selection methods can be categorized into three categories. Filter methods: These methods use the statistical properties of features to filter out poorly informative features. Wrapper methods: These methods evaluate the feature subset within classifier/regressor algorithms. These methods are classifier/regressors dependent and have better performance than filter methods. Embedded methods:these methods use the search for the optimal subset into classifier/regression design. These methods are classifier/regressors dependent. Two key steps in feature selection process. Evaluation: An evaluation measure is a means of assessing a candidate feature subset. Subset generation: A subset generation method is a means of generating a subset for evaluation. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

11 Feature selection methods (Evaluation measures) Large number of features are not informative- either irrelevant or redundant. Irrelevant features are those features that don t contribute to a classification or regression rule. Redundant features are those features that are strongly correlated. In order to choose a good feature set, we require a means of a measure to contribute to the separation of classes, either individually or in the context of already selected features. We need to measure relevancy and redundancy. There are two types of measures Measures that relay on the general properties of the data. These assess the relevancy of individual features and are used to eliminate feature redundancy. All these measures are independent of the final classifier. These measures are inexpensive to implement but may not well detect the redundancy. Measures that use a classification rule as a part of their evaluation. In this approach, a classifier is designed using the reduced feature set and a measure of classifier performance is employed to assess the selected features. A widely used measure is the error rate. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

12 Feature selection methods (Evaluation measures) The following measures relay on the general properties of the data. Feature ranking: Features are ranked by a metric and those that fail to achieve a prescribed score are eliminated. Examples of these metrics are: Pearson correlation, mutual information, and information gain. Interclass distance: A measure of distance between classes is defined based on distances between members of each class. Example of these metrics is: Euclidean distance. Probabilistic distance: This is the computation of a probabilistic distance between class-conditional probability density functions, i.e. the distance between p(x C 1 ) and p(x C 2 ) (two-classes). Example of these metrics is: Chhernoff dissimilarity measure. Probabilistic dependency: These measures are multi-class criteria that measure the distance between class-conditional probability density functions and the mixture probability density function for the data irrespective of the class, i.e. the distance between p(x C i ) and p(x). Example of these metrics is: Joshi dissimilarity measure. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

13 Feature selection methods (Search algorithms) Complete search: These methods guarantee to find the optimal subset of features according to some specified evaluation criteria. For example exhaustive search and branch and bound methods are complete. Sequential search: In these methods, features are added or removed sequentially. These methods are not optimal, but are simple to implement and fast to produce results. Best individual N: The simplest method is to assign a score to each feature and then select N top ranks features. Sequential forward selection: It is a bottom-up search procedure that adds new features to a feature set one at a time until the final feature set is reached. Generalized sequential forward selection: In this approach, at each time r > 1, features are added instead of a single feature. Sequential backward elimination: It is a top-down procedure that deletes a single feature at a time until the final feature set is reached. Generalized sequential backward elimination : In this approach, at each time r > 1 features are deleted instead of a single feature. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

14 Table of contents 1 Introduction 2 Feature selection methods 3 Feature extraction methods Principal component analysis Kernel principal component analysis Factor analysis Locally Linear Embedding Linear discriminant analysis Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

15 Feature extraction In feature extraction, we are interested to find a new set of k (k > D) dimensions that are combinations of the original D dimensions. Feature extraction methods may be supervised or unsupervised. Examples of feature extraction methods Principal component analysis (PCA) Factor analysis (FA)s Multi-dimensional scaling (MDS) ISOMap Locally linear embedding Linear discriminant analysis (LDA) Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

16 Principal component analysis PCA project D dimensional input vectors to k dimensional input vectors via a linear mapping with minimum loss of information. Dimensions are combinations of the original D dimensions. The problem is to find a matrix W such that the following mapping results in the minimum loss of information. Z = W T X PCA is unsupervised and tries to maximize the variance. The principle component is w 1 such that the sample after projection onto w 1 is most spread out so that the difference between the sample points becomes most apparent. For uniqueness of the solution, we require w 1 = 1, Let Σ = Cov(X ) and consider the principle component w 1, we have z 1 = w T 1 x Var(z 1 ) = E[(w T 1 x w T 1 µ) 2 ] = E[(w T 1 x w T 1 µ)(w T 1 x w T 1 µ)] = E[w T 1 (x µ)(x µ) T w 1 ] = w T 1 E[(x µ)(x µ) T ]w 1 = w T 1 Σw 1 Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

17 Principal component analysis (cont.) The mapping problem becomes w 1 = argmax w w T Σw subject to w T 1 w 1 = 1. Writing this as Lagrange problem, we have maximize w 1 w T 1 Σw 1 α(w T 1 w 1 1) Taking derivative with respect to w 1 and setting it equal to 0, we obtain 2Σw 1 = 2αw 1 Σw 1 = αw 1 Hence w 1 is eigenvector of Σ and α is the corresponding eigenvalue. Since we want to maximize Var(z 1 ), we have Var(z 1 ) = w1 T Σw 1 = αw1 T w 1 = α Hence, we choose the eigenvector with the largest eigenvalue, i.e. λ 1 = α. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

18 Principal component analysis (cont.) The second principal component, w 2, should also maximize variance be unit length orthogonal to w 1 (z 1 and z 2 must be uncorrelated) The mapping problem for the second principal component becomes w 2 = argmax w w T Σw subject to w T 2 w 2 = 1 and w T 2 w 1 = 0. Writing this as Lagrange problem, we have maximize w 2 w T 2 Σw 2 α(w T 2 w 2 1) β(w T 2 w 1 0) Taking derivative with respect to w 2 and setting it equal to 0, we obtain 2Σw 2 2αw 2 βw 1 = 0 Pre-multiply by w T 1, we obtain 2w T 1 Σw 2 2αw T 1 w 2 βw T 1 w 1 = 0 Note that w T 1 w 2 = 0 and w T 1 Σw 2 = (w T 2 Σw 1) T = w T 2 Σw 1 is a scaler. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

19 Principal component analysis (cont.) Since Σw 1 = λ 1 w 1, therefore we have Then β = 0 and the problem reduces to w T 1 Σw 2 = w T 2 Σw 1 = λ 1 w T 2 w 1 = 0 Σw 2 = αw 2 This implies that w 2 should be the eigenvector of Σ with the second largest eigenvalue λ 2 = α. Similarly, you can show that the other dimensions are given by the eigenvectors with decreasing eigenvalues. Since Σ is symmetric, for two different eigenvalues, their corresponding eigenvectors are orthogonal. (Show it) If Σ is positive definite (x T Σx > 0 for all non-null vector x), then all its eigenvalues are positive. If Σ is singular, its rank is k (k < D) and λ i = 0 for i = k + 1,..., D. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

20 Principal component analysis (cont.) Define Z = W T (X m) Then k columns of W are the k leading eigenvectors of S (the estimator of Σ). m is the sample mean of X. Subtracting m from X before projection centers the data on the origin. How to normalize variances? Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

21 Principal component analysis (cont.) How to select k? If all eigenvalues are positive and S = D i=1 λ i is small, then some eigenvalues have little contribution to the variance and may be discarded. Scree graph is the plot of variance as a function of the number of eigenvectors. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

22 Principal component analysis (cont.) How to select k? We select the leading k components that explain more than for example 95% of the variance. The proportion of variance (POV) is POV = k i=1 λ i D i=1 λ i By visually analyzing it, we can choose k. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

23 Principal component analysis (cont.) How to select k? Another possibility is to ignore the eigenvectors whose corresponding eigenvalues are less than the average input variance (why?). In the pre-processing phase, it is better to pre-process data such that each dimension has mean 0 and unit variance(why and when?). Question: Can we use the correlation matrix instead of covariance matrix? Drive solution for PCA. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

24 Principal component analysis (cont.) PCA is sensitive to outliers. A few points far from the center have large effect on the variances and thus eigenvectors. Question: How can use the robust estimation methods for calculating parameters in the presence of outliers? A simple method is discarding the isolated data points that are far away. Question: When D is large, calculating, sorting, and processing of S may be tedious. Is it possible to calculate eigenvectors and eigenvalues directly from data without explicitly calculating the covariance matrix? Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

25 Principal component analysis (reconstruction error) In PCA, input vector x is projected to the z space as z = W T (x µ) When W is an orthogonal matrix (W T W = I ), it can be back-projected to the original space as ˆx = Wz + µ ˆx is the reconstruction of x from its representation in the z space. Question: Show that PCA minimizes the following reconstruction error. ˆx i x i 2 i The reconstruction error depends on how many of the leading components are taken into account. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

26 Kernel principal component analysis PCA can be extended to find nonlinear directions in the data using kernel methods. Kernel PCA finds the directions of most variance in the feature space instead of the input space. Linear principal components in the feature space correspond to nonlinear directions in the input space. Using kernel trick, all operations can be carried out in terms of the kernel function in input space without having to transform the data into feature space. Let φ correspond to a mapping from the input space to the feature space. Each point in feature space is given as the image of φ(x) of the point x in the input space. In feature space, we can find the first kernek principal component W 1 (W T 1 W 1 = 1) by solving Σ φ W 1 = λ 1 W 1 Covariance matrix Σ φ in feature space is equal to Σ φ = 1 N We assume that the points are centered. N φ(x i )φ(x i ) T i=1 Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

27 Kernel principal component analysis (cont.) Plugging Σ φ into Σ φ W 1 = λ 1 W 1, we obtain ( ) 1 N φ(x i )φ(x i ) T W 1 = λ 1 W 1 N 1 N i=1 N ) φ(x i ) (φ(x i ) T W 1 i=1 i=1 = λ 1 W 1 N ( φ(xi ) T ) W 1 φ(x i ) = W 1 Nλ 1 N c i φ(x i ) = W 1 i=1 c i = φ(x i ) T W 1 Nλ 1 is a scalar value Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

28 Kernel principal component analysis (cont.) Now substitute N i=1 c iφ(x i ) = W 1 in Σ φ W 1 = λ 1 W 1, we obtain ( ) ( 1 N N ) φ(x i )φ(x i ) T c i φ(x i ) N i=1 i=1 N = λ 1 c i φ(x i ) i=1 1 N N N c j φ(x i )φ(x i ) T φ(x j ) = λ 1 c i φ(x i ) N i=1 j=1 i=1 N N N φ(x i ) c j φ(x i ) T φ(x j ) = Nλ 1 c i φ(x i ) i=1 j=1 i=1 Replacing φ(x i ) T φ(x j ) by K(x i, x j ) N N N φ(x i ) c j K(x i, x j ) = Nλ 1 c i φ(x i ) i=1 j=1 i=1 Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

29 Kernel principal component analysis (cont.) Multiplying with φ(x k ) T, we obtain N N N φ(x k ) T φ(x i ) c j K(x i, x j ) = Nλ 1 c i φ(x k ) T φ(x i ) i=1 j=1 i=1 N N N K(x k, x i ) c j K(x i, x j ) = Nλ 1 c i K(x k, x j ) i=1 j=1 i=1 By some algebraic simplificatin, we obtain (do it) Multiplying by K 1, we obtain K 2 C = Nλ 1 KC KC = Nλ 1 C KC = η 1 C Weight vector C is the eigenvector corresponding to the largest eigenvalue η 1 of the kernel matrix K. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

30 Kernel principal component analysis (cont.) Replacing N i=1 c iφ(x i ) = W 1 in constraint W T 1 W 1 = 1, we obtain N i=1 j=1 N c j c j φ(x i ) T φ(x j ) = 1 C T KC = 1 Using KC = η 1 C, we obtain C T (η 1 C) = = 1 η 1 C T C = = 1 C 2 = 1 η 1 Since C is an eigenvector of K, it will have unit norm. To ensure that W 1 is a unit vector, multiply C by 1 η1 Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

31 Kernel principal component analysis (cont.) In general, we do not map input space to the feature space via φ, hence we cannot compute W 1 using N c i φ(x i ) = W 1 i=1 We can project any point φ(x) on to principal direction W 1 W T 1 φ(x) = N c i φ(x i ) T φ(x) = i=1 N c i K(x i, x) i=1 When x = x i is one of the input points, we have a i = W1 T φ(x i ) = Ki T C where K i is the column vector corresponding to the I th row of K and a i is the vector in the reduced dimension. This shows that all computation are carried out using only kernel operations. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

32 Factor analysis In PCA, from the original dimensions x i (for i = 1,..., D), we form a new set of variables z that are linear combinations of x i Z = W T (X µ) In factor analysis (FA), we assume that there is a set of unobservable, latent factors z j (for j = 1,..., k), which when acting in combination generate x. Thus the direction is opposite that of PCA. The goal is to characterize the dependency among the observed variables by means of a smaller number of factors. Suppose there is a group of variables that have high correlation among themselves and low correlation with all the other variables. Then there may be a single underlying factor that gave rise to these variables. FA, like PCA, is a one-group procedure and is unsupervised. The aim is to model the data in a smaller dimensional space without loss of information. In FA, this is measured as the correlation between variables. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

33 Factor analysis (cont.) Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

34 Factor analysis (cont.) As in PCA, we have a sample x drawn from some unknown probability density with E[x] = µ and Cov(x) = Σ. We assume that the factors are unit normals, E[z j ] = 0, Var(z j ) = 1, and are uncorrelated,cov(z i, z j ) = 0, i j. To explain what is not explained by the factors, there is an added source for each input which we denote by ɛ i. It is assumed to be zero-mean, E[ɛ i ] = 0, and have some unknown variance, Var(ɛ i ) = ψ i. These specific sources are uncorrelated among themselves, Cov(ɛ i, ɛ i ) = 0, i j, and are also uncorrelated with the factors, Cov(ɛ i, z j ) = 0, i, j. FA assumes that each input dimension, x i can be written as a weighted sum of k < D factors, z j plus the residual term. This can be written in vector-matrix form as x i µ i = v i1 z 1 + v i2 z v ik z k + ɛ i x µ = VZ + ɛ i V is the D k matrix of weights, called factor loadings. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

35 Factor analysis (cont.) We assume that µ = 0 and we can always add µ after projection. Given that Var(z j ) = 1 and Var(ɛ i ) = ψ i and In vector-matrix form, we have Var(x i ) = v 2 i1 + v 2 i v 2 ik + ψ i Σ = Var(X ) = VV T + Ψ There are two uses of factor analysis: It can be used for knowledge extraction when we find the loadings and try to express the variables using fewer factors. It can also be used for dimensionality reduction when k < D. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

36 Factor analysis (cont.) For dimensionality reduction, we need to find the factor, z j, from x i. We want to find the loadings w ji such that z j = D w ji x i + ɛ i i=1 In vector form, for input x, this can be written as z = W T x + ɛ For N input, the problem is reduced to linear regression and its solution is W = (X T X ) 1 X T Z We do not know Z; We multiply and divide both sides by N 1 and obtain (drive it!) W = (N 1)(X T X ) 1 X T ( Z X T ) 1 N 1 = X X T Z N 1 N 1 = S 1 V Z = XW = XS 1 V Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

37 Locally Linear Embedding Locally linear embedding (LLE) recovers global nonlinear structure from locally linear fits. The idea is that each point can be approximated as a weighted sum of its neighbors. The neighbors either defined using a given number of neighbors (n) or distance threshold (ɛ). Let x r be an example in the input space and its neighbors be x(r) s. We find weights in such a way that minimize the following objective function. E[W x] = N x r s r=1 w rs x(r) s The idea in LLE is that the reconstruction weights w rs reflect the intrinsic geometric properties of the data is also valid for the new space. The first step of LLE is to find w rs in such a way that minimize the above objective function subject to s w rs = 1. 2 Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

38 Locally Linear Embedding (cont.) traints: first, that ructed only from 0 if X j does ij The second step of LLE is to keep w rs fixed and construct the new coordinates Y in such a way that minimize the following objective function. N E[Y W ] = y r 2 R EPORTS w rs y(r) s not belong to the set of neighbors of Xi ; W ij subject to these constraints s (6) are found second, that the rows of the weight matrix sum to one: in such a way j W that ij 1. The optimal weights Cov(Y ) = I and E[Y ] = 0. r=1 by solving a least-squares problem (7). The constrained weights that minimize these reconstruction errors obey an important symmetry: for any particular data point, they are invariant to rotations, rescalings, and translations of that data point and its neighbors. By symmetry, it follows that the reconstruction weights characterize intrinsic geometric properties of each neighborhood, as opposed to properties that depend on a particular frame of reference (8). Note that the invariance to translations is specifically enforced by the sum-to-one constraint on the rows of the weight matrix. Suppose the data lie on or near a smooth nonlinear manifold of lower dimensionality d D. To a good approximation then, there exists a linear mapping consisting of a translation, rotation, and rescaling that maps the high-dimensional coordinates of each neighborhood to global internal coordinates on the manifold. By design, the reconstruction weights W ij reflect intrinsic geometric properties of the data that are invariant to exactly such transformations. We therefore expect their characterization of local geometry in the original data space to be equally valid for local patches on the manifold. In particular, the same weights W ij that recon- S. T. Roweis and L. K. Saul, Nonlinear Dimensionality Reduction by Locally Linear Embedding, SCIENCE, VOL 290, No. 22, pp , Dec Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

39 Linear discriminant analysis Linear discriminant analysis (LDA) is a supervised method for dimensionality reduction for classification problems. Consider first the case of two classes, and suppose we take the D dimensional input vector x and project it down to one dimension using z = W T x If we place a threshold on z and classify z w 0 as class C 1, and otherwise class C 2, then we obtain our standard linear classifier. In general, the projection onto one dimension leads to a considerable loss of information, and classes that are well separated in the original D dimensional space may become strongly overlapping in one dimension. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

40 Linear discriminant analysis(cont.) However, by adjusting the components of the weight vector W, we can select a projection that maximizes the class separation. Consider a two-class problem in which there are N 1 points of class C 1 and N 2 points of class C 2, so that the mean vectors of the two classes are given by m j = 1 N j i C j x i Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

41 Linear discriminant analysis (cont.) x 1 x 1 The simplest measure of the separation of the classes, when projected onto W, is the separation of the projected class means. This suggests that we might choose W so as to maximize where m 2 m 1 = W T (m 2 m 1 ) m j = W T m j This expression can be made arbitrarily large simply by increasing the magnitude of W. To solve this problem, we could constrain W to have unit length, so that i w i 2 = 1. Using a Lagrange multiplier to perform the constrained maximization, we then find that W (m 2 m 1 ) x 2 This axis yields better class separability ' 1 ' 2 x 1 This axis has a larger distance between means Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

42 Linear discriminant analysis (cont.) This approach has a problem: The following figure shows two classes that are well separated in the original two dimensional space but that have considerable overlap when projected onto the line joining their means. This difficulty arises from the strongly non-diagonal covariances of the class distributions. The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class means while also giving a small variance within each class, thereby minimizing the class overlap. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

43 Linear discriminant analysis (cont.) The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class means while also giving a small variance within each class, thereby minimizing the class overlap. The projection z = W T x transforms the set of labeled data points in x into a labeled set in the one-dimensional space z. The within-class variance of the transformed data from class C j equals s 2 j = i C j (z i m j ) 2 where z i = w T x i. We can define the total within-class variance for the whole data set to be s s2 2.. The Fisher criterion is defined to be the ratio of the between-class variance to the within-class variance and is given by J(W ) = (m 2 m 1 ) 2 s s2 2 Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

44 Linear discriminant analysis (cont.) Between-class covariance matrix equals to S B = (m 2 m 1 )(m 2 m 1 ) T Total within-class covariance matrix equals to S W = (x i m 1 ) (x i m 1 ) T + (x i m 2 ) (x i m 2 ) T i C 1 i C 2 J(W ) can be written as J(w) = W T S B W W T S W W To maximize J(W ), we differentiate with respect to W and we obtain ( ) ( ) W T S B W S W W = W T S W W S B W S B W is always in the direction of (m 2 m 1 ) (Show it!). We can drop the scalar factors W T S B W and W T S W W (why?). Multiplying both sides of S 1 W we then obtain (Show it!) W S 1 W (m 2 m 1 ) Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

45 Linear discriminant analysis (cont.) The result W S 1 W (m 2 m 1 ) is known as Fisher s linear discriminant, although strictly it is not a discriminant but rather a specific choice of direction for projection of the data down to one dimension. The projected data can subsequently be used to construct a discriminant function. The above idea can be extended to multiple classes (Read section of Bishop). How the Fisher s linear discriminant can be used for dimensionality reduction? (Show it!) Hamid Beigy (Sharif University of Technology) Data Mining Fall / 42

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Machine Learning Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 1 / 47 Table of contents 1 Introduction

More information

Linear Dimensionality Reduction

Linear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Principal Component Analysis 3 Factor Analysis

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

Machine Learning 2nd Edition

Machine Learning 2nd Edition INTRODUCTION TO Lecture Slides for Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some parts from http://www.cs.tau.ac.il/~apartzin/machinelearning/ The MIT Press, 2010

More information

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

L26: Advanced dimensionality reduction

L26: Advanced dimensionality reduction L26: Advanced dimensionality reduction The snapshot CA approach Oriented rincipal Components Analysis Non-linear dimensionality reduction (manifold learning) ISOMA Locally Linear Embedding CSCE 666 attern

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Non-linear Dimensionality Reduction

Non-linear Dimensionality Reduction Non-linear Dimensionality Reduction CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Introduction Laplacian Eigenmaps Locally Linear Embedding (LLE)

More information

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations. Previously Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations y = Ax Or A simply represents data Notion of eigenvectors,

More information

Unsupervised dimensionality reduction

Unsupervised dimensionality reduction Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Preprocessing & dimensionality reduction

Preprocessing & dimensionality reduction Introduction to Data Mining Preprocessing & dimensionality reduction CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall 2016

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Yoshua Bengio Pascal Vincent Jean-François Paiement University of Montreal April 2, Snowbird Learning 2003 Learning Modal Structures

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

PCA and LDA. Man-Wai MAK

PCA and LDA. Man-Wai MAK PCA and LDA Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/ mwmak References: S.J.D. Prince,Computer

More information

Manifold Learning: Theory and Applications to HRI

Manifold Learning: Theory and Applications to HRI Manifold Learning: Theory and Applications to HRI Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea seungjin@postech.ac.kr August 19, 2008 1 / 46 Greek Philosopher

More information

Machine Learning (Spring 2012) Principal Component Analysis

Machine Learning (Spring 2012) Principal Component Analysis 1-71 Machine Learning (Spring 1) Principal Component Analysis Yang Xu This note is partly based on Chapter 1.1 in Chris Bishop s book on PRML and the lecture slides on PCA written by Carlos Guestrin in

More information

STATISTICAL LEARNING SYSTEMS

STATISTICAL LEARNING SYSTEMS STATISTICAL LEARNING SYSTEMS LECTURE 8: UNSUPERVISED LEARNING: FINDING STRUCTURE IN DATA Institute of Computer Science, Polish Academy of Sciences Ph. D. Program 2013/2014 Principal Component Analysis

More information

Feature selection and extraction Spectral domain quality estimation Alternatives

Feature selection and extraction Spectral domain quality estimation Alternatives Feature selection and extraction Error estimation Maa-57.3210 Data Classification and Modelling in Remote Sensing Markus Törmä markus.torma@tkk.fi Measurements Preprocessing: Remove random and systematic

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Data Preprocessing Tasks

Data Preprocessing Tasks Data Tasks 1 2 3 Data Reduction 4 We re here. 1 Dimensionality Reduction Dimensionality reduction is a commonly used approach for generating fewer features. Typically used because too many features can

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Nonlinear Dimensionality Reduction Piyush Rai CS5350/6350: Machine Learning October 25, 2011 Recap: Linear Dimensionality Reduction Linear Dimensionality Reduction: Based on a linear projection of the

More information

7. Variable extraction and dimensionality reduction

7. Variable extraction and dimensionality reduction 7. Variable extraction and dimensionality reduction The goal of the variable selection in the preceding chapter was to find least useful variables so that it would be possible to reduce the dimensionality

More information

Machine detection of emotions: Feature Selection

Machine detection of emotions: Feature Selection Machine detection of emotions: Feature Selection Final Degree Dissertation Degree in Mathematics Leire Santos Moreno Supervisor: Raquel Justo Blanco María Inés Torres Barañano Leioa, 31 August 2016 Contents

More information

Motivating the Covariance Matrix

Motivating the Covariance Matrix Motivating the Covariance Matrix Raúl Rojas Computer Science Department Freie Universität Berlin January 2009 Abstract This note reviews some interesting properties of the covariance matrix and its role

More information

Dimension Reduction and Low-dimensional Embedding

Dimension Reduction and Low-dimensional Embedding Dimension Reduction and Low-dimensional Embedding Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/26 Dimension

More information

Lecture 10: Dimension Reduction Techniques

Lecture 10: Dimension Reduction Techniques Lecture 10: Dimension Reduction Techniques Radu Balan Department of Mathematics, AMSC, CSCAMM and NWC University of Maryland, College Park, MD April 17, 2018 Input Data It is assumed that there is a set

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable

More information

PCA and LDA. Man-Wai MAK

PCA and LDA. Man-Wai MAK PCA and LDA Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/ mwmak References: S.J.D. Prince,Computer

More information

Pattern Recognition 2

Pattern Recognition 2 Pattern Recognition 2 KNN,, Dr. Terence Sim School of Computing National University of Singapore Outline 1 2 3 4 5 Outline 1 2 3 4 5 The Bayes Classifier is theoretically optimum. That is, prob. of error

More information

Lecture 7: Con3nuous Latent Variable Models

Lecture 7: Con3nuous Latent Variable Models CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT & 14. PCA and MDS Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)

More information

Principal Component Analysis and Linear Discriminant Analysis

Principal Component Analysis and Linear Discriminant Analysis Principal Component Analysis and Linear Discriminant Analysis Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/29

More information

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University Lecture 4: Principal Component Analysis Aykut Erdem May 016 Hacettepe University This week Motivation PCA algorithms Applications PCA shortcomings Autoencoders Kernel PCA PCA Applications Data Visualization

More information

LECTURE NOTE #11 PROF. ALAN YUILLE

LECTURE NOTE #11 PROF. ALAN YUILLE LECTURE NOTE #11 PROF. ALAN YUILLE 1. NonLinear Dimension Reduction Spectral Methods. The basic idea is to assume that the data lies on a manifold/surface in D-dimensional space, see figure (1) Perform

More information

Nonlinear Manifold Learning Summary

Nonlinear Manifold Learning Summary Nonlinear Manifold Learning 6.454 Summary Alexander Ihler ihler@mit.edu October 6, 2003 Abstract Manifold learning is the process of estimating a low-dimensional structure which underlies a collection

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

Chapter 4: Factor Analysis

Chapter 4: Factor Analysis Chapter 4: Factor Analysis In many studies, we may not be able to measure directly the variables of interest. We can merely collect data on other variables which may be related to the variables of interest.

More information

Dimensionality Reduction Techniques (DRT)

Dimensionality Reduction Techniques (DRT) Dimensionality Reduction Techniques (DRT) Introduction: Sometimes we have lot of variables in the data for analysis which create multidimensional matrix. To simplify calculation and to get appropriate,

More information

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) Principal Component Analysis (PCA) Additional reading can be found from non-assessed exercises (week 8) in this course unit teaching page. Textbooks: Sect. 6.3 in [1] and Ch. 12 in [2] Outline Introduction

More information

Principal Component Analysis

Principal Component Analysis B: Chapter 1 HTF: Chapter 1.5 Principal Component Analysis Barnabás Póczos University of Alberta Nov, 009 Contents Motivation PCA algorithms Applications Face recognition Facial expression recognition

More information

Multi-Label Informed Latent Semantic Indexing

Multi-Label Informed Latent Semantic Indexing Multi-Label Informed Latent Semantic Indexing Shipeng Yu 12 Joint work with Kai Yu 1 and Volker Tresp 1 August 2005 1 Siemens Corporate Technology Department of Neural Computation 2 University of Munich

More information

1 Principal Components Analysis

1 Principal Components Analysis Lecture 3 and 4 Sept. 18 and Sept.20-2006 Data Visualization STAT 442 / 890, CM 462 Lecture: Ali Ghodsi 1 Principal Components Analysis Principal components analysis (PCA) is a very popular technique for

More information

Notes on Latent Semantic Analysis

Notes on Latent Semantic Analysis Notes on Latent Semantic Analysis Costas Boulis 1 Introduction One of the most fundamental problems of information retrieval (IR) is to find all documents (and nothing but those) that are semantically

More information

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi Overview Introduction Linear Methods for Dimensionality Reduction Nonlinear Methods and Manifold

More information

IN Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg

IN Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg IN 5520 30.10.18 Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg (anne@ifi.uio.no) 30.10.18 IN 5520 1 Literature Practical guidelines of classification

More information

3.1. The probabilistic view of the principal component analysis.

3.1. The probabilistic view of the principal component analysis. 301 Chapter 3 Principal Components and Statistical Factor Models This chapter of introduces the principal component analysis (PCA), briefly reviews statistical factor models PCA is among the most popular

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson PCA Review! Supervised learning! Fisher linear discriminant analysis! Nonlinear discriminant analysis! Research example! Multiple Classes! Unsupervised

More information

Machine Learning for Software Engineering

Machine Learning for Software Engineering Machine Learning for Software Engineering Dimensionality Reduction Prof. Dr.-Ing. Norbert Siegmund Intelligent Software Systems 1 2 Exam Info Scheduled for Tuesday 25 th of July 11-13h (same time as the

More information

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014 Principal Component Analysis and Singular Value Decomposition Volker Tresp, Clemens Otte Summer 2014 1 Motivation So far we always argued for a high-dimensional feature space Still, in some cases it makes

More information

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis: Uses one group of variables (we will call this X) In

More information

Dimension Reduction. David M. Blei. April 23, 2012

Dimension Reduction. David M. Blei. April 23, 2012 Dimension Reduction David M. Blei April 23, 2012 1 Basic idea Goal: Compute a reduced representation of data from p -dimensional to q-dimensional, where q < p. x 1,...,x p z 1,...,z q (1) We want to do

More information

Robustness of Principal Components

Robustness of Principal Components PCA for Clustering An objective of principal components analysis is to identify linear combinations of the original variables that are useful in accounting for the variation in those original variables.

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable

More information

Dimensionality reduction. PCA. Kernel PCA.

Dimensionality reduction. PCA. Kernel PCA. Dimensionality reduction. PCA. Kernel PCA. Dimensionality reduction Principal Component Analysis (PCA) Kernelizing PCA If we have time: Autoencoders COMP-652 and ECSE-608 - March 14, 2016 1 What is dimensionality

More information

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation Introduction and Data Representation Mikhail Belkin & Partha Niyogi Department of Electrical Engieering University of Minnesota Mar 21, 2017 1/22 Outline Introduction 1 Introduction 2 3 4 Connections to

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Principal Components Analysis. Sargur Srihari University at Buffalo

Principal Components Analysis. Sargur Srihari University at Buffalo Principal Components Analysis Sargur Srihari University at Buffalo 1 Topics Projection Pursuit Methods Principal Components Examples of using PCA Graphical use of PCA Multidimensional Scaling Srihari 2

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand

More information

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1

More information

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection Instructor: Herke van Hoof (herke.vanhoof@cs.mcgill.ca) Based on slides by:, Jackie Chi Kit Cheung Class web page:

More information

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014 Dimensionality Reduction Using PCA/LDA Hongyu Li School of Software Engineering TongJi University Fall, 2014 Dimensionality Reduction One approach to deal with high dimensional data is by reducing their

More information

CSC 411 Lecture 12: Principal Component Analysis

CSC 411 Lecture 12: Principal Component Analysis CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 12-PCA 1 / 23 Overview Today we ll cover the first unsupervised

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Principal Component Analysis Barnabás Póczos Contents Motivation PCA algorithms Applications Some of these slides are taken from Karl Booksh Research

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

Data Mining II. Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich. Basel, Spring Semester 2016 D-BSSE

Data Mining II. Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich. Basel, Spring Semester 2016 D-BSSE D-BSSE Data Mining II Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich Basel, Spring Semester 2016 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 2 / 117 Our course

More information

ISSN: (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at:

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Least Squares Optimization

Least Squares Optimization Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques, which are widely used to analyze and visualize data. Least squares (LS)

More information

LECTURE NOTE #10 PROF. ALAN YUILLE

LECTURE NOTE #10 PROF. ALAN YUILLE LECTURE NOTE #10 PROF. ALAN YUILLE 1. Principle Component Analysis (PCA) One way to deal with the curse of dimensionality is to project data down onto a space of low dimensions, see figure (1). Figure

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Linear and Non-Linear Dimensionality Reduction

Linear and Non-Linear Dimensionality Reduction Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections

More information

PRINCIPAL COMPONENT ANALYSIS

PRINCIPAL COMPONENT ANALYSIS PRINCIPAL COMPONENT ANALYSIS 1 INTRODUCTION One of the main problems inherent in statistics with more than two variables is the issue of visualising or interpreting data. Fortunately, quite often the problem

More information

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Intelligent Data Analysis and Probabilistic Inference Lecture

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

Clustering VS Classification

Clustering VS Classification MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:

More information

Principal Component Analysis

Principal Component Analysis CSci 5525: Machine Learning Dec 3, 2008 The Main Idea Given a dataset X = {x 1,..., x N } The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection The Main Idea Given

More information

Dimensionality Reduction with Principal Component Analysis

Dimensionality Reduction with Principal Component Analysis 10 Dimensionality Reduction with Principal Component Analysis Working directly with high-dimensional data, such as images, comes with some difficulties: it is hard to analyze, interpretation is difficult,

More information

Covariance and Correlation Matrix

Covariance and Correlation Matrix Covariance and Correlation Matrix Given sample {x n } N 1, where x Rd, x n = x 1n x 2n. x dn sample mean x = 1 N N n=1 x n, and entries of sample mean are x i = 1 N N n=1 x in sample covariance matrix

More information

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction A presentation by Evan Ettinger on a Paper by Vin de Silva and Joshua B. Tenenbaum May 12, 2005 Outline Introduction The

More information