Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Similar documents
Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Descriptive Data Summarization

Distances & Similarities

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Data Preprocessing Tasks

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

14 Singular Value Decomposition

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

University of Florida CISE department Gator Engineering. Clustering Part 1

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

15 Singular Value Decomposition

Linear Algebra Methods for Data Mining

7 Principal Component Analysis

PRINCIPAL COMPONENTS ANALYSIS

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Metric-based classifiers. Nuno Vasconcelos UCSD

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Preprocessing & dimensionality reduction

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Multivariate Statistical Analysis

Singular Value Decomposition and Principal Component Analysis (PCA) I

Introduction to Machine Learning

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Fall TMA4145 Linear Methods. Exercise set Given the matrix 1 2

Linear Algebra (Review) Volker Tresp 2017

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Singular Value Decompsition

Unsupervised dimensionality reduction

Jeffrey D. Ullman Stanford University

Numerical Methods I Singular Value Decomposition

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

CS281 Section 4: Factor Analysis and PCA

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Machine Learning for Large-Scale Data Analysis and Decision Making A. Week #1

CISC 4631 Data Mining

Similarity and Dissimilarity

j=1 [We will show that the triangle inequality holds for each p-norm in Chapter 3 Section 6.] The 1-norm is A F = tr(a H A).

Linear Algebra 2 Spectral Notes

The value of a problem is not so much coming up with the answer as in the ideas and attempted ideas it forces on the would be solver I.N.

1 Interpretation. Contents. Biplots, revisited. Biplots, revisited 2. Biplots, revisited 1

Statistical Pattern Recognition

This appendix provides a very basic introduction to linear algebra concepts.

CS 340 Lec. 6: Linear Dimensionality Reduction

Mathematical foundations - linear algebra

Linear Algebra Review. Vectors

1 Linearity and Linear Systems

Linear Algebra and Eigenproblems

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Linear Methods in Data Mining

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Lecture 7 Spectral methods

Machine Learning (Spring 2012) Principal Component Analysis

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Singular Value Decomposition

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

FINM 331: MULTIVARIATE DATA ANALYSIS FALL 2017 PROBLEM SET 3

Expectation Maximization

4 Bias-Variance for Ridge Regression (24 points)

Linear Algebra & Geometry why is linear algebra useful in computer vision?

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Foundations of Computer Vision

Numerical Linear Algebra

Dimensionality Reduction

Lecture 6: Methods for high-dimensional problems

Properties of Matrices and Operations on Matrices

Dimensionality Reduction

4 Linear Algebra Review

Chapter 7: Symmetric Matrices and Quadratic Forms

Functional Analysis Review

Singular Value Decomposition

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Matrix Factorizations

Lecture 8: Linear Algebra Background

Least Squares Optimization

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method

PCA, Kernel PCA, ICA

Image Registration Lecture 2: Vectors and Matrices

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

Lecture: Face Recognition and Feature Reduction

STATISTICAL LEARNING SYSTEMS

3.1. The probabilistic view of the principal component analysis.

Mathematics for Graphics and Vision

Principal components analysis COMS 4771

Transcription:

Week 2 Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1

Part I Other datatypes, preprocessing 2 / 1

Other datatypes Document data You might start with a collection of n documents (i.e. all of Shakespeare s works in some digital format; all of Wikileaks U.S. Embassy Cables). This is not a data matrix... Given p terms of interest: {Al Qaeda, Iran, Iraq, etc.} one can form a term-document matrix filled with counts. 3 / 1

Other datatypes Document data Document ID Al Qaeda Iran Iraq... 1 0 0 3........ 250000 2 1 1... 4 / 1

Other datatypes Time series Imagine recording the minute-by-minute prices of all stocks in the S & P 500 for last 200 days of trading. The data can be represented by a 78000 500 matrix. BUT, there is definite structure across the rows of this matrix. They are not unrelated cases like they might be in other applications. 5 / 1

Transaction data 6 / 1

Social media data 7 / 1

Graph data 8 / 1

Other data types Graph data Nodes on the graph might be facebook users, or public pages. Weights on the edges could be number of messages sent in a prespecified period. If you take weekly intervals, this leads to a sequence of graphs G i = communication over i-th week. How this graph changes is of obvious interest.... Even structure of just one graph is of interest we ll come back to this when we talk about spectral methods... 9 / 1

Data quality Some issues to keep in mind Is the data experimental or observational? If observational, what do we know about the data generating mechanism? For example, although the S&P 500 example can be represented as a data matrix, there is clearly structure across rows. General quality issues: How much of the data missing? Is missingness informative? Is it very noisy? Are there outliers? Are there a large number of duplicates? 10 / 1

Preprocessing General procedures Aggregation Combining features into a new feature. Example: pooling county-level data to state-level data. Discretization Breaking up a continuous variable (or a set of continuous variables) into an ordinal (or nominal) discrete variable. Transformation Simple transformation of feature (log or exp) or mapping to a new space (Fourier transform / power spectrum, wavelet transform). 11 / 1

A continuous variable that could be discretized 1.0 0.8 0.6 0.4 0.2 0.0 4 2 0 2 4 12 / 1

Discretization by fixed width 1.0 0.8 0.6 0.4 0.2 0.0 4 2 0 2 4 13 / 1

Discretization by fixed quartile 1.0 0.8 0.6 0.4 0.2 0.0 4 2 0 2 4 14 / 1

Discretization by clustering 1.0 0.8 0.6 0.4 0.2 0.0 4 2 0 2 4 15 / 1

Variable transformation: bacterial decay 16 / 1

Variable transformation: bacterial decay 17 / 1

BMW daily returns (fecofin package) 18 / 1

ACF of BMW daily returns 19 / 1

Discrete wavelet transform of BMW daily returns 20 / 1

Part II Dimension reduction, PCA & eigenanalysis 21 / 1

Dimension reduction Combinations of features Given a data matrix X n p with p fairly large, it can be difficult to visualize structure. Often useful to look at linear combinations of the features. Each β R p, determines a linear rule f β (x) = x T β Evaluating this on each row X i of X yields a vector (f β (X i )) 1 i n = X β. 22 / 1

Dimension reduction Dimension reduction By choosing β appropriately, we may find interesting new features. Suppose we take k much smaller than p vectors of β which we write as a matrix B p k The new data matrix XB has fewer dimensions than X. This is dimension reduction... 23 / 1

Dimension reduction Principal Components This method of dimension reduction seeks features that maximize sample variance... Specifically, for β R p, define the sample variance of V β = X β R n Var(V β ) = 1 n 1 n n (V β,i V β ) 2 i=1 V β = 1 n i=1 V β,i Note that Var(V Cβ ) = C 2 Var(V β ) so we usually standardize β 2 = n i=1 β2 i = 1 24 / 1

Dimension reduction Principal Components Define the n n matrix H = I n n 1 n 11T This matrix removes means: (Hv) i = v i v. It is also a projection matrix: H T = H H 2 = H 25 / 1

Dimension reduction Principal Components with Matrices With this matrix, Var(V β ) = 1 n 1 βt X T HX β. So, maximizing sample variance, with β 2 = 1 is ( ) maximize β T X T HX β. β 2 =1 This boils down to an eigenvalue problem... 26 / 1

Dimension reduction Eigenanalysis The matrix X T HX is symmetric, so it can be written as X T HX = VDV T where D k k = diag(d 1,..., d k ), rank(x T HX ) = k and V p k has orthonormal columns, i.e. V T V = I k k. We always have d i 0 and we take d 1 d 2... d k. 27 / 1

Dimension reduction Eigenanalysis & PCA Suppose now that β = k j=1 a jv j with v j the columns of V. Then, β 2 = k ( ) β T X T HX β = i=1 a 2 i k ai 2 d i i=1 Choosing a 1 = 1, a j = 0, j 2 maximizes this quantity. 28 / 1

Dimension reduction Eigenanalysis & PCA Therefore, β 1 = v 1 the first column of V solves ( ) maximize β T X T HX β. β 2 =1 This yields scores HX β 1 R n. These are the 1st principal component scores. 29 / 1

Dimension reduction Higher order components Having found the direction with maximal sample variance we might look for the second most variable direction by solving ( ) maximize β T X T HX β. β 2 =1,β T v 1 =0 Note we restricted our search so we would not just recover v 1 again. Not hard to see that if β = k j=1 a jv j this is solved by taking a 2 = 1, a j = 0, j 2. 30 / 1

Dimension reduction Higher order components In matrix terms, all the principal components scores are (HX V ) n k and the loadings are the columns of V. This information can be summarized in a biplot. The loadings describe how each feature contributes to each principal component score. 31 / 1

Olympic data 32 / 1

Olympic data: screeplot 33 / 1

Dimension reduction The importance of scale The PCA scores are not invariant to scaling of the features: for Q p p diagonal the PCA scores of X Q are not the same as X. Common to convert all variables to the same scale before applying PCA. Define the scalings to be the sample standard deviation of each feature. In matrix form, let S 2 = 1 ( ) n 1 diag X T HX Define X = HX S 1. The normalized PCA loadings are given by an eigenanalysis of X T X. 34 / 1

Olympic data 35 / 1

Olympic data: screeplot 36 / 1

Dimension reduction PCA and the SVD The singular value decomposition of a matrix tells us we can write X = U V T with k k = diag(δ 1,..., δ k ), δ j 0, k = rank( X ), U T U = V T V = I k k. Recall that the scores were X V = (U V T )V = U Also, X T X = V 2 V T so D = 2. 37 / 1

Dimension reduction PCA and the SVD Recall X = HX S 1/2. We saw that normalized PCA loadings are given by an eigenanalysis of X T X. It turns out that, via the SVD, the normalized PCA scores are given by an eigenanalysis of X X T because ( X V ) n k = U where U are eigenvectors of X X T = U 2 U T. 38 / 1

Dimension reduction Another characterization of SVD Given a data matrix X (or its scaled centered version X ) we might try solving minimize Z:rank(Z)=k X Z 2 F where F stands for Frobenius norm on matrices A 2 F = n p i=1 j=1 A 2 ij 39 / 1

Dimension reduction Another characterization of SVD It can be proven that Z = S n k (V T ) k p where S n k is the matrix of the first k PCA scores and the columns of V are the first k PCA loadings. This approximation is related to the screeplot: the height of each bar describes the additional drop in Frobenius norm as the rank of the approximation increases. 40 / 1

Olympic data: screeplot 41 / 1

Dimension reduction Other types of dimension reduction Instead of maximizing sample variance, we might try maximizing some other quantity... Independent Component Analysis tries to maximize non-gaussianity of V β. In practice, it uses skewness, kurtosis or other moments to quantify non-gaussianity. These are both unsupervised approaches. Often, these are combined with supervised approaches into an algorithm like: Feature creation Build some set of features using PCA. Validate Use the derived features to see if they are helpful in the supervised problem. 42 / 1

Olympic data: 1st component vs. total score 43 / 1

Part III Distances and similarities 44 / 1

Distances and similarities Similarities Start with X which we assume is centered and standardized. The PCA loadings were given by eigenvectors of the correlation matrix which is a measure of similarity. The first 2 (or any 2) PCA scores yield an n 2 matrix that can be visualized as a scatter plot. Similarly, the first 2 (or any 2) PCA loadings yield an p 2 matrix that can be visualized as a scatter plot. 45 / 1

Olympic data 46 / 1

Distances and similarities Similarities The PCA loadings were given by eigenvectors of the correlation matrix which is a measure of similarity. The visualization of the cases based on PCA scores were determined by the eigenvectors of X X T, an n n matrix. To see this, remember that X = U V T so X X T = U 2 U T X T X = V 2 V T 47 / 1

Distances and similarities Similarities The matrix X X T is a measure of similarity between cases. The matrix X T X is a measure of similarity between features. Structure in the two similarity matrices yield insight into the set of cases, or the set of features... 48 / 1

Distances and similarities Distances Distances are inversely related to similarities. If A and B are similar, then d(a, B) should be small, i.e. they should be near. If A and B are distant, then they should not be similar. For a data matrix, there is a natural distance between cases: d(x i, X k ) 2 = X i X k 2 2 p = X ij X kj 2 2 j=1 = (X X T ) ii 2(X X T ) ik + (X X T ) kk 49 / 1

Distances and similarities Distances Suggests a natural transformation between a similarity matrix S and a distance matrix D D ik = (S ii 2 S ik + S kk ) 1/2 The reverse transformation is not so obvious. Some suggestions from your book: S ik = D ik = e D ik 1 = 1 + D ik 50 / 1

Distances and similarities Distances A distance (or a metric) on a set S is a function d : S S [0, + ) that satisfies d(x, x) = 0; d(x, y) = 0 x = y d(x, y) = d(y, x) d(x, y) d(x, z) + d(z, y) If d(x, y) = 0 for some x y then d is a pseudo-metric. 51 / 1

Distances and similarities Similarities A similarity on a set S is a function s : S S R and should satisfy s(x, x) s(x, y) for all x y s(x, y) = s(y, x) By adding a constant, we can often assume that s(x, y) 0. 52 / 1

Distances and similarities Examples: nominal data The simplest example for nominal data is just the discrete metric { 0 x = y d(x, y) = 1 otherwise. The corresponding similarity would be { 1 x = y s(x, y) = 0 otherwise. = 1 d(x, y) 53 / 1

Distances and similarities Examples: ordinal data If S is ordered, we can think of S as (or identify S with) a subset of the non-negative integers. If S = m then a natural distance is d(x, y) = x y m 1 1 The corresponding similarity would be s(x, y) = 1 d(x, y) 54 / 1

Distances and similarities Examples: vectors of continuous data If S = R k there are lots of distances determined by norms. The Minkowski p or l p norm, for p 1: ( k ) 1/p d(x, y) = x y p = x i y i p i=1 Examples: p = 2 the usual Euclidean distance, l 2 p = 1 d(x, y) = k i=1 x i y i, the taxicab distance, l 1 p = the d(x, y) = max 1 i k x i y i, the sup norm, l 55 / 1

Distances and similarities Examples: vectors of continuous data If 0 < p < 1 this is not a norm, but it still defines a metric. The preceding transformations can be used to construct similarities. Examples: binary vectors If S = {0, 1} k R k the vectors can be thought of as vectors of bits. The l 1 norm counts the number of mismatched bits. This is known as Hamming distance. 56 / 1

Distances and similarities Example: Mahalanobis distance Given Σ k k that is positive definite, we define Mahalanobis distance on R k by d Σ (x, y) = ( ) 1/2 (x y) T Σ 1 (x y) This is the usual Euclidean distance, with a change of basis given by a rotation and stretching of the axes. If Σ is only non-negative definite, then we can replace Σ 1 with Σ, its pseudo-inverse. This yields a pseudo-metric because it fails the test d(x, y) = 0 x = y. 57 / 1

Distances and similarities Example: similarities for binary vectors Given binary vectors x, y {0, 1} k we can summarize their agreement in a 2 2 table x = 0 x = 1 total y y = 0 f 00 f 01 f 0 y = 1 f 10 f 11 f 1 total x f 0 f 1 k 58 / 1

Distances and similarities Example: similarities for binary vectors We define the simple matching coefficient (SMC) similarity by SMC(x, y) = f 00 + f 11 = f 00 + f 11 f 00 + f 01 + f 10 + f 11 k = 1 x y 1 k The Jaccard coefficient ignores entries where x i = y i = 0 J(x, y) = f 11 f 01 + f 10 + f 11. 59 / 1

Distances and similarities Example: cosine similarity & correlation Given vectors x, y R k the cosine similarity is defined as 1 cos(x, y) = x, y x 2 y 2 1. The correlation between two vectors is defined as 1 cor(x, y) = x x 1, y ȳ 1 x x 1 2 y ȳ 1 2 1. 60 / 1

Distances and similarities Example: correlation An alternative, perhaps more familiar definition: S xy cor(x, y) = S x S y S xy = 1 k 1 S 2 x = 1 k 1 S 2 y = 1 k 1 k (x i x)(y i ȳ) i=1 k (x i x) 2 i=1 k (y i ȳ) 2 i=1 61 / 1

Distances and similarities Correlation & PCA The matrix 1 n 1 X T X was actually the matrix of pair-wise correlations of the features. Why? How? 1 ( X T ) X n 1 = 1 (D 1/2 X T HX D 1/2) ij n 1 The diagonal entries of D are the sample variances of each feature. The inner matrix multiplication computes the pair-wise dot-products of the columns of HX ( n X T HX )ij = (X ki X i )(X kj X j ) k=1 62 / 1

High positive correlation 63 / 1

High negative correlation 64 / 1

No correlation 65 / 1

Small positive 66 / 1

Small negative 67 / 1

Distances and similarities Combining similarities In a given data set, each case may have many attributes or features. Example: see the health data set for HW 1. To compute similarities of cases, we must pool similarities across features. In a data set with M different features, we write x i = (x i1,..., x im ), with each x im S m. 68 / 1

Distances and similarities Combining similarities Given similarities s m on each S m we can define an overall similarity between case x i and x j by s(x i, x j ) = M w m s m (x im, x jm ) m=1 with optional weights w m for each feature. Your book modifies this to deal with asymmetric attributes, i.e. attributes for which Jaccard similarity might be used. 69 / 1