Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Week 2 Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1

Part I Other datatypes, preprocessing 2 / 1

Other datatypes Document data You might start with a collection of n documents (i.e. all of Shakespeare s works in some digital format; all of Wikileaks U.S. Embassy Cables). This is not a data matrix... Given p terms of interest: {Al Qaeda, Iran, Iraq, etc.} one can form a term-document matrix filled with counts. 3 / 1

Other datatypes Document data Document ID Al Qaeda Iran Iraq... 1 0 0 3........ 250000 2 1 1... 4 / 1

Other datatypes Time series Imagine recording the minute-by-minute prices of all stocks in the S & P 500 for last 200 days of trading. The data can be represented by a 78000 500 matrix. BUT, there is definite structure across the rows of this matrix. They are not unrelated cases like they might be in other applications. 5 / 1

Transaction data 6 / 1

Social media data 7 / 1

Graph data 8 / 1

Other data types Graph data Nodes on the graph might be facebook users, or public pages. Weights on the edges could be number of messages sent in a prespecified period. If you take weekly intervals, this leads to a sequence of graphs G i = communication over i-th week. How this graph changes is of obvious interest.... Even structure of just one graph is of interest we ll come back to this when we talk about spectral methods... 9 / 1

Data quality Some issues to keep in mind Is the data experimental or observational? If observational, what do we know about the data generating mechanism? For example, although the S&P 500 example can be represented as a data matrix, there is clearly structure across rows. General quality issues: How much of the data missing? Is missingness informative? Is it very noisy? Are there outliers? Are there a large number of duplicates? 10 / 1

Preprocessing General procedures Aggregation Combining features into a new feature. Example: pooling county-level data to state-level data. Discretization Breaking up a continuous variable (or a set of continuous variables) into an ordinal (or nominal) discrete variable. Transformation Simple transformation of feature (log or exp) or mapping to a new space (Fourier transform / power spectrum, wavelet transform). 11 / 1

A continuous variable that could be discretized 1.0 0.8 0.6 0.4 0.2 0.0 4 2 0 2 4 12 / 1

Discretization by fixed width 1.0 0.8 0.6 0.4 0.2 0.0 4 2 0 2 4 13 / 1

Discretization by fixed quartile 1.0 0.8 0.6 0.4 0.2 0.0 4 2 0 2 4 14 / 1

Discretization by clustering 1.0 0.8 0.6 0.4 0.2 0.0 4 2 0 2 4 15 / 1

Variable transformation: bacterial decay 16 / 1

Variable transformation: bacterial decay 17 / 1

BMW daily returns (fecofin package) 18 / 1

ACF of BMW daily returns 19 / 1

Discrete wavelet transform of BMW daily returns 20 / 1

Part II Dimension reduction, PCA & eigenanalysis 21 / 1

Dimension reduction Combinations of features Given a data matrix X n p with p fairly large, it can be difficult to visualize structure. Often useful to look at linear combinations of the features. Each β R p, determines a linear rule f β (x) = x T β Evaluating this on each row X i of X yields a vector (f β (X i )) 1 i n = X β. 22 / 1

Dimension reduction Dimension reduction By choosing β appropriately, we may find interesting new features. Suppose we take k much smaller than p vectors of β which we write as a matrix B p k The new data matrix XB has fewer dimensions than X. This is dimension reduction... 23 / 1

Dimension reduction Principal Components This method of dimension reduction seeks features that maximize sample variance... Specifically, for β R p, define the sample variance of V β = X β R n Var(V β ) = 1 n 1 n n (V β,i V β ) 2 i=1 V β = 1 n i=1 V β,i Note that Var(V Cβ ) = C 2 Var(V β ) so we usually standardize β 2 = n i=1 β2 i = 1 24 / 1

Dimension reduction Principal Components Define the n n matrix H = I n n 1 n 11T This matrix removes means: (Hv) i = v i v. It is also a projection matrix: H T = H H 2 = H 25 / 1

Dimension reduction Principal Components with Matrices With this matrix, Var(V β ) = 1 n 1 βt X T HX β. So, maximizing sample variance, with β 2 = 1 is ( ) maximize β T X T HX β. β 2 =1 This boils down to an eigenvalue problem... 26 / 1

Dimension reduction Eigenanalysis The matrix X T HX is symmetric, so it can be written as X T HX = VDV T where D k k = diag(d 1,..., d k ), rank(x T HX ) = k and V p k has orthonormal columns, i.e. V T V = I k k. We always have d i 0 and we take d 1 d 2... d k. 27 / 1

Dimension reduction Eigenanalysis & PCA Suppose now that β = k j=1 a jv j with v j the columns of V. Then, β 2 = k ( ) β T X T HX β = i=1 a 2 i k ai 2 d i i=1 Choosing a 1 = 1, a j = 0, j 2 maximizes this quantity. 28 / 1

Dimension reduction Eigenanalysis & PCA Therefore, β 1 = v 1 the first column of V solves ( ) maximize β T X T HX β. β 2 =1 This yields scores HX β 1 R n. These are the 1st principal component scores. 29 / 1

Dimension reduction Higher order components Having found the direction with maximal sample variance we might look for the second most variable direction by solving ( ) maximize β T X T HX β. β 2 =1,β T v 1 =0 Note we restricted our search so we would not just recover v 1 again. Not hard to see that if β = k j=1 a jv j this is solved by taking a 2 = 1, a j = 0, j 2. 30 / 1

Dimension reduction Higher order components In matrix terms, all the principal components scores are (HX V ) n k and the loadings are the columns of V. This information can be summarized in a biplot. The loadings describe how each feature contributes to each principal component score. 31 / 1

Olympic data 32 / 1

Olympic data: screeplot 33 / 1

Dimension reduction The importance of scale The PCA scores are not invariant to scaling of the features: for Q p p diagonal the PCA scores of X Q are not the same as X. Common to convert all variables to the same scale before applying PCA. Define the scalings to be the sample standard deviation of each feature. In matrix form, let S 2 = 1 ( ) n 1 diag X T HX Define X = HX S 1. The normalized PCA loadings are given by an eigenanalysis of X T X. 34 / 1

Olympic data 35 / 1

Olympic data: screeplot 36 / 1

Dimension reduction PCA and the SVD The singular value decomposition of a matrix tells us we can write X = U V T with k k = diag(δ 1,..., δ k ), δ j 0, k = rank( X ), U T U = V T V = I k k. Recall that the scores were X V = (U V T )V = U Also, X T X = V 2 V T so D = 2. 37 / 1

Dimension reduction PCA and the SVD Recall X = HX S 1/2. We saw that normalized PCA loadings are given by an eigenanalysis of X T X. It turns out that, via the SVD, the normalized PCA scores are given by an eigenanalysis of X X T because ( X V ) n k = U where U are eigenvectors of X X T = U 2 U T. 38 / 1

Dimension reduction Another characterization of SVD Given a data matrix X (or its scaled centered version X ) we might try solving minimize Z:rank(Z)=k X Z 2 F where F stands for Frobenius norm on matrices A 2 F = n p i=1 j=1 A 2 ij 39 / 1

Dimension reduction Another characterization of SVD It can be proven that Z = S n k (V T ) k p where S n k is the matrix of the first k PCA scores and the columns of V are the first k PCA loadings. This approximation is related to the screeplot: the height of each bar describes the additional drop in Frobenius norm as the rank of the approximation increases. 40 / 1

Olympic data: screeplot 41 / 1

Dimension reduction Other types of dimension reduction Instead of maximizing sample variance, we might try maximizing some other quantity... Independent Component Analysis tries to maximize non-gaussianity of V β. In practice, it uses skewness, kurtosis or other moments to quantify non-gaussianity. These are both unsupervised approaches. Often, these are combined with supervised approaches into an algorithm like: Feature creation Build some set of features using PCA. Validate Use the derived features to see if they are helpful in the supervised problem. 42 / 1

Olympic data: 1st component vs. total score 43 / 1

Part III Distances and similarities 44 / 1

Distances and similarities Similarities Start with X which we assume is centered and standardized. The PCA loadings were given by eigenvectors of the correlation matrix which is a measure of similarity. The first 2 (or any 2) PCA scores yield an n 2 matrix that can be visualized as a scatter plot. Similarly, the first 2 (or any 2) PCA loadings yield an p 2 matrix that can be visualized as a scatter plot. 45 / 1

Olympic data 46 / 1

Distances and similarities Similarities The PCA loadings were given by eigenvectors of the correlation matrix which is a measure of similarity. The visualization of the cases based on PCA scores were determined by the eigenvectors of X X T, an n n matrix. To see this, remember that X = U V T so X X T = U 2 U T X T X = V 2 V T 47 / 1

Distances and similarities Similarities The matrix X X T is a measure of similarity between cases. The matrix X T X is a measure of similarity between features. Structure in the two similarity matrices yield insight into the set of cases, or the set of features... 48 / 1

Distances and similarities Distances Distances are inversely related to similarities. If A and B are similar, then d(a, B) should be small, i.e. they should be near. If A and B are distant, then they should not be similar. For a data matrix, there is a natural distance between cases: d(x i, X k ) 2 = X i X k 2 2 p = X ij X kj 2 2 j=1 = (X X T ) ii 2(X X T ) ik + (X X T ) kk 49 / 1

Distances and similarities Distances Suggests a natural transformation between a similarity matrix S and a distance matrix D D ik = (S ii 2 S ik + S kk ) 1/2 The reverse transformation is not so obvious. Some suggestions from your book: S ik = D ik = e D ik 1 = 1 + D ik 50 / 1

Distances and similarities Distances A distance (or a metric) on a set S is a function d : S S [0, + ) that satisfies d(x, x) = 0; d(x, y) = 0 x = y d(x, y) = d(y, x) d(x, y) d(x, z) + d(z, y) If d(x, y) = 0 for some x y then d is a pseudo-metric. 51 / 1

Distances and similarities Similarities A similarity on a set S is a function s : S S R and should satisfy s(x, x) s(x, y) for all x y s(x, y) = s(y, x) By adding a constant, we can often assume that s(x, y) 0. 52 / 1

Distances and similarities Examples: nominal data The simplest example for nominal data is just the discrete metric { 0 x = y d(x, y) = 1 otherwise. The corresponding similarity would be { 1 x = y s(x, y) = 0 otherwise. = 1 d(x, y) 53 / 1

Distances and similarities Examples: ordinal data If S is ordered, we can think of S as (or identify S with) a subset of the non-negative integers. If S = m then a natural distance is d(x, y) = x y m 1 1 The corresponding similarity would be s(x, y) = 1 d(x, y) 54 / 1

Distances and similarities Examples: vectors of continuous data If S = R k there are lots of distances determined by norms. The Minkowski p or l p norm, for p 1: ( k ) 1/p d(x, y) = x y p = x i y i p i=1 Examples: p = 2 the usual Euclidean distance, l 2 p = 1 d(x, y) = k i=1 x i y i, the taxicab distance, l 1 p = the d(x, y) = max 1 i k x i y i, the sup norm, l 55 / 1

Distances and similarities Examples: vectors of continuous data If 0 < p < 1 this is not a norm, but it still defines a metric. The preceding transformations can be used to construct similarities. Examples: binary vectors If S = {0, 1} k R k the vectors can be thought of as vectors of bits. The l 1 norm counts the number of mismatched bits. This is known as Hamming distance. 56 / 1

Distances and similarities Example: Mahalanobis distance Given Σ k k that is positive definite, we define Mahalanobis distance on R k by d Σ (x, y) = ( ) 1/2 (x y) T Σ 1 (x y) This is the usual Euclidean distance, with a change of basis given by a rotation and stretching of the axes. If Σ is only non-negative definite, then we can replace Σ 1 with Σ, its pseudo-inverse. This yields a pseudo-metric because it fails the test d(x, y) = 0 x = y. 57 / 1

Distances and similarities Example: similarities for binary vectors Given binary vectors x, y {0, 1} k we can summarize their agreement in a 2 2 table x = 0 x = 1 total y y = 0 f 00 f 01 f 0 y = 1 f 10 f 11 f 1 total x f 0 f 1 k 58 / 1

Distances and similarities Example: similarities for binary vectors We define the simple matching coefficient (SMC) similarity by SMC(x, y) = f 00 + f 11 = f 00 + f 11 f 00 + f 01 + f 10 + f 11 k = 1 x y 1 k The Jaccard coefficient ignores entries where x i = y i = 0 J(x, y) = f 11 f 01 + f 10 + f 11. 59 / 1

Distances and similarities Example: cosine similarity & correlation Given vectors x, y R k the cosine similarity is defined as 1 cos(x, y) = x, y x 2 y 2 1. The correlation between two vectors is defined as 1 cor(x, y) = x x 1, y ȳ 1 x x 1 2 y ȳ 1 2 1. 60 / 1

Distances and similarities Example: correlation An alternative, perhaps more familiar definition: S xy cor(x, y) = S x S y S xy = 1 k 1 S 2 x = 1 k 1 S 2 y = 1 k 1 k (x i x)(y i ȳ) i=1 k (x i x) 2 i=1 k (y i ȳ) 2 i=1 61 / 1

Distances and similarities Correlation & PCA The matrix 1 n 1 X T X was actually the matrix of pair-wise correlations of the features. Why? How? 1 ( X T ) X n 1 = 1 (D 1/2 X T HX D 1/2) ij n 1 The diagonal entries of D are the sample variances of each feature. The inner matrix multiplication computes the pair-wise dot-products of the columns of HX ( n X T HX )ij = (X ki X i )(X kj X j ) k=1 62 / 1

High positive correlation 63 / 1

High negative correlation 64 / 1

No correlation 65 / 1

Small positive 66 / 1

Small negative 67 / 1

Distances and similarities Combining similarities In a given data set, each case may have many attributes or features. Example: see the health data set for HW 1. To compute similarities of cases, we must pool similarities across features. In a data set with M different features, we write x i = (x i1,..., x im ), with each x im S m. 68 / 1

Distances and similarities Combining similarities Given similarities s m on each S m we can define an overall similarity between case x i and x j by s(x i, x j ) = M w m s m (x im, x jm ) m=1 with optional weights w m for each feature. Your book modifies this to deal with asymmetric attributes, i.e. attributes for which Jaccard similarity might be used. 69 / 1