Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Size: px

Start display at page:

Download "Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes"

Moses Sharp
5 years ago
Views:

1 Week 2 Based in part on slides from textbook, slides of Susan Holmes Part I Other datatypes, preprocessing October 3, / 1 2 / 1 Other datatypes Other datatypes Document data You might start with a collection of n documents (i.e. all of Shakespeare s works in some digital format; all of Wikileaks U.S. Embassy Cables). This is not a data matrix... Given p terms of interest: {Al Qaeda, Iran, Iraq, etc.} one can form a term-document matrix filled with counts. Document data Document ID Al Qaeda Iran Iraq / 1 4 / 1

The data can be represented by a 78000 500 matrix.

2 Other datatypes Transaction data Time series Imagine recording the minute-by-minute prices of all stocks in the S & P 500 for last 200 days of trading. The data can be represented by a matrix. BUT, there is definite structure across the rows of this matrix. They are not unrelated cases like they might be in other applications. 5 / 1 6 / 1 Social media data Graph data 7 / 1 8 / 1

3 Other data types Data quality Graph data Nodes on the graph might be facebook users, or public pages. Weights on the edges could be number of messages sent in a prespecified period. If you take weekly intervals, this leads to a sequence of graphs G i = communication over i-th week. How this graph changes is of obvious interest.... Even structure of just one graph is of interest we ll come back to this when we talk about spectral methods... Some issues to keep in mind Is the data experimental or observational? If observational, what do we know about the data generating mechanism? For example, although the S&P 500 example can be represented as a data matrix, there is clearly structure across rows. General quality issues: How much of the data missing? Is missingness informative? Is it very noisy? Are there outliers? Are there a large number of duplicates? 9 / 1 10 / 1 Preprocessing A continuous variable that could be discretized General procedures 1.0 Aggregation Combining features into a new feature. Example: pooling county-level data to state-level data. 0.8 Discretization Breaking up a continuous variable (or a set of continuous variables) into an ordinal (or nominal) discrete variable. Transformation Simple transformation of feature (log or exp) or mapping to a new space (Fourier transform / power spectrum, wavelet transform) / 1 12 / 1

4 Discretization by fixed width Discretization by fixed quartile / 1 14 / 1 Discretization by clustering Variable transformation: bacterial decay / 1 16 / 1

5 Variable transformation: bacterial decay BMW daily returns (fecofin package) 17 / 1 18 / 1 ACF of BMW daily returns Discrete wavelet transform of BMW daily returns 19 / 1 20 / 1

6 Part II Combinations of features Given a data matrix X n p with p fairly large, it can be difficult to visualize structure., PCA & eigenanalysis Often useful to look at linear combinations of the features. Each β R p, determines a linear rule f β (x) = x T β Evaluating this on each row X i of X yields a vector (f β (X i )) 1 i n = X β. 21 / 1 22 / 1 Principal Components By choosing β appropriately, we may find interesting new features. Suppose we take k much smaller than p vectors of β which we write as a matrix B p k This method of dimension reduction seeks features that maximize sample variance... Specifically, for β R p, define the sample variance of V β = X β R n Var(V β ) = 1 n 1 n n (V β,i V β ) 2 i=1 The new data matrix XB has fewer dimensions than X. This is dimension reduction... V β = 1 n i=1 V β,i Note that Var(V Cβ ) = C 2 Var(V β ) so we usually standardize β 2 = n i=1 β2 i = 1 23 / 1 24 / 1

7 Principal Components Define the n n matrix H = I n n 1 n 11T This matrix removes means: (Hv) i = v i v. It is also a projection matrix: Principal Components with Matrices With this matrix, Var(V β ) = 1 n 1 βt X T HX β. So, maximizing sample variance, with β 2 = 1 is ( ) maximize β T X T HX β. β 2 =1 H T = H This boils down to an eigenvalue problem... H 2 = H 25 / 1 26 / 1 Eigenanalysis The matrix X T HX is symmetric, so it can be written as X T HX = VDV T where D k k = diag(d 1,..., d k ), rank(x T HX ) = k and V p k has orthonormal columns, i.e. V T V = I k k. We always have d i 0 and we take d 1 d 2... d k. Eigenanalysis & PCA Suppose now that β = k j=1 a jv j with v j the columns of V. Then, β 2 = k ( ) β T X T HX β = i=1 a 2 i k ai 2 d i i=1 Choosing a 1 = 1, a j = 0, j 2 maximizes this quantity. 27 / 1 28 / 1

8 Eigenanalysis & PCA Therefore, β 1 = v 1 the first column of V solves ( ) maximize β T X T HX β. β 2 =1 Higher order components Having found the direction with maximal sample variance we might look for the second most variable direction by solving ( ) maximize β T X T HX β. β 2 =1,β T v 1 =0 This yields scores HX β 1 R n. These are the 1st principal component scores. Note we restricted our search so we would not just recover v 1 again. Not hard to see that if β = k j=1 a jv j this is solved by taking a 2 = 1, a j = 0, j / 1 30 / 1 Olympic data Higher order components In matrix terms, all the principal components scores are (HX V ) n k and the loadings are the columns of V. This information can be summarized in a biplot. The loadings describe how each feature contributes to each principal component score. 31 / 1 32 / 1

9 Olympic data: screeplot The importance of scale The PCA scores are not invariant to scaling of the features: for Q p p diagonal the PCA scores of X Q are not the same as X. Common to convert all variables to the same scale before applying PCA. Define the scalings to be the sample standard deviation of each feature. In matrix form, let S 2 = 1 ( ) n 1 diag X T HX Define X = HX S 1. The normalized PCA loadings are given by an eigenanalysis of X T X. 33 / 1 34 / 1 Olympic data Olympic data: screeplot 35 / 1 36 / 1

10 PCA and the SVD The singular value decomposition of a matrix tells us we can write X = U V T PCA and the SVD Recall X = HX S 1/2. We saw that normalized PCA loadings are given by an with k k = diag(δ 1,..., δ k ), δ j 0, k = rank( X ), U T U = V T V = I k k. Recall that the scores were X V = (U V T )V = U eigenanalysis of X T X. It turns out that, via the SVD, the normalized PCA scores are given by an eigenanalysis of X X T because ( X V ) n k = U Also, X T X = V 2 V T where U are eigenvectors of X X T = U 2 U T. so D = / 1 38 / 1 Another characterization of SVD Given a data matrix X (or its scaled centered version X ) we might try solving Another characterization of SVD It can be proven that minimize Z:rank(Z)=k X Z 2 F where F stands for Frobenius norm on matrices A 2 F = n p i=1 j=1 A 2 ij Z = S n k (V T ) k p where S n k is the matrix of the first k PCA scores and the columns of V are the first k PCA loadings. This approximation is related to the screeplot: the height of each bar describes the additional drop in Frobenius norm as the rank of the approximation increases. 39 / 1 40 / 1

11 Olympic data: screeplot Other types of dimension reduction Instead of maximizing sample variance, we might try maximizing some other quantity... Independent Component Analysis tries to maximize non-gaussianity of V β. In practice, it uses skewness, kurtosis or other moments to quantify non-gaussianity. These are both unsupervised approaches. Often, these are combined with supervised approaches into an algorithm like: Feature creation Build some set of features using PCA. Validate Use the derived features to see if they are helpful in the supervised problem. 41 / 1 42 / 1 Olympic data: 1st component vs. total score Part III 43 / 1 44 / 1

12 Olympic data Similarities Start with X which we assume is centered and standardized. The PCA loadings were given by eigenvectors of the correlation matrix which is a measure of similarity. The first 2 (or any 2) PCA scores yield an n 2 matrix that can be visualized as a scatter plot. Similarly, the first 2 (or any 2) PCA loadings yield an p 2 matrix that can be visualized as a scatter plot. 45 / 1 46 / 1 Similarities The PCA loadings were given by eigenvectors of the correlation matrix which is a measure of similarity. The visualization of the cases based on PCA scores were determined by the eigenvectors of X X T, an n n matrix. To see this, remember that Similarities The matrix X X T is a measure of similarity between cases. The matrix X T X is a measure of similarity between features. X = U V T Structure in the two similarity matrices yield insight into the set of cases, or the set of features... so X X T = U 2 U T X T X = V 2 V T 47 / 1 48 / 1

13 Distances Distances are inversely related to similarities. If A and B are similar, then d(a, B) should be small, i.e. they should be near. If A and B are distant, then they should not be similar. For a data matrix, there is a natural distance between cases: Distances Suggests a natural transformation between a similarity matrix S and a distance matrix D D ik = (S ii 2 S ik + S kk ) 1/2 The reverse transformation is not so obvious. Some suggestions from your book: d(x i, X k ) 2 = X i X k 2 2 p = X ij X kj 2 2 j=1 = (X X T ) ii 2(X X T ) ik + (X X T ) kk S ik = D ik = e D ik = D ik 49 / 1 50 / 1 Distances A distance (or a metric) on a set S is a function d : S S [0, + ) that satisfies d(x, x) = 0; d(x, y) = 0 x = y d(x, y) = d(y, x) d(x, y) d(x, z) + d(z, y) If d(x, y) = 0 for some x y then d is a pseudo-metric. Similarities A similarity on a set S is a function s : S S R and should satisfy s(x, x) s(x, y) for all x y s(x, y) = s(y, x) By adding a constant, we can often assume that s(x, y) / 1 52 / 1

14 Examples: nominal data The simplest example for nominal data is just the discrete metric { 0 x = y d(x, y) = 1 otherwise. The corresponding similarity would be { 1 x = y s(x, y) = 0 otherwise. = 1 d(x, y) Examples: ordinal data If S is ordered, we can think of S as (or identify S with) a subset of the non-negative integers. If S = m then a natural distance is d(x, y) = x y m 1 1 The corresponding similarity would be s(x, y) = 1 d(x, y) 53 / 1 54 / 1 Examples: vectors of continuous data If S = R k there are lots of distances determined by norms. The Minkowski p or l p norm, for p 1: ( k ) 1/p d(x, y) = x y p = x i y i p i=1 Examples: vectors of continuous data If 0 < p < 1 this is not a norm, but it still defines a metric. The preceding transformations can be used to construct similarities. Examples: binary vectors Examples: p = 2 the usual Euclidean distance, l 2 p = 1 d(x, y) = k i=1 x i y i, the taxicab distance, l 1 p = the d(x, y) = max 1 i k x i y i, the sup norm, l If S = {0, 1} k R k the vectors can be thought of as vectors of bits. The l 1 norm counts the number of mismatched bits. This is known as Hamming distance. 55 / 1 56 / 1

15 Example: Mahalanobis distance Given Σ k k that is positive definite, we define Mahalanobis distance on R k by d Σ (x, y) = ( ) 1/2 (x y) T Σ 1 (x y) Example: similarities for binary vectors Given binary vectors x, y {0, 1} k we can summarize their agreement in a 2 2 table This is the usual Euclidean distance, with a change of basis given by a rotation and stretching of the axes. If Σ is only non-negative definite, then we can replace Σ 1 with Σ, its pseudo-inverse. This yields a pseudo-metric because it fails the test x = 0 x = 1 total y y = 0 f 00 f 01 f 0 y = 1 f 10 f 11 f 1 total x f 0 f 1 k d(x, y) = 0 x = y. 57 / 1 58 / 1 Example: similarities for binary vectors We define the simple matching coefficient (SMC) similarity by Example: cosine similarity & correlation Given vectors x, y R k the cosine similarity is defined as SMC(x, y) = f 00 + f 11 = f 00 + f 11 f 00 + f 01 + f 10 + f 11 k 1 cos(x, y) = x, y x 2 y 2 1. = 1 x y 1 k The correlation between two vectors is defined as The Jaccard coefficient ignores entries where x i = y i = 0 J(x, y) = f 11 f 01 + f 10 + f cor(x, y) = x x 1, y ȳ 1 x x 1 2 y ȳ / 1 60 / 1

16 Example: correlation An alternative, perhaps more familiar definition: cor(x, y) = S xy S x S y S xy = 1 k 1 S 2 x = 1 k 1 S 2 y = 1 k 1 k (x i x)(y i ȳ) i=1 k (x i x) 2 i=1 k (y i ȳ) 2 i=1 Correlation & PCA The matrix 1 n 1 X T X was actually the matrix of pair-wise correlations of the features. Why? How? 1 ( X T ) X n 1 = 1 (D 1/2 X T HX D 1/2) ij n 1 The diagonal entries of D are the sample variances of each feature. The inner matrix multiplication computes the pair-wise dot-products of the columns of HX ( n X T HX )ij = (X ki X i )(X kj X j ) k=1 61 / 1 62 / 1 High positive correlation High negative correlation 63 / 1 64 / 1

17 No correlation Small positive 65 / 1 66 / 1 Small negative Combining similarities In a given data set, each case may have many attributes or features. Example: see the health data set for HW 1. To compute similarities of cases, we must pool similarities across features. In a data set with M different features, we write x i = (x i1,..., x im ), with each x im S m. 67 / 1 68 / 1

18 Combining similarities Given similarities s m on each S m we can define an overall similarity between case x i and x j by s(x i, x j ) = M w m s m (x im, x jm ) m=1 with optional weights w m for each feature. Your book modifies this to deal with asymmetric attributes, i.e. attributes for which Jaccard similarity might be used. 69 / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1 Week 2 Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Part I Other datatypes, preprocessing 2 / 1 Other datatypes Document data You might start with a collection of