Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Similar documents
Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Distances & Similarities

Descriptive Data Summarization

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

proximity similarity dissimilarity distance Proximity Measures:

University of Florida CISE department Gator Engineering. Clustering Part 1

j=1 [We will show that the triangle inequality holds for each p-norm in Chapter 3 Section 6.] The 1-norm is A F = tr(a H A).

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Assignment 3: Chapter 2 & 3 (2.6, 3.8)

Mathematics for Graphics and Vision

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Stat 206: Sampling theory, sample moments, mahalanobis

2 Metric Spaces Definitions Exotic Examples... 3

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Linear Analysis Lecture 5

CISC 4631 Data Mining

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Linear Algebra (Review) Volker Tresp 2017

Lecture 7. Econ August 18

Quadratic forms. Here. Thus symmetric matrices are diagonalizable, and the diagonalization can be performed by means of an orthogonal matrix.

Chapter II. Metric Spaces and the Topology of C

Data Mining: Concepts and Techniques

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Algorithms for Data Science: Lecture on Finding Similar Items

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1]

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

3d scatterplots. You can also make 3d scatterplots, although these are less common than scatterplot matrices.

Lecture 5 : Projections

Basic Math Review for CS4830

Fall TMA4145 Linear Methods. Exercise set Given the matrix 1 2

Notes on Linear Algebra and Matrix Theory

Notes. Relations. Introduction. Notes. Relations. Notes. Definition. Example. Slides by Christopher M. Bourke Instructor: Berthe Y.

Principal Component Analysis (PCA) Theory, Practice, and Examples

Data Preprocessing Tasks

Linear Algebra Review. Vectors

Linear Algebra Methods for Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

CS 246 Review of Linear Algebra 01/17/19

Professor Terje Haukaas University of British Columbia, Vancouver Notation

Introduction to Linear Regression

UC Berkeley Department of Electrical Engineering and Computer Sciences. EECS 126: Probability and Random Processes

Dimensionality Reduction Techniques (DRT)

Determine for which real numbers s the series n>1 (log n)s /n converges, giving reasons for your answer.

Lecture 2: Vector Spaces, Metric Spaces

12 CHAPTER 1. PRELIMINARIES Lemma 1.3 (Cauchy-Schwarz inequality) Let (; ) be an inner product in < n. Then for all x; y 2 < n we have j(x; y)j (x; x)

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Linear Algebra & Geometry why is linear algebra useful in computer vision?

MASTER OF SCIENCE IN ANALYTICS 2014 EMPLOYMENT REPORT

This appendix provides a very basic introduction to linear algebra concepts.

= a. a = Let us now study what is a? c ( a A a )

Math 443 Differential Geometry Spring Handout 3: Bilinear and Quadratic Forms This handout should be read just before Chapter 4 of the textbook.

CS249: ADVANCED DATA MINING

MATH 431: FIRST MIDTERM. Thursday, October 3, 2013.

Introductory Analysis I Fall 2014 Homework #9 Due: Wednesday, November 19

PHY305: Notes on Entanglement and the Density Matrix

A Note on Two Different Types of Matrices and Their Applications

Data Mining 4. Cluster Analysis

Basic Math Review for CS1340

Linear Algebra and Eigenproblems

Metric Spaces Math 413 Honors Project

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr

Some practical remarks (recap) Statistical Natural Language Processing. Today s lecture. Linear algebra. Why study linear algebra?

Similarity and Dissimilarity

MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2

(x, y) = d(x, y) = x y.

Functional Analysis MATH and MATH M6202

Archive of past papers, solutions and homeworks for. MATH 224, Linear Algebra 2, Spring 2013, Laurence Barker

Econ Lecture 14. Outline

Introduction to Machine Learning

PHYS 705: Classical Mechanics. Rigid Body Motion Introduction + Math Review

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Lecture 2. Econ August 11

Metric-based classifiers. Nuno Vasconcelos UCSD

EE263 Review Session 1

14 Singular Value Decomposition

1 Matrices and Systems of Linear Equations. a 1n a 2n

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Covariance to PCA. CS 510 Lecture #14 February 23, 2018

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous:

Spazi vettoriali e misure di similaritá

Math 123, Week 2: Matrix Operations, Inverses

Measurement and Data

Analysis-3 lecture schemes

Image Registration Lecture 2: Vectors and Matrices

Information Retrieval

L5: Quadratic classifiers

Linear Algebra Review

Linear Algebra 2 Spectral Notes

A primer on matrices

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Relations Graphical View

Lecture 2j Inner Product Spaces (pages )

Contravariant and Covariant as Transforms

Numerical Linear Algebra

MATH 320, WEEK 7: Matrices, Matrix Operations

Transcription:

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1

Similarities Start with X which we assume is centered and standardized. The PCA loadings were given by eigenvectors of the correlation matrix which is a measure of similarity. The first 2 (or any 2) PCA scores yield an n 2 matrix that can be visualized as a scatter plot. Similarly, the first 2 (or any 2) PCA loadings yield an p 2 matrix that can be visualized as a scatter plot. 2 / 1

Olympic data 3 / 1

Similarities The PCA loadings were given by eigenvectors of the correlation matrix which is a measure of similarity. The visualization of the cases based on PCA scores were determined by the eigenvectors of X X T, an n n matrix. To see this, remember that X = U V T so X X T = U 2 U T X T X = V 2 V T 4 / 1

Similarities The matrix X X T is a measure of similarity between cases. The matrix X T X is a measure of similarity between features. Structure in the two similarity matrices yield insight into the set of cases, or the set of features... 5 / 1

Distances Distances are inversely related to similarities. If A and B are similar, then d(a, B) should be small, i.e. they should be near. If A and B are distant, then they should not be similar. For a data matrix, there is a natural distance between cases: d(x i, X k ) 2 = X i X k 2 2 p = X ij X kj 2 2 j=1 = (X X T ) ii 2(X X T ) ik + (X X T ) kk 6 / 1

Distances Suggests a natural transformation between a similarity matrix S and a distance matrix D D ik = (S ii 2 S ik + S kk ) 1/2 The reverse transformation is not so obvious. Some suggestions from your book: S ik = D ik = e D ik 1 = 1 + D ik 7 / 1

Distances A distance (or a metric) on a set S is a function d : S S [0, + ) that satisfies d(x, x) = 0; d(x, y) = 0 x = y d(x, y) = d(y, x) d(x, y) d(x, z) + d(z, y) If d(x, y) = 0 for some x y then d is a pseudo-metric. 8 / 1

Similarities A similarity on a set S is a function s : S S R and should satisfy s(x, x) s(x, y) for all x y s(x, y) = s(y, x) By adding a constant, we can often assume that s(x, y) 0. 9 / 1

Examples: nominal data The simplest example for nominal data is just the discrete metric { 0 x = y d(x, y) = 1 otherwise. The corresponding similarity would be { 1 x = y s(x, y) = 0 otherwise. = 1 d(x, y) 10 / 1

Examples: ordinal data If S is ordered, we can think of S as (or identify S with) a subset of the non-negative integers. If S = m then a natural distance is d(x, y) = x y m 1 1 The corresponding similarity would be s(x, y) = 1 d(x, y) 11 / 1

Examples: vectors of continuous data If S = R k there are lots of distances determined by norms. The Minkowski p or l p norm, for p 1: ( k ) 1/p d(x, y) = x y p = x i y i p i=1 Examples: p = 2 the usual Euclidean distance, l 2 p = 1 d(x, y) = k i=1 x i y i, the taxicab distance, l 1 p = the d(x, y) = max 1 i k x i y i, the sup norm, l 12 / 1

Examples: vectors of continuous data If 0 < p < 1 this is not a norm, but it still defines a metric. The preceding transformations can be used to construct similarities. Examples: binary vectors If S = {0, 1} k R k the vectors can be thought of as vectors of bits. The l 1 norm counts the number of mismatched bits. This is known as Hamming distance. 13 / 1

Example: Mahalanobis distance Given Σ k k that is positive definite, we define Mahalanobis distance on R k by d Σ (x, y) = ( ) 1/2 (x y) T Σ 1 (x y) This is the usual Euclidean distance, with a change of basis given by a rotation and stretching of the axes. If Σ is only non-negative definite, then we can replace Σ 1 with Σ, its pseudo-inverse. This yields a pseudo-metric because it fails the test d(x, y) = 0 x = y. 14 / 1

Example: similarities for binary vectors Given binary vectors x, y {0, 1} k we can summarize their agreement in a 2 2 table x = 0 x = 1 total y y = 0 f 00 f 01 f 0 y = 1 f 10 f 11 f 1 total x f 0 f 1 k 15 / 1

Example: similarities for binary vectors We define the simple matching coefficient (SMC) similarity by SMC(x, y) = f 00 + f 11 = f 00 + f 11 f 00 + f 01 + f 10 + f 11 k = 1 x y 1 k The Jaccard coefficient ignores entries where x i = y i = 0 J(x, y) = f 11 f 01 + f 10 + f 11. 16 / 1

Example: cosine similarity & correlation Given vectors x, y R k the cosine similarity is defined as 1 cos(x, y) = x, y x 2 y 2 1. The correlation between two vectors is defined as 1 cor(x, y) = x x 1, y ȳ 1 x x 1 2 y ȳ 1 2 1. 17 / 1

Example: correlation An alternative, perhaps more familiar definition: S xy cor(x, y) = S x S y S xy = 1 k 1 S 2 x = 1 k 1 S 2 y = 1 k 1 k (x i x)(y i ȳ) i=1 k (x i x) 2 i=1 k (y i ȳ) 2 i=1 18 / 1

Correlation & PCA The matrix 1 n 1 X T X was actually the matrix of pair-wise correlations of the features. Why? How? 1 ( X T ) X n 1 = 1 (D 1/2 X T HX D 1/2) ij n 1 The diagonal entries of D are the sample variances of each feature. The inner matrix multiplication computes the pair-wise dot-products of the columns of HX ( n X T HX )ij = (X ki X i )(X kj X j ) k=1 19 / 1

High positive correlation 20 / 1

High negative correlation 21 / 1

No correlation 22 / 1

Small positive 23 / 1

Small negative 24 / 1

Combining similarities In a given data set, each case may have many attributes or features. Example: see the health data set for HW 1. To compute similarities of cases, we must pool similarities across features. In a data set with M different features, we write x i = (x i1,..., x im ), with each x im S m. 25 / 1

Combining similarities Given similarities s m on each S m we can define an overall similarity between case x i and x j by s(x i, x j ) = M w m s m (x im, x jm ) m=1 with optional weights w m for each feature. Your book modifies this to deal with asymmetric attributes, i.e. attributes for which Jaccard similarity might be used. 26 / 1

27 / 1