Distances & Similarities

Similar documents
Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Descriptive Data Summarization

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance

Preprocessing & dimensionality reduction

proximity similarity dissimilarity distance Proximity Measures:

University of Florida CISE department Gator Engineering. Clustering Part 1

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1]

Measurement and Data

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Spazi vettoriali e misure di similaritá

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr

CS249: ADVANCED DATA MINING

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

ICML - Kernels & RKHS Workshop. Distances and Kernels for Structured Objects

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Concepts and Techniques

Similarity and Dissimilarity

Fundamentals of Similarity Search

Lecture 6: September 19

Data Mining 4. Cluster Analysis

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality

Data Preprocessing. Cluster Similarity

Review Packet 1 B 11 B 12 B 13 B = B 21 B 22 B 23 B 31 B 32 B 33 B 41 B 42 B 43

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Recent advances in Time Series Classification

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Decoding linear codes via systems solving: complexity issues

10-701/ Recitation : Kernels

MATH 304 Linear Algebra Lecture 19: Least squares problems (continued). Norms and inner products.

Lecture 7. Econ August 18

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Statistical Machine Learning

Linear Algebra Methods for Data Mining

4 Divergence theorem and its consequences

Spacetime and 4 vectors

CS6220: DATA MINING TECHNIQUES

Lecture Notes 1: Vector spaces

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014

Review (Probability & Linear Algebra)

Algorithms for Data Science: Lecture on Finding Similar Items

COMPLETE METRIC SPACES AND THE CONTRACTION MAPPING THEOREM

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Radial-Basis Function Networks

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Lecture 2: Vector Spaces, Metric Spaces

Data Mining: Data. Lecture Notes for Chapter 2 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Admin. Assignment 1 is out (due next Friday, start early). Tutorials start Monday: Office hours: Sign up for the course page on Piazza.

Least squares problems

Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points

Multivariate Random Variable

Supremum of simple stochastic processes

Distance Preservation - Part I

CS-E3210 Machine Learning: Basic Principles

CIS 520: Machine Learning Oct 09, Kernel Methods

Functional Analysis Exercise Class

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

CISC 4631 Data Mining

Multivariate Analysis

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

Math 341: Convex Geometry. Xi Chen

Kernel Method: Data Analysis with Positive Definite Kernels

2. Isometries and Rigid Motions of the Real Line

PCA, Kernel PCA, ICA

Lecture 1.4: Inner products and orthogonality

CPSC 340: Machine Learning and Data Mining. Data Exploration Fall 2016

Course 212: Academic Year Section 1: Metric Spaces

MASTER OF SCIENCE IN ANALYTICS 2014 EMPLOYMENT REPORT

(x, y) = d(x, y) = x y.

Mid Term-1 : Practice problems

Table of Contents. Multivariate methods. Introduction II. Introduction I

Discriminative Direction for Kernel Classifiers

Statistical comparisons for the topological equivalence of proximity measures

Similarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

MSCBD 5002/IT5210: Knowledge Discovery and Data Minig

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Two dimensional oscillator and central forces

Advanced Machine Learning & Perception

Neuroimage Processing

Kernel Methods in Machine Learning

Getting To Know Your Data

Math General Topology Fall 2012 Homework 1 Solutions

MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

Announcements. Proposals graded

Feature selection and extraction Spectral domain quality estimation Alternatives

Radial-Basis Function Networks

Similarity Search. The String Edit Distance. Nikolaus Augsten.

Linear Algebra Massoud Malek

Lecture 35: December The fundamental statistical distances

Lecture Notes DRE 7007 Mathematics, PhD

Machine Learning (CS 567) Lecture 5

Design and Implementation of Speech Recognition Systems

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Transcription:

Introduction to Data Mining Distances & Similarities CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 1 / 22

Outline 1 Distance metrics Minkowski distances Euclidean distance Manhattan distance Normalization & standardization Mahalanobis distance Hamming distance 2 Similarities and dissimilarities Correlation Gaussian affinities Cosine similarities Jaccard index 3 Dynamic time-warp Comparing misaligned signals Computing DTW dissimilarity 4 Combining similarities CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 2 / 22

Distance metrics Metric spaces Consider a dataset X as an arbitrary collection of data points Distance metric A distance metric is a function d : X X [0, ) that satisfies three conditions for any x, y, z X: 1 d(x, y) = 0 x = y 2 d(x, y) = d(y, x) 3 d(x, y) d(x, z) + d(z, y) The set X of data points together with an appropriate distance metric d(, ) is called a metric space. CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 3 / 22

Distance metrics Euclidean distance When X R n we can consider Euclidean distances: Euclidean distance The distance between x, y X is defined by x y 2 = n i=1 (x[i] y[i]) 2 One of the classic most common distance metrics Often inappropriate in realistic settings without proper preprocessing & feature extraction Also used for least mean square error optimizations Proximity requires all attributes to have equally small differences CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 4 / 22

Distance metrics Manhattan distances Manhattan distance The Manhattan distance between x, y X is defined by x y 1 = n i=1 x[i] y[i]. This distance is also called taxicab or cityblock distance Taken from Wikipedia CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 5 / 22

Distance metrics Minkowski (l p ) distance Minkowski distance The Minkowski distance between x, y X R n is defined by n x y p p = x[i] y[i] p i=1 for some p > 0. This is also called the l p distance. Three popular Minkowski distances are: p = 1 Manhattan distance: x y 1 = n i=1 x[i] y[i] p = 2 Euclidean distance: x y 2 = n i=1 x[i] y[i] 2 p = Supremum/l max distance: x y = sup 1 i n x[i] y[i] CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 6 / 22

Distance metrics Normalization & standardization Minkowski distances require normalization to deal with varying magnitudes, scaling, distribution or measurement units. Min-max normalization minmax(x)[i] = x[i] m i r i, where m i and r i are the min value and range of attribute i. Z-score standardization zscore(x)[i] = x[i] µ i σ i, where µ i and σ i are the mean and STD of attribute i. log attenuation logatt(x)[i] = sgn(x[i]) log( x[i] + 1) CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 7 / 22

Distance metrics Mahalanobis distance Mahalanobis distances The Mahalanobis distance is defined by mahal(x, y) = (x y)σ 1 (x y) T where Σ is the covariance matrix of the data and data points are represented as row vectors. CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 8 / 22

Distance metrics Mahalanobis distance Mahalanobis distances The Mahalanobis distance is defined by mahal(x, y) = (x y)σ 1 (x y) T where Σ is the covariance matrix of the data and data points are represented as row vectors. When all attributes are independent with unit standard deviation (e.g., z-scored) then Σ = Id and we get the Euclidean distance. CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 8 / 22

Distance metrics Mahalanobis distance Mahalanobis distances The Mahalanobis distance is defined by mahal(x, y) = (x y)σ 1 (x y) T where Σ is the covariance matrix of the data and data points are represented as row vectors. When all attributes are independent with variances σi 2 then Σ = diag(σ1, 2..., σn) 2 ni=1 and we get mahal(x, y) = ( x[i] y[i] σ i ) 2, which is the Euclidean distance between z-scored data points. CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 8 / 22

Distance metrics Mahalanobis distance x y z Σ = [ 0.3 ] 0.2 0.2 0.3 x = (0, 1) y = (0.5, 0.5) z = (1.5, 1.5) d(x, y) = 5 d(y, z) = 4 CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 8 / 22

Distance metrics Hamming distance When the data contains nominal values, we can use Hamming distances: Hamming distances The hamming distance is defined as hamm(x, y) = n i=1 x[i] y[i] for data points x, y that contain n nominal attributes. This distance is equivalent to l 1 distance with binary flag representation. Example If x = ( big, black, cat ), y = ( small, black, rat ), and z = ( big, blue, bulldog ) then hamm(x, y) = d(x, z) = 2 and hamm(y, z) = 3. CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 9 / 22

Similarities and dissimilarities Similarities / affinities Similarities or affinities quantify whether, or how much, data points are similar. Similarity/affinity measure We will consider a similarity or affinity measure as a function a : X X [0, 1] such that for every x, y X a(x, x) = a(y, y) = 1 a(x, y) = a(y, x) Dissimilarities quantify the opposite notion, and typically take values in [0, ), although they are sometimes normalized to finite ranges. Distances can serve as a way to measure dissimilarities. CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 10 / 22

Similarities and dissimilarities Simple similarity measures CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 11 / 22

Similarities and dissimilarities Correlation CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 12 / 22

Similarities and dissimilarities Gaussian affinities Given a distance metric d(x, y), we can use it to formulate Guassian affinities Gaussian affinities Gaussian affinities are defined as k(x, y) = exp( d(x,y)2 2ε ) given a distance metric d. Essentially, data points are similar if they are within the same spherical neighborhoods w.r.t. the distance metric, whose radius is determined by ε. For Euclidean distances they are also known as RBF (radial basis function) affinities. CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 13 / 22

Similarities and dissimilarities Cosine similarities Another similarity metric in Euclidean space is based on the inner product (i.e., dot product) x, y = x y cos( xy) Cosine similarities The cosine similarity between x, y X R n is defined as cos(x, y) = x, y x y CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 14 / 22

Similarities and dissimilarities Cosine similarities Another similarity metric in Euclidean space is based on the inner product (i.e., dot product) x, y = x y cos( xy) Cosine similarities The cosine similarity between x, y X R n is defined as cos(x, y) = x, y x y CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 14 / 22

Similarities and dissimilarities Cosine similarities Another similarity metric in Euclidean space is based on the inner product (i.e., dot product) x, y = x y cos( xy) Cosine similarities The cosine similarity between x, y X R n is defined as cos(x, y) = x, y x y CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 14 / 22

Similarities and dissimilarities Cosine similarities Another similarity metric in Euclidean space is based on the inner product (i.e., dot product) x, y = x y cos( xy) Cosine similarities The cosine similarity between x, y X R n is defined as cos(x, y) = x, y x y CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 14 / 22

Similarities and dissimilarities Jaccard index For data with n binary attributes we consider two similarity metrics: Simple matching coefficient SMC(x, y) = Jaccard coefficient J(x, y) = n i=1 x[i] y[i]+ n i=1 x[i] y[i] n n i=1 x[i] y[i] n i=1 x[i] y[i] The Jaccard coefficient can be extended to continuous attributes: Tanimoto (extended Jaccard) coefficient T (x, y) = x,y x 2 + y 2 x,y CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 15 / 22

Dynamic time-warp Comparing misaligned signals CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 16 / 22

Dynamic time-warp Comparing misaligned signals Theoretically: Use time offset to align signals CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 16 / 22

Dynamic time-warp Comparing misaligned signals Realistically: CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 16 / 22

Dynamic time-warp Comparing misaligned signals Realistically: Which offset to use? CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 16 / 22

Dynamic time-warp Adaptive alignment CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 17 / 22

Dynamic time-warp Adaptive alignment CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 17 / 22

Dynamic time-warp Adaptive alignment CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 17 / 22

Dynamic time-warp Computing DTW dissimilarity Signal x j i Pairwise diff. matrix: each cell holds difference between two signal entries x[i] y[j] Signal y CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22

Dynamic time-warp Computing DTW dissimilarity Signal x Alignment path: get from start to end of both signals Signal y CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22

Dynamic time-warp Computing DTW dissimilarity Signal x 1:1 alignment: trivial - nothing modified by the alignment Aligned distance: ( ) 2 = x y 2 Signal y CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22

Dynamic time-warp Computing DTW dissimilarity Signal x Time offset: works sometimes, but not always optimal Aligned distance: ( ) 2 =? Signal y CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22

Dynamic time-warp Computing DTW dissimilarity Signal x Extreme offset: complete misalignment - worst alignment alternative Aligned distance: ( ) 2 = x 2 + y 2 Signal y CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22

Dynamic time-warp Computing DTW dissimilarity Signal x Optimal alignment: Optimize alignment by minimizing aligned distance Aligned distance: ( ) 2 = min Signal y CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 18 / 22

Dynamic time-warp Dynamic programming algorithm Dynamic Programming A method for solving complex problems by breaking them down into simpler subproblems. Applicable to problems exhibiting the properties of overlapping subproblems and optimal substructure. Better performances than naive methods that do not utilize the subproblem overlap. CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 19 / 22

Dynamic time-warp Dynamic programming algorithm DTW Algorithm: For each signal-time i and for each signal-time j: Set cost (x[i] y[j]) 2 Set the optimal distance at stage [i, j] to: DTW [i,j 1] DTW [i,j] cost + min DTW [i 1,j 1] DTW [i 1,j] DTW[i,j 1] DTW[i,j] DTW[i 1,j 1] DTW[i 1,j] Optimal distance: DTW [m,n] (where m & n are lengths of signals). Optimal alignment: backtracking the path leading to DTW [m,n] via min-cost choices of the algorithm CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 19 / 22

Dynamic time-warp Remark about earth-mover distances (EMD) What is the cost of transforming one distribution to another? n n EMDp p (x, y) = min{ i j p Ω ij : i=1 j=1 n n Ω ij = x[i] Ω ij = y[j]} j=1 i=1 where Ω is a moving strategy (transferring Ω ij mass from i to j). Can be solved with the Hungarian algorithm, but more efficient methods exist and rely on wavelets and mathematical analysis. CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 20 / 22

Combining similarities To combine similarities of different attributes we can consider several alternatives: 1 Transform all the attributes to conform to the same similarity/distance metric 2 Use weighted average to combine similarities a(x, y) = n i=1 w i a i (x, y) or distances d 2 (x, y) = n i=1 w i di 2 (x, y) with n i=1 w i = 1. 3 Consider asymmetric attributes by defining binary flags δ i (x, y) {0, 1} that mark whether two data points share comparable information in affinity i and then combine only comparable information by a(x, y) = n i=1 w i δ i (x,y)a i (x,y) n. i=1 δ i (x,y) CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 21 / 22

Summary To compare data points we can either 1 quantify how similar they are with a similarity or affinity metric, or 2 quantify how different they are with a dissimilarity or a distance metric. There are many possible metrics (e.g., Euclidean, Mahalanobis, Hamming, Gaussian, Cosine, Jaccard), and the choice of which one to use depends on both the task and the input data. It is sometimes useful to consider several different metrics and then combine them together. Alternatively, data preprocessing can be done to transform all the data to conform with a single metric. CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 22 / 22