Multivariate distance 2017 Fall
Contents Euclidean Distance Definitions Standardization Population Distance Population mean and variance Definitions proportions presence-absence data
Multivariate distance Examples Consider three races: Korean, Japanese, African. Korean and Japanese are closer than Korean and African (or Japanese and African) Why? Appearance, Culture, Geography, Language,... Most of Multivariate problems can be viewed as in terms of distances between single observations For each observation, there are p measurements (p variables) How to define (Multivariate) distance between two observations based on p measurements
Euclidean distance 1. There are n observations: X 1, X 2,..., X n 2. Each observation has p variables: X t i = (X i1, X i2,..., X ip ) 3. Euclidean distance of two observations: p d ij = d ij (X i, X j ) = (X ik X jk ) 2 k=1
Standardization Examples Two measurements are obtained from a person: Height and tooth dimension (mm) Variation of height would be 20-30 mm. Variation of tooth dimension would be 1-2 mm. d ij == (X i1 X j1 ) 2 + (X i2 X j2 ) 2 The distance is affected mostly by Height 1. It is desirable for all variables to have about the same influence on the distance. 2. Standardization: dividing the value its standard deviation (Xi1 ) 2 ( ) 2 X j1 Xi2 X j2 d ij = + sd(x 1 ) sd(x 2 )
Standardization Let the X k and s k be the sample mean and standard deviation of the kth variable with data X 11 X 12... X 1p X 21 X 22... X 2p............ X n1 X n2... X np and X k = 1 n i X ik, s 2 k = 1 n 1 The standardized variable Z ik is defined by Z ik = X ik X k s k (X ik X k ) 2 i
Populations and parameters There are m populations There are p variables µ ki = the ith population mean of the variable X k V k = the population variance of the variable X k V rs = the population covariance of two variables X r and X s. Assume population means are different among groups. But, assume population variance is same among groups.
Definition of population distances Definition (Penrose) P ij = p (µ ki µ kj ) 2 /(pv k ) k=1 Note: P ij ignores the correlations among the variables. Definition (Mahalanobis) D 2 ij = p r=1 s=1 p (µ ri µ rj )v rs (µ si µ sj ) where v rs is the element on the rth row and sth column of the inverse of the population covariance matrix
Mahalanobis distance Quadratic form for Mahalanobis distance between two populations: Dij 2 = (µ i µ j ) t V 1 (µ i µ j ) where µ i is the population mean vector of the ith groups such that µ t i = (µ 1i, µ 2i,..., µ pi ) and V is the population covariance matrix. Mahalanobis distance of one observation from the population center: D 2 = (x µ) t V 1 (x µ) where x is the one observation vector such that x t = (x 1, x 2,..., x p i)
Mahalanobis distance The large value of Mahalanobis distance implies the observation may be 1. a genuine but unlikely record 2. an observation from another distribution 3. a record containing some mistake If µ and V are unknown, these should be estimated: D 2 = (x ˆµ) t ˆV 1 (x ˆµ) If sample size is small, the estimate V is not stable and then Mahalanobis distance is not reliable. Hence, it is better to use Penrose distance if n < 100.
Distance with proportions Examples (The election poll) The survey was conducted for the presidential election. There are three candidates and the results are illustrated by two regions. Region/Candidate 1 2 3? A p 1 p 2 p 3 p 4 B q 1 q 2 q 3 q 4 Note sum of proportion is 1 (p 1 + p 2 + p 3 + p 4 = 1) How to define the distance of two groups in terms of the proportions?
Distance with proportions Definition (Distance I) d 1 = Note that 0 d 1 1. Definition (Distance II) K p i q i /2 i=1 Note that 0 d 2 1. Definition (Similarity s) K i=1 d 2 = 1 p iq i [ K K i=1 p2 i i=1 q2 i ]1/2 s = 1 d or s = 1/d or s = 1/(1 + d) where d is any distance measure.
Distance with present-absence data Example Presences and absences of two species at ten site site 1 2 3 4 5 6 7 8 9 10 species 1 0 0 1 1 1 0 1 1 1 0 species 2 1 1 1 1 0 0 0 0 1 1 Note that 1=presence, 0=absence The data can be summarized by two-by-two contingency table such that species 1 spicies2 present absent total present a b a + b absent c d c + d total a + c b + d n
Distance with present-absence data Definition (Simple matching index) Definition (Ochiai index) s = s = (a + d)/n Definition (Dice-Sorensen index) Definition (Jaccard index) a [(a + b)(a + c)] 1/2 s = s = 2a 2a + b + c a a + b + c
Distance with present-absence data Note that All similarity measures have the value between zero (no similarity) and one (complete similarity) The number of joint absences d is not used for all four definitions. If two species are absent from many site, there is the danger of conclusion that two species are similar. Hence,there are some debates whether d should be used or not in the definition of similarity measure.