Geog 20C: Phaedon C Kriakidis Setting Data pairs of two attributes X & Y, measured at sampling units: ṇ and ṇ there are pairs of attribute values {( n, n ),,,} Scatter plot: graph of - versus -values in attribute space: -values serve as coordinates in vertical ais, -values as coordinates in horizontal ais; n-th point in scatter-plot has coordinates ( n, n ) < 0 < 0 > 0 > 0 Objective: provide a quantitative summar of the above scatter plot as a measure of association between - and-values V Scatter Plot Quadrants Scatter plot center: point (, ȳ) with coordinates equal to the data means: n and ȳ n < 0 < 0 > 0 > 0 Scatter plot quadrants: The line etending from parallel to the -ais and the line etending from ȳ parallel to the -ais define 4 quadrants in the scatter-plot Deviations from the mean: an measure association between X and Y should be independent of where the sample scatter plot is centered Consequentl, we ll be looking at deviations of the data from their respective means: n and n ȳ quadrant : n >0 and n ȳ>0 quadrant : n <0 and n ȳ>0 quadrant : n <0 and n ȳ<0 quadrant V: n >0 and n ȳ<0 V Slide Slide 2
Geog 20C: Phaedon C Kriakidis Sample Covariance of a Scatter Plot Products of Data Deviations from their Means Since we are after a measure of association, we compute products of data deviations from their means, eg, ( n )( n ȳ) A large positive product indicates high - and -values of same sign A large negative product indicates high - and -values of different sign Products of deviations from means: ȳ n n ȳ ȳ ( )( ȳ) ( n )( n ȳ) ( )( ȳ) > 0 < 0 0 > 0 < 0 V 0 Product signs in different quadrants: quadrant : n >0 & n ȳ>0 ( n )( n ȳ) > 0 quadrant : n <0 & n ȳ>0 ( n )( n ȳ) < 0 quadrant : n <0 & n ȳ<0 ( n )( n ȳ) > 0 quadrant V: n >0 & n ȳ<0 ( n )( n ȳ) < 0 where denotes element-b-element multiplication of two arras; there are such products {( n )( n ȳ),,,} Average of products: ( n )( n ȳ) [ ]T [ ȳ] denotes a ( ) vector of s; superscript T denotes transposition Sample covariance between data of attributes and : ˆσ XY ( n )( n ȳ) average of products of data deviations from their means; a measure of joint variabilit (association) between two attributes X and Y Sample variance covariance of an attribute with itself: n n ˆσ XX ˆσ X ( n ) 2 Slide 3 Slide 4
Geog 20C: Phaedon C Kriakidis nterpreting The Sample Covariance Sample covariance between data of attributes X and Y : ˆσ XY ( n )( n ȳ) sum of products of data deviations from their means, divided b > 0 Sample Covariance and Correlation Coefficient Problems with sample covariance: not easil interpretable, since - and-values can have different units and sample variances ˆσ X ( n ) 2 and ˆσ Y ( n ȳ) 2 sensitive to outliers; quantifies onl linear relationships Sample correlation coefficient: < 0 Pearson s product moment correlation: ˆρ XY ˆσ XY ˆσX ˆσY < 0 > 0 [/( )] ( n )( n ȳ) [/( )] ( n ) 2 [/( )] ( n ȳ) 2 nterpretation: large positive covariance indicates data pairs predominantl ling in quadrants and large negative covariance indicates data pairs predominantl ling in quadrants and V small covariance indicates data pairs ling in all quadrants, in which case positive and negative products ( n )( n ȳ) cancel out when one computes their mean OTE: The covariance is a measure of linear association between X and Y, and just a summar measure of the actual scatter plot V lies in [, +]; sensitive to outliers; quantifies onl linear relationships Sample rank correlation coefficient: (Spearman s correlation): rank transform each sample data set, b assigning a rank of to the smallest value and a rank of to the largest one transform each data pair { n, n } into a rank pair {r( n ),r( n )}, wherer( n ) and r( n ) is the rank of n and n compute the correlation coefficient of the rank pairs, as: ˆρ S XY 6 ( 2 ) [r( n ) r( n )] 2 can detect non-linear monotonic relationships Slide 5 Slide 6
Geog 20C: Phaedon C Kriakidis Moment of nertia of a Scatter Plot Motivation: nstead of looking at average product of deviations from mean, we could look at the moment of inertia of a scatter plot; that is, the average squared distance between an pair ( n, n ) and the 45 line; ote: such a line does not alwas make sense, but so be it for now n n o d 45 n Link Between Covariance and Moment of nertia Recall: ˆσ X ˆσ Y ˆσ XY 2 n ˆµ 2 X n 2 ˆµ2 Y n n ˆµ X ˆµ Y with ˆµ X cos(45) d n 2 n n d n 2 n n d 2 n 2 ( n n ) 2 Moment of inertia average deviation of scatter plot points from 45 line: ˆγ XY d 2 n ( n n ) 2 2 n ote: The moment of inertia for a scatter plot aligned with the 45 line is alwas 0; that is, if n n, n ˆγ XX ˆγ YY 0 alternativel, the dissimilarit of an attribute with itself ˆγ X is 0 Epanding: ˆγ XY ( n n ) 2 2 n + ( 2 n + n 2 2 n n ) n 2 2 n n ˆσ X +ˆµ 2 X +ˆσ Y +ˆµ 2 Y 2ˆσ XY 2ˆµ X ˆµ Y ˆσ X +ˆσ Y 2ˆσ XY +[ˆµ X ˆµ Y ] 2 What s the difference: To estimate the moment of inertia γ XY ou do not need to know the mean values µ X and µ Y ; these two mean values are required for estimating the covariance σ XY Slide 7 Slide 8
Geog 20C: Phaedon C Kriakidis Geometric nterpretation () Vector length: L 2 n length distance of point with coordinates {,, } from origin Vector-scalar multiplication: Multiplication of a vector b a scalar c changes length (and direction, depending on sign of c): c L c L c> epansion, c< contraction; unit vector u L nner product of two vectors: <, > T n n a scalar quantit (could be negative, zero or positive) Vector length: inner product of a vector with itself L <, > T Angle θ between two vectors and : cos(θ) <, > L L T T T cos(90) cos(270) 0 T 0 the two vectors and are perpendicular, ie, Geometric nterpretation () Let denote the vector of deviations (centered vector) Variance: Covariance: ˆσ YX ˆσ X ( n ) 2 T L2 Variance proportional to squared vector length ( n )( n ȳ) ỹt < ỹ, > Covariance proportional to inner product of centered vectors ỹ and Correlation coefficient ˆρ YX between two attributes X and Y : ˆρ YX ˆσ YX ˆσY ˆσX L2 ỹ < ỹ, > L2 < ỹ, > cos(θ) LỹL cosine of angle θ between two centered vectors ỹ and in a -dimensional space nterpretation: ˆρ YX 0, vectors and are orthogonal the two attributes are uncorrelated ˆρ YX, vectors and are co-linear and lie along the same direction perfectl positivel correlated attributes ˆρ YX, vectors and are co-linear but lie along opposite directions perfectl negativel correlated attributes Slide 9 Slide 0
Geog 20C: Phaedon C Kriakidis Geometric nterpretation () Projection vector: ( shadow ) of vector onto vector : Computing Multivariate Sample Statistics () P (, ) T T <, > L L a new vector along the direction of ; L has unit length Projection length: P (, ) <, > <, > L L cos(θ) L L L when, P (, ) 0 Unit vector: Let denote a ( ) vector of s, with length L The vector u has length L u and forms equal angles with each of the coordinate aes of a -dimensional space The sample mean vector: vector ȳ [ȳ n,,,] T derived b projecting onto the unit vector : ( ) ȳ T n ȳ sample mean multiple of required to ield the projection of onto the unit vector deviation vector; ote that ȳ is orthogonal to ȳ Regression projection: The regression of a centered vector ỹ on another centered vector is the projection of the former on the latter: ỹ T T σ YX ρ YX σx σy σy ρ YX β X σ X σx σx σx β X slope of variable X for simple linear regression Multivariate data set: measurements on K attributes {X,,X K } made at sampling units and arranged in a ( K) matrix: k K X n nk nk k K nk n-th measurement for the k-th variable X k n-th row contains K measurements of different attributes at a single sampling unit k-th column contains measurements of a single attribute at all sampling units Multivariate sample mean: (K ) vector [ k,k,,k] T, with k-th entr k representing mean of k-th attribute X k : XT Conditional multivariate mean vector: (K ) vector of mean values for all K attributes, computed onl from those rows of X whose entries satisf some condition (or quer) Slide Slide 2
Geog 20C: Phaedon C Kriakidis Computing Multivariate Sample Statistics () Matri of means: X of size ( K): k K X k K k K T T X T outer product of vectors and ; a( ) matri; T ( ) matriofs Matri of deviations from means: X X X of size ( K): k k K K n nk k nk K X T X k k K K Sum of cross-products between variables X k & X k (columns k, k ): ( nk k )( nk k ) Computing Multivariate Sample Statistics () Matri of squares and cross-products: X T X of size (K K): X T X (X ) T ( T X X ) T X X ( T ) T X where denotes the ( ) identit matri Sample covariance matri: ˆΣ X of size (K K): ˆΣ X (X X) T (X X) ( XT ) T X kk -th entr of covariance matri: ˆσ Xk X k (X X) T k (X X) k ( k k )( k k ) A k k-th row of matri A; A k k-th column on A ote: n the presence of missing values, one should compute all variance and covariance values onl from those < rows of matri X with no missing values This ensures that the resulting covariance matri ˆΣ is a valid one Conditional covariance matri: (K K) covariance matri between all K 2 pairs of attributes, computed onl from those rows of X whose entries satisf some condition (or quer) Slide 3 Slide 4