Multivariate distance Fall

Similar documents
Fast and Precise Discriminant Function Considering Correlations of Elements of Feature Vectors and Its Application to Character Recognition

Matrices: 2.1 Operations with Matrices

. a m1 a mn. a 1 a 2 a = a n

2. Matrix Algebra and Random Vectors

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Linear Equations and Matrix

Properties of the stress tensor

The System of Linear Equations. Direct Methods. Xiaozhou Li.

ANOVA: Analysis of Variance - Part I

. =. a i1 x 1 + a i2 x 2 + a in x n = b i. a 11 a 12 a 1n a 21 a 22 a 1n. i1 a i2 a in

Directional Control Schemes for Multivariate Categorical Processes

Interpretation of results through confidence intervals

Distance Measures. Objectives: Discuss Distance Measures Illustrate Distance Measures

2. Sample representativeness. That means some type of probability/random sampling.

L5: Quadratic classifiers

Data Mining 4. Cluster Analysis

Announcements Wednesday, October 10

Announcements Monday, October 02

Matrix Algebra & Elementary Matrices

STAT 730 Chapter 1 Background

MATH 38061/MATH48061/MATH68061: MULTIVARIATE STATISTICS Solutions to Problems on Random Vectors and Random Sampling. 1+ x2 +y 2 ) (n+2)/2

Introduction to Statistical Data Analysis Lecture 4: Sampling

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance

Textbook: Methods of Multivariate Analysis 2nd edition, by Alvin C. Rencher

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 6: Bivariate Correspondence Analysis - part II

BIOS 2083: Linear Models

Math 304 (Spring 2010) - Lecture 2

Unsupervised dimensionality reduction

Sample Geometry. Edps/Soc 584, Psych 594. Carolyn J. Anderson

Lecture 4: Products of Matrices

Decorrelation in Statistics: The Mahalanobis Transformation Added material to Data Compression: The Complete Reference

Regularized Discriminant Analysis. Part I. Linear and Quadratic Discriminant Analysis. Discriminant Analysis. Example. Example. Class distribution

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

CS 246 Review of Linear Algebra 01/17/19

Determinants of Partition Matrices

Linear Algebra Solutions 1

Integer Programming, Constraint Programming, and their Combination

Linear Algebra (Review) Volker Tresp 2018

Static Output Feedback Controller for Nonlinear Interconnected Systems: Fuzzy Logic Approach

COMPARISON OF FIVE TESTS FOR THE COMMON MEAN OF SEVERAL MULTIVARIATE NORMAL POPULATIONS

Computer Science & Engineering 423/823 Design and Analysis of Algorithms

Basic Concepts in Matrix Algebra

MULTINOMIAL PROBABILITY DISTRIBUTION

Estimating the Number of Tables via Sequential Importance Sampling

Graphical Model Selection

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Recommendation Systems

2. Sample representativeness. That means some type of probability/random sampling.

On Expected Gaussian Random Determinants

Expectation, inequalities and laws of large numbers

Multivariate Analysis

Various Proofs of Sylvester s (Determinant) Identity

STAT 730 Chapter 14: Multidimensional scaling

Statistical Process Control for Multivariate Categorical Processes

An Introduction to Multivariate Methods

Stat 206: Sampling theory, sample moments, mahalanobis

On Properties of QIC in Generalized. Estimating Equations. Shinpei Imori

I = i 0,

Math 4377/6308 Advanced Linear Algebra

Machine Learning (CS 567) Lecture 5

Pattern correlation matrices and their properties

Principal Component Analysis Applied to Polytomous Quadratic Logistic

Chapter 14. Linear least squares

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality

16.584: Random Vectors

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

Linear Algebra Review

Introduction. Semivariogram Cloud

Chapter 17: Undirected Graphical Models

Chaper 5: Matrix Approach to Simple Linear Regression. Matrix: A m by n matrix B is a grid of numbers with m rows and n columns. B = b 11 b m1 ...

Table of Contents. Multivariate methods. Introduction II. Introduction I

Unconstrained Ordination

=, v T =(e f ) e f B =

Outline Lecture Notes Math /17

Data Mining and Analysis: Fundamental Concepts and Algorithms

Nonlinear Dimensionality Reduction

Transportation Problem

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

Regression. Oscar García

7 Curvature of a connection

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Random Vectors 1. STA442/2101 Fall See last slide for copyright information. 1 / 30

Introduction to statistical analysis of Social Networks

Describing Contingency tables

analysis of incomplete data in statistical surveys

L2: Review of probability and statistics

Computer Science & Engineering 423/823 Design and Analysis of Algorithms

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

iron retention (log) high Fe2+ medium Fe2+ high Fe3+ medium Fe3+ low Fe2+ low Fe3+ 2 Two-way ANOVA

1 Curvature of submanifolds of Euclidean space

STATISTICS 407 METHODS OF MULTIVARIATE ANALYSIS TOPICS

Appendix: Modeling Approach

Matrix Differentiation

Unit 9: Inferences for Proportions and Count Data

Measurement and Data

STAT 135 Lab 10 Two-Way ANOVA, Randomized Block Design and Friedman s Test

Goals. PSCI6000 Maximum Likelihood Estimation Multiple Response Model 1. Multinomial Dependent Variable. Random Utility Model

Basic Concepts in Linear Algebra

Transcription:

Multivariate distance 2017 Fall

Contents Euclidean Distance Definitions Standardization Population Distance Population mean and variance Definitions proportions presence-absence data

Multivariate distance Examples Consider three races: Korean, Japanese, African. Korean and Japanese are closer than Korean and African (or Japanese and African) Why? Appearance, Culture, Geography, Language,... Most of Multivariate problems can be viewed as in terms of distances between single observations For each observation, there are p measurements (p variables) How to define (Multivariate) distance between two observations based on p measurements

Euclidean distance 1. There are n observations: X 1, X 2,..., X n 2. Each observation has p variables: X t i = (X i1, X i2,..., X ip ) 3. Euclidean distance of two observations: p d ij = d ij (X i, X j ) = (X ik X jk ) 2 k=1

Standardization Examples Two measurements are obtained from a person: Height and tooth dimension (mm) Variation of height would be 20-30 mm. Variation of tooth dimension would be 1-2 mm. d ij == (X i1 X j1 ) 2 + (X i2 X j2 ) 2 The distance is affected mostly by Height 1. It is desirable for all variables to have about the same influence on the distance. 2. Standardization: dividing the value its standard deviation (Xi1 ) 2 ( ) 2 X j1 Xi2 X j2 d ij = + sd(x 1 ) sd(x 2 )

Standardization Let the X k and s k be the sample mean and standard deviation of the kth variable with data X 11 X 12... X 1p X 21 X 22... X 2p............ X n1 X n2... X np and X k = 1 n i X ik, s 2 k = 1 n 1 The standardized variable Z ik is defined by Z ik = X ik X k s k (X ik X k ) 2 i

Populations and parameters There are m populations There are p variables µ ki = the ith population mean of the variable X k V k = the population variance of the variable X k V rs = the population covariance of two variables X r and X s. Assume population means are different among groups. But, assume population variance is same among groups.

Definition of population distances Definition (Penrose) P ij = p (µ ki µ kj ) 2 /(pv k ) k=1 Note: P ij ignores the correlations among the variables. Definition (Mahalanobis) D 2 ij = p r=1 s=1 p (µ ri µ rj )v rs (µ si µ sj ) where v rs is the element on the rth row and sth column of the inverse of the population covariance matrix

Mahalanobis distance Quadratic form for Mahalanobis distance between two populations: Dij 2 = (µ i µ j ) t V 1 (µ i µ j ) where µ i is the population mean vector of the ith groups such that µ t i = (µ 1i, µ 2i,..., µ pi ) and V is the population covariance matrix. Mahalanobis distance of one observation from the population center: D 2 = (x µ) t V 1 (x µ) where x is the one observation vector such that x t = (x 1, x 2,..., x p i)

Mahalanobis distance The large value of Mahalanobis distance implies the observation may be 1. a genuine but unlikely record 2. an observation from another distribution 3. a record containing some mistake If µ and V are unknown, these should be estimated: D 2 = (x ˆµ) t ˆV 1 (x ˆµ) If sample size is small, the estimate V is not stable and then Mahalanobis distance is not reliable. Hence, it is better to use Penrose distance if n < 100.

Distance with proportions Examples (The election poll) The survey was conducted for the presidential election. There are three candidates and the results are illustrated by two regions. Region/Candidate 1 2 3? A p 1 p 2 p 3 p 4 B q 1 q 2 q 3 q 4 Note sum of proportion is 1 (p 1 + p 2 + p 3 + p 4 = 1) How to define the distance of two groups in terms of the proportions?

Distance with proportions Definition (Distance I) d 1 = Note that 0 d 1 1. Definition (Distance II) K p i q i /2 i=1 Note that 0 d 2 1. Definition (Similarity s) K i=1 d 2 = 1 p iq i [ K K i=1 p2 i i=1 q2 i ]1/2 s = 1 d or s = 1/d or s = 1/(1 + d) where d is any distance measure.

Distance with present-absence data Example Presences and absences of two species at ten site site 1 2 3 4 5 6 7 8 9 10 species 1 0 0 1 1 1 0 1 1 1 0 species 2 1 1 1 1 0 0 0 0 1 1 Note that 1=presence, 0=absence The data can be summarized by two-by-two contingency table such that species 1 spicies2 present absent total present a b a + b absent c d c + d total a + c b + d n

Distance with present-absence data Definition (Simple matching index) Definition (Ochiai index) s = s = (a + d)/n Definition (Dice-Sorensen index) Definition (Jaccard index) a [(a + b)(a + c)] 1/2 s = s = 2a 2a + b + c a a + b + c

Distance with present-absence data Note that All similarity measures have the value between zero (no similarity) and one (complete similarity) The number of joint absences d is not used for all four definitions. If two species are absent from many site, there is the danger of conclusion that two species are similar. Hence,there are some debates whether d should be used or not in the definition of similarity measure.