Machine Learning. Measuring Distance. several slides from Bryan Pardo

Machine Learning Measuring Distance several slides from Bran Pardo 1

Wh measure distance? Nearest neighbor requires a distance measure Also: Local search methods require a measure of localit (Frida) Clustering requires a distance measure (later) Search engines require a measure of similarit, etc.

Euclidean Distance What people intuitivel think of as distance Dimension 2 dd, = ( 1 1 ) + 2 2 2 Dimension 1

Generalized Euclidean Distance ), i( d i i n n n i i i R > =< > =< = = and,...,,,,...,, where ), ( 2 1 2 1 2 1/ 1 2 n = the number of dimensions

L p norms L p norms are all special cases of this: d(, ) n = i= 1 i i p 1/ p p changes the norm 1 = L 1 norm = Manhattan Distance : p = 1 2 = L 2 norm = Euclidean Distance : p = 2 Hamming Distance : p = 1and i, i { 0,1}

Weighting Dimensions Put point in cluster with the closest center of gravit? Which cluster should the red point go in? How do I measure distance in a wa that gives the right answer for both situations?

Weighting Dimensions Think Start End Put point in cluster with the closest center of gravit? Which cluster should the red point go in? How do I measure distance in a wa that gives the right answer for both situations?

Weighting Dimensions Pair Start End Put point in cluster with the closest center of gravit? Which cluster should the red point go in? How do I measure distance in a wa that gives the right answer for both situations?

Weighting Dimensions Share Put point in cluster with the closest center of gravit? Which cluster should the red point go in? How do I measure distance in a wa that gives the right answer for both situations?

Weighted Norms You can compensate b weighting our dimensions. d n 1/ (, ) = w p i i i i= 1 p This lets ou turn our circle of equal-distance into an elipse with aes parallel to the dimensions of the vectors.

Mahalanobis distance The region of constant Mahalanobis distance around the mean of a distribution forms an ellipsoid. The aes of this ellipsiod don t have to be parallel to the dimensions describing the vector Images from: http://www.aiaccess.net/english/glossaries/glosmod/e_gm_mahalanobis.htm 11

Calculating Mahalanobis d (, ) = ( ) T S 1 ( ) This matri S is called the covariance matri and is calculated from the data distribution 12

Take-awa on Mahalanobis Is good for nonsphericall smmetric distributions. Accounts for scaling of coordinate aes Can reduce to Euclidean 13

What is a metric? A metric has these four qualities. otherwise, call it a measure (triangle inequalit) ), ( ), ( ), ( (smmetr) ), ( ), ( (non - negative) 0 ), ( (refleivit) iff 0 ), ( z d z d d d d d d + = = =

Metric, or not? Driving distance with 1-wa streets Categorical Stuff : Is distance (Jazz to Blues to Rock) no less than distance (Jazz to Rock)?

Categorical Variables Consider feature vectors for genre & vocals: Genre: {Blues, Jazz, Rock, Hip Hop} Vocals: {vocals,no vocals} s1 = {rock, vocals} s2 = {jazz, no vocals} s3 = { rock, no vocals} Which two songs are more similar?

One Solution:Hamming distance Blues Jazz Rock Hip Hop 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 Vocals s1 = {rock, vocals} s2 = {jazz, no_vocals} s3 = { rock, no_vocals} Hamming Distance = number of different bits in two binar vectors

Hamming Distance {0,1}) ( and,...,,,,...,, where ), ( 2 1 2 1 1 > =< > =< = = i i n n n i i i, i d

Defining our own distance (an eample) How often does artist quote artist? Quote Frequenc Beethoven Beatles Liz Phair Beethoven 7 0 0 Beatles 4 5 0 Liz Phair? 1 2 Let s build a distance measure!

Defining our own distance (an eample) Beethoven Beatles Liz Phair Beethoven 7 0 0 Beatles 4 5 0 Liz Phair? 1 2 Quote frequenc Q Distance d(, ) = 1 f (, ) = value in table Q f Q z Artists (, ) f (, z)

Missing data What if, for some categor, on some eamples, there is no value given? Approaches: Discard all eamples missing the categor Fill in the blanks with the mean value Onl use a categor in the distance measure if both eamples give a value

Dealing with missing data ww ii = 1, if both iiand ii are defined 0,otherwise dd, = nn nn ww ww ii φφ( ii, ii ) ii ii=1 nn ii=1

Edit Distance Quer = string from finite alphabet Target = string from finite alphabet Cost of Edits = Distance Target: C A G E D - - Quer: C E A E D

Semantic Relatedness d(portland, Hippies) << d(portland, Monster trucks) 24

Semantic Relatedness Several measures have been proposed One that works well: Milne-Witten SR MW (, ) fraction of Wikipedia in-links to either or that link to both 25

One more distance measure Kullback Leibler divergence Related to entrop & information gain not a metric, since it is not smmetric Take EECS 428:Information Theor to find out more 31