Machine Learning. Measuring Distance. several slides from Bryan Pardo

Machne Learnng Measurng Dstance several sldes from Bran Pardo 1

Wh measure dstance? Nearest neghbor requres a dstance measure Also: Local search methods requre a measure of localt (Frda) Clusterng requres a dstance measure (later) Search engnes requre a measure of smlart etc.

Dmenson 2 Eucldean Dstance What people ntutvel thnk of as dstance d( ) ( 2 2 1 1) ( 2 2) Dmenson 1

Generalzed Eucldean Dstance ) ( d n n n and...... where ) ( 2 1 2 1 2 1/ 1 2 n = the number of dmensons

L p norms 01 1and Hammng Dstance : 2 Eucldean Dstance : norm L 1 Manhattan Dstance : norm L ) ( 2 2 1 1 1/ 1 p n p p p p d L p norms are all specal cases of ths: p changes the norm

Weghtng Dmensons Put pont n cluster wth the closest center of gravt? Whch cluster should the red pont go n? How do I measure dstance n a wa that gves the rght answer for both stuatons?

Weghtng Dmensons Thnk Start End Put pont n cluster wth the closest center of gravt? Whch cluster should the red pont go n? How do I measure dstance n a wa that gves the rght answer for both stuatons?

Weghtng Dmensons Par Start End Put pont n cluster wth the closest center of gravt? Whch cluster should the red pont go n? How do I measure dstance n a wa that gves the rght answer for both stuatons?

Weghtng Dmensons Share Put pont n cluster wth the closest center of gravt? Whch cluster should the red pont go n? How do I measure dstance n a wa that gves the rght answer for both stuatons?

Weghted Norms You can compensate b weghtng our dmensons. d n 1/ ( ) w p 1 p Ths lets ou turn our crcle of equal-dstance nto an elpse wth aes parallel to the dmensons of the vectors.

Mahalanobs dstance The regon of constant Mahalanobs dstance around the mean of a dstrbuton forms an ellpsod. The aes of ths ellpsod don t have to be parallel to the dmensons descrbng the vector Images from: http://www.aaccess.net/englsh/glossares/glosmod/e_gm_mahalanobs.htm 11

Calculatng Mahalanobs d ( ) ( ) T S 1 ( ) Ths matr S s called the covarance matr and s calculated from the data dstrbuton 12

Take-awa on Mahalanobs Is good for nonsphercall smmetrc dstrbutons. Accounts for scalng of coordnate aes Can reduce to Eucldean 13

What s a metrc? A metrc has these four qualtes. otherwse call t a measure (trangle nequalt) ) ( ) ( ) ( (smmetr) ) ( ) ( (non - negatve) 0 ) ( (reflevt) ff 0 ) ( z d z d d d d d d

Metrc or not? Drvng dstance wth 1-wa streets Categorcal Stuff : Is dstance (Jazz to Blues to Rock) no less than dstance (Jazz to Rock)?

Categorcal Varables Consder feature vectors for genre & vocals: Genre: {Blues Jazz Rock Hp Hop} Vocals: {vocalsno vocals} s1 = {rock vocals} s2 = {jazz no vocals} s3 = { rock no vocals} Whch two songs are more smlar?

One Soluton:Hammng dstance Blues Jazz Rock Hp Hop 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 Vocals s1 = {rock vocals} s2 = {jazz no_vocals} s3 = { rock no_vocals} Hammng Dstance = number of dfferent bts n two bnar vectors

Hammng Dstance {01}) ( and...... where ) ( 2 1 2 1 1 n n n d

Defnng our own dstance (an eample) How often does artst quote artst? Quote Frequenc Beethoven Beatles Lz Phar Beethoven 7 0 0 Beatles 4 5 0 Lz Phar? 1 2 Let s buld a dstance measure!

Defnng our own dstance (an eample) Beethoven Beatles Lz Phar Beethoven 7 0 0 Beatles 4 5 0 Lz Phar? 1 2 Quote frequenc Q Dstance d( ) 1 f ( ) value n table Q f Q zartsts ( ) f ( z)

Mssng data What f for some categor on some eamples there s no value gven? Approaches: Dscard all eamples mssng the categor Fll n the blanks wth the mean value Onl use a categor n the dstance measure f both eamples gve a value

Dealng wth mssng data n n w w n n d w 1 1 ) ( ) ( else 0 are defned and both f 1

Edt Dstance Quer = strng from fnte alphabet Target = strng from fnte alphabet Cost of Edts = Dstance Target: C A G E D - - Quer: C E A E D

Semantc Relatedness d(portland Hppes) << d(portland Monster trucks) 24

Semantc Relatedness Several measures have been proposed One that works well: Mlne-Wtten SR MW ( ) fracton of Wkpeda n-lnks to ether or that lnk to both 25

One more dstance measure Kullback Lebler dvergence Related to entrop & nformaton gan not a metrc snce t s not smmetrc Take EECS 428:Informaton Theor to fnd out more 31