Machne Learnng Measurng Dstance several sldes from Bran Pardo 1
Wh measure dstance? Nearest neghbor requres a dstance measure Also: Local search methods requre a measure of localt (Frda) Clusterng requres a dstance measure (later) Search engnes requre a measure of smlart etc.
Dmenson 2 Eucldean Dstance What people ntutvel thnk of as dstance d( ) ( 2 2 1 1) ( 2 2) Dmenson 1
Generalzed Eucldean Dstance ) ( d n n n and...... where ) ( 2 1 2 1 2 1/ 1 2 n = the number of dmensons
L p norms 01 1and Hammng Dstance : 2 Eucldean Dstance : norm L 1 Manhattan Dstance : norm L ) ( 2 2 1 1 1/ 1 p n p p p p d L p norms are all specal cases of ths: p changes the norm
Weghtng Dmensons Put pont n cluster wth the closest center of gravt? Whch cluster should the red pont go n? How do I measure dstance n a wa that gves the rght answer for both stuatons?
Weghtng Dmensons Thnk Start End Put pont n cluster wth the closest center of gravt? Whch cluster should the red pont go n? How do I measure dstance n a wa that gves the rght answer for both stuatons?
Weghtng Dmensons Par Start End Put pont n cluster wth the closest center of gravt? Whch cluster should the red pont go n? How do I measure dstance n a wa that gves the rght answer for both stuatons?
Weghtng Dmensons Share Put pont n cluster wth the closest center of gravt? Whch cluster should the red pont go n? How do I measure dstance n a wa that gves the rght answer for both stuatons?
Weghted Norms You can compensate b weghtng our dmensons. d n 1/ ( ) w p 1 p Ths lets ou turn our crcle of equal-dstance nto an elpse wth aes parallel to the dmensons of the vectors.
Mahalanobs dstance The regon of constant Mahalanobs dstance around the mean of a dstrbuton forms an ellpsod. The aes of ths ellpsod don t have to be parallel to the dmensons descrbng the vector Images from: http://www.aaccess.net/englsh/glossares/glosmod/e_gm_mahalanobs.htm 11
Calculatng Mahalanobs d ( ) ( ) T S 1 ( ) Ths matr S s called the covarance matr and s calculated from the data dstrbuton 12
Take-awa on Mahalanobs Is good for nonsphercall smmetrc dstrbutons. Accounts for scalng of coordnate aes Can reduce to Eucldean 13
What s a metrc? A metrc has these four qualtes. otherwse call t a measure (trangle nequalt) ) ( ) ( ) ( (smmetr) ) ( ) ( (non - negatve) 0 ) ( (reflevt) ff 0 ) ( z d z d d d d d d
Metrc or not? Drvng dstance wth 1-wa streets Categorcal Stuff : Is dstance (Jazz to Blues to Rock) no less than dstance (Jazz to Rock)?
Categorcal Varables Consder feature vectors for genre & vocals: Genre: {Blues Jazz Rock Hp Hop} Vocals: {vocalsno vocals} s1 = {rock vocals} s2 = {jazz no vocals} s3 = { rock no vocals} Whch two songs are more smlar?
One Soluton:Hammng dstance Blues Jazz Rock Hp Hop 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 Vocals s1 = {rock vocals} s2 = {jazz no_vocals} s3 = { rock no_vocals} Hammng Dstance = number of dfferent bts n two bnar vectors
Hammng Dstance {01}) ( and...... where ) ( 2 1 2 1 1 n n n d
Defnng our own dstance (an eample) How often does artst quote artst? Quote Frequenc Beethoven Beatles Lz Phar Beethoven 7 0 0 Beatles 4 5 0 Lz Phar? 1 2 Let s buld a dstance measure!
Defnng our own dstance (an eample) Beethoven Beatles Lz Phar Beethoven 7 0 0 Beatles 4 5 0 Lz Phar? 1 2 Quote frequenc Q Dstance d( ) 1 f ( ) value n table Q f Q zartsts ( ) f ( z)
Mssng data What f for some categor on some eamples there s no value gven? Approaches: Dscard all eamples mssng the categor Fll n the blanks wth the mean value Onl use a categor n the dstance measure f both eamples gve a value
Dealng wth mssng data n n w w n n d w 1 1 ) ( ) ( else 0 are defned and both f 1
Edt Dstance Quer = strng from fnte alphabet Target = strng from fnte alphabet Cost of Edts = Dstance Target: C A G E D - - Quer: C E A E D
Semantc Relatedness d(portland Hppes) << d(portland Monster trucks) 24
Semantc Relatedness Several measures have been proposed One that works well: Mlne-Wtten SR MW ( ) fracton of Wkpeda n-lnks to ether or that lnk to both 25
One more dstance measure Kullback Lebler dvergence Related to entrop & nformaton gan not a metrc snce t s not smmetrc Take EECS 428:Informaton Theor to fnd out more 31