Statistical Analysis of Environmental Data - Academic Year Prof. Fernando Sansò CLUSTER ANALYSIS

Size: px

Start display at page:

Download "Statistical Analysis of Environmental Data - Academic Year Prof. Fernando Sansò CLUSTER ANALYSIS"

Derek Davidson
6 years ago
Views:

Statstal Analyss o Envronmental Data - Aadem Year 008-009 Pro.

Unsupervsed Determnst Stohast Determnst Stohast Dsrmnant

AGNES DIANA PAM FANNY able o Contents Herarhal Methods... AGNES.

..8 DIANA eample...9 Optmzaton Method (PAM and FANNY)...5 PAM.

1 Statstal Analyss o Envronmental Data - Aadem Year Pro. Fernando Sansò EXERCISES - PAR CLUSER ANALYSIS Supervsed Unsupervsed Determnst Stohast Determnst Stohast Dsrmnant Analyss Bayesan Herarhal Optmzaton Approah methods methods AGNES DIANA PAM FANNY able o Contents Herarhal Methods... AGNES...3 AGNES eample...3 he slhouette nde rteron...7 DIANA...8 DIANA eample...9 Optmzaton Method (PAM and FANNY)...5 PAM...5 PAM eample...7 FANNY...9 FANNY eample...0 Supervsed lassaton... Bayesan approah... Dsrmnant analyss...3 Supervsed lassaton eample...3 APPENDIX...9 Update: 6/0/08 Authors: A. Molten, L. Pertusn, M. Reguzzon

2 Herarhal Methods: the dea s to group data n lusters wthout usng tranng samples, but denng a smlarty/dssmlarty table (see Fg. ) based on a dstane onept. hese methods are alled herarhal beause they requre many steps and the hoses taen at eah step are never hanged n the subsequent steps. A dstane s a untonal d o two vetors and y that ulls the ollowng propertes: d(, y) 0 and d(, y) 0 y d(, y) d(y, ) d(, y) d(, z) + d(z, y) Dstane between ponts P Euldean dstane d(,) P P Cty-Blo dstane d(,) P Fg.a. Eamples o dstane between ponts P and P. Dstane between lusters Mean dstane Mnmum dstane D( A, A ) A A P A, P A D( A, A ) mn, P A, P A, [ d(, ) ] d(, ) Fg.b. Eamples o dstane between lusters A and A. A unton on vetors whose values are salars.

3 AGNES (AGglomeratve NEStng) It s an agglomeratve method, startng rom as many lusters as the number o data and endng wth a unque luster. Startng pont: dene one luster or eah element Create the dssmlarty table Fnd the mnmum dstane and merge the losest lusters Are all elements n a sngle luster? NO YES Choose the number o lusters END AGNES eample Consder the ollowng 8 observatons (n one dmenson) provded n asendng order. Group data usng AGNES and then hoose a reasonable number o lusters. At the begnnng we dene 8 lusters omposed by one element only. X X X 3 X 4 X 5 X 6 X 7 X hen we reate the dssmlarty table on the bass o Euldean dstane (see Fg.a), namely d X, or eample d, ( 0.79) ( 0.308) 0.9., X Note that the dssmlarty table s a symmetr matr wth zero values on the dagonal. X X X 3 X 4 X 5 X 6 X 7 X 8 X X X X X X X X 8 0 Fg.. Dssmlarty table ( st step). 3

4 he mnmum dstane n the table above (Fg. ) s d 7, ; thus, the rst luster, ormed by the X 7 and X 8, s reated. Now we update the dssmlarty table by omputng dstanes wth respet to the luster {7,8} usng a d,7 + d, onept o mean dstane (see Fg.b), or eample d, { 7,8}.007. X X X 3 X 4 X 5 X 6 {X 7,X 8 } X X X X X X {X 7,X 8 } 0 Fg.3. Dssmlarty table ( nd step). he mnmum dstane n the table above (Fg. 3) s d 6, { 7,8 } 0.0 ; thus, the element X 6 s nluded n the luster {7,8} reatng the luster {6,7,8}. Agan we update the dssmlarty table by omputng dstanes wth respet to the luster {6,7,8}. X X X 3 X 4 X 5 {X 6,X 7,X 8 } X X X X X {X 6,X 7,X 8 } 0 Fg.4. Dssmlarty table (3 rd step). he mnmum dstane n the table above (Fg. 4) s d, 0.9; thus, a new luster, ormed by the X and X, s reated. {X,X } X 3 X 4 X 5 {X 6,X 7,X 8 } {X,X } X X X {X 6,X 7,X 8 } 0 Fg.5. Dssmlarty table (4 th step). he mnmum dstane n the table above (Fg. 5) s d 3,4 0.; thus, a new luster, ormed by the X 3 and X 4, s reated. 4

5 {X,X } {X 3,X 4 } X 5 {X 6,X 7,X 8 } {X,X } {X 3,X 4 } X {X 6,X 7,X 8 } 0 Fg.6. Dssmlarty table (5 th step). he mnmum dstane n the table above (Fg. 6) s d 5, { 6,7,8 } ; thus the element X 5 s nluded nto the luster {6,7,8} reatng the luster {5,6,7,8}. {X,X } {X 3,X 4 } { X 5,X 6,X 7,X 8 } {X,X } {X 3,X 4 } 0.5 { X 5,X 6,X 7,X 8 } 0 Fg.7. Dssmlarty table (6 th step). he mnmum dstane n the table above (Fg. 7) s d{, }{, 3,4} ; thus the luster {,,3,4} s reated. {X,X, X 3,X 4 } { X 5,X 6,X 7,X 8 } {X,X, X 3,X 4 } { X 5,X 6,X 7,X 8 } 0 Fg.8. Dssmlarty table (7 th step). Fnally, a unque luster s obtaned: luster {,,3,4,5,6,7,8}. In order to hoose a reasonable number o lusters (a unque luster wth all the elements s obvously not meanngul) a stop rteron or the agglomeratve proedure has to be dened, suh as: o the mamum dstane rteron, denng a threshold on the mamum dstane between lusters (see lag dagram n Fg.9a); o the mamum gradent rteron, denng a threshold on the mamum varaton o the agglomeratve dstane (see graph n Fg.9b); o the slhouette nde rteron. 5

6 elements X X X 3 X 4 X 5 X 6 X 7 X agglomeratve dstane Fg.9a. Flag dagram.,60,40,0,00 0,80 0,60 0,40 0,0 0,00 ACCEPABLE GRADIEN UNACCEPABLE GRADIEN Fg.9b. Mamum gradent rteron. 6

7 he slhouette nde rteron he mean dstane between the element and the luster A m s generally dened as: d(, A ) d(, ) P m Am A We have to dene the parameters a, b and the slhouette nde s or eah element : a b s m d(,a l ) P A l dstane wthn ts own luster (ntra-luster dstane). mn d(, A m l ) b a s ma(b,a ) P A, P, m l dstane wth respet to the losest lusters A m (nter-luster dstane). s then a >> b, thus the lusterng o the element s very bad. s then a << b, thus the lusterng o the element s optmal. Moreover, t s possble to evaluate the qualty o a luster by omputng the average o the slhouette ndees o all the elements ormng the luster: n s l s P A l n, n A l. As an eample, let us onsder the 6 th step o the AGNES eerse wth lusters {X,X }, {X 3,X 4 }, {X 5,X 6, X 7,X 8 } and evaluate the qualty o the luster {X,X } usng the slhouette nde. ( X ) X X 0.9 a { X, X }) d(x, { X, X, X, X }).8583 d(x, ( X ) b s( X ) ( X ) X X 0.9 a { X, X }) d(x, { X, X, X, X }).793 d(x, ( X ) b s( X ) s ({ X,X }) ( X ) + s( X ) s good luster lassaton! 7

8 DIANA (DIvsve ANAlyss) It s a dvsve method, startng rom a nque luster omposed by all the data endng wth many lusters omposed by one element only. Startng pont: nlude all the elements n a unque luster Create the dssmlarty table Fnd the element wth the mamum dstane and reate a new luster Choose the luster wth the mamum dameter Create dssmlarty table o the remanng elements o the orgnal luster Fnd the element wth the mamum dstane Inlude the element nto the new luster Is the element loser to the new luster? YES NO NO Does eah luster ontan one element only? YES Choose the number o lusters END 8

9 DIANA eample Consder the same 8 observatons o the AGNES eample, but now group data usng DIANA. he startng pont s a unque luster omposed by all the elements. he dssmlarty table s the same o Fg.. X X X 3 X 4 X 5 X 6 X 7 X 8 X X X X X X X X Fg.0. Dssmlarty table. he ntalzaton an be summarzed as ollows: A o B o Φ {,,3,4,5,6,7,8} X X X 3 X 4 X 5 X 6 X 7 X 8 Fg.. Step 0. he rst step onssts n ndng the most dssmlar element wthn A 0, thereore we alulate the mean value o eah row o the dssmlarty table (Fg.0). For eample d(x,a 0 ) ( ) / 7.58 A 0 X X X 3 X 4 X 5 X 6 X 7 X 8 d(x,a 0 ) Fg.. Dssmlarty table wthn A 0. he mamum dstane n the table above (Fg. ) s d(x,a 0 ).58; thus the A B {,3,4,5,6,7,8} {} X luster{} X X 3 X 4 X 5 X 6 X 7 X 8 Fg.3. Step. s reated: hen we have to nd the most dssmlar element wthn the luster A, thereore we alulate the mean value o eah row o the dssmlarty table (Fg.0), dsregardng the X -olumn and the X -row. For eample d(x,a ) ( ) /

10 A X X 3 X 4 X 5 X 6 X 7 X 8 d(x,a ) Fg.4. Dssmlarty table wthn A. he mamum dstane above (Fg.4) s d(x,a ).38. Now the ollowng test s perormed: d(x,a ) > d(x,b ) the element X s nluded n B. d(x,a ) < d(x,b ) the element X remans n A and the luster dvson s stopped. In ths ase: d(x,a ).38 > d(x,b ) 0.9 (see Fg.0), thus the element X s nluded n B: A { 3,4,5,6,7,8} B {,} X X X 3 Fg.5. Step. X 4 X 5 X 6 X 7 X 8 NOE that the elements an only sht rom luster A to luster B and not ve versa! he proedure s then repeated: A X 3 X 4 X 5 X 6 X 7 X 8 d(x,a ) Fg.6. Dssmlarty table wthn A. In ths ase: d(x 3,A ).4 > d(x 3,B ) ( )/ 0.453, thus the element X 3 s nluded n B. A 3 { 4,5,6,7,8} B 3 {,,3} X X X 3 Fg.7. Step 3. X 4 X 5 X 6 X 7 X 8 A 3 X 4 X 5 X 6 X 7 X 8 d(x,a 3 ) Fg.8. Dssmlarty table wthn A 3. In ths ase: d(x 4,A 3 ).9 > d(x 4,B 3 ) ( )/3 0.53, thus the element X 4 s nluded n B. A 4 { 5,6,7,8} B 4 {,,3,4} X X X 3 X 4 X 5 X 6 X 7 X 8 Fg.9. Step 4. A 4 X 5 X 6 X 7 X 8 d(x,a 4 ) Fg.0. Dssmlarty table wthn A 4. In ths ase: d(x 5,A 4 ) < d(x 5,B 4 ) ( )/4.88, thus the element X 5 has not to be nluded n B, and the proess stops. 0

11 In order to hoose whh luster to dvde, t s neessary to ompute the dameter o A 4 and B 4, loong at the table n Fg.0. he luster to be dvded s the one wth the largest dameter. dam(a 4 ) ma d(,) d(x 5,X 8 ) X, X A 4 dam(b 4 ) ma d(,) d(x,x 4 ) 0.79 X, X B 4 Sne dam(b 4 ) > dam(a 4 ), thus we are gong to dvde the luster B 4 and the luster A 4 s now rozen. B 4 X X X 3 X 4 d(x,b 4 ) Fg.. Dssmlarty table wthn B 4. he mamum dstane n the table above (Fg.) s d(x 4,B 4 ) 0.53; thus the A 5 { 5,6,7,8} B 5 C 5 {} 4 {,,3} X X X 3 X 4 Fg.. Step 5. X 5 luster{} 4 X 6 X 7 X 8 s reated: B 5 X X X 3 d(x,b 5 ) Fg.3. Dssmlarty table wthn B 5. In ths ase: d(x 3,B 5 ) > d(x 3,C 5 ) 0. (see Fg.0), thus the element X 3 s nluded n C. A 6 { 5,6,7,8} B 6 {,} C 6 { 3,4} X X X 3 X 4 Fg.4. Step 6. X 5 X 6 X 7 X 8 B 6 X X D( X,B 6 ) Fg.5. Dssmlarty table wthn B 6. In ths ase d(x,b 6 ) d(x,b 6 ) 0.9, so the anddate element to be moved an be X or X. We arbtrarly hoose X. Sne d(x,c 6 ) ( )/ > d(x,b 6 ), the element X an not be nluded n C, and the proess stops. dam(a 6 ) ma d(,) d(5,8) X, X A 6 dam(b 6 ) ma d(,) d(,) 0.9 X, X B 6 dam(c 6 ) ma d(,) d(3,4) 0. X, X C 6 dam(a 6 ) > dam(c 6 ) > dam(b 6 ), thus we are gong to dvde the luster A 6 reatng the luster{5}, beause the mamum dstane n the dssmlarty table o Fg.0 s d(x 5,A) 0.435:

12 A 7 { 6,7,8} B 7 {,} C 7 { 3,4} D 7 {} 5 X X X 3 X 4 X 5 X 6 X 7 X 8 Fg.6. Step 7. A 7 X 6 X 7 X 8 d(x,a 7 ) Fg.7. Dssmlarty table wthn A 7. he mamum dstane s d(x 6,A 7 ) 0.0. Sne d(x 6,D 7 ) > d(x 6,A 7 ), the element X 6 an not be nluded n D, and the proess stops. We have to ompute agan the dameters: dam(a 7 ) ma d(,) d(6,8) 0.38 X, X A 7 dam(b 7 ) ma d(,) d(,) 0.9 X, X B 7 dam(c 7 ) ma d(,) d(3,4) 0. X, X C 7 dam(c 7 ) > dam(a 7 ) > dam(b 7 ), thus we are gong to dvde the luster C 7 reatng the luster{3} and luster{4}: A 8 { 6,7,8} B 8 {,} C 8 {} 3 D 8 {} 5 E 8 {} 4 X X X 3 X 4 X 5 X 6 Fg.8. Step 8. X 7 X 8 In order to hoose whh luster to dvde (luster A or B), t s neessary to alulate ther dameter, loong at the table n Fg.0. he luster that needs to be dvded s the one wth the largest dameter. dam(a 8 ) ma d(,) d(6,8) 0.38 X, X A 8 dam(b 8 ) ma d(,) d(,) 0.9 X, X B 8 dam(a 8 ) > dam(b 8 ), thus we are gong to dvde the luster A 8 reatng the luster{6}, beause the mamum dstane n the dssmlarty table o Fg.7 d d(x 6,A) 0.0. A 9 { 7,8} B 9 {,} C 9 {} 3 D 9 {} 5 E 9 {} 4 F 9 {} 6 X X X 3 X 4 X 5 X 6 X 7 X 8 Fg.9. Step 9.

13 A 9 X 7 X 8 d(x,a 9 ) Fg.30. Dssmlarty table wthn A 9. he mamum dstane s d(x 7/8,A 9 ) Sne d(x 7/8,F 9 ) 0.0 > d(x 7/8,A 9 ), the element X 7 an not be nluded n F, and the proess stops. We have to ompute agan the dameters to hoose whh luster has to be dvded: dam(a 9 ) ma d(,) d(7,8) X, X A 9 dam(b 9 ) ma d(,) d(,) 0.9 X, X B 9 dam(b 9 ) > dam(a 9 ), thus we are gong to dvde the luster B 9 reatng the luster{} and the luster{}: A 0 { 7,8} B 0 C 0 {} {} 3 D 0 {} 5 E 0 {} 4 F 0 {} 6 G 0 {} X 5 X 6 X X X 3 X 4 X 7 X 8 Fg.30. Step 0. he nal step s to dvde the luster A 0 n two lusters o one element luster{7} and luster{8}. So, the nal result s: A {} 7.68 B {} C {} D {} E {} 4.4 F {} G {} 0.4 X X X 3 X 4 X 5 X 6 X 7 X 8 H {} 8.77 Fg.3. Step. 3

14 We an use the Matlab ode alled DIANA.m n order to vsualze the results n a better way: Matlab ode SINAX: >> data [ ] ; >> [label] DIANA (data); 4

15 Optmzaton Method (PAM and FANNY) he dea s to optmze the ollowng target unton: Φ α, p (w, m ) w m wth ed values or α and p, where m s the representatve or the luster and w s the membershp, that s the ownershp nde o the element wth respet to the luster. he ollowng ondtons have to be satsed: 0 w w α In other words, the am o the optmzaton method s to mnmze the sum o the dstanes between eah element and the luster representatve, weghed by the ownershp nde. he number o lusters has to be dened n advane. p PAM: Parttonng Around Medods (α and p ) We need to: Φ ( w, m ) w, m a) m and mnmze w the mnmum s reahed or w *, w 0, *,. In other words ths s a hard lassaton. b) w and mnmze m the mnmum s reahed or m X *. In other words the representatve o the luster s an element o the luster (alled medod or entrod). he algorthm or the lusterng s dvded nto two phases: ) Buld ) Swap ) In ths stage the lusters are bult. he rst medod s the element wth the mnmum dstane wth respet to the other elements: ( *,A ) mn * : d l m * Eah element s a anddate medod and the ontrbuton C to the target unton o the gener element the element s hosen as medod s evaluated. C [,0] where Gan: ma d(, *) d(,) otal gan: G C he mamum value ndates the new medod: * * * : G * ma m * * *, * 5

16 he seond step s repeated, alulatng the gans wth respet to the prevous medods: *, * * and, *, * * D mn[ d(, *),d (, **)] dstane o rom the nearest medod C G ma[d C d (,),0] * * ** : G ** ma m 3 ** * ) In ths stage the medods are ehanged. he prevous onguraton s hanged by swappng the medods and evaluatng the mpat on the target unton. he swap that produes the best gan s hosen. C h ontrbuton o the general value ater the swap between the old medod and the new medod h (hosen out o the non-medod elements); [ d(,) ] D mn medod (.e. the dstane o rom the nearest medod); E mn[ d(, ) ] medod * (.e. the dstane o rom the seond nearest medod). here are varous ases: ) (, *) d(,),d(, h) d < there s a medod * that s nearer to than and h ( ) C h 0 ) d, D the medod s the nearest to : d, h < E a) ( ) C h d (, h) d(,) d(, h) D 0 d, h > E b) ( ) C E D h > 0 3) D d(, *) < d (,) s not the medod nearest to, but h s nearer to than the In any ase, we ompute urrent medod *: (, h) D 0 C h d <. h C h and we test the ollowng ondtons: mn h < 0, then SWAP h,h else SOP 6

17 PAM eample Consder the same 8 observatons o the AGNES eample, but now group data nto lusters usng PAM. X X X 3 X 4 X 5 X 6 X 7 X Usng the Matlab ode PAM_OU : Matlab ode SINAX: >> data [ ] ; >> [label, med] PAM_OU (data, ); DISSIMILARIY ABLE: X X X 3 X 4 X 5 X 6 X 7 X 8 X X X X X X X X he medod s the element wth the mnmum dstane ( [ d(, )] mn A l, P A l ) wth respet to the other elements; here we an arbtrarly hoose X 4 or X 5, wth a mnmum dstane o 6,0570: MEDOID X 5 [ * 5]. he other modod s hosen by omputng the gan table. GAIN ABLE: ma[ d(, *) d(,),0] C, G C. ma[ d( X,X ) d( X,X ),0] C 5 or eample \ X X X 3 X 4 X 5 X 6 X 7 X 8 X X X X X X X X he mamum value s G 3.6: MEDOID X 3 [ ** 3] 7

18 COS ABLE: (teraton ) h * ** X X X X X X X X he best mprovement or the target unton s reahed by swappng the medods 5 6. COS ABLE: (teraton ) h X X X X X X X X All the possble swaps do not produe any mprovement to the target unton; thereore the algorthm s stopped. BUILD X X X 3 X 4 X 5 X 6 X 7 X 8 BUILD SWAP SWAP he nal medods are m 6 and m 3. he nal membershps are: X X X 3 X 4 X 5 X 6 X 7 X 8 w w w (, m ) d mn 0 otherwse 8

19 FANNY: Fuzzy Analyss Clusterng (α and p ) he target unton s now a non lnear unton. he ondtons on the membershps reman the same: Φ ( w, m ) w, m 0 w w Usng the Lagrange multpler or the searh o the mnmum we obtan: Φ Φ m Φ w w m + λ ( w ) ( m ) 0 w 0 0 w λ m m w w we now that: w λ m, thus: w m m Usng an teratve proedure we obtan a relatve mnmum that depends on the startng pont (the result s not the absolute mnmum!). he teratve proedure s stopped when the derenes between the prevous and the urrent estmates are neglgble. 9

20 FANNY eample Consder the same observatons o the AGNES eample, but now group data nto lusters usng FANNY. In order to ntalze the teratve proess, onsder as startng pont the PAM lassaton, namely the lusters {X,X,X 3,X 4 } and {X 5,X 6,X 7,X 8 } X w w Compute the luster representatve values: w m w m m X + X + X 4 + X 4 + X X + X + X Compute the membershps or eah element X : w m m For eample, n the ase o : w w m m m m + m m he results are summarzed n the ollowng table: m m Step: 0.036;.5505 w

21 Repeatng the teratve proess we get: Step: m 0.033; m.5565 w Step: m m.5566 SOP! he estmates are stablzed. he ollowng mages are the result o the FANNY applaton (wth the same data o our eample and stoppng the teratons when the derenes n the estmates are smaller than 0.0) usng the Matlab ode: FUZZY_OU.m (see also FUZZY.m ). Note that the startng values o the membershps are random values sampled rom a unorm dstrbuton n the nterval [0,]. Matlab ode SINAX: >> data [ ] ;. >> [ label, w, med] FUZZY (data,, 0.0); step Hard lassaton Membershp luster Membershp luster 3 4 5

22 Supervsed lassaton: the dea s to group data n lusters usng tranng samples, namely data or whh the orrespondng luster s assumed to be nown. Among the varous supervsed lassaton methods, we onsder the Bayesan approah (based on an a-pror nowledge o the stohast dstrbuton o the data) and the dsrmnant analyss (whh does not requre any a-pror hypothess on the data dstrbuton). Bayesan approah he data are onsdered as a sample drawn rom a random varable ( X,L) where: X s a ontnuous n-dmensonal varable L s a dsrete varable: l,,,m where m number o lusters he ont dstrbuton s gven by:, l l p l X,L ( ) ( ) ( ) ( ) p What s really avalable s the margnal dstrbuton o (the labels are unnown): For eample n the ase m 3, we have: X m ( ) (, ) ( ) l X,L m ( ) ( ) p + ( ) p + ( )( p p ) hs margnal dstrbuton s alled mture and ts meanng s represented n the ollowng dagrams: () () 3 () 3 p p p p 3 () mture he lassaton problem onssts o two phases:. the evaluaton o the ondtonal dstrbuton ( ) possble to evaluate ( ) (e.g. onsderng a Gaussan dstrbuton t s by estmatne mean and varane values on the bass o tranng samples);. the proper lassaton (.e. the evaluaton o the label values) by applyng the Bayes theorem : ( ) ( ) p p l p ( ) P ( A B) P N ( B A ) P( A ) P( B A ) P( A )

23 In the ase m : p ( l ) ( ) p ( ) p + ( )( p ) P ( l ) ( )( p ) ( ) p + ( )( p ) hs s a sot (or uzzy) lassaton. In order to strtly dede whh label to assgn to the value (hard lassaton) t s also possble to ompute the so alled lelhood rato R: Dsrmnant analyss ( ) p ( )( p ) R > l else l Aordng to the Bayesan approah, the value s assgned to the luster l G ( ) p( l ) p( l ) > 0 he dsrmnant unton G( ) s non lnear and depends on the hosen dstrbuton ( ) and ( ) he dea o the dsrmnant analyss s to appromate G( ) wth a lnear unton L ( ) L( ) a b n suh a way that E [ L( ) G( ) ] mn. { } a, b In the ase m, t results that â p p C ( μ ) XX μ a ( p p ) X. bˆ μ where μ and C are the mean vetor and the ovarane matr o the mture dstrbuton. Supervsed lassaton eample Consder the ollowng 30 values: he rst 0 values are etrated rom N[μ 5, σ ], whle the last 0 values are etrated rom N[μ 8, σ ], but ths normaton an NO be used n order to solve the eerse! he rst 5 values: {5.47, 4.09, 5.03, 4.38, 5.53} have to be onsdered as tranng samples or the rst luster and the values rom the th to the 5 th : {0.03, 4.93, 7.74, 8.77, 9.03} as tranng samples or the seond luster. Assumng that the data o eah luster are etrated rom a normal dstrbuton, estmate the parameters o the mture and lassy the data: o o usng the lelhood rato (Bayesan approah) usng the dsrmnant analyss Mean (μ ) and standard devaton (σ) or the rst luster: μˆ n 5 ( ) 4.9 3

24 σˆ 4.9 n n n 4 5 n 5 ( μˆ ) μˆ ( ) σˆ Mean (μ ) and standard devaton (σ) or the seond luster: μˆ 5 ( ) 8. 5 ( ) σˆ 4 5 σˆ he ont dstrbuton o data and label s gven by: L, l X L l p whle the mture s: m X μ,σ p μ,σ p ( ) ( ) ( μ,σ ) X, p ( ) ( ) ( ) + ( μ,σ )( p ) where X D random varable rom whh the data are etrated L dsrete random varable o the labels. hereore, on the bass o tranng samples, we assume that X l ~ N [ μˆ 4.9, σˆ 0.645] X N μˆ 8., σˆ.950 l ~ [ ] 4 Now we need to estmate the parameter p (and p - p ); we an do t by a. the moment method b. loong at the data hstogram Usng the rst moment (mean value), we have ( ) d p ( ) d + ( p ) ( ) d pμ + ( p ) μ ( μ μ ) p μ μ + Computng the sample mean o all data: N μˆ N 30 ( ) 7.05 We an nd the estmate o p by solvng the ollowng equaton: μˆ μˆ μˆ ( μˆ μˆ ) p + μˆ pˆ 0.38 μˆ μˆ Usng the seond moment, we have ( ) ( ) + ( ) σ d μ p d p ( ) d μ 4

25 p ( σ + μ ) + ( p )( σ + μ ) μ ( p ) σ + p μ + ( p ) μ p μ ( p ) μ p ( p ) μ p σ + μ ( p ) σ + p μ ( p ) + ( p ) μ ( + p ) p ( p ) μ p σ + μ ( p ) σ + p ( p )( μ + μ μ ) p σ + μ ( p ) σ + p ( p )( μ p ) σ + μ Computng the sample varane o all data σˆ μ N N N We an nd the estmate o p by solvng the ollowng equaton: σˆ p σˆ + ( p ) σˆ + p ( p )( μˆ μˆ ) ( μˆ μˆ ) p + [( μˆ μˆ ) + ( σˆ σˆ )]p σˆ σˆ + ( μˆ μˆ ) p [( μˆ μˆ ) + ( σ σ )]p + ( σˆ σˆ ) 0 0.4p 6.85p ± p (,) and p p Note that 0 p! and Data hstogram Separate the data nto lasses and buld the hstogram (N absolute requeny o the lass ). hs hstogram s an appromaton o the mture dstrbuton. 4 N Classes Class : values between and 3 Class : values between 3 and 5 Class 3: values between 5 and 7 Class 4: values between 7 and 9 Class 5: values between 9 and Class 6: values between and 3 5

26 N he relatve requeny o the lass s an appromaton o the probablty o the data to all n the N [ ] orrespondng nterval a, a : N P [ a,a ] p P N { } { [ a,a ]} + ( p ) P [ a,a ] a a a ( μ,σ ) d + ( p ) ( μ,σ ) p d a { } μ In order to solve these ntegrals, let us onsder the standard normal dstrbuton z and ll the σ ollowng table: n sup z n z sup z n z sup Readng values rom a table o the standard normal dstrbuton, we get: P(z< z n) P(z< z sup) P(z< z n) P(z< z sup) P(z n<z< z sup) P(z n<z< z sup) he prevous equaton based on the relatve requenes an be used to dene the terms o the Least Squares problem: y 0 A + b + ν y 0 /30 4/30 8/30 /30 3/30 / A b p pˆ ( A A) A ( y b) pˆ pˆ

27 Atually, there s a problem: separatng the data nto 6 lasses wth a derent number o elements, the matr Q s not equal to I (beause the relatve requenes are observatons o the probabltes wth derent auraes). he soluton s to buld lasses wth the same number o elements: 6 N Class : values between.5 and 5 Class : values between 5 and 5.54 Class 3: values between 5.54 and 7. Class 4: values between 7. and 7.8 Class 5: values between 7.8 and 9 Class 6: values between 9 and Classes Repeatng the same reasonng o beore, we get: n sup z n z sup z n z sup P(z< z n) P(z< z sup) P(z< z n) P(z< z sup) P(z n<z< z sup) P(z n<z< z sup) he terms o the new Least Squares problem (where the use o Q I s orret) are y 0 A + b + ν y 0 /6 /6 /6 /6 /6 / A b p pˆ ( A A) A ( y b) pˆ pˆ

28 We an now lassy the data, both usng the lelhood rato and the dsrmnant analyss. Lelhood Rato ( μ ) σ p e πσ R( ) σ ( p ) e πσ > C ( luster ) ( μ ) By orng that the lelhood rato s equal to, one an get the threshold to dsrmnate between the rst and seond luster: ( μ ) ( p ) ( μ ) ln p lnσ ln + lnσ + 0 σ σ μ μ μ μ p ln σ σ + + σ σ + σ σ p σ σ σ σ 0 Substtutng the estmated values o µ, µ,, and p (rom the rst moment method), we get: ± ±.857, and p () luster p () 3.7 or 5.84 luster Dsrmnant analyss σ p σ + μ p μ + p a σ + p ( p ) σ ( p ) μ b μa ( p )( μ μ ) ( p )( μ μ ) ( p p ) b.4598 L ( ) a b 0 threshold a < 5.7 luster > 5.7 luster 8

29 APPENDIX: Dsrmnant Analyss Formula Proo Gven two lusters wth labels l and l, we an dene the dsrmnant unton G( ) o the data as: ( ) p( l ) p( l ) G so that the optmal lassaton aordng to the Bayesan approah s G( ) > 0 l G( ) < 0 l We want to appromate G( ) wth a lnear unton L() a b Usng the ollowng prnple: a b {[ ] } a. b {[ L() - G( ) ] } Ε a b G( ) Ε mn In order to mnmze the epresson above, we ompute the partal dervatves wth respet to a and b, and then ore them equal to zero: Ε [ a b G( ) ] Ε{ [ a b G( ) ]} 0 a Ε Ε Ε b [ a b G( ) ] Ε a b G( ) { } a Ε{} b Ε G( ) { } a b Ε{ G( ) } 0 { } 0 {[ ]} Reallng the mture dstrbuton s gven by () p + p ( ) ( ) ( ) wth p ( ) p we an wrte Ε ()d pμ + p so that Ε{ } a μb Ε{ G( ) } 0 μμ a μb με{ G( ) } 0 {} ( ) μ μ 0 Subtratng the seond equaton rom the rst equaton, we get: Ε μμ a Ε μ G [ { } ] [( ) ( )] where Ε μμ C { } beause C Ε {( μ)( μ) } Ε{ } + μμ με{ } Ε{ } μμ 9

30 Moreover: Ε μ G {( ) ( ) } Ε{ ( μ) p( l ) } Ε{ ( μ) p( l ) } ( ) p ( ) ( l ) ( ) p ( ) ( l ) μ () d μ () () p ( μ) ( ) d p ( μ) ( ) d p ( ( ) d μ ( ) d) p ( ( ) d μ ( ) d) p ( μ μ) p ( μ μ) p p ( μ ) μ hereore the equaton or the estmate o a s gven by C a p p μ μ â ( ) where the ovarane matr C p p C an be wrtten as () d Ε{ } μμ ( )( p( ) + p ( ) ) d μμ ( ) ( ) d + p ( ) ( ) d μμ pε{ } + p Ε { } ( C + μ μ ) + p ( C + μ μ ) μμ C + p p ( μ μ )( μ μ ) wth C p C + p C From the seond equaton o the system we have bˆ μ a Ε G where Ε [ ( )] [ G( ) ] G( ) so that b μ ( ) p ( ) ()d ()d () a ( p p ) bˆ μ â ( p p ) p () ()d p μμ p 30

Outline. Clustering: Similarity-Based Clustering. Supervised Learning vs. Unsupervised Learning. Clustering. Applications of Clustering

Outline. Clustering: Similarity-Based Clustering. Supervised Learning vs. Unsupervised Learning. Clustering. Applications of Clustering Clusterng: Smlarty-Based Clusterng CS4780/5780 Mahne Learnng Fall 2013 Thorsten Joahms Cornell Unversty Supervsed vs. Unsupervsed Learnng Herarhal Clusterng Herarhal Agglomeratve Clusterng (HAC) Non-Herarhal