Unsupervised Learning, Kmeans and Derivative algorithms. Virginia de Sa desa at cogsci

Size: px

Start display at page:

Download "Unsupervised Learning, Kmeans and Derivative algorithms. Virginia de Sa desa at cogsci"

Jody Shields
6 years ago
Views:

1 Unsupervised Learning, Kmeans and Derivative algorithms 1 Virginia de Sa desa at cogsci

2 Unsupervised Learning 2 No target data required Extract structure (density estimates, cluster memberships, or produce a reduced dimensional representation) from the data

3 Unsupervised algorithms are often forms of Hebbian Learning 3 Hebbian learning refers to modifying the strength of a connection according to a function of the input and output activity (often simply the product). It is based on a rule specified by the Canadian Donald Hebb in his 1949 book The Organization of Behavior When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A s efficiency, as one of the cells firing B, is increased (Hebb 1949)(figure below from

4 Data Compression 4 We might want to compress data from high-dimensional spaces for several reasons: to enable us (and also machine learning algorithms) to better see relationships for more efficient storage and transmission of information (gzip, jpg) We want to do this while preserving as much as the useful information as possible. (Of course how useful is determined is critical). Clustering and PCA are different methods of dimensionality reduction.

5 PCA and Clustering 5 PCA represents a point using a fewer number of dimensions. The directions are the directions of greatest variance in the data Clustering represents a point using prototype points.

6 K-means 6 a simple but effective clustering algorithm partitions the data in to K disjoint sets (clusters) iterative batch algorithm Start with initial guess of k centers S (j) is all points closest to µ (j) Update µ (j) = 1/N j n S (j) x (n) until no change in the means

7 K-means 7

8 K-means 8

9 K-means 9

10 K-means 10

11 K-means 11

12 K-means 12

13 K-means 13

14 K-means 14

15 K-means 15

16 K-means 16

17 K-means 17 a simple but effective clustering algorithm partitions the data in to K disjoint sets (clusters) iterative batch algorithm Start with initial guess of k centers S (j) is all points closest to µ (j) Update µ (j) = 1/N j n S (j) x (n) until no change in the means

18 Stochastic K-means = Competitive Learning 18 Find weight w (j) that minimizes w (j) x (n) (weight closest to the pattern) and move it closer to the pattern w (j) = η(t)(x (n) w (j) ) decrease learning rate with time W x1 x2 x3 x4

19 Competitive Learning 19 "!#"$"! %'&#& (")+*#,"-*#./("021"3#45-76"893#("6 :7& *#-715:76"1E6"("<245:7&#0 89MON73#P7-7.3#89-7QO*#, > -T<2-71UI7B *#<[:]\^-7I7*#( <2_`(") *#,"-`.-73#C7,"* <289Q./,"3#I,a8Y*#:7<2*:7* *#,"-R<2-71a; : I7&dB"89*#-7< l7mxm:7<=*#q:76 1`j: prq7s#s#t7u2vrwsx#q7yyy9z#{" 7q7s#z#} v~ # 7 " `ˆ Š"Œ"Œ" DŽ+ L " D G # # 7 š7 " "œ9 "žd "Ÿ7

20 Competitive Learning 20

21 Competitive Learning 21

22 Competitive Learning 22

23 Competitive Learning 23

24 Competitive Learning 24

25 Competitive Learning 25

26 Competitive Learning 26

27 Competitive Learning 27

28 Competitive Learning 28

29 Competitive Learning 29

30 Competitive Learning 30

31 Competitive Learning 31

32 Competitive Learning 32

33 Competitive Learning 33

34 Competitive Learning 34

35 Competitive Learning 35

36 Competitive Learning 36

37 Kohonen Feature Mapping 37 Update the neighbours (in output topography) as well as the winner. If y refers to the winning output neuron then we update weights w (k) = η(t)λ( y (k) y, t)(x w (k) ) window function decreases with time

38 Kohonen Feature Mapping 38 : ;1<)=#>@?'AB!< CEDGFA!;1B<)FIHJ: A!K)<! #"!$%'&)(!*+$!(!,- *+!.!/102&3*54,!02& 89,!/16!&27*54,0)& L M NPORQTSVUXWXYZX[XY \^]+_a`bc#dxe#fgahxijihxfvklgmnbe#dxkpoqx_vrostdxceuxikl_hx]vidxhxg`wxuxi]+yv]+dxzxee{a_ ]+mxga{_ od}o~qx_rdxh7_c#uxikl_ahx]+idxhxg`~w `ihx_d7b oqx_poge#f_o ]+mxga{_{gah} X_`_age#hX_u}g] bdx``d7sr]vƒ )dxe3_ag{ q mxd7ihxo lˆ ŠŒX ŠŽa # Š ˆ X Š~ŒX # aˆ +Š Ž X E # + X X X 7ˆ X X Xˆ XŠ ˆ~ ŠŒX + 7šX E + XŽa ŠŒXŽŠ~ ˆ + a X + X œt Xš7 V Ža VŠ ž PŸ7 X l X + ª «aa ) X ª E ²± ~³X X±Ŕ ª V ~ Xµ ³X + 3 X X X ³X v X X Eªa ' ¹ º» ³X¼ l ~ X¼ ¹ X ªa½ž ž ~³X + X 7 Eª¼ + X ª»¾ ¼ª ³7 ¹ ³7 X EÀ# l a XºI ³X ¹ # aï + X ªa IÁxÏ ³7 + ~ +³X 7 t X±I ³X ¹ X # ªa X +a X +½ X X XÏ ~ ½X Â lãtä ÅXÆaÇÈÉPÊlËXÌ+Í ÎÏÍ~È~Ð ÆÑ Ò!ÓXÆÕÔÆÎÖ#ÉXÈÉ7Ç ÖE XÔÆÕØÙÚXÑ ÛXÛXÜ7Ý ÊlÎÞÆÌßÈÍÌßÌ+ËX 7ÖEÏÆÕàXËXÈÉ7Í ÊlË7ÐaÆÕÍË7á ÎaÖ#âÍ~ÓXÆ Ì+ÆaÉXÌ+ÆâàXËXÈÉ7ÍãTÎÌÌvÓXË7áRÉÅIä¹Í~ÓXÆ¹Ì+ÊlÎÔÔTÎaÖ#Ö#Ë7áßÑTå ÆaÏÎ XÌ+Æ¹ËXæTÍÓ7Æ¹áRÈÉ7âXË7áçæ XÉXÏaÍÈËXÉéèêìë ítî!ï}ð ñ òvó ôõ7ör XøEöù#úûlüaýöRþXÿ7 XþXú Xô ü ìü äö Xô7ôþ! " #$%& '$&()$* '$ + #$$,$-(#$ '$.(/,$0#$132 43&,$0#5 "'$* '-6* /"()$ 5 ' 7 * 8 8 #$6 "9: '$$$ 76& '$& 8 #!#$ ;&$,$:0 ;&)$ 0 +<; )$$ $- * 0( 7;0 $" 0 < )=*! ("'$ $1<>?"$@<AB* 0# "=CD1<E,$ + F ( 8G 1IH " + '$JE * JK1IL $"M +ON P(Q Q R S"TJUWV P XX&Y Z$[ P Q Y \T]I^W_$` a b"c d e$f=g hji$k$k$lnmodprq$s$t uwv x y oz { q$t}&~$ t$

39 Kohonen Feature Mapping )%**"7)%'%5& Q#974.&5OSRUTWVXY[Z\]7Z^;_=`%]%Zba:c^%d;]7`%]%ZbZ\]7e^;_=`%]%a:Zbf]%X`%\Zbgch^%Z]7fC\Xe Xa:Z^%YZm`%]%Zna j ^%ee]%_ogch^%z]%pmqr_ kjbsnt Xd#\^;_=hvuipmw-gh^%VmxU]%Z]%_oy;pnz-^%_=Z(Vn^%Yh{w-^} %ƒƒ ; = B ˆ %ŠFŠ: Œ % %ƒ Ž 1 % = % ĩ šœ ž7ÿ1i B J %-ª «% : %±

40 Kohonen Feature Mapping 40! #"!$%&')($! +*,!--$! #%.,#$!/0$1%-!-2*%/3!#$')!4$&/ 2%"-5$/0$!-&,.*672"!89:8;<= $!A$!4'B!,C4$!D1$1*,/08EF5$G%&'B($!H*,!-I-$! #%,A$!/0$!%-!-I2*%/J2/K2/0-$!48L6 A*'BMNO2?75!#4P8QR&41 QR D!24)VR8W!-*AX!=Y Z![[I\!]#^B_a`Z1b0b0cde!Z![Icf^ghaijlk!m#no1pqGr sutvvw)xlykz {}~) ƒ!yr 1{~ 0ˆ ~Š!

41 Kohonen Feature Mapping 41! "$# %$& ' ( ) *+) ( &,! '+-. %$/0' 12! %$# /3'4. ' %$#. %65,71 ' 5 - #8'4. ' %$9 & %$) : ;<- ' %$&+ -1 ' 5 =< C = D E F< ' 5AG2& %$) :21 # ) ( -<#!<- G - ' %$& -<1 ' 5+) H7G # I(2'+- # 12- G 1 # ) ( -<) (2- G #. %J5 1 ' 5+- G ' -< ' : H/K' L ) /K N- G ' -O- ' %J& -O1 # ) ( - =O@P%$# /KQORS) 5AG ' %$:UT8=OV. : ' FXW< - %ZY =O[N' %$- FO'+( :\V'^] ) :U_=X` - # %$a+fcb^d e e f+g$h\i>j d+kkl m n d+ v$w x y z8{ ~} 0 u cƒ 0ˆŠ Œ N Ž

42 Kohonen Feature Mapping 42!"#$%! &' )(+*,-./ 01!'&2/ 3'!$456 "!-$ '5 :+3'!$4 0;#$%! &' )"<5=/ 08 A/ B!3 CDE=?F- 08 E-B:G6 $4/ $H$%! '.B& -0@ ")'!B/ I> >JK?086'5-/A5!080MLD"!$4 '.B G&O*,/P$%!- & )GQ&O*./ 0R! &P N "<!*,& -$S*,'& '*T:6 5'P! &P08"'*,$U& 5! <P" 5-/'!$4&ZYAJH[,6'&!LH\+-$^]J7_,!$4L7! &`[,! E&`a,JHG $4>L;b c-ddef4g`hsicj8j8kl'mcdkn go7psq rdst4uvw'xy z { } } ~` ƒ 8 -

43 More examples 43

44 Some SOM applets 44 applet from rfhs8012.fh-regensburg.de/ saj39122/jfroehl/diplom/e-sample.html applet from applet from

45 Let s look at visual cortex example 45 Obermayer1990.pdf

46 Neural Gas Learn the Topology 46 fritzke/fuzzypaper/node6.html

47 Aside to related supervised algorithms (Kohonen s Learning Vector Quantization) 47 Supervised methods for moving cluster centers (makes use of given class label) Can have more than one center per class. Move centers to reduce the number of misclassified patterns. Various flavours. LVQ2.1 minimizes number of misclassified patterns

48 LVQ2.1 Learning rule 48 Let w (i), and w (j) be the closest codebook vectors Only if exactly one of w (i) and w (j) belongs to the correct class and min( x w (i) / x w (j), x w (j) / x w (i) ) < s (x lies within a window of the border region) do the following (the below rules assume w (i) is from the correct class, switch the rules if not) w (i) = w (i) + ɛ(x w (i) ) w (j) = w (j) ɛ(x w (j) )

49 Improved LVQ2.1 Learning rule 49 Let w (i), and w (j) be the closest codebook vectors Only if exactly one of w (i) and w (j) belongs to the correct class and min( x w (i) / x w (j), x w (j) / x w (i) ) < s(t) (x lies within a window of the border region that decreases with time) do we apply the following (the below rules assume w (i) is from the correct class, switch the rules if not) w (i) = w (i) + ɛ (x w(i) ) x w (i) w (j) = w (j) ɛ (x w(j) ) x w (j)

50 LVQ2.1 in 2-D 50 w (i) = w (i) + ɛ (x w(i) ) x w (i) w (j) = w (j) ɛ (x w(j) ) x w (j) w (i) is from the correct class, w (j) from an incorrect class y y y y y a b c d e x 2 x x 1 2 x 1

51 LVQ in 1-D 51 P LVQ 2.1 P LVQ 2.0 P(C )p(x C ) A A P(C )p(x C ) A A P(C )p(x C ) B B P(C )p(x C ) B B Class A decision Class B decision x Class A decision Class B decision x Force to the left <-- Force to the right -->

52 LVQ in 1-D, Separable Distributions 52 P LVQ 2.1 P LVQ 2.0 P(C A )p(x C A ) P(C B )p(x C B ) P(C A )p(x C A ) P(C B )p(x C B ) Class A decision Class B decision x Class A decision Class B decision x Force to the left <-- Force to the right -->

53 Problem with K-means 53 What will happen here?

54 Solution 54 Model the clusters as Gaussian s and learn the covariance ellipses with the data and use probabilities associated with the Gaussian density to determine membership.

55 Mixture of Gaussians (MOG) = A softer k-means 55 Model the data as coming from a mixture of Gaussian s and you don t know which Gaussian generated which data point Each Gaussian cluster has an associated proportion or prior probability π k p(x) = c π k p k (x) k=1 In the mixture of Gaussian s case p k (x) N(µ (k), Σ k ) p k (x) = 1 2πΣ k.5e (x µ (k) ) T Σ 1 k (x µ(k) ) 2 mixture models can be generalized

56 MOG Solution 56 Normalize the probabilities to determine the responsibility of each cluster for each data point (soft-responsibility). r k (x (n) ) = π kp k (x (n) ) i π ip i (x (n) ) Now solve, similarly to k-means solution Recompute the mean, covariance and overall weighting, for each cluster with each datapoint contributing weight according to its responsibility. Then iterate as in k-means. µ (k) = n r k(x (n) )x (n) n r k(x (n) ) Σ k = π k = n n r k(x (n) )(x (n) µ (i) ) 2 N n r k(x (n) ) r k (x (n) )/ r i (x (n) ) i n

57 Issues with MOG 57 Quite sensitive to initial conditions applet it s a good idea to initialize with k-means There are a large number of parameters. We can reduce parameters by

58 Issues with MOG 57 Quite sensitive to initial conditions applet it s a good idea to initialize with k-means There are a large number of parameters. We can reduce parameters by a) constraining Gaussians to have diagonal covariance matrices b) constraining Gaussians to have the same covariance matrix

59 58 Note to other teachers and users of these sl Andrew would be delighted if you found this material useful in giving your own lectures. to use these slides verbatim, or to modify th your own needs. PowerPoint originals are av you make use of a significant portion of thes your own lecture, please include this messa following link to the source repository of And tutorials: awm/tu Comments and corrections gratefully receive

64 After 6th iteration Copyright 2001, 2004, Andrew

65 After 20th iteration Copyright 2001, 2004, Andrew

Learning Vector Quantization

Learning Vector Quantization Neural Computation : Lecture 18 John A. Bullinaria, 2015 1. SOM Architecture and Algorithm 2. Vector Quantization 3. The Encoder-Decoder Model 4. Generalized Lloyd Algorithms