Unsupervised Learning: Linear Dimension Reduction

Size: px

Start display at page:

Download "Unsupervised Learning: Linear Dimension Reduction"

Derek Goodwin
5 years ago
Views:

1 Unsupervised Learning: Linear Dimension Reduction

function input function only having function output

2 Unsupervised Learning Clustering & Dimension Reduction ( 化繁為簡 ) Generation ( 無中生有 ) only having function input function only having function output function code Clustering & Dimension Reduction in these slides

3 Clustering Cluster 3 Open question: how many clusters do we need? K-means Cluster 1 Cluster 2 Clustering X = x 1,, x n,, x N into K clusters Initialize cluster center c i, i=1,2, K (K random x n from X) Repeat For all x n n 1 x in X: b n is most close to c i i 0 Otherwise Updating all c i : c i = x n b i n x n n x n b i

4 Clustering Hierarchical Agglomerative Clustering (HAC) Step 1: build a tree Step 2: pick a threshold root

5 Distributed Representation Clustering: an object must belong to one cluster 小傑是強化系 Distributed representation Dimension Reduction 強化系 0.70 放出系 0.25 小傑是變化系 0.05 操作系 0.00 具現化系 0.00 特質系 0.00

Dimension Reduction Looks like 3-D Actually, 2-D http://reuter.mit.

6 Dimension Reduction Looks like 3-D Actually, 2-D somapcnxtrue1.png

7 Dimension Reduction In MNIST, a digit is 28 x 28 dims. Most 28 x 28 dim vectors are not digits

8 Dimension Reduction x function z The dimension of z would be smaller than x Feature selection x 2 Select x2? x 1 Principle component analysis (PCA) [Bishop, Chapter 12] z = Wx

9 Principle Component Analysis (PCA)

10 PCA z = Wx Reduce to 1-D: z 1 = w 1 x Small variance Large variance x Project all the data points x onto w 1, and obtain a set of z 1 w 1 We want the variance of z 1 as large as possible z 1 = w 1 x Var z 1 = z 1 z 1 ഥz 1 2 w 1 2 = 1

11 PCA z = Wx Project all the data points x onto w 1, and obtain a set of z 1 We want the variance of z 1 as large as possible Reduce to 1-D: z 1 = w 1 x Var z 1 = z 1 z 1 ഥz 1 2 w 1 2 = 1 W = z 2 = w 2 x w 1 T w 2 T Orthogonal matrix We want the variance of z 2 as large as possible Var z 2 = z 2 z 2 ഥz 2 2 w 2 2 = 1 w 1 w 2 = 0

12 Warning of Math

13 PCA z 1 = w 1 x ഥz 1 = 1 N z 1 = 1 N w1 x = w 1 1 N x = w1 xҧ Var z 1 = 1 N z 1 z 1 ഥz 1 2 = 1 N x w 1 x w 1 xҧ 2 = 1 N w1 x xҧ 2 a b 2 = a T b 2 = a T ba T b = a T b a T b T = a T bb T a Find w 1 maximizing = 1 N w1 T x xҧ = w 1 T 1 N x xҧ x xҧ T w 1 x xҧ T w 1 w 1 T Sw 1 w 1 2 = w 1 T w 1 = 1 = w 1 T Cov x w 1 S = Cov x

14 Find w 1 maximizing w 1 T Sw 1 w 1 T w 1 = 1 S = Cov x Symmetric positive-semidefinite (non-negative eigenvalues) Using Lagrange multiplier [Bishop, Appendix E] g w 1 = w 1 T Sw 1 α w 1 T w 1 1 Τ g w 1 w 1 1 = 0 Τ g w 1 w 2 1 = 0 Sw 1 αw 1 = 0 Sw 1 = αw 1 w 1 : eigenvector w 1 T Sw 1 = α w 1 T w 1 = α Choose the maximum one w 1 is the eigenvector of the covariance matrix S Corresponding to the largest eigenvalue λ 1

15 Find w 2 maximizing w 2 T Sw 2 w 2 T w 2 = 1 w 2 T w 1 = 0 g w 2 = w 2 T Sw 2 α w 2 T w 2 1 β w 2 T w 1 0 Τ g w 2 w 1 2 = 0 Τ g w 2 w 2 2 = 0 Sw 2 αw 2 βw 1 = 0 w 1 0 T Sw 2 α w 1 0 T w 2 β w 1 1 T w 1 = 0 = w 1 T Sw 2 T = w 2 T Sw 1 β = 0: Sw 2 αw 2 = 0 Sw 2 = αw 2 = w 2 T S T w 1 = λ 1 w 2 T w 1 = 0 Sw 1 = λ 1 w 1 w 2 is the eigenvector of the covariance matrix S Corresponding to the 2 nd largest eigenvalue λ 2

16 PCA - decorrelation z = Wx Cov z = D Diagonal matrix PCA z 2 z 1 Cov z = 1 N z zҧ z z ҧ T = WSW T S = Cov x = WS w 1 w K = W Sw 1 Sw K = W λ 1 w 1 λ K w K = λ 1 Ww 1 λ K Ww K = λ 1 e 1 λ K e K = D Diagonal matrix

17 End of Warning

PCA Another Point of View Basic Component: 1 0 1 0 1

K + xҧ Pixels in a digit image 1 0 1 0 1 1x 1x 1x

18 PCA Another Point of View Basic Component: u 1 u 2 u 3 u 4 u x c 1 u 1 + c 2 u c K u K + xҧ Pixels in a digit image x 1x 1x component u 1 u 3 u 5 c 1 c 2 c K. Represent a digit image

19 PCA Another Point of View x xҧ c 1 u 1 + c 2 u c K u K = x Reconstruction error: (x x) ҧ x 2 PCA: z 1 z 2 z K = L = min x x ҧ u 1,,uK z = Wx w 1 T w 2 T w K T x Find u 1,, u K minimizing the error K k=1 c k u k w 1, w 2, w K is the component u 1, u 2, u K minimizing L Proof in [Bishop, Chapter ] x 2

20 x xҧ c 1 u 1 + c 2 u c K u K = x Reconstruction error: (x x) ҧ x 2 Find u 1,, u K minimizing the error x 1 xҧ c 1 1 u 1 + c 1 2 u 2 + x 2 xҧ c 2 1 u 1 + c 2 2 u 2 + x 3 xҧ c 3 1 u 1 + c 3 2 u 2 + Matrix X Minimize Error u 1 u 2 1 c 1 2 c 1 3 c 1 1 c 2 2 c 2 3 c 2

21 x 1 xҧ u 1 u 2 Matrix X Minimize Error M x N M x K K x K K x N V X U 1 c 1 2 c 1 3 c 1 1 c 2 2 c 2 3 c 2 K columns of U: a set of orthonormal eigen vectors corresponding to the k largest eigenvalues of XX T This is the solution of PCA SVD:

22 PCA looks like a neural network with one hidden layer (linear activation function) Autoencoder If w 1, w 2, w K K x = c k w k k=1 is the component u 1, u 2, u K x xҧ To minimize reconstruction error: c k = x xҧ w k x xҧ K = 2: w 1 1 w 2 1 w 3 1 c 1 w 1 1 w 2 1 w 3 1 x 1 x 2 x 3

23 PCA looks like a neural network with one hidden layer (linear activation function) Autoencoder If w 1, w 2, w K K x = c k w k k=1 is the component u 1, u 2, u K x xҧ To minimize reconstruction error: c k = x xҧ w k K = 2: It can be deep. Deep Autoencoder x xҧ w 2 2 w 3 2 w 1 2 c 1 c 2 w 3 2 w 1 2 w 2 2 x 1 x 2 x 3 Minimize error Gradient Descent? x xҧ

following lectures LDA http://www.astroml.

24 Weakness of PCA Unsupervised Linear PCA Non-linear dimension reduction in the following lectures LDA hapter7/fig_s_manifold_pca.html

25 PCA - Pokémon Inspired from: pal-component-analysis-of-pokemon-data 800 Pokemons, 6 features for each (HP, Atk, Def, Sp Atk, Sp Def, Speed) How many principle components? λ i λ 1 + λ 2 + λ 3 + λ 4 + λ 5 + λ 6 λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 ratio Using 4 components is good enough

26 PCA - Pokémon HP Atk Def Sp Atk Sp Def Speed PC 強度 PC PC 防禦 0.1 ( 犧牲速度 ) PC

27 PCA - Pokémon HP Atk Def Sp Atk Sp Def Speed PC PC PC 特殊防禦 0.1 ( 犧牲生命力強 PC 攻擊和生命 -0.3 )

28 PCA - MNIST 30 components: = a 1 w 1 + a 2 w 2 + images Eigen-digits

29 PCA - Face 30 components: ing08/assignment3.html Eigen-face

30 What happens to PCA? = a 1 w 1 + a 2 w 2 + Can be any real number PCA involves adding up and subtracting some components (images) Then the components may not be parts of digits Non-negative matrix factorization (NMF) Forcing a 1, a 2 be non-negative additive combination Forcing w 1, w 2 be non-negative More like parts of digits Ref: Daniel D. Lee and H. Sebastian Seung. "Algorithms for non-negative matrix factorization." Advances in neural information processing systems

31 NMF on MNIST

32 NMF on Face

33 Matrix Factorization

34 Matrix Factorization Number in table: number of figures a person has A B C D E There are some common factors behind otakus and characters.

35 Matrix Factorization The factors are latent. 呆傲 A match 呆呆傲呆 No one cares 傲 B Not directly 傲呆 observable 呆 C 呆傲傲傲

呆 r 1 r 2 r 3 r 4 r A r B r C r D r E A 5 3 0 1 B 4 3 0 1 Matrix X C 1 1 0 5 D 1 0 4 4 E 0 1 5 4 No. of Otaku = M No. of characters = N No.

36 呆 r 1 r 2 r 3 r 4 r A r B r C r D r E A B Matrix X C D E No. of Otaku = M No. of characters = N No. of latent factor = K r A r 1 5 r B r 1 4 r C r 1 1 傲 M N n A1 n A2 n B1 n B2 Matrix X N Minimize Error K r A r B K N r 1 r 2 Singular value decomposition (SVD)

37 r i rj r 1 r 2 r 3 r 4 r A r B r C r D r E n A1 A 5 3? 1 B 4 3? 1 C 1 1? 5 D 1? 4 4 E? r A r 1 5 r B r 1 4 r C r 1 1 Minimizing L = i,j Only considering the defined value r i r j n ij 2 Find r i and r j by gradient descent

38 r 1 r 2 r 3 r 4 r A r B r C r D r E A ? 1 n A1 B ? 1 C ? 5 D 1 0.6? 4 4 E 0.1? Assume the dimensions of r are all 2 (there are two factors) A B C D E ( 春日 ) ( 炮姐 ) ( 姐寺 ) ( 小唯 )

39 More about Matrix Factorization Considering the induvial characteristics r A r 1 5 r A r 1 + b A + b 1 5 b A : otakus A likes to buy figures b 1 : how popular character 1 is Minimizing L = i,j r i r j + b i + b j n ij 2 Find r i, r j, b i, b j by gradient descent (can add regularization) Ref: Matrix Factorization Techniques For Recommender Systems

40 Matrix Factorization for Topic analysis Latent semantic analysis (LSA) Doc 1 Doc 2 Doc 3 Doc 4 投資股票總統選舉立委 character document, otakus word Number in Table: Term frequency (weighted by inverse document frequency) Latent factors are topics ( 財經政治 ) Probability latent semantic analysis (PLSA) Thomas Hofmann, Probabilistic Latent Semantic Indexing, SIGIR, 1999 latent Dirichlet allocation (LDA) Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). Lafferty, John, ed. "Latent Dirichlet Allocation". Journal of Machine Learning Research. 3 (4 5): pp

41 More Related Approaches Not Introduced Multidimensional Scaling (MDS) [Alpaydin, Chapter 6.7] Only need distance between objects Probabilistic PCA [Bishop, Chapter 12.2] Kernel PCA [Bishop, Chapter 12.3] non-linear version of PCA Canonical Correlation Analysis (CCA) [Alpaydin, Chapter 6.9] Independent Component Analysis (ICA) Ref: Linear Discriminant Analysis (LDA) [Alpaydin, Chapter 6.8] Supervised

42 Acknowledgement 感謝彭冲同學發現引用資料的錯誤感謝 Hsiang-Chih Cheng 同學發現投影片上的錯誤

Classification: Probabilistic Generative Model

Classification: Probabilistic Generative Model Classification x Function Class n Credit Scoring Input: income, savings, profession, age, past financial history Output: accept or refuse Medical Diagnosis