Review. Gibbs sampling. Relational learning (properties of sets of entities) MH with proposal. failure mode: lock-down

Size: px

Start display at page:

Download "Review. Gibbs sampling. Relational learning (properties of sets of entities) MH with proposal. failure mode: lock-down"

Milton Jordan
5 years ago
Views:

1 Review Gibbs sampling MH with proposal Q(X X ) = P(XB(i) X B(i)) I(X B(i) = X B(i)) / #B failure mode: lock-down Relational learning (properties of sets of entities) document clustering, recommender systems, eigenfaces 1

2 Review Latent-variable models PCA, ppca, Bayesian PCA everything Gaussian E(X U,V) = UV T MLE: use SVD Mean subtraction, example weights 2

3 PageRank SVD is pretty useful: turns out to be main computational step in other models too A famous one: PageRank Given: web graph (V, E) Predict: which pages are important 3

4 PageRank: adjacency matrix 4

5 Random surfer model W. p. α: W. p. (1 α): Intuition: page is important if a random surfer is likely to land there 5

6 Stationary distribution A B C D 6

7 Thought experiment What if A is symmetric? note: we re going to stop distinguishing A, A So, stationary dist n for symmetric A is: What do people do instead? 7

8 Spectral embedding Another famous model: spectral embedding (and its cousin, spectral clustering) Embedding: assign low-d coordinates to vertices (e.g., web pages) so that similar nodes in graph nearby coordinates A, B similar = random surfer tends to reach the same places when starting from A or B 8

9 Where does random surfer reach? Given graph: Start from distribution π after 1 step: P(k π, 1-step) = after 2 steps: P(k π, 2-step) = after t steps: 9

10 Similarity A, B similar = random surfer tends to reach the same places when starting from A or B P(k π, t-step) = If π has all mass on i: Compare i & j: Role of Σ t : 10

11 Role of Σ t (real data) t=1 t=3 t=5 t=

12 Example: dolphins dolphin social network near Doubtful Sound, New Zealand Aij = 1 if dolphin i friends dolphin j (Lusseau et al., 2003) 12

13 Dolphin network!"%!"#!"$!!!"$!!"#!!"%!!"#!!"$!!"$!"#!"% 13

14 Comparisons!"&!"#!"$!"%!!!"%!!"$!!"#!!"#!!"$!!"%!!"%!"$!"# spectral embedding of random data!"#!"%!"$!"&!!!"&!!"$!!"%!!"#!!"#!!"$!!"$!"# random embedding of dolphin data 14

15 Spectral clustering!"%!"#!"$!!!"$!!"#!!"%!!"#!!"$!!"$!"#!"% Use your favorite clustering algorithm on coordinates from spectral embedding 15

16 PCA: the good, the bad, and the ugly The good: simple, successful The bad: linear, Gaussian E(X) = UV T X, U, V ~ Gaussian The ugly: failure to generalize to new entities 16

17 Consistency Linear & logistic regression are consistent What would consistency mean for PCA? forget about row/col means for now Consistency: #users, #movies, #ratings (= nnz(w)) numel(u), numel(v) consistency = 17

18 Failure to generalize What does this mean for generalization? new user s rating of moviej: only info is new movie rated by useri: only info is all our carefully-learned factors give us: Generalization is: 18

19 Hierarchical model old, non-hierarchical model 19

20 Benefit of hierarchy Now: only k μ U latents, k μ V latents (and corresponding σs) can get consistency for these if we observe more and more Xij For a new user or movie: 20

21 Mean subtraction Can now see that mean subtraction is a special case of our hierarchical model Fix Vj1 = 1 for all j; then Ui1 = Fix Ui2 = 1 for all i; then Vj2 = global mean: 21

22 What about the second rating for a new user? Estimating Ui from one rating: knowing μ U : result: How should we fix? Note: often we have only a few ratings per user 22

23 MCMC for PCA Can do Bayesian inference by Gibbs sampling for simplicity, assume σs known 23

24 Recognizing a Gaussian Suppose X ~ N(X μ, σ 2 ) L = log P(X=x μ, σ 2 ) = dl/dx = d 2 L/dx 2 = So: if we see d 2 L/dx 2 = a, dl/dx = a(x b) μ = σ 2 = 24

25 Gibbs step for an element of μ U 25

26 Gibbs step for an element of U 26

27 In reality We d do blocked Gibbs instead Blocks contain entire rows of U or V take gradient, Hessian to get mean, covariance formulas look a lot like linear regression (normal equations) And, we d fit σ U, σ V too sample 1/σ 2 from a Gamma (or Σ 1 from a Wishart) distribution 27

28 Nonlinearity: conjunctive features P(rent) Comedy Foreign 28

29 Disjunctive features P(rent) Comedy Foreign 29

30 Other P(rent) Comedy Foreign 30

31 Non-Gaussian X, U, and V could each be non-gaussian e.g., binary! rents(u, M), comedy(m), female(u) For X: predicting 0.1 instead of 0 is only as bad as predicting +0.1 instead of 0 For U, V: might infer 17% comedy or 32% female 31

32 Logistic PCA Regular PCA: Xij ~ N(Ui Vj, σ 2 ) Logistic PCA: 32

33 More generally Can have Xij Poisson(μij), μij = exp(ui Vj) Xij Bernoulli(μij), μij = σ(ui Vj) Called exponential family PCA Might expect optimization to be difficult 33

34 Application: fmri fmri Brain activity stimulus: dog stimulus: cat stimulus: hammer :-) ;-> :-)) fmri fmri Stimulus Voxels Y credit: Ajit Singh 34

Results (logistic PCA) Y (fmri data): Fold-in 1.4 1.2 HB!CMF H!CMF CMF 1 Mean Squared Error 0.8 0.6 0.4 0.

35 Results (logistic PCA) Y (fmri data): Fold-in HB!CMF H!CMF CMF 1 Mean Squared Error Augmenting Words fmri + Voxels data with word co-occurrence Lower is Better Just using Voxels fmri data credit: Ajit Singh 35

Data Mining Techniques

Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!