Data Mining and Matrices

Size: px

Start display at page:

Download "Data Mining and Matrices"

Sara Wood
5 years ago
Views:

1 Data Mining and Matrices 04 Matrix Completion Rainer Gemulla, Pauli Miettinen May 02, 2013

2 Recommender systems Problem Set of users Set of items (movies, books, jokes, products, stories,...) Feedback (ratings, purchase, click-through, tags,...) Sometimes: metadata (user profiles, item properties,...) Goal: Predict preferences of users for items Ultimate goal: Create item recommendations for each user Example Avatar The Matrix Up Alice? 4 2 Bob 3 2? Charlie 5? 3 2 / 35

3 Outline 1 Collaborative Filtering 2 Matrix Completion 3 Algorithms 4 Summary 3 / 35

4 Collaborative filtering Key idea: Make use of past user behavior No domain knowledge required No expensive data collection needed Allows discovery of complex and unexpected patterns Widely adopted: Amazon, TiVo, Netflix, Microsoft Key techniques: neighborhood models, latent factor models Avatar The Matrix Up Alice? 4 2 Bob 3 2? Charlie 5? 3 Leverage past behavior of other users and/or on other items. 4 / 35

5 A simple baseline m users, n items, m n rating matrix D Revealed entries Ω = { (i, j) rating D ij is revealed }, N = Ω Baseline predictor: b ui = µ + b i + b j µ = 1 N (i,j) Ω D ij is the overall average rating b i is a user bias (user s tendency to rate low/high) bj is an item bias (item s tendency to be rated low/high) Least squares estimates: argmin b (i,j) Ω (D ij µ b i b j ) 2 D Avatar Matrix Up (1.01) (0.34) ( 1.32) Alice? 4 2 (0.32) (4.5) (3.8) (2.1) Bob 3 2? ( 1.34) (2.8) (2.2) (0.5) Charlie 5? 3 (0.99) (5.2) (4.5) (2.8) m = 3 n = 3 Ω = { (1, 2), (1, 3), (2, 1),... } N = 6 µ = 3.17 b 32 = = 4.5 Baseline does not account for personal tastes. 5 / 35

When does a user like an item? Neighborhood models (knn): When he likes similar items I I I Find the top-k most similar items the user has rated Combine the ratings of these items (e.g., average) Requires a similarity measure (e.

Users and items are placed in the same latent factor space Position of a user and an item related to preference (via dot products) Serious Braveheart The Color Purple Amadeus Lethal Weapon Geared

6 When does a user like an item? Neighborhood models (knn): When he likes similar items I I I Find the top-k most similar items the user has rated Combine the ratings of these items (e.g., average) Requires a similarity measure (e.g., Pearson correlation coefficient) is similar to Unrated by Bob predict 4 Bob rated 4 F E AT U RE Latent factor models (LFM): When similarcover users like similar items I I I More holistic approach Users and items are placed in the same latent factor space Position of a user and an item related to preference (via dot products) Serious Braveheart The Color Purple Amadeus Lethal Weapon Geared toward females Sense and Sensibility Ocean s 11 Geared toward males Dave The Lion King The Princess Diaries Dumb and Dumber Independence Day Escapist Gus vector qi ated with a ve i, the elemen which the it positive or n the elements interest the u on the corres tive or negati qit pu, capture u and item i the item s cha user u s rating rui, leading to r ui = qit p major ch 6The / 35

7 Intuitioncover behindfeature latent factor models (1) Geared toward females The Color Purple Sense and Sensibility The Princess Diaries Serious Amadeus Escapist Ocean s 11 Dave The Lion King Independence Day Braveheart Lethal Weapon Gus Geared toward males Dumb and Dumber Figure 2. A simplified illustration of the latent factor approach, which Koren et al., vector q ated with i, the ele which th positive the elem interest on the c tive or n q i T p u, cap u and ite the item user u s r r ui, leadi ˆr ui = The majo ping of e q i, p u 7 / 35

8 Intuition behind latent factor models (2) Does user u like item v? Quality: measured via direction from origin (cos (u, v)) Same direction attraction: cos (u, v) 1 Opposite direction repulsion: cos (u, v) 1 Orthogonal direction oblivious: cos (u, v) 0 Strength: measured via distance from origin ( u v ) Far from origin strong relationship: u v large Close to origin weak relationship: u v small Overall preference: measured via dot product (u v) u v = u v u v = u v cos (u, v) u v Same direction, far out strong attraction: u v large positive Opposite direction, far out strong repulsion: u v large negative Orthogonal direction, any distance oblivious: : u v 0 But how to select dimensions and where to place items and users? Key idea: Pick dimensions that explain the known data well. 8 / 35

9 SVD and missing values Input data Rank-10 truncated SVD 10% of input data Rank-10 truncated SVD SVD treats missing entries as 0. 9 / 35

10 Latent factor models and missing values Input data Rank-10 LFM 10% of input data Rank-10 LFM LFMs ignore missing entries. 10 / 35

11 Latent factor models (simple form) Given rank r, find m r matrix L and r n matrix R such that D ij [LR] ij for (i, j) Ω Least squares formulation (D ij [LR] ij ) 2 min L,R (i,j) Ω Example (r = 1) R Avatar The Matrix Up (2.24) (1.92) (1.18) L Alice? 4 2 (1.98) (4.4) (3.8) (2.3) Bob 3 2? (1.21) (2.7) (2.3) (1.4) Charlie 5? 3 (2.30) (5.2) (4.4) (2.7) L L i R R j D ij D 11 / 35

12 Example: Netflix prize data ( 500k users, 17k movies, 100M ratings) -time ent is ion. odel dence ight to If conted as e cost nt for i 2 (8) Factor vector fe apemes, 1.5 ng for Koren et al., Sophie s Choice Moonstruck Freddy Got Fingered Half Baked Julien Donkey-Boy Kill Bill: Vol. 1 Freddy vs. Jason Natural Born Killers Road Trip I Heart Huckabees Scarface Punch-Drunk Love The Royal Tenenbaums The Longest Yard Being John Malkovich The Fast and the Furious Lost in Translation Belle de Jour Armageddon Catwoman The Wizard of Oz Citizen Kane Annie Hall Coyote Ugly Maid in Manhattan Runaway Bride Stepmom Sister Act The Way We Were The Sound of Music The Waltons: Season Factor vector 1 12 / 35

13 Latent factor models (summation form) Least squares formulation prone to overfitting More general summation form: L = l ij (L i, R j ) + R(L, R), L (i,j) Ω L is global loss L i and R j are user and item parameters, resp. l ij is local loss, e.g., l ij = (D ij [LR] ij ) 2 R is regularization term, e.g., R = λ( L 2 F + R 2 F ) Loss function can be more sophisticated Improved predictors (e.g., include user and item bias) Additional feedback data (e.g., time, implicit feedback) Regularization terms (e.g., weighted depending on amount of feedback) Available metadata (e.g., demographics, genre of a movie) R R j L i D ij D 13 / 35

14 Example: Netflix prize data RMSE cover FEATURE Root mean square error of predictions , ,000 10, ,000 Millions of parameters Plain With biases With implicit feedback With temporal dynamics (v.1) With temporal dynamics (v.2) 1,500 Mcollabor mender datasets s data has s accuracy nearest-n the same pact mem that syste easily. W niques ev that mod rally man data, suc feedback and conf Figure 4. Matrix factorization models accuracy. The plots show the root-mean-square 14 / 35 Koren et al., 2009.

15 Outline 1 Collaborative Filtering 2 Matrix Completion 3 Algorithms 4 Summary 15 / 35

16 The matrix completion problem Complete these matrices! ? ???? 1???? 1???? 1???? Matrix completion is impossible without additional assumptions! Let s assume that underlying full matrix is simple (here: rank 1) When/how can we recover a low-rank matrix from a sample of its entries? 16 / 35

17 Rank minimization Definition (rank minimization problem) Given an n n data matrix D and an index set Ω of revealed entries. The rank minimization problem is minimize rank(x) subject to D ij = X ij (i, j) Ω X R n n. Seeks for simplest explanation fitting the data If unique and sufficient samples, recovers D (i.e., X = D) NP-hard Time complexity of existing rank minimization algorithms double exponential in n (and also slow in practice). 17 / 35

18 Nuclear norm minimization Rank: rank(d) = { σ k (D) > 0 : 1 k n } = n k=1 I σ k (D)>0 Nuclear norm: D = n k=1 σ k(d) Definition (nuclear norm minimization) Given an n n data matrix D and an index set Ω of revealed entries. The nuclear minimization problem is minimize X subject to D ij = X ij (i, j) Ω X R n n. A heuristic for rank minimization Nuclear norm is convex function (thus local optimum is global opt.) Can be optimized (more) efficiently via semidefinite programming. 18 / 35

is a random, one-dimensional, affine subspace which, as expected, intersects the nuclear norm ball a rank 1 matrix. Why As further nuclear motivation, norm an interesting minimization?

19 is a random, one-dimensional, affine subspace which, as expected, intersects the nuclear norm ball a rank 1 matrix. Why As further nuclear motivation, norm an interesting minimization? connection exists between the nuclear norm and popular algorithms in Figure 1. Unit ball of the nuclear norm for symmetric 2 2 matrices. The red line depicts a random one-dimensional affine space. Such a subspace will generically intersect a sufficiently large nuclear norm ball at a rank one matrix N y ( ) x y y z x to 2nr. For large problems, this leads to a significan tion in computation time, such that very large instan be solved on a desktop computer. On the other hand mulation (2.4) is nonconvex and thus potentially h minima that are not globally optimal. Nonetheless, Consider SVD of D = UΣV T tored approximation (2.4) of the nuclear norm is on most successful stand-alone approaches to solving flix Prize Unit problem. nuclear 16, 24 Indeed, normit was ball one = of the foun components of the winning team s prediction engin convex combination (σ k ) of rank-1 matrices of unit 2.1. Main results As seen in our first example (2.1), it is impossible to a matrix Frobenius which is equal (Uto k 0 V T in k nearly ) all of its entrie we see all the entries of the matrix. This is particula if the Extreme singular vectors points of a matrix have M have lowmost rank of th concentrated (in figure: in a few rank-1 coordinates. matrices For instance, of c the rank 2 symmetric matrix M given by unit Frobenius norm) Nuclear norm minimization: inflate unit ball as little as where the singular values are arbitrary. Then, thi vanishes possible everywhere to except reachin Dthe ij top-left = X ij 2 2 cor one would basically need to see all the entries of able Solution to recover this lies matrix exactly. extreme There point is an end of examples of this sort. Hence, we arrive at the not the singular of inflated vectors ball need to be (hopefully) sufficiently spread low rank june 2012 vol. 55 no. 6 communications of the Candès and Recht, / 35

20 Relationship to LFMs Recall regularized LFM (L is m r, R is r n): (D ij [LR] ij ) 2 + λ ( L 2 F + ) R 2 F min L,R (i,j) Ω View as matrix completion problem by enforcing D ij = [LR] ij : ( 1 minimize 2 L 2 F + R 2 ) F subject to D ij = X ij (i, j) Ω LR = X. One can show: for r chosen larger than rank of nuclear norm optimum, equivalent to nuclear norm minimization For some intuition, suppose X = UΣV T at optimum L and R: 1 2 ( L 2 F + R 2 ) ( F 1 UΣ 1/2 2 F + Σ1/2 V T 2 ) F 2 = 1 2 n i=1 r k=1 (U2 ik σ k + V 2 ik σ k) = r k=1 σ k = X 20 / 35

21 When can we hope to recover D? (1) Assume D is the 5 5 all-ones matrix (rank 1) ?? 1? 1?????? 1?? 1????? 1?? 1 1???? 1? 1?? 1????? 1 1?? Ok Ok ? 1???? 1 1???? 1??? 1?????? 1?? 1?? 1???? 1? 1???????? 1 Not unique Not unique (column missed) (insufficient samples) Sampling strategy and sample size matter. 21 / 35

22 When can we hope to recover D? (2) Consider the following rank-1 matrices and assume few revealed entries Ok ( incoherent ) Ok ( incoherent ) Bad ( coherent ) Bad ( coherent ) first row required (1, 1)-entry required Properties of D matter. 22 / 35

23 When can we hope to recover D? (3) Exact conditions under which matrix completion works is active research area: Which sampling schemes? (e.g., random, WR/WOR, active) Which sample size? Which matrices? (e.g., incoherent matrices) Noise (e.g., independent, normally distributed noise) Theorem (Candès and Recht, 2009) Let D = UΣV T. If D is incoherent in that max U 2 ij µ B ij n and max V 2 ij µ B ij n for some µ B = O(1), and if rank(d) µ 1 B n1/5, then O(n 6/5 r log n) random samples without replacement suffice to recover D exactly with high probability. Candès and Recht, / 35

24 Outline 1 Collaborative Filtering 2 Matrix Completion 3 Algorithms 4 Summary 24 / 35

25 Overview Latent factor models in practice Millions of users and items Billions of ratings Sometimes quite complex models Many algorithms have been applied to large-scale problems Gradient descent and quasi-newton methods Coordinate-wise gradient descent Stochastic gradient descent Alternating least squares 25 / 35

26 Continuous gradient descent Find minimum θ of function L Pick a starting point θ 0 Compute gradient L (θ 0 ) Walk downhill Differential equation θ(t) t = L (θ(t)) with boundary cond. θ(0) = θ 0 Under certain conditions θ(t) θ q(t) - q * t 26 / 35

27 Discrete gradient descent Find minimum θ of function L Pick a starting point θ 0 Compute gradient L (θ 0 ) Jump downhill Difference equation θ n+1 = θ n ɛ n L (θ n ) Under certain conditions, approximates CGD in that θ n (t) = θ n + steps of size t satisfies the ODE as n q(t) - q * stepfun(px, py) t / 35

28 Gradient descent for LFMs Set θ = (L, R) and write L(θ) = L ij (L i, R j ) R R j Li L(θ) = (i,j) Ω j { j (i,j ) Ω } Li L ij (L i, R j ) L L i D ij GD epoch 1 Compute gradient Initialize zero matrices L and R For each entry (i, j) Ω, update gradients D L i L i + Li L ij (L i, R j ) R j R j + R j L ij (L i, R j ) 2 Update parameters L L ɛ n L R R ɛ n R 28 / 35

29 Computing the gradient (example) Simplest form (unregularized) R R j L ij (L i, R j ) = (D ij L i R j ) 2 Gradient computation Li k L ij (L i, R j ) = { 0 if i i 2R kj (D ij L i R j ) L if i = i L i D ij D Local gradient of entry (i, j) Ω nonzero only on row L i and column R j. 29 / 35

30 Stochastic gradient descent Find minimum θ of function L Pick a starting point θ 0 Approximate gradient ˆL (θ 0 ) Jump approximately downhill Stochastic difference equation θ n+1 = θ n ɛ n ˆL (θ n ) Under certain conditions, asymptotically approximates (continuous) gradient descent q(t) - q * stepfun(px, py) t 30 / 35

31 Stochastic gradient descent for LFMs Set θ = (L, R) and use L(θ) = L ij (L i, R j ) R j R L (θ) = (i,j) Ω (i,j) Ω L ij(l i, R j ) L L i D ij ˆL (θ, z) = NL i z j z (L iz, R jz ), where N = Ω and z = (i z, j z ) Ω D SGD epoch 1 Pick a random entry z Ω 2 Compute approximate gradient ˆL (θ, z) 3 Update parameters θ n+1 = θ n ɛ n ˆL (θ n, z) SGD step affects only current row and column. 4 Repeat N times 31 / 35

32 SGD in practice Step size sequence { ɛ n } needs to be chosen carefully Pick initial step size based on sample (of some rows and columns) Reduce step size gradually Bold driver heuristic: After every epoch Increase step size slightly when loss decreased (by, say, 5%) Decrease step size sharply when loss increased (by, say, 50%) Mean Loss LBFGS SGD ALS Netflix data (unregularized) epoch 32 / 35

33 Outline 1 Collaborative Filtering 2 Matrix Completion 3 Algorithms 4 Summary 33 / 35

34 Lessons learned Collaborative filtering methods learn from past user behavior Latent factor models are best-performing single approach for collaborative filtering But often combined with other methods Users and items are represented in common latent factor space Holistic matrix-factorization approach Similar users/item placed at similar positions Low-rank assumption = few factors influence user preferences Close relationship to matrix completion problem Reconstruct a partially observed low-rank matrix SGD is simple and practical algorithm to solve LFMs in summation form 34 / 35

35 Suggested reading Y. Koren, R. Bell, C. Volinsky Matrix factorization techniques for recommender systems IEEE Computer, 42(8), p , E. Candès, B. Recht Exact matrix completion via convex optimization Communications of the ACM, 55(6), p , And references in the above articles 35 / 35

Matrix Factorization and Collaborative Filtering

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Matrix Factorization and Collaborative Filtering MF Readings: (Koren et al., 2009)