INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Size: px

Start display at page:

Download "INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from"

Aubrie Hill
6 years ago
Views:

1 INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from IR 8: Evaluation & SVD Paul Ginsparg Cornell University, Ithaca, NY 20 Sep / 25

2 Administrativa Ass t 2 to be posted 24 Sep, due Sat 8 Oct, 1pm (late submission permitted until Sun 9 Oct at 11 p.m.) No class Tue 11 Oct (midterm break) The Midterm Examination is on Thu Oct 13 from 11:40 to 12:55, in Kimball B11. It will be open book. Topics examined include assignments, lectures and discussion class readings before the midterm break. 2/ 25

3 Overview 1 Recap 2 SVD Intuition, cont d 3 Incremental Numerics 4 Discussion 2 3/ 25

4 Outline 1 Recap 2 SVD Intuition, cont d 3 Incremental Numerics 4 Discussion 2 4/ 25

5 Netflix challenge, Next 9 slides adapted from ( Simon Funk = Brandyn Webb) simon/journal/ html See also popular article: Netflix provided 100M ratings (from 1 to 5) of 17K movies by 500K users. i.e., 100 million (User,Movie,Rating) s of the form (105932,14002,3) Predict (User,Movie,?) not in the database (how would the given User rate the given Movie?) $50k incentive to the best each year, and $1M to the first to beat a set target (10% better than Netflix) 5/ 25

6 User-Movie Rating Matrix R um Visualize as large sparse 500k 17k user-movie matrix R um, with (u,m) th matrix element containing rating (1 5) by user u for movie m. About 8.5B entries total, so data in only 1 of 85 = 1.2%. Certain specified? elements constitute a quiz: make best guess P um at missing ratings. Use mean squared error (mse) as measure of accuracy: guess 1.5 and actual is 2, penalty = (2 1.5) 2 = Then sum over penalties for all guesses (including optional sqrt): rmse E = (R um P um ) 2 u,m 6/ 25

7 Linear Dependencies If one had the full 8.5 billion ratings (and many weary users ), they would contains many regularities, i.e., not consist of 8.5B independent and unrelated ratings. Describe each movie in terms of some basic attributes such as overall quality action or comedy actors... Describe user preferences in terms of complementary attributes or preferences they rate high or low prefer action or comedy preferred actors... 7/ 25

8 Model the data Explain 8.5 billion ratings by far less than 8.5 billion numbers (e.g., a single number specifying movie s action content can explain the attraction to a few million action-buffs) Define model for data with smaller number of parameters, infer parameters from the data, SVD ( = singular value decomposition) reduces in this case to the assumption that user s overall rating is composed of a sum of preferences over movie features 8/ 25

9 Example: Just one Feature Suppose only 1 feature, overall quality, and 1 corresponding user tendency to rate high/low. Three users: U u = (1,2,3) Five movies: V m = (1,1,3,2,1) Predicted rating matrix: P um = U u V m = Explain 15 data points with only 7 parameters (only one overall scale) 9/ 25

10 More Features Now suppose 40 features: Each movie described by 40 values, specifying for each feature degree to which contained in movie; Each user described by 40 values, specifying degree to which each feature preferred by user. To calculate rating, sum products of each user preference multiplied by the corresponding movie feature. E.g., movie Terminator might be (action=1.2, chickflick=-1,...), and user Joe might be (action=3, chickflick=-1,...). Combine to find Joe likes Terminator with rating ( 1) ( 1) +... = (Negative numbers OK: Terminator is anti-chickflick, Joe has aversion to chickflicks, so Terminator actively scores positive points with Joe for being decidedly un-chickflicky. ) 10/ 25

11 Outline 1 Recap 2 SVD Intuition, cont d 3 Incremental Numerics 4 Discussion 2 11/ 25

12 Concise Model Model requires roughly 40 (500K+17K) values, or about 20M: less than the original 8.5B by a factor of 400. Predicted ratings: P um = r Uu f Vm f f =1 Uu f is the preference of user u for feature f, V m f is the degree to which movie m contains feature f (up to r = 40). Original matrix has been decomposed into product of two rectangular matrices: the 500, user preference matrix U f u, and the 40 17,000 movie feature matrix V f m. (Matrix multiplication just performs the products and sums described above, resulting in an approximation to the original 500,000 17,000 rating matrix.) 12/ 25

13 ( ) }{{} P um = U u V m = 2 = }{{}}{{} P um = r f =1 Uf u Vm f = }{{} r m = }{{} } {{ 4 5 } n r n m 13/ 25

14 How to calculate model parameters Singular value decomposition (SVD) is the mathematical method for finding the two smaller matrices which minimize the resulting approximation error (rmse) to original matrix. The rank-40 SVD of the 8.5B matrix gives the best approximation within framework of 40 feature user-movie-rating model. Difficult to calculate SVD of large matrix. Moreover don t have all 8.5B entries (instead have 100M entries and 8.4B empty cells) But can train parameters by following derivative of the approximation error (steepest descent). (also means the unknown error on the 8.4B empty matrix elements can be ignored for a fully known matrix, end result coincides exactly with the SVD) 14/ 25

15 Summary End result of SVD = list of inferred categories, sorted by relevance. Each category expressed by extent to which each user and movie belong (or anti-belong), as read off from columns of user matrix U, or rows of movie matrix V. Sorted by value, a category might represent action movies (movies with a lot of action at the top, slow movies at the bottom), and correspondingly users who like action movies (at the top, and those who prefer slow movies at the bottom). Procedure discovers whatever the data implies: algorithm itself has no inherent concept of action (uses neither titles nor descriptions). Uses only a hundred million examples of the form: user gives movie 4819 a rating of 3 (and 84 of 85 ratings are missing). 15/ 25

16 Outline 1 Recap 2 SVD Intuition, cont d 3 Incremental Numerics 4 Discussion 2 16/ 25

17 Incremental SVD method (from simon/journal/ html) Recall: R um = known rating by user u for item m P um = predicted rating for user u for item m Singular vectors indexed by f = 1,...,r Uu f = element of the f th singular user vector for the u th user Vm f = element of the f th singular item vector for the m th movie SVD computes the prediction as: r P um = Uu f Vm f f =1 17/ 25

18 Error Gradient The error in the prediction for user u s rating of movie m is e um = R um P um, and the total rms error E for all predictions is given by E 2 = e 2 u m. u,m For gradient descent, take the partial derivative of the squared error with respect to each of the parameters U f u and V f m, E 2 U f u = P um 2e um U f m u = 2 m e um V f m = 2 m (R um P um )V f m (derivative for U f u just the sum over all the ratings by user u). Similarly E 2 V f m = P u 2e m u m V f u m = 2 u e u mu f u = 2 u (R u m P u m)u f u 18/ 25

19 Gradient Descent Starts at point P 0 and moves from P i to P i+1 by minimizing along the line extending from P i in the direction of f (P i ), the local downhill gradient. For 1d function f (x), takes the form of iterating x i = x i 1 ǫf (x i 1 ) for small ǫ > 0, from starting point x 0 until fixed point is reached. f (x) = x 3 2x with ǫ =.1 and starting points x 0 = 2, / 25

20 Inner Loop In simple backpropagation algorithm for gradient descent, use as parameter step learning rate parameter l = 2ǫ multiplied by gradient: U f u = ǫ E2 U f u V f m = ǫ E2 V f m translates to inner loop of code as = l m e um V f m = l u e u mu f u real err = l * (rating(user,movie) - predictrating(user,movie)); uservalue[f][user] += err * movievalue[f][movie]; movievalue[f][movie] += err * uservalue[f][user]; (sum former over movies, latter over users, and iterate to minimum) 20/ 25

21 Outline 1 Recap 2 SVD Intuition, cont d 3 Incremental Numerics 4 Discussion 2 21/ 25

22 Discussion 2 K. Sparck Jones, A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11-21, Letter by Stephen Robertson and reply by Karen Sparck Jones, Journal of Documentation 28, , / 25

23 Exhaustivity and specificity What are the semantic and statistical interpretations of specificity? Semantic: tea, coffee, cocoa (more specific, smaller # docs) beverage (less specific, larger # docs) Statistical: specificity a function of term usage, frequently used implies non-specific (even if has specific meaning). Exhaustivity of a document description is determined by the number of controlled vocabulary terms assigned. Reject frequently occurring terms? via conjunction (but according to item C table I, average number of matched terms smaller than request, so would reduce recall) remove them entirely (again hurts recall, needed for many relevant documents) What is graphed in figure 1 and what does it illustrate? (Why aren t axes labelled?) 23/ 25

24 idf weight Spärck Jones defines f (n) = m such that 2 m 1 < n <= 2 m (In other words f (n) = log 2 (n), where x denotes the smallest integer not less than x, equivalent to one plus the greatest integer less than x) and suggests weight = f (N) f (n) + 1 e.g. for N = 200 documents, f (N) = 8 (2 8 = 256) n = 90, f (n) = 7 (2 7 = 128), hence weight = = 2 n = 3, f (n) = 2 (2 2 = 4), hence weight = = 7 overall weight for query is then = 9 +1 so that terms occurring in more than roughly half the documents in the corpus not given zero weight (for N = 200, anything in more than 128 documents) 24/ 25

25 idf weight, modified Robertson: Spärck Jones weight f (N) f (n) + 1 log 2 (N/n) + 1 Note that n/n is the probability an item chosen (at random) will contain the term. Suppose an item contains a,b,c in common with query, and probabilities are p a, p b, p c. Then weight assigned to the document is log(1/p a ) + log(1/p b ) + log(1/p c ) = log(1/p a p b p c ) (probability that doc will randomly contain all three terms a,b,c under what assumption?) quantifies statement: less likely that given combination of terms occurs, more likely relevant to query (theoretical justification for logarithmic idf weights) 25/ 25

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 8: Evaluation & SVD Paul Ginsparg Cornell University, Ithaca, NY 23 Sep 2010