Learning Topical Transition Probabilities in Click Through Data with Regression Models

Learning Topical Transition Probabilities in Click Through Data with Regression Xiao Zhang, Prasenjit Mitra Department of Computer Science and Engineering College of Information Sciences and Technology the Pennsylvania State University xiazhang@csepsuedu, pmitra@istpsuedu ABSTRACT The transition of search engine users intents has been studied for a long time The knowledge of intent transition, once discovered, can yield a better understanding of how different topics are related and be used in many applications, such as building recommender systems, ranking and etc In this paper, we study the problem of finding the transition probabilities of digital library users intents among different topics We use the click-through data from CiteSeerX and extract the click chains Each document in the click chain is represented by a topical vector generated by LDA models We then model the task of finding the topical transition probabilities as a multiple output linear regression problem, in which the input and output are two consecutive topical vectors in the click chain and the elements in the weight matrix correspond to the transition probabilities Given the constraints of our task, we propose a new algorithm based on the exponentiated gradient Our algorithm provides a good interpretability as well as a small sum-of-squares error comparable to existing regression methods We are particular interested in the off-diagonal elements of the learned weight matrix since they represent the transition probabilities of different topics The authors interpretation of these transitions are given at the end of the paper 1 INTRODUCTION The search intent of a search engine user may switch to a different topic even within the same search session According to previous studies, 1-3% of search engine users perform multitasking [15] [12] Multiple topics exist in over 8% of search sessions which have two or more queries [16] Thus, detecting the transition of users intents has become a crucial task for search engines The knowledge of users intent transition, once detected, can help researchers understand how different topics are related and be further used to improve query suggestion, ranking, recommendation and the overall search quality Previous studies on user intent transition have focused Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee WebDB 1 Indianapolis, IN USA Copyright 21 ACM 978-1-453-186-2/1/6 $1 mainly on Web search However, nowadays more and more vertical search engines have emerged and gained attention from both industry and academia, such as digital libraries, chemical search engines and travel search engines Discover the knowledge of intent transition in such search engines is equally important as in Web search engines Vertical search engines have different characteristics from the Web search engines The digital library search engine CiteSeerX 1,for example has a closed corpus Most of its documents are research papers on computer science domain Therefore the number of topics in its dataset is limited; while the documents on the Web have a much larger range of topics Such difference makes the study of intent transition between each pair of topics possible Users information needs are expressed implicitly through sequences of querying, clicking and reformulating behaviors Existing work focused on the analysis of such behaviors [9] [6] Various features, such as time interval between queries, time thresholding, and shared common terms between adjacent queries and etc are used to determine whether a transition happens However, these features are not suitable in vertical search engines Take CiteSeerX for example, documents are connected through citation links, users often start with one search query to visit one document, then follow the citation links to visit the other documents Intent transition may happen when users traverse the citation graph The cosine similarity between the documents retrieved in response to the queries were also used as features [9] [13] to determine the boundary of a search task However, they can only measure the similarity between two documents but not the topical distributions of the documents Besides, they are used to determine whether an intent transition happens, but not to discover the knowledge of intent transition between a pair of topics If we can instead find how the intents shift among all pairs of topics (in other words, the transition probabilities), we can get a clearer picture of the connections between topics In this paper, we study the problem of user intent transition among topics, discover the transition probabilities and find the relatedness among the topics We focus on the Cite- SeerX vertical search engine, the documents of which are mainly in the computer science domain We train Latent Dirichlet Allocation (LDA) [3] models on all the documents in CiteSeerX The trained LDA model is then applied on the documents in the click chains extracted from the query log and generates the topical distribution (in forms of topical vectors) for these documents We find the transition 1 http://citeseerxistpsuedu/

probabilities by solving a multiple-output linear regression problem, in which the inputs and outputs are the topical vectors of each pair of documents in the click pairs, and the weight matrix to be learned is the transition probability matrix Since each weight bears the meaning of a transition probability, we have two constraints on the regression problem: (1) all the weights must be non-negative and (2) the transition probabilities from one topic to all other topics must sum up to 1 Based on the two constraints, we propose our new regression algorithm which is based on the exponentiated gradient algorithm It minimizes the sum-of-squares error under the constraints The result leads to the discovery of connections between some topics, for example people who are interested in graph problems may switch their intent to the techniques and fundamental issues to solve the graph problems, such as optimization techniques, computational cost and efficiency Such knowledge can be used to recommend related topics to people who show interests in some topics 2 RELATED WORK 21 User Intent Transition Existing work on user intent transition focused on finding a timeout to cut off between queries Different threshold values for timeouts were proposed, ranging from 5 to 12 minutes [4] [11] [14] [6] [1] However, Jones and Klinkner [9] examined all the timeout thresholds and applied them to real search engine data, and claimed that no time threshold is effective at identifying task boundaries Another set of work used the query contents Jansen et al [8] used common words in adjacent queries as a feature to segment sessions He et al [6] studied the adding and deleting of terms in queries to detect topic shifts in user query streams Ozmutlu et al [15] [12] used various features to study the user intent shifts The features include (1) the change of terms of the consecutive queries within a session, (2) time interval, which is the difference of the arrival times of two consecutive queries, and (3) order of each query in the user session They evaluated different statistical model and used them to predict the user intent transition in a Web search session However, these works rely on the search queries In digital libraries, such as CiteSeerX, users frequently follow the citation links, not the search results to visit other documents Therefore, these work do not fit well in our application 22 Linear Regression Regression models predict the value of one or more continuous target variables y given the value of a d-dimensional vector x Linear regression models learn the weights from a set of observations by minimizing the loss function If the sum-ofsquared error: L = N i=1 (y i ŷ i ) 2 L = N (y i ŷ i ) 2 i=1 is chosen as the loss function, the linear regression problem can be solved by the ordinary least squares (OLS) technique Although OLS technique minimized the sum-of-squared errors, it does not fit well in our scenario because it does not satisfy the constraints set forth in our application As will be discussed in details in Problem Definition Section, our problem is a multiple output linear regression problem with two constraints: (1) all the weights represents probabilities and thus should be non-negative, and (2) each row of the learned weight matrix should sum up to 1 OLS can satisfy neither of them Regularized linear regression places constraints on the weights by adding an additional penalty term in the loss function Two very important regularized linear regression models are Ridge [7] and Lasso [17] regression Lasso regression has a constraint on the sum of the absolute value of each weight Recently Efron et al gave an efficient solution [5] However, neither of them satisfy the constraints in our application simultaneously Exponentiated Gradient algorithm was proposed by Kivinen and Warmuth [1] This online algorithm takes a single observation each time and updates the weight vector EG algorithm uses the components of the gradient in the exponents of factors that are used in updating the weight vector multiplicatively It satisfies the constraints that all the weights are positive and they sum up to 1, which is very close to our constraints However, as will be discussed in Section 4, our application requires normalization in a different direction in the weight matrix (normalize along rows, not columns) Our proposed new algorithm is based on the Exponentiated Gradient method 3 PROBLEM FORMULATION Topical Intent Representation: Researchers have created topic hierarchies to define topics for papers However, these topics assigned to papers are not suitable in our scenario for the following reasons: (1) one research paper can cover multiple topics; while human-generated topics only covers the primary topic of the paper (2) New topics emerge over time, which may not be captured by human-generated topics (3) People use different terminologies for papers on similar topics (4) Not all papers have an assigned topic Taking these reasons into consideration, we use LDA models to find topical distributions for each document on k topics automatically A topical vector is generated for each document to represent the topics covered by this paper with probabilities In CiteSeerX, a user is given various information, such as title, snippets, authors and publishing year, about a document before actually visiting it We assume the user can make the judgments of the relevance of the document to his interests well Therefore we use the topical vector of the document visited by the user to represent the user s intent at that moment Topical Intent Transition: Since the transition of intents happens between successive visits to two documents having citation connection, we pulled the query log from CiteSeerX and extracted all pairs of successively visited documents by each user in each search session, then use the pairs as our observations We call such pair of documents click pair We assume a user s interest on each topic switches to any one of all topics in the next click with a probability The transition probability is defined below: Definition 1 (Topical Intent Transition Probability) The topical intent transition probability p ij = prob(z t+1 = j z t = i) (1 i, j k) is the transition probability of user s

Notation Description k dimension of input/output vectors x, y N number of observations x, y input, output vector (k 1 column vector) w weight vector (k 1 column vector) W weight matrix (k k matrix) η learning rate Table 1: Notations intent from Topic i to Topic j in successive visits to two documents The change in the topical vectors of two documents in the click pair is the result of such transition Putting all the transition probabilities together, we have a Transition Probability Matrix P: p 11 p 12 p 1k p 21 p 22 p 2k P = p k1 p k2 p kk Note that in each row, the probabilities are from one topic to all topics Therefore, we have the constraint that each row should sum up to 1, ie k j=1 pij = 1 for all i The goal of our work: given the query log and documents visited by search engine users, find topical vectors (x, y) for each click pair Furthermore, find the transition probability matrix P, such that the predicted topical vector ŷ = P T x is close to y The closeness is measured by sum-of-squares error We model the problem as a multipleoutput linear regression problem, in which the first vector x is the input, y is a multiple dimensional output vector, P is the weight matrix to learn Given N observed pairs of x and y, the learned weight matrix should minimize the total sum-of-squares error 4 MODEL DESCRIPTION 41 Notations Given an input vector x = (x 1, x 2,, x k ) T, linear regression models generate a single value output y by linearly combining the input values ŷ = k i=1 w ix i = w T x The weights w = (w 1, w 2,, w k ) T for the linear combination are learned from the training data Note that in our application, we do not introduce the constant term x = 1 into the model since they bear the meaning of probabilities We use X and Y to denote the set of observed input and output vectors, respectively X = x T 1 x T 2 x T N, Y = y T 1 y T 2 y T N The weight matrix W and the transition probability matrix P are equivalent and will be used interchangeably Note that w ij = p ji for all i, j [1, k] w 11 w 21 w k1 w 12 w 22 w k2 W = (w 1, w 2,, w k ) = = P w 1k w 2k w kk Table 1 gives the notations 42 Linear Regression Model We first consider the linear regression model which generates a single value output Given X and Y (in Y, each y i, i [1, N] is a single value output), the linear regression model finds the weight vector that minimizes the sumof-squares error over the training set D: L D = SSE = N i=1 (y i ŷ i ) 2 = N i=1 (y i w T x i ) 2 Consider minimizing the sum-of-squares error with respect to w Setting the gradient of the error function to zero and we obtain the OLS solution: w = (X T X) 1 X T Y In our application, the output is a k-dimensional vector, instead of a single value output The weights form a matrix W, instead of a vector In this case y = W T x Given the training data X and Y (each y i in Y, i [1, N], is a k-dimensional vector, thus Y is a N k matrix), the multiple output linear regression model uses the following equation to compute the weight matrix W: W = (X T X) 1 X T Y It is equavalent to taking each column of Y as output and solving each individual single output linear regression problem, and then putting the solutions together [2] Although the OLS solution minimized the sum-of-squares error for both the single and multiple output linear regression problems, it does not satisfy the constraints on the weights in our application, thus cannot be interpreted as probabilities We modify the solutions of OLS by setting the negative values to zero and normalize the rows in the matrix so that each row sums up to 1 We call this modified method normalized linear regression (nlr) 43 Exponentiated Gradient Algorithm 1 updating rule of Exponentiated Gradient algorithm Input: weight vector w t = (w t,1, w t,2,, w t,p) learning rate η > an observation (x t, y t ), where x t = (x t,1, x t,2,, x t,p ) Output: weight vector w t+1 Procedure: 1: ŷ t w T t x t; 2: for i 1 to p do 3: w t+1,i = where r i = e 2η(ŷ t y t )x t,i 4: end for 5: return w t+1 ; wt,i ri p j=1 r j The exponentiated gradient (EG) algorithm for solving

linear regression problem was first proposed by Kivinen and Warmuth [1] It starts from an initial guess of the weight vector After receiving each observation, it updates the 51 Data Set weight vector w EG guarantees each weight in the weight We used the documents and query logs in the CiteSeerX vector is positive and they sum up to 1 This is very close to search engine We pulled out 1,143,971 documents with their our constraints but slightly different, as will be shown later titles, abstracts and keywords We trained LDA models on The updating rule for EG algorithm is given by Kivinen and this data set We also extracted the query log of 1 months Warmuth [1] We present this updating rule here for clarity Algorithm 1 gives the updating rule when receiving an (Feb 28 to Nov 28) from the CiteSeerX search engine to extract the click pairs The entries in the query log new observation at the t-th iteration records the users behaviors The entries are grouped by EG algorithm starts with an initial guess w 1 = (w 1,1, w 1,2, w 1,p ), which satisfies i w 1,i = 1 and w 1,i > for all i The normalization in the updating rule guarantees the weights in the new weight vector sum up to 1 (ie w t+1,1 + w t+1,2 + + w t+1,p = 1) The usual choice for w 1 is the uniform probability vector (1/p, 1/p, 1/p) (ie w 1,i = 1/p for all i) A typical learning rate could be η = 2/(3R 2 ), where R = max t(max ix t,i min ix t,i) is the upper bound for the maximum difference between the components x t,i of an instance x t Algorithm 2 updating rule of multiple output, normalized Exponentiated Gradient algorithm Input: weight matrix W t = (w t,1, w t,2,, w t,k ) learning rate η > an observation (x t, y t), where x t = (x t,1, x t,2,, x t,p) and y t = (y t,1, y t,2,, y t,k ) Output: weight matrix W t+1 Procedure: 1: for j 1 to k do 2: y = y t,j; {the j-th component in vector y t} 3: ŷ = w T t,jx t; 4: for i 1 to p do 5: r i = e 2η(ŷ y)x t,i ; 6: w t+1,i,j = w t,i,jr i; 7: end for 8: w t+1,j = (w t+1,1,j, w t+1,2,j,, w t+1,p,j ) T ; 9: end for 1: W t+1 = (w t+1,1, w t+1,2, w t+1,k ); 11: normalize each row of W t+1; 12: return W t+1; We follow the same framework as the previous section to solve the multiple output linear regression problems using exponentiated gradient Given the observed output matrix Y, we extract each of the k columns of Y and solve k single output linear regression problems using basic EG algorithm, and put together the results to form the weight matrix The basic EG guarantees that each weight column vector sums up to 1 However, each row of the weight matrix consists of the transition probabilities from one topic to all topics, therefore, each row of the weight matrix should sum up to 1, instead of each column We modify the basic EG according to our constraints We removed the normalization step when updating each column weight vector and added an normalization step along each row after we updated the entire weight matrix Algorithm 2 gives the details Note that in the algorithm description, W t,i,j means the element at the i-th row and j-th column of the matrix W t 5 EXPERIMENT RESULTS users and sessions (sessions are automatically determined by the search engine) The users often start with searching key words, then visit documents in the result list and follow the citation links to visit other documents We only focus on 8 interesting user behaviors because they indicate that the user is interested in a document These behaviors are: downloading a paper, viewing summary of a paper, adding a paper to collection, viewing related papers, correcting mistakes in a paper, monitoring changes to a paper and viewing different versions of a paper Each one of them corresponds to a document, indicating that the user is interested in this document 52 Data Processing We used Phan and Nguyen s implementation of LDA 2 on the documents from CiteSeerX We removed the traditional stop words such as a and the, as well as additional stop words which are very common in research papers but do not bear topic meaning, such as author, abstract, copyright etc We identified a total of 222 such stop words In the query log, there are 25,41 sessions which contain visits to multiple documents In these sessions, 63,89 unique documents were visited A total of 97,28 click pairs were extracted 53 Model Comparison We implemented and compared 5 models: (1) basic linear regression with multiple output (blr), (2) normalized linear regression with multiple output (nlr) Normalization means resetting negative weights to zero and normalize each row, (3) basic Exponentiated Gradient Linear Regression with multiple output (beg), (4) basic Exponentiated Gradient normalized at the final step (feg), which takes the weight matrix generated by beg as input and normalize each row, (5) step-wise normalized Exponentiated Gradient, which is our proposed algorithm, described in Algorithm 2 We set 6 different values for the number of topics when training the LDA model: 2, 3, 4, 5, 1, 15 After the training, we generate the topical vector for each document in the click pairs Since we have 97,28 click pairs, we have 97,28 observations We used 1-fold cross validation to evaluate and compare different regression models Figure 1 shows the comparison of the mean squared error of each model Each figure is the results from a different setting of the number of topics We show, for each model, the mean, max and min MSE obtained from the 1-fold cross validation in each figure The cross in the middle of each line represents the mean of the errors; while the points at the top and bottom of the lines correspond to the max and min errors, respectively Table 2 gives the averaged testing MSE from 1-fold cross validation for each model under each 2 http://gibbsldasourceforgenet/

5 Model Comparison (# of topics =2) 5 Model Comparison (# of topics =3) 5 Model Comparison (# of topics =4) 45 45 45 4 35 3 25 2 4 35 3 25 2 4 35 3 25 2 5 5 5 (a) Number of topics = 2 (b) Number of topics = 3 (c) Number of topics = 4 5 Model Comparison (# of topics =5) 5 Model Comparison (# of topics =1) 5 Model Comparison (# of topics =15) 45 45 45 4 35 3 25 2 4 35 3 25 2 4 35 3 25 2 5 5 5 (d) Number of topics = 5 (e) Number of topics = 1 (f) Number of topics = 15 Figure 1: Comparisons of Mean Squared Errors of Different Regression n=2 3899 3899 3952 3934 3975 n=3 367 367 372 3691 376 n=4 3618 3619 3661 3646 3652 n=5 356 357 3539 3529 3533 n=1 3247 3253 3271 3277 3266 n=15 315 3115 3129 3147 3124 Table 2: Averaged MSE from 1-fold cross validation settings of number of topics As can be seen from the figures and the table, the performances of different models are very close The blr, which is based on OLS, always gives the minimum MSE, since it doesn t have any constraints nlr comes the second Among the three models based on EG, when number of topic n is 1 and 15, seg gives the best performance; when n = 5 and n = 3, feg gives the best performance; when n = 2, beg beats the other two The performance of seg model, compared with the best model blr, drops 195%, 99%, 94%, 78%, 58% and 62% for n = 2, 3, 4, 5, 1, 15 respectively Although seg model cannot beat blr model in terms of mean squared error, the decrease in performance is small, and it provides good interpretability, as discussed before 54 Discussion We give some of our observations in this subsection We counted the number of click pairs in which the largest elements of the topical vectors of the two documents belong to different topics In the 97,28 click pairs, 62,646 click pairs have their largest elements on different topics Transition Probabilities Transition Probabilities 19 13 1684% 12 3 115% 16 15 111% 5 2 1% 8 2 971% 5 12 94% 12 17 893% 11 2 883% 13 19 833% 15 16 815% Table 3: Top 1 Off-Diagonal Largest Transition Probabilities Based on our observation, in the learned weight matrices, the largest element in each row and column is on the diagonal, meaning that the transition probability of user intent from one particular topic to itself is always the largest compared to the transition probabilities to other topics The values of diagonal elements range from 1424% to 6911%, while initially all the weights were set to be 5% Other than the diagonal elements, we found the top 1 largest probabilities Table 3 gives 1 largest off-diagonal transition probabilities and their corresponding topic transitions 3 The number of topics is set to be 2 55 Interpretation of Intent Transitions We provide some interpretations for our discovered knowledge from the experiments in this section The learned weights are obtained by seg model with the number of topics set to be 2 The first transition in the top 1 list is from Topic 19 to Topic 13 Topic 19 is on problems and applications of 3 Due to space limitation, the list of the top representative terms of each topic is not presented here but is available in the full version of this paper

graphs and trees Its representative terms include: graph(s), tree(s), degree, vertex (ices), point(s), distance, etc Topic 13 is on optimization techniques, algorithms, computational costs and efficiencies Its representative terms include: algorithm(s), optimization/optimal, constraints, computational, cost, efficient, etc Such transition shows that people who are interested in problems and applications of graphs and trees may switch to the techniques and methods to solve these problems It is a transition from problems to solutions The second transition is from Topic 12 to Topic 3 Topic 12 is on programming and design of languages Its representative terms include: software programming/program, language, design, object, oriented, java, etc Topic 3 is on issues involved in designing a language Its representative terms include: type(s), logic, semantics, rules, terms, calculus, reasoning, etc It shows that people s interest on programming languages may shift to more fundamental issues such as typed or untyped, rules for induction, checking the semantics of the language and etc The third transition is from Topic 16 to Topic 15 Again, it shows that people who are interested in the problems will also be interested in the techniques to solve these problems The representative terms for Topic 16 include: energy, field, flow, high, phase, mass, temperature, density, surface, etc The representative terms for Topic 15 include: linear, space(s), finite, matrix, equation(s), numerical, differential, etc One interesting thing to notice is that in the 9-th and 1- th transition(13 19 and 15 16), the reversed transitions are the first and third in the list, respectively It shows the strong relatedness of the two pairs of topics The transition is not only in one direction, but also the other way around slr model produced 8 negative elements and thus nlr model produced 8 zero elements, which makes the results hard to interpret 6 CONCLUSIONS AND FUTURE WORK In this paper, we studied the problem of finding the user intent transition among different topics We used the topical vectors of documents generated by a trained LDA model to represent the users intents Given the pairs of documents visited successively by users and their topical vectors, we propose to use multiple output linear regression to model our problem, in which all the transition probabilities between any pair of topics form the weight matrix Then we propose a new algorithm based on exponentiated gradient to efficiently solve the linear regression problem Our proposed method satisfies the constraints we set forth in the regression problem It gives a good performance in terms of the sumof-squared errors, compared with the ordinary least squares technique Besides, it provides a good interpretability compared with other regression models The effectiveness of our method has been proved in the experiments we conducted In the future, we would like to extend our work by considering a longer click chain, instead of a click pair Besides, we can incorporate personal information so that our learned transition probability matrix is optimized for individuals or groups to satisfy their information needs Finally, we would like to consider the temporal information, since new topics may emerge over time and people s interests in one topic may change over time We will continue to report our progress in future work 7 REFERENCES [1] P Anick Using terminological feedback for web search refinement: a log-based study In SIGIR, 23 [2] C M Bishop Pattern Recognition and Machine Learning (Information Science and Statistics) Springer, 27 [3] D M Blei, A Y Ng, and M I Jordan Latent dirichlet allocation J Mach Learn Res, 23 [4] L D Catledge and J E Pitkow Characterizing browsing strategies in the world-wide web Comput Netw ISDN Syst, pages 165 173, 1995 [5] B Efron, T Hastie, I Johnstone, and R Tibshirani Least angle regression Annals of Statistics, 24 [6] D He, A Göker, and D J Harper Combining evidence for automatic web session identification Inf Process Manage, 22 [7] A E Hoerl and R W Kennard Ridge regression: Biased estimation for nonorthogonal problems Technometrics, 197 [8] B J Jansen, A Spink, C Blakely, and S Koshman Defining a session on web search engines: Research articles J Am Soc Inf Sci Technol, 27 [9] R Jones and K L Klinkner Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs In CIKM, 28 [1] J Kivinen and M K Warmuth Exponentiated gradient versus gradient descent for linear predictors Information and Computation, 1997 [11] A L Montgomery and C Faloutsos Identifying web browsing trends and patterns Computer, 21 [12] S Ozmutlu, H C Ozmutlu, and A Spink Multitasking web searching and implications for design Proceedings of Annual Meeting of the American Society for Information Science and Technology, 23 [13] F Radlinski and T Joachims Query chains: learning to rank from implicit feedback In KDD, 25 [14] C Silverstein, H Marais, M Henzinger, and M Moricz Analysis of a very large web search engine query log SIGIR Forum, pages 6 12, 1999 [15] A Spink, H C Ozmutlu, and S Ozmutlu Multitasking information seeking and searching processes J Am Soc Inf Sci Technol, 22 [16] A Spink, M Park, B J Jansen, and J Pedersen Multitasking during web search sessions Inf Process Manage, 26 [17] R Tibshirani Regression shrinkage and selection via the lasso Journal of the Royal Statistical Society, Series B, 1994