Online Limited-Memory BFGS for Click-Through Rate Prediction

Size: px

Start display at page:

Download "Online Limited-Memory BFGS for Click-Through Rate Prediction"

Edward Garrett
6 years ago
Views:

1 Online Limited-Memory BFGS for Click-Through Rate Prediction Mitchell Stern Department of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104, USA Aryan Mokhtari Department of Electrical and Systems Engineering University of Pennsylvania Philadelphia, PA 19104, USA Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania Philadelphia, PA 19104, USA Editor: Abstract We study the problem of click-through rate (CTR) prediction, where the goal is to predict the probability that a user will click on a search advertisement given information about his issued query and account. In this paper, we formulate a model for CTR prediction using logistic regression, then assess the performance of stochastic gradient descent (SGD) and online limited-memory BFGS (olbfgs) for use in training the corresponding classifier. We demonstrate empirically that olbfgs provides faster convergence and requires fewer training examples than SGD to achieve comparable performance, confirming the benefits of the use of second-order information in stochastic optimization. Keywords: click-through rate prediction, large-scale optimization, logistic regression, stochastic optimization, quasi-newton methods 1. Introduction Nearly all major search engines rely on advertisements as a primary source of revenue, meaning the problem of building an effective system for advertisement click-through rate (CTR) prediction has garnered widespread interest in recent years. Briefly, the goal of CTR prediction is to build a model that captures the behavior of users when presented with a small set of advertisements; armed with such a model, companies can then propose ads which have the highest probability of being clicked, thereby better serving their advertiser clients. Along with related tasks, CTR prediction belongs to a growing class of problems which have motivated the development of algorithms that can efficiently handle large data sets, often containing hundreds of millions of training instances consisting of tens of millions of features. In the literature, the CTR prediction problem has been approached in a number of ways. Regelson and Fain hypothesize that different terms within a search have different CTR probabilities Regelson and Fain (2006); accordingly, they incorporate a combination of historical rates for individual terms as well as aggregate rates for clusters of keywords into their model. By using a back-off approach, they are able to predict the CTR probabilities both of terms which have sufficient historical data and of terms which are completely novel to their system. Graepel et al. instead use a graphical model for the problem Graepel et al. (2010), using probit regression to map combinations of discrete and real-valued features derived from advertisement, query, and user data to CTR probabilities.

2 Stern, Mokhtari, and Ribeiro Another popular model choice for the CTR prediction problem (and indeed other problems in machine learning with heterogeneous feature sets) is logistic regression, which easily accommodates a wide range of features and is amenable to methods developed for the field of convex optimization Mitchell (1997); Hosmer and Lemeshow (2004); Kleinbaum and Klein (2010). Firstorder methods such as gradient descent have seen great success on smaller-scale problems, but become prohibitively expensive when the number of features and training examples become large. Stochastic (online) versions of gradient descent resolve this issue by approximating the gradient of the objective function using a small subsample of the data set Zhang (2004); Shalev-Shwartz et al. (2007); Nemirovski et al. (2009); Bottou (2010); LeRoux et al. (2012); Konecny and Richtarik (2013); Schmidt et al. (2013); Zhang et al. (2013). These algorithms are applied to the problem of training a logistic regression classifier for CTR prediction which reduce the cost of updates McMahan et al. (2013). Chakrabarti et al. instead develop and implement an approximate logistic regression model, and use parallelized iterative scaling for training Chakrabarti et al. (2008). Although this method suffers from slightly slower convergence than comparable approaches, it provides benefits in the form of reduced iteration cost by updating only one component at a time. An issue common to each of these latter approaches, however, is that the algorithms they use for training are either offline, making individual iterations slow, or do not incorporate second-order information, making convergence slow. In a deterministic setting, Newton s method can be used to accelerate the convergence speed of first-order methods. One may consider the use of stochastic Newton to accelerate the convergence of SGD Birge et al. (1995), but the computation of an unbiased estimate of the objective Hessian and its inverse are costly in large-scale problems. Quasi-Newton methods, which incorporate second-order information by approximating the curvature of the objective function using two consecutive gradients, can also be used to improve the convergence of gradient descent algorithms Powell (1971); J. E. Dennis and More (1974); Byrd et al. (1987); Nocedal and Wright (1999). The most successful quasi-newton methods are BFGS, which has super-linear convergence Byrd et al. (1987), and limited-memory BFGS (LBFGS), which exhibits the improved convergence of BFGS while requiring less memory to implement Dong C. and Nocedal (1989). The use of LBFGS to train a logistic regression model for CTR prediction shows a reduction in convergence times relative to first-order methods Richardson et al. (2007). However, both BFGS and LBFGS require computation of the full gradient, which is not suitable at larger scales. Although each of these approaches offers significant improvements over standard gradient descent, none of them combines the advantages in cost per iteration obtained by processing inputs online with the advantages in convergence obtained by incorporating curvature estimates of the objective function. Stochastic (online) quasi-newton methods arise as a natural alternative that capture second order-information by approximating the Hessian of the objective to accelerate convergence while using stochastic gradients in lieu of full gradients to more easily accommodate larger problems Schraudolph et al. (2007); Mokhtari and Ribeiro (2014a,b). We therefore propose the use of online limited-memory BFGS (olbfgs) to train a logistic regression model for CTR prediction, as this method possesses both of these beneficial qualities. We apply this approach to the click-through rate prediction problem on the Soso data set, which consists of several hundred million search logs and their associated advertisement impressions and clicks. We study the advantages of olbfgs relative to the first-order online algorithm SGD and show that olbfgs significantly decreases the number of required sample points for training a logistic model. Our paper is organized as follows. We first give an overview of our target data set and the process of generating feature vectors (Section 2). We then provide a mathematical formulation of the CTR problem and show that it can be cast in the framework of logistic regression. Further, we explain the SGD and olbfgs algorithms and discuss their respective implementations (Section 3). We proceed by describing the setup and results of our numerical experiments. On a portion of a data set containing 240 million instances, each with 56 million features, olbfgs reaches the same negative log-likelihood value as SGD in an order of magnitude less time. Moreover, the test set error rate achieved by olbfgs after one pass over the subsampled training set is 26%, a substantial improvement over the error rate of 38% achieved by SGD (Section 4). 2

3 Online Limited-Memory BFGS for Click-Through Rate Prediction 2. Data Set We use the Soso search engine data set from Tencent Sun (2014), which contains information on searches. As provided, each data point can roughly be divided into two groups of features, one of which concerns the user, and the other of which concerns the proposed advertisement. In the former group, numerical identifiers for the user and the query are provided. The data set contains 22 million unique users for whom the ID can be referenced in an external file to obtain information from the user s profile gender is given as male, female, or unknown, and age is given as one of the six intervals (0, 12], (12, 18], (18, 24], (24, 30], (30, 40], or (40, ). Similarly, the query ID can be located in a separate file to obtain the list of tokens in the query, which are given as hash values for purposes of anonymity. Offering slightly more information, the advertisement feature group includes both standalone features and references to other data files. Fields which are directly available include a unique identifier for the ad, a unique identifier for the advertiser, a hashed version of the URL linked to by the ad, the total number of ads displayed on the results page ( depth ), and the relative position of the ad within the list of ads. Three numeric keys are provided as well, which can be looked up in additional data files to obtain the tokens that comprise the ad s keywords, title, and main content ( description ). As with the query, the tokens in these files have been hashed to preserve anonymity. 2.1 Feature Vector Generation The raw features described above cannot directly be incorporated into standard statistical models. Therefore, first we pre-process the raw data to obtain usable numerical features. To begin, suppose that a field x may take on one of n categorical values {v 1, v 2,, v n}. For a particular instance of the problem, we can encode the value v i assumed by this field as a list of n binary features f = [f 1,, f n] by setting f j = 1 if i = j and 0 otherwise. In words, the vector f acts as an n-dimensional indicator vector for x. This technique can be applied to each of the categorical raw features, namely the user s gender and age, the hashed advertisement URL, the depth and position of the ad, and each of the numerical IDs. Interpreting the depth and position of the ad as a single unit, this pair can be encoded as a list of binary features as well. As an example, a user s gender can be specified as male, female, or unknown, meaning there are three different possibilities for this field. To encode this information we would assign three binary components of the feature vector to gender, encoding a male user as [1, 0, 0], a female user as [0, 1, 0], and a user of unknown gender as [0, 0, 1]. The tokens which comprise the user s query and the ad s keywords, title, and description can be transformed into lists of binary features using a similar technique. Let {t 1,, t v} denote the union of tokens that occur across all user queries, where v denotes the size of the vocabulary. We can encode the contents of a specific query as a list of v binary features f = [f 1,, f v] by setting f j = 1 if token t i occurs at least once in the query and 0 otherwise. Binary features for the ad s keywords, title, and description can be generated in an analogous manner. Importantly, each of the resulting vectors should be kept separate so as not to conflate the presence of a token in one textual feature with the presence of the same token in another. As a simple example, suppose that the user s query is cheap phone and that the full vocabulary across all users queries is {cheap, electronics, laptop, phone}. We would then encode this particular query as the 4-dimensional vector [1, 0, 0, 1]. In practice, our data set of interest has a query vocabulary size of 1 million words, and distinct ad keyword, title, and description vocabularies of around 100,000 words each. Once the binary vectors have been generated for each of the token lists for a particular instance, real-valued similarities between each pair of lists can additionally be computed. Letting S and T denote two sets of tokens that have been derived from token lists by collapsing repeated occurrences, we define the cosine similarity of the corresponding messages as CosineSimilarity(S, T ) = S T S T. (1) This value ranges between 0 and 1, with 0 indicating no overlap, and 1 indicating that the messages contain exactly the same set of tokens. Note that the ordering of the tokens within each 3

4 Stern, Mokhtari, and Ribeiro Table 1: Derived features for click-through rate prediction. For each type of feature, we include the total number of components occupied in the feature vector by that type, along with the maximum and mean number of non-zero components across all instances. Description Components Max # Mean # Gender Age Ad URL 26, Depth Position Depth and Position Advertisement ID 641, Advertiser ID 14, Query ID 24,122, Keyword ID 1,188, Title ID 3,735, Description ID 2,934, User ID 22,023, Query Tokens 1,039, Ad Keyword Tokens 91, Ad Title Tokens 101, Ad Description Tokens 122, Cosine Similarities Depth (Numerical) Position (Numerical) # of Query Tokens # of Ad Keyword Tokens # of Ad Title Tokens # of Ad Description Tokens Total Counts 56,040, message is not taken into account in this metric. As an extension of the example from above, the cosine similarity between the queries cheap phone and cheap laptop would be computed as {cheap, phone} {cheap, laptop} / {cheap, phone} {cheap, laptop} = 1/2. Since each instance contains 4 sets of tokens, namely the user query and the advertisement keywords, title, and description, we compute a total of ( 4 2) = 6 real-valued similarity features. An additional group of simple numerical features can also be generated for each instance. The depth and position of the ad can be used directly as numerical values. Moreover, the number of tokens in the user s query and each of the advertisement s three textual fields can be included as numerical features as well. For example, if the depth, which can take on values in the set {1, 2, 3}, were equal to 2 in a particular training instance, then it would be encoded both in binary by the vector [0, 1, 0] and numerically as the number 2. Therefore, we use 4 components of each feature vector to encode the depth of the ad. Lastly, for each of the raw categorical features, we can compute the average click-through rate of the training instances for which a particular variable takes on a particular value. Storing these in a dictionary, the average click-through rate associated with the value of each categorical variable in a new instance can be efficiently determined and included as an additional feature. The set of features derived from the raw data are summarized in Table 1. By construction, some of the categorical and numerical feature subvectors contain exactly one non-zero component per instance, and depending on the number of possible values, range in sparsity from fully dense 4

5 Online Limited-Memory BFGS for Click-Through Rate Prediction to nearly completely sparse. Other types of features, such as those generated from tokens, exhibit varying numbers of non-zero components across the data set. There are approximately features in all. The critical observation is that the feature vectors are very sparse and on average only 59.8 of the components are non-zero. This observation implies that computation of SGD and olbfgs steps can be executed quickly by exploiting the sparsity of the feature vectors. 3. Problem Formulation and Algorithms In this section, we discuss the statistical model that characterizes our mathematical formulation of the click-through rate prediction problem, and give an overview of the stochastic optimization techniques used to learn this model in our experiments. 3.1 Logistic Regression We use logistic regression as our model for CTR prediction. Specifically, let x R p be a feature vector, w R p be a weight vector dictating how features are to be combined, and y { 1, 1} be an indicator variable that takes on the value 1 when the advertisement is not clicked by the user and 1 when the advertisement is clicked. We assume the following functional form for the click-through rate, which by definition is the probability that y = 1: CTR(x; w) := P [y = 1 x; w] = exp( w T x). (2) Subsequently, the probability that a sample point x has label y = 1 is given by 1 P [y = 1 x; w] = 1/(1+exp(w T x)). Given a training set S = {(x i, y i)} n i=1 consisting of n pairs of feature vectors x i and their associated labels y i, our goal is then to learn the optimal classifier w as the maximum likelihood estimate of w according to the model in (2) and the training data S. To introduce the optimization problem that corresponds to finding the optimal classifier, let S = {(x i, y i) y i = 1} be the set of negative training examples and S + = {(x i, y i) y i = 1} be the set of positive training examples. The CTR prediction problem can then be formalized as max w 1 n 1+exp(w T x i) + x i S 1 (3) 1+exp( w T x i) x i S + 1 Notice that the arguments of the exponentials in each sum can be written as y iw T x i. Therefore, the two sums for sample points with labels y i = 1 and y i = 1 can be combined into a single sum over the all sample points if we use y iw T x i as the exponential argument. Moreover, observe that maximizing the likelihood probability in (3) is equivalent to minimizing the negative log-likelihood of the objective in (3). Hence, the optimal classifier can be found by minimizing the negative logistic log-likelihood function λ 2 w n n i=1 ( )) log 1 + exp ( y iw T x i, (4) where the l 2 -norm regularization term (λ/2) w 2 has been added to prevent overfitting. Hence define the regularized log-likelihood term for the ith training example as f i(w) = λ ( )) 2 w 2 + log 1 + exp ( y iw T x i. (5) With the functions f i(w) defined as in (5) we can simplify log-likelihood minimization in (4) as [ ] w 1 n = argmin f i(w) = argmin F (w), (6) w n w i=1 where F (w) = (1/n) n i=1 fi(w) is an abbreviation for the full objective function. In the rest of the paper we let f i(w) denote the instantaneous function that corresponds to sample point i, and F (w) denote the aggregate function that captures the error of classifier w in classifying all the sample points in the data set. 5

6 Stern, Mokhtari, and Ribeiro 3.2 Stochastic Gradient Descent The instantaneous functions f i(w) are strongly convex with parameter λ which implies the strong convexity of the aggregate function F (w). In light of this observation, the optimal parameter w can be found using stochastic gradient descent. Let us introduce a time index t = 0, 1,, step sizes ɛ t, and random subsets I t of the n training indices {1,, n}, each of which satisfies I t = L for a fixed batch size L n. The stochastic loss function at time t is then ˆF t(w) = (1/L) i I t f i(w). Though this is an approximation to the true loss function F (w), choosing L n when n is large can put previously intractable calculations within the realm of modern computation. We now define the stochastic gradient over the subsample of training instances at time t as ŝ(w t; I t) := w ˆFt(w t) = λw t + 1 L The iterates for stochastic gradient descent can then be written succinctly as i I t y t,ix t,i 1 + exp (y t,iw T t x t,i). (7) w t+1 = w t ɛ tŝ t(w t; I t), (8) where ɛ t is the positive step size at time t. Note that stochastic gradient ŝ(w; I t) = [ w ˆFt(w) ] is an unbiased estimator of the exact gradient s(w) = wf (w), since we know that E w ˆFt(w) = wf (w). This observation implies that the stochastic gradient direction is on average a descent direction. Note that when the sample points (feature vectors) x i are sparse, the product w T t x t,i in (7) is not costly and the stochastic gradient ŝ(w t; I t) can be computed in a reasonable amount of time. 3.3 Online Limited-Memory BFGS Though widely used, the number of iterations required by SGD for convergence becomes prohibitively large for high-dimensional problems. This motivates the use of the stochastic quasi- Newton method known as online limited-memory BFGS (olbfgs), which incorporates secondorder information to realize a significant improvement in convergence speed. In general, all second-order methods are characterized by the introduction of a positive definite matrix 0 into the iteration equation ˆB 1 t w t+1 = w t ɛ t ˆB 1 t ŝ t(w t; I t) = w t ɛ td t. (9) Setting ˆB t equal to the Hessian of the objective function 2 F (w t) gives Newton s algorithm. In quasi-newton methods, ˆB t is instead selected to be an approximation of the Hessian that is less expensive to compute. The popular BFGS method uses gradient evaluations of the objective function to compute rank-one updates to the initial Hessian approximation Nocedal and Wright (1999). An online variant, obfgs, generalizes this approach by using stochastic gradients in place of deterministic gradients Schraudolph et al. (2007). The limited-memory variant of BFGS stores the approximate Hessian matrix implicitly through a fixed number of representative vectors rather than as a dense matrix, thereby improving performance in exchange for a slight decrease in accuracy Dong C. and Nocedal (1989). Combining these attributes, online limited-memory BFGS (olbfgs) both stores the Hessian estimate implicitly through a collection of vectors and makes updates based on stochastic gradients rather than deterministic gradients Schraudolph et al. (2007). In addition to superior performance, this final method also has strong theoretical guarantees, exhibiting global convergence to optimal arguments with probability 1 under mild conditions on the Hessian eigenvalues of the stochastic objective functions Mokhtari and Ribeiro (2014b). A brief description of its implementation is given below. Let v t := w t+1 w t be the variable variation at time t and ˆr t := ŝ t(w t+1, I t) ŝ t(w t, I t) be the stochastic gradient variation at time t. In olbfgs, a history of the past τ curvature information pairs (v t τ, ˆr t τ ),, (v t 1, ˆr t 1) is maintained. With these, an approximation to the descent 1 direction ˆB t ŝ t(w t, I t) = d t can be computed efficiently using Algorithm 1, where the initial 1 Hessian inverse approximation ˆB t,0 is set to identity matrix I if t = 0 or ((vt t 1ˆr t 1)/(ˆr T t 1ˆr t 1)) I 6

7 Online Limited-Memory BFGS for Click-Through Rate Prediction Algorithm 1 Computation of the olbfgs descent direction ˆB 1 t ŝ t (w t, I t ) = d t. Require: The stochastic gradient ŝ t (w t, I t ), the initial Hessian inverse approximation the last τ curvature information pairs {v t u, ˆr t u } τ u=1. 1: Let p 0 ŝ t (w t, I t ) be the stochastic gradient at time t. 2: for u = 0, 1,, τ 1 do 3: Compute scalar ˆρ t u 1 1/(v t u 1ˆr T t u 1). 4: Compute and store scalar α u ˆρ t u 1 vt u 1p T u. 5: Update vector p u+1 p u α uˆr t u 1. 6: end for 1 7: Let q 0 ˆB t,0 p τ. 8: for u = 0, 1,, τ 1 do 9: Compute scalar β u ˆρ t τ+uˆr T t τ+uq u. 10: Update vector q u+1 q u + (α τ u 1 β u )v t τ+u. 11: end for 12: return d t = q τ ˆB 1 t,0, and in subsequent iterations. Note that this implementation of the computation of the olbfgs descent direction d t has complexity of order O(τp), which is not expensive relative to the computational complexity of the SGD descent direction of order O(p). 4. Numerical Experiments For our numerical experiments, we use the data set provided for Track 2 of the 2012 KDD Cup, wherein the training and test instances are derived from session logs of the Soso search engine. Each data point in the training set summarizes the behavior of a specific user when shown a particular ad after issuing a particular query, including both the total number of impressions under these conditions as well as the number of times the user clicked on the ad. As a first pre-processing step, we expand each of these summaries into individual impressions, (# of clicks) of which are given the label +1, and the remainder of which have the label 1. After this expansion process, there are approximately instances. We randomly select 1/10 of these sample points to create the training and test sets. Out of these samples, we create a training set by picking (9/10) = of the samples and we use the remaining instances to create the test set. In our training set, there were roughly 27 negative instances for every positive instance. To account for the discrepancy, we replicate each positive instance 26 times to balance the ratio. Without this adjustment, a classifier could obtain a near-optimal objective value simply by assigning the label 1 to every instance, which is clearly undesirable behavior. This approximately doubles the size of the training set, giving effective samples. We then generate feature vectors for our training instances as described in Section 2, and implement SGD and olbfgs as described in Section Parameter Selection The initial weight vector w 0 is chosen by generating each component of this vector at random from the normal distribution N (0, 1) with zero mean and unit variance. We then set the regularization parameter to λ = 10 7 in each of our experiments so that the regularization and log-likelihood terms would be of the same order of magnitude. To choose the method-specific parameters, we run SGD and olbfgs on the training set for 1 hour each under a variety of conditions, and select the parameter settings which lead to the best performance. The step sizes for both algorithms are selected according to the formula ɛ t = ɛ 0T 0/(T 0 + t), with the decay factor T 0 being fixed in each group of experiments. The history size for olbfgs is fixed at τ = 5 due to memory constraints. 7

8 Stern, Mokhtari, and Ribeiro olbfgs (T 0 = 10 2, L =10 4, ǫ 0 =0.5) olbfgs (T 0 = 10 3, L =10 4, ǫ 0 =0.5) olbfgs (T 0 = 10 4, L =10 4, ǫ 0 =0.5) olbfgs (T 0 = 10 5, L =10 4, ǫ 0 =0.1) Objective Function Value (F(w)) Run Time (seconds) Figure 1: Illustration of negative log-likelihood values for olbfgs after one hour of training under four best parameter settings. The accuracy of these top four parameter settings are close to each other and set of parameters (T 0, L, ɛ 0 ) = (10 2, 10 4, 0.5) has a marginal advantage over the other three. 8

9 Online Limited-Memory BFGS for Click-Through Rate Prediction Objective Function Value (F(w)) SGD (T 0 =10 3, L = 10 2, ǫ 0 =0.025) SGD (T 0 =10 4, L = 10 3, ǫ 0 =0.075) SGD (T 0 =10 5, L = 10 2, ǫ 0 =0.075) SGD (T 0 =10 6, L = 10 3, ǫ 0 =0.075) Run Time (seconds) Figure 2: Illustration of negative log-likelihood values for SGD after one hour of training under four best parameter settings. The best performance among these top four parameter settings belongs to (T 0, L, ɛ 0 ) = (10 4, 10 3, 0.075). 9

10 Stern, Mokhtari, and Ribeiro To be more precise, for olbfgs method, all combinations of parameter settings with step size parameters T 0 {10 2, 10 3, 10 4, 10 5 }, mini-batch sizes L {10 2, 10 3, 10 4, 10 5 }, and ɛ 0 { , , 10 2, , , 10 1, , } are considered. For each decay factor T 0, the best choice of mini-batch size L and step size parameter ɛ 0 are determined to be (T 0, L, ɛ 0) = (10 2, 10 4, 0.5), (10 3, 10 4, 0.5), (10 4, 10 4, 0.5), (10 5, 10 4, 0.1). To obtain the best choice of parameters we compare these four different settings as shown in Fig. 1. The best performance among different combinations of parameters for olbfgs after running the algorithm for seconds is (T 0, L, ɛ 0) = (10 2, 10 4, 0.5). To find the best combination of parameters for SGD, we repeat the same process by running an analogous set of experiments with parameters T 0 {10 3, 10 4, 10 5, 10 6 }, L {10 2, 10 3, 10 4, 10 5 }, and ɛ 0 {10 2, , , , 10 1, , }. A comparison of different combinations for fixed values of T 0 shows that the optimal set of parameters for SGD is (T 0, L, ɛ 0) = (10 3, 10 2, ), (10 4, 10 3, ), (10 5, 10 2, ), (10 6, 10 3, ). The performance of these parameters is illustrated in Fig. 2. A comparison of the top four combinations of parameters (T 0, L, ɛ 0) shows that the best performance for SGD after processing feature vectors for seconds is achieved by the parameters (T 0, L, ɛ 0) = (10 4, 10 3, ). Observe that all of the top four parameter settings for olbfgs achieve an objective function value of about F (w t) = 4.05 after 2400 seconds of training, while the best choice of parameters for SGD of (T 0, L, ɛ 0) = (10 4, 10 3, ) achieves an objective function value of F (w t) = 4.42 after running for the same amount of time. Note that each choice of parameters for olbfgs reaches this best final value of SGD objective function value F (w t) = 4.42 in just 1150 seconds. In the following section, we use the optimal choices of parameters for olbfgs and SGD to train the classifier for a longer period of time, with the expectation that we will observe an even larger gap between the performance of these algorithms. 4.2 Full Experiments Recall that the optimal combinations of olbfgs and SGD parameters after training for 2400 seconds are (T 0, L, ɛ 0) = (10 2, 10 4, 0.5) and (T 0, L, ɛ 0) = (10 4, 10 3, 0.075), respectively. With these parameter settings, each algorithm is run until 40 million feature vectors have been processed, corresponding to roughly one pass over our adjusted training set. We then perform one additional pass over the training set with the best pairing of algorithm and parameters to obtain an approximation of the optimal classifier w. Since we use different values of L for SGD and olbfgs, we plot the relative distance to optimality F (w t) F (w ) versus Lt, where the latter quantity is the number of processed feature vectors after iteration t. Figure 3 illustrates the objective function value relative to the approximate optimal value, F (w t) F (w ), versus the number of feature vectors processed for olbfgs and SGD under their best respective parameter settings. After processing Lt = feature vectors (requiring t = iterations) the relative objective function value for olbfgs is F (w t) F (w ) = 0.11, while after processing Lt = feature vectors (requiring t = iterations) the distance to the optimal objective function value for SGD is F (w t) F (w ) = Notice that after processing just Lt = feature vectors, or about 6% of the training set, olbfgs achieves the same relative objective function value that SGD does after processing the entire data set. The results in Figure 3 also allow us to compare the performance of olbfgs and SGD in terms of number of iterations t. After processing t = iterations the relative objective function value for olbfgs is F (w t) F (w ) = 0.11, while SGD reaches an objective function accuracy F (w t) F (w ) = 1.16 after the same number of iterations. Figure 4 shows the same metric, but with cumulative run time on the x-axis as opposed to the number of feature vectors processed. Under the chosen parameters, olbfgs uses a mini-batch size of L = 10 4, which is 10 times as large as the mini-batch size of 10 3 used for SGD. We recall from above that the dense vector operations involving w dominate the run time of both olbfgs and SGD due to the sparsity of the data set. Since olbfgs performs 4τ + 1 = 21 such operations per iteration versus 2 for SGD, the time per iteration for olbfgs is approximately 10 times as long as it is for SGD. Together, these factors roughly cancel out, meaning we can draw nearly the same conclusions for run time that we could for processed feature vector count. Specifically, after 9 hours of run time, the relative objective function values for olbfgs and SGD are 0.12 and 0.80, 10

11 Online Limited-Memory BFGS for Click-Through Rate Prediction Relative Objective Function Value (F(w) F(w )) SGD olbfgs Processed Feature Vectors (Lt) x 10 7 Figure 3: Plot of regularized negative log-likelihood values for olbfgs and SGD after processing feature vectors, relative to the negative log-likelihood of the optimal classifier. Note that olbfgs achieves the same relative objective function value after processing 2.5 million feature vectors that SGD does after processing 40 million. 11

12 Stern, Mokhtari, and Ribeiro Relative Objective Function Value (F(w) F(w )) SGD olbfgs Run Time (Hours) Figure 4: Plot of regularized negative log-likelihood values for olbfgs and SGD after 9 hours of run time, relative to the negative log-likelihood of the optimal classifier. Note that olbfgs achieves the same relative objective function value as SGD in roughly an order of magnitude less time. respectively. Moreover, olbfgs reaches in 40 minutes the level of performance exhibited by SGD after the full 9 hours of processing. To better understand the advantages of olbfgs relative to SGD, we also compare their prediction error on the test set, where prediction error is defined to be the average magnitude of the difference between the logistic regression output and the true label (substituting 0 for 1), taking on values between 0 and 1. More formally, the prediction error of a weight vector w is defined as e(w) = 1 n yi exp( w T x). (10) x i S Figure 5 illustrates the prediction error versus the number of feature vectors processed for olbfgs and SGD after processing Lt = feature vectors which is equivalent to t = 10 3 iterations for olbfgs and t = 10 4 iterations for SGD. As shown in Figure 5, the error of prediction for SGD after processing Lt = samples of training set is e = 0.38, while the accuracy of classifier trained by olbfgs is e = 0.26 after processing the same amount of feature vectors. Moreover, olbfgs makes fairly steady progress towards its final prediction error of 0.26 over the course of the run, but SGD oscillates around an error of e = 0.4 without showing any convergent tendencies. The gap between performance of olbfgs and SGD is more significant if we compare the prediction error versus number of iterations t. The prediction error of olbfgs after t = iterations is e = 0.26, while after t = iterations SGD reaches an error level of e = Figure 6 instead shows prediction error versus runtime. Due to the previously discussed runtime considerations, this plot largely shows the same information that Figure 5 contains. To be more precise, after 9 hours of runtime, the prediction errors for SGD and olbfgs are e = 0.39 and 12

13 Online Limited-Memory BFGS for Click-Through Rate Prediction 0.5 SGD olbfgs 0.45 Prediction Error Processed Feature Vectors x 10 7 Figure 5: Plot of mean prediction error for olbfgs and SGD after processing feature vectors. The prediction error for olbfgs decreases consistently over the course of the experiment, whereas the prediction error for SGD oscillates and fails to exhibit steady improvement. 13

14 Stern, Mokhtari, and Ribeiro 0.5 SGD olbfgs 0.45 Prediction Error Run Time (Hours) Figure 6: Plot of mean prediction error for olbfgs and SGD after 9 hours of run time. The prediction error for olbfgs decreases consistently over the course of the experiment, whereas the prediction error for SGD oscillates and fails to exhibit steady improvement. e = 0.27, respectively. Moreover, olbfgs reaches in 14 minutes the level of performance exhibited by SGD after the full 9 hours of processing, and remains below this level from 1 hour 41 minutes onwards. In addition, the classifier obtained using olbfgs achieves an area-under-curve (AUC) score of on the validation set, whereas the classifier obtained using SGD achieves an AUC score of just For reference, the AUC score of the initial weight vector w 0 was Lastly, if we map logistic outputs between 0 and 1/2 to a predicted label of 1 and outputs between 1/2 and 1 to a predicted label of 1, olbfgs obtains a test set classification accuracy of 77.56%, whereas SGD obtains an accuracy of just 63.18%. These results all lend credence to the theoretical claims of superiority of olbfgs over SGD, demonstrating that olbfgs exhibits faster convergence than SGD, and requires fewer samples to converge to a classifier of equivalent accuracy. 5. Conclusion To conclude, we demonstrate that online limited-memory BFGS (olbfgs) is empirically superior to stochastic gradient descent (SGD) for the click-through rate prediction problem. We validate this claim through a comparison of these algorithms across iterations, feature vectors, and run time required for convergence, using the evaluation metrics of log-likelihood, continuous prediction error, discrete accuracy, and AUC. In future work, we intend to investigate the use of olbfgs for even larger data sets in order to evaluate the effects of scale on its performance. 14

15 Online Limited-Memory BFGS for Click-Through Rate Prediction References J. R. Birge, X. Chen, L. Qi, and Z. Wei. A stochastic newton method for stochastic quadratic programs with resource. Technical report, University of Michigan, Ann Arbor, MI L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMP- STAT 2010, pages , Physica-Verlag HD, R. H. Byrd, J. Nocedal, and Y. Yuan. Global convergence of a class of quasi-newton methods on convex problems. SIAM J. Numer. Anal., 24(5): , October D. Chakrabarti, D. Agarwal, and V. Josifovski. Contextual advertising by combining relevance with click feedback. In Proceedings of the 17th international conference on World Wide Web, pages ACM, L. Dong C. and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, (45(1-3)): , T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich. Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft s bing search engine. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 13 20, D. W. Hosmer, Jr. and S. Lemeshow. Applied logistic regression. John Wiley & Sons, Jr. J. E. Dennis and J. J. More. A characterization of super linear convergence and its application to quasi-newton methods. Mathematics of computation, 28(126): , D. G Kleinbaum and M. Klein. Logistic regression: a self-learning text. Springer, J. Konecny and P. Richtarik. Semi-stochstic gradient descent methods. arxiv preprint arxiv, , N. LeRoux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. arxiv preprint arxiv, , H B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, et al. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages ACM, T. M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1 edition, ISBN , A. Mokhtari and A. Ribeiro. Res: Regularized stochastic bfgs algorithm. IEEE Trans. Signal Process., 62 (23): , December 2014a. A. Mokhtari and A. Ribeiro. Global convergence of online limited memory BFGS. arxiv preprint arxiv, b. A Nemirovski, A Juditsky, and A Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4): , J. Nocedal and S. J. Wright. Numerical optimization. Springer-Verlag, New York, NY, 2 edition, M. J. D. Powell. Some global convergence properties of a variable metric algorithm for minimization without exact line search. Academic Press, London, UK, 2 edition, M. Regelson and D. Fain. Predicting click-through rate using keyword clusters. In Proceedings of the Second Workshop on Sponsored Search Auctions, volume 9623, M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th international conference on World Wide Web, pages ACM,

16 Stern, Mokhtari, and Ribeiro M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. arxiv preprint arxiv, , N. N. Schraudolph, J. Yu, and S. Günter. A stochastic quasi-newton method for online convex optimization. In Proc. 11th Intl. Conf. on Artificial Intelligence and Statistics (AIstats), page , Soc. for Artificial Intelligence and Statistics, S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In Proceedings of the 24th international conference on Machine learning, pages , ACM, G. Sun. KDD Cup 2012, Track 2: Soso.com advertisement prediction challenge. Accessed August 1, L. Zhang, M. Mahdavi, and R. Jin. Linear convergence with condition number independent access of full gradients. In Advances in Neural Information Processing Systems, pages , T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, page , ACM,

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation