Online Limited-Memory BFGS for Click-Through Rate Prediction

Size: px
Start display at page:

Download "Online Limited-Memory BFGS for Click-Through Rate Prediction"

Transcription

1 Online Limited-Memory BFGS for Click-Through Rate Prediction Mitchell Stern Department of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104, USA Aryan Mokhtari Department of Electrical and Systems Engineering University of Pennsylvania Philadelphia, PA 19104, USA Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania Philadelphia, PA 19104, USA Editor: Abstract We study the problem of click-through rate (CTR) prediction, where the goal is to predict the probability that a user will click on a search advertisement given information about his issued query and account. In this paper, we formulate a model for CTR prediction using logistic regression, then assess the performance of stochastic gradient descent (SGD) and online limited-memory BFGS (olbfgs) for use in training the corresponding classifier. We demonstrate empirically that olbfgs provides faster convergence and requires fewer training examples than SGD to achieve comparable performance, confirming the benefits of the use of second-order information in stochastic optimization. Keywords: click-through rate prediction, large-scale optimization, logistic regression, stochastic optimization, quasi-newton methods 1. Introduction Nearly all major search engines rely on advertisements as a primary source of revenue, meaning the problem of building an effective system for advertisement click-through rate (CTR) prediction has garnered widespread interest in recent years. Briefly, the goal of CTR prediction is to build a model that captures the behavior of users when presented with a small set of advertisements; armed with such a model, companies can then propose ads which have the highest probability of being clicked, thereby better serving their advertiser clients. Along with related tasks, CTR prediction belongs to a growing class of problems which have motivated the development of algorithms that can efficiently handle large data sets, often containing hundreds of millions of training instances consisting of tens of millions of features. In the literature, the CTR prediction problem has been approached in a number of ways. Regelson and Fain hypothesize that different terms within a search have different CTR probabilities Regelson and Fain (2006); accordingly, they incorporate a combination of historical rates for individual terms as well as aggregate rates for clusters of keywords into their model. By using a back-off approach, they are able to predict the CTR probabilities both of terms which have sufficient historical data and of terms which are completely novel to their system. Graepel et al. instead use a graphical model for the problem Graepel et al. (2010), using probit regression to map combinations of discrete and real-valued features derived from advertisement, query, and user data to CTR probabilities.

2 Stern, Mokhtari, and Ribeiro Another popular model choice for the CTR prediction problem (and indeed other problems in machine learning with heterogeneous feature sets) is logistic regression, which easily accommodates a wide range of features and is amenable to methods developed for the field of convex optimization Mitchell (1997); Hosmer and Lemeshow (2004); Kleinbaum and Klein (2010). Firstorder methods such as gradient descent have seen great success on smaller-scale problems, but become prohibitively expensive when the number of features and training examples become large. Stochastic (online) versions of gradient descent resolve this issue by approximating the gradient of the objective function using a small subsample of the data set Zhang (2004); Shalev-Shwartz et al. (2007); Nemirovski et al. (2009); Bottou (2010); LeRoux et al. (2012); Konecny and Richtarik (2013); Schmidt et al. (2013); Zhang et al. (2013). These algorithms are applied to the problem of training a logistic regression classifier for CTR prediction which reduce the cost of updates McMahan et al. (2013). Chakrabarti et al. instead develop and implement an approximate logistic regression model, and use parallelized iterative scaling for training Chakrabarti et al. (2008). Although this method suffers from slightly slower convergence than comparable approaches, it provides benefits in the form of reduced iteration cost by updating only one component at a time. An issue common to each of these latter approaches, however, is that the algorithms they use for training are either offline, making individual iterations slow, or do not incorporate second-order information, making convergence slow. In a deterministic setting, Newton s method can be used to accelerate the convergence speed of first-order methods. One may consider the use of stochastic Newton to accelerate the convergence of SGD Birge et al. (1995), but the computation of an unbiased estimate of the objective Hessian and its inverse are costly in large-scale problems. Quasi-Newton methods, which incorporate second-order information by approximating the curvature of the objective function using two consecutive gradients, can also be used to improve the convergence of gradient descent algorithms Powell (1971); J. E. Dennis and More (1974); Byrd et al. (1987); Nocedal and Wright (1999). The most successful quasi-newton methods are BFGS, which has super-linear convergence Byrd et al. (1987), and limited-memory BFGS (LBFGS), which exhibits the improved convergence of BFGS while requiring less memory to implement Dong C. and Nocedal (1989). The use of LBFGS to train a logistic regression model for CTR prediction shows a reduction in convergence times relative to first-order methods Richardson et al. (2007). However, both BFGS and LBFGS require computation of the full gradient, which is not suitable at larger scales. Although each of these approaches offers significant improvements over standard gradient descent, none of them combines the advantages in cost per iteration obtained by processing inputs online with the advantages in convergence obtained by incorporating curvature estimates of the objective function. Stochastic (online) quasi-newton methods arise as a natural alternative that capture second order-information by approximating the Hessian of the objective to accelerate convergence while using stochastic gradients in lieu of full gradients to more easily accommodate larger problems Schraudolph et al. (2007); Mokhtari and Ribeiro (2014a,b). We therefore propose the use of online limited-memory BFGS (olbfgs) to train a logistic regression model for CTR prediction, as this method possesses both of these beneficial qualities. We apply this approach to the click-through rate prediction problem on the Soso data set, which consists of several hundred million search logs and their associated advertisement impressions and clicks. We study the advantages of olbfgs relative to the first-order online algorithm SGD and show that olbfgs significantly decreases the number of required sample points for training a logistic model. Our paper is organized as follows. We first give an overview of our target data set and the process of generating feature vectors (Section 2). We then provide a mathematical formulation of the CTR problem and show that it can be cast in the framework of logistic regression. Further, we explain the SGD and olbfgs algorithms and discuss their respective implementations (Section 3). We proceed by describing the setup and results of our numerical experiments. On a portion of a data set containing 240 million instances, each with 56 million features, olbfgs reaches the same negative log-likelihood value as SGD in an order of magnitude less time. Moreover, the test set error rate achieved by olbfgs after one pass over the subsampled training set is 26%, a substantial improvement over the error rate of 38% achieved by SGD (Section 4). 2

3 Online Limited-Memory BFGS for Click-Through Rate Prediction 2. Data Set We use the Soso search engine data set from Tencent Sun (2014), which contains information on searches. As provided, each data point can roughly be divided into two groups of features, one of which concerns the user, and the other of which concerns the proposed advertisement. In the former group, numerical identifiers for the user and the query are provided. The data set contains 22 million unique users for whom the ID can be referenced in an external file to obtain information from the user s profile gender is given as male, female, or unknown, and age is given as one of the six intervals (0, 12], (12, 18], (18, 24], (24, 30], (30, 40], or (40, ). Similarly, the query ID can be located in a separate file to obtain the list of tokens in the query, which are given as hash values for purposes of anonymity. Offering slightly more information, the advertisement feature group includes both standalone features and references to other data files. Fields which are directly available include a unique identifier for the ad, a unique identifier for the advertiser, a hashed version of the URL linked to by the ad, the total number of ads displayed on the results page ( depth ), and the relative position of the ad within the list of ads. Three numeric keys are provided as well, which can be looked up in additional data files to obtain the tokens that comprise the ad s keywords, title, and main content ( description ). As with the query, the tokens in these files have been hashed to preserve anonymity. 2.1 Feature Vector Generation The raw features described above cannot directly be incorporated into standard statistical models. Therefore, first we pre-process the raw data to obtain usable numerical features. To begin, suppose that a field x may take on one of n categorical values {v 1, v 2,, v n}. For a particular instance of the problem, we can encode the value v i assumed by this field as a list of n binary features f = [f 1,, f n] by setting f j = 1 if i = j and 0 otherwise. In words, the vector f acts as an n-dimensional indicator vector for x. This technique can be applied to each of the categorical raw features, namely the user s gender and age, the hashed advertisement URL, the depth and position of the ad, and each of the numerical IDs. Interpreting the depth and position of the ad as a single unit, this pair can be encoded as a list of binary features as well. As an example, a user s gender can be specified as male, female, or unknown, meaning there are three different possibilities for this field. To encode this information we would assign three binary components of the feature vector to gender, encoding a male user as [1, 0, 0], a female user as [0, 1, 0], and a user of unknown gender as [0, 0, 1]. The tokens which comprise the user s query and the ad s keywords, title, and description can be transformed into lists of binary features using a similar technique. Let {t 1,, t v} denote the union of tokens that occur across all user queries, where v denotes the size of the vocabulary. We can encode the contents of a specific query as a list of v binary features f = [f 1,, f v] by setting f j = 1 if token t i occurs at least once in the query and 0 otherwise. Binary features for the ad s keywords, title, and description can be generated in an analogous manner. Importantly, each of the resulting vectors should be kept separate so as not to conflate the presence of a token in one textual feature with the presence of the same token in another. As a simple example, suppose that the user s query is cheap phone and that the full vocabulary across all users queries is {cheap, electronics, laptop, phone}. We would then encode this particular query as the 4-dimensional vector [1, 0, 0, 1]. In practice, our data set of interest has a query vocabulary size of 1 million words, and distinct ad keyword, title, and description vocabularies of around 100,000 words each. Once the binary vectors have been generated for each of the token lists for a particular instance, real-valued similarities between each pair of lists can additionally be computed. Letting S and T denote two sets of tokens that have been derived from token lists by collapsing repeated occurrences, we define the cosine similarity of the corresponding messages as CosineSimilarity(S, T ) = S T S T. (1) This value ranges between 0 and 1, with 0 indicating no overlap, and 1 indicating that the messages contain exactly the same set of tokens. Note that the ordering of the tokens within each 3

4 Stern, Mokhtari, and Ribeiro Table 1: Derived features for click-through rate prediction. For each type of feature, we include the total number of components occupied in the feature vector by that type, along with the maximum and mean number of non-zero components across all instances. Description Components Max # Mean # Gender Age Ad URL 26, Depth Position Depth and Position Advertisement ID 641, Advertiser ID 14, Query ID 24,122, Keyword ID 1,188, Title ID 3,735, Description ID 2,934, User ID 22,023, Query Tokens 1,039, Ad Keyword Tokens 91, Ad Title Tokens 101, Ad Description Tokens 122, Cosine Similarities Depth (Numerical) Position (Numerical) # of Query Tokens # of Ad Keyword Tokens # of Ad Title Tokens # of Ad Description Tokens Total Counts 56,040, message is not taken into account in this metric. As an extension of the example from above, the cosine similarity between the queries cheap phone and cheap laptop would be computed as {cheap, phone} {cheap, laptop} / {cheap, phone} {cheap, laptop} = 1/2. Since each instance contains 4 sets of tokens, namely the user query and the advertisement keywords, title, and description, we compute a total of ( 4 2) = 6 real-valued similarity features. An additional group of simple numerical features can also be generated for each instance. The depth and position of the ad can be used directly as numerical values. Moreover, the number of tokens in the user s query and each of the advertisement s three textual fields can be included as numerical features as well. For example, if the depth, which can take on values in the set {1, 2, 3}, were equal to 2 in a particular training instance, then it would be encoded both in binary by the vector [0, 1, 0] and numerically as the number 2. Therefore, we use 4 components of each feature vector to encode the depth of the ad. Lastly, for each of the raw categorical features, we can compute the average click-through rate of the training instances for which a particular variable takes on a particular value. Storing these in a dictionary, the average click-through rate associated with the value of each categorical variable in a new instance can be efficiently determined and included as an additional feature. The set of features derived from the raw data are summarized in Table 1. By construction, some of the categorical and numerical feature subvectors contain exactly one non-zero component per instance, and depending on the number of possible values, range in sparsity from fully dense 4

5 Online Limited-Memory BFGS for Click-Through Rate Prediction to nearly completely sparse. Other types of features, such as those generated from tokens, exhibit varying numbers of non-zero components across the data set. There are approximately features in all. The critical observation is that the feature vectors are very sparse and on average only 59.8 of the components are non-zero. This observation implies that computation of SGD and olbfgs steps can be executed quickly by exploiting the sparsity of the feature vectors. 3. Problem Formulation and Algorithms In this section, we discuss the statistical model that characterizes our mathematical formulation of the click-through rate prediction problem, and give an overview of the stochastic optimization techniques used to learn this model in our experiments. 3.1 Logistic Regression We use logistic regression as our model for CTR prediction. Specifically, let x R p be a feature vector, w R p be a weight vector dictating how features are to be combined, and y { 1, 1} be an indicator variable that takes on the value 1 when the advertisement is not clicked by the user and 1 when the advertisement is clicked. We assume the following functional form for the click-through rate, which by definition is the probability that y = 1: CTR(x; w) := P [y = 1 x; w] = exp( w T x). (2) Subsequently, the probability that a sample point x has label y = 1 is given by 1 P [y = 1 x; w] = 1/(1+exp(w T x)). Given a training set S = {(x i, y i)} n i=1 consisting of n pairs of feature vectors x i and their associated labels y i, our goal is then to learn the optimal classifier w as the maximum likelihood estimate of w according to the model in (2) and the training data S. To introduce the optimization problem that corresponds to finding the optimal classifier, let S = {(x i, y i) y i = 1} be the set of negative training examples and S + = {(x i, y i) y i = 1} be the set of positive training examples. The CTR prediction problem can then be formalized as max w 1 n 1+exp(w T x i) + x i S 1 (3) 1+exp( w T x i) x i S + 1 Notice that the arguments of the exponentials in each sum can be written as y iw T x i. Therefore, the two sums for sample points with labels y i = 1 and y i = 1 can be combined into a single sum over the all sample points if we use y iw T x i as the exponential argument. Moreover, observe that maximizing the likelihood probability in (3) is equivalent to minimizing the negative log-likelihood of the objective in (3). Hence, the optimal classifier can be found by minimizing the negative logistic log-likelihood function λ 2 w n n i=1 ( )) log 1 + exp ( y iw T x i, (4) where the l 2 -norm regularization term (λ/2) w 2 has been added to prevent overfitting. Hence define the regularized log-likelihood term for the ith training example as f i(w) = λ ( )) 2 w 2 + log 1 + exp ( y iw T x i. (5) With the functions f i(w) defined as in (5) we can simplify log-likelihood minimization in (4) as [ ] w 1 n = argmin f i(w) = argmin F (w), (6) w n w i=1 where F (w) = (1/n) n i=1 fi(w) is an abbreviation for the full objective function. In the rest of the paper we let f i(w) denote the instantaneous function that corresponds to sample point i, and F (w) denote the aggregate function that captures the error of classifier w in classifying all the sample points in the data set. 5

6 Stern, Mokhtari, and Ribeiro 3.2 Stochastic Gradient Descent The instantaneous functions f i(w) are strongly convex with parameter λ which implies the strong convexity of the aggregate function F (w). In light of this observation, the optimal parameter w can be found using stochastic gradient descent. Let us introduce a time index t = 0, 1,, step sizes ɛ t, and random subsets I t of the n training indices {1,, n}, each of which satisfies I t = L for a fixed batch size L n. The stochastic loss function at time t is then ˆF t(w) = (1/L) i I t f i(w). Though this is an approximation to the true loss function F (w), choosing L n when n is large can put previously intractable calculations within the realm of modern computation. We now define the stochastic gradient over the subsample of training instances at time t as ŝ(w t; I t) := w ˆFt(w t) = λw t + 1 L The iterates for stochastic gradient descent can then be written succinctly as i I t y t,ix t,i 1 + exp (y t,iw T t x t,i). (7) w t+1 = w t ɛ tŝ t(w t; I t), (8) where ɛ t is the positive step size at time t. Note that stochastic gradient ŝ(w; I t) = [ w ˆFt(w) ] is an unbiased estimator of the exact gradient s(w) = wf (w), since we know that E w ˆFt(w) = wf (w). This observation implies that the stochastic gradient direction is on average a descent direction. Note that when the sample points (feature vectors) x i are sparse, the product w T t x t,i in (7) is not costly and the stochastic gradient ŝ(w t; I t) can be computed in a reasonable amount of time. 3.3 Online Limited-Memory BFGS Though widely used, the number of iterations required by SGD for convergence becomes prohibitively large for high-dimensional problems. This motivates the use of the stochastic quasi- Newton method known as online limited-memory BFGS (olbfgs), which incorporates secondorder information to realize a significant improvement in convergence speed. In general, all second-order methods are characterized by the introduction of a positive definite matrix 0 into the iteration equation ˆB 1 t w t+1 = w t ɛ t ˆB 1 t ŝ t(w t; I t) = w t ɛ td t. (9) Setting ˆB t equal to the Hessian of the objective function 2 F (w t) gives Newton s algorithm. In quasi-newton methods, ˆB t is instead selected to be an approximation of the Hessian that is less expensive to compute. The popular BFGS method uses gradient evaluations of the objective function to compute rank-one updates to the initial Hessian approximation Nocedal and Wright (1999). An online variant, obfgs, generalizes this approach by using stochastic gradients in place of deterministic gradients Schraudolph et al. (2007). The limited-memory variant of BFGS stores the approximate Hessian matrix implicitly through a fixed number of representative vectors rather than as a dense matrix, thereby improving performance in exchange for a slight decrease in accuracy Dong C. and Nocedal (1989). Combining these attributes, online limited-memory BFGS (olbfgs) both stores the Hessian estimate implicitly through a collection of vectors and makes updates based on stochastic gradients rather than deterministic gradients Schraudolph et al. (2007). In addition to superior performance, this final method also has strong theoretical guarantees, exhibiting global convergence to optimal arguments with probability 1 under mild conditions on the Hessian eigenvalues of the stochastic objective functions Mokhtari and Ribeiro (2014b). A brief description of its implementation is given below. Let v t := w t+1 w t be the variable variation at time t and ˆr t := ŝ t(w t+1, I t) ŝ t(w t, I t) be the stochastic gradient variation at time t. In olbfgs, a history of the past τ curvature information pairs (v t τ, ˆr t τ ),, (v t 1, ˆr t 1) is maintained. With these, an approximation to the descent 1 direction ˆB t ŝ t(w t, I t) = d t can be computed efficiently using Algorithm 1, where the initial 1 Hessian inverse approximation ˆB t,0 is set to identity matrix I if t = 0 or ((vt t 1ˆr t 1)/(ˆr T t 1ˆr t 1)) I 6

7 Online Limited-Memory BFGS for Click-Through Rate Prediction Algorithm 1 Computation of the olbfgs descent direction ˆB 1 t ŝ t (w t, I t ) = d t. Require: The stochastic gradient ŝ t (w t, I t ), the initial Hessian inverse approximation the last τ curvature information pairs {v t u, ˆr t u } τ u=1. 1: Let p 0 ŝ t (w t, I t ) be the stochastic gradient at time t. 2: for u = 0, 1,, τ 1 do 3: Compute scalar ˆρ t u 1 1/(v t u 1ˆr T t u 1). 4: Compute and store scalar α u ˆρ t u 1 vt u 1p T u. 5: Update vector p u+1 p u α uˆr t u 1. 6: end for 1 7: Let q 0 ˆB t,0 p τ. 8: for u = 0, 1,, τ 1 do 9: Compute scalar β u ˆρ t τ+uˆr T t τ+uq u. 10: Update vector q u+1 q u + (α τ u 1 β u )v t τ+u. 11: end for 12: return d t = q τ ˆB 1 t,0, and in subsequent iterations. Note that this implementation of the computation of the olbfgs descent direction d t has complexity of order O(τp), which is not expensive relative to the computational complexity of the SGD descent direction of order O(p). 4. Numerical Experiments For our numerical experiments, we use the data set provided for Track 2 of the 2012 KDD Cup, wherein the training and test instances are derived from session logs of the Soso search engine. Each data point in the training set summarizes the behavior of a specific user when shown a particular ad after issuing a particular query, including both the total number of impressions under these conditions as well as the number of times the user clicked on the ad. As a first pre-processing step, we expand each of these summaries into individual impressions, (# of clicks) of which are given the label +1, and the remainder of which have the label 1. After this expansion process, there are approximately instances. We randomly select 1/10 of these sample points to create the training and test sets. Out of these samples, we create a training set by picking (9/10) = of the samples and we use the remaining instances to create the test set. In our training set, there were roughly 27 negative instances for every positive instance. To account for the discrepancy, we replicate each positive instance 26 times to balance the ratio. Without this adjustment, a classifier could obtain a near-optimal objective value simply by assigning the label 1 to every instance, which is clearly undesirable behavior. This approximately doubles the size of the training set, giving effective samples. We then generate feature vectors for our training instances as described in Section 2, and implement SGD and olbfgs as described in Section Parameter Selection The initial weight vector w 0 is chosen by generating each component of this vector at random from the normal distribution N (0, 1) with zero mean and unit variance. We then set the regularization parameter to λ = 10 7 in each of our experiments so that the regularization and log-likelihood terms would be of the same order of magnitude. To choose the method-specific parameters, we run SGD and olbfgs on the training set for 1 hour each under a variety of conditions, and select the parameter settings which lead to the best performance. The step sizes for both algorithms are selected according to the formula ɛ t = ɛ 0T 0/(T 0 + t), with the decay factor T 0 being fixed in each group of experiments. The history size for olbfgs is fixed at τ = 5 due to memory constraints. 7

8 Stern, Mokhtari, and Ribeiro olbfgs (T 0 = 10 2, L =10 4, ǫ 0 =0.5) olbfgs (T 0 = 10 3, L =10 4, ǫ 0 =0.5) olbfgs (T 0 = 10 4, L =10 4, ǫ 0 =0.5) olbfgs (T 0 = 10 5, L =10 4, ǫ 0 =0.1) Objective Function Value (F(w)) Run Time (seconds) Figure 1: Illustration of negative log-likelihood values for olbfgs after one hour of training under four best parameter settings. The accuracy of these top four parameter settings are close to each other and set of parameters (T 0, L, ɛ 0 ) = (10 2, 10 4, 0.5) has a marginal advantage over the other three. 8

9 Online Limited-Memory BFGS for Click-Through Rate Prediction Objective Function Value (F(w)) SGD (T 0 =10 3, L = 10 2, ǫ 0 =0.025) SGD (T 0 =10 4, L = 10 3, ǫ 0 =0.075) SGD (T 0 =10 5, L = 10 2, ǫ 0 =0.075) SGD (T 0 =10 6, L = 10 3, ǫ 0 =0.075) Run Time (seconds) Figure 2: Illustration of negative log-likelihood values for SGD after one hour of training under four best parameter settings. The best performance among these top four parameter settings belongs to (T 0, L, ɛ 0 ) = (10 4, 10 3, 0.075). 9

10 Stern, Mokhtari, and Ribeiro To be more precise, for olbfgs method, all combinations of parameter settings with step size parameters T 0 {10 2, 10 3, 10 4, 10 5 }, mini-batch sizes L {10 2, 10 3, 10 4, 10 5 }, and ɛ 0 { , , 10 2, , , 10 1, , } are considered. For each decay factor T 0, the best choice of mini-batch size L and step size parameter ɛ 0 are determined to be (T 0, L, ɛ 0) = (10 2, 10 4, 0.5), (10 3, 10 4, 0.5), (10 4, 10 4, 0.5), (10 5, 10 4, 0.1). To obtain the best choice of parameters we compare these four different settings as shown in Fig. 1. The best performance among different combinations of parameters for olbfgs after running the algorithm for seconds is (T 0, L, ɛ 0) = (10 2, 10 4, 0.5). To find the best combination of parameters for SGD, we repeat the same process by running an analogous set of experiments with parameters T 0 {10 3, 10 4, 10 5, 10 6 }, L {10 2, 10 3, 10 4, 10 5 }, and ɛ 0 {10 2, , , , 10 1, , }. A comparison of different combinations for fixed values of T 0 shows that the optimal set of parameters for SGD is (T 0, L, ɛ 0) = (10 3, 10 2, ), (10 4, 10 3, ), (10 5, 10 2, ), (10 6, 10 3, ). The performance of these parameters is illustrated in Fig. 2. A comparison of the top four combinations of parameters (T 0, L, ɛ 0) shows that the best performance for SGD after processing feature vectors for seconds is achieved by the parameters (T 0, L, ɛ 0) = (10 4, 10 3, ). Observe that all of the top four parameter settings for olbfgs achieve an objective function value of about F (w t) = 4.05 after 2400 seconds of training, while the best choice of parameters for SGD of (T 0, L, ɛ 0) = (10 4, 10 3, ) achieves an objective function value of F (w t) = 4.42 after running for the same amount of time. Note that each choice of parameters for olbfgs reaches this best final value of SGD objective function value F (w t) = 4.42 in just 1150 seconds. In the following section, we use the optimal choices of parameters for olbfgs and SGD to train the classifier for a longer period of time, with the expectation that we will observe an even larger gap between the performance of these algorithms. 4.2 Full Experiments Recall that the optimal combinations of olbfgs and SGD parameters after training for 2400 seconds are (T 0, L, ɛ 0) = (10 2, 10 4, 0.5) and (T 0, L, ɛ 0) = (10 4, 10 3, 0.075), respectively. With these parameter settings, each algorithm is run until 40 million feature vectors have been processed, corresponding to roughly one pass over our adjusted training set. We then perform one additional pass over the training set with the best pairing of algorithm and parameters to obtain an approximation of the optimal classifier w. Since we use different values of L for SGD and olbfgs, we plot the relative distance to optimality F (w t) F (w ) versus Lt, where the latter quantity is the number of processed feature vectors after iteration t. Figure 3 illustrates the objective function value relative to the approximate optimal value, F (w t) F (w ), versus the number of feature vectors processed for olbfgs and SGD under their best respective parameter settings. After processing Lt = feature vectors (requiring t = iterations) the relative objective function value for olbfgs is F (w t) F (w ) = 0.11, while after processing Lt = feature vectors (requiring t = iterations) the distance to the optimal objective function value for SGD is F (w t) F (w ) = Notice that after processing just Lt = feature vectors, or about 6% of the training set, olbfgs achieves the same relative objective function value that SGD does after processing the entire data set. The results in Figure 3 also allow us to compare the performance of olbfgs and SGD in terms of number of iterations t. After processing t = iterations the relative objective function value for olbfgs is F (w t) F (w ) = 0.11, while SGD reaches an objective function accuracy F (w t) F (w ) = 1.16 after the same number of iterations. Figure 4 shows the same metric, but with cumulative run time on the x-axis as opposed to the number of feature vectors processed. Under the chosen parameters, olbfgs uses a mini-batch size of L = 10 4, which is 10 times as large as the mini-batch size of 10 3 used for SGD. We recall from above that the dense vector operations involving w dominate the run time of both olbfgs and SGD due to the sparsity of the data set. Since olbfgs performs 4τ + 1 = 21 such operations per iteration versus 2 for SGD, the time per iteration for olbfgs is approximately 10 times as long as it is for SGD. Together, these factors roughly cancel out, meaning we can draw nearly the same conclusions for run time that we could for processed feature vector count. Specifically, after 9 hours of run time, the relative objective function values for olbfgs and SGD are 0.12 and 0.80, 10

11 Online Limited-Memory BFGS for Click-Through Rate Prediction Relative Objective Function Value (F(w) F(w )) SGD olbfgs Processed Feature Vectors (Lt) x 10 7 Figure 3: Plot of regularized negative log-likelihood values for olbfgs and SGD after processing feature vectors, relative to the negative log-likelihood of the optimal classifier. Note that olbfgs achieves the same relative objective function value after processing 2.5 million feature vectors that SGD does after processing 40 million. 11

12 Stern, Mokhtari, and Ribeiro Relative Objective Function Value (F(w) F(w )) SGD olbfgs Run Time (Hours) Figure 4: Plot of regularized negative log-likelihood values for olbfgs and SGD after 9 hours of run time, relative to the negative log-likelihood of the optimal classifier. Note that olbfgs achieves the same relative objective function value as SGD in roughly an order of magnitude less time. respectively. Moreover, olbfgs reaches in 40 minutes the level of performance exhibited by SGD after the full 9 hours of processing. To better understand the advantages of olbfgs relative to SGD, we also compare their prediction error on the test set, where prediction error is defined to be the average magnitude of the difference between the logistic regression output and the true label (substituting 0 for 1), taking on values between 0 and 1. More formally, the prediction error of a weight vector w is defined as e(w) = 1 n yi exp( w T x). (10) x i S Figure 5 illustrates the prediction error versus the number of feature vectors processed for olbfgs and SGD after processing Lt = feature vectors which is equivalent to t = 10 3 iterations for olbfgs and t = 10 4 iterations for SGD. As shown in Figure 5, the error of prediction for SGD after processing Lt = samples of training set is e = 0.38, while the accuracy of classifier trained by olbfgs is e = 0.26 after processing the same amount of feature vectors. Moreover, olbfgs makes fairly steady progress towards its final prediction error of 0.26 over the course of the run, but SGD oscillates around an error of e = 0.4 without showing any convergent tendencies. The gap between performance of olbfgs and SGD is more significant if we compare the prediction error versus number of iterations t. The prediction error of olbfgs after t = iterations is e = 0.26, while after t = iterations SGD reaches an error level of e = Figure 6 instead shows prediction error versus runtime. Due to the previously discussed runtime considerations, this plot largely shows the same information that Figure 5 contains. To be more precise, after 9 hours of runtime, the prediction errors for SGD and olbfgs are e = 0.39 and 12

13 Online Limited-Memory BFGS for Click-Through Rate Prediction 0.5 SGD olbfgs 0.45 Prediction Error Processed Feature Vectors x 10 7 Figure 5: Plot of mean prediction error for olbfgs and SGD after processing feature vectors. The prediction error for olbfgs decreases consistently over the course of the experiment, whereas the prediction error for SGD oscillates and fails to exhibit steady improvement. 13

14 Stern, Mokhtari, and Ribeiro 0.5 SGD olbfgs 0.45 Prediction Error Run Time (Hours) Figure 6: Plot of mean prediction error for olbfgs and SGD after 9 hours of run time. The prediction error for olbfgs decreases consistently over the course of the experiment, whereas the prediction error for SGD oscillates and fails to exhibit steady improvement. e = 0.27, respectively. Moreover, olbfgs reaches in 14 minutes the level of performance exhibited by SGD after the full 9 hours of processing, and remains below this level from 1 hour 41 minutes onwards. In addition, the classifier obtained using olbfgs achieves an area-under-curve (AUC) score of on the validation set, whereas the classifier obtained using SGD achieves an AUC score of just For reference, the AUC score of the initial weight vector w 0 was Lastly, if we map logistic outputs between 0 and 1/2 to a predicted label of 1 and outputs between 1/2 and 1 to a predicted label of 1, olbfgs obtains a test set classification accuracy of 77.56%, whereas SGD obtains an accuracy of just 63.18%. These results all lend credence to the theoretical claims of superiority of olbfgs over SGD, demonstrating that olbfgs exhibits faster convergence than SGD, and requires fewer samples to converge to a classifier of equivalent accuracy. 5. Conclusion To conclude, we demonstrate that online limited-memory BFGS (olbfgs) is empirically superior to stochastic gradient descent (SGD) for the click-through rate prediction problem. We validate this claim through a comparison of these algorithms across iterations, feature vectors, and run time required for convergence, using the evaluation metrics of log-likelihood, continuous prediction error, discrete accuracy, and AUC. In future work, we intend to investigate the use of olbfgs for even larger data sets in order to evaluate the effects of scale on its performance. 14

15 Online Limited-Memory BFGS for Click-Through Rate Prediction References J. R. Birge, X. Chen, L. Qi, and Z. Wei. A stochastic newton method for stochastic quadratic programs with resource. Technical report, University of Michigan, Ann Arbor, MI L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMP- STAT 2010, pages , Physica-Verlag HD, R. H. Byrd, J. Nocedal, and Y. Yuan. Global convergence of a class of quasi-newton methods on convex problems. SIAM J. Numer. Anal., 24(5): , October D. Chakrabarti, D. Agarwal, and V. Josifovski. Contextual advertising by combining relevance with click feedback. In Proceedings of the 17th international conference on World Wide Web, pages ACM, L. Dong C. and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, (45(1-3)): , T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich. Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft s bing search engine. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 13 20, D. W. Hosmer, Jr. and S. Lemeshow. Applied logistic regression. John Wiley & Sons, Jr. J. E. Dennis and J. J. More. A characterization of super linear convergence and its application to quasi-newton methods. Mathematics of computation, 28(126): , D. G Kleinbaum and M. Klein. Logistic regression: a self-learning text. Springer, J. Konecny and P. Richtarik. Semi-stochstic gradient descent methods. arxiv preprint arxiv, , N. LeRoux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. arxiv preprint arxiv, , H B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, et al. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages ACM, T. M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1 edition, ISBN , A. Mokhtari and A. Ribeiro. Res: Regularized stochastic bfgs algorithm. IEEE Trans. Signal Process., 62 (23): , December 2014a. A. Mokhtari and A. Ribeiro. Global convergence of online limited memory BFGS. arxiv preprint arxiv, b. A Nemirovski, A Juditsky, and A Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4): , J. Nocedal and S. J. Wright. Numerical optimization. Springer-Verlag, New York, NY, 2 edition, M. J. D. Powell. Some global convergence properties of a variable metric algorithm for minimization without exact line search. Academic Press, London, UK, 2 edition, M. Regelson and D. Fain. Predicting click-through rate using keyword clusters. In Proceedings of the Second Workshop on Sponsored Search Auctions, volume 9623, M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th international conference on World Wide Web, pages ACM,

16 Stern, Mokhtari, and Ribeiro M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. arxiv preprint arxiv, , N. N. Schraudolph, J. Yu, and S. Günter. A stochastic quasi-newton method for online convex optimization. In Proc. 11th Intl. Conf. on Artificial Intelligence and Statistics (AIstats), page , Soc. for Artificial Intelligence and Statistics, S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In Proceedings of the 24th international conference on Machine learning, pages , ACM, G. Sun. KDD Cup 2012, Track 2: Soso.com advertisement prediction challenge. Accessed August 1, L. Zhang, M. Mahdavi, and R. Jin. Linear convergence with condition number independent access of full gradients. In Advances in Neural Information Processing Systems, pages , T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, page , ACM,

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Incremental Quasi-Newton methods with local superlinear convergence rate

Incremental Quasi-Newton methods with local superlinear convergence rate Incremental Quasi-Newton methods wh local superlinear convergence rate Aryan Mokhtari, Mark Eisen, and Alejandro Ribeiro Department of Electrical and Systems Engineering Universy of Pennsylvania Int. Conference

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu Thanks to: Aryan Mokhtari, Mark

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless

More information

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725 Quasi-Newton Methods Javier Peña Convex Optimization 10-725/36-725 Last time: primal-dual interior-point methods Consider the problem min x subject to f(x) Ax = b h(x) 0 Assume f, h 1,..., h m are convex

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

Efficient Methods for Large-Scale Optimization

Efficient Methods for Large-Scale Optimization Efficient Methods for Large-Scale Optimization Aryan Mokhtari Department of Electrical and Systems Engineering University of Pennsylvania aryanm@seas.upenn.edu Ph.D. Proposal Advisor: Alejandro Ribeiro

More information

Variable Metric Stochastic Approximation Theory

Variable Metric Stochastic Approximation Theory Variable Metric Stochastic Approximation Theory Abstract We provide a variable metric stochastic approximation theory. In doing so, we provide a convergence theory for a large class of online variable

More information

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization Quasi-Newton Methods Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization 10-725 Last time: primal-dual interior-point methods Given the problem min x f(x) subject to h(x)

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Large Scale Machine Learning with Stochastic Gradient Descent

Large Scale Machine Learning with Stochastic Gradient Descent Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning.

More information

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Thore Graepel and Nicol N. Schraudolph Institute of Computational Science ETH Zürich, Switzerland {graepel,schraudo}@inf.ethz.ch

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University

More information

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION Aryan Mokhtari, Alec Koppel, Gesualdo Scutari, and Alejandro Ribeiro Department of Electrical and Systems

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

ABSTRACT 1. INTRODUCTION

ABSTRACT 1. INTRODUCTION A DIAGONAL-AUGMENTED QUASI-NEWTON METHOD WITH APPLICATION TO FACTORIZATION MACHINES Aryan Mohtari and Amir Ingber Department of Electrical and Systems Engineering, University of Pennsylvania, PA, USA Big-data

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Adaptive Online Learning in Dynamic Environments

Adaptive Online Learning in Dynamic Environments Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Journal of Computational Information Systems 9: 15 (2013) 6251 6258 Available at http://www.jofcis.com Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Xin ZHOU, Conghui ZHU, Sheng

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Introduction to Logistic Regression

Introduction to Logistic Regression Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks Yingfei Wang, Chu Wang and Warren B. Powell Princeton University Yingfei Wang Optimal Learning Methods June 22, 2016

More information

Is the test error unbiased for these programs?

Is the test error unbiased for these programs? Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Online Advertising is Big Business

Online Advertising is Big Business Online Advertising Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Numerical Optimization Techniques

Numerical Optimization Techniques Numerical Optimization Techniques Léon Bottou NEC Labs America COS 424 3/2/2010 Today s Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification,

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

Matrix Factorization and Factorization Machines for Recommender Systems

Matrix Factorization and Factorization Machines for Recommender Systems Talk at SDM workshop on Machine Learning Methods on Recommender Systems, May 2, 215 Chih-Jen Lin (National Taiwan Univ.) 1 / 54 Matrix Factorization and Factorization Machines for Recommender Systems Chih-Jen

More information

A Distributed Solver for Kernelized SVM

A Distributed Solver for Kernelized SVM and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,

More information

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado

More information

A Stochastic Quasi-Newton Method for Online Convex Optimization

A Stochastic Quasi-Newton Method for Online Convex Optimization A Stochastic Quasi-Newton Method for Online Convex Optimization Nicol N. Schraudolph nic.schraudolph@nicta.com.au Jin Yu jin.yu@rsise.anu.edu.au Statistical Machine Learning, National ICT Australia Locked

More information

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Improving the Convergence of Back-Propogation Learning with Second Order Methods the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Deep Learning & Neural Networks Lecture 4

Deep Learning & Neural Networks Lecture 4 Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly

More information

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal Stochastic Optimization Methods for Machine Learning Jorge Nocedal Northwestern University SIAM CSE, March 2017 1 Collaborators Richard Byrd R. Bollagragada N. Keskar University of Colorado Northwestern

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Why should you care about the solution strategies?

Why should you care about the solution strategies? Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Second-Order Methods for Stochastic Optimization

Second-Order Methods for Stochastic Optimization Second-Order Methods for Stochastic Optimization Frank E. Curtis, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Conjugate Directions for Stochastic Gradient Descent

Conjugate Directions for Stochastic Gradient Descent Conjugate Directions for Stochastic Gradient Descent Nicol N Schraudolph Thore Graepel Institute of Computational Science ETH Zürich, Switzerland {schraudo,graepel}@infethzch Abstract The method of conjugate

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification JMLR: Workshop and Conference Proceedings 1 16 A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification Chih-Yang Hsia r04922021@ntu.edu.tw Dept. of Computer Science,

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

Dreem Challenge report (team Bussanati)

Dreem Challenge report (team Bussanati) Wavelet course, MVA 04-05 Simon Bussy, simon.bussy@gmail.com Antoine Recanati, arecanat@ens-cachan.fr Dreem Challenge report (team Bussanati) Description and specifics of the challenge We worked on the

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China) Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017 CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address

More information

Parallel Coordinate Optimization

Parallel Coordinate Optimization 1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a

More information

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information