L 1 -Regularized Continuous Conditional Random Fields

Size: px
Start display at page:

Download "L 1 -Regularized Continuous Conditional Random Fields"

Transcription

1 L 1 -Regularized Continuous Conditional Random Fields Xishun Wang 1, Fenghui Ren 1, Chen Liu 2, and Minjie Zhang 1 1 School of Computing and Information Technology, University of Wollongong, Australia xw357@uowmail.edu.au, fren@uow.edu.au, minjie@uow.edu.au 2 School of Science, RMIT University, Australia s @student.rmit.edu.au, Abstract. Continuous Conditional Random Fields (CCRF) has been widely applied to various research domains as an efficient approach for structural regression. In previous studies, the weights of CCRF are constrained to be positive from a theoretical perspective. This paper extends the definition domains of weights of CCRF and thus introduces L 1 norm to regularize CCRF, which enables CCRF to perform feature selection. We provide a plausible learning method for L 1 -Regularized CCRF (L 1 - CCRF) and verify its effectiveness. Moreover, we demonstrate that the proposed L 1 -CCRF performs well in selecting key features related to the various customers power usages in Smart Grid. Keywords: Continuous Conditional Random Fields; Regularization; Feature selection 1 Introduction Conditional Random Fields (CRF) [5] is an efficient structural learning tool which has been used in image recognition, natural language processing and bioinformatics etc. CRF is an undirected graphical model that supplies flexible structural learning framework. There are two kinds of potentials in CRF, which are state potentials and edge potentials. The state potentials model how the inner states are influenced by a series of outside factors, while the edge potentials model the interactions of inside states (under the influences of outside factors). With the two arrays of potentials, CRF is capable of simultaneously modeling outside influences and inside interactions for the structural outputs to be predicted. Due to the above favorable advantages, CRF promotes various research domains in computational intelligence. In 2009, Qin et al. [9] proposed Continuous CRF (CCRF), which extends CRF to be able to output continuous values. Since then, CCRF has been widely applied to different research fields. Qin et al. [9] applied CCRF in learning to rank, and gained superior performance to the baseline learning to rank algorithms. Baltrusaitis et al. [2] used CCRF as a structural regression tool for

2 2 Xishun Wang, Fenghui Ren, Chen Liu, and Minjie Zhang expression recognition. In their work, a facial expression was defined by four measurements. CCRF was employed to regress an unseen expression to the four measurements. Xin et al. [11] developed multi-scale CCRF to build a social recommendation framework. Guo [3] used CCRF for energy load forecasting (the usage of electricity and gas) of a building. His experimental results demonstrated that CCRF gained better performance against state-of-art regression methods. In previous work [2,9], the weight parameters of CCRF were stipulated to be positive in a theoretical view. These constraints in fact limit the applications of CCRF. In real-world problems, there may be some features (in the whole feature set) that are not related to the final predictions, or even there are sparse feature problems that the predicted results are determined by a small fraction of all the features. For the above problems, feature selection should be done before using CCRF. To surmount the limit of CCRF caused by the positive weights, we do not constrain the weights of CCRF from a practical perspective. In this scenario, L 1 norm can be introduced to regularize CCRF, which enables CCRF to select effective features. However, as L 1 norm is not differentiable at zero, it brings extra difficulty in L 1 -CCRF learning. We then provide a plausible learning method for the proposed L 1 -CCRF using Orthant-Wise Limited-memory Quasi- Newton (OWL-QN) algorithm [1]. The paper has three major contributions: 1. The definition domains of the weights of CCRF are extended from a practical view. This enable CCRF to perform feature selection with an L 1 norm and consequently extends the applicable areas of CCRF. 2. We provide a practical learning method for L 1 -CCRF and demonstrate its effectiveness. Experimental results show that the new learning method is more efficient than the previous learning method. 3. Experimental results demonstrate that the proposed L 1 -CCRF is effective in feature selection in the power market domain. The rest of the paper is organised as follows. Section 2 gives a brief introduction to CCRF. Section 3 proposes L 1 -CCRF and describes its learning and inference process. Experiments are reported in Section 4 to verify the learning process of L 1 -CCRF and the feature selection capacity of L 1 -CCRF. Finally, the conclusion is given in Section 5. 2 Introduction to CCRF The Continuous Conditional Random Fields [9] (CCRF) was initially proposed for structural regression. CCRF extended CRF to be able to output real-value predictions. The chain-structured CCRF, as illustrated in Figure 1, is widely used. Assume X = {x 1, x 2,, x m } is the given sequence of observations, and Y = {y 1, y 2,, y n } is the value sequence to be predicted. CCRF defines the conditional probability P (Y X) in Equation 1. P (Y X) = 1 exp(ψ), (1) Z(X)

3 L 1 -Regularized Continuous Conditional Random Fields 3 y i-1 y i y i+1 X Edge Potential Node Potential Fig. 1. An illustration of a CCRF with a chain structure. where Ψ is the energy function, and Z(X) is the partition function that normalizes P (Y X). The energy function Ψ is further defined as Ψ = i K 1 k=1 α k f k (y i, X) + i,j K 2 β k g k (y i, y j, X), (2) k=1 where function f k (y i, X) is called node potential and function g k (y i, y j, X) is called edge potential, and α k and β k are corresponding weight parameters. In the energy function, the node potential captures the associations between inputs and outputs, and the edge potential captures the interactions between related outputs. The partition function Z(X) is defined in Equation 3. Z(X) = exp(ψ) (3) CCRF explicitly defines P (Y X), which means Y is determined by the whole observation X. Therefore, CCRF gains the capacity of considering the whole observed sequence for the output. To learn a CCRF model, maximum log-likelihood is used to find the fittest weights α and β. Given training data D = {(X, Y )} Q 1, where Q is the total number of training samples, the log-likelihood L(α, β) is maximized: where Y ( ˆα, ˆβ) = argmax (α,β) (L(α, β)), (4) L(α, β) = Q logp (Y q X q ) (5) l=1 After the weights α and β are obtained, inference for a CCRF is to find the most likely value for Y k, provided an observed sequence X k : Ŷ k = argmax Yk (P (Y k X k )) (6) In practical use, we usually use a vector y to represent the array of values to be predicted, and a matrix X to represent the array of features. Radosavljevic et

4 4 Xishun Wang, Fenghui Ren, Chen Liu, and Minjie Zhang al. [10] have shown that P (y X) with quadratic potentials can be transformed into a multivariate Gaussian, which facilitates the learning and inference processes. The quadratic form is widely used in CCRF, and consequently, a universal partition function is shown as follows. Z(X) = exp( y i K 1 α k (y i X i,k ) 2 + k=1 K 2 δ k β k (y i y j ) 2 ) i,j k=1 (7) In Equation 7, δ k is an indicator function. When a certain assertion holds, δ k takes the value 1, otherwise, it is 0. In this scenario, P (y X) could be transformed to a multivariate Gaussian form, resulting in a concise inference process [10]. This is further shown in Subsection L 1 -Regularized Continuous Conditional Random Fields We propose L 1 -Regularized Continuous Conditional Random Fields (L 1 -CCRF) in this section. We first extend definition domains of the weights for CCRF from a perspective of practical use. As a consequence, L 1 norm can be introduced to regularize CCRF. Then we provide a plausible learning method for L 1 -CCRF based on OWL-QN algorithm [1]. 3.1 Introducing L 1 norm to regularize CCRF We first extend the definition domains of weights for CCRF. In previous CCRF [9,10], the partition function had the form shown in Equation 7. In this equation, we do not pay much attention to any specific parameters, but focus on the quadratic terms. When the variables X and y are defined in infinite domains, both α and β are required to be positive to ensure that the partition function is integrable. However, in practical use, data preprocessing is a necessary step before learning. In this step, the observed feature X is preprocessed to be within a certain domain, and y is also targeted in a finite range. Thus, the partition function is integrable regardless of the domains of α and β. Therefore, we do not have to constrain α and β. In machine learning, regularization has been commonly used in learning process to achieve a model that generalizes to unseen data. L 2 -norm regularization has been used in [2,3,10] in learning CCRF. L 1 norm regularizer is theoretically studied in [8] by Ng, and in practice, the L 1 norm regularizer has gained roughly the same accuracy as the L 2 -norm regularizer [6]. Besides, L 1 norm has a favorable property of capable of selecting effective features, which can be utilized to analyze customer behaviours in our research. Therefore, we introduce L 1 norm to regularize the CCRF in the learning procedure. λ =< α, β > is introduced

5 L 1 -Regularized Continuous Conditional Random Fields 5 to compactly represent the weights. The objective function to be minimized for L 1 -CCRF is designed in Equation 8, F (λ) = L(λ) + ρ λ 1, (8) where 1 stands for L 1 norm. In the objective function, the first term is the loss function, which is a negative of log-likelihood of the training set (see Equation 5), while the second term is the L 1 norm of λ, used as a regularization term. The parameter ρ compromises the loss and the regularization term. 3.2 Learning L 1 -CCRF As L 1 norm is not differentiable at zero, the previous learning method for CCRF [9,10] is no longer applied to L 1 -CCRF learning. Some special methods have been proposed to tackle the learning with L 1 norm regularizer [1, 12]. Orthant-Wise Limited-memory Quasi-Newton (OWL-QN), proposed by Andrew and Gao [1], has been verified an advantageous algorithm for L 1 -regularized log-linear model in [6]. We therefore introduce the OWL-QN algorithm to learn an L 1 -CCRF. For the purpose of comparison and experimental evaluation, we briefly review the learning process for CCRF. In CCRF, each weight parameter λ i is constrained to be positive. Authors in [9, 10] maximize L(λ) with respect to log λ i. With this transform, the constrained optimization problem becomes unconstrained. As a consequence, Stochastic Gradient Descent (SGD) [13] algorithm can be used to optimize the unconstrained problem. In CCRF learning using SGD, in each iteration, two major steps are taken: 1) Compute gradient log λi (gradient with respect to log λ i ) for the objective function P (y X) with a random training sample. 2) Update weight parameter log λ i : log λ i = log λ i + η log λi, where η is the learning rate. After T iterations, SGD outputs the optimized weights λ. For L 1 -CCRF learning, we introduce the Orthant-Wise Limited-memory Quasi-Newton (OWL-QN) algorithm, which in fact extends L-BFGS [7] algorithm to be capable of optimizing convex function with L 1 norm. The quasi-newton algorithms gain the advantage of second-order convergence rate with a small computation cost. They construct an approximation of the second-order Taylor expansion of the objective function, and then try to minimize the approximation. In the approximated Taylor expansion, the Hessian matrix is constructed with the first-order information gathered from previous steps. OWL-QN, which modifies L-BFGS, is motivated by the following basic idea. For the L 1 norm, when the orthant is given, its sign can be determined and become differentiable. Furthermore, L 1 norm is not related to the Hessian, which can be approximated by the loss term alone. Thus, OWL-QN in fact imitates L-BFGS steps in a chosen orthant. The process of using OWL-QN to learn L 1 -CCRF is briefly described as follows. OWL-QN iteratively changes λ to obtain an optimized value. In the kth iteration step, OWL-QN first computes the pseudo-gradient of F (λ) as a

6 6 Xishun Wang, Fenghui Ren, Chen Liu, and Minjie Zhang basement to choose the appropriate orthant and search direction, then chooses an orthant ξ k, computes the search direction p k, and searches the next objective point λ k+1. In each step, a displacement s k = λ k+1 λ k and the change in gradient r k = L(λ k+1 ) + L(λ k ) are updated and recorded. The previous m displacements and changes in gradient are used to construct an approximate H k, the inverse of Hessian matrix of F (λ), which is the key to compute the search direction. After T iterations, OWL-QN converges and outputs the optimal weights λ. The learning procedure for L 1 -CCRF using OWL-QN is summarized in Algorithm 1. Algorithm 1 L 1 -CCRF learning using OWL-QN Input: Training samples D = {(X, Y )} Q 1 ; Output: Weight parameter λ; 1: Initialize: Initial point λ 0 ; S {}, R {}. 2: for k = 0 to T do 3: Compute the pseudo-gradient of F (λ) 4: Choose an orthant ξ k 5: Construct H k using S and R 6: Compute search direction p k 7: Find λ k+1 with constrained line search 8: if termination condition satisfied then 9: Stop and return λ k+1 10: end if 11: Update S with s k = λ k+1 λ k 12: Update R with r k = L(λ k+1 ) + L(λ k ) 13: end for Before the explanation of Algorithm 1, we introduce two special functions [1] for convenience. The first one is sign function σ: σ(x) results in a value in { 1, 0, 1} according to whether x is negative, zero or positive. The second one is project function π: π(x; y), R n R n, is parameterized by y R n, where { xi if σ(x π i (x; y) = i ) = σ(y i ) (9) 0 otherwise It can be interpreted as projecting x onto the orthant defined by y. In Algorithm 1, Step 1 chooses and initial λ, and initialize sets S and R. Step 2-13 are the main iteration loop. Step 3 calculates the pseudo-gradient of F (λ) at λ, denoted F (λ), according to the following equation: i F (λ) if i F (λ) > 0 i F (λ) = + i F (λ) if + i F (λ) < 0, (10) 0 otherwise where ± i F (λ) = λ i L(λ) + { ρσ(λi ) if λ i 0 ±ρ if λ i = 0. (11)

7 L 1 -Regularized Continuous Conditional Random Fields 7 In Equation 11, the term L(λ)/ λ i is derived with respect to the specified model. In Step 4, an orthant ξ k is chosen based on F (λ), ξ k i = { σ(λ k i ) if λ k i 0 σ( i F (λ k )) if λ k i = 0. (12) Step 5 constructs the inverse of Hessian H k, which is constructed the same as the traditional L-BFGS [7], not shown here again. Step 6 then determines the search direction p k, formulated by p k = π(h k v k ; v k ), (13) where v k = F (λ k ). Step 7-10 aims to find the next point λ k+1 using constrained line search, in which each point explored is projected back onto the chosen orthant: λ k+1 = π(λ k + αp k ; ξ k ), where α controls the search step. Step 11 and 12 update sets S and R, respectively. The proposed learning method for L 1 -CCRF has a semi-second-order convergence rate in theory. We will show it converges faster than the CCRF learning method in Subsection Inference In the prediction for a learned L 1 -CCRF model, we find the most likely y given the observed feature X, as formulated in Equation 6. We first derive P (y X) into multi-variant Gaussian form. P (y X) = 1 (2π) n/2 Σ 1/2 exp( 1 2 (y µ(x))t Σ 1 (y µ(x))) (14) In this Gaussian form, the inverse of the covariance matrix Σ 1, is the sum of two n n matrices, further expressed as follows. Σ 1 = 2(M 1 + M 2 ), where { K1 Mi,j 1 = k=1 α k if i = j 0 otherwise { n K2 Mi,j 2 = j=1 k=1 δ kβ k K 2 k=1 δ kβ k if i = j K 2 k=1 δ kβ k if i j (15) The diagonal matrix M 1 represents the contribution of α terms (node potentials), and the symmetric matrix M 2 represents the contribution of β terms (edge potentials). The mean µ(x) is computed by µ(x) = Σθ (16)

8 8 Xishun Wang, Fenghui Ren, Chen Liu, and Minjie Zhang Here, θ is an n-dimension vector, where each element is calculated by K 1 θ i = 2 α k X i,k (17) k=1 Benefiting from the multivariate Gaussian form, the inference becomes quite tractable. To maximize P (y X) in the multivariate Gaussian (see Equation 14), we simply make y equal to µ(x), ŷ = argmax y (P (y X)) = µ(x) = Σθ (18) 4 Experiment We conducted two experiments for the proposed L 1 -CCRF. Experiment 1 evaluated the learning process of L 1 -CCRF. This experiment aimed to demonstrate the proposed learning method for L 1 -CCRF was correct and more efficient than the previous learning method for CCRF. Experiment 2 aimed to verify the feature selection capacity of L 1 -CCRF. This experiment used L 1 -CCRF to predict the hourly load of some specific customers in a Smart Grid to demonstrate L 1 - CCRF was effective in feature selection and load forecasting. In the following, the two experiments are reported, respectively. 4.1 Experiment 1: Learning of L 1 -CCRF In this experiment, we compared the proposed learning method for L 1 -CCRF with the conventional CCRF learning. The experimental data we used here are from the work of Baltrusaitis et al. [2], whose source code is publicly available 3. In their work, they used CCRF as a structural regression tool to regress a test expression to a four-dimension measurement defined to measure facial expressions. For page limits, their model is not introduced here. Interested reader may refer to their work [2]. In their CCRF learning, they used the conventional SGD algorithm (introduced in Subsection 3.2) and introduced L 2 norm for regularization. We first repeated their results, and then built the same CCRF model, and used the proposed L 1 -CCRF learning algorithm (see Algorithm 1) to learn the same data. We repeat the dimensional expression recognition experiment using CCRF model, not the CA-CCRF [2]. For the experimental settings, the convergence criteria for both methods were: λ = λ i+1 λ i < The other parameters in learning process of CCRF used the default settings in [2]. In L 1 -CCRF learning, parameter ρ was set as 100, which was optimized by cross-validations. The convergence curves of the two methods for the valence dimension are shown in Figure 2. We first make some explanations to Figure 2. The convergence rate is measured by the total iterations of all the training samples. In CCRF learning, SGD 3

9 L 1 -Regularized Continuous Conditional Random Fields logl CCRF L1-CCRF number of iterations Fig. 2. The convergence of curves of CCRF and L 1 -CCRF learning. derives gradient of each sample, while L 1 -CCRF derives gradient based on all the training samples. We have divided the total loops by the total number of samples to obtain the number of iterations for CCRF learning, which is comparable with the total iterations of L 1 -CCRF. CCRF takes 624 iterations to converge, while L 1 -CCRF iterates 68 times to reach convergence. From Figure 2, we can see that the convergence rate of L 1 -CCRF is much faster than that of CCRF. In the theoretical view, the convergence rate of SGD is first-order, while that of OWL-QN is semi-second-order. Thus, the learning process of L 1 -CCRF takes much less iterations than that of CCRF. When converged, the final log-likelihood of CCRF is , and that of L 1 -CCRF is Table 1. The learning performance comparison of CCRF and L 1 -CCRF Method learning time(seconds) precision(average correlations) CCRF L 1 -CCRF We compared the learning time and prediction results between CCRF and L 1 -CCRF in Table 1. In Table 1, the learning time is measure by seconds. In the prediction results, both methods calculates the correlations of each predicted values to each of the four dimensions of an expression. For the prediction precision, we use the average correlations to measure both methods. In our desktop with Duo CPU 3.33GHz and 4GB RAM, CCRF learning takes 1532 seconds to reach convergence, while L 1 -CCRF learning converges in 257 seconds. For the

10 10 Xishun Wang, Fenghui Ren, Chen Liu, and Minjie Zhang prediction precision, we can see that the two methods show similar results. The predicted results demonstrate two points, which are: 1) the L 1 -CCRF learning is correct, and 2) L 1 norm and L 2 norm regularization result in similar precisions. From this experiment, we draw two major conclusions: 1. The convergence rate of L 1 -CCRF learning is faster than that of CCRF. 2. L 1 norm shows similar performance as L 2 norm in prediction precision. 4.2 Experiment 2: Feature Selection using L 1 -CCRF This experiment was conducted on the platform of Power Trading Agent Competition (Power TAC) [4]. Power TAC has drawn wide attentions and has become a benchmark in the Smart Grid research community. We here utilized this platform to evaluate the performance of L 1 -CCRF in feature selection and short-term load forecasting (the future hourly power usage). Power TAC simulates a variety of customers with various behaviours in a Smart Grid. Moreover, there are rich features, including real-world weather conditions and real-time market status. Besides, the Power TAC server supplies rich logs of customers hourly power usages, which are regarded as the ground-truth to evaluate the prediction precision of L 1 -CCRF. Table 2. Features used in Experiment 2 Feature Content Index Temporal feature hour of a day t1 day of a week t 2 temperature w 1 Weather feature wind strength w 2 wind direction w 3 cloudiness w 4 Market feature lowest price m 1 average price m 2 Representative features, which may relate to the customers power consumption behaviors, are studied in our work. These features include temporal features, weather features and market features. Table 2 lists the contents of the three features, and the indexes provide convenience for further discussions. We configured the Power TAC server and weather data server, and utilized Power TAC games to generate training and test data. Six games were run to roughly cover the year of The server logs, which contained features and customers energy usages, were used as the training data. Another six games

11 L 1 -Regularized Continuous Conditional Random Fields 11 were run for year The logged customers usages were regarded as the ground-truths of loads to be predicted. The L 1 -CCRF model for load forecasting is described in details. L 1 -CCRF modeled 24 hours power usages under the influences of 24 hourly features. For the features in each hour, 8 node potentials were generated. There were 23 intervals between every two adjacent hours in a day. Edge potentials were generated in every intervals. To ensure the convergence of L 1 -CCRF, we set 50 iterations for OWL-QN. Four typical customers are selected to test the performance of L 1 -CCRF. Customer 1 is a village householder, Customer 2 is a photovoltaic energy generator, Customer 3 is an office building and Customer 4 is a cold storage company with power storage capacity. We first analyze the feature selection capacity of L 1 -CCRF. Table 3 shows effective features selected by L 1 -CCRF for the four customers. Table 3. Feature selection for the four typical customers Customer t 1 t 2 w 1 w 2 w 3 w 4 m 1 m 2 C C C C In Table 3, 1 indicates that the feature is related to the power usage of the customer, and 0 otherwise. Customer 1 is a village user. Intuitively, the power usage of Customer 1 may be influenced by time, temperature and market price. The feature selection result of Customer 1 in Table 3 is quite reasonable according to our intuitions. Customer 2, who is a photovoltaic energy produce, is affected by hours in one day, the cloudiness and temperature. The result shown in Table 3 is also correct. Similarly, the feature selection result for Customer 3 is reasonable. Customer 4 is a cold storage company, and the power usage is not likely affected by wind strength. In this case, the wind strength feature might be selected improperly. In contrast, if we used conventional CCRF learning, as parameters were confined to be positive, it could not perform feature selection. Thus, L 1 -CCRF gains the advantage to build a concise (using only effective features) and explanatory (harmony with intuitions or commonsense) model. At the same time, L 1 -CCRF predicts the hourly power usages of the four customers. Mean Absolute Percentage Error (MAPE) for each hour is calculated to measure the prediction results. The average MAPE for each of the four customers are 7.32%, 5.63%, 6.41% and 3.36%. L 1 -CCRF shows good performances in short-term load forecasting. Two points are verified in this experiment. 1. L 1 -CCRF is effective in feature selection for customers power consumption behaviors in the Smart Grid market.

12 12 Xishun Wang, Fenghui Ren, Chen Liu, and Minjie Zhang 2. L 1 -CCRF shows good performance in load forecasting. Thus, the proposed L 1 -CCRF can be an efficient structural regression tool with feature selection capacity. 5 Conclusion This work first used L 1 norm to regularize CCRF and proposed L 1 -CCRF. The definition domains of weights of CCRF were extended, so that L 1 norm could be introduced for CCRF regularization in the learning process. We then provided a plausible learning method for L 1 -CCRF, and experimental results demonstrated its correctness and efficiency. We used L 1 -CCRF for feature selection for the customers in a Smart Grid and demonstrate that L 1 -CCRF is effective in selecting key features in the power market domain. In a nutshell, the proposed L 1 -CCRF is effective in learning and feature selection, and we suggest it be applied to wide research domains. References 1. G. Andrew and J. Gao. Scalable training of l 1-regularized log-linear models. In Proceedings of the 24th international conference on Machine learning, pages ACM, T. Baltrusaitis, N. Banda, and P. Robinson. Dimensional affect recognition using continuous conditional random fields. In IEEE Automatic Face and Gesture Recognition, pages 1 8, Hongyu Guo. Accelerated continuous conditional random fields for load forecasting. Knowledge and Data Engineering, IEEE Transactions on, 27(8): , W. Ketter, J. Collins, P. Reddy, and M. Weerdt. The 2014 power trading agent competition. ERIM Report Series Reference No. ERS LIS, J. Lafferty. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML-2001), T. Lavergne, O. Cappé, and F. Yvon. Practical very large scale crfs. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages Association for Computational Linguistics, D. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3): , A. Ng. Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In Proceedings of the twenty-first international conference on Machine learning, pages ACM, T. Qin, T. Liu, X. Zhang, D. Wang, and H. Li. Global ranking using continuous conditional random fields. In Advances in neural information processing systems, pages , V. Radosavljevic, S. Vucetic, and Z. Obradovic. Continuous conditional random fields for regression in remote sensing. In ECAI, pages , 2010.

13 L 1 -Regularized Continuous Conditional Random Fields Xin Xin, Irwin King, Hongbo Deng, and Michael R Lyu. A social recommendation framework based on multi-scale continuous conditional random fields. In Proceedings of the 18th ACM conference on Information and knowledge management, pages ACM, J. Yu, S. Vishwanathan, S. Günter, and N. Schraudolph. A quasi-newton approach to nonsmooth convex optimization problems in machine learning. The Journal of Machine Learning Research, 11: , T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, pages ACM, 2004.

Modeling Short-term Energy Load with Continuous Conditional Random Fields

Modeling Short-term Energy Load with Continuous Conditional Random Fields Modeling Short-term Energy Load with Continuous Conditional Random Fields Hongyu Guo National Research Council Canada 1200 Montreal Road, Ottawa, ON., K1A 0R6, Canada hongyu.guo@nrc-cnrc.gc.ca Abstract.

More information

Accelerated Continuous Conditional Random Fields For Load Forecasting

Accelerated Continuous Conditional Random Fields For Load Forecasting Accelerated Continuous Conditional Random Fields For Load Forecasting Hongyu Guo Abstract Increasingly, aiming to contain their rapidly growing energy expenditures, commercial buildings are equipped to

More information

A Bayesian Perspective on Residential Demand Response Using Smart Meter Data

A Bayesian Perspective on Residential Demand Response Using Smart Meter Data A Bayesian Perspective on Residential Demand Response Using Smart Meter Data Datong-Paul Zhou, Maximilian Balandat, and Claire Tomlin University of California, Berkeley [datong.zhou, balandat, tomlin]@eecs.berkeley.edu

More information

Probabilistic Models for Sequence Labeling

Probabilistic Models for Sequence Labeling Probabilistic Models for Sequence Labeling Besnik Fetahu June 9, 2011 Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 1 / 26 Background & Motivation Problem introduction Generative

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Sparse Gaussian conditional random fields

Sparse Gaussian conditional random fields Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian

More information

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16 COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015 Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary

More information

Gaussian Graphical Models and Graphical Lasso

Gaussian Graphical Models and Graphical Lasso ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf

More information

Revenue Maximization in a Cloud Federation

Revenue Maximization in a Cloud Federation Revenue Maximization in a Cloud Federation Makhlouf Hadji and Djamal Zeghlache September 14th, 2015 IRT SystemX/ Telecom SudParis Makhlouf Hadji Outline of the presentation 01 Introduction 02 03 04 05

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts ICML 2015 Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes Machine Learning Research Group and Oxford-Man Institute University of Oxford July 8, 2015 Point Processes

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Scalable Training of L 1 -Regularized Log-Linear Models

Scalable Training of L 1 -Regularized Log-Linear Models Galen Andrew Jianfeng Gao Microsoft Research, One Microsoft Way, Redmond, WA 9052 USA galena@microsoft.com jfgao@microsoft.com Abstract The l-bfgs limited-memory quasi-newton method is the algorithm of

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

Log-Linear Models, MEMMs, and CRFs

Log-Linear Models, MEMMs, and CRFs Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Deep Learning & Neural Networks Lecture 4

Deep Learning & Neural Networks Lecture 4 Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly

More information

Electric Load Forecasting Using Wavelet Transform and Extreme Learning Machine

Electric Load Forecasting Using Wavelet Transform and Extreme Learning Machine Electric Load Forecasting Using Wavelet Transform and Extreme Learning Machine Song Li 1, Peng Wang 1 and Lalit Goel 1 1 School of Electrical and Electronic Engineering Nanyang Technological University

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Lecture 3 - Linear and Logistic Regression

Lecture 3 - Linear and Logistic Regression 3 - Linear and Logistic Regression-1 Machine Learning Course Lecture 3 - Linear and Logistic Regression Lecturer: Haim Permuter Scribe: Ziv Aharoni Throughout this lecture we talk about how to use regression

More information

Optimization II: Unconstrained Multivariable

Optimization II: Unconstrained Multivariable Optimization II: Unconstrained Multivariable CS 205A: Mathematical Methods for Robotics, Vision, and Graphics Doug James (and Justin Solomon) CS 205A: Mathematical Methods Optimization II: Unconstrained

More information

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Fast Nonnegative Matrix Factorization with Rank-one ADMM Fast Nonnegative Matrix Factorization with Rank-one Dongjin Song, David A. Meyer, Martin Renqiang Min, Department of ECE, UCSD, La Jolla, CA, 9093-0409 dosong@ucsd.edu Department of Mathematics, UCSD,

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

Optimization II: Unconstrained Multivariable

Optimization II: Unconstrained Multivariable Optimization II: Unconstrained Multivariable CS 205A: Mathematical Methods for Robotics, Vision, and Graphics Justin Solomon CS 205A: Mathematical Methods Optimization II: Unconstrained Multivariable 1

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015 Logistic Regression Mohammad Emtiyaz Khan EPFL Oct 8, 2015 Mohammad Emtiyaz Khan 2015 Classification with linear regression We can use y = 0 for C 1 and y = 1 for C 2 (or vice-versa), and simply use least-squares

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

Why should you care about the solution strategies?

Why should you care about the solution strategies? Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Integrated Electricity Demand and Price Forecasting

Integrated Electricity Demand and Price Forecasting Integrated Electricity Demand and Price Forecasting Create and Evaluate Forecasting Models The many interrelated factors which influence demand for electricity cannot be directly modeled by closed-form

More information

Does Better Inference mean Better Learning?

Does Better Inference mean Better Learning? Does Better Inference mean Better Learning? Andrew E. Gelfand, Rina Dechter & Alexander Ihler Department of Computer Science University of California, Irvine {agelfand,dechter,ihler}@ics.uci.edu Abstract

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics) COMS 4771 Lecture 1 1. Course overview 2. Maximum likelihood estimation (review of some statistics) 1 / 24 Administrivia This course Topics http://www.satyenkale.com/coms4771/ 1. Supervised learning Core

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Higher-Order Methods

Higher-Order Methods Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

Lecture 6 Optimization for Deep Neural Networks

Lecture 6 Optimization for Deep Neural Networks Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Homework 5. Convex Optimization /36-725

Homework 5. Convex Optimization /36-725 Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

AESO Load Forecast Application for Demand Side Participation. Eligibility Working Group September 26, 2017

AESO Load Forecast Application for Demand Side Participation. Eligibility Working Group September 26, 2017 AESO Load Forecast Application for Demand Side Participation Eligibility Working Group September 26, 2017 Load forecasting for the Capacity Market Demand Considerations Provide further information on forecasting

More information

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep

More information

Predicting the Electricity Demand Response via Data-driven Inverse Optimization

Predicting the Electricity Demand Response via Data-driven Inverse Optimization Predicting the Electricity Demand Response via Data-driven Inverse Optimization Workshop on Demand Response and Energy Storage Modeling Zagreb, Croatia Juan M. Morales 1 1 Department of Applied Mathematics,

More information

Joint Learning of Representation and Structure for Sparse Regression on Graphs

Joint Learning of Representation and Structure for Sparse Regression on Graphs Joint Learning of Representation and Structure for Sparse Regression on Graphs Chao Han Shanshan Zhang Mohamed Ghalwash Slobodan Vucetic Zoran Obradovic Abstract In many applications, including climate

More information

Conditional Random Fields: An Introduction

Conditional Random Fields: An Introduction University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science 2-24-2004 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS. Chenlin Wu Yuhan Lou

PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS. Chenlin Wu Yuhan Lou PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS Chenlin Wu Yuhan Lou Background Smart grid: increasing the contribution of renewable in grid energy Solar generation: intermittent and nondispatchable

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Maximum Likelihood Estimation Prof. C. F. Jeff Wu ISyE 8813 Section 1 Motivation What is parameter estimation? A modeler proposes a model M(θ) for explaining some observed phenomenon θ are the parameters

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Multivariate Gaussians Mark Schmidt University of British Columbia Winter 2019 Last Time: Multivariate Gaussian http://personal.kenyon.edu/hartlaub/mellonproject/bivariate2.html

More information

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural

More information

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725 Quasi-Newton Methods Javier Peña Convex Optimization 10-725/36-725 Last time: primal-dual interior-point methods Consider the problem min x subject to f(x) Ax = b h(x) 0 Assume f, h 1,..., h m are convex

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

More information

Variations of Logistic Regression with Stochastic Gradient Descent

Variations of Logistic Regression with Stochastic Gradient Descent Variations of Logistic Regression with Stochastic Gradient Descent Panqu Wang(pawang@ucsd.edu) Phuc Xuan Nguyen(pxn002@ucsd.edu) January 26, 2012 Abstract In this paper, we extend the traditional logistic

More information

ARestricted Boltzmann machine (RBM) [1] is a probabilistic

ARestricted Boltzmann machine (RBM) [1] is a probabilistic 1 Matrix Product Operator Restricted Boltzmann Machines Cong Chen, Kim Batselier, Ching-Yun Ko, and Ngai Wong chencong@eee.hku.hk, k.batselier@tudelft.nl, cyko@eee.hku.hk, nwong@eee.hku.hk arxiv:1811.04608v1

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg

More information

A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem

A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem Kangkang Deng, Zheng Peng Abstract: The main task of genetic regulatory networks is to construct a

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao

More information

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion Today Probability and Statistics Naïve Bayes Classification Linear Algebra Matrix Multiplication Matrix Inversion Calculus Vector Calculus Optimization Lagrange Multipliers 1 Classical Artificial Intelligence

More information

Multi-wind Field Output Power Prediction Method based on Energy Internet and DBPSO-LSSVM

Multi-wind Field Output Power Prediction Method based on Energy Internet and DBPSO-LSSVM , pp.128-133 http://dx.doi.org/1.14257/astl.16.138.27 Multi-wind Field Output Power Prediction Method based on Energy Internet and DBPSO-LSSVM *Jianlou Lou 1, Hui Cao 1, Bin Song 2, Jizhe Xiao 1 1 School

More information

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Linear Regression. S. Sumitra

Linear Regression. S. Sumitra Linear Regression S Sumitra Notations: x i : ith data point; x T : transpose of x; x ij : ith data point s jth attribute Let {(x 1, y 1 ), (x, y )(x N, y N )} be the given data, x i D and y i Y Here D

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Andreas Maletti Technische Universität Dresden Fakultät Informatik June 15, 2006 1 The Problem 2 The Basics 3 The Proposed Solution Learning by Machines Learning

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient

More information

Latent Tree Approximation in Linear Model

Latent Tree Approximation in Linear Model Latent Tree Approximation in Linear Model Navid Tafaghodi Khajavi Dept. of Electrical Engineering, University of Hawaii, Honolulu, HI 96822 Email: navidt@hawaii.edu ariv:1710.01838v1 [cs.it] 5 Oct 2017

More information