L 1 -Regularized Continuous Conditional Random Fields

Size: px

Start display at page:

Download "L 1 -Regularized Continuous Conditional Random Fields"

Valerie Perkins
6 years ago
Views:

1 L 1 -Regularized Continuous Conditional Random Fields Xishun Wang 1, Fenghui Ren 1, Chen Liu 2, and Minjie Zhang 1 1 School of Computing and Information Technology, University of Wollongong, Australia xw357@uowmail.edu.au, fren@uow.edu.au, minjie@uow.edu.au 2 School of Science, RMIT University, Australia s @student.rmit.edu.au, Abstract. Continuous Conditional Random Fields (CCRF) has been widely applied to various research domains as an efficient approach for structural regression. In previous studies, the weights of CCRF are constrained to be positive from a theoretical perspective. This paper extends the definition domains of weights of CCRF and thus introduces L 1 norm to regularize CCRF, which enables CCRF to perform feature selection. We provide a plausible learning method for L 1 -Regularized CCRF (L 1 - CCRF) and verify its effectiveness. Moreover, we demonstrate that the proposed L 1 -CCRF performs well in selecting key features related to the various customers power usages in Smart Grid. Keywords: Continuous Conditional Random Fields; Regularization; Feature selection 1 Introduction Conditional Random Fields (CRF) [5] is an efficient structural learning tool which has been used in image recognition, natural language processing and bioinformatics etc. CRF is an undirected graphical model that supplies flexible structural learning framework. There are two kinds of potentials in CRF, which are state potentials and edge potentials. The state potentials model how the inner states are influenced by a series of outside factors, while the edge potentials model the interactions of inside states (under the influences of outside factors). With the two arrays of potentials, CRF is capable of simultaneously modeling outside influences and inside interactions for the structural outputs to be predicted. Due to the above favorable advantages, CRF promotes various research domains in computational intelligence. In 2009, Qin et al. [9] proposed Continuous CRF (CCRF), which extends CRF to be able to output continuous values. Since then, CCRF has been widely applied to different research fields. Qin et al. [9] applied CCRF in learning to rank, and gained superior performance to the baseline learning to rank algorithms. Baltrusaitis et al. [2] used CCRF as a structural regression tool for

2 2 Xishun Wang, Fenghui Ren, Chen Liu, and Minjie Zhang expression recognition. In their work, a facial expression was defined by four measurements. CCRF was employed to regress an unseen expression to the four measurements. Xin et al. [11] developed multi-scale CCRF to build a social recommendation framework. Guo [3] used CCRF for energy load forecasting (the usage of electricity and gas) of a building. His experimental results demonstrated that CCRF gained better performance against state-of-art regression methods. In previous work [2,9], the weight parameters of CCRF were stipulated to be positive in a theoretical view. These constraints in fact limit the applications of CCRF. In real-world problems, there may be some features (in the whole feature set) that are not related to the final predictions, or even there are sparse feature problems that the predicted results are determined by a small fraction of all the features. For the above problems, feature selection should be done before using CCRF. To surmount the limit of CCRF caused by the positive weights, we do not constrain the weights of CCRF from a practical perspective. In this scenario, L 1 norm can be introduced to regularize CCRF, which enables CCRF to select effective features. However, as L 1 norm is not differentiable at zero, it brings extra difficulty in L 1 -CCRF learning. We then provide a plausible learning method for the proposed L 1 -CCRF using Orthant-Wise Limited-memory Quasi- Newton (OWL-QN) algorithm [1]. The paper has three major contributions: 1. The definition domains of the weights of CCRF are extended from a practical view. This enable CCRF to perform feature selection with an L 1 norm and consequently extends the applicable areas of CCRF. 2. We provide a practical learning method for L 1 -CCRF and demonstrate its effectiveness. Experimental results show that the new learning method is more efficient than the previous learning method. 3. Experimental results demonstrate that the proposed L 1 -CCRF is effective in feature selection in the power market domain. The rest of the paper is organised as follows. Section 2 gives a brief introduction to CCRF. Section 3 proposes L 1 -CCRF and describes its learning and inference process. Experiments are reported in Section 4 to verify the learning process of L 1 -CCRF and the feature selection capacity of L 1 -CCRF. Finally, the conclusion is given in Section 5. 2 Introduction to CCRF The Continuous Conditional Random Fields [9] (CCRF) was initially proposed for structural regression. CCRF extended CRF to be able to output real-value predictions. The chain-structured CCRF, as illustrated in Figure 1, is widely used. Assume X = {x 1, x 2,, x m } is the given sequence of observations, and Y = {y 1, y 2,, y n } is the value sequence to be predicted. CCRF defines the conditional probability P (Y X) in Equation 1. P (Y X) = 1 exp(ψ), (1) Z(X)

3 L 1 -Regularized Continuous Conditional Random Fields 3 y i-1 y i y i+1 X Edge Potential Node Potential Fig. 1. An illustration of a CCRF with a chain structure. where Ψ is the energy function, and Z(X) is the partition function that normalizes P (Y X). The energy function Ψ is further defined as Ψ = i K 1 k=1 α k f k (y i, X) + i,j K 2 β k g k (y i, y j, X), (2) k=1 where function f k (y i, X) is called node potential and function g k (y i, y j, X) is called edge potential, and α k and β k are corresponding weight parameters. In the energy function, the node potential captures the associations between inputs and outputs, and the edge potential captures the interactions between related outputs. The partition function Z(X) is defined in Equation 3. Z(X) = exp(ψ) (3) CCRF explicitly defines P (Y X), which means Y is determined by the whole observation X. Therefore, CCRF gains the capacity of considering the whole observed sequence for the output. To learn a CCRF model, maximum log-likelihood is used to find the fittest weights α and β. Given training data D = {(X, Y )} Q 1, where Q is the total number of training samples, the log-likelihood L(α, β) is maximized: where Y ( ˆα, ˆβ) = argmax (α,β) (L(α, β)), (4) L(α, β) = Q logp (Y q X q ) (5) l=1 After the weights α and β are obtained, inference for a CCRF is to find the most likely value for Y k, provided an observed sequence X k : Ŷ k = argmax Yk (P (Y k X k )) (6) In practical use, we usually use a vector y to represent the array of values to be predicted, and a matrix X to represent the array of features. Radosavljevic et

4 4 Xishun Wang, Fenghui Ren, Chen Liu, and Minjie Zhang al. [10] have shown that P (y X) with quadratic potentials can be transformed into a multivariate Gaussian, which facilitates the learning and inference processes. The quadratic form is widely used in CCRF, and consequently, a universal partition function is shown as follows. Z(X) = exp( y i K 1 α k (y i X i,k ) 2 + k=1 K 2 δ k β k (y i y j ) 2 ) i,j k=1 (7) In Equation 7, δ k is an indicator function. When a certain assertion holds, δ k takes the value 1, otherwise, it is 0. In this scenario, P (y X) could be transformed to a multivariate Gaussian form, resulting in a concise inference process [10]. This is further shown in Subsection L 1 -Regularized Continuous Conditional Random Fields We propose L 1 -Regularized Continuous Conditional Random Fields (L 1 -CCRF) in this section. We first extend definition domains of the weights for CCRF from a perspective of practical use. As a consequence, L 1 norm can be introduced to regularize CCRF. Then we provide a plausible learning method for L 1 -CCRF based on OWL-QN algorithm [1]. 3.1 Introducing L 1 norm to regularize CCRF We first extend the definition domains of weights for CCRF. In previous CCRF [9,10], the partition function had the form shown in Equation 7. In this equation, we do not pay much attention to any specific parameters, but focus on the quadratic terms. When the variables X and y are defined in infinite domains, both α and β are required to be positive to ensure that the partition function is integrable. However, in practical use, data preprocessing is a necessary step before learning. In this step, the observed feature X is preprocessed to be within a certain domain, and y is also targeted in a finite range. Thus, the partition function is integrable regardless of the domains of α and β. Therefore, we do not have to constrain α and β. In machine learning, regularization has been commonly used in learning process to achieve a model that generalizes to unseen data. L 2 -norm regularization has been used in [2,3,10] in learning CCRF. L 1 norm regularizer is theoretically studied in [8] by Ng, and in practice, the L 1 norm regularizer has gained roughly the same accuracy as the L 2 -norm regularizer [6]. Besides, L 1 norm has a favorable property of capable of selecting effective features, which can be utilized to analyze customer behaviours in our research. Therefore, we introduce L 1 norm to regularize the CCRF in the learning procedure. λ =< α, β > is introduced

5 L 1 -Regularized Continuous Conditional Random Fields 5 to compactly represent the weights. The objective function to be minimized for L 1 -CCRF is designed in Equation 8, F (λ) = L(λ) + ρ λ 1, (8) where 1 stands for L 1 norm. In the objective function, the first term is the loss function, which is a negative of log-likelihood of the training set (see Equation 5), while the second term is the L 1 norm of λ, used as a regularization term. The parameter ρ compromises the loss and the regularization term. 3.2 Learning L 1 -CCRF As L 1 norm is not differentiable at zero, the previous learning method for CCRF [9,10] is no longer applied to L 1 -CCRF learning. Some special methods have been proposed to tackle the learning with L 1 norm regularizer [1, 12]. Orthant-Wise Limited-memory Quasi-Newton (OWL-QN), proposed by Andrew and Gao [1], has been verified an advantageous algorithm for L 1 -regularized log-linear model in [6]. We therefore introduce the OWL-QN algorithm to learn an L 1 -CCRF. For the purpose of comparison and experimental evaluation, we briefly review the learning process for CCRF. In CCRF, each weight parameter λ i is constrained to be positive. Authors in [9, 10] maximize L(λ) with respect to log λ i. With this transform, the constrained optimization problem becomes unconstrained. As a consequence, Stochastic Gradient Descent (SGD) [13] algorithm can be used to optimize the unconstrained problem. In CCRF learning using SGD, in each iteration, two major steps are taken: 1) Compute gradient log λi (gradient with respect to log λ i ) for the objective function P (y X) with a random training sample. 2) Update weight parameter log λ i : log λ i = log λ i + η log λi, where η is the learning rate. After T iterations, SGD outputs the optimized weights λ. For L 1 -CCRF learning, we introduce the Orthant-Wise Limited-memory Quasi-Newton (OWL-QN) algorithm, which in fact extends L-BFGS [7] algorithm to be capable of optimizing convex function with L 1 norm. The quasi-newton algorithms gain the advantage of second-order convergence rate with a small computation cost. They construct an approximation of the second-order Taylor expansion of the objective function, and then try to minimize the approximation. In the approximated Taylor expansion, the Hessian matrix is constructed with the first-order information gathered from previous steps. OWL-QN, which modifies L-BFGS, is motivated by the following basic idea. For the L 1 norm, when the orthant is given, its sign can be determined and become differentiable. Furthermore, L 1 norm is not related to the Hessian, which can be approximated by the loss term alone. Thus, OWL-QN in fact imitates L-BFGS steps in a chosen orthant. The process of using OWL-QN to learn L 1 -CCRF is briefly described as follows. OWL-QN iteratively changes λ to obtain an optimized value. In the kth iteration step, OWL-QN first computes the pseudo-gradient of F (λ) as a

6 6 Xishun Wang, Fenghui Ren, Chen Liu, and Minjie Zhang basement to choose the appropriate orthant and search direction, then chooses an orthant ξ k, computes the search direction p k, and searches the next objective point λ k+1. In each step, a displacement s k = λ k+1 λ k and the change in gradient r k = L(λ k+1 ) + L(λ k ) are updated and recorded. The previous m displacements and changes in gradient are used to construct an approximate H k, the inverse of Hessian matrix of F (λ), which is the key to compute the search direction. After T iterations, OWL-QN converges and outputs the optimal weights λ. The learning procedure for L 1 -CCRF using OWL-QN is summarized in Algorithm 1. Algorithm 1 L 1 -CCRF learning using OWL-QN Input: Training samples D = {(X, Y )} Q 1 ; Output: Weight parameter λ; 1: Initialize: Initial point λ 0 ; S {}, R {}. 2: for k = 0 to T do 3: Compute the pseudo-gradient of F (λ) 4: Choose an orthant ξ k 5: Construct H k using S and R 6: Compute search direction p k 7: Find λ k+1 with constrained line search 8: if termination condition satisfied then 9: Stop and return λ k+1 10: end if 11: Update S with s k = λ k+1 λ k 12: Update R with r k = L(λ k+1 ) + L(λ k ) 13: end for Before the explanation of Algorithm 1, we introduce two special functions [1] for convenience. The first one is sign function σ: σ(x) results in a value in { 1, 0, 1} according to whether x is negative, zero or positive. The second one is project function π: π(x; y), R n R n, is parameterized by y R n, where { xi if σ(x π i (x; y) = i ) = σ(y i ) (9) 0 otherwise It can be interpreted as projecting x onto the orthant defined by y. In Algorithm 1, Step 1 chooses and initial λ, and initialize sets S and R. Step 2-13 are the main iteration loop. Step 3 calculates the pseudo-gradient of F (λ) at λ, denoted F (λ), according to the following equation: i F (λ) if i F (λ) > 0 i F (λ) = + i F (λ) if + i F (λ) < 0, (10) 0 otherwise where ± i F (λ) = λ i L(λ) + { ρσ(λi ) if λ i 0 ±ρ if λ i = 0. (11)

7 L 1 -Regularized Continuous Conditional Random Fields 7 In Equation 11, the term L(λ)/ λ i is derived with respect to the specified model. In Step 4, an orthant ξ k is chosen based on F (λ), ξ k i = { σ(λ k i ) if λ k i 0 σ( i F (λ k )) if λ k i = 0. (12) Step 5 constructs the inverse of Hessian H k, which is constructed the same as the traditional L-BFGS [7], not shown here again. Step 6 then determines the search direction p k, formulated by p k = π(h k v k ; v k ), (13) where v k = F (λ k ). Step 7-10 aims to find the next point λ k+1 using constrained line search, in which each point explored is projected back onto the chosen orthant: λ k+1 = π(λ k + αp k ; ξ k ), where α controls the search step. Step 11 and 12 update sets S and R, respectively. The proposed learning method for L 1 -CCRF has a semi-second-order convergence rate in theory. We will show it converges faster than the CCRF learning method in Subsection Inference In the prediction for a learned L 1 -CCRF model, we find the most likely y given the observed feature X, as formulated in Equation 6. We first derive P (y X) into multi-variant Gaussian form. P (y X) = 1 (2π) n/2 Σ 1/2 exp( 1 2 (y µ(x))t Σ 1 (y µ(x))) (14) In this Gaussian form, the inverse of the covariance matrix Σ 1, is the sum of two n n matrices, further expressed as follows. Σ 1 = 2(M 1 + M 2 ), where { K1 Mi,j 1 = k=1 α k if i = j 0 otherwise { n K2 Mi,j 2 = j=1 k=1 δ kβ k K 2 k=1 δ kβ k if i = j K 2 k=1 δ kβ k if i j (15) The diagonal matrix M 1 represents the contribution of α terms (node potentials), and the symmetric matrix M 2 represents the contribution of β terms (edge potentials). The mean µ(x) is computed by µ(x) = Σθ (16)

8 8 Xishun Wang, Fenghui Ren, Chen Liu, and Minjie Zhang Here, θ is an n-dimension vector, where each element is calculated by K 1 θ i = 2 α k X i,k (17) k=1 Benefiting from the multivariate Gaussian form, the inference becomes quite tractable. To maximize P (y X) in the multivariate Gaussian (see Equation 14), we simply make y equal to µ(x), ŷ = argmax y (P (y X)) = µ(x) = Σθ (18) 4 Experiment We conducted two experiments for the proposed L 1 -CCRF. Experiment 1 evaluated the learning process of L 1 -CCRF. This experiment aimed to demonstrate the proposed learning method for L 1 -CCRF was correct and more efficient than the previous learning method for CCRF. Experiment 2 aimed to verify the feature selection capacity of L 1 -CCRF. This experiment used L 1 -CCRF to predict the hourly load of some specific customers in a Smart Grid to demonstrate L 1 - CCRF was effective in feature selection and load forecasting. In the following, the two experiments are reported, respectively. 4.1 Experiment 1: Learning of L 1 -CCRF In this experiment, we compared the proposed learning method for L 1 -CCRF with the conventional CCRF learning. The experimental data we used here are from the work of Baltrusaitis et al. [2], whose source code is publicly available 3. In their work, they used CCRF as a structural regression tool to regress a test expression to a four-dimension measurement defined to measure facial expressions. For page limits, their model is not introduced here. Interested reader may refer to their work [2]. In their CCRF learning, they used the conventional SGD algorithm (introduced in Subsection 3.2) and introduced L 2 norm for regularization. We first repeated their results, and then built the same CCRF model, and used the proposed L 1 -CCRF learning algorithm (see Algorithm 1) to learn the same data. We repeat the dimensional expression recognition experiment using CCRF model, not the CA-CCRF [2]. For the experimental settings, the convergence criteria for both methods were: λ = λ i+1 λ i < The other parameters in learning process of CCRF used the default settings in [2]. In L 1 -CCRF learning, parameter ρ was set as 100, which was optimized by cross-validations. The convergence curves of the two methods for the valence dimension are shown in Figure 2. We first make some explanations to Figure 2. The convergence rate is measured by the total iterations of all the training samples. In CCRF learning, SGD 3

9 L 1 -Regularized Continuous Conditional Random Fields logl CCRF L1-CCRF number of iterations Fig. 2. The convergence of curves of CCRF and L 1 -CCRF learning. derives gradient of each sample, while L 1 -CCRF derives gradient based on all the training samples. We have divided the total loops by the total number of samples to obtain the number of iterations for CCRF learning, which is comparable with the total iterations of L 1 -CCRF. CCRF takes 624 iterations to converge, while L 1 -CCRF iterates 68 times to reach convergence. From Figure 2, we can see that the convergence rate of L 1 -CCRF is much faster than that of CCRF. In the theoretical view, the convergence rate of SGD is first-order, while that of OWL-QN is semi-second-order. Thus, the learning process of L 1 -CCRF takes much less iterations than that of CCRF. When converged, the final log-likelihood of CCRF is , and that of L 1 -CCRF is Table 1. The learning performance comparison of CCRF and L 1 -CCRF Method learning time(seconds) precision(average correlations) CCRF L 1 -CCRF We compared the learning time and prediction results between CCRF and L 1 -CCRF in Table 1. In Table 1, the learning time is measure by seconds. In the prediction results, both methods calculates the correlations of each predicted values to each of the four dimensions of an expression. For the prediction precision, we use the average correlations to measure both methods. In our desktop with Duo CPU 3.33GHz and 4GB RAM, CCRF learning takes 1532 seconds to reach convergence, while L 1 -CCRF learning converges in 257 seconds. For the

10 10 Xishun Wang, Fenghui Ren, Chen Liu, and Minjie Zhang prediction precision, we can see that the two methods show similar results. The predicted results demonstrate two points, which are: 1) the L 1 -CCRF learning is correct, and 2) L 1 norm and L 2 norm regularization result in similar precisions. From this experiment, we draw two major conclusions: 1. The convergence rate of L 1 -CCRF learning is faster than that of CCRF. 2. L 1 norm shows similar performance as L 2 norm in prediction precision. 4.2 Experiment 2: Feature Selection using L 1 -CCRF This experiment was conducted on the platform of Power Trading Agent Competition (Power TAC) [4]. Power TAC has drawn wide attentions and has become a benchmark in the Smart Grid research community. We here utilized this platform to evaluate the performance of L 1 -CCRF in feature selection and short-term load forecasting (the future hourly power usage). Power TAC simulates a variety of customers with various behaviours in a Smart Grid. Moreover, there are rich features, including real-world weather conditions and real-time market status. Besides, the Power TAC server supplies rich logs of customers hourly power usages, which are regarded as the ground-truth to evaluate the prediction precision of L 1 -CCRF. Table 2. Features used in Experiment 2 Feature Content Index Temporal feature hour of a day t1 day of a week t 2 temperature w 1 Weather feature wind strength w 2 wind direction w 3 cloudiness w 4 Market feature lowest price m 1 average price m 2 Representative features, which may relate to the customers power consumption behaviors, are studied in our work. These features include temporal features, weather features and market features. Table 2 lists the contents of the three features, and the indexes provide convenience for further discussions. We configured the Power TAC server and weather data server, and utilized Power TAC games to generate training and test data. Six games were run to roughly cover the year of The server logs, which contained features and customers energy usages, were used as the training data. Another six games

11 L 1 -Regularized Continuous Conditional Random Fields 11 were run for year The logged customers usages were regarded as the ground-truths of loads to be predicted. The L 1 -CCRF model for load forecasting is described in details. L 1 -CCRF modeled 24 hours power usages under the influences of 24 hourly features. For the features in each hour, 8 node potentials were generated. There were 23 intervals between every two adjacent hours in a day. Edge potentials were generated in every intervals. To ensure the convergence of L 1 -CCRF, we set 50 iterations for OWL-QN. Four typical customers are selected to test the performance of L 1 -CCRF. Customer 1 is a village householder, Customer 2 is a photovoltaic energy generator, Customer 3 is an office building and Customer 4 is a cold storage company with power storage capacity. We first analyze the feature selection capacity of L 1 -CCRF. Table 3 shows effective features selected by L 1 -CCRF for the four customers. Table 3. Feature selection for the four typical customers Customer t 1 t 2 w 1 w 2 w 3 w 4 m 1 m 2 C C C C In Table 3, 1 indicates that the feature is related to the power usage of the customer, and 0 otherwise. Customer 1 is a village user. Intuitively, the power usage of Customer 1 may be influenced by time, temperature and market price. The feature selection result of Customer 1 in Table 3 is quite reasonable according to our intuitions. Customer 2, who is a photovoltaic energy produce, is affected by hours in one day, the cloudiness and temperature. The result shown in Table 3 is also correct. Similarly, the feature selection result for Customer 3 is reasonable. Customer 4 is a cold storage company, and the power usage is not likely affected by wind strength. In this case, the wind strength feature might be selected improperly. In contrast, if we used conventional CCRF learning, as parameters were confined to be positive, it could not perform feature selection. Thus, L 1 -CCRF gains the advantage to build a concise (using only effective features) and explanatory (harmony with intuitions or commonsense) model. At the same time, L 1 -CCRF predicts the hourly power usages of the four customers. Mean Absolute Percentage Error (MAPE) for each hour is calculated to measure the prediction results. The average MAPE for each of the four customers are 7.32%, 5.63%, 6.41% and 3.36%. L 1 -CCRF shows good performances in short-term load forecasting. Two points are verified in this experiment. 1. L 1 -CCRF is effective in feature selection for customers power consumption behaviors in the Smart Grid market.

12 12 Xishun Wang, Fenghui Ren, Chen Liu, and Minjie Zhang 2. L 1 -CCRF shows good performance in load forecasting. Thus, the proposed L 1 -CCRF can be an efficient structural regression tool with feature selection capacity. 5 Conclusion This work first used L 1 norm to regularize CCRF and proposed L 1 -CCRF. The definition domains of weights of CCRF were extended, so that L 1 norm could be introduced for CCRF regularization in the learning process. We then provided a plausible learning method for L 1 -CCRF, and experimental results demonstrated its correctness and efficiency. We used L 1 -CCRF for feature selection for the customers in a Smart Grid and demonstrate that L 1 -CCRF is effective in selecting key features in the power market domain. In a nutshell, the proposed L 1 -CCRF is effective in learning and feature selection, and we suggest it be applied to wide research domains. References 1. G. Andrew and J. Gao. Scalable training of l 1-regularized log-linear models. In Proceedings of the 24th international conference on Machine learning, pages ACM, T. Baltrusaitis, N. Banda, and P. Robinson. Dimensional affect recognition using continuous conditional random fields. In IEEE Automatic Face and Gesture Recognition, pages 1 8, Hongyu Guo. Accelerated continuous conditional random fields for load forecasting. Knowledge and Data Engineering, IEEE Transactions on, 27(8): , W. Ketter, J. Collins, P. Reddy, and M. Weerdt. The 2014 power trading agent competition. ERIM Report Series Reference No. ERS LIS, J. Lafferty. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML-2001), T. Lavergne, O. Cappé, and F. Yvon. Practical very large scale crfs. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages Association for Computational Linguistics, D. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3): , A. Ng. Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In Proceedings of the twenty-first international conference on Machine learning, pages ACM, T. Qin, T. Liu, X. Zhang, D. Wang, and H. Li. Global ranking using continuous conditional random fields. In Advances in neural information processing systems, pages , V. Radosavljevic, S. Vucetic, and Z. Obradovic. Continuous conditional random fields for regression in remote sensing. In ECAI, pages , 2010.

13 L 1 -Regularized Continuous Conditional Random Fields Xin Xin, Irwin King, Hongbo Deng, and Michael R Lyu. A social recommendation framework based on multi-scale continuous conditional random fields. In Proceedings of the 18th ACM conference on Information and knowledge management, pages ACM, J. Yu, S. Vishwanathan, S. Günter, and N. Schraudolph. A quasi-newton approach to nonsmooth convex optimization problems in machine learning. The Journal of Machine Learning Research, 11: , T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, pages ACM, 2004.

Modeling Short-term Energy Load with Continuous Conditional Random Fields

Modeling Short-term Energy Load with Continuous Conditional Random Fields Hongyu Guo National Research Council Canada 1200 Montreal Road, Ottawa, ON., K1A 0R6, Canada hongyu.guo@nrc-cnrc.gc.ca Abstract.