# EM-algorithm for Training of State-space Models with Application to Time Series Prediction

Save this PDF as:

Size: px
Start display at page:

Download "EM-algorithm for Training of State-space Models with Application to Time Series Prediction"

## Transcription

1 EM-algorithm for Training of State-space Models with Application to Time Series Prediction Elia Liitiäinen, Nima Reyhani and Amaury Lendasse Helsinki University of Technology - Neural Networks Research Center P.O. Box 5400, FI Finland {elia, lendasse, Abstract. In this paper, an improvement to the E step of the EM algorithm for nonlinear state-space models is presented. We also propose strategies for model structure selection when the EM-algorithm and statespace models are used for time series prediction. Experiments on the Poland electricity load time series show that the method gives good shortterm predictions and can also be used for long-term prediction. 1 Introduction Time series prediction is an important problem in many fields. Applications include finance, prediction of electricity load and ecology. Time series prediction is a problem of system identification. Taken s theorem [1] ensures that if the dynamic of an underlying system is deterministic, a regression approach can be used for prediction. However, many real world phenomena are stochastic and more complicated methods might be necessary. One possibility is to use a state-space model for predicting the behavior of a time series. This kind of model is very general and is able to model many kinds of phenomena. Choosing the structure and estimating the parameters of a state-space model can be done in many ways. One approach is to use the EM-algorithm and Radial Basis Functions (RBF)-networks [2]. The main problem is that the E-step of the algorithm is difficult due to nonlinearity of the model. In this paper, we apply a Gaussian filter which handles nonlinearities better than the ordinary Extended Kalman Filter (EKF) which is used in [2]. The problem of choosing the dimension of the state-space and the number of neurons is also investigated. This problem is important as too complicated a model can lead to numerical instability, high computational complexity and overfitting. We propose the use of training error curves for model structure selection. The experimental results show that the method works well in practice. We also show that the EM-algorithm maximizes the short-term prediction performance but is not optimal for long-term prediction. 137

2 2 Gaussian Linear Regression Filters Consider the nonlinear state-space model x k = f(x k 1 ) + w k (1) y k = g(x k ) + v k, (2) where w k N(0, Q) and v k N(0, r) are independent Gaussian random variables, x k R n and y k R. Here, N(0, Q) means normal distribution with the covariance matrix Q. Correspondingly, r is the variance of the observation noise. The filtering problem is stated as calculating p(x k y 1,..., y k ). If f and h are linear this could be done using Kalman filter. The linear filtering theory can be applied to nonlinear problems by using Gaussian linear regression filters (LRKF) that are recursive algorithms based on statistical linearization of the model equations. The linearization can be done in various ways resulting in slightly different algorithms. Denote by p(x k y 1,...,y k ) a Gaussian approximation of p(x k y 1,..., y k ). The first phase in a recursive step of the algorithm is calculating p(x k+1 y 1,..., y k ), which can be done by linearizing f(x k ) A k x k + b k so that the error tr(e k ) = tr ( (f(x k ) A k x k b k )(f(x k ) A k x k b k ) T p(x k y 0,...,y k ) dx k ) R nx (3) is minimized. Here, tr denotes the sum of the diagonal elements of a matrix. In addition Q is replaced by Q k = Q k + e k. The linearized model is used for calculating p(x k+1 y 1,..., y k ) using the theory of linear filters. The measurement update p(x k+1 y 1,..., y k ) p(x k+1 y 1,..., y k+1 ) is done by a similar linearization. Approximating equation 3 using central differences would lead to the central difference filter (CFD) which is related to the unscented Kalman filter. However, by using Gaussian nonlinearities as described in the following sections, no numerical integration is needed in the linearization. The smoothed density p(x l y 1,..., y k ) (l < k) is also of interest. In our experiments we use the Rauch- Tung-Striebel smoother [4] with the linearized model. 3 Expectation Maximization (EM)-algorithm for Training of Radial Basis Function Networks In this section, a training algorithm introduced in [2] is described. The EMalgorithm is used for parameter estimation for nonlinear state-space models. 3.1 Parameter Estimation Suppose a sequence of observations (y k ) N k=1 is available. The underlying system that produced the observations is modelled as a state-space model. The functions 138

3 f and g in equations 1 and 2 are parametrized using RBF-networks: f = w T f Φ f (4) g = w T g Φ g, (5) where Φ f = [ρ f 1 (x),..., ρf l (x) xt 1] T and Φ g = [ρ g 1 (x),..., ρg j (x) xt 1] T are the neurons of the RBF-networks. The nonlinear neurons are of the form ρ(x) = 2πS 1/2 exp( 1 2 (x c)t S 1 (x c)). (6) The free parameters of the model are the weights of the neurons, w f and w g in equations 4 and 5, and the noise covariances Q and r in equations 1 and 2. In addition, the initial condition for the states is chosen Gaussian and the parameters of this distribution are optimized. The EM-algorithm is a standard method for handling missing data. Denoting by θ the free parameters of the model, the EM-algorithm is used for maximizing p(y 1,..., y T θ). The EM-algorithm for learning nonlinear state-space models is derived in [2]. The algorithm is recursive and each iteration consists of two steps. In the E- step, the density p(x 0,..., x N y 1,...,y N, θ) is approximated and in the M-step this approximation is used for updating the parameters. Due to the linearity of the model with respect to the parameters, the M-step can be solved analytically. The update formulas for the weights w f and w g and covariances Q and R can be found in [2]. In our implementation Q is chosen diagonal. The E-step is more difficult and an approximative method must be used. We propose the use of the smoother derived in section 2 instead of the extended Kalman smoother used in [2]. 3.2 Initialization The state variables must be initialized to some meaningful values before the algorithm can be used. Consider a scalar time series (y t ). First the vectors z t = [y t L,...,y t,..., y t+l ] (7) are formed. L is chosen large enough so that the vectors contain enough information. Next the dimension for the hidden states is chosen. Once the dimension is known, the vectors z t are projected onto this lower dimensional space. This is done with the PCA mapping [5]. The rough estimates for the hidden states are used to obtain an initial guess for the parameters of the network. 139

4 3.3 Choosing Kernel Means and Widths Choosing the centers of the neurons (c in equation 6) is done with the k-means algorithm [5]. The widths S j are chosen according to the formula (see [6]) S j = 1 l l ( c j c N(j,i) 2 ) 1 2 I, (8) i=1 where I is the identity matrix and N(j, i) is the ith nearest neighbor of j. In the experiments, we use the value l = Choosing the Number of Neurons and the Dimension of the State-Space To estimate the dimension of the state-space, we propose the use of a validation set to estimate the generialization error. For each dimension, the model is calculated for different number of neurons and the one which gives the lowest validation error is chosen. This procedure is repeated to get averaged estimates which are used for dimension selection. For choosing the number of neurons, we propose the use of the training error, which has the advantage that the whole data set is used for training. Usually there is a bend after which the training error decreases only slowly (see figure 1c). This kind of a bend is used as an estimate of the point after which overfitting becomes a problem. To avoid local minima, averaging is used. Because iterative prediction is used, the one-step prediction error is used for model structure selection. Instead, it would also have been possible to use the likelihood. 4 Experiments The experiments are made with a time series that represents the daily electricity consumption in Poland during 1400 days in the 90s [7], see figure 1a. The values are used for training and the values for testing. For the dimension selection, the values are kept for validation. The model is tested using the dimensions 2 to 8 for the state-space. For each dimension, the validation errors are calculated for different number of neurons so that the number of neurons goes from 0 to 36 by step of 2 (we use the same amount of neurons for both networks). To calculate the prediction errors, the state at the beginning of each prediction is estimated with the linear-regression filter. The prediction is made by approximating the future states and observations with the linear-regression approach. For L in formula 7, we choose 10. The choice is based on the results in [8] which imply that the window size 10 contains enough information for prediction. For each number of neurons, the training is stopped if the likelihood decreased twice in row or more than 40 iterations have been made. In figure 1b are the validation MSE curves calculated by averaging the values for each number of 140

5 1.4 a) b) 8 x x 10 3 Validation MSE Dimension c) d) 3.2 x 10 3 MSE Number of neurons 1 e) MSE Number of neurons Fig. 1: a) The Poland electricity consumption time series. b) 1-step prediction validation errors for dimension selection. c) 1-step prediction errors for dimension 7. The solid line is the error on the test set and the dashed line the error on the training set. d) 5-step test error for dimension 7. e) An example of prediction. neurons over four simulations and choosing the minimum for each dimension. It can be seen that a good choice for dimension of the state-space is 7 (even though the validation error for dimension 6 is nearly as good). The errors are averaged over four simulations. The training error curve for 1-step prediction shows clearly that a good choice for the number of neurons is 6. The test error shows that the model chosen based on the training error gives good results on short-term prediction. The chosen model gives quite good but not optimal 5-step prediction result. We claim that the EM-algorithm optimizes mainly the short-term prediction performance which can be seen by examining the test error curves. The 1-step test error curve is much smoother. Thus, an algorithm designed for long-term prediction would certainly give much better result. In table 1 are the results corresponding to the averaged prediction over four simulations. In comparison we have calculated the corresponding results using linear regression, lazy learning [9] and LS-SVM [8]. As expected the 1-step MSE of the EM-algorithm is very good, but the 5-step MSE is less good. 141

6 method 1-step MSE 5-step MSE EM linear lazy learning LS-SVM Table 1: Test errors with averaged predictions. 5 Conclusion In this paper, a new approach to the E-step of the EM-algorithm for nonlinear state-space models is proposed. We also propose strategies for model structure selection. The experimental results show that the optimization of the model for short-term prediction works well but on the other hand the EM-algorithm is not optimal for long-term prediction. This is because the prediction is iterative with a model optimized for 1-step prediction. The linear regression filter uses the structure of the neural network to propagate nonlinearities. Our experiments confirm that this leads to a stable algorithm. Unfortunately, it is difficult to prove theoretical convergence or stability bounds. In the future, the cost function should be modified to give better results on long-term prediction. In addition our approach for the E-step should be compared to other filtering methods. References [1] D. A. Rand and L. S. Young, editors. Dynamical Systems and Turbulence, volume 898 of Lecture Notes in Mathematics, chapter Detecting strange attractors in turbulence. Springer-Verlag, [2] Z. Ghahramani and S. Roweis. Learning nonlinear dynamical systems using an EM algorithm. In Advances in Neural Information Processing Systems, volume 11, pages [3] S. Haykin, editor. Kalman Filtering and Neural Networks. Wiley Series on Adaptive and Learning Systems for Signal Processing. John Wiley & Sons, Inc., [4] S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, 2nd edition, [5] J. Moody and C. J. Darken. Fast learning in networks of locally-tuned processing units. Neural Computation, pages , [6] M. Cottrell, B. Girard, and P. Rousset. Forecasting of curves using classification. Journal of Forecasting, 17(5-6): , [7] Y. Ji, J. Hao, N. Reyhani, and A. Lendasse. Direct and recursive prediction of time series using mutual information selection. In Computational Intelligence and Bioinspired Systems: 8th International Workshop on Artificial Neural Networks, IWANN 2005, Vilanova i la Geltrú, Barcelona, Spain, June 8-10,2005, pages [8] A. Sorjamaa, A. Lendasse, and M. Verleysen. Pruned lazy learning models for time series prediction. In ESANN 2005, European Symposium on Artifical Neural Networks, Bruges (Belgium), pages ,

### Time series prediction

Chapter 12 Time series prediction Amaury Lendasse, Yongnan Ji, Nima Reyhani, Jin Hao, Antti Sorjamaa 183 184 Time series prediction 12.1 Introduction Amaury Lendasse What is Time series prediction? Time

### Analysis of Fast Input Selection: Application in Time Series Prediction

Analysis of Fast Input Selection: Application in Time Series Prediction Jarkko Tikka, Amaury Lendasse, and Jaakko Hollmén Helsinki University of Technology, Laboratory of Computer and Information Science,

### Non-parametric Residual Variance Estimation in Supervised Learning

Non-parametric Residual Variance Estimation in Supervised Learning Elia Liitiäinen, Amaury Lendasse, and Francesco Corona Helsinki University of Technology - Lab. of Computer and Information Science P.O.

### Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing

### Direct and Recursive Prediction of Time Series Using Mutual Information Selection

Direct and Recursive Prediction of Time Series Using Mutual Information Selection Yongnan Ji, Jin Hao, Nima Reyhani, and Amaury Lendasse Neural Network Research Centre, Helsinki University of Technology,

### Factor Analysis and Kalman Filtering (11/2/04)

CS281A/Stat241A: Statistical Learning Theory Factor Analysis and Kalman Filtering (11/2/04) Lecturer: Michael I. Jordan Scribes: Byung-Gon Chun and Sunghoon Kim 1 Factor Analysis Factor analysis is used

### Prediction of ESTSP Competition Time Series by Unscented Kalman Filter and RTS Smoother

Prediction of ESTSP Competition Time Series by Unscented Kalman Filter and RTS Smoother Simo Särkkä, Aki Vehtari and Jouko Lampinen Helsinki University of Technology Department of Electrical and Communications

### Dual Estimation and the Unscented Transformation

Dual Estimation and the Unscented Transformation Eric A. Wan ericwan@ece.ogi.edu Rudolph van der Merwe rudmerwe@ece.ogi.edu Alex T. Nelson atnelson@ece.ogi.edu Oregon Graduate Institute of Science & Technology

### Heterogeneous mixture-of-experts for fusion of locally valid knowledge-based submodels

ESANN'29 proceedings, European Symposium on Artificial Neural Networks - Advances in Computational Intelligence and Learning. Bruges Belgium), 22-24 April 29, d-side publi., ISBN 2-9337-9-9. Heterogeneous

### Input Selection for Long-Term Prediction of Time Series

Input Selection for Long-Term Prediction of Time Series Jarkko Tikka, Jaakko Hollmén, and Amaury Lendasse Helsinki University of Technology, Laboratory of Computer and Information Science, P.O. Box 54,

### Lecture 6: Bayesian Inference in SDE Models

Lecture 6: Bayesian Inference in SDE Models Bayesian Filtering and Smoothing Point of View Simo Särkkä Aalto University Simo Särkkä (Aalto) Lecture 6: Bayesian Inference in SDEs 1 / 45 Contents 1 SDEs

### output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1

To appear in M. S. Kearns, S. A. Solla, D. A. Cohn, (eds.) Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 999. Learning Nonlinear Dynamical Systems using an EM Algorithm Zoubin

### PATTERN CLASSIFICATION

PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

### Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Department of Biomedical Engineering and Computational Science Aalto University January 26, 2012 Contents 1 Batch and Recursive Estimation

### Using the Delta Test for Variable Selection

Using the Delta Test for Variable Selection Emil Eirola 1, Elia Liitiäinen 1, Amaury Lendasse 1, Francesco Corona 1, and Michel Verleysen 2 1 Helsinki University of Technology Lab. of Computer and Information

### Fast Bootstrap for Least-square Support Vector Machines

ESA'2004 proceedings - European Symposium on Artificial eural etworks Bruges (Belgium), 28-30 April 2004, d-side publi., ISB 2-930307-04-8, pp. 525-530 Fast Bootstrap for Least-square Support Vector Machines

### Lecture 7: Optimal Smoothing

Department of Biomedical Engineering and Computational Science Aalto University March 17, 2011 Contents 1 What is Optimal Smoothing? 2 Bayesian Optimal Smoothing Equations 3 Rauch-Tung-Striebel Smoother

### ABSTRACT INTRODUCTION

ABSTRACT Presented in this paper is an approach to fault diagnosis based on a unifying review of linear Gaussian models. The unifying review draws together different algorithms such as PCA, factor analysis,

### Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method Antti Honkela 1, Stefan Harmeling 2, Leo Lundqvist 1, and Harri Valpola 1 1 Helsinki University of Technology,

### Gaussian Fitting Based FDA for Chemometrics

Gaussian Fitting Based FDA for Chemometrics Tuomas Kärnä and Amaury Lendasse Helsinki University of Technology, Laboratory of Computer and Information Science, P.O. Box 54 FI-25, Finland tuomas.karna@hut.fi,

### Bayesian ensemble learning of generative models

Chapter Bayesian ensemble learning of generative models Harri Valpola, Antti Honkela, Juha Karhunen, Tapani Raiko, Xavier Giannakopoulos, Alexander Ilin, Erkki Oja 65 66 Bayesian ensemble learning of generative

### The Kalman Filter ImPr Talk

The Kalman Filter ImPr Talk Ged Ridgway Centre for Medical Image Computing November, 2006 Outline What is the Kalman Filter? State Space Models Kalman Filter Overview Bayesian Updating of Estimates Kalman

### NON-LINEAR NOISE ADAPTIVE KALMAN FILTERING VIA VARIATIONAL BAYES

2013 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING NON-LINEAR NOISE ADAPTIVE KALMAN FILTERING VIA VARIATIONAL BAYES Simo Särä Aalto University, 02150 Espoo, Finland Jouni Hartiainen

### IN neural-network training, the most well-known online

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 1, JANUARY 1999 161 On the Kalman Filtering Method in Neural-Network Training and Pruning John Sum, Chi-sing Leung, Gilbert H. Young, and Wing-kay Kan

### Unsupervised Variational Bayesian Learning of Nonlinear Models

Unsupervised Variational Bayesian Learning of Nonlinear Models Antti Honkela and Harri Valpola Neural Networks Research Centre, Helsinki University of Technology P.O. Box 5400, FI-02015 HUT, Finland {Antti.Honkela,

### NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

### Gaussian Processes (10/16/13)

STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

### Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

### Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

### ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

Sparse Kernel Canonical Correlation Analysis Lili Tan and Colin Fyfe 2, Λ. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong. 2. School of Information and Communication

### RECURSIVE OUTLIER-ROBUST FILTERING AND SMOOTHING FOR NONLINEAR SYSTEMS USING THE MULTIVARIATE STUDENT-T DISTRIBUTION

1 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 3 6, 1, SANTANDER, SPAIN RECURSIVE OUTLIER-ROBUST FILTERING AND SMOOTHING FOR NONLINEAR SYSTEMS USING THE MULTIVARIATE STUDENT-T

### CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

CS325 Artificial Intelligence Cengiz Spring 2013 Model Complexity in Learning f(x) x Model Complexity in Learning f(x) x Let s start with the linear case... Linear Regression Linear Regression price =

### Introduction to Machine Learning

Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

### Learning Nonlinear Dynamical Systems using an EM Algorithm

Learning Nonlinear Dynamical Systems using an EM Algorithm Zoubin Ghahramani and Sam T. Roweis Gatsby Computational Neuroscience Unit University College London London WC1N 3AR, U.K. http://www.gatsby.ucl.ac.uk/

### Time Series Prediction by Kalman Smoother with Cross-Validated Noise Density

Time Series Prediction by Kalman Smoother with Cross-Validated Noise Density Simo Särkkä E-mail: simo.sarkka@hut.fi Aki Vehtari E-mail: aki.vehtari@hut.fi Jouko Lampinen E-mail: jouko.lampinen@hut.fi Abstract

### Computational statistics

Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

### 1 Machine Learning Concepts (16 points)

CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

### Old painting digital color restoration

Old painting digital color restoration Michail Pappas Ioannis Pitas Dept. of Informatics, Aristotle University of Thessaloniki GR-54643 Thessaloniki, Greece Abstract Many old paintings suffer from the

### Week 5: Logistic Regression & Neural Networks

Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and

### Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

### Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Statistical Techniques in Robotics (16-831, F1) Lecture#19 (Wednesday November 16) Gaussian Processes Lecturer: Drew Bagnell Scribe:Yamuna Krishnamurthy 1 1 What problems can be solved by Gaussian Processes?

### Linear and Non-Linear Dimensionality Reduction

Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections

### Machine Learning 4771

Machine Learning 4771 Instructor: ony Jebara Kalman Filtering Linear Dynamical Systems and Kalman Filtering Structure from Motion Linear Dynamical Systems Audio: x=pitch y=acoustic waveform Vision: x=object

### Lecture 9. Time series prediction

Lecture 9 Time series prediction Prediction is about function fitting To predict we need to model There are a bewildering number of models for data we look at some of the major approaches in this lecture

### Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

### Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

### A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models. Isabelle Rivals and Léon Personnaz

In Neurocomputing 2(-3): 279-294 (998). A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models Isabelle Rivals and Léon Personnaz Laboratoire d'électronique,

### Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

### Functional Preprocessing for Multilayer Perceptrons

Functional Preprocessing for Multilayer Perceptrons Fabrice Rossi and Brieuc Conan-Guez Projet AxIS, INRIA, Domaine de Voluceau, Rocquencourt, B.P. 105 78153 Le Chesnay Cedex, France CEREMADE, UMR CNRS

### Expectation Propagation in Dynamical Systems

Expectation Propagation in Dynamical Systems Marc Peter Deisenroth Joint Work with Shakir Mohamed (UBC) August 10, 2012 Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1 Motivation Figure : Complex

### Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

### Long-Term Prediction of Time Series by combining Direct and MIMO Strategies

Long-Term Prediction of Series by combining Direct and MIMO Strategies Souhaib Ben Taieb, Gianluca Bontempi, Antti Sorjamaa and Amaury Lendasse Abstract Reliable and accurate prediction of time series

### Feature Extraction with Weighted Samples Based on Independent Component Analysis

Feature Extraction with Weighted Samples Based on Independent Component Analysis Nojun Kwak Samsung Electronics, Suwon P.O. Box 105, Suwon-Si, Gyeonggi-Do, KOREA 442-742, nojunk@ieee.org, WWW home page:

### Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian

### Immediate Reward Reinforcement Learning for Projective Kernel Methods

ESANN'27 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), 25-27 April 27, d-side publi., ISBN 2-9337-7-2. Immediate Reward Reinforcement Learning for Projective Kernel Methods

### Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

### Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

### UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

### Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting) Professor: Aude Billard Assistants: Nadia Figueroa, Ilaria Lauzana and Brice Platerrier E-mails: aude.billard@epfl.ch,

### State-Space Inference and Learning with Gaussian Processes

Ryan Turner 1 Marc Peter Deisenroth 1, Carl Edward Rasmussen 1,3 1 Department of Engineering, University of Cambridge, Trumpington Street, Cambridge CB 1PZ, UK Department of Computer Science & Engineering,

### Semi-Supervised Learning through Principal Directions Estimation

Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de

### An Ensemble Learning Approach to Nonlinear Dynamic Blind Source Separation Using State-Space Models

An Ensemble Learning Approach to Nonlinear Dynamic Blind Source Separation Using State-Space Models Harri Valpola, Antti Honkela, and Juha Karhunen Neural Networks Research Centre, Helsinki University

### STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

### Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (6-83, F) Lecture# (Monday November ) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan Applications of Gaussian Processes (a) Inverse Kinematics

### Gear Health Monitoring and Prognosis

Gear Health Monitoring and Prognosis Matej Gas perin, Pavle Bos koski, -Dani Juiric ic Department of Systems and Control Joz ef Stefan Institute Ljubljana, Slovenia matej.gasperin@ijs.si Abstract Many

### ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

### Probabilistic Graphical Models

Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures

### Forecasting Wind Ramps

Forecasting Wind Ramps Erin Summers and Anand Subramanian Jan 5, 20 Introduction The recent increase in the number of wind power producers has necessitated changes in the methods power system operators

### Linear Dynamical Systems

Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

### CMU-Q Lecture 24:

CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

### Replacing eligibility trace for action-value learning with function approximation

Replacing eligibility trace for action-value learning with function approximation Kary FRÄMLING Helsinki University of Technology PL 5500, FI-02015 TKK - Finland Abstract. The eligibility trace is one

### CS 195-5: Machine Learning Problem Set 1

CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

### Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

### Gaussian process for nonstationary time series prediction

Computational Statistics & Data Analysis 47 (2004) 705 712 www.elsevier.com/locate/csda Gaussian process for nonstationary time series prediction Soane Brahim-Belhouari, Amine Bermak EEE Department, Hong

### MODEL BASED LEARNING OF SIGMA POINTS IN UNSCENTED KALMAN FILTERING. Ryan Turner and Carl Edward Rasmussen

MODEL BASED LEARNING OF SIGMA POINTS IN UNSCENTED KALMAN FILTERING Ryan Turner and Carl Edward Rasmussen University of Cambridge Department of Engineering Trumpington Street, Cambridge CB PZ, UK ABSTRACT

### DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION Alexandre Iline, Harri Valpola and Erkki Oja Laboratory of Computer and Information Science Helsinki University of Technology P.O.Box

### Replacing eligibility trace for action-value learning with function approximation

Replacing eligibility trace for action-value learning with function approximation Kary FRÄMLING Helsinki University of Technology PL 5500, FI-02015 TKK - Finland Abstract. The eligibility trace is one

### Content. Learning. Regression vs Classification. Regression a.k.a. function approximation and Classification a.k.a. pattern recognition

Content Andrew Kusiak Intelligent Systems Laboratory 239 Seamans Center The University of Iowa Iowa City, IA 52242-527 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak Introduction to learning

### CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

### Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

### Linear Models for Regression

Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

### L11: Pattern recognition principles

L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

### COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

### Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

### Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

### Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

### Support Vector Regression (SVR) Descriptions of SVR in this discussion follow that in Refs. (2, 6, 7, 8, 9). The literature

Support Vector Regression (SVR) Descriptions of SVR in this discussion follow that in Refs. (2, 6, 7, 8, 9). The literature suggests the design variables should be normalized to a range of [-1,1] or [0,1].

### Gaussian Process Approximations of Stochastic Differential Equations

Gaussian Process Approximations of Stochastic Differential Equations Cédric Archambeau Dan Cawford Manfred Opper John Shawe-Taylor May, 2006 1 Introduction Some of the most complex models routinely run

### Gaussian Process Latent Variable Models for Dimensionality Reduction and Time Series Modeling

Gaussian Process Latent Variable Models for Dimensionality Reduction and Time Series Modeling Nakul Gopalan IAS, TU Darmstadt nakul.gopalan@stud.tu-darmstadt.de Abstract Time series data of high dimensions

### Lecture 4: Probabilistic Learning

DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

### State-space Model. Eduardo Rossi University of Pavia. November Rossi State-space Model Fin. Econometrics / 53

State-space Model Eduardo Rossi University of Pavia November 2014 Rossi State-space Model Fin. Econometrics - 2014 1 / 53 Outline 1 Motivation 2 Introduction 3 The Kalman filter 4 Forecast errors 5 State

### Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

### Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

### The Kalman Filter. Data Assimilation & Inverse Problems from Weather Forecasting to Neuroscience. Sarah Dance

The Kalman Filter Data Assimilation & Inverse Problems from Weather Forecasting to Neuroscience Sarah Dance School of Mathematical and Physical Sciences, University of Reading s.l.dance@reading.ac.uk July

### CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#\$

### A Bayesian Perspective on Residential Demand Response Using Smart Meter Data

A Bayesian Perspective on Residential Demand Response Using Smart Meter Data Datong-Paul Zhou, Maximilian Balandat, and Claire Tomlin University of California, Berkeley [datong.zhou, balandat, tomlin]@eecs.berkeley.edu

### Pattern Recognition and Machine Learning

Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

### Lecture 3: Pattern Classification

EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures