How New Information Criteria WAIC and WBIC Worked for MLP Model Selection

Similar documents
Multilayer Perceptron Learning Utilizing Singular Regions and Search Pruning

Eigen Vector Descent and Line Search for Multilayer Perceptron

Second-order Learning Algorithm with Squared Penalty Term

Algebraic Information Geometry for Learning Machines with Singularities

Stochastic Complexity of Variational Bayesian Hidden Markov Models

Model Comparison. Course on Bayesian Inference, WTCN, UCL, February Model Comparison. Bayes rule for models. Linear Models. AIC and BIC.

Data Mining (Mineria de Dades)

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Stochastic Complexities of Reduced Rank Regression in Bayesian Estimation

Pattern Recognition and Machine Learning

Resolution of Singularities and Stochastic Complexity of Complete Bipartite Graph-Type Spin Model in Bayesian Estimation

Introduction to Machine Learning

A Bayesian Local Linear Wavelet Neural Network

Algebraic Geometry and Model Selection

Deep unsupervised learning

CPSC 540: Machine Learning

LTI Systems, Additive Noise, and Order Estimation

A NEW INFORMATION THEORETIC APPROACH TO ORDER ESTIMATION PROBLEM. Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.

Deep Feedforward Networks

Machine Learning 2017

Deep Neural Networks

Machine Learning Lecture 5

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja

PATTERN CLASSIFICATION

Neural Networks and the Back-propagation Algorithm

Bayesian Machine Learning

Statistical Machine Learning from Data

FEEDBACK GMDH-TYPE NEURAL NETWORK AND ITS APPLICATION TO MEDICAL IMAGE ANALYSIS OF LIVER CANCER. Tadashi Kondo and Junji Ueno

Reading Group on Deep Learning Session 1

The connection of dropout and Bayesian statistics

Unsupervised Learning

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Density Estimation. Seungjin Choi

Final Examination CS 540-2: Introduction to Artificial Intelligence

Bayesian ensemble learning of generative models

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Multilayer Perceptron

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Weight Quantization for Multi-layer Perceptrons Using Soft Weight Sharing

TDA231. Logistic regression

Artificial Neural Networks

CS:4420 Artificial Intelligence

p(d θ ) l(θ ) 1.2 x x x

Memoirs of the Faculty of Engineering, Okayama University, Vol. 42, pp , January Geometric BIC. Kenichi KANATANI

Expectation Propagation for Approximate Bayesian Inference

Ways to make neural networks generalize better

STA 4273H: Sta-s-cal Machine Learning

Part 1: Expectation Propagation

Principles of Bayesian Inference

Introduction to Machine Learning

Non-Parametric Bayes

Multi-layer Neural Networks

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method

CSE 473: Artificial Intelligence Autumn Topics

Advanced statistical methods for data analysis Lecture 2

Bayesian Networks BY: MOHAMAD ALSABBAGH

Introduction to Neural Networks

Keywords- Source coding, Huffman encoding, Artificial neural network, Multilayer perceptron, Backpropagation algorithm

Bayesian methods in economics and finance

An Introduction to Statistical and Probabilistic Linear Models

An artificial neural networks (ANNs) model is a functional abstraction of the

Probabilistic Machine Learning

Lecture 6: Model Checking and Selection

Neural networks. Chapter 20. Chapter 20 1

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Nonparametric Bayesian Methods (Gaussian Processes)

Graphical Models for Collaborative Filtering

Ch 4. Linear Models for Classification

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

From perceptrons to word embeddings. Simon Šuster University of Groningen

Introduction to Machine Learning

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Recent Advances in Bayesian Inference Techniques

Machine Learning Summer School

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Heterogeneous mixture-of-experts for fusion of locally valid knowledge-based submodels

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

4. Multilayer Perceptrons

SRNDNA Model Fitting in RL Workshop

A Statistical Input Pruning Method for Artificial Neural Networks Used in Environmental Modelling

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Variational Methods in Bayesian Deconvolution

Slow Dynamics Due to Singularities of Hierarchical Learning Machines

I D I A P. A New Margin-Based Criterion for Efficient Gradient Descent R E S E A R C H R E P O R T. Ronan Collobert * Samy Bengio * IDIAP RR 03-16

Bayesian Regression Linear and Logistic Regression

Probabilistic & Bayesian deep learning. Andreas Damianou

Sampling Methods (11/30/04)

Neural Network Training

Bayesian Machine Learning

10-701/ Machine Learning, Fall

Lecture 6: Graphical Models: Learning

Approximate Inference using MCMC

Linear Models for Classification

CPSC 540: Machine Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Multilayer Neural Networks

Transcription:

How ew Information Criteria WAIC and WBIC Worked for MLP Model Selection Seiya Satoh and Ryohei akano ational Institute of Advanced Industrial Science and Tech, --7 Aomi, Koto-ku, Tokyo, 5-6, Japan Chubu University, Matsumoto-cho, Kasugai, 87-85, Japan seiya.satoh@aist.go.jp, nakano@cs.chubu.ac.jp Keywords: Abstract: Information Criteria, Model Selection, Multilayer Perceptron, Singular Model. The present paper evaluates newly invented information criteria for singular models. Well-known criteria such as AIC and BIC are valid for regular statistical models, but their validness for singular models is not guaranteed. Statistical models such as multilayer perceptrons (MLPs), RBFs, HMMs are singular models. Recently WAIC and WBIC have been proposed as new information criteria for singular models. They are developed on a strict mathematical basis, and need empirical evaluation. This paper experimentally evaluates how WAIC and WBIC work for MLP model selection using conventional and new learning methods. ITRODUCTIO A statistical model is called regular if the mapping from a parameter vector to a probability distribution is one-to-one and its Fisher information matrix is always positive definite; otherwise, it is called singular. Many useful statistical models such as multilayer perceptrons (MLPs), RBFs, HMMs, Gaussian mixtures, are all singular. Given data, we sometimes have to select the best statistical model that has the optimum trade-off between goodness of fit and model complexity. This task is called model selection, and many information criteria have been proposed as measures for this task. Most information criteria such as AIC (Akaike s information criterion) (Akaike, 97), BIC (Bayesian information criterion) (Schwarz, 978), and BPIC (Bayesian predictive information criterion) (Ando, 7) are for regular models. These criteria assume the asymptotic normality of maximum likelihood estimator; however, in singular models this assumption does not hold any more. Recently Watanabe established singular learning theory (Watanabe, 9), and proposed new criteria WAIC (widely applicable information criteria) (Watanabe, ) and WBIC (widely applicable Bayesian information criterion) (Watanabe, ), applicable to singular models. WAIC and WBIC have been developed on a strict mathematical basis, and how they work for singular models needs to be investigated hereafter. Let MLP(J) be an MLP having J hidden units; note that an MLP model is determined by the number J. When evaluating MLP model selection experimentally, we have to run learning methods for different MLP models. There can be two ways to perform this learning: independent learning and successive learning. In the former, we run a learning method repeatedly and independently for each MLP(J), whereas in the latter MLP(J) learning inherits solutions from MLP(J ) learning. As J gets larger, a model gets more complex having more fitting capability. This means training error should monotonically decrease as J gets larger. However, independent learning will not guarantee this monotonicity. A new learning method called SSF (Singularity Stairs Following) (Satoh and akano, a; Satoh and akano, b) realizes successive learning by utilizing singular regions to stably find excellent solutions, and can guarantee the monotonicity. This paper experimentally evaluates how new criteria WAIC and WBIC work for MLP model selection, compared with conventional criteria AIC and BIC, using a conventional learning method called (Back Propagation based on Quasi-ewton) (Saito and akano, 997) and the new learning method SSF for search and sampling. is a kind of quasi-ewton method with BFGS (Broyden- Fletcher-Goldfarb-Shanno) update. 5 Satoh, S. and akano, R. How ew Information Criteria WAIC and WBIC Worked for MLP Model Selection. DOI:.5/65 In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods (ICPRAM 7), pages 5- ISB: 978-989-758--6 Copyright c 7 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved

ICPRAM 7-6th International Conference on Pattern Recognition Applications and Methods IFORMATIO CRITERIA FOR MODEL SELECTIO Let a statistical model be p(x w), where x is an input vector and w is a parameter vector. Let given data be D = {x µ, µ =,,}, where indicates data size, the number of data points. AIC and BIC: AIC (Akaike information criterion) (Akaike, 97) and BIC (Bayesian information criterion) (Schwarz, 978) are famous information criteria for regular models. Both deal with the trade-off between goodness of fit and model complexity. The log-likelihood is defined as follow: L (w) = log p(x µ w). () Let ŵ be a maximum likelihood estimator. AIC is given below as an estimator of a compensated loglikelihood using the asymptotic normality of ŵ. Here M is the number of parameters. AIC = L (ŵ) + M = log p(x µ ŵ) + M () BIC is obtained as an estimator of free energy F(D) shown below. Here p(d) is called evidence and p(w) is a prior distribution of w. F(D) = log p(d), () p(d) = p(w) p(x µ w) dw () BIC is derived using the asymptotic normality and Laplace approximation. BIC = L (ŵ) + M log = log p(x µ ŵ) + M log (5) AIC and BIC can be calculated using only one point estimator ŵ. WAIC and WBIC: WAIC and WBIC are derived from Watanabe s singular learning theory (Watanabe, 9) as new information criteria for singular models. Watanabe introduced the following four quantities: Bayes generalization loss BL g, Bayes training loss BL t, Gibbs generalization loss GL g, and Gibbs training loss GL t. BL g = p (x) log p(x D)dx (6) BL t = log p(x µ D) (7) ( ) GL g = p (x)log p(x w)dx p(w D)dw (8) ( ) GL t = log p(x µ w) p(w D)dw (9) Here p (x) is the true distribution, p(w D) is a posterior distribution, and p(x D) is a predictive distribution. p(w D) = p(x D) = p(d) p(w) p(x µ w) () p(x w) p(w D) dw () WAIC and WAIC are given as estimators of BL g and GL g respectively (Watanabe, ). WAIC reduces to AIC for regular models. WAIC = BL t + (GL t BL t ) () WAIC = GL t + (GL t BL t ) () WBIC is given as an estimator of free energy F(D) for singular models (Watanabe, ), where p β (w D) is a posterior distribution under the inverse temperature β. In WBIC context, β is set to be /log(). WBIC reduces to BIC for regular models. ( WBIC = log p(x µ w) p β (w D) = ) p β (w D)dw () p β (D) p(w) p(x µ w) β (5) There are two ways to calculate WAIC and WBIC: analytic approach and empirical one. We employ the latter, which requires a set of weights {w t } which approximates a posterior distribution (Watanabe, 9). SSF: EW LEARIG METHOD SSF (Singularity Stairs Following) is briefly explained; for details, refer to (Satoh and akano, a; Satoh and akano, b). SSF finds solutions of MLP(J) successively from J= until J max making good use of singular regions of each MLP(J). Singular regions of MLP(J) are created by utilizing the optimum of MLP(J ). Gradient is zero all over the region. How to create singular regions is explained below. Consider MLP(J) with just one output unit which outputs f J (x;θ J ) = w + J j= w jz j, where θ J 6

How ew Information Criteria WAIC and WBIC Worked for MLP Model Selection = {w,w j,w j, j =,,J}, z j g(w T j x), and g(h) is an activation function. Given data {(x µ,y µ ),µ =,,}, we try to find MLP(J) which minimizes an error function. We also consider MLP(J ) with θ J = {u,u j,u j, j =,,J}. The output is f J (x;θ J ) = u + J j= u jv j, where v j g(u T j x) ow consider the following reducibility mappings α, β, and γ. Then apply α, β, and γ to the optimum θj to get regions ϑ α J, ϑ β J, and ϑ γ J respectively. θj α ϑ α β J, θj ϑ β J, γ θj ϑ γ J ϑ α J {θ J w =û, w =, w j =û j,w j =û j, j =,,J} ϑ β J {θ J w +w g(w )=û, w =[w,,...,] T, w j =û j,w j = û j, j =,...,J} ϑ γ J {θ J w =û,w +w m =û m, w =w m =û m, w j =û j,w j =û j, j {,...,J}\m} ow two singular regions can be formed. One is αβ ϑ J, the intersection of ϑ α J and ϑ β J. The parameters are as follows, where only w is free: w = û, w =, w = [w,,,] T,w j = û j, w j = û j, j =,,J. The other is ϑ γ J, which has the restriction: w + w m = û m. SSF starts search from MLP(J=) and then gradually increases J one by one until J max. When starting from the singular region, the method employs eigenvector descent (Satoh and akano, ), which finds descending directions, and from then on employs (Saito and akano, 997), a quasi-ewton method. SSF finds excellent solution of MLP(J) one after another for J=,,J max. Thus, SSF guarantees that training error decreases monotonically as J gets larger, which will be quite preferable for model selection. EXPERIMETS Experimental Conditions: We used artificial data since they are easy to control and their true nature is obvious. The structure of an MLP is defined as follows: the numbers of input, hidden, and output units are K, J, and I respectively. Both input and hidden layers have a bias. Values of input data were randomly selected from the range [, ]. Artificial data and data were generated using MLP(K = 5, J =, I = ) and MLP(K =, J =, I = ) respectively. Weights between input and hidden layers were set to be integers randomly selected from the range [, +], whereas weights between hidden and output layers were integers randomly selected from [, +]. A small Gaussian noise with mean zero and standard deviation. was added to each MLP output. Size of training data was set to be = 8, whereas test data size was set to be,. WAIC and WBIC were compared with AIC and BIC. The empirical approach needs a sampling method; however, usual MCMC (Markov chain Monte Carlo) methods such as Metropolis algorithm will not work at all (eal, 996) since MLP search space is quite hard to search. Thus, we employ powerful learning methods and SSF as sampling methods. For AIC and BIC a learning method runs without any regularizer, whereas WAIC and WBIC need a weight decay regularizer whose regularization coefficient λ depends on temperature T. The temperature T was set as suggested in (Watanabe, ; Watanabe, ): T = for WAIC and T = log() for WBIC. The regularization coefficient λ of WAIC is smaller than that of WBIC. WAIC and WBIC were calculated using a set of weights {w t } approximating a posterior distribution. Test error was calculated using test data. Our various previous experiments have shown that (Saito and akano, 997) finds much better solutions than BP (Back Propagation) does, mainly because is a quasi-ewton, a nd-order method. Thus, we employ as a conventional learning method. We performed independently times changing initial weights for each J. Moreover, we employ a newly invented learning method called SSF as well. For SSF, the maximum number of search routes was set to be for each J; J was changed from until. Each run of a learning method was terminated when the number of sweeps exceeded, or the step length got smaller than 6. Experimental Results: Figures to 6 show a set of results for artificial data. Figure shows minimum training error obtained by each learning method for each J. Although SSF guarantees the monotonic decrease of minimum training error, does not in general. However, showed the monotonic decrease for this data. Figure shows test error for ŵ of the best model obtained by each learning method for each J. with λ =, with λ for WAIC, and with λ for WBIC got the minimum test error at J =,, and respectively. SSF with λ =, SSF with λ for WAIC, and SSF with λ for WBIC found the minimum test error at J = 8, 9, and respectively. Figure shows AIC values obtained by each 7

ICPRAM 7-6th International Conference on Pattern Recognition Applications and Methods Training Error 5 (λ =) SSF. (λ =) (WAIC) SSF. (WAIC) (WBIC) SSF. (WBIC) Criterion AIC 9 8 7 6 5 SSF. - Figure : Training Error for Data. - Figure : AIC for Data. Test Error 6 5 (λ =) SSF. (λ =) (WAIC) SSF. (WAIC) (WBIC) SSF. (WBIC) Criterion BIC 5 5 5 5 SSF. 5 Figure : Test Error for Data. -5 Figure : BIC for Data. learning method for each J. AIC of both methods selected J as the best model, which is not suitable at all. Figure shows BIC values obtained by each learning method for each J. BIC of selected J = 8 as the best model, whereas BIC of SSF selected J = 9. Thus, BIC selected a bit smaller models than the true one for this data. Figure 5 shows WAIC and WAIC values obtained by each learning method for each J. WAIC and WAIC of selected J as the best model, which is not suitable. WAIC and WAIC of SSF selected J = 9 as the best model, which is very close to the true one (J = ). WAIC and WAIC selected the same model for each method. Figure 6 shows WBIC values obtained by each learning method for each J. WBIC of selected J, which is not suitable, whereas WBIC of SSF selected J =, which is right. Figures 7 to show the results for artificial data. Figure 7 shows minimum training error. SSF showed the monotonic decrease, whereas did not for WBIC. Figure 8 shows test error for ŵ of the best model obtained by each learning method. with λ =, λ for WAIC, and λ for WBIC got the minimum test error at J =,, and 9 respectively. SSF with λ =, λ for WAIC, and λ for WBIC found the minimum test error at J =,, and respectively. Figure 9 shows AIC values obtained by each learning method for each J. AIC of and SSF selected J = and J respectively as the best model, which is not acceptable. Figure shows BIC values obtained by each learning method for each J. BIC of and SSF selected J = and J = respectively as the best model. BIC selected a bit larger models for this data. Figure shows WAIC and WAIC values obtained by each learning method for each J. WAIC and WAIC of selected J = as the best model, which is very close to the true model. WAIC and WAIC of SSF selected J = as the best model, which is exactly the true one. For this data WAIC and WAIC again selected the same model for each method. Figure shows WBIC values obtained by each learning method for each J. WBIC of selected J =, which is quite unacceptable, whereas 8

How ew Information Criteria WAIC and WBIC Worked for MLP Model Selection Criterion WAIC - - (WAIC ) SSF. (WAIC ) (WAIC ) SSF. (WAIC ) Test Error 7 6 5 (λ =) SSF. (λ =) (WAIC) SSF. (WAIC) (WBIC) SSF. (WBIC) Criterion WBIC Training Error - Figure 5: WAIC for Data. - - SSF. - Figure 6: WBIC for Data. 6 5 - (λ =) SSF. (λ =) (WAIC) SSF. (WAIC) (WBIC) SSF. (WBIC) - Figure 7: Training Error for Data. WBIC of SSF selected J =, which is just the same as the true one. Tables and summarize our results of model selection using and SSF for artificial data and respectively. Criterion AIC Criterion BIC Figure 8: Test Error for Data. 9 8 7 6 5 SSF. - Figure 9: AIC for Data. 5 5 5 5 5 SSF. Figure : BIC for Data. Considerations: The results of our experiments may suggest the following. ote, however, that since our experiments are quite limited, more intensive investigation will be needed to make the tendencies more reliable. 9

ICPRAM 7-6th International Conference on Pattern Recognition Applications and Methods Criterion WAIC Criterion WBIC - (WAIC ) SSF. (WAIC ) (WAIC ) SSF. (WAIC ) - Figure : WAIC for Data. 5 - SSF. - Figure : WBIC for Data. Table : Comparison of Selected Models for Data. learning method criterion SSF AIC BIC 8 9 WAIC 9 WAIC 9 WBIC (a) Independent learning of does not guarantee monotonic decrease of training error along with the increase of J, whereas successive learning of SSF does guarantee the monotonic decrease. For MLP model selection, independent learning sometimes did not work well, showing an up-and-down curve of training error and leading to wrong selection, whereas successive learning seems suited for MLP Table : Comparison of Selected Models for Data. learning method criterion SSF AIC BIC WAIC WAIC WBIC model selection due to the monotonic decrease of training error. (b) In our experiments AIC had the tendency to select the largest (J ) among the candidates for any learning method. This is probably because the penalty for model complexity is too small. BIC worked relatively well, having the tendency to select a bit smaller or larger models than the true one. (c) WAIC and WBIC of SSF worked very well, selecting the true model or models very close to the true one. However, WAIC and WBIC of sometimes didn t work well. Moreover, there was little difference between WAIC and WAIC for each learning method in our experiments. 5 COCLUSIO WAIC and WBIC are new information criteria for singular models. This paper evaluates how they work for MLP model selection using artificial data. We compared them with AIC and BIC using sampling methods. For this sampling, we used independent learning of a conventional learning method and successive learning of a newly invented SSF. Our experiments showed that WAIC and WBIC of SSF worked very well, selecting the true model or very close models for each data, although WAIC and WBIC of sometimes did not work well. AIC did not work well selecting larger models, and BIC had the tendency to select a bit smaller or larger models. In the future we plan to do more intensive investigation on WAIC and WBIC.

How ew Information Criteria WAIC and WBIC Worked for MLP Model Selection ACKOWLEDGEMET This work was supported by Grants-in-Aid for Scientific Research (C) 6K. REFERECES Akaike, H. (97). A new look at the statistical model identification. IEEE Trans. on Automatic Control, AC- 9:76 7. Ando, T. (7). Bayesian predictive information criterion for the evaluation of hierarchical Bayesian and empirical Bayes models. Biometrika, 9: 58. eal, R. (996). Bayesian learning for neural networks. Springer. Saito, K. and akano, R. (997). Partial BFGS update and efficient step-length calculation for three-layer neural networks. eural Comput., 9():9 57. Satoh, S. and akano, R. (). Eigen vector descent and line search for multilayer perceptron. In IAEG Int. Conf. on Artificial Intelligence & Applications (ICAIA ), volume, pages 6. Satoh, S. and akano, R. (a). Fast and stable learning utilizing singular regions of multilayer perceptron. eural Processing Letters, 8():99 5. Satoh, S. and akano, R. (b). Multilayer perceptron learning utilizing singular regions and search pruning. In Proc. Int. Conf. on Machine Learning and Data Analysis, pages 79 795. Schwarz, G. (978). Estimating the dimension of a model. Annals of Statistics, 6:6 6. Watanabe, S. (9). Algebraic geometry and statistical learning theory. Cambridge University Press, Cambridge. Watanabe, S. (). Equations of states in singular statistical estimation. eural etworks, :. Watanabe, S. (). A widely applicable Bayesian information criterion. Journal of Machine Learning Research, :867 897.