Complexity Bounds of Radial Basis Functions and Multi-Objective Learning

Similar documents
Back to the future: Radial Basis Function networks revisited

Convex optimization COMS 4771

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Support Vector Regression (SVR) Descriptions of SVR in this discussion follow that in Refs. (2, 6, 7, 8, 9). The literature

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Lecture : Probabilistic Machine Learning

Optimal transfer function neural networks

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

CLOSE-TO-CLEAN REGULARIZATION RELATES

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Statistical Learning Reading Assignments

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

Machine Learning for OR & FE

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

PDEEC Machine Learning 2016/17

Introduction. Chapter 1

Probabilistic Machine Learning. Industrial AI Lab.

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

An Analytical Comparison between Bayes Point Machines and Support Vector Machines

Reproducing Kernel Hilbert Spaces

Linear & nonlinear classifiers

Holdout and Cross-Validation Methods Overfitting Avoidance

E.G., data mining, prediction. Function approximation problem: How >9 best estimate the function 0 from partial information (examples)

STA414/2104 Statistical Methods for Machine Learning II

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Flexible Transfer Functions with Ontogenic Neural Networks

Relevance Vector Machines for Earthquake Response Spectra

Overfitting, Bias / Variance Analysis

In the Name of God. Lectures 15&16: Radial Basis Function Networks

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Linear & nonlinear classifiers

On the Noise Model of Support Vector Machine Regression. Massimiliano Pontil, Sayan Mukherjee, Federico Girosi

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

Gaussian Processes (10/16/13)

Data Mining and Analysis: Fundamental Concepts and Algorithms

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

L20: MLPs, RBFs and SPR Bayes discriminants and MLPs The role of MLP hidden units Bayes discriminants and RBFs Comparison between MLPs and RBFs

Statistical Methods for SVM

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Multiobjective Optimization Applied to Robust H 2 /H State-feedback Control Synthesis

Today. Calculus. Linear Regression. Lagrange Multipliers

Computational statistics

ISyE 691 Data mining and analytics

Discriminative Models

Machine Learning, Fall 2009: Midterm

STA 4273H: Statistical Machine Learning

Adaptive Sparseness Using Jeffreys Prior

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Support Vector Machine Regression for Volatile Stock Market Prediction

Probabilistic Regression Using Basis Function Models

Lecture 6. Regression

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Regularization in Neural Networks

A Bayesian Local Linear Wavelet Neural Network

Second-order Learning Algorithm with Squared Penalty Term

Radial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition

Non-parametric Residual Variance Estimation in Supervised Learning

Pattern Recognition and Machine Learning

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MS-C1620 Statistical inference

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Announcements. Proposals graded

Different Criteria for Active Learning in Neural Networks: A Comparative Study

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machines

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

Density Estimation. Seungjin Choi

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

PATTERN RECOGNITION AND MACHINE LEARNING

Recap from previous lecture

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 3: Pattern Classification. Pattern classification

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

CMU-Q Lecture 24:

DATA MINING AND MACHINE LEARNING

SUPPLEMENTARY MATERIAL: General mixed Poisson regression models with varying dispersion

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Dimension Reduction Methods

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Reproducing Kernel Hilbert Spaces

Classifier Complexity and Support Vector Classifiers

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Sparse Support Vector Machines by Kernel Discriminant Analysis

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Discriminative Models

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Relevance Vector Machines

Lecture 3: Pattern Classification

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Transcription:

Complexity Bounds of Radial Basis Functions and Multi-Objective Learning Illya Kokshenev and Antônio P. Braga Universidade Federal de Minas Gerais - Depto. Engenharia Eletrônica Av. Antônio Carlos, 6.67 - Campus UFMG Pampulha 30.161-970, Belo Horizonte, MG - Brasil Abstract. In the paper, the problem of multi-objective MOBJ) learning is discussed. The problem of obtaining apparent effective) complexity measure, which is one of the objectives, is considered. For the specific case of RBFN, the bounds on the smoothness-based complexity measure are proposed. As shown in the experimental part, the bounds can be used for Pareto set approximation. 1 Introduction Similarly to other universal approximators, neural networks are capable to yield good generalization by minimizing a fitting criterion when the amount of observations is sufficiently large. However, non-synthetic data is most likely to be finite and disturbed by noise, leading single objective error minimization learning approaches often to poor generalization and overfitting. This kind of behaviour led to the development of generalization control methods, which aim at obtaining a proper balance between bias and variance, by also minimizing network complexity [1]. Pruning [] and regularization [3] methods are commonly used to achieve that goal. While pruning algorithms control complexity by manipulating network structure, regularization aims at controlling network output response with penalty functions. However, error and network complexity structural or apparent) in these two approaches are treated as single objective problems. Although this may result on good generalization models, they are highly dependent on user defined training parameters. In addition to that, it is well known that error and complexity are conflicting objectives and, similarly to bias and variance, demand balancing instead of joint minimization. This viewpoint of learning demands multi-objective MOBJ) methods [4], which treat empirical and structural risks as two independent objectives. The MOBJ problem can be defined by introducing error and complexity objective functions φ e ω) andφ c ω), repectively. With the first one representing the empirical risk and the second one representing the structural risk, we may now formulate MOBJ learning as the vector optimization problem min φ eω),φ c ω)), 1) ω Ω where ω is the vector of network parameters in the parameter space Ω. Since the objectives are conflicting in the region of interest, the solution of 1) is a Pareto-optimal front Ω Ω, in which the elements ω Ω satisfy the conditions ω : {φ e ω) φ e ω ),φ c ω) φ c ω )}. In other words, the optimization 73

problem results in the optimal solutions that represent the best compromise between the two objectives. It means that for every solution ω Ω, there are others in Ω that have lower complexity and error. Usually, the squared error criterion and the norm of network weights w are taken as the error and complexity measures for MLP networks [5]. A general definition for assessing apparent complexity of other network types is not known, what is an obstacle for MOBJ learning. In this paper we present the smoothnessbased apparent complexity measure and propose its bounds for RBF network. We show that the apparent complexity is limited by σ 1 w 1, where σ is the radius of the hidden layer functions and 1 is the Manhattan norm operator 1 norm). It is shown also that this result can be used to control generalization of RBF with MOBJ learning. Apparent complexity measure for RBFN Considering neural network as the function f in Sobolev spaces W k,p, its smoothness could be represented in terms of the Sobolev [6] norm f k,p = k f i) p = i=0 k i=0 ) 1/p. f i) t) p dt ) Since only the second or higher order elements of sum ) represent nonlinear properties, the element f ) can be chosen for nonlinear smoothness criterion, which is also considered as a penalty function in regularization [3, 7]. The apparent complexity measure based on f, may be therefore expressed as f φ c ω) = x, ω) x = x f x, 3) where fx, ω) is the mapping function of the neural network with parameters ω. Let s consider the particular case of n-input single output RBF network containing m Gaussian functions of the same radius σ and prototype matrix c =c 1,c,...,c m ). Introducing the centered input vector δ j = x cj σ with respect to j-th prototype and the centered kernel function Kδ) =exp 1 δ ), the network output can be written as fx, w, σ, c) = m w j Kδ j ), 4) where w is the m 1) output weights vector. Hence, the radial basis functions with spherical receptive fields is considered, the Hessian of 4) is diagonal and its Euclidean norm in 3) is 1 n m x f = σ w j K δ j ) δj 1 ). i=1 74

According to the triangle property of the norm operator, we obtain the following inequality: m n x f σ w j K δ j ) δj 1 )) ) 1 or x f σ m i=1 ) w j K δ j ) δ j 4 4 δ j 1 + n after simplification. One can see, that there are positive functions on both sides of 5). Thus, the inequality remains the same after integration, providing φ c w, σ, c) σ m w j ) K δ j ) δ j 4 4 δ j 1 + n x. ) Introducing Ψδ j )=Kδ j ) δ j 4 4 δ j 1 + n and passing to the integral by δ i = σ 1 x, we obtain m φ c w, σ, c) σ 1 w j Ψδ j ) δ j. Since Ψδ j ) does not depend on the network parameters, we may treat Ψδ j ) δ j just as function of n and take it out from the sum. Accordingly, we obtain the bound on apparent complexity 5) φ c w, σ, c) σ 1 w 1 Θn), 6) where Θn) = Ψδ) δ and 1 is the Manhattan norm operator 1 norm). 3 MOBJ learning Usually, the vector optimization problem 1) does not have an analytical solution, however in practice it is enough to approximate the Pareto front Ω with a finite number of solutions. According to MOBJ learning concepts, after approximation is obtained, the resulting best solution must be selected from Ω on a decision making step, in accordance to a posteriori criterion, such as validation error, maximum entropy etc. When nothing is known about convexity of the Pareto front, the ɛ-constraint method can be used to achieve the approximation of Ω. The apparent complexity bound 6) is obtained, regardless of the prototype location. Here, we consider the case when prototypes are once determined by an appropriate strategy and does not participate in parameter search. Thus, inequality 6) leads to the ɛ-restrict approximation { [w i,σi ] = arg min φ e w, σ, c), σ 1 7) w 1 ɛ i. 75

In view of that, the constraints imposed on φ c ω) bounds are more important than its magnitude, Θn) is neglected under assumption on its convergence. 4 Experiments The experiment was concerned on the regression problem of sinc function yx) = sinπx) πx + γ, 8) where γ is the normally distributed zero mean and constant variance noise component. Since the problem is simple by itself, we treat it under conditions of high noise level and small number of observations in order to force the difficulty of obtaining good generalization. We choose m = 30 radial basis functions and noise variance of σnoise =0.. The training and validation sets were generated respectively by selecting 100 and 50 samples of 8) on the interval x 0, 4π] normalized to [0, 1]. The test sample of 8) was taken without noise. Since the problem is one-dimensional, the prototype centers were equidistantly spaced on the input range. The Pareto front was approximated for complexity bound restriction magnitudes ɛ i [0, 80]. The candidate solution for the corresponding ɛ i was selected with respect to a minimal training error value calculated for the radius σ [0.1, 0.5]. According to 7), the w corresponding to a solution under chosen restrictions must satisfy w 1 σɛ c. Therefore, for given ɛ i and σ the output layer weight vector w was obtained by solving the constrained least squares problem E = 1 Y Hw), 9) w 1 σɛ i, over the training set of N samples xk) y d k)) using ellipsoid method. Here H is the N n) matrix of radial basis function values h kj = Kδ j xk))) and Y =y d 1),y d ),...,y d N)) T stands for the desired output vector. The final solution, minimum of validation error, is picked-up within the obtained candidates set. The results of the Pareto front approximation are shown on Figure, where each auxiliary curve is also a Pareto front of the subproblem 9). The regression results are presented in Figure 1 and Table 1. Also, the comparison with the ridge regression RR) method for various model selection criteria is given: Bayesian information criterion BIC), generalized crossvalidation GCV) and maximum margin likelihood MML). 5 Conclusions The experimental results confirmed the efficiency of generalization control based on the proposed bounds. The solutions obtained by application of RR learning is close to the MOBJ results. In contrast to the RR results, most of the weights have exactly zero magnitude see Figure 1 b)), so the network structure can be 76

1.4 1. 1 trainig set validation set real function MOBJ solution 0.8 0.6 MOBJ RR MML) 0.8 y 0.6 0.4 0. Magnitude 0.4 0. 0 0. 0 0.4 0.6 0 0. 0.4 0.6 0.8 1 x 0. 0 5 10 15 0 5 30 Weight number a) b) Fig. 1: The regression results for MOBJ solution a) and the weight magnitudes b) in a comparison with RR results. Solution σ w 1 σ 1 w 1 MSE train/valid./test) RR GCV) 0. 3.84 17.6 0.0405 / 0.0343 / 0.0048 RR BIC) 0. 3.97 17.6 0.0407 / 0.0349 / 0.0054 RR MML) 0.16 1.95 11.9 0.0406 / 0.039 / 0.0044 MOBJ 0.14 1.41 9.7 0.0403 / 0.0339 / 0.004 Table 1: The experimental results. 0 candidate set MOBJ RR GCV) RR BIC) RR MML) 15 Complexity 10 5 0.038 0,0405 0,043 0,0455 0,048 Training MSE Fig. : The results of Pareto front approximation and the candidate set. simplified without any loss. Hence, we can conclude that the proposed MOBJ approach involves both the pruning and regularization properties. From the experimental results, it was observed that, under certain conditions, 77

the Pareto front presents non-convex intervals. Moreover, the best solution may belongtothem. It is noteworthy, that learning with regularization in a general nonlinear form is equivalent to solving the multi-objective problem 1) by weighting method, that is the minimization of the convex combination of the objectives min φ e ω)+ λφ c ω), where λ 0 is the regularization parameter. For various values of λ λ=0.0 λ=0.04 Pareto front possible solutions convex regions φ c λ=0.1 λ=0.631 λ= λ=0.63 λ=3 λ=4 φ e Fig. 3: The example of regularization in case of non-convex Pareto front. the solutions will always form a convex front, which will concur with the Pareto front only on its convex intervals as it is shown on example Figure 3. Hence, we infer that regularization learning based on smoothness penalty 3) does not reach the best possible solutions from the non-convex Pareto front regions, so MOBJ learning is needed in such situations. The proposed bounds on apparent complexity 6) for RBFs provide a new possibility for MOBJ learning. The concept can also evolve to more general RBF networks and other architectures, what brings also interest for further research. References [1] S. Geman, E. Bienenstock, and R. Doursat. Neural Network and the Bias/Variance Dilemma. Neural Computation, 4:1 58, 199. [] R. Reed. Pruning algorithms - a survey. IEEE Transactions on Neural Networks, 45):740 746, 1993. [3] Federico Girosi, Michael Jones, and Tomaso Poggio. Regularization theory and neural networks architectures. Neural Computation, 7):19 69, 1995. [4] A. P. Braga, R. H. C. Takahashi, M. A. Costa, and R. A. Teixeira. Multi-objective algorithms for neural-networks learning. In Y. Jin, editor, Multi-Objective Machine Learning Series: Studies in Computational Intelligence), volume 16, pages 151 171. Heidelberg: Springer Verlag, 006. [5] R. A. Teixeira, A. P. Braga, and R. R. Saldanha R. H. C. Takahashi. Improving generalization of MLPs with multi-objetive optimization. Neurocomputing, 35:189 194, 000. [6] R. A. Adams and J. J.F Fournier. Sobolev spaces. Academic press, New York, second edition, 003. [7] M. A. Kon and L. Plaskota. Complexity of Regularization RBF Networks. In Proceedings of International Joint Congress on Neural Networks, INNS, pages 34 346, Washington, 001. 78