Radial Basis Function Network for Multi-task Learning

Size: px

Start display at page:

Download "Radial Basis Function Network for Multi-task Learning"

Tracey Rose
6 years ago
Views:

1 Radial Basis Function Networ for Multi-tas Learning Xuejun Liao Department of ECE Due University Durham, NC , USA Lawrence Carin Department of ECE Due University Durham, NC , USA Abstract We extend radial basis function (RBF networs to the scenario in which multiple correlated tass are learned simultaneously, and present the corresponding learning algorithms. We develop the algorithms for learning the networ structure, in either a supervised or unsupervised manner. raining data may also be actively selected to improve the networ s generalization to test data. Experimental results based on real data demonstrate the advantage of the proposed algorithms and support our conclusions. 1 Introduction In practical applications, one is frequently confronted with situations in which multiple tass must be solved. Often these tass are not independent, implying what is learned from one tas is transferable to another correlated tas. By maing use of this transferability, each tas is made easier to solve. In machine learning, the concept of explicitly exploiting the transferability of expertise between tass, by learning the tass simultaneously under a unified representation, is formally referred to as multi-tas learning 1]. In this paper we extend radial basis function (RBF networs 4,5] to the scenario of multitas learning and present the corresponding learning algorithms. Our primary interest is to learn the regression model of several data sets, where any given data set may be correlated with some other sets but not necessarily with all of them. he advantage of multi-tas learning is usually manifested when the training set of each individual tas is wea, i.e., it does not generalize well to the test data. Our algorithms intend to enhance, in a mutually beneficial way, the wea training sets of multiple tass, by learning them simultaneously. Multi-tas learning becomes superfluous when the data sets all come from the same generating distribution, since in that case we can simply tae the union of them and treat the union as a single tas. In the other extreme, when all the tass are independent, there is no correlation to utilize and we learn each tas separately. he paper is organized as follows. We define the structure of multi-tas RBF networ in Section and present the supervised learning algorithm in Section 3. In Section 4 we show how to learn the networ structure in an unsupervised manner, and based on this we demonstrate how to actively select the training data, with the goal of improving the

2 generalization to test data. We perform experimental studies in Section 5 and conclude the paper in Section 6. Multi-as Radial Basis Function Networ Figure 1 schematizes the radial basis function (RBF networ structure customized to multitas learning. he networ consists of an input layer, a hidden layer, and an output layer. he input layer receives a data point x = x 1,,x d ] R d and submits it to the hidden layer. Each node at the hidden layer has a localized activation φ n (x = φ( x c n,σ n, n = 1,,N, where denotes the vector norm and φ n ( is a radial basis function (RBF localized around c n with the degree of localization parameterized by σ n. Choosing φ(z,σ = exp( z σ gives the Gaussian RBF. he activations of all hidden nodes are weighted and sent to the output layer. Each output node represents a unique tas and has its own hidden-to-output weights. he weighted activations of the hidden nodes are summed at each output node to produce the output for the associated tas. Denoting w = w 0,w 1,,w N ] as the weights connecting hidden nodes to the -th output node, then the output for the -th tas, in response to input x, taes the form f (x = w φ(x (1 where φ(x = φ 0 (x,φ 1 (x,...,φ N (x ] is a column containing N + 1 basis functions with φ 0 (x 1 a dummy basis accounting for the bias in Figure 1. f 1 (x f (x f K (x Networ response as 1 as as K Output layer (Specified by the number of tass w 1 w w K Hidden-to-output weights 0 1 Bias 1 (x (x N (x Hidden layer (basis functions (o be learned by algorithms Input layer (Specified by data dimensionality x= x 1 x x d ] Input data point Figure 1: A multi-tas structure of RBF Networ. Each of the output nodes represents a unique tas. Each tas has its own hidden-to-output weights but all the tass share the same hidden nodes. he activation of hidden node n is characterized by a basis function φ n(x = φ( x c n, σ n. A typical choice of φ is φ(z, σ = exp( z σ, which gives the Gaussian RBF. 3 Supervised Learning Suppose we have K tass and the data set of the -th tas is D = {(x 1,y 1,,(x J,y J }, where y i is the target (desired output of x i. By definition, a given data point x i is said to be supervised if the associated target y i is provided and unsupervised if y i is not provided. he definition extends similarly to a set of data

3 able 1: Learning Algorithm of Multi-as RBF Networ Input: {(x 1,y,,(x J,,y J,} =1:K, φ(,σ, σ, and ρ; Output: φ( and {w } K =1. 1. For m=1:k, For n=1:j m, For =1:K, For i=1:j nm Compute φ i = φ( x nm x i,σ;. Let N = 0, φ( = 1, e 0 = K J =1 i=1 y i (J + ρ 1 ( J i=1 y i ]; For =1:K, compute A =J +ρ, w =(J +ρ 1 J i=1 y i; 3. For m=1:k, For n=1:j m If φ nm is not mared as deleted For =1:K, compute c = J i=1 φ nm i φ i, q = J nm i=1 ( φ i + ρ c A 1 c ; If there exists such that q = 0, mar φ nm as deleted ; else, compute δe(φ, φ nm using (5. 4. If { φ i } i=1:j,=1:k are all mared as deleted, go to Let (n,m = arg max φnm not mared as deleted δe(φ, φ nm ; Mar φ n m as deleted. 6. une RBF parameter σ N+1 = arg max σ δe(φ,φ( x n m,σ 7. Let φ N+1 ( = φ( x n m,σ N+1; Update φ( φ (,φ N+1 ( ] ; 8. For =1:K Compute A new and w new respectively by (A-1 and (A-3 in the appendix; Update A A new, w w new 9. Let e N+1 = e N δe(φ,φ N+1 ; If the sequence {e n } n=0:(n+1 is converged, go to 10, else update N N + 1 and go bac to Exit and output φ( and {w } K =1. points. We are interested in learning the functions f (x for the K tass, based on K =1 D. he learning is based on minimizing the squared error e(φ,w = K =1 { J i=1 ( w φ i y i + ρ w } ( where φ i = φ(x i for notational simplicity. he regularization terms ρ w, = 1,,K, are used to prevent singularity of the A matrices defined in (3, and ρ is typically set to a small positive number. For fixed φ s, the w s are solved by minimizing e(φ,w with respect to w, yielding w = A 1 J i=1 y iφ i and A = J i=1 φ iφ i + ρi, = 1,,K (3 In a multi-tas RBF networ, the input layer and output layer are respectively specified by the data dimensionality and the number of tass. We now discuss how to determine the hidden layer (basis functions φ. Substituting the solutions of the w s in (3 into ( gives e(φ = K =1 J i=1 ( y i y i w φ i where e(φ is a function of φ only because w s are now functions of φ as given by (3. By minimizing e(φ, we can determine φ. Recalling that φ i is an abbreviation of φ(x i = 1,φ 1 (x i,...,φ N (x i ], this amounts to determining N, the number of basis functions, and the functional form of each basis function φ n (, n = 1,...,N. Consider the candidate functions {φ nm (x = φ( x x nm,σ : n = 1,,J m, m = 1,,K}. We learn the RBF networ structure by selecting φ( from these candidate functions such that e(φ in (4 is minimized. he following theorem tells us how to perform the selection in a sequential way; the proof is given in the Appendix. (4

4 heorem 1 Let φ(x = 1,φ 1 (x,...,φ N (x] and φ N+1 (x be a single basis function. Assume the A matrices corresponding to φ and φ,φ N+1 ] are all non-degenerate. hen δe(φ,φ N+1 = e(φ e(φ,φ N+1 ] = K ( =1 c w J i=1 y iφ N+1 q 1 i (5 where φ N+1 i = φ N+1 (φ i, w and A are the same as in (3, and c = J i=1 φ iφ N+1 i, d = J i=1 (φn+1 i + ρ, q = d c A 1 c (6 By the conditions of the theorem A new is full ran and hence it is positive definite by con- is a diagonal element of (A new struction. By (A- in the Appendix, q 1 1, therefore q 1 is positive and by (5 δe(φ,φ N+1 > 0, which means adding φ N+1 to φ generally maes the squared error decrease. he decrease δe(φ,φ N+1 depends on φ N+1. By sequentially selecting basis functions that bring the maximum error reduction, we achieve the goal of maximizing e(φ. he details of the learning algorithm are summarized in able 1. 4 Active Learning In the previous section, the data in D are supervised (provided with the targets. In this section, we assume the data in D are initially unsupervised (only x is available without access to the associated y and we select a subset from D to be supervised (targets acquired such that the resulting networ generalizes well to the remaining data in D. he approach is generally nown as active learning 6]. We first learn the basis functions φ from the unsupervised data, and based on φ select data to be supervised. Both of these steps are based on the following theorem, the proof of which is given in the Appendix. heorem Let there be K tass and the data set of the -th tas is D D where D = {(x i,y i } J i=1 and D = {(x i,y i } J + J i=j +1. Let there be two multi-tas RBF networs, whose output nodes are characterized by f ( and f (, respectively, for tas = 1,...,K. he two networs have the same given basis functions (hidden nodes φ( = 1,φ 1 (,,φ N ( ], but different hidden-to-output weights. he weights of f ( are trained with D D, while the weights of f ( are trained using D. hen for = 1,,K, the square errors committed on D by f ( and f ( are related by 0 detγ ] 1 λ 1 max, J i=1 (y i f (x i ] 1 J i=1 (y i f (x i λ 1 min, 1 (7 where Γ = I + Φ (ρi + Φ Φ 1 Φ ] with Φ = φ(x1,...,φ(x J ] and Φ = φ(xj +1,,...,φ(x J + J, ], and λ max, and λ min, are respectively the largest and smallest eigenvalues of Γ. Specializing heorem to the case J = 0, we have Corollary 1 Let there be K tass and the data set of the -th tas is D = {(x i,y i } J i=1. Let the RBF networ, whose output nodes are characterized by f ( for tas = 1,...,K, have given basis functions (hidden nodes φ( = 1,φ 1 (,,φ N ( ] and the hidden-to-output weights of tas be trained with D. hen for = 1,,K, the squared error committed on D by f ( is bounded as 0 detγ ] 1 λ 1 max, J ] 1 i=1 y J i i=1 (y i f (x i λ 1 min, 1, where Γ = ( I + ρ 1 Φ Φ with Φ = φ(x 1,,...,φ(x J, ], and λ max, and λ min, are respectively the largest and smallest eigenvalues of Γ. It is evident from the properties of matrix determinant 7] and the definition of Φ that detγ = det(ρi + Φ Φ ] det(ρi] = det(ρi + J i=1 φ iφ i ] det(ρi].

5 Using (3 we write succinctly detγ = deta ]det(ρi]. We are interested in selecting the basis functions φ that minimize the error, before seeing y s. By Corollary 1 and the equation detγ = deta ]det(ρi], the squared error is lower bounded by J i=1 y i det(ρi] deta ]. Instead of minimizing the error directly, we minimize its lower bound. As det(ρi] L i=1 y i does not depend on φ, this amounts to selecting φ to minimize (deta. o minimize the errors for all tass = 1,K, we select φ to minimize K =1 (deta. he selection proceeds in a sequential manner. Suppose we have selected basis functions φ = 1,φ 1,,φ N ]. he associated A matrices are A = J i=1 φ iφ i + ρi (N+1 (N+1, = 1,,K. Augmenting basis functions to φ,φ N+1 ], the A matrices change to A new = J i=1 φ i,φ N+1 i ] φ i,φ N+1 i ] + ρi (N+ (N+. Using the determinant formula of bloc matrices 7], we get K =1 (detanew = K =1 (q deta, where q is the same as in (6. As A does not depend on φ N+1, the left-hand side is minimized by maximizing K =1 q. he selection is easily implemented by maing the following two minor modifications in able 1: (a in step, compute e 0 = K =1 ln(j + ρ ; in step 3, compute δe(φ, φ nm = K =1 lnq. Employing the logarithm is for gaining additivity and it does not affect the maximization. Based on the basis functions φ determined above, we proceed to selecting data to be supervised and determining the hidden-to-output weights w from the supervised data using the equations in (3. he selection of data is based on an iterative use of the following corollary, which is a specialization of heorem and was originally given in 8]. Corollary Let there be K tass and the data set of the -th tas is D = {(x i,y i } J i=1. Let there be two RBF networs, whose output nodes are characterized by f ( and f + (, respectively, for tas = 1,...,K. he two networs have the same given basis functions φ( = 1,φ 1 (,,φ N ( ], but different hidden-to-output weights. he weights of f ( are trained with D, while the weights of f + ( are trained using D + = D {(x J +1,,y J +1,}. hen for = 1,,K, the squared errors committed on (x J +1,,y J +1, by f ( and f + ( are related by f + (x J +1, y J +1,] = γ(xj +1, ] 1, f (x J +1, y J +1,] where γ(xj +1, = 1 + φ (x J +1,A 1 φ(x J +1, ] 1 and A = J i=1 ρi + φ(xi φ (x i ] is the same as in (3. wo observations are made from Corollary. First, if γ(x J +1, 1, seeing y J +1, does not effect the error on x J +1,, indicating D already contain sufficient information about (x J +1,,y J +1,. Second, if γ(x i 1, seeing y J +1, greatly decrease the error on x J +1,, indicating x J +1, is significantly dissimilar (novel to D and x J +1, must be supervised to reduce the error. Based on Corollary, the selection proceeds sequentially. Suppose we have selected data D = {(x i,y i } J i=1, from which we compute A. We select the next data point as x J +1, = arg max i>j, =1,,K γ(x i = arg max i>j =1,,K 1 + φ (x i A 1 φ(x i ]. After xj +1, is selected, the A is updated and the next selection begins. As the iteration advances γ will decrease until it reaches convergence. We use (3 to compute w from the selected x and their associated targets y, completing learning of the RBF networ. 5 Experimental Results In this section we compare the multi-tas RBF networ against single-tas RBF networs via experimental studies. We consider three types of RBF networs to learn K tass, each

6 with its data set D. In the first, which we call one RBF networ, we let the K tass share both basis functions φ (hidden nodes and hidden-to output weights w, thus we do not distinguish the K tass and design a single RBF networ to learn a union of them. he second is the multi-tas RBF networ, where the K tass share the same φ but each has its own w. In the third, we have K independent networs, each designed for a single tas. We use a school data set from the Inner London Education Authority, consisting of examination records of 1536 students from 139 secondary schools. he data are available at his data set was originally used to study the effectiveness of schools and has recently been used to evaluate multi-tas algorithms,3]. he goal is to predict the exam scores of the students based on 9 variables: year of exam (1985, 1986, or 1987, school code (1-139, FSM (percentage of students eligible for free school meals, VR1 band (percentage of students in school in VR band one, gender, VR band of student (3 categories, ethnic group of student (11 categories, school gender (male, female, or mixed, school denomination (3 categories. We consider each school a tas, leading to 139 tass in total. he remaining 8 variables are used as inputs to the RBF networ. Following,3], we converted each categorical variable to a number of binary variables, resulting in a total number of 7 input variables, i.e., x R 7. he exam score is the target to be predicted. he three types of RBF networs as defined above are designed as follows. he multi-tas RBF networ is implemented as the structure as shown in Figure 1 and trained with the learning algorithm in able 1. he one RBF networ is implemented as a special case of Figure 1, with a single output node and trained using the union of supervised data from all 139 schools. We design 139 independent RBF networs, each of which is implemented with a single output node and trained using the supervised data from a single school. We use the Gaussian RBF φ n (x = exp( x cn σ, where the c n s are selected from training data points and σ n s are initialized as 0 and optimized as described in able 1. he main role of the regularization parameter ρ is to prevent the A matrices from being singular and it does not affect the results seriously. In the results reported here, ρ is set to Following -3], we randomly tae 75% of the 1536 data points as training (supervised data and the remaining 5% as test data. he generalization performance is measured by the squared error (f (x i y i averaged over all test data x i of tass = 1,,K. We made 10 independent trials to randomly split the data into training and test sets and the squared error averaged over the test data of all the 139 schools and the trials are shown in able, for the three types of RBF networs. able : Squared error averaged over the test data of all 139 schools and the 10 independent trials for randomly splitting the school data into training (75% and testing (5% sets. Multi-tas RBF networ Independent RBF networs One RBF networ ± ± ±.8093 able clearly shows the multi-tas RBF networ outperforms the other two types of RBF networs by a considerable margin. he one RBF networ ignores the difference between the tass and the independent RBF networs ignore the tass correlations, therefore they both perform inferiorly. he multi-tas RBF networ uses the shared hidden nodes (basis functions to capture the common internal representation of the tass and meanwhile uses the independent hidden-to-output weights to learn the statistics specific to each tas. We now demonstrate the results of active learning. We use the method in Section 4 to actively split the data into training and test sets using a two-step procedure. First we learn the basis functions φ of multi-tas RBF networ using all 1536 data (unsupervised. Based on the φ, we then select the data to be supervised and use them as training data to learn

7 the hidden-to-output weights w. o mae the results comparable, we use the same training data to learn the other two types of RBF networs (including learning their own φ and w. he networs are then tested on the remaining data. Figure shows the results of active learning. Each curve is the squared error averaged over the test data of all 139 schools, as a function of number of training data. It is clear that the multi-tas RBF networ maintains its superior performance all the way down to 5000 training data points, whereas the independent RBF networs have their performances degraded seriously as the training data diminish. his demonstrates the increasing advantage of multi-tas learning as the number of training data decreases. he one RBF networ seems also insensitive to the number of training data, but it ignores the inherent dissimilarity between the tass, which maes its performance inferior. 60 Squared error averaged over test data Multi tas RBF networ Independent RBF networs One RBF networ Number of training (supervised data Figure : Squared error averaged over the test data of all 139 schools, as a function of the number of training (supervised data. he data are split into training and test sets via active learning. 6 Conclusions We have presented the structure and learning algorithms for multi-tas learning with the radial basis function (RBF networ. By letting multiple tass share the basis functions (hidden nodes we impose a common internal representation for correlated tass. Exploiting the inter-tas correlation yields a more compact networ structure that has enhanced generalization ability. Unsupervised learning of the networ structure enables us to actively split the data into training and test sets. As the data novel to the previously selected ones are selected next, what finally remain unselected and to be tested are all similar to the selected data which constitutes the training set. his improves the generalization of the resulting networ to the test data. hese conclusions are substantiated via results on real multi-tas data. References 1] R. Caruana. (1997 Multitas learning. Machine Learning, 8, p , ] B. Baer and. Heses (003. as clustering and gating for Bayesian multitas learning. Journal of Machine Learning Research, 4: 83-99, 003 3]. Evgeniou, C. A. Micchelli, and M. Pontil (005. Learning Multiple ass with Kernel Methods. Journal of Machine Learning Research, 6: , 005

8 4] Powell M. (1987, Radial basis functions for multivariable interpolation : A review, J.C. Mason and M.G. Cox, eds, Algorithms for Approximation, pp ] Chen, F. Cowan, and P. Grant (1991, Orthogonal least squares learning algorithm for radial basis function networs, IEEE ransactions on Neural Networs, Vol., No., , ] Cohn, D. A., Ghahramani, Z., and Jordan, M. I. (1995. Active learning with statistical models. Advances in Neural Information Processing Systems, 7, ] V. Fedorov (197, heory of Optimal Experiments, Academic Press, 197 8] M. Stone (1974, Cross-validatory choice and assessment of statistical predictions, Journal of the Royal Statistical Society, Series B, 36, pp , Appendix Proof of heorem 1:. Let φ new = φ, φ N+1 ]. By (3, the A matrices corresponding to φ new are A new = J φ ] i ] i=1 φ N+1 φ i φ N+1 A c ] i + ρi(n+ (N+ = i c (A-1 d where c and d are as in (6. By the conditions of the theorem, the matrices A and A new are all non-degenerate. Using the bloc matrix inversion formula 7] we get (A new 1 = A 1 + A 1 c q 1 q 1 where q is as in (6. By (3, the weights w new w new J = (A new 1 J i=1 y iφ i i=1 y iφ N+1 i c A 1 c A 1 A 1 c q 1 q 1 corresponding to φ, φ N+1 ] are ] w + A 1 = c q 1 g ] q 1 g ] (A- (A-3 with g = c w J i. Hence, (φ new i w new = φ iw + ( φ ia 1 c φ N+1 i g q 1, which is put into (4 to get e(φnew = K J ] =1 i=1 y i y i (φ new i w new = K J ( =1 i=1 y i y i φ iw y i φ ia 1 c ] φ N+1 i g q 1 = e(φ K ( =1 c w J i=1 y iφ N+1 i q 1, where in arriving the last equality we have used (3 and (4 and g = c w J i=1 y iφ N+1 i. he theorem is proved. i=1 y iφ N+1 Proof of heorem : he proof applies to = 1,, K. For any given, define Φ = φ(x1,..., φ(x J ], Φ = φ(xj +1,,..., φ(x J + J, ], y = y 1,..., y J ], ỹ = y J +1,,..., y J + J, ], f = f(x 1,..., f(x J ], f = f (x 1,..., f (x J ], and Ã = ρi + Φ Φ. By (1, (3, and the conditions of the theorem, f = Φ (Ã + Φ Φ 1 (Φ y + Φỹ (a = Φ Ã 1 ( Φ Ã 1 Φ +I I ( I+Φ Ã 1 Φ 1 ] Φ Ã 1 Φ y + ] ( Φ ỹ = I + Φ Ã 1 Φ 1 ] Φ (b Ã 1 ] Φ ỹ + Φ y = ( I + Φ Ã 1 Φ 1 f + ( I + Φ Ã 1 Φ 1 ( Φ Ã 1 Φ + I I y = y + ( I + Φ Ã 1 Φ 1 ( f y, where equation (a is due to the Sherman-Morrison-Woodbury formula and equation (b results because f = Φ Ã 1 Φ ỹ. Hence, f y = ( I + Φ Ã 1 Φ 1 ( f y, which gives J i=1 (y i f (x i = (f y (f y = ( f y Γ 1 ( f y (A-4 where Γ = I + Φ Ã 1 Φ ] = I + Φ (ρi + Φ ] Φ 1 Φ. By construction, Γ has all its eigenvalues no less than 1, i.e., Γ = E diagλ 1,, λ J ]E with E E = I and λ 1,, λ J 1, which maes the first, second, and last inequality in (7 hold. Using this expansion of Γ in (A-4 we get J i=1 (f (x i y i = ( f y E diagσ 1 1,..., σ 1 J ] ( f y ( f y E λ 1 min, I ] ( E f y = λ 1 J min, i=1 (f (x i y i (A-5 where the inequality results because λ min, = min(λ 1,,, λ J,. From (A-5 follows the fourth inequality in (7. he third inequality in (7 can be proven in in a similar way.

MTTTS16 Learning from Multiple Sources

MTTTS16 Learning from Multiple Sources 5 ECTS credits Autumn 2018, University of Tampere Lecturer: Jaakko Peltonen Lecture 6: Multitask learning with kernel methods and nonparametric models On this lecture: