Radial Basis Function Network for Multi-task Learning

Size: px
Start display at page:

Download "Radial Basis Function Network for Multi-task Learning"

Transcription

1 Radial Basis Function Networ for Multi-tas Learning Xuejun Liao Department of ECE Due University Durham, NC , USA Lawrence Carin Department of ECE Due University Durham, NC , USA Abstract We extend radial basis function (RBF networs to the scenario in which multiple correlated tass are learned simultaneously, and present the corresponding learning algorithms. We develop the algorithms for learning the networ structure, in either a supervised or unsupervised manner. raining data may also be actively selected to improve the networ s generalization to test data. Experimental results based on real data demonstrate the advantage of the proposed algorithms and support our conclusions. 1 Introduction In practical applications, one is frequently confronted with situations in which multiple tass must be solved. Often these tass are not independent, implying what is learned from one tas is transferable to another correlated tas. By maing use of this transferability, each tas is made easier to solve. In machine learning, the concept of explicitly exploiting the transferability of expertise between tass, by learning the tass simultaneously under a unified representation, is formally referred to as multi-tas learning 1]. In this paper we extend radial basis function (RBF networs 4,5] to the scenario of multitas learning and present the corresponding learning algorithms. Our primary interest is to learn the regression model of several data sets, where any given data set may be correlated with some other sets but not necessarily with all of them. he advantage of multi-tas learning is usually manifested when the training set of each individual tas is wea, i.e., it does not generalize well to the test data. Our algorithms intend to enhance, in a mutually beneficial way, the wea training sets of multiple tass, by learning them simultaneously. Multi-tas learning becomes superfluous when the data sets all come from the same generating distribution, since in that case we can simply tae the union of them and treat the union as a single tas. In the other extreme, when all the tass are independent, there is no correlation to utilize and we learn each tas separately. he paper is organized as follows. We define the structure of multi-tas RBF networ in Section and present the supervised learning algorithm in Section 3. In Section 4 we show how to learn the networ structure in an unsupervised manner, and based on this we demonstrate how to actively select the training data, with the goal of improving the

2 generalization to test data. We perform experimental studies in Section 5 and conclude the paper in Section 6. Multi-as Radial Basis Function Networ Figure 1 schematizes the radial basis function (RBF networ structure customized to multitas learning. he networ consists of an input layer, a hidden layer, and an output layer. he input layer receives a data point x = x 1,,x d ] R d and submits it to the hidden layer. Each node at the hidden layer has a localized activation φ n (x = φ( x c n,σ n, n = 1,,N, where denotes the vector norm and φ n ( is a radial basis function (RBF localized around c n with the degree of localization parameterized by σ n. Choosing φ(z,σ = exp( z σ gives the Gaussian RBF. he activations of all hidden nodes are weighted and sent to the output layer. Each output node represents a unique tas and has its own hidden-to-output weights. he weighted activations of the hidden nodes are summed at each output node to produce the output for the associated tas. Denoting w = w 0,w 1,,w N ] as the weights connecting hidden nodes to the -th output node, then the output for the -th tas, in response to input x, taes the form f (x = w φ(x (1 where φ(x = φ 0 (x,φ 1 (x,...,φ N (x ] is a column containing N + 1 basis functions with φ 0 (x 1 a dummy basis accounting for the bias in Figure 1. f 1 (x f (x f K (x Networ response as 1 as as K Output layer (Specified by the number of tass w 1 w w K Hidden-to-output weights 0 1 Bias 1 (x (x N (x Hidden layer (basis functions (o be learned by algorithms Input layer (Specified by data dimensionality x= x 1 x x d ] Input data point Figure 1: A multi-tas structure of RBF Networ. Each of the output nodes represents a unique tas. Each tas has its own hidden-to-output weights but all the tass share the same hidden nodes. he activation of hidden node n is characterized by a basis function φ n(x = φ( x c n, σ n. A typical choice of φ is φ(z, σ = exp( z σ, which gives the Gaussian RBF. 3 Supervised Learning Suppose we have K tass and the data set of the -th tas is D = {(x 1,y 1,,(x J,y J }, where y i is the target (desired output of x i. By definition, a given data point x i is said to be supervised if the associated target y i is provided and unsupervised if y i is not provided. he definition extends similarly to a set of data

3 able 1: Learning Algorithm of Multi-as RBF Networ Input: {(x 1,y,,(x J,,y J,} =1:K, φ(,σ, σ, and ρ; Output: φ( and {w } K =1. 1. For m=1:k, For n=1:j m, For =1:K, For i=1:j nm Compute φ i = φ( x nm x i,σ;. Let N = 0, φ( = 1, e 0 = K J =1 i=1 y i (J + ρ 1 ( J i=1 y i ]; For =1:K, compute A =J +ρ, w =(J +ρ 1 J i=1 y i; 3. For m=1:k, For n=1:j m If φ nm is not mared as deleted For =1:K, compute c = J i=1 φ nm i φ i, q = J nm i=1 ( φ i + ρ c A 1 c ; If there exists such that q = 0, mar φ nm as deleted ; else, compute δe(φ, φ nm using (5. 4. If { φ i } i=1:j,=1:k are all mared as deleted, go to Let (n,m = arg max φnm not mared as deleted δe(φ, φ nm ; Mar φ n m as deleted. 6. une RBF parameter σ N+1 = arg max σ δe(φ,φ( x n m,σ 7. Let φ N+1 ( = φ( x n m,σ N+1; Update φ( φ (,φ N+1 ( ] ; 8. For =1:K Compute A new and w new respectively by (A-1 and (A-3 in the appendix; Update A A new, w w new 9. Let e N+1 = e N δe(φ,φ N+1 ; If the sequence {e n } n=0:(n+1 is converged, go to 10, else update N N + 1 and go bac to Exit and output φ( and {w } K =1. points. We are interested in learning the functions f (x for the K tass, based on K =1 D. he learning is based on minimizing the squared error e(φ,w = K =1 { J i=1 ( w φ i y i + ρ w } ( where φ i = φ(x i for notational simplicity. he regularization terms ρ w, = 1,,K, are used to prevent singularity of the A matrices defined in (3, and ρ is typically set to a small positive number. For fixed φ s, the w s are solved by minimizing e(φ,w with respect to w, yielding w = A 1 J i=1 y iφ i and A = J i=1 φ iφ i + ρi, = 1,,K (3 In a multi-tas RBF networ, the input layer and output layer are respectively specified by the data dimensionality and the number of tass. We now discuss how to determine the hidden layer (basis functions φ. Substituting the solutions of the w s in (3 into ( gives e(φ = K =1 J i=1 ( y i y i w φ i where e(φ is a function of φ only because w s are now functions of φ as given by (3. By minimizing e(φ, we can determine φ. Recalling that φ i is an abbreviation of φ(x i = 1,φ 1 (x i,...,φ N (x i ], this amounts to determining N, the number of basis functions, and the functional form of each basis function φ n (, n = 1,...,N. Consider the candidate functions {φ nm (x = φ( x x nm,σ : n = 1,,J m, m = 1,,K}. We learn the RBF networ structure by selecting φ( from these candidate functions such that e(φ in (4 is minimized. he following theorem tells us how to perform the selection in a sequential way; the proof is given in the Appendix. (4

4 heorem 1 Let φ(x = 1,φ 1 (x,...,φ N (x] and φ N+1 (x be a single basis function. Assume the A matrices corresponding to φ and φ,φ N+1 ] are all non-degenerate. hen δe(φ,φ N+1 = e(φ e(φ,φ N+1 ] = K ( =1 c w J i=1 y iφ N+1 q 1 i (5 where φ N+1 i = φ N+1 (φ i, w and A are the same as in (3, and c = J i=1 φ iφ N+1 i, d = J i=1 (φn+1 i + ρ, q = d c A 1 c (6 By the conditions of the theorem A new is full ran and hence it is positive definite by con- is a diagonal element of (A new struction. By (A- in the Appendix, q 1 1, therefore q 1 is positive and by (5 δe(φ,φ N+1 > 0, which means adding φ N+1 to φ generally maes the squared error decrease. he decrease δe(φ,φ N+1 depends on φ N+1. By sequentially selecting basis functions that bring the maximum error reduction, we achieve the goal of maximizing e(φ. he details of the learning algorithm are summarized in able 1. 4 Active Learning In the previous section, the data in D are supervised (provided with the targets. In this section, we assume the data in D are initially unsupervised (only x is available without access to the associated y and we select a subset from D to be supervised (targets acquired such that the resulting networ generalizes well to the remaining data in D. he approach is generally nown as active learning 6]. We first learn the basis functions φ from the unsupervised data, and based on φ select data to be supervised. Both of these steps are based on the following theorem, the proof of which is given in the Appendix. heorem Let there be K tass and the data set of the -th tas is D D where D = {(x i,y i } J i=1 and D = {(x i,y i } J + J i=j +1. Let there be two multi-tas RBF networs, whose output nodes are characterized by f ( and f (, respectively, for tas = 1,...,K. he two networs have the same given basis functions (hidden nodes φ( = 1,φ 1 (,,φ N ( ], but different hidden-to-output weights. he weights of f ( are trained with D D, while the weights of f ( are trained using D. hen for = 1,,K, the square errors committed on D by f ( and f ( are related by 0 detγ ] 1 λ 1 max, J i=1 (y i f (x i ] 1 J i=1 (y i f (x i λ 1 min, 1 (7 where Γ = I + Φ (ρi + Φ Φ 1 Φ ] with Φ = φ(x1,...,φ(x J ] and Φ = φ(xj +1,,...,φ(x J + J, ], and λ max, and λ min, are respectively the largest and smallest eigenvalues of Γ. Specializing heorem to the case J = 0, we have Corollary 1 Let there be K tass and the data set of the -th tas is D = {(x i,y i } J i=1. Let the RBF networ, whose output nodes are characterized by f ( for tas = 1,...,K, have given basis functions (hidden nodes φ( = 1,φ 1 (,,φ N ( ] and the hidden-to-output weights of tas be trained with D. hen for = 1,,K, the squared error committed on D by f ( is bounded as 0 detγ ] 1 λ 1 max, J ] 1 i=1 y J i i=1 (y i f (x i λ 1 min, 1, where Γ = ( I + ρ 1 Φ Φ with Φ = φ(x 1,,...,φ(x J, ], and λ max, and λ min, are respectively the largest and smallest eigenvalues of Γ. It is evident from the properties of matrix determinant 7] and the definition of Φ that detγ = det(ρi + Φ Φ ] det(ρi] = det(ρi + J i=1 φ iφ i ] det(ρi].

5 Using (3 we write succinctly detγ = deta ]det(ρi]. We are interested in selecting the basis functions φ that minimize the error, before seeing y s. By Corollary 1 and the equation detγ = deta ]det(ρi], the squared error is lower bounded by J i=1 y i det(ρi] deta ]. Instead of minimizing the error directly, we minimize its lower bound. As det(ρi] L i=1 y i does not depend on φ, this amounts to selecting φ to minimize (deta. o minimize the errors for all tass = 1,K, we select φ to minimize K =1 (deta. he selection proceeds in a sequential manner. Suppose we have selected basis functions φ = 1,φ 1,,φ N ]. he associated A matrices are A = J i=1 φ iφ i + ρi (N+1 (N+1, = 1,,K. Augmenting basis functions to φ,φ N+1 ], the A matrices change to A new = J i=1 φ i,φ N+1 i ] φ i,φ N+1 i ] + ρi (N+ (N+. Using the determinant formula of bloc matrices 7], we get K =1 (detanew = K =1 (q deta, where q is the same as in (6. As A does not depend on φ N+1, the left-hand side is minimized by maximizing K =1 q. he selection is easily implemented by maing the following two minor modifications in able 1: (a in step, compute e 0 = K =1 ln(j + ρ ; in step 3, compute δe(φ, φ nm = K =1 lnq. Employing the logarithm is for gaining additivity and it does not affect the maximization. Based on the basis functions φ determined above, we proceed to selecting data to be supervised and determining the hidden-to-output weights w from the supervised data using the equations in (3. he selection of data is based on an iterative use of the following corollary, which is a specialization of heorem and was originally given in 8]. Corollary Let there be K tass and the data set of the -th tas is D = {(x i,y i } J i=1. Let there be two RBF networs, whose output nodes are characterized by f ( and f + (, respectively, for tas = 1,...,K. he two networs have the same given basis functions φ( = 1,φ 1 (,,φ N ( ], but different hidden-to-output weights. he weights of f ( are trained with D, while the weights of f + ( are trained using D + = D {(x J +1,,y J +1,}. hen for = 1,,K, the squared errors committed on (x J +1,,y J +1, by f ( and f + ( are related by f + (x J +1, y J +1,] = γ(xj +1, ] 1, f (x J +1, y J +1,] where γ(xj +1, = 1 + φ (x J +1,A 1 φ(x J +1, ] 1 and A = J i=1 ρi + φ(xi φ (x i ] is the same as in (3. wo observations are made from Corollary. First, if γ(x J +1, 1, seeing y J +1, does not effect the error on x J +1,, indicating D already contain sufficient information about (x J +1,,y J +1,. Second, if γ(x i 1, seeing y J +1, greatly decrease the error on x J +1,, indicating x J +1, is significantly dissimilar (novel to D and x J +1, must be supervised to reduce the error. Based on Corollary, the selection proceeds sequentially. Suppose we have selected data D = {(x i,y i } J i=1, from which we compute A. We select the next data point as x J +1, = arg max i>j, =1,,K γ(x i = arg max i>j =1,,K 1 + φ (x i A 1 φ(x i ]. After xj +1, is selected, the A is updated and the next selection begins. As the iteration advances γ will decrease until it reaches convergence. We use (3 to compute w from the selected x and their associated targets y, completing learning of the RBF networ. 5 Experimental Results In this section we compare the multi-tas RBF networ against single-tas RBF networs via experimental studies. We consider three types of RBF networs to learn K tass, each

6 with its data set D. In the first, which we call one RBF networ, we let the K tass share both basis functions φ (hidden nodes and hidden-to output weights w, thus we do not distinguish the K tass and design a single RBF networ to learn a union of them. he second is the multi-tas RBF networ, where the K tass share the same φ but each has its own w. In the third, we have K independent networs, each designed for a single tas. We use a school data set from the Inner London Education Authority, consisting of examination records of 1536 students from 139 secondary schools. he data are available at his data set was originally used to study the effectiveness of schools and has recently been used to evaluate multi-tas algorithms,3]. he goal is to predict the exam scores of the students based on 9 variables: year of exam (1985, 1986, or 1987, school code (1-139, FSM (percentage of students eligible for free school meals, VR1 band (percentage of students in school in VR band one, gender, VR band of student (3 categories, ethnic group of student (11 categories, school gender (male, female, or mixed, school denomination (3 categories. We consider each school a tas, leading to 139 tass in total. he remaining 8 variables are used as inputs to the RBF networ. Following,3], we converted each categorical variable to a number of binary variables, resulting in a total number of 7 input variables, i.e., x R 7. he exam score is the target to be predicted. he three types of RBF networs as defined above are designed as follows. he multi-tas RBF networ is implemented as the structure as shown in Figure 1 and trained with the learning algorithm in able 1. he one RBF networ is implemented as a special case of Figure 1, with a single output node and trained using the union of supervised data from all 139 schools. We design 139 independent RBF networs, each of which is implemented with a single output node and trained using the supervised data from a single school. We use the Gaussian RBF φ n (x = exp( x cn σ, where the c n s are selected from training data points and σ n s are initialized as 0 and optimized as described in able 1. he main role of the regularization parameter ρ is to prevent the A matrices from being singular and it does not affect the results seriously. In the results reported here, ρ is set to Following -3], we randomly tae 75% of the 1536 data points as training (supervised data and the remaining 5% as test data. he generalization performance is measured by the squared error (f (x i y i averaged over all test data x i of tass = 1,,K. We made 10 independent trials to randomly split the data into training and test sets and the squared error averaged over the test data of all the 139 schools and the trials are shown in able, for the three types of RBF networs. able : Squared error averaged over the test data of all 139 schools and the 10 independent trials for randomly splitting the school data into training (75% and testing (5% sets. Multi-tas RBF networ Independent RBF networs One RBF networ ± ± ±.8093 able clearly shows the multi-tas RBF networ outperforms the other two types of RBF networs by a considerable margin. he one RBF networ ignores the difference between the tass and the independent RBF networs ignore the tass correlations, therefore they both perform inferiorly. he multi-tas RBF networ uses the shared hidden nodes (basis functions to capture the common internal representation of the tass and meanwhile uses the independent hidden-to-output weights to learn the statistics specific to each tas. We now demonstrate the results of active learning. We use the method in Section 4 to actively split the data into training and test sets using a two-step procedure. First we learn the basis functions φ of multi-tas RBF networ using all 1536 data (unsupervised. Based on the φ, we then select the data to be supervised and use them as training data to learn

7 the hidden-to-output weights w. o mae the results comparable, we use the same training data to learn the other two types of RBF networs (including learning their own φ and w. he networs are then tested on the remaining data. Figure shows the results of active learning. Each curve is the squared error averaged over the test data of all 139 schools, as a function of number of training data. It is clear that the multi-tas RBF networ maintains its superior performance all the way down to 5000 training data points, whereas the independent RBF networs have their performances degraded seriously as the training data diminish. his demonstrates the increasing advantage of multi-tas learning as the number of training data decreases. he one RBF networ seems also insensitive to the number of training data, but it ignores the inherent dissimilarity between the tass, which maes its performance inferior. 60 Squared error averaged over test data Multi tas RBF networ Independent RBF networs One RBF networ Number of training (supervised data Figure : Squared error averaged over the test data of all 139 schools, as a function of the number of training (supervised data. he data are split into training and test sets via active learning. 6 Conclusions We have presented the structure and learning algorithms for multi-tas learning with the radial basis function (RBF networ. By letting multiple tass share the basis functions (hidden nodes we impose a common internal representation for correlated tass. Exploiting the inter-tas correlation yields a more compact networ structure that has enhanced generalization ability. Unsupervised learning of the networ structure enables us to actively split the data into training and test sets. As the data novel to the previously selected ones are selected next, what finally remain unselected and to be tested are all similar to the selected data which constitutes the training set. his improves the generalization of the resulting networ to the test data. hese conclusions are substantiated via results on real multi-tas data. References 1] R. Caruana. (1997 Multitas learning. Machine Learning, 8, p , ] B. Baer and. Heses (003. as clustering and gating for Bayesian multitas learning. Journal of Machine Learning Research, 4: 83-99, 003 3]. Evgeniou, C. A. Micchelli, and M. Pontil (005. Learning Multiple ass with Kernel Methods. Journal of Machine Learning Research, 6: , 005

8 4] Powell M. (1987, Radial basis functions for multivariable interpolation : A review, J.C. Mason and M.G. Cox, eds, Algorithms for Approximation, pp ] Chen, F. Cowan, and P. Grant (1991, Orthogonal least squares learning algorithm for radial basis function networs, IEEE ransactions on Neural Networs, Vol., No., , ] Cohn, D. A., Ghahramani, Z., and Jordan, M. I. (1995. Active learning with statistical models. Advances in Neural Information Processing Systems, 7, ] V. Fedorov (197, heory of Optimal Experiments, Academic Press, 197 8] M. Stone (1974, Cross-validatory choice and assessment of statistical predictions, Journal of the Royal Statistical Society, Series B, 36, pp , Appendix Proof of heorem 1:. Let φ new = φ, φ N+1 ]. By (3, the A matrices corresponding to φ new are A new = J φ ] i ] i=1 φ N+1 φ i φ N+1 A c ] i + ρi(n+ (N+ = i c (A-1 d where c and d are as in (6. By the conditions of the theorem, the matrices A and A new are all non-degenerate. Using the bloc matrix inversion formula 7] we get (A new 1 = A 1 + A 1 c q 1 q 1 where q is as in (6. By (3, the weights w new w new J = (A new 1 J i=1 y iφ i i=1 y iφ N+1 i c A 1 c A 1 A 1 c q 1 q 1 corresponding to φ, φ N+1 ] are ] w + A 1 = c q 1 g ] q 1 g ] (A- (A-3 with g = c w J i. Hence, (φ new i w new = φ iw + ( φ ia 1 c φ N+1 i g q 1, which is put into (4 to get e(φnew = K J ] =1 i=1 y i y i (φ new i w new = K J ( =1 i=1 y i y i φ iw y i φ ia 1 c ] φ N+1 i g q 1 = e(φ K ( =1 c w J i=1 y iφ N+1 i q 1, where in arriving the last equality we have used (3 and (4 and g = c w J i=1 y iφ N+1 i. he theorem is proved. i=1 y iφ N+1 Proof of heorem : he proof applies to = 1,, K. For any given, define Φ = φ(x1,..., φ(x J ], Φ = φ(xj +1,,..., φ(x J + J, ], y = y 1,..., y J ], ỹ = y J +1,,..., y J + J, ], f = f(x 1,..., f(x J ], f = f (x 1,..., f (x J ], and à = ρi + Φ Φ. By (1, (3, and the conditions of the theorem, f = Φ (à + Φ Φ 1 (Φ y + Φỹ (a = Φ Ã 1 ( Φ Ã 1 Φ +I I ( I+Φ Ã 1 Φ 1 ] Φ Ã 1 Φ y + ] ( Φ ỹ = I + Φ Ã 1 Φ 1 ] Φ (b à 1 ] Φ ỹ + Φ y = ( I + Φ Ã 1 Φ 1 f + ( I + Φ Ã 1 Φ 1 ( Φ Ã 1 Φ + I I y = y + ( I + Φ Ã 1 Φ 1 ( f y, where equation (a is due to the Sherman-Morrison-Woodbury formula and equation (b results because f = Φ Ã 1 Φ ỹ. Hence, f y = ( I + Φ Ã 1 Φ 1 ( f y, which gives J i=1 (y i f (x i = (f y (f y = ( f y Γ 1 ( f y (A-4 where Γ = I + Φ Ã 1 Φ ] = I + Φ (ρi + Φ ] Φ 1 Φ. By construction, Γ has all its eigenvalues no less than 1, i.e., Γ = E diagλ 1,, λ J ]E with E E = I and λ 1,, λ J 1, which maes the first, second, and last inequality in (7 hold. Using this expansion of Γ in (A-4 we get J i=1 (f (x i y i = ( f y E diagσ 1 1,..., σ 1 J ] ( f y ( f y E λ 1 min, I ] ( E f y = λ 1 J min, i=1 (f (x i y i (A-5 where the inequality results because λ min, = min(λ 1,,, λ J,. From (A-5 follows the fourth inequality in (7. he third inequality in (7 can be proven in in a similar way.

MTTTS16 Learning from Multiple Sources

MTTTS16 Learning from Multiple Sources MTTTS16 Learning from Multiple Sources 5 ECTS credits Autumn 2018, University of Tampere Lecturer: Jaakko Peltonen Lecture 6: Multitask learning with kernel methods and nonparametric models On this lecture:

More information

Adaptive Multi-Modal Sensing of General Concealed Targets

Adaptive Multi-Modal Sensing of General Concealed Targets Adaptive Multi-Modal Sensing of General Concealed argets Lawrence Carin Balaji Krishnapuram, David Williams, Xuejun Liao and Ya Xue Department of Electrical & Computer Engineering Duke University Durham,

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Radial Basis Function (RBF) Networks

Radial Basis Function (RBF) Networks CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks 1 Function approximation We have been using MLPs as pattern classifiers But in general, they are function approximators Depending

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Neural Networks Lecture 4: Radial Bases Function Networks

Neural Networks Lecture 4: Radial Bases Function Networks Neural Networks Lecture 4: Radial Bases Function Networks H.A Talebi Farzaneh Abdollahi Department of Electrical Engineering Amirkabir University of Technology Winter 2011. A. Talebi, Farzaneh Abdollahi

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Kernel Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1 / 21

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

In the Name of God. Lectures 15&16: Radial Basis Function Networks

In the Name of God. Lectures 15&16: Radial Basis Function Networks 1 In the Name of God Lectures 15&16: Radial Basis Function Networks Some Historical Notes Learning is equivalent to finding a surface in a multidimensional space that provides a best fit to the training

More information

Problem 1: Compactness (12 points, 2 points each)

Problem 1: Compactness (12 points, 2 points each) Final exam Selected Solutions APPM 5440 Fall 2014 Applied Analysis Date: Tuesday, Dec. 15 2014, 10:30 AM to 1 PM You may assume all vector spaces are over the real field unless otherwise specified. Your

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

Introduction to Machine Learning Spring 2018 Note 18

Introduction to Machine Learning Spring 2018 Note 18 CS 189 Introduction to Machine Learning Spring 2018 Note 18 1 Gaussian Discriminant Analysis Recall the idea of generative models: we classify an arbitrary datapoint x with the class label that maximizes

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date

More information

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3

More information

Artificial Neural Network : Training

Artificial Neural Network : Training Artificial Neural Networ : Training Debasis Samanta IIT Kharagpur debasis.samanta.iitgp@gmail.com 06.04.2018 Debasis Samanta (IIT Kharagpur) Soft Computing Applications 06.04.2018 1 / 49 Learning of neural

More information

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation Vu Malbasa and Slobodan Vucetic Abstract Resource-constrained data mining introduces many constraints when learning from

More information

Review: Support vector machines. Machine learning techniques and image analysis

Review: Support vector machines. Machine learning techniques and image analysis Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization

More information

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT & 14. PCA and MDS Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Nearest Neighbors Methods for Support Vector Machines

Nearest Neighbors Methods for Support Vector Machines Nearest Neighbors Methods for Support Vector Machines A. J. Quiroz, Dpto. de Matemáticas. Universidad de Los Andes joint work with María González-Lima, Universidad Simón Boĺıvar and Sergio A. Camelo, Universidad

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Linear Classification Given labeled data: (xi, feature vector yi) label i=1,..,n where y is 1 or 1, find a hyperplane to separate from Linear Classification

More information

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway

More information

Supervised Learning Coursework

Supervised Learning Coursework Supervised Learning Coursework John Shawe-Taylor Tom Diethe Dorota Glowacka November 30, 2009; submission date: noon December 18, 2009 Abstract Using a series of synthetic examples, in this exercise session

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Dynamic Time-Alignment Kernel in Support Vector Machine

Dynamic Time-Alignment Kernel in Support Vector Machine Dynamic Time-Alignment Kernel in Support Vector Machine Hiroshi Shimodaira School of Information Science, Japan Advanced Institute of Science and Technology sim@jaist.ac.jp Mitsuru Nakai School of Information

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

7.1 Support Vector Machine

7.1 Support Vector Machine 67577 Intro. to Machine Learning Fall semester, 006/7 Lecture 7: Support Vector Machines an Kernel Functions II Lecturer: Amnon Shashua Scribe: Amnon Shashua 7. Support Vector Machine We return now to

More information

A Spectral Regularization Framework for Multi-Task Structure Learning

A Spectral Regularization Framework for Multi-Task Structure Learning A Spectral Regularization Framework for Multi-Task Structure Learning Massimiliano Pontil Department of Computer Science University College London (Joint work with A. Argyriou, T. Evgeniou, C.A. Micchelli,

More information

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector

More information

Recursive Least Squares for an Entropy Regularized MSE Cost Function

Recursive Least Squares for an Entropy Regularized MSE Cost Function Recursive Least Squares for an Entropy Regularized MSE Cost Function Deniz Erdogmus, Yadunandana N. Rao, Jose C. Principe Oscar Fontenla-Romero, Amparo Alonso-Betanzos Electrical Eng. Dept., University

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.

More information

Slide05 Haykin Chapter 5: Radial-Basis Function Networks

Slide05 Haykin Chapter 5: Radial-Basis Function Networks Slide5 Haykin Chapter 5: Radial-Basis Function Networks CPSC 636-6 Instructor: Yoonsuck Choe Spring Learning in MLP Supervised learning in multilayer perceptrons: Recursive technique of stochastic approximation,

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Radial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition

Radial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition Radial Basis Function Networks Ravi Kaushik Project 1 CSC 84010 Neural Networks and Pattern Recognition History Radial Basis Function (RBF) emerged in late 1980 s as a variant of artificial neural network.

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Introduction: The Perceptron

Introduction: The Perceptron Introduction: The Perceptron Haim Sompolinsy, MIT October 4, 203 Perceptron Architecture The simplest type of perceptron has a single layer of weights connecting the inputs and output. Formally, the perceptron

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

10-701/15-781, Machine Learning: Homework 4

10-701/15-781, Machine Learning: Homework 4 10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Gaussian Mixture Distance for Information Retrieval

Gaussian Mixture Distance for Information Retrieval Gaussian Mixture Distance for Information Retrieval X.Q. Li and I. King fxqli, ingg@cse.cuh.edu.h Department of omputer Science & Engineering The hinese University of Hong Kong Shatin, New Territories,

More information

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 10: Support Vector Machine and Large Margin Classifier Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Basis Expansion and Nonlinear SVM. Kai Yu

Basis Expansion and Nonlinear SVM. Kai Yu Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

CSE 546 Final Exam, Autumn 2013

CSE 546 Final Exam, Autumn 2013 CSE 546 Final Exam, Autumn 0. Personal info: Name: Student ID: E-mail address:. There should be 5 numbered pages in this exam (including this cover sheet).. You can use any material you brought: any book,

More information

Machine Learning Lecture Notes

Machine Learning Lecture Notes Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

An Algorithm for Transfer Learning in a Heterogeneous Environment

An Algorithm for Transfer Learning in a Heterogeneous Environment An Algorithm for Transfer Learning in a Heterogeneous Environment Andreas Argyriou 1, Andreas Maurer 2, and Massimiliano Pontil 1 1 Department of Computer Science University College London Malet Place,

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines (Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning

More information

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit New Coherence and RIP Analysis for Wea 1 Orthogonal Matching Pursuit Mingrui Yang, Member, IEEE, and Fran de Hoog arxiv:1405.3354v1 [cs.it] 14 May 2014 Abstract In this paper we define a new coherence

More information

Solutions Preliminary Examination in Numerical Analysis January, 2017

Solutions Preliminary Examination in Numerical Analysis January, 2017 Solutions Preliminary Examination in Numerical Analysis January, 07 Root Finding The roots are -,0, a) First consider x 0 > Let x n+ = + ε and x n = + δ with δ > 0 The iteration gives 0 < ε δ < 3, which

More information

LMS Algorithm Summary

LMS Algorithm Summary LMS Algorithm Summary Step size tradeoff Other Iterative Algorithms LMS algorithm with variable step size: w(k+1) = w(k) + µ(k)e(k)x(k) When step size µ(k) = µ/k algorithm converges almost surely to optimal

More information

A Tutorial on Support Vector Machine

A Tutorial on Support Vector Machine A Tutorial on School of Computing National University of Singapore Contents Theory on Using with Other s Contents Transforming Theory on Using with Other s What is a classifier? A function that maps instances

More information

Final Exam. 1 True or False (15 Points)

Final Exam. 1 True or False (15 Points) 10-606 Final Exam Submit by Oct. 16, 2017 11:59pm EST Please submit early, and update your submission if you want to make changes. Do not wait to the last minute to submit: we reserve the right not to

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

An artificial neural networks (ANNs) model is a functional abstraction of the

An artificial neural networks (ANNs) model is a functional abstraction of the CHAPER 3 3. Introduction An artificial neural networs (ANNs) model is a functional abstraction of the biological neural structures of the central nervous system. hey are composed of many simple and highly

More information

Robust Fisher Discriminant Analysis

Robust Fisher Discriminant Analysis Robust Fisher Discriminant Analysis Seung-Jean Kim Alessandro Magnani Stephen P. Boyd Information Systems Laboratory Electrical Engineering Department, Stanford University Stanford, CA 94305-9510 sjkim@stanford.edu

More information

CLOSE-TO-CLEAN REGULARIZATION RELATES

CLOSE-TO-CLEAN REGULARIZATION RELATES Worshop trac - ICLR 016 CLOSE-TO-CLEAN REGULARIZATION RELATES VIRTUAL ADVERSARIAL TRAINING, LADDER NETWORKS AND OTHERS Mudassar Abbas, Jyri Kivinen, Tapani Raio Department of Computer Science, School of

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR Data Mining I Lecture

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES

SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES JIANG ZHU, SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, P. R. China E-MAIL:

More information

When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants

When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants Sheng Zhang erence Sim School of Computing, National University of Singapore 3 Science Drive 2, Singapore 7543 {zhangshe, tsim}@comp.nus.edu.sg

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

A Least Squares Formulation for Canonical Correlation Analysis

A Least Squares Formulation for Canonical Correlation Analysis A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

Classifier Complexity and Support Vector Classifiers

Classifier Complexity and Support Vector Classifiers Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Machine Learning Lecture Notes

Machine Learning Lecture Notes Machine Learning Lecture Notes Predrag Radivojac February 2, 205 Given a data set D = {(x i,y i )} n the objective is to learn the relationship between features and the target. We usually start by hypothesizing

More information

Massachusetts Institute of Technology

Massachusetts Institute of Technology Massachusetts Institute of Technology 6.867 Machine Learning, Fall 2006 Problem Set 5 Due Date: Thursday, Nov 30, 12:00 noon You may submit your solutions in class or in the box. 1. Wilhelm and Klaus are

More information

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems

More information

Heterogeneous Multitask Learning with Joint Sparsity Constraints

Heterogeneous Multitask Learning with Joint Sparsity Constraints Heterogeneous ultitas Learning with Joint Sparsity Constraints Xiaolin Yang Department of Statistics Carnegie ellon University Pittsburgh, PA 23 xyang@stat.cmu.edu Seyoung Kim achine Learning Department

More information

CSC Neural Networks. Perceptron Learning Rule

CSC Neural Networks. Perceptron Learning Rule CSC 302 1.5 Neural Networks Perceptron Learning Rule 1 Objectives Determining the weight matrix and bias for perceptron networks with many inputs. Explaining what a learning rule is. Developing the perceptron

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm The Expectation-Maximization Algorithm Francisco S. Melo In these notes, we provide a brief overview of the formal aspects concerning -means, EM and their relation. We closely follow the presentation in

More information

Nonlinear Statistical Learning with Truncated Gaussian Graphical Models

Nonlinear Statistical Learning with Truncated Gaussian Graphical Models Nonlinear Statistical Learning with Truncated Gaussian Graphical Models Qinliang Su, Xuejun Liao, Changyou Chen, Lawrence Carin Department of Electrical & Computer Engineering, Duke University Presented

More information

Least Squares SVM Regression

Least Squares SVM Regression Least Squares SVM Regression Consider changing SVM to LS SVM by making following modifications: min (w,e) ½ w 2 + ½C Σ e(i) 2 subject to d(i) (w T Φ( x(i))+ b) = e(i), i, and C>0. Note that e(i) is error

More information

Modeling human function learning with Gaussian processes

Modeling human function learning with Gaussian processes Modeling human function learning with Gaussian processes Thomas L. Griffiths Christopher G. Lucas Joseph J. Williams Department of Psychology University of California, Berkeley Berkeley, CA 94720-1650

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN , Sparse Kernel Canonical Correlation Analysis Lili Tan and Colin Fyfe 2, Λ. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong. 2. School of Information and Communication

More information

Matrix Support Functional and its Applications

Matrix Support Functional and its Applications Matrix Support Functional and its Applications James V Burke Mathematics, University of Washington Joint work with Yuan Gao (UW) and Tim Hoheisel (McGill), CORS, Banff 2016 June 1, 2016 Connections What

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information