A Maximum-likelihood connectionist model for unsupervised learning over graphical domains
|
|
- Regina Marshall
- 6 years ago
- Views:
Transcription
1 A Maximum-likelihood connectionist model for unsupervised learning over graphical domains Edmondo Trentin and Leonardo Rigutini DII - Università di Siena, V. Roma, 56 Siena (Italy) Abstract. Supervised relational learning over labeled graphs, e.g. via recursive neural nets, received considerable attention from the connectionist community. Surprisingly, with the exception of recursive self organizing maps, unsupervised paradigms have been far less investigated. In particular, no algorithms for density estimation over graphs are found in the literature. This paper introduces first a formal notion of probability density function (pdf) over graphical spaces. It then proposes a maximum-likelihood pdf estimation technique, relying on the joint optimization of a recursive encoding network and a constrained radial basis functions-like net. Preliminary experiments on synthetically generated samples of labeled graphs are analyzed and tested statistically. Key words: Density estimation, unsupervised relational learning, recursive network 1 Introduction Two major instances of unsupervised learning have long been considered in statistical pattern recognition, namely the estimation of probability density functions (pdf), and clustering algorithms [2]. An approximative borderline between the two setups can be traced by saying that the former focuses on the probabilistic properties of the data sample, whilst the latter is rather topology-oriented, in the sense that it concentrates on certain topological properties (e.g., distance measures among patterns and/or centroids). Several unsupervised training algorithms for neural networks were also introduced (e.g., competitive neural nets and self-organizing maps). Most of these neural networks are rooted in the topological framework (i.e., clustering, topologically consistent mappings, etc.), although a few exceptions aimed at pdf estimation can be found [8]. It is rather surprising to realize that, despite the amount of work that has been accomplished in the community on relational and graphical learners in the last few decades, only limited attention has been paid to unsupervised relational learning (a remarkable exception, in the topological framework, is [5]), and to pdf estimation in the first place. This is even more surprising if we consider the fact that the original motivations for undertaking the study of unsupervised algorithms are often strong in the graphical domains (such as the World Wide Web, or as biological and chemical data which have a natural representation in terms of variable-size labeled graphs), where amounts of unlabeled samples are
2 2 Unsupervised learning over graphical domains available. These motivations include: (i) the need for a compact description of the overall distribution of a sample; (ii) the need for a measure of likelihood of a certain model given a graphical structure to be assigned to a certain group ( cluster ); (iii) the need for techniques that can deal with large amounts of unlabeled data in order to ease the design of semi-supervised classifiers or to facilitate the adaptation of previously trained machines to new data or new environmental conditions. This paper is a first attempt to introduce a model for the estimation of pdfs over graphical domains. It exploits the encoding capabilities of recursive neural nets (RNN) [7], combined with a constrained radial basis function (RBF)-like network. A gradient ascent, maximum-likelihood (ML) training algorithm is proposed, which jointly optimizes the encoding network [7] and the RBF in order to obtain the pdf estimate from an unsupervised sample of graphs. Constraints are introduced in both neural nets such that the resulting estimate can be interpreted as a pdf (non-negative, unit integral over its definition domain), and such that the encoding of the graphs does not lead to singular solutions. In order to introduce the model, it is necessary to give a formal definition of a pdf over graphs in the first place. via the notion of generalized random graph (an extension of traditional random graphs [3]). Let V be a given discrete or continuous-valued set (vertex universe), and let Ω be any given sample space. We define a generalized random graph (GRG) over V and Ω as a function G : Ω {(V, E) V V, E V V } (note that labels in the form of real-valued vectors associated with vertices and/or edges is easily encapsulated within the definition). Let then G = {g g = {(V, E)}, V V, E V V }. We define a probability density function (pdf) for GRGs over V as a function p : G R such that (1) p(g) 0, g G, and (2) p(g)dg = 1. Note that the G integral in (2) has a mathematical meaning since the (Lebesgue-)measurability of the space of graphs defined over measurable domains (and with measurable labels) like countable sets or real vectors is shown in [4]. The extension of notions from traditional probability theory (conditional pdf, joint pdf, statistical independence) to GRGs is a straightforward exercise. Now, suppose that a sample T = {g 1,..., g n g i G, i = 1,..., n} of n graphs has been collected. The pdf estimation problem faced in this paper can be stated as follows: assuming that all the GRGs in T have been independently drawn from a certain pdf p(g), how can the dataset be used in order to estimate a reasonable model of p(g)? 2 A plausible neural answer to the question We assume that p(g) is a function having fixed and known parametric form, being determined uniquely by the specific value of a set of parameters θ = (θ 1,..., θ k ). To render this dependency on θ in a more explicit manner, we will modify our notation slightly by writing p(g) as p(g θ). Given the assumption, the formulation of the question posed at the end of previous Section can be restated as: how can we use the sample T in order to obtain estimates for θ that are meaningful according to a certain optimality criterion? A sound answer
3 Unsupervised learning over graphical domains 3 to the question may be found in the adoption of the ML criterion, along with a suitable method for maximizing the likelihood p(t θ) of the parameters given the sample. Since g 1,..., g n are assumed to be i.i.d., the likelihood p(t θ) can be written as p(t θ) = n p(g i θ). Before attempting the maximization of the likelihood, it is necessary to specify a well-defined form for the pdf p(g θ). Let us assume the existence of an integer d and of two functions, φ : G R d and ˆp : R d R, s.t. p(g θ) can be decomposed as: p(g θ) = ˆp(φ(g)). (1) It is seen that there exist (infinite) choices for φ(.) and ˆp(.) that satisfy Eq. (1), the most trivial being φ(g) = p(g θ), ˆp(x) = x. We call φ(.) the encoding, while ˆp(.) is simply referred to as the likelihood. Again, we assume parametric forms φ(g θ φ ) and ˆp(x θ ˆp ) for the encoding and for the likelihood, respectively, and we set θ = (θ φ, θ ˆp ). The ML estimation of θ given T requires now to find parameter vectors θ φ and θ ˆp that maximize the quantity p(t θ φ, θ ˆp ) = ny ˆp(φ(g i θ φ ) θ ˆp ). (2) We propose a two-block connectionist/statistical model for p(g θ) as follows. The function φ(g θ φ ) is realized via an encoding network, suitable to map directed acyclic graphs (DAG) g into real vectors x, as described in [7] for supervised training of recursive neural networks (RNN) over structured domains. The weights of the encoding network become the parameters θ φ. A radial basis functions (RBF)-like neural net is then used to model the likelihood function ˆp(x θ ˆp ), where θ ˆp are the parameters of the RBF. In order to ensure that a pdf is obtained, specific constraints have to be placed on the nature of the RBF kernels, as well as on the hidden-to-output connection weights. It is crucial to underline that we are not going to find out a rough encoding of graphs via standard RNN followed by a separate, standard ML estimation of a mixture of Normal densities defined over the encoded space. On the contrary, we propose a joint optimization of all model parameters, θ φ and θ ˆp, to increase the overall likelihood. In other words, the encoding and the likelihood are jointly optimized to maximize p(t θ φ, θ ˆp ). Occasionaly, the ML principle for this general class of mixtures may lead to singular solutions. This fact is well-known from classical statistical theory; but, as pointed out in [2] (sec , page 199), it is an empirical fact that meaningful solutions can still be obtained. A hill-climbing algorithm to carry out ML estimation of the parameters θ can be obtained as an instance of the gradient-ascent method over p(t θ φ, θˆp ) in two steps: (i) initialization, i.e., start with some initial, e.g. random, assignment of values to the model parameters θ; (ii) gradient-ascent, i.e., repeatedly apply a learning rule in the form θ = η θ { n ˆp(φ(g i θ φ ) θ ˆp )} with η R +. This is a batch learning setup. In practice, neural network learning may be simplified, yet even improved, with the adoption of an on-line training scheme that prescribes θ = η θ {ˆp(φ(g θ φ ) θ ˆp )} upon presentation of each individual
4 4 Unsupervised learning over graphical domains training example g. Three distinct families of adaptive parameters θ have to be considered: (1) Mixing parameters c 1,..., c n, i.e. the hidden-to-output weights of the RBF network. Constraints have to be placed on these parameters during the ML estimation process, in order to ensure that they are in [0, 1] and that they sum to one. A simple way to satisfy the requirements is to introduce n hidden parameters γ 1,..., γ n, which are unconstrained, and to set c i = ς(γ P i) n, i = 1,..., n (3) j=1 ς(γj) where ς(x) = 1/(1 + e x ). Each γ i is then treated as an unknown parameter θ to be estimated via ML. (2) d-dimensional mean vector µ i and d d covariance matrix Σ i for each of the Gaussian kernels K i (x) = N(x; µ i, Σ i ), i = 1,..., n of the RBF, where N(x; µ i, Σ i ) denotes a multivariate Normal pdf having mean vector µ i, covariance matrix Σ i, and evaluated over the random vector x. A common (yet effective) simplification is to consider diagonal covariance matrices, i.e. independence among the components of the input vector x. This assumption leads to the following three major consequences: (i) modeling properties are not affected significantly, according to [6]; (ii) generalization capabilities of the overall model may turn out to be improved, since the number of free parameters is reduced; (iii) i-th multivariate kernel K i may be expressed in the form of a product of d univariate Normal densities as: K i(x) = dy j=1 ( 1 exp 1 2πσij 2 «) 2 xj µ ij σ ij (4) i.e., the free parameters to be estimated are the means µ ij and the standard deviations σ ij, for each kernel i = 1,..., n and for each component j = 1,..., d of the input space. (3) The weights U of the encoding network. The learning rule has to rely on partial derivatives of the likelihood which are backpropagated down to the RBF inputs and, in turn, through the encoding net. In order to discourage singular solutions, e.g. the tendency to map all the input graphs onto a single point in the encoded space by developing close-to-zero weights, the learning rule for U shall include an additional regularization term which treats the network weights as random variables distributed according to a pdf whose modes are far from zero. The likelihood of the network weights is then taken into account in the optimization procedure. In the following, we will derive explicit formulations for ˆp(φ(g θ φ) θ ˆp ) θ for each of the three families of free parameters θ within the proposed model. As regards a generic mixing parameter c i, i = 1,..., n, from Eq. (3), and since p(g) = n k=1 c kk k (x), we have
5 Unsupervised learning over graphical domains 5 ˆp(φ(g θ φ ) θ ˆp ) nx p(g) c j = (5) γ i c j γ i j=1 nx = K «ς(γ j(x) P j) γ n i j=1 k=1 ς(γ k) j ς (γ P i) k = K ς(γ k) ς(γ i)ς ff (γ i) i(x) [ P k ς(γ + X j ff ς(γj)ς (γ i) K j(x) k)] 2 [ P j i k ς(γ k)] 2 ς (γ i) = K i(x) P k ς(γ k) X K j(x) ς(γj)ς (γ i) [ P j k ς(γ k)] 2 ( ) ς (γ i) X = K i(x) Pk ς(γ k) ς (γ i) c jk j(x) P k ς(γ k) = ς (γ i) Pk ς(γ {Ki(x) p(g)}. k) j For the means µ ij and the standard deviations σ ij we proceed as follows. Let θ ij denote the free parameter, i.e. µ ij or σ ij, to be estimated. It is seen that: ˆp(φ(g θ φ ) θ ˆp ) θ ij = c i K i(x) θ ij (6) where the calculation of Ki(x) θ ij can be accomplished as follows. First of all, let us observe that for any real-valued, differentiable function f(.) this property holds true: f(.) x log[f(.)] = f(.). As a consequence, from Eq. (4) we can write K i(x) θ ij x = K logki(x) i(x) (7) θ ij ( " = K dx i(x) 1 «#) 2 log(2πσik) 2 xk µ ik +. θ ij 2 k=1 For the means, i.e. θ ij = µ ij, Eq. (7) yields σ ik K i(x) xj µij = K i(x). (8) µ ij σij 2 For the covariances, i.e. θ ij = σ ij, Eq. (7) takes the form: K i(x) σ ij = K i(x) = Ki(x) σ ij σ ij ( 1 2 log(2πσ2 ij) 1 2 ( «2 xj µ ij 1). σ ij «) 2 xj µ ij σ ij (9) Finally, let us consider the connection weights U = {v 1,..., v s } within the encoding network. For a generic v U, application of the chain rule yields:
6 6 Unsupervised learning over graphical domains ˆp(φ(g θ φ ) θ ˆp ) = ˆp(φ(g θ φ) θ ˆp ) y (10) y where y is the output from the unit (in the encoding net) which is fed from connection v. The quantity y can be easily computed by taking the partial derivative of the activation function associated with the unit itself, as usual. As regards the quantity ˆp(φ(g θ φ) θˆp ) y, we proceed as follows. First of all, let us assume that v feeds the output layer, i.e. it connects a certain hidden unit with j-th output unit of the encoding net. In this case, we have y = x j. It is easy to see that: ˆp(φ(g θ φ ) θ ˆp ) x j = P n ciki(x) (11) x j nx = c logki(x) ik i(x) x j ( " nx = c dx ik i(x) 1 «#) 2 log(2πσ 2 xk w ik µ ik ik) + x j 2 σ ik k=1 ( nx = c ik i(x) 1 «) 2 xjw ij µ ij 2 x j σ ij = nx K i(x) c i (x jw σij 2 ij µ ij)w ij. On the contrary, whenever v is a hidden weight the quantity ˆp(φ(g θ φ) θˆp ) can be obtained applying the usual backpropagation through structures (BPTS) algorithm [7], once the deltas to be backpropagated have been initialized at the output layer via Eq. (11). Unconstrained ML training of the weights of the encoding net may lead to singular solutions. To tackle the problem, we assume that the weights are random variables, independently drawn from a certain probability distribution p(u) = s p(v i). The pdf p(v) is defined in a way to encourage non-degenerate solutions, and the new criterion function C to be maximized during gradient-ascent training is in the form of a joint pdf, namely: C = ˆp(φ(g θ φ ) θ ˆp )p(u). (12) Extremization of such a criterion results in weight values that yield high likelihood of the sample, and that are highly likely themselves. If the weights U are randomly initialized in a uniform manner over the interval ( ρ, ρ), an effective choice for p(v) is a mixture of two Gaussian components in the form p(v) = 1 2 N(v; ρ 2, σ2 ) N(v; ρ 2, σ2 ). (13) Whenever σ 2 is chosen to be sufficiently small, two benefits are expected as training proceeds: (1) weights are encouraged to move toward non-degenerative solutions in the weight space; (2) a form of regularization of the learning process emerges, since complex solutions (i.e., weights too large in size) are discouraged.
7 Unsupervised learning over graphical domains 7 Given a generic weight v U, gradient ascent requires to compute partial derivatives of the proposed criterion C w.r.t. v, i.e., C = p(u) ˆp(φ(g θ φ) θ ˆp ) + ˆp(φ(g θ φ ) θ ˆp ) p(u). (14) The quantity ˆp(φ(g θ φ) θ ˆp ) is computed as above, while the term p(u) in Eq. (14) can be rewritten as follows: p(u) log p(u) = p(u) = p(u) sx log p(v i) = p(u) p(v) p(v) (15) which is computed in a straightforward manner by taking the derivatives of Eq. (13) w.r.t. v. 3 Demonstration Since there are no other approaches to pdf estimation over graphical domains, comparisons are impracticable. Consequently, we analyze the behavior of the model and we evaluate it via statistical tests on a synthetic task. Two different samples of GRGs were synthesized under controlled probabilistic conditions. A first sample of 300 independent DAGs was randomly generated. These DAGs had a random number of vertices (between 5 and 20), a uniform distribution of edges connectivity (as in the classic Erdös and Rényi model [3]), ( and ) a realvalued label for vertices drawn from the Laplacian pdf 1 2θ exp x µ θ with location µ = 5.0 and smoothness θ = 1 2. Let us call Q this collection of GRGs. Q was partitioned into three equally-sized subsamples, Q 0 (training set), Q 1 (validation set) and Q 2 (test set). Another sample P of 200 independent DAGs was likewise obtained, each DAG having: a random number of vertices (between 5 and 20), a Power-law preferential attachment random connectivity (as in Barabási and Albert model [1]) according to the value of the node labels (the relevance, or authority ), which were independently drawn from an exponential distribution λe λx, with inverse scale λ = 1 3. The collection P was split into equally-sized subsamples, P 1 (reference set) and P 2 (reference test set), as well. Fig. 1. Learning, generalization and reference curves.
8 8 Unsupervised learning over graphical domains Fig. 2. Learning and generalization curves magnified (left). Difference between learning and generalization curves (right). Fig. 3. Final average log-likelihoods as a function of the training set size. The pdf underlying the distributions of GRGs in Q was then estimated, relying on the training subsample Q 0. An encoding net having 10 sigmoid hidden units and 2 linear encoding neurons was used, while the RBF used 4 Gaussian kernels. All the parameters were initialized at random. Different learning rates (chosen, along with the neural architectures, by evaluating the variations of the likelihood on training and validation sets) were applied for the different families of parameters, namely η γ = 1.0e 4, η µ = 5.0e 5, η σ = 5.0e 6, η v = 1.0e 7 (the notation implicitly refers to the symbols used in Section 2). Fig. 1 shows the learning curve (log-likelihood on Q 0 ), generalization curve (log-likelihood on Q 1 ), and reference curve (log-likelihood on P 1 ). It is seen that the criterion function is increased during training, as expected. All three curves exhibit a steep growth during the early training, due to the fact that the RBF kernels quickly move from their random initial position toward the region in R 2 where all the graphs are initially randomly encoded. Learning and generalization curves continue to grow smoothly, getting closer to each other. This fact, magnified in Fig. 2 (left), indicates that the estimated pdf model explains (i.e., has high likelihood) equally well independent samples of GRGs drawn from the same distribution. On the contrary, the reference curve is constantly and significantly lower than the others, and starts dropping early (i.e., the model does not cover samples drawn from a different pdf). This is due to the constrained training of the RBF (which is forced to have a unit integral over its definition domain, i.e. it peaks around the GRGs in Q at he expense of those in P), and to the regularized training of the encoding network (whose weights are discouraged to move toward solutions that could map all the GRGs onto a compact cluster, regardless of their original distribution). Early stopping of training was accomplished once the generalization curve began decreasing (after 1824 iterations), whilst the learning curve continued to grow (overfitting the training data). This is seen also in Fig. 2 (right), which plots the difference between the two curves, which lowers down to a minimum at epoch 1824 before inverting its trend. The final average log-likelihood over Q 0, Q 2 and P 2, respectively, is shown in Fig. 3 as a function of the number of GRGs in the training set (results are averaged w.r.t. the cardinality of the training set; the experiment was repeated, accordingly, for the different cardinalities). Let us now call p(.) the pdf model estimated from Q 0 (i.e., using 100 GRGs). Its capability to describe the statistical properties of the corresponding distribution Q (but not those of the other, P) may be quantified by evaluating how likely it explains the test samples Q 2 and P 2, according to some statistical criteria. First of all, in the spirit of the likelihood-ratio test, the overall log-likelihoods L(Q 2 ) and L(P 2 ) of the model given Q 2 and P 2, respectively, were computed. Let
9 Unsupervised learning over graphical domains 9 us define l(g) = log p(g) for any given graph g, and let Λ = log(l(q 2 )/L(P 2 )) be the (log)likelihood-ratio. Table 1 reports the statistics. Table 1. log-likelihoods and Likelihood-ratio (Λ) tests of the estimated pdf model. L(Q 2) = P g Q 2 l(g) L(P 2) = P g P 2 l(g) Λ Roughly speaking, the model is highly likely to express the probabilistic law underlying Q (but not P), as sought. The value of Λ (i.e., the likelihood-ratio is >> 1) confirms the high statistical significance of the test. These values express global statistics. Let us now evaluate the distribution of individual log-likelihoods yielded by p(.) over each graph in the test samples Q 2 and P 2 (100 values for each subsample) from an analytical point of view. To this end, the Kolmogorov- Smirnov (KS) test is a popular choice for the evaluation of pdf models. Two (independent) null-hypotheses were formed, namely: (1) the distribution of individual log-likelihoods yielded by p(.) when applied to Q 2 coincides with the analogous distribution yielded by the same model on P 2. (2) the distributions of individual log-likelihoods evaluated via p(.) on the samples Q 0 and Q 2 do not coincide. The KS test pointed out that both null-hypotheses are rejected at a level α of at least (confidence 99.9%). That is, the model explains well the distribution of independent samples drawn from Q, but is highly unlikely to explain GRGs having a different underlying distribution. 4 Conclusion and On-Going Work This paper was a first attempt to introduce pdf estimation over graphical domains. It gave a formal notion of GRG, and proposed a combined connectionist model with joint, gradient-ascent constrained optimization of the parameters over the ML criterion. The model was evaluated in terms of statistical tests (likelihood-ratio, KS) on synthetic distributions of GRGs. On-going work focuses on applications to real-world tasks (e.g., classification and clustering of relational data). References 1. A-L Barabási and R. Albert. Emergence of scaling in random networks. Science, 286: , October R. Duda and P. Hart. Pattern Classification and Scene Analysis. Wiley, N.Y., P. Erdös and A. Rényi. On random graphs. Publ. Math. Debrecen, 6: , B. Hammer, A. Micheli, and A. Sperduti. Universal approximation capability of cascade correlation for structures. Neural Computation, 17(5): , B. Hammer, A. Micheli, A. Sperduti, and M. Strickert. Recursive self-organizing network models. Neural Networks, 17(8-9): , G.J. McLachlan and K.E. Basford, editors. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York, USA, 1988.
10 10 Unsupervised learning over graphical domains 7. A. Sperduti and A. Starita. Supervised neural networks for the classification of structures. IEEE Transactions on Neural Networks, 8(3): , May E. Trentin and M. Gori. Robust combination of neural networks and hidden Markov models for speech recognition. IEEE Trans. on Neural Networks, 14(6), 2003.
Artificial Neural Networks (ANN)
Artificial Neural Networks (ANN) Edmondo Trentin April 17, 2013 ANN: Definition The definition of ANN is given in 3.1 points. Indeed, an ANN is a machine that is completely specified once we define its:
More informationUnsupervised Learning with Permuted Data
Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationDeep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści
Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationNotes on Back Propagation in 4 Lines
Notes on Back Propagation in 4 Lines Lili Mou moull12@sei.pku.edu.cn March, 2015 Congratulations! You are reading the clearest explanation of forward and backward propagation I have ever seen. In this
More informationIntroduction to Graphical Models
Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More information4. Multilayer Perceptrons
4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output
More informationDeep unsupervised learning
Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.
More informationMODULE -4 BAYEIAN LEARNING
MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationNeural Networks Lecture 4: Radial Bases Function Networks
Neural Networks Lecture 4: Radial Bases Function Networks H.A Talebi Farzaneh Abdollahi Department of Electrical Engineering Amirkabir University of Technology Winter 2011. A. Talebi, Farzaneh Abdollahi
More informationData Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More informationGreedy Layer-Wise Training of Deep Networks
Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny Story so far Deep neural nets are more expressive: Can learn
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationCS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas
CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1
More informationParametric Techniques Lecture 3
Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationNotes on Machine Learning for and
Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationGraphRNN: A Deep Generative Model for Graphs (24 Feb 2018)
GraphRNN: A Deep Generative Model for Graphs (24 Feb 2018) Jiaxuan You, Rex Ying, Xiang Ren, William L. Hamilton, Jure Leskovec Presented by: Jesse Bettencourt and Harris Chan March 9, 2018 University
More informationPATTERN CLASSIFICATION
PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS
More informationAlgorithmisches Lernen/Machine Learning
Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines
More informationPattern Recognition. Parameter Estimation of Probability Density Functions
Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The
More informationLearning Deep Architectures for AI. Part II - Vijay Chakilam
Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model
More informationStein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d
More informationWeighted Finite-State Transducers in Computational Biology
Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial
More informationARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD
ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided
More informationMathematical Formulation of Our Example
Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot
More informationAlgorithm-Independent Learning Issues
Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationWhat is semi-supervised learning?
What is semi-supervised learning? In many practical learning domains, there is a large supply of unlabeled data but limited labeled data, which can be expensive to generate text processing, video-indexing,
More informationProbability and Information Theory. Sargur N. Srihari
Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal
More informationModeling High-Dimensional Discrete Data with Multi-Layer Neural Networks
Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks Yoshua Bengio Dept. IRO Université de Montréal Montreal, Qc, Canada, H3C 3J7 bengioy@iro.umontreal.ca Samy Bengio IDIAP CP 592,
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationDiscriminative Direction for Kernel Classifiers
Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationLecture 3: Pattern Classification
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures
More informationBayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses
Bayesian Learning Two Roles for Bayesian Methods Probabilistic approach to inference. Quantities of interest are governed by prob. dist. and optimal decisions can be made by reasoning about these prob.
More informationStatistical NLP for the Web
Statistical NLP for the Web Neural Networks, Deep Belief Networks Sameer Maskey Week 8, October 24, 2012 *some slides from Andrew Rosenberg Announcements Please ask HW2 related questions in courseworks
More informationArtificial Neural Networks. Edward Gatt
Artificial Neural Networks Edward Gatt What are Neural Networks? Models of the brain and nervous system Highly parallel Process information much more like the brain than a serial computer Learning Very
More informationNN V: The generalized delta learning rule
NN V: The generalized delta learning rule We now focus on generalizing the delta learning rule for feedforward layered neural networks. The architecture of the two-layer network considered below is shown
More informationUNSUPERVISED LEARNING
UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training
More informationIntroduction. Chapter 1
Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics
More informationFeedforward Neural Nets and Backpropagation
Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28 th, 2016 1 / 23 Supervised Learning Roadmap Supervised Learning: Assume that we are given the features
More informationIn the Name of God. Lectures 15&16: Radial Basis Function Networks
1 In the Name of God Lectures 15&16: Radial Basis Function Networks Some Historical Notes Learning is equivalent to finding a surface in a multidimensional space that provides a best fit to the training
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationEEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1
EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle
More informationLearning Multiple Tasks with a Sparse Matrix-Normal Penalty
Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1
More informationArtificial Neural Networks
Artificial Neural Networks 鮑興國 Ph.D. National Taiwan University of Science and Technology Outline Perceptrons Gradient descent Multi-layer networks Backpropagation Hidden layer representations Examples
More informationPattern Classification
Pattern Classification All materials in these slides were taen from Pattern Classification (2nd ed) by R. O. Duda,, P. E. Hart and D. G. Stor, John Wiley & Sons, 2000 with the permission of the authors
More informationMLPR: Logistic Regression and Neural Networks
MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition Amos Storkey Amos Storkey MLPR: Logistic Regression and Neural Networks 1/28 Outline 1 Logistic Regression 2 Multi-layer
More informationLearning Methods for Linear Detectors
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2
More informationOutline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.
Outline MLPR: and Neural Networks Machine Learning and Pattern Recognition 2 Amos Storkey Amos Storkey MLPR: and Neural Networks /28 Recap Amos Storkey MLPR: and Neural Networks 2/28 Which is the correct
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationMachine Learning. Neural Networks. (slides from Domingos, Pardo, others)
Machine Learning Neural Networks (slides from Domingos, Pardo, others) Human Brain Neurons Input-Output Transformation Input Spikes Output Spike Spike (= a brief pulse) (Excitatory Post-Synaptic Potential)
More information3 Undirected Graphical Models
Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 3 Undirected Graphical Models In this lecture, we discuss undirected
More informationNeural Networks Learning the network: Backprop , Fall 2018 Lecture 4
Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:
More informationLogistic Regression & Neural Networks
Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability
More informationLecture 2: Learning with neural networks
Lecture 2: Learning with neural networks Deep Learning @ UvA LEARNING WITH NEURAL NETWORKS - PAGE 1 Lecture Overview o Machine Learning Paradigm for Neural Networks o The Backpropagation algorithm for
More informationHMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems
HMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems Silvia Chiappa and Samy Bengio {chiappa,bengio}@idiap.ch IDIAP, P.O. Box 592, CH-1920 Martigny, Switzerland Abstract. We compare the use
More informationLearning Vector Quantization
Learning Vector Quantization Neural Computation : Lecture 18 John A. Bullinaria, 2015 1. SOM Architecture and Algorithm 2. Vector Quantization 3. The Encoder-Decoder Model 4. Generalized Lloyd Algorithms
More information10-701/ Machine Learning, Fall
0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training
More informationA Simple Algorithm for Learning Stable Machines
A Simple Algorithm for Learning Stable Machines Savina Andonova and Andre Elisseeff and Theodoros Evgeniou and Massimiliano ontil Abstract. We present an algorithm for learning stable machines which is
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationBeyond the Point Cloud: From Transductive to Semi-Supervised Learning
Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Vikas Sindhwani, Partha Niyogi, Mikhail Belkin Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of
More informationArtificial Neural Networks
0 Artificial Neural Networks Based on Machine Learning, T Mitchell, McGRAW Hill, 1997, ch 4 Acknowledgement: The present slides are an adaptation of slides drawn by T Mitchell PLAN 1 Introduction Connectionist
More informationSummary and discussion of: Dropout Training as Adaptive Regularization
Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial
More informationUniversity of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians
Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationHeuristics for The Whitehead Minimization Problem
Heuristics for The Whitehead Minimization Problem R.M. Haralick, A.D. Miasnikov and A.G. Myasnikov November 11, 2004 Abstract In this paper we discuss several heuristic strategies which allow one to solve
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More informationDiversity Regularization of Latent Variable Models: Theory, Algorithm and Applications
Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Pengtao Xie, Machine Learning Department, Carnegie Mellon University 1. Background Latent Variable Models (LVMs) are
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationIntroduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis
Introduction to Natural Computation Lecture 9 Multilayer Perceptrons and Backpropagation Peter Lewis 1 / 25 Overview of the Lecture Why multilayer perceptrons? Some applications of multilayer perceptrons.
More informationARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92
ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92 BIOLOGICAL INSPIRATIONS Some numbers The human brain contains about 10 billion nerve cells (neurons) Each neuron is connected to the others through 10000
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationECE662: Pattern Recognition and Decision Making Processes: HW TWO
ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationNotation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions
Notation S pattern space X feature vector X = [x 1,...,x l ] l = dim{x} number of features X feature space K number of classes ω i class indicator Ω = {ω 1,...,ω K } g(x) discriminant function H decision
More informationNeutron inverse kinetics via Gaussian Processes
Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationDeep Feedforward Networks. Seung-Hoon Na Chonbuk National University
Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationLecture 3: Pattern Classification. Pattern classification
EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and
More informationApproximating the Covariance Matrix with Low-rank Perturbations
Approximating the Covariance Matrix with Low-rank Perturbations Malik Magdon-Ismail and Jonathan T. Purnell Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 {magdon,purnej}@cs.rpi.edu
More information