A Maximum-likelihood connectionist model for unsupervised learning over graphical domains

Size: px
Start display at page:

Download "A Maximum-likelihood connectionist model for unsupervised learning over graphical domains"

Transcription

1 A Maximum-likelihood connectionist model for unsupervised learning over graphical domains Edmondo Trentin and Leonardo Rigutini DII - Università di Siena, V. Roma, 56 Siena (Italy) Abstract. Supervised relational learning over labeled graphs, e.g. via recursive neural nets, received considerable attention from the connectionist community. Surprisingly, with the exception of recursive self organizing maps, unsupervised paradigms have been far less investigated. In particular, no algorithms for density estimation over graphs are found in the literature. This paper introduces first a formal notion of probability density function (pdf) over graphical spaces. It then proposes a maximum-likelihood pdf estimation technique, relying on the joint optimization of a recursive encoding network and a constrained radial basis functions-like net. Preliminary experiments on synthetically generated samples of labeled graphs are analyzed and tested statistically. Key words: Density estimation, unsupervised relational learning, recursive network 1 Introduction Two major instances of unsupervised learning have long been considered in statistical pattern recognition, namely the estimation of probability density functions (pdf), and clustering algorithms [2]. An approximative borderline between the two setups can be traced by saying that the former focuses on the probabilistic properties of the data sample, whilst the latter is rather topology-oriented, in the sense that it concentrates on certain topological properties (e.g., distance measures among patterns and/or centroids). Several unsupervised training algorithms for neural networks were also introduced (e.g., competitive neural nets and self-organizing maps). Most of these neural networks are rooted in the topological framework (i.e., clustering, topologically consistent mappings, etc.), although a few exceptions aimed at pdf estimation can be found [8]. It is rather surprising to realize that, despite the amount of work that has been accomplished in the community on relational and graphical learners in the last few decades, only limited attention has been paid to unsupervised relational learning (a remarkable exception, in the topological framework, is [5]), and to pdf estimation in the first place. This is even more surprising if we consider the fact that the original motivations for undertaking the study of unsupervised algorithms are often strong in the graphical domains (such as the World Wide Web, or as biological and chemical data which have a natural representation in terms of variable-size labeled graphs), where amounts of unlabeled samples are

2 2 Unsupervised learning over graphical domains available. These motivations include: (i) the need for a compact description of the overall distribution of a sample; (ii) the need for a measure of likelihood of a certain model given a graphical structure to be assigned to a certain group ( cluster ); (iii) the need for techniques that can deal with large amounts of unlabeled data in order to ease the design of semi-supervised classifiers or to facilitate the adaptation of previously trained machines to new data or new environmental conditions. This paper is a first attempt to introduce a model for the estimation of pdfs over graphical domains. It exploits the encoding capabilities of recursive neural nets (RNN) [7], combined with a constrained radial basis function (RBF)-like network. A gradient ascent, maximum-likelihood (ML) training algorithm is proposed, which jointly optimizes the encoding network [7] and the RBF in order to obtain the pdf estimate from an unsupervised sample of graphs. Constraints are introduced in both neural nets such that the resulting estimate can be interpreted as a pdf (non-negative, unit integral over its definition domain), and such that the encoding of the graphs does not lead to singular solutions. In order to introduce the model, it is necessary to give a formal definition of a pdf over graphs in the first place. via the notion of generalized random graph (an extension of traditional random graphs [3]). Let V be a given discrete or continuous-valued set (vertex universe), and let Ω be any given sample space. We define a generalized random graph (GRG) over V and Ω as a function G : Ω {(V, E) V V, E V V } (note that labels in the form of real-valued vectors associated with vertices and/or edges is easily encapsulated within the definition). Let then G = {g g = {(V, E)}, V V, E V V }. We define a probability density function (pdf) for GRGs over V as a function p : G R such that (1) p(g) 0, g G, and (2) p(g)dg = 1. Note that the G integral in (2) has a mathematical meaning since the (Lebesgue-)measurability of the space of graphs defined over measurable domains (and with measurable labels) like countable sets or real vectors is shown in [4]. The extension of notions from traditional probability theory (conditional pdf, joint pdf, statistical independence) to GRGs is a straightforward exercise. Now, suppose that a sample T = {g 1,..., g n g i G, i = 1,..., n} of n graphs has been collected. The pdf estimation problem faced in this paper can be stated as follows: assuming that all the GRGs in T have been independently drawn from a certain pdf p(g), how can the dataset be used in order to estimate a reasonable model of p(g)? 2 A plausible neural answer to the question We assume that p(g) is a function having fixed and known parametric form, being determined uniquely by the specific value of a set of parameters θ = (θ 1,..., θ k ). To render this dependency on θ in a more explicit manner, we will modify our notation slightly by writing p(g) as p(g θ). Given the assumption, the formulation of the question posed at the end of previous Section can be restated as: how can we use the sample T in order to obtain estimates for θ that are meaningful according to a certain optimality criterion? A sound answer

3 Unsupervised learning over graphical domains 3 to the question may be found in the adoption of the ML criterion, along with a suitable method for maximizing the likelihood p(t θ) of the parameters given the sample. Since g 1,..., g n are assumed to be i.i.d., the likelihood p(t θ) can be written as p(t θ) = n p(g i θ). Before attempting the maximization of the likelihood, it is necessary to specify a well-defined form for the pdf p(g θ). Let us assume the existence of an integer d and of two functions, φ : G R d and ˆp : R d R, s.t. p(g θ) can be decomposed as: p(g θ) = ˆp(φ(g)). (1) It is seen that there exist (infinite) choices for φ(.) and ˆp(.) that satisfy Eq. (1), the most trivial being φ(g) = p(g θ), ˆp(x) = x. We call φ(.) the encoding, while ˆp(.) is simply referred to as the likelihood. Again, we assume parametric forms φ(g θ φ ) and ˆp(x θ ˆp ) for the encoding and for the likelihood, respectively, and we set θ = (θ φ, θ ˆp ). The ML estimation of θ given T requires now to find parameter vectors θ φ and θ ˆp that maximize the quantity p(t θ φ, θ ˆp ) = ny ˆp(φ(g i θ φ ) θ ˆp ). (2) We propose a two-block connectionist/statistical model for p(g θ) as follows. The function φ(g θ φ ) is realized via an encoding network, suitable to map directed acyclic graphs (DAG) g into real vectors x, as described in [7] for supervised training of recursive neural networks (RNN) over structured domains. The weights of the encoding network become the parameters θ φ. A radial basis functions (RBF)-like neural net is then used to model the likelihood function ˆp(x θ ˆp ), where θ ˆp are the parameters of the RBF. In order to ensure that a pdf is obtained, specific constraints have to be placed on the nature of the RBF kernels, as well as on the hidden-to-output connection weights. It is crucial to underline that we are not going to find out a rough encoding of graphs via standard RNN followed by a separate, standard ML estimation of a mixture of Normal densities defined over the encoded space. On the contrary, we propose a joint optimization of all model parameters, θ φ and θ ˆp, to increase the overall likelihood. In other words, the encoding and the likelihood are jointly optimized to maximize p(t θ φ, θ ˆp ). Occasionaly, the ML principle for this general class of mixtures may lead to singular solutions. This fact is well-known from classical statistical theory; but, as pointed out in [2] (sec , page 199), it is an empirical fact that meaningful solutions can still be obtained. A hill-climbing algorithm to carry out ML estimation of the parameters θ can be obtained as an instance of the gradient-ascent method over p(t θ φ, θˆp ) in two steps: (i) initialization, i.e., start with some initial, e.g. random, assignment of values to the model parameters θ; (ii) gradient-ascent, i.e., repeatedly apply a learning rule in the form θ = η θ { n ˆp(φ(g i θ φ ) θ ˆp )} with η R +. This is a batch learning setup. In practice, neural network learning may be simplified, yet even improved, with the adoption of an on-line training scheme that prescribes θ = η θ {ˆp(φ(g θ φ ) θ ˆp )} upon presentation of each individual

4 4 Unsupervised learning over graphical domains training example g. Three distinct families of adaptive parameters θ have to be considered: (1) Mixing parameters c 1,..., c n, i.e. the hidden-to-output weights of the RBF network. Constraints have to be placed on these parameters during the ML estimation process, in order to ensure that they are in [0, 1] and that they sum to one. A simple way to satisfy the requirements is to introduce n hidden parameters γ 1,..., γ n, which are unconstrained, and to set c i = ς(γ P i) n, i = 1,..., n (3) j=1 ς(γj) where ς(x) = 1/(1 + e x ). Each γ i is then treated as an unknown parameter θ to be estimated via ML. (2) d-dimensional mean vector µ i and d d covariance matrix Σ i for each of the Gaussian kernels K i (x) = N(x; µ i, Σ i ), i = 1,..., n of the RBF, where N(x; µ i, Σ i ) denotes a multivariate Normal pdf having mean vector µ i, covariance matrix Σ i, and evaluated over the random vector x. A common (yet effective) simplification is to consider diagonal covariance matrices, i.e. independence among the components of the input vector x. This assumption leads to the following three major consequences: (i) modeling properties are not affected significantly, according to [6]; (ii) generalization capabilities of the overall model may turn out to be improved, since the number of free parameters is reduced; (iii) i-th multivariate kernel K i may be expressed in the form of a product of d univariate Normal densities as: K i(x) = dy j=1 ( 1 exp 1 2πσij 2 «) 2 xj µ ij σ ij (4) i.e., the free parameters to be estimated are the means µ ij and the standard deviations σ ij, for each kernel i = 1,..., n and for each component j = 1,..., d of the input space. (3) The weights U of the encoding network. The learning rule has to rely on partial derivatives of the likelihood which are backpropagated down to the RBF inputs and, in turn, through the encoding net. In order to discourage singular solutions, e.g. the tendency to map all the input graphs onto a single point in the encoded space by developing close-to-zero weights, the learning rule for U shall include an additional regularization term which treats the network weights as random variables distributed according to a pdf whose modes are far from zero. The likelihood of the network weights is then taken into account in the optimization procedure. In the following, we will derive explicit formulations for ˆp(φ(g θ φ) θ ˆp ) θ for each of the three families of free parameters θ within the proposed model. As regards a generic mixing parameter c i, i = 1,..., n, from Eq. (3), and since p(g) = n k=1 c kk k (x), we have

5 Unsupervised learning over graphical domains 5 ˆp(φ(g θ φ ) θ ˆp ) nx p(g) c j = (5) γ i c j γ i j=1 nx = K «ς(γ j(x) P j) γ n i j=1 k=1 ς(γ k) j ς (γ P i) k = K ς(γ k) ς(γ i)ς ff (γ i) i(x) [ P k ς(γ + X j ff ς(γj)ς (γ i) K j(x) k)] 2 [ P j i k ς(γ k)] 2 ς (γ i) = K i(x) P k ς(γ k) X K j(x) ς(γj)ς (γ i) [ P j k ς(γ k)] 2 ( ) ς (γ i) X = K i(x) Pk ς(γ k) ς (γ i) c jk j(x) P k ς(γ k) = ς (γ i) Pk ς(γ {Ki(x) p(g)}. k) j For the means µ ij and the standard deviations σ ij we proceed as follows. Let θ ij denote the free parameter, i.e. µ ij or σ ij, to be estimated. It is seen that: ˆp(φ(g θ φ ) θ ˆp ) θ ij = c i K i(x) θ ij (6) where the calculation of Ki(x) θ ij can be accomplished as follows. First of all, let us observe that for any real-valued, differentiable function f(.) this property holds true: f(.) x log[f(.)] = f(.). As a consequence, from Eq. (4) we can write K i(x) θ ij x = K logki(x) i(x) (7) θ ij ( " = K dx i(x) 1 «#) 2 log(2πσik) 2 xk µ ik +. θ ij 2 k=1 For the means, i.e. θ ij = µ ij, Eq. (7) yields σ ik K i(x) xj µij = K i(x). (8) µ ij σij 2 For the covariances, i.e. θ ij = σ ij, Eq. (7) takes the form: K i(x) σ ij = K i(x) = Ki(x) σ ij σ ij ( 1 2 log(2πσ2 ij) 1 2 ( «2 xj µ ij 1). σ ij «) 2 xj µ ij σ ij (9) Finally, let us consider the connection weights U = {v 1,..., v s } within the encoding network. For a generic v U, application of the chain rule yields:

6 6 Unsupervised learning over graphical domains ˆp(φ(g θ φ ) θ ˆp ) = ˆp(φ(g θ φ) θ ˆp ) y (10) y where y is the output from the unit (in the encoding net) which is fed from connection v. The quantity y can be easily computed by taking the partial derivative of the activation function associated with the unit itself, as usual. As regards the quantity ˆp(φ(g θ φ) θˆp ) y, we proceed as follows. First of all, let us assume that v feeds the output layer, i.e. it connects a certain hidden unit with j-th output unit of the encoding net. In this case, we have y = x j. It is easy to see that: ˆp(φ(g θ φ ) θ ˆp ) x j = P n ciki(x) (11) x j nx = c logki(x) ik i(x) x j ( " nx = c dx ik i(x) 1 «#) 2 log(2πσ 2 xk w ik µ ik ik) + x j 2 σ ik k=1 ( nx = c ik i(x) 1 «) 2 xjw ij µ ij 2 x j σ ij = nx K i(x) c i (x jw σij 2 ij µ ij)w ij. On the contrary, whenever v is a hidden weight the quantity ˆp(φ(g θ φ) θˆp ) can be obtained applying the usual backpropagation through structures (BPTS) algorithm [7], once the deltas to be backpropagated have been initialized at the output layer via Eq. (11). Unconstrained ML training of the weights of the encoding net may lead to singular solutions. To tackle the problem, we assume that the weights are random variables, independently drawn from a certain probability distribution p(u) = s p(v i). The pdf p(v) is defined in a way to encourage non-degenerate solutions, and the new criterion function C to be maximized during gradient-ascent training is in the form of a joint pdf, namely: C = ˆp(φ(g θ φ ) θ ˆp )p(u). (12) Extremization of such a criterion results in weight values that yield high likelihood of the sample, and that are highly likely themselves. If the weights U are randomly initialized in a uniform manner over the interval ( ρ, ρ), an effective choice for p(v) is a mixture of two Gaussian components in the form p(v) = 1 2 N(v; ρ 2, σ2 ) N(v; ρ 2, σ2 ). (13) Whenever σ 2 is chosen to be sufficiently small, two benefits are expected as training proceeds: (1) weights are encouraged to move toward non-degenerative solutions in the weight space; (2) a form of regularization of the learning process emerges, since complex solutions (i.e., weights too large in size) are discouraged.

7 Unsupervised learning over graphical domains 7 Given a generic weight v U, gradient ascent requires to compute partial derivatives of the proposed criterion C w.r.t. v, i.e., C = p(u) ˆp(φ(g θ φ) θ ˆp ) + ˆp(φ(g θ φ ) θ ˆp ) p(u). (14) The quantity ˆp(φ(g θ φ) θ ˆp ) is computed as above, while the term p(u) in Eq. (14) can be rewritten as follows: p(u) log p(u) = p(u) = p(u) sx log p(v i) = p(u) p(v) p(v) (15) which is computed in a straightforward manner by taking the derivatives of Eq. (13) w.r.t. v. 3 Demonstration Since there are no other approaches to pdf estimation over graphical domains, comparisons are impracticable. Consequently, we analyze the behavior of the model and we evaluate it via statistical tests on a synthetic task. Two different samples of GRGs were synthesized under controlled probabilistic conditions. A first sample of 300 independent DAGs was randomly generated. These DAGs had a random number of vertices (between 5 and 20), a uniform distribution of edges connectivity (as in the classic Erdös and Rényi model [3]), ( and ) a realvalued label for vertices drawn from the Laplacian pdf 1 2θ exp x µ θ with location µ = 5.0 and smoothness θ = 1 2. Let us call Q this collection of GRGs. Q was partitioned into three equally-sized subsamples, Q 0 (training set), Q 1 (validation set) and Q 2 (test set). Another sample P of 200 independent DAGs was likewise obtained, each DAG having: a random number of vertices (between 5 and 20), a Power-law preferential attachment random connectivity (as in Barabási and Albert model [1]) according to the value of the node labels (the relevance, or authority ), which were independently drawn from an exponential distribution λe λx, with inverse scale λ = 1 3. The collection P was split into equally-sized subsamples, P 1 (reference set) and P 2 (reference test set), as well. Fig. 1. Learning, generalization and reference curves.

8 8 Unsupervised learning over graphical domains Fig. 2. Learning and generalization curves magnified (left). Difference between learning and generalization curves (right). Fig. 3. Final average log-likelihoods as a function of the training set size. The pdf underlying the distributions of GRGs in Q was then estimated, relying on the training subsample Q 0. An encoding net having 10 sigmoid hidden units and 2 linear encoding neurons was used, while the RBF used 4 Gaussian kernels. All the parameters were initialized at random. Different learning rates (chosen, along with the neural architectures, by evaluating the variations of the likelihood on training and validation sets) were applied for the different families of parameters, namely η γ = 1.0e 4, η µ = 5.0e 5, η σ = 5.0e 6, η v = 1.0e 7 (the notation implicitly refers to the symbols used in Section 2). Fig. 1 shows the learning curve (log-likelihood on Q 0 ), generalization curve (log-likelihood on Q 1 ), and reference curve (log-likelihood on P 1 ). It is seen that the criterion function is increased during training, as expected. All three curves exhibit a steep growth during the early training, due to the fact that the RBF kernels quickly move from their random initial position toward the region in R 2 where all the graphs are initially randomly encoded. Learning and generalization curves continue to grow smoothly, getting closer to each other. This fact, magnified in Fig. 2 (left), indicates that the estimated pdf model explains (i.e., has high likelihood) equally well independent samples of GRGs drawn from the same distribution. On the contrary, the reference curve is constantly and significantly lower than the others, and starts dropping early (i.e., the model does not cover samples drawn from a different pdf). This is due to the constrained training of the RBF (which is forced to have a unit integral over its definition domain, i.e. it peaks around the GRGs in Q at he expense of those in P), and to the regularized training of the encoding network (whose weights are discouraged to move toward solutions that could map all the GRGs onto a compact cluster, regardless of their original distribution). Early stopping of training was accomplished once the generalization curve began decreasing (after 1824 iterations), whilst the learning curve continued to grow (overfitting the training data). This is seen also in Fig. 2 (right), which plots the difference between the two curves, which lowers down to a minimum at epoch 1824 before inverting its trend. The final average log-likelihood over Q 0, Q 2 and P 2, respectively, is shown in Fig. 3 as a function of the number of GRGs in the training set (results are averaged w.r.t. the cardinality of the training set; the experiment was repeated, accordingly, for the different cardinalities). Let us now call p(.) the pdf model estimated from Q 0 (i.e., using 100 GRGs). Its capability to describe the statistical properties of the corresponding distribution Q (but not those of the other, P) may be quantified by evaluating how likely it explains the test samples Q 2 and P 2, according to some statistical criteria. First of all, in the spirit of the likelihood-ratio test, the overall log-likelihoods L(Q 2 ) and L(P 2 ) of the model given Q 2 and P 2, respectively, were computed. Let

9 Unsupervised learning over graphical domains 9 us define l(g) = log p(g) for any given graph g, and let Λ = log(l(q 2 )/L(P 2 )) be the (log)likelihood-ratio. Table 1 reports the statistics. Table 1. log-likelihoods and Likelihood-ratio (Λ) tests of the estimated pdf model. L(Q 2) = P g Q 2 l(g) L(P 2) = P g P 2 l(g) Λ Roughly speaking, the model is highly likely to express the probabilistic law underlying Q (but not P), as sought. The value of Λ (i.e., the likelihood-ratio is >> 1) confirms the high statistical significance of the test. These values express global statistics. Let us now evaluate the distribution of individual log-likelihoods yielded by p(.) over each graph in the test samples Q 2 and P 2 (100 values for each subsample) from an analytical point of view. To this end, the Kolmogorov- Smirnov (KS) test is a popular choice for the evaluation of pdf models. Two (independent) null-hypotheses were formed, namely: (1) the distribution of individual log-likelihoods yielded by p(.) when applied to Q 2 coincides with the analogous distribution yielded by the same model on P 2. (2) the distributions of individual log-likelihoods evaluated via p(.) on the samples Q 0 and Q 2 do not coincide. The KS test pointed out that both null-hypotheses are rejected at a level α of at least (confidence 99.9%). That is, the model explains well the distribution of independent samples drawn from Q, but is highly unlikely to explain GRGs having a different underlying distribution. 4 Conclusion and On-Going Work This paper was a first attempt to introduce pdf estimation over graphical domains. It gave a formal notion of GRG, and proposed a combined connectionist model with joint, gradient-ascent constrained optimization of the parameters over the ML criterion. The model was evaluated in terms of statistical tests (likelihood-ratio, KS) on synthetic distributions of GRGs. On-going work focuses on applications to real-world tasks (e.g., classification and clustering of relational data). References 1. A-L Barabási and R. Albert. Emergence of scaling in random networks. Science, 286: , October R. Duda and P. Hart. Pattern Classification and Scene Analysis. Wiley, N.Y., P. Erdös and A. Rényi. On random graphs. Publ. Math. Debrecen, 6: , B. Hammer, A. Micheli, and A. Sperduti. Universal approximation capability of cascade correlation for structures. Neural Computation, 17(5): , B. Hammer, A. Micheli, A. Sperduti, and M. Strickert. Recursive self-organizing network models. Neural Networks, 17(8-9): , G.J. McLachlan and K.E. Basford, editors. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York, USA, 1988.

10 10 Unsupervised learning over graphical domains 7. A. Sperduti and A. Starita. Supervised neural networks for the classification of structures. IEEE Transactions on Neural Networks, 8(3): , May E. Trentin and M. Gori. Robust combination of neural networks and hidden Markov models for speech recognition. IEEE Trans. on Neural Networks, 14(6), 2003.

Artificial Neural Networks (ANN)

Artificial Neural Networks (ANN) Artificial Neural Networks (ANN) Edmondo Trentin April 17, 2013 ANN: Definition The definition of ANN is given in 3.1 points. Indeed, an ANN is a machine that is completely specified once we define its:

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Notes on Back Propagation in 4 Lines

Notes on Back Propagation in 4 Lines Notes on Back Propagation in 4 Lines Lili Mou moull12@sei.pku.edu.cn March, 2015 Congratulations! You are reading the clearest explanation of forward and backward propagation I have ever seen. In this

More information

Introduction to Graphical Models

Introduction to Graphical Models Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

4. Multilayer Perceptrons

4. Multilayer Perceptrons 4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

More information

Deep unsupervised learning

Deep unsupervised learning Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Neural Networks Lecture 4: Radial Bases Function Networks

Neural Networks Lecture 4: Radial Bases Function Networks Neural Networks Lecture 4: Radial Bases Function Networks H.A Talebi Farzaneh Abdollahi Department of Electrical Engineering Amirkabir University of Technology Winter 2011. A. Talebi, Farzaneh Abdollahi

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Greedy Layer-Wise Training of Deep Networks

Greedy Layer-Wise Training of Deep Networks Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny Story so far Deep neural nets are more expressive: Can learn

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

GraphRNN: A Deep Generative Model for Graphs (24 Feb 2018)

GraphRNN: A Deep Generative Model for Graphs (24 Feb 2018) GraphRNN: A Deep Generative Model for Graphs (24 Feb 2018) Jiaxuan You, Rex Ying, Xiang Ren, William L. Hamilton, Jure Leskovec Presented by: Jesse Bettencourt and Harris Chan March 9, 2018 University

More information

PATTERN CLASSIFICATION

PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Pattern Recognition. Parameter Estimation of Probability Density Functions

Pattern Recognition. Parameter Estimation of Probability Density Functions Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The

More information

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures for AI. Part II - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d

More information

Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial

More information

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

What is semi-supervised learning?

What is semi-supervised learning? What is semi-supervised learning? In many practical learning domains, there is a large supply of unlabeled data but limited labeled data, which can be expensive to generate text processing, video-indexing,

More information

Probability and Information Theory. Sargur N. Srihari

Probability and Information Theory. Sargur N. Srihari Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal

More information

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks Yoshua Bengio Dept. IRO Université de Montréal Montreal, Qc, Canada, H3C 3J7 bengioy@iro.umontreal.ca Samy Bengio IDIAP CP 592,

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses Bayesian Learning Two Roles for Bayesian Methods Probabilistic approach to inference. Quantities of interest are governed by prob. dist. and optimal decisions can be made by reasoning about these prob.

More information

Statistical NLP for the Web

Statistical NLP for the Web Statistical NLP for the Web Neural Networks, Deep Belief Networks Sameer Maskey Week 8, October 24, 2012 *some slides from Andrew Rosenberg Announcements Please ask HW2 related questions in courseworks

More information

Artificial Neural Networks. Edward Gatt

Artificial Neural Networks. Edward Gatt Artificial Neural Networks Edward Gatt What are Neural Networks? Models of the brain and nervous system Highly parallel Process information much more like the brain than a serial computer Learning Very

More information

NN V: The generalized delta learning rule

NN V: The generalized delta learning rule NN V: The generalized delta learning rule We now focus on generalizing the delta learning rule for feedforward layered neural networks. The architecture of the two-layer network considered below is shown

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Feedforward Neural Nets and Backpropagation

Feedforward Neural Nets and Backpropagation Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28 th, 2016 1 / 23 Supervised Learning Roadmap Supervised Learning: Assume that we are given the features

More information

In the Name of God. Lectures 15&16: Radial Basis Function Networks

In the Name of God. Lectures 15&16: Radial Basis Function Networks 1 In the Name of God Lectures 15&16: Radial Basis Function Networks Some Historical Notes Learning is equivalent to finding a surface in a multidimensional space that provides a best fit to the training

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks 鮑興國 Ph.D. National Taiwan University of Science and Technology Outline Perceptrons Gradient descent Multi-layer networks Backpropagation Hidden layer representations Examples

More information

Pattern Classification

Pattern Classification Pattern Classification All materials in these slides were taen from Pattern Classification (2nd ed) by R. O. Duda,, P. E. Hart and D. G. Stor, John Wiley & Sons, 2000 with the permission of the authors

More information

MLPR: Logistic Regression and Neural Networks

MLPR: Logistic Regression and Neural Networks MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition Amos Storkey Amos Storkey MLPR: Logistic Regression and Neural Networks 1/28 Outline 1 Logistic Regression 2 Multi-layer

More information

Learning Methods for Linear Detectors

Learning Methods for Linear Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2

More information

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap. Outline MLPR: and Neural Networks Machine Learning and Pattern Recognition 2 Amos Storkey Amos Storkey MLPR: and Neural Networks /28 Recap Amos Storkey MLPR: and Neural Networks 2/28 Which is the correct

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) Human Brain Neurons Input-Output Transformation Input Spikes Output Spike Spike (= a brief pulse) (Excitatory Post-Synaptic Potential)

More information

3 Undirected Graphical Models

3 Undirected Graphical Models Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 3 Undirected Graphical Models In this lecture, we discuss undirected

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information

Logistic Regression & Neural Networks

Logistic Regression & Neural Networks Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

More information

Lecture 2: Learning with neural networks

Lecture 2: Learning with neural networks Lecture 2: Learning with neural networks Deep Learning @ UvA LEARNING WITH NEURAL NETWORKS - PAGE 1 Lecture Overview o Machine Learning Paradigm for Neural Networks o The Backpropagation algorithm for

More information

HMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems

HMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems HMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems Silvia Chiappa and Samy Bengio {chiappa,bengio}@idiap.ch IDIAP, P.O. Box 592, CH-1920 Martigny, Switzerland Abstract. We compare the use

More information

Learning Vector Quantization

Learning Vector Quantization Learning Vector Quantization Neural Computation : Lecture 18 John A. Bullinaria, 2015 1. SOM Architecture and Algorithm 2. Vector Quantization 3. The Encoder-Decoder Model 4. Generalized Lloyd Algorithms

More information

10-701/ Machine Learning, Fall

10-701/ Machine Learning, Fall 0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training

More information

A Simple Algorithm for Learning Stable Machines

A Simple Algorithm for Learning Stable Machines A Simple Algorithm for Learning Stable Machines Savina Andonova and Andre Elisseeff and Theodoros Evgeniou and Massimiliano ontil Abstract. We present an algorithm for learning stable machines which is

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Vikas Sindhwani, Partha Niyogi, Mikhail Belkin Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of

More information

Artificial Neural Networks

Artificial Neural Networks 0 Artificial Neural Networks Based on Machine Learning, T Mitchell, McGRAW Hill, 1997, ch 4 Acknowledgement: The present slides are an adaptation of slides drawn by T Mitchell PLAN 1 Introduction Connectionist

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Heuristics for The Whitehead Minimization Problem

Heuristics for The Whitehead Minimization Problem Heuristics for The Whitehead Minimization Problem R.M. Haralick, A.D. Miasnikov and A.G. Myasnikov November 11, 2004 Abstract In this paper we discuss several heuristic strategies which allow one to solve

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Pengtao Xie, Machine Learning Department, Carnegie Mellon University 1. Background Latent Variable Models (LVMs) are

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis Introduction to Natural Computation Lecture 9 Multilayer Perceptrons and Backpropagation Peter Lewis 1 / 25 Overview of the Lecture Why multilayer perceptrons? Some applications of multilayer perceptrons.

More information

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92 ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92 BIOLOGICAL INSPIRATIONS Some numbers The human brain contains about 10 billion nerve cells (neurons) Each neuron is connected to the others through 10000

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions

Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions Notation S pattern space X feature vector X = [x 1,...,x l ] l = dim{x} number of features X feature space K number of classes ω i class indicator Ω = {ω 1,...,ω K } g(x) discriminant function H decision

More information

Neutron inverse kinetics via Gaussian Processes

Neutron inverse kinetics via Gaussian Processes Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Lecture 3: Pattern Classification. Pattern classification

Lecture 3: Pattern Classification. Pattern classification EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and

More information

Approximating the Covariance Matrix with Low-rank Perturbations

Approximating the Covariance Matrix with Low-rank Perturbations Approximating the Covariance Matrix with Low-rank Perturbations Malik Magdon-Ismail and Jonathan T. Purnell Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 {magdon,purnej}@cs.rpi.edu

More information