A Maximum-likelihood connectionist model for unsupervised learning over graphical domains

Size: px

Start display at page:

Download "A Maximum-likelihood connectionist model for unsupervised learning over graphical domains"

Regina Marshall
6 years ago
Views:

1 A Maximum-likelihood connectionist model for unsupervised learning over graphical domains Edmondo Trentin and Leonardo Rigutini DII - Università di Siena, V. Roma, 56 Siena (Italy) Abstract. Supervised relational learning over labeled graphs, e.g. via recursive neural nets, received considerable attention from the connectionist community. Surprisingly, with the exception of recursive self organizing maps, unsupervised paradigms have been far less investigated. In particular, no algorithms for density estimation over graphs are found in the literature. This paper introduces first a formal notion of probability density function (pdf) over graphical spaces. It then proposes a maximum-likelihood pdf estimation technique, relying on the joint optimization of a recursive encoding network and a constrained radial basis functions-like net. Preliminary experiments on synthetically generated samples of labeled graphs are analyzed and tested statistically. Key words: Density estimation, unsupervised relational learning, recursive network 1 Introduction Two major instances of unsupervised learning have long been considered in statistical pattern recognition, namely the estimation of probability density functions (pdf), and clustering algorithms [2]. An approximative borderline between the two setups can be traced by saying that the former focuses on the probabilistic properties of the data sample, whilst the latter is rather topology-oriented, in the sense that it concentrates on certain topological properties (e.g., distance measures among patterns and/or centroids). Several unsupervised training algorithms for neural networks were also introduced (e.g., competitive neural nets and self-organizing maps). Most of these neural networks are rooted in the topological framework (i.e., clustering, topologically consistent mappings, etc.), although a few exceptions aimed at pdf estimation can be found [8]. It is rather surprising to realize that, despite the amount of work that has been accomplished in the community on relational and graphical learners in the last few decades, only limited attention has been paid to unsupervised relational learning (a remarkable exception, in the topological framework, is [5]), and to pdf estimation in the first place. This is even more surprising if we consider the fact that the original motivations for undertaking the study of unsupervised algorithms are often strong in the graphical domains (such as the World Wide Web, or as biological and chemical data which have a natural representation in terms of variable-size labeled graphs), where amounts of unlabeled samples are

2 2 Unsupervised learning over graphical domains available. These motivations include: (i) the need for a compact description of the overall distribution of a sample; (ii) the need for a measure of likelihood of a certain model given a graphical structure to be assigned to a certain group ( cluster ); (iii) the need for techniques that can deal with large amounts of unlabeled data in order to ease the design of semi-supervised classifiers or to facilitate the adaptation of previously trained machines to new data or new environmental conditions. This paper is a first attempt to introduce a model for the estimation of pdfs over graphical domains. It exploits the encoding capabilities of recursive neural nets (RNN) [7], combined with a constrained radial basis function (RBF)-like network. A gradient ascent, maximum-likelihood (ML) training algorithm is proposed, which jointly optimizes the encoding network [7] and the RBF in order to obtain the pdf estimate from an unsupervised sample of graphs. Constraints are introduced in both neural nets such that the resulting estimate can be interpreted as a pdf (non-negative, unit integral over its definition domain), and such that the encoding of the graphs does not lead to singular solutions. In order to introduce the model, it is necessary to give a formal definition of a pdf over graphs in the first place. via the notion of generalized random graph (an extension of traditional random graphs [3]). Let V be a given discrete or continuous-valued set (vertex universe), and let Ω be any given sample space. We define a generalized random graph (GRG) over V and Ω as a function G : Ω {(V, E) V V, E V V } (note that labels in the form of real-valued vectors associated with vertices and/or edges is easily encapsulated within the definition). Let then G = {g g = {(V, E)}, V V, E V V }. We define a probability density function (pdf) for GRGs over V as a function p : G R such that (1) p(g) 0, g G, and (2) p(g)dg = 1. Note that the G integral in (2) has a mathematical meaning since the (Lebesgue-)measurability of the space of graphs defined over measurable domains (and with measurable labels) like countable sets or real vectors is shown in [4]. The extension of notions from traditional probability theory (conditional pdf, joint pdf, statistical independence) to GRGs is a straightforward exercise. Now, suppose that a sample T = {g 1,..., g n g i G, i = 1,..., n} of n graphs has been collected. The pdf estimation problem faced in this paper can be stated as follows: assuming that all the GRGs in T have been independently drawn from a certain pdf p(g), how can the dataset be used in order to estimate a reasonable model of p(g)? 2 A plausible neural answer to the question We assume that p(g) is a function having fixed and known parametric form, being determined uniquely by the specific value of a set of parameters θ = (θ 1,..., θ k ). To render this dependency on θ in a more explicit manner, we will modify our notation slightly by writing p(g) as p(g θ). Given the assumption, the formulation of the question posed at the end of previous Section can be restated as: how can we use the sample T in order to obtain estimates for θ that are meaningful according to a certain optimality criterion? A sound answer

3 Unsupervised learning over graphical domains 3 to the question may be found in the adoption of the ML criterion, along with a suitable method for maximizing the likelihood p(t θ) of the parameters given the sample. Since g 1,..., g n are assumed to be i.i.d., the likelihood p(t θ) can be written as p(t θ) = n p(g i θ). Before attempting the maximization of the likelihood, it is necessary to specify a well-defined form for the pdf p(g θ). Let us assume the existence of an integer d and of two functions, φ : G R d and ˆp : R d R, s.t. p(g θ) can be decomposed as: p(g θ) = ˆp(φ(g)). (1) It is seen that there exist (infinite) choices for φ(.) and ˆp(.) that satisfy Eq. (1), the most trivial being φ(g) = p(g θ), ˆp(x) = x. We call φ(.) the encoding, while ˆp(.) is simply referred to as the likelihood. Again, we assume parametric forms φ(g θ φ ) and ˆp(x θ ˆp ) for the encoding and for the likelihood, respectively, and we set θ = (θ φ, θ ˆp ). The ML estimation of θ given T requires now to find parameter vectors θ φ and θ ˆp that maximize the quantity p(t θ φ, θ ˆp ) = ny ˆp(φ(g i θ φ ) θ ˆp ). (2) We propose a two-block connectionist/statistical model for p(g θ) as follows. The function φ(g θ φ ) is realized via an encoding network, suitable to map directed acyclic graphs (DAG) g into real vectors x, as described in [7] for supervised training of recursive neural networks (RNN) over structured domains. The weights of the encoding network become the parameters θ φ. A radial basis functions (RBF)-like neural net is then used to model the likelihood function ˆp(x θ ˆp ), where θ ˆp are the parameters of the RBF. In order to ensure that a pdf is obtained, specific constraints have to be placed on the nature of the RBF kernels, as well as on the hidden-to-output connection weights. It is crucial to underline that we are not going to find out a rough encoding of graphs via standard RNN followed by a separate, standard ML estimation of a mixture of Normal densities defined over the encoded space. On the contrary, we propose a joint optimization of all model parameters, θ φ and θ ˆp, to increase the overall likelihood. In other words, the encoding and the likelihood are jointly optimized to maximize p(t θ φ, θ ˆp ). Occasionaly, the ML principle for this general class of mixtures may lead to singular solutions. This fact is well-known from classical statistical theory; but, as pointed out in [2] (sec , page 199), it is an empirical fact that meaningful solutions can still be obtained. A hill-climbing algorithm to carry out ML estimation of the parameters θ can be obtained as an instance of the gradient-ascent method over p(t θ φ, θˆp ) in two steps: (i) initialization, i.e., start with some initial, e.g. random, assignment of values to the model parameters θ; (ii) gradient-ascent, i.e., repeatedly apply a learning rule in the form θ = η θ { n ˆp(φ(g i θ φ ) θ ˆp )} with η R +. This is a batch learning setup. In practice, neural network learning may be simplified, yet even improved, with the adoption of an on-line training scheme that prescribes θ = η θ {ˆp(φ(g θ φ ) θ ˆp )} upon presentation of each individual

4 4 Unsupervised learning over graphical domains training example g. Three distinct families of adaptive parameters θ have to be considered: (1) Mixing parameters c 1,..., c n, i.e. the hidden-to-output weights of the RBF network. Constraints have to be placed on these parameters during the ML estimation process, in order to ensure that they are in [0, 1] and that they sum to one. A simple way to satisfy the requirements is to introduce n hidden parameters γ 1,..., γ n, which are unconstrained, and to set c i = ς(γ P i) n, i = 1,..., n (3) j=1 ς(γj) where ς(x) = 1/(1 + e x ). Each γ i is then treated as an unknown parameter θ to be estimated via ML. (2) d-dimensional mean vector µ i and d d covariance matrix Σ i for each of the Gaussian kernels K i (x) = N(x; µ i, Σ i ), i = 1,..., n of the RBF, where N(x; µ i, Σ i ) denotes a multivariate Normal pdf having mean vector µ i, covariance matrix Σ i, and evaluated over the random vector x. A common (yet effective) simplification is to consider diagonal covariance matrices, i.e. independence among the components of the input vector x. This assumption leads to the following three major consequences: (i) modeling properties are not affected significantly, according to [6]; (ii) generalization capabilities of the overall model may turn out to be improved, since the number of free parameters is reduced; (iii) i-th multivariate kernel K i may be expressed in the form of a product of d univariate Normal densities as: K i(x) = dy j=1 ( 1 exp 1 2πσij 2 «) 2 xj µ ij σ ij (4) i.e., the free parameters to be estimated are the means µ ij and the standard deviations σ ij, for each kernel i = 1,..., n and for each component j = 1,..., d of the input space. (3) The weights U of the encoding network. The learning rule has to rely on partial derivatives of the likelihood which are backpropagated down to the RBF inputs and, in turn, through the encoding net. In order to discourage singular solutions, e.g. the tendency to map all the input graphs onto a single point in the encoded space by developing close-to-zero weights, the learning rule for U shall include an additional regularization term which treats the network weights as random variables distributed according to a pdf whose modes are far from zero. The likelihood of the network weights is then taken into account in the optimization procedure. In the following, we will derive explicit formulations for ˆp(φ(g θ φ) θ ˆp ) θ for each of the three families of free parameters θ within the proposed model. As regards a generic mixing parameter c i, i = 1,..., n, from Eq. (3), and since p(g) = n k=1 c kk k (x), we have

5 Unsupervised learning over graphical domains 5 ˆp(φ(g θ φ ) θ ˆp ) nx p(g) c j = (5) γ i c j γ i j=1 nx = K «ς(γ j(x) P j) γ n i j=1 k=1 ς(γ k) j ς (γ P i) k = K ς(γ k) ς(γ i)ς ff (γ i) i(x) [ P k ς(γ + X j ff ς(γj)ς (γ i) K j(x) k)] 2 [ P j i k ς(γ k)] 2 ς (γ i) = K i(x) P k ς(γ k) X K j(x) ς(γj)ς (γ i) [ P j k ς(γ k)] 2 ( ) ς (γ i) X = K i(x) Pk ς(γ k) ς (γ i) c jk j(x) P k ς(γ k) = ς (γ i) Pk ς(γ {Ki(x) p(g)}. k) j For the means µ ij and the standard deviations σ ij we proceed as follows. Let θ ij denote the free parameter, i.e. µ ij or σ ij, to be estimated. It is seen that: ˆp(φ(g θ φ ) θ ˆp ) θ ij = c i K i(x) θ ij (6) where the calculation of Ki(x) θ ij can be accomplished as follows. First of all, let us observe that for any real-valued, differentiable function f(.) this property holds true: f(.) x log[f(.)] = f(.). As a consequence, from Eq. (4) we can write K i(x) θ ij x = K logki(x) i(x) (7) θ ij ( " = K dx i(x) 1 «#) 2 log(2πσik) 2 xk µ ik +. θ ij 2 k=1 For the means, i.e. θ ij = µ ij, Eq. (7) yields σ ik K i(x) xj µij = K i(x). (8) µ ij σij 2 For the covariances, i.e. θ ij = σ ij, Eq. (7) takes the form: K i(x) σ ij = K i(x) = Ki(x) σ ij σ ij ( 1 2 log(2πσ2 ij) 1 2 ( «2 xj µ ij 1). σ ij «) 2 xj µ ij σ ij (9) Finally, let us consider the connection weights U = {v 1,..., v s } within the encoding network. For a generic v U, application of the chain rule yields:

6 6 Unsupervised learning over graphical domains ˆp(φ(g θ φ ) θ ˆp ) = ˆp(φ(g θ φ) θ ˆp ) y (10) y where y is the output from the unit (in the encoding net) which is fed from connection v. The quantity y can be easily computed by taking the partial derivative of the activation function associated with the unit itself, as usual. As regards the quantity ˆp(φ(g θ φ) θˆp ) y, we proceed as follows. First of all, let us assume that v feeds the output layer, i.e. it connects a certain hidden unit with j-th output unit of the encoding net. In this case, we have y = x j. It is easy to see that: ˆp(φ(g θ φ ) θ ˆp ) x j = P n ciki(x) (11) x j nx = c logki(x) ik i(x) x j ( " nx = c dx ik i(x) 1 «#) 2 log(2πσ 2 xk w ik µ ik ik) + x j 2 σ ik k=1 ( nx = c ik i(x) 1 «) 2 xjw ij µ ij 2 x j σ ij = nx K i(x) c i (x jw σij 2 ij µ ij)w ij. On the contrary, whenever v is a hidden weight the quantity ˆp(φ(g θ φ) θˆp ) can be obtained applying the usual backpropagation through structures (BPTS) algorithm [7], once the deltas to be backpropagated have been initialized at the output layer via Eq. (11). Unconstrained ML training of the weights of the encoding net may lead to singular solutions. To tackle the problem, we assume that the weights are random variables, independently drawn from a certain probability distribution p(u) = s p(v i). The pdf p(v) is defined in a way to encourage non-degenerate solutions, and the new criterion function C to be maximized during gradient-ascent training is in the form of a joint pdf, namely: C = ˆp(φ(g θ φ ) θ ˆp )p(u). (12) Extremization of such a criterion results in weight values that yield high likelihood of the sample, and that are highly likely themselves. If the weights U are randomly initialized in a uniform manner over the interval ( ρ, ρ), an effective choice for p(v) is a mixture of two Gaussian components in the form p(v) = 1 2 N(v; ρ 2, σ2 ) N(v; ρ 2, σ2 ). (13) Whenever σ 2 is chosen to be sufficiently small, two benefits are expected as training proceeds: (1) weights are encouraged to move toward non-degenerative solutions in the weight space; (2) a form of regularization of the learning process emerges, since complex solutions (i.e., weights too large in size) are discouraged.

7 Unsupervised learning over graphical domains 7 Given a generic weight v U, gradient ascent requires to compute partial derivatives of the proposed criterion C w.r.t. v, i.e., C = p(u) ˆp(φ(g θ φ) θ ˆp ) + ˆp(φ(g θ φ ) θ ˆp ) p(u). (14) The quantity ˆp(φ(g θ φ) θ ˆp ) is computed as above, while the term p(u) in Eq. (14) can be rewritten as follows: p(u) log p(u) = p(u) = p(u) sx log p(v i) = p(u) p(v) p(v) (15) which is computed in a straightforward manner by taking the derivatives of Eq. (13) w.r.t. v. 3 Demonstration Since there are no other approaches to pdf estimation over graphical domains, comparisons are impracticable. Consequently, we analyze the behavior of the model and we evaluate it via statistical tests on a synthetic task. Two different samples of GRGs were synthesized under controlled probabilistic conditions. A first sample of 300 independent DAGs was randomly generated. These DAGs had a random number of vertices (between 5 and 20), a uniform distribution of edges connectivity (as in the classic Erdös and Rényi model [3]), ( and ) a realvalued label for vertices drawn from the Laplacian pdf 1 2θ exp x µ θ with location µ = 5.0 and smoothness θ = 1 2. Let us call Q this collection of GRGs. Q was partitioned into three equally-sized subsamples, Q 0 (training set), Q 1 (validation set) and Q 2 (test set). Another sample P of 200 independent DAGs was likewise obtained, each DAG having: a random number of vertices (between 5 and 20), a Power-law preferential attachment random connectivity (as in Barabási and Albert model [1]) according to the value of the node labels (the relevance, or authority ), which were independently drawn from an exponential distribution λe λx, with inverse scale λ = 1 3. The collection P was split into equally-sized subsamples, P 1 (reference set) and P 2 (reference test set), as well. Fig. 1. Learning, generalization and reference curves.

8 8 Unsupervised learning over graphical domains Fig. 2. Learning and generalization curves magnified (left). Difference between learning and generalization curves (right). Fig. 3. Final average log-likelihoods as a function of the training set size. The pdf underlying the distributions of GRGs in Q was then estimated, relying on the training subsample Q 0. An encoding net having 10 sigmoid hidden units and 2 linear encoding neurons was used, while the RBF used 4 Gaussian kernels. All the parameters were initialized at random. Different learning rates (chosen, along with the neural architectures, by evaluating the variations of the likelihood on training and validation sets) were applied for the different families of parameters, namely η γ = 1.0e 4, η µ = 5.0e 5, η σ = 5.0e 6, η v = 1.0e 7 (the notation implicitly refers to the symbols used in Section 2). Fig. 1 shows the learning curve (log-likelihood on Q 0 ), generalization curve (log-likelihood on Q 1 ), and reference curve (log-likelihood on P 1 ). It is seen that the criterion function is increased during training, as expected. All three curves exhibit a steep growth during the early training, due to the fact that the RBF kernels quickly move from their random initial position toward the region in R 2 where all the graphs are initially randomly encoded. Learning and generalization curves continue to grow smoothly, getting closer to each other. This fact, magnified in Fig. 2 (left), indicates that the estimated pdf model explains (i.e., has high likelihood) equally well independent samples of GRGs drawn from the same distribution. On the contrary, the reference curve is constantly and significantly lower than the others, and starts dropping early (i.e., the model does not cover samples drawn from a different pdf). This is due to the constrained training of the RBF (which is forced to have a unit integral over its definition domain, i.e. it peaks around the GRGs in Q at he expense of those in P), and to the regularized training of the encoding network (whose weights are discouraged to move toward solutions that could map all the GRGs onto a compact cluster, regardless of their original distribution). Early stopping of training was accomplished once the generalization curve began decreasing (after 1824 iterations), whilst the learning curve continued to grow (overfitting the training data). This is seen also in Fig. 2 (right), which plots the difference between the two curves, which lowers down to a minimum at epoch 1824 before inverting its trend. The final average log-likelihood over Q 0, Q 2 and P 2, respectively, is shown in Fig. 3 as a function of the number of GRGs in the training set (results are averaged w.r.t. the cardinality of the training set; the experiment was repeated, accordingly, for the different cardinalities). Let us now call p(.) the pdf model estimated from Q 0 (i.e., using 100 GRGs). Its capability to describe the statistical properties of the corresponding distribution Q (but not those of the other, P) may be quantified by evaluating how likely it explains the test samples Q 2 and P 2, according to some statistical criteria. First of all, in the spirit of the likelihood-ratio test, the overall log-likelihoods L(Q 2 ) and L(P 2 ) of the model given Q 2 and P 2, respectively, were computed. Let

9 Unsupervised learning over graphical domains 9 us define l(g) = log p(g) for any given graph g, and let Λ = log(l(q 2 )/L(P 2 )) be the (log)likelihood-ratio. Table 1 reports the statistics. Table 1. log-likelihoods and Likelihood-ratio (Λ) tests of the estimated pdf model. L(Q 2) = P g Q 2 l(g) L(P 2) = P g P 2 l(g) Λ Roughly speaking, the model is highly likely to express the probabilistic law underlying Q (but not P), as sought. The value of Λ (i.e., the likelihood-ratio is >> 1) confirms the high statistical significance of the test. These values express global statistics. Let us now evaluate the distribution of individual log-likelihoods yielded by p(.) over each graph in the test samples Q 2 and P 2 (100 values for each subsample) from an analytical point of view. To this end, the Kolmogorov- Smirnov (KS) test is a popular choice for the evaluation of pdf models. Two (independent) null-hypotheses were formed, namely: (1) the distribution of individual log-likelihoods yielded by p(.) when applied to Q 2 coincides with the analogous distribution yielded by the same model on P 2. (2) the distributions of individual log-likelihoods evaluated via p(.) on the samples Q 0 and Q 2 do not coincide. The KS test pointed out that both null-hypotheses are rejected at a level α of at least (confidence 99.9%). That is, the model explains well the distribution of independent samples drawn from Q, but is highly unlikely to explain GRGs having a different underlying distribution. 4 Conclusion and On-Going Work This paper was a first attempt to introduce pdf estimation over graphical domains. It gave a formal notion of GRG, and proposed a combined connectionist model with joint, gradient-ascent constrained optimization of the parameters over the ML criterion. The model was evaluated in terms of statistical tests (likelihood-ratio, KS) on synthetic distributions of GRGs. On-going work focuses on applications to real-world tasks (e.g., classification and clustering of relational data). References 1. A-L Barabási and R. Albert. Emergence of scaling in random networks. Science, 286: , October R. Duda and P. Hart. Pattern Classification and Scene Analysis. Wiley, N.Y., P. Erdös and A. Rényi. On random graphs. Publ. Math. Debrecen, 6: , B. Hammer, A. Micheli, and A. Sperduti. Universal approximation capability of cascade correlation for structures. Neural Computation, 17(5): , B. Hammer, A. Micheli, A. Sperduti, and M. Strickert. Recursive self-organizing network models. Neural Networks, 17(8-9): , G.J. McLachlan and K.E. Basford, editors. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York, USA, 1988.

10 10 Unsupervised learning over graphical domains 7. A. Sperduti and A. Starita. Supervised neural networks for the classification of structures. IEEE Transactions on Neural Networks, 8(3): , May E. Trentin and M. Gori. Robust combination of neural networks and hidden Markov models for speech recognition. IEEE Trans. on Neural Networks, 14(6), 2003.

Artificial Neural Networks (ANN)

Artificial Neural Networks (ANN) Edmondo Trentin April 17, 2013 ANN: Definition The definition of ANN is given in 3.1 points. Indeed, an ANN is a machine that is completely specified once we define its: