When does a mixture of products contain a product of mixtures?

Size: px

Start display at page:

Download "When does a mixture of products contain a product of mixtures?"

Gervase Shanon Bell
5 years ago
Views:

1 When does a mixture of products contain a product of mixtures? Jason Morton Penn State May 19, 2014 Algebraic Statistics 2014 IIT Joint work with Guido Montufar Supported by DARPA FA Jason Morton (Penn State) /19/ / 43

2 Outline 1 Motivation and Introduction Definitions Problems 2 Inference functions and regions 3 Geometric Perspectives Perfect reconstructibility Modes Zonosets and hyperplane arrangements Linear threshold functions Implications among properties Jason Morton (Penn State) /19/ / 43

3 Restricted Boltzmann machines: building block of deep learning Restricted Boltzmann machines are the canonical layer in deep learning models. A deep learning model is (approximately) a stack of RBMs. Deep learning models are machines for printing money. Geoff Hinton (Google) Yann LeCun and Rob Fergus (Facebook) Andrew Ng (Baidu) Deep Mind ($500M acquisition by Google to play Space Invaders) Last year, the cost of a top, world-class deep learning expert was about the same as a top NFL quarterback prospect. - Peter Lee, 1/27/2014 head of Microsoft Research Jason Morton (Penn State) /19/ / 43

4 But does it work in theory? Dominant technology for speech recognition, image recognition, expanding range of applications. Deployed at huge scale at Google, Facebook, Microsoft, etc. State-of-the-art results on many tasks. But we can t really prove much about how they work. The theory of deep learning is a wide open field. Everything is up for the taking. Go for it. - Yann LeCun 5/15/2014 on Reddit Machine Learning Jason Morton (Penn State) /19/ / 43

5 Mixture of products M n,k Probability distributions on n binary variables X 1,..., X n Product distribution factors: p(x) = p(x 1 ) p(x n ) A k-mixture of products M n,k, a naïve Bayes model, is a convex combination of k product distributions Closure of the n-dim exponential family pb (x) = 1 Z(B) exp(b x) k pockets each with n different color biased coins Zariski closure M n,k in complex projective space is the kth secant variety of the nth Segre product of P 1 s. dim(m n,k ) = dim(m n,k ) = min{nk + (k 1), 2 n 1} = number of parameters or ambient, except in the degerate case (n, k) = (4, 3), [Catalisano et al. 2011]. Even a full dimensional M n,k has a large complement in the simplex until k 2 n 1 [Montufar 2010]. Jason Morton (Penn State) /19/ / 43

6 Product of mixtures of products: RBM n,m Mixture of experts: Hadamard (pointwise) product of m different M n,2 s. The RBM model with n visible and m hidden binary units is the set RBM n,m of distributions on {0, 1} n which are limits of p(x) = 1 Z(W, B, C) h {0,1} m exp(h Wx + B v + C h), visible units v {0, 1} n, hidden units h {0, 1} m W R m n is a matrix of interaction weights B R n is the visible bias weights C R m is the hidden bias weights Z = v,h exp(h Wx + B x + C h) normalizes Usually has the expected dimension [Cueto M Sturmfels]. Jason Morton (Penn State) /19/ / 43

7 How are RBMs better? Mixture of Products k Product of Mixtures of Products (RBM) C W B M 6,k RBM 6,4 Mixture of products and product of mixtures as graphical models. Dark nodes represent hidden units, light nodes represent visible units. Jason Morton (Penn State) /19/ / 43

8 Problem (1) When does the mixture of product distributions M n,k contain the product of mixtures of product distributions RBM n,m, and vice versa? A result: number of parameters of the smallest mixture of products Mn,k containing RBM n,m grows exponentially in the number of parameters of the RBM for any fixed ratio 0 < m/n <. Reason: the RBM can represent distributions with many more Hamming-local maxima Comes from polyhedral approximations of the sets of probability distributions representable by each model. Jason Morton (Penn State) /19/ / 43

9 Smallest mixture of products representing an RBM m m = n m = 3 4 n dim(rbm n,m)= c n (linear scale) (1) 3 4 n log 2(k) n 1 (2) log 2 (k) = m (3) 3 4 n log 2 (k) m Heat map of log of k(n, m) = min{k N: M n,k RBM n,m }. An RBM of dimension c has hyperparameters n and m satisfying nm + n + m = c (dashed hyperbola). Fixing dimension, the RBMs which are hardest to represent as mixtures of product distributions are those with m/n 1. Jason Morton (Penn State) /19/ / 43

10 Many related problems Problem What sets of length-n binary vectors are (2) perfectly reconstructible by an RBM with m hidden units? (3) the outputs of n linear threshold functions with m input bits? (4) the modes or strong modes (Hamming-local maxima) of distributions represented by an RBM with m hidden units? Probability distributions with many strong modes are written much more compactly as RBMs than as mixtures of products. e.g. those supported on even parity bitstrings Key to understanding how problems relate, and proving superiority of RBMs at modeling complex distributions, is thinking about modes and inference functions. Jason Morton (Penn State) /19/ / 43

11 Modes and hyperplanes Modes are described by linear inequalities of the form p(x) > p(x ) and define polyhedral approximations of probability models. Closely related to binary classification problems and separation of vertex sets of hypercubes by hyperplane arrangements, leading to problems such as Problem (5) What is the smallest arrangement of hyperplanes, if one exists, that slices each edge of a hypercube a given number of times? Goal: provide partial answers to these five problems by explaining connections Jason Morton (Penn State) /19/ / 43

12 1 Motivation and Introduction Definitions Problems 2 Inference functions and regions 3 Geometric Perspectives Perfect reconstructibility Modes Zonosets and hyperplane arrangements Linear threshold functions Implications among properties Jason Morton (Penn State) /19/ / 43

13 Inference functions of RBMs The inference function of a probability model p θ (v, h) with parameter θ Ω explains each value v by the most likely value of h according to up θ : v argmax h p θ (h v) up θ : {0, 1} n {0, 1} m up θ : R n {0, 1} m Each of the m RBM hidden units linearly divides the input space R n according to its preferred state given the input. Jason Morton (Penn State) /19/ / 43

14 Inference regions and distributed representations up θ : {0, 1} n {0, 1} m up θ : R n {0, 1} m Together, the m hidden units partition the input space into inference regions where different joint hidden states are most likely. This defines a distributed encoding or distributed representation [Bengio 2009] of the input vector into the hidden nodes. Such a distributed representation is speculated to be a key to the model s efficacy. Jason Morton (Penn State) /19/ / 43

15 Inference regions of mixture and RBM, for some θ M 2,4 (θ) RBM 2,3 (θ ) (010) 3 2 (110) (011) (01) (11) (01) (11) 1 (111) (100) (001) (00) 4 (10) (00) (10) (101) Both have 7 parameters Both universal approximators of distributions on {0, 1} 2. Define very different inference regions. Jason Morton (Penn State) /19/ / 43

16 RBM combines linear threshold inference functions Definition A linear threshold function (LTF) with m (binary) inputs is a function f : {0, 1} m {, +}; y sgn(( w j y j ) + b); j [m] where w R m is called weight vector and b R bias. Identify /+ and 0/1 vectors via 0 and + 1. Choosing parameters W R m n, B R n, C R m, our model RBM n,m defines the inference function up W,B,C : R n {0, 1} n {0, 1} m ; v argmax h {0,1} m h (Wv+C). The log number of LTFs with m inputs is asymptotically of order m 2, the exact number is only known for m 9. Jason Morton (Penn State) /19/ / 43

17 Inference functions: linear threshold functions Partition given by intersection of an affine space and the normal fan of an m-cube (the orthants of R m ). preimages of the orthants of R m by the affine map ψ : R n R m ; v Wv + C. Number of inference regions rank(w ) ( m ) i=0 = number of i orthants of R m intersected by a generic d-dimensional affine subspace. When rank(w ) < m (e.g. if m > n), the image of the map ψ does not intersect all orthants of R m there are empty inference regions, i.e., states h which are not the explanation of any input vector v. Jason Morton (Penn State) /19/ / 43

18 Inference functions of mixture model The mixture model M n,k has a simpler inference function up λ,b : v argmax i [k] (B i v log(z(b i )) + log(λ i )), (1) with mixture weights λ i, parameters of each mixture component B i R n for i [k], and Z(B i ) = v {0,1} exp(b n i v). Input space R n partitioned into at most k regions of linearity of the function v max{bi v log(z(b i )) + log(λ i ): i [k]}. Partition given by the intersection of an affine space and the normal fan of a (k 1)-simplex. Jason Morton (Penn State) /19/ / 43

19 Comparing inference functions Fixing input space dimension n, the number of inference regions in R n realizable by RBM n,m is of order Θ( ( m min{n,m}) ) Exponential in the number of parameters of the model vs. number of inference regions realizable by mixture M n,k : linear in the number of parameters of the model. So distributed representations can learn different explanations to a number of observations that is exponential in the number of model parameters. Examine more closely the codes (sets of bitstrings) an RBM prefers. Jason Morton (Penn State) /19/ / 43

20 1 Motivation and Introduction Definitions Problems 2 Inference functions and regions 3 Geometric Perspectives Perfect reconstructibility Modes Zonosets and hyperplane arrangements Linear threshold functions Implications among properties Jason Morton (Penn State) /19/ / 43

21 Six 3-parameter properties of sets of binary vectors Let n, m be non-negative integers and C be a subset of {0, 1} n. LTC: The set C is an (n, m)-linear threshold code, i.e., the image of n linear threshold functions with m inputs. HP: There exists an arrangement A of n hyperplanes in R m such that the vertices of the m-dimensional unit cube intersect exactly the C-cells of A. ZP: There is an affine image of the vertices of an m-cube (m-zonoset) in R n which intersects exactly the C-orthants of R n. SM: An RBM with n visible and m hidden nodes can represent a distribution with set of strong modes C. PR: The set C is the set of perfectly reconstructible inputs of an RBM with n visible and m hidden nodes. SP: An RBM with n visible and m hidden nodes can represent a distribution which is strictly positive on C and zero elsewhere. Jason Morton (Penn State) /19/ / 43

22 Perfect reconstructibility Similarly to the up θ inference function, a model p θ (v, h) defines a down θ inference function, down θ outputs the most likely visible state argmax v p θ (v h) given a hidden state h. Definition Given a probability model p θ (v, h) on v X and h Y, a collection of states C X is perfectly reconstructible if there is a choice of the parameter θ for which down θ (up θ (v)) = v for all v C. Jason Morton (Penn State) /19/ / 43

23 Is C reconstructible for some θ? C Jason Morton (Penn State) /19/ / 43

24 Is C reconstructible for some θ? up θ (C) C Jason Morton (Penn State) /19/ / 43

25 Is C reconstructible for some θ? up θ (C) down θ (up θ (C)) Jason Morton (Penn State) /19/ / 43

26 Perfect reconstructibility Ability to reconstruct input vectors used to evaluate the performance of RBMs in practice intuitive, can be tested more cheaply than probability distributions. Taking this seriously leads to autoencoder-like training algorithms, where we minimize reconstruction error. Which subsets of {0, 1} n can be reconstructed? Write the joint distribution on hidden and visible states {p θ (v, h)} v,h as a matrix with rows labeled by h and columns by v. Then C is perfectly reconstructible iff for some θ, we have p θ (v, up θ (v)) is the unique maximal entry in the up θ (v)-row (and in the v-column) for all v C. Jason Morton (Penn State) /19/ / 43

27 Represent distributions with many modes? Let d H (x, y) be the Hamming distance between binary strings x and y: the number of bits that must be flipped to turn x into y. Definition A binary vector x is a mode of a distribution p if p(x) > p(y) for all y X with d H (y, x) = 1, and a strong mode if p(x) > y X :d H (y,x)=1 p(y). The modes of a distribution are the Hamming-locally most likely events in the space of possible events. Modes are closely related to the support sets and boundaries of statistical models, which have been studied especially for hierarchical and graphical models without hidden variables. A complex distribution has many modes. Jason Morton (Penn State) /19/ / 43

28 Polyhedral approximation with (strong) modes Consider the set of distributions which have modes G C or strong modes H C exactly at the bitstrings in a code C (so codewords Hamming distance 2 apart). The closures ( G C and H C resp) are convex mode polytopes inscribed in the probability simplex The sets of modes not realizable by a probability model give a full-dimensional polyhedral approximation of the model s complement. Test cases: binary strings with an even or odd number of ones. These are the maximal subsets of {0, 1} n, with cardinality 2 n 1 and minimum distance two. Jason Morton (Penn State) /19/ / 43

29 Mode polytopes on {0, 1} 2 G + 2 δ (00) δ (11) M 2,1 δ (01) G 2 δ (10) The polytopes at the top and bottom contain the distributions with two modes on even or odd parity bitstrings respectively. Jason Morton (Penn State) /19/ / 43

30 Mode polytopes {0, 1} 3 The set of distributions on {0, 1} 3 with modes on even weight strings is the intersection of the probability simplex and 12 open half-spaces defined by, for each even-weight x, p(x) > p(y) for all y with d H (x, y) = 1. The closure of this set is a convex polytope G + 3 representing 1.79% of the simplex. with 19 vertices The set H + 3 of distributions with strong modes at the four even-parity bitstrings is the intersection of the probability simplex and the 4 half-spaces defined by p(x) > y:d H (x,y)=1 p(y) for x even parity. We use strong mode polytopes because they have fewer facets. Jason Morton (Penn State) /19/ / 43

31 Modes of mixtures of products Theorem The sets C of strong modes of distributions in M n,k are exactly the sets of strings in X 1 X n of minimum Hamming distance at least two and cardinality at most k. Furthermore, if p M n,k has strong modes C, then every c C is the mode of one mixture component of p. For example, although M 3,3 is full dimensional in 2 3 1, its complement contains points arbitrarily close to the uniform distribution! Jason Morton (Penn State) /19/ / 43

32 Modes of RBMs Problem What is the smallest m N for which RBM n,m contains a distribution with l (strong) modes? In particular, what is the smallest m for which the model RBM n,m can represent the parity function? Characterizing the sets of modes realizable by RBMs is a more complex problem for mixtures of product distributions Useful to describe in terms of point configurations called zonosets, hyperplane arrangements, or linear threshold functions. Jason Morton (Penn State) /19/ / 43

33 Zonosets Zonotopes are equivalently images of n-cubes under affine projection, or Minkowski sums of line segments, and encode hyperplane arrangements. Zonosets remember the generating points (interior, multiplicity). Parameters W, B define the projection. Definition Let m 0, n > 0, W i R n for all i [m], and B R n. The multiset Z = { i I W i + B} I [m] is called an m-zonoset. The convex hull of a zonoset is a zonotope. Jason Morton (Penn State) /19/ / 43

34 Theorem: Connecting properties of codes Let C {0, 1} n be a binary code of minimum Hamming distance at least two. If the model RBM n,m contains a distribution with strong modes C or C has cardinality 2 m and is perfectly reconstructible by RBM n,m, then there is an m-zonoset with a point in each C-orthant of R n. If there is an m-zonoset intersecting exactly the C-orthants of R n at points of equal l 1 -norm, then RBMn,m contains a distribution with strong modes C and, furthermore, C is perfectly reconstructible. Jason Morton (Penn State) /19/ / 43

35 Hyperplane arrangements A hyperplane arrangement A in R n is a finite set of (affine) hyperplanes {H i } i [k] in R n. Choose an orientation for each hyperplane, each vector x R n gets a sign vector sgn A (x) {, 0, +} k, indicating whether x lies on the negative side, inside, or on the positive side of each H i. The set of all vectors in R n with the same sign vector is called a cell of A. Jason Morton (Penn State) /19/ / 43

36 Linear threshold codes Recall Definition A linear threshold function (LTF) with m (binary) inputs is a function f : {0, 1} m {, +}; y sgn(( w j y j ) + b); j [m] where w R m is called weight vector and b R bias. When f (x 1,..., x m ) = f (x 1,..., x m ) for all x {0, 1} m, the LTF is self-dual. An LTF with an equal number of positive and negative points separates every input from its opposite and is self-dual. Jason Morton (Penn State) /19/ / 43

37 Linear threshold codes Definition A subset C {0, 1} n = {, +} n is an (n, m)-linear threshold code (LTC) if there exist n linear threshold functions f i : {0, 1} m {0, 1}, i [n] with {(f 1 (y), f 2 (y),..., f n (y)) {0, 1} n : y {0, 1} m } = C. Equivalently, C is an (n, m)-ltc if it is the image of a down inference function of RBM n,m. If all f i can be chosen self-dual, then C is called homogeneous. Jason Morton (Penn State) /19/ / 43

38 LTC Example: n = 3 and m = 2 There are only two ways to linearly separate the vertices of the 2-cube into sets of cardinality two (up to opposites): 1234 and These are the only possible columns of a homogeneous LTC with two inputs (up to opposites). But the even or odd parity code is not a (3, 2)-LTC So, there does not exist a 2-zonoset with vertices in the four even, or odd, orthants of R 3, and that RBM 3,2 does not contain any distributions with four strong modes. But there are 104 ways to linearly separate the vertices of the n = 3-cube. Here is a good one: Jason Morton (Penn State) /19/ / 43

39 LTC Example: n = 4 and m = 3 The 8 vertices of the 3-cube are in the 8 even-parity cells of an arrangement of four hyperplanes corresponding to the (4, 3)-LTC with the LTFs: , , , This arrangement corresponds to a 3-zonoset with points in the 8 even orthants of R 4, realizable as: w = ; Z = 1 b = 1 2 ( ) ; Jason Morton (Penn State) /19/ / 43

40 LTC Example: n = 4 and m = These four slicings of the 3-cube can be repeated to obtain: Jason Morton (Penn State) /19/ / 43

41 Smallest mixture of products representing an RBM m m = n m = 3 4 n dim(rbm n,m)= c n (linear scale) (1) 3 4 n log 2(k) n 1 (2) log 2 (k) = m (3) 3 4 n log 2 (k) m Heat map of log of k(n, m) = min{k N: M n,k RBM n,m }. To improve on our result, just find an RBM n,m where m/n is closer to one. Hint: 5/6 fails. Jason Morton (Penn State) /19/ / 43

42 Six 3-parameter properties of sets of binary vectors Let n, m be non-negative integers and C be a subset of {0, 1} n. LTC: The set C is an (n, m)-linear threshold code, i.e., the image of n linear threshold functions with m inputs. HP: There exists an arrangement A of n hyperplanes in R m such that the vertices of the m-dimensional unit cube intersect exactly the C-cells of A. ZP: There is an affine image of the vertices of an m-cube (m-zonoset) in R n which intersects exactly the C-orthants of R n. SM: An RBM with n visible and m hidden nodes can represent a distribution with set of strong modes C. PR: The set C is the set of perfectly reconstructible inputs of an RBM with n visible and m hidden nodes. SP: An RBM with n visible and m hidden nodes can represent a distribution which is strictly positive on C and zero elsewhere. Jason Morton (Penn State) /19/ / 43

43 SP d H (C) 2 SM d H (C) 2 l 1 PR HP LTC ZP Theorem Fix integers n and m. For any C {0, 1} n, the following hold. 1 The properties LTC, HP, and ZP are equivalent. 2 If C satisfies PR or SM, then it is contained in an LTC set. 3 If the vectors in C are at least Hamming distance 2 apart, then SP implies both SM and PR. 4 If the vectors in C are at least Hamming distance 2 apart and C satisfies an l 1 property, then LTC implies SP. Jason Morton (Penn State) /19/ / 43

Mixture Models and Representational Power of RBM s, DBN s and DBM s

Mixture Models and Representational Power of RBM s, DBN s and DBM s Guido Montufar Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany. montufar@mis.mpg.de Abstract