When does a mixture of products contain a product of mixtures?
|
|
- Gervase Shanon Bell
- 5 years ago
- Views:
Transcription
1 When does a mixture of products contain a product of mixtures? Jason Morton Penn State May 19, 2014 Algebraic Statistics 2014 IIT Joint work with Guido Montufar Supported by DARPA FA Jason Morton (Penn State) /19/ / 43
2 Outline 1 Motivation and Introduction Definitions Problems 2 Inference functions and regions 3 Geometric Perspectives Perfect reconstructibility Modes Zonosets and hyperplane arrangements Linear threshold functions Implications among properties Jason Morton (Penn State) /19/ / 43
3 Restricted Boltzmann machines: building block of deep learning Restricted Boltzmann machines are the canonical layer in deep learning models. A deep learning model is (approximately) a stack of RBMs. Deep learning models are machines for printing money. Geoff Hinton (Google) Yann LeCun and Rob Fergus (Facebook) Andrew Ng (Baidu) Deep Mind ($500M acquisition by Google to play Space Invaders) Last year, the cost of a top, world-class deep learning expert was about the same as a top NFL quarterback prospect. - Peter Lee, 1/27/2014 head of Microsoft Research Jason Morton (Penn State) /19/ / 43
4 But does it work in theory? Dominant technology for speech recognition, image recognition, expanding range of applications. Deployed at huge scale at Google, Facebook, Microsoft, etc. State-of-the-art results on many tasks. But we can t really prove much about how they work. The theory of deep learning is a wide open field. Everything is up for the taking. Go for it. - Yann LeCun 5/15/2014 on Reddit Machine Learning Jason Morton (Penn State) /19/ / 43
5 Mixture of products M n,k Probability distributions on n binary variables X 1,..., X n Product distribution factors: p(x) = p(x 1 ) p(x n ) A k-mixture of products M n,k, a naïve Bayes model, is a convex combination of k product distributions Closure of the n-dim exponential family pb (x) = 1 Z(B) exp(b x) k pockets each with n different color biased coins Zariski closure M n,k in complex projective space is the kth secant variety of the nth Segre product of P 1 s. dim(m n,k ) = dim(m n,k ) = min{nk + (k 1), 2 n 1} = number of parameters or ambient, except in the degerate case (n, k) = (4, 3), [Catalisano et al. 2011]. Even a full dimensional M n,k has a large complement in the simplex until k 2 n 1 [Montufar 2010]. Jason Morton (Penn State) /19/ / 43
6 Product of mixtures of products: RBM n,m Mixture of experts: Hadamard (pointwise) product of m different M n,2 s. The RBM model with n visible and m hidden binary units is the set RBM n,m of distributions on {0, 1} n which are limits of p(x) = 1 Z(W, B, C) h {0,1} m exp(h Wx + B v + C h), visible units v {0, 1} n, hidden units h {0, 1} m W R m n is a matrix of interaction weights B R n is the visible bias weights C R m is the hidden bias weights Z = v,h exp(h Wx + B x + C h) normalizes Usually has the expected dimension [Cueto M Sturmfels]. Jason Morton (Penn State) /19/ / 43
7 How are RBMs better? Mixture of Products k Product of Mixtures of Products (RBM) C W B M 6,k RBM 6,4 Mixture of products and product of mixtures as graphical models. Dark nodes represent hidden units, light nodes represent visible units. Jason Morton (Penn State) /19/ / 43
8 Problem (1) When does the mixture of product distributions M n,k contain the product of mixtures of product distributions RBM n,m, and vice versa? A result: number of parameters of the smallest mixture of products Mn,k containing RBM n,m grows exponentially in the number of parameters of the RBM for any fixed ratio 0 < m/n <. Reason: the RBM can represent distributions with many more Hamming-local maxima Comes from polyhedral approximations of the sets of probability distributions representable by each model. Jason Morton (Penn State) /19/ / 43
9 Smallest mixture of products representing an RBM m m = n m = 3 4 n dim(rbm n,m)= c n (linear scale) (1) 3 4 n log 2(k) n 1 (2) log 2 (k) = m (3) 3 4 n log 2 (k) m Heat map of log of k(n, m) = min{k N: M n,k RBM n,m }. An RBM of dimension c has hyperparameters n and m satisfying nm + n + m = c (dashed hyperbola). Fixing dimension, the RBMs which are hardest to represent as mixtures of product distributions are those with m/n 1. Jason Morton (Penn State) /19/ / 43
10 Many related problems Problem What sets of length-n binary vectors are (2) perfectly reconstructible by an RBM with m hidden units? (3) the outputs of n linear threshold functions with m input bits? (4) the modes or strong modes (Hamming-local maxima) of distributions represented by an RBM with m hidden units? Probability distributions with many strong modes are written much more compactly as RBMs than as mixtures of products. e.g. those supported on even parity bitstrings Key to understanding how problems relate, and proving superiority of RBMs at modeling complex distributions, is thinking about modes and inference functions. Jason Morton (Penn State) /19/ / 43
11 Modes and hyperplanes Modes are described by linear inequalities of the form p(x) > p(x ) and define polyhedral approximations of probability models. Closely related to binary classification problems and separation of vertex sets of hypercubes by hyperplane arrangements, leading to problems such as Problem (5) What is the smallest arrangement of hyperplanes, if one exists, that slices each edge of a hypercube a given number of times? Goal: provide partial answers to these five problems by explaining connections Jason Morton (Penn State) /19/ / 43
12 1 Motivation and Introduction Definitions Problems 2 Inference functions and regions 3 Geometric Perspectives Perfect reconstructibility Modes Zonosets and hyperplane arrangements Linear threshold functions Implications among properties Jason Morton (Penn State) /19/ / 43
13 Inference functions of RBMs The inference function of a probability model p θ (v, h) with parameter θ Ω explains each value v by the most likely value of h according to up θ : v argmax h p θ (h v) up θ : {0, 1} n {0, 1} m up θ : R n {0, 1} m Each of the m RBM hidden units linearly divides the input space R n according to its preferred state given the input. Jason Morton (Penn State) /19/ / 43
14 Inference regions and distributed representations up θ : {0, 1} n {0, 1} m up θ : R n {0, 1} m Together, the m hidden units partition the input space into inference regions where different joint hidden states are most likely. This defines a distributed encoding or distributed representation [Bengio 2009] of the input vector into the hidden nodes. Such a distributed representation is speculated to be a key to the model s efficacy. Jason Morton (Penn State) /19/ / 43
15 Inference regions of mixture and RBM, for some θ M 2,4 (θ) RBM 2,3 (θ ) (010) 3 2 (110) (011) (01) (11) (01) (11) 1 (111) (100) (001) (00) 4 (10) (00) (10) (101) Both have 7 parameters Both universal approximators of distributions on {0, 1} 2. Define very different inference regions. Jason Morton (Penn State) /19/ / 43
16 RBM combines linear threshold inference functions Definition A linear threshold function (LTF) with m (binary) inputs is a function f : {0, 1} m {, +}; y sgn(( w j y j ) + b); j [m] where w R m is called weight vector and b R bias. Identify /+ and 0/1 vectors via 0 and + 1. Choosing parameters W R m n, B R n, C R m, our model RBM n,m defines the inference function up W,B,C : R n {0, 1} n {0, 1} m ; v argmax h {0,1} m h (Wv+C). The log number of LTFs with m inputs is asymptotically of order m 2, the exact number is only known for m 9. Jason Morton (Penn State) /19/ / 43
17 Inference functions: linear threshold functions Partition given by intersection of an affine space and the normal fan of an m-cube (the orthants of R m ). preimages of the orthants of R m by the affine map ψ : R n R m ; v Wv + C. Number of inference regions rank(w ) ( m ) i=0 = number of i orthants of R m intersected by a generic d-dimensional affine subspace. When rank(w ) < m (e.g. if m > n), the image of the map ψ does not intersect all orthants of R m there are empty inference regions, i.e., states h which are not the explanation of any input vector v. Jason Morton (Penn State) /19/ / 43
18 Inference functions of mixture model The mixture model M n,k has a simpler inference function up λ,b : v argmax i [k] (B i v log(z(b i )) + log(λ i )), (1) with mixture weights λ i, parameters of each mixture component B i R n for i [k], and Z(B i ) = v {0,1} exp(b n i v). Input space R n partitioned into at most k regions of linearity of the function v max{bi v log(z(b i )) + log(λ i ): i [k]}. Partition given by the intersection of an affine space and the normal fan of a (k 1)-simplex. Jason Morton (Penn State) /19/ / 43
19 Comparing inference functions Fixing input space dimension n, the number of inference regions in R n realizable by RBM n,m is of order Θ( ( m min{n,m}) ) Exponential in the number of parameters of the model vs. number of inference regions realizable by mixture M n,k : linear in the number of parameters of the model. So distributed representations can learn different explanations to a number of observations that is exponential in the number of model parameters. Examine more closely the codes (sets of bitstrings) an RBM prefers. Jason Morton (Penn State) /19/ / 43
20 1 Motivation and Introduction Definitions Problems 2 Inference functions and regions 3 Geometric Perspectives Perfect reconstructibility Modes Zonosets and hyperplane arrangements Linear threshold functions Implications among properties Jason Morton (Penn State) /19/ / 43
21 Six 3-parameter properties of sets of binary vectors Let n, m be non-negative integers and C be a subset of {0, 1} n. LTC: The set C is an (n, m)-linear threshold code, i.e., the image of n linear threshold functions with m inputs. HP: There exists an arrangement A of n hyperplanes in R m such that the vertices of the m-dimensional unit cube intersect exactly the C-cells of A. ZP: There is an affine image of the vertices of an m-cube (m-zonoset) in R n which intersects exactly the C-orthants of R n. SM: An RBM with n visible and m hidden nodes can represent a distribution with set of strong modes C. PR: The set C is the set of perfectly reconstructible inputs of an RBM with n visible and m hidden nodes. SP: An RBM with n visible and m hidden nodes can represent a distribution which is strictly positive on C and zero elsewhere. Jason Morton (Penn State) /19/ / 43
22 Perfect reconstructibility Similarly to the up θ inference function, a model p θ (v, h) defines a down θ inference function, down θ outputs the most likely visible state argmax v p θ (v h) given a hidden state h. Definition Given a probability model p θ (v, h) on v X and h Y, a collection of states C X is perfectly reconstructible if there is a choice of the parameter θ for which down θ (up θ (v)) = v for all v C. Jason Morton (Penn State) /19/ / 43
23 Is C reconstructible for some θ? C Jason Morton (Penn State) /19/ / 43
24 Is C reconstructible for some θ? up θ (C) C Jason Morton (Penn State) /19/ / 43
25 Is C reconstructible for some θ? up θ (C) down θ (up θ (C)) Jason Morton (Penn State) /19/ / 43
26 Perfect reconstructibility Ability to reconstruct input vectors used to evaluate the performance of RBMs in practice intuitive, can be tested more cheaply than probability distributions. Taking this seriously leads to autoencoder-like training algorithms, where we minimize reconstruction error. Which subsets of {0, 1} n can be reconstructed? Write the joint distribution on hidden and visible states {p θ (v, h)} v,h as a matrix with rows labeled by h and columns by v. Then C is perfectly reconstructible iff for some θ, we have p θ (v, up θ (v)) is the unique maximal entry in the up θ (v)-row (and in the v-column) for all v C. Jason Morton (Penn State) /19/ / 43
27 Represent distributions with many modes? Let d H (x, y) be the Hamming distance between binary strings x and y: the number of bits that must be flipped to turn x into y. Definition A binary vector x is a mode of a distribution p if p(x) > p(y) for all y X with d H (y, x) = 1, and a strong mode if p(x) > y X :d H (y,x)=1 p(y). The modes of a distribution are the Hamming-locally most likely events in the space of possible events. Modes are closely related to the support sets and boundaries of statistical models, which have been studied especially for hierarchical and graphical models without hidden variables. A complex distribution has many modes. Jason Morton (Penn State) /19/ / 43
28 Polyhedral approximation with (strong) modes Consider the set of distributions which have modes G C or strong modes H C exactly at the bitstrings in a code C (so codewords Hamming distance 2 apart). The closures ( G C and H C resp) are convex mode polytopes inscribed in the probability simplex The sets of modes not realizable by a probability model give a full-dimensional polyhedral approximation of the model s complement. Test cases: binary strings with an even or odd number of ones. These are the maximal subsets of {0, 1} n, with cardinality 2 n 1 and minimum distance two. Jason Morton (Penn State) /19/ / 43
29 Mode polytopes on {0, 1} 2 G + 2 δ (00) δ (11) M 2,1 δ (01) G 2 δ (10) The polytopes at the top and bottom contain the distributions with two modes on even or odd parity bitstrings respectively. Jason Morton (Penn State) /19/ / 43
30 Mode polytopes {0, 1} 3 The set of distributions on {0, 1} 3 with modes on even weight strings is the intersection of the probability simplex and 12 open half-spaces defined by, for each even-weight x, p(x) > p(y) for all y with d H (x, y) = 1. The closure of this set is a convex polytope G + 3 representing 1.79% of the simplex. with 19 vertices The set H + 3 of distributions with strong modes at the four even-parity bitstrings is the intersection of the probability simplex and the 4 half-spaces defined by p(x) > y:d H (x,y)=1 p(y) for x even parity. We use strong mode polytopes because they have fewer facets. Jason Morton (Penn State) /19/ / 43
31 Modes of mixtures of products Theorem The sets C of strong modes of distributions in M n,k are exactly the sets of strings in X 1 X n of minimum Hamming distance at least two and cardinality at most k. Furthermore, if p M n,k has strong modes C, then every c C is the mode of one mixture component of p. For example, although M 3,3 is full dimensional in 2 3 1, its complement contains points arbitrarily close to the uniform distribution! Jason Morton (Penn State) /19/ / 43
32 Modes of RBMs Problem What is the smallest m N for which RBM n,m contains a distribution with l (strong) modes? In particular, what is the smallest m for which the model RBM n,m can represent the parity function? Characterizing the sets of modes realizable by RBMs is a more complex problem for mixtures of product distributions Useful to describe in terms of point configurations called zonosets, hyperplane arrangements, or linear threshold functions. Jason Morton (Penn State) /19/ / 43
33 Zonosets Zonotopes are equivalently images of n-cubes under affine projection, or Minkowski sums of line segments, and encode hyperplane arrangements. Zonosets remember the generating points (interior, multiplicity). Parameters W, B define the projection. Definition Let m 0, n > 0, W i R n for all i [m], and B R n. The multiset Z = { i I W i + B} I [m] is called an m-zonoset. The convex hull of a zonoset is a zonotope. Jason Morton (Penn State) /19/ / 43
34 Theorem: Connecting properties of codes Let C {0, 1} n be a binary code of minimum Hamming distance at least two. If the model RBM n,m contains a distribution with strong modes C or C has cardinality 2 m and is perfectly reconstructible by RBM n,m, then there is an m-zonoset with a point in each C-orthant of R n. If there is an m-zonoset intersecting exactly the C-orthants of R n at points of equal l 1 -norm, then RBMn,m contains a distribution with strong modes C and, furthermore, C is perfectly reconstructible. Jason Morton (Penn State) /19/ / 43
35 Hyperplane arrangements A hyperplane arrangement A in R n is a finite set of (affine) hyperplanes {H i } i [k] in R n. Choose an orientation for each hyperplane, each vector x R n gets a sign vector sgn A (x) {, 0, +} k, indicating whether x lies on the negative side, inside, or on the positive side of each H i. The set of all vectors in R n with the same sign vector is called a cell of A. Jason Morton (Penn State) /19/ / 43
36 Linear threshold codes Recall Definition A linear threshold function (LTF) with m (binary) inputs is a function f : {0, 1} m {, +}; y sgn(( w j y j ) + b); j [m] where w R m is called weight vector and b R bias. When f (x 1,..., x m ) = f (x 1,..., x m ) for all x {0, 1} m, the LTF is self-dual. An LTF with an equal number of positive and negative points separates every input from its opposite and is self-dual. Jason Morton (Penn State) /19/ / 43
37 Linear threshold codes Definition A subset C {0, 1} n = {, +} n is an (n, m)-linear threshold code (LTC) if there exist n linear threshold functions f i : {0, 1} m {0, 1}, i [n] with {(f 1 (y), f 2 (y),..., f n (y)) {0, 1} n : y {0, 1} m } = C. Equivalently, C is an (n, m)-ltc if it is the image of a down inference function of RBM n,m. If all f i can be chosen self-dual, then C is called homogeneous. Jason Morton (Penn State) /19/ / 43
38 LTC Example: n = 3 and m = 2 There are only two ways to linearly separate the vertices of the 2-cube into sets of cardinality two (up to opposites): 1234 and These are the only possible columns of a homogeneous LTC with two inputs (up to opposites). But the even or odd parity code is not a (3, 2)-LTC So, there does not exist a 2-zonoset with vertices in the four even, or odd, orthants of R 3, and that RBM 3,2 does not contain any distributions with four strong modes. But there are 104 ways to linearly separate the vertices of the n = 3-cube. Here is a good one: Jason Morton (Penn State) /19/ / 43
39 LTC Example: n = 4 and m = 3 The 8 vertices of the 3-cube are in the 8 even-parity cells of an arrangement of four hyperplanes corresponding to the (4, 3)-LTC with the LTFs: , , , This arrangement corresponds to a 3-zonoset with points in the 8 even orthants of R 4, realizable as: w = ; Z = 1 b = 1 2 ( ) ; Jason Morton (Penn State) /19/ / 43
40 LTC Example: n = 4 and m = These four slicings of the 3-cube can be repeated to obtain: Jason Morton (Penn State) /19/ / 43
41 Smallest mixture of products representing an RBM m m = n m = 3 4 n dim(rbm n,m)= c n (linear scale) (1) 3 4 n log 2(k) n 1 (2) log 2 (k) = m (3) 3 4 n log 2 (k) m Heat map of log of k(n, m) = min{k N: M n,k RBM n,m }. To improve on our result, just find an RBM n,m where m/n is closer to one. Hint: 5/6 fails. Jason Morton (Penn State) /19/ / 43
42 Six 3-parameter properties of sets of binary vectors Let n, m be non-negative integers and C be a subset of {0, 1} n. LTC: The set C is an (n, m)-linear threshold code, i.e., the image of n linear threshold functions with m inputs. HP: There exists an arrangement A of n hyperplanes in R m such that the vertices of the m-dimensional unit cube intersect exactly the C-cells of A. ZP: There is an affine image of the vertices of an m-cube (m-zonoset) in R n which intersects exactly the C-orthants of R n. SM: An RBM with n visible and m hidden nodes can represent a distribution with set of strong modes C. PR: The set C is the set of perfectly reconstructible inputs of an RBM with n visible and m hidden nodes. SP: An RBM with n visible and m hidden nodes can represent a distribution which is strictly positive on C and zero elsewhere. Jason Morton (Penn State) /19/ / 43
43 SP d H (C) 2 SM d H (C) 2 l 1 PR HP LTC ZP Theorem Fix integers n and m. For any C {0, 1} n, the following hold. 1 The properties LTC, HP, and ZP are equivalent. 2 If C satisfies PR or SM, then it is contained in an LTC set. 3 If the vectors in C are at least Hamming distance 2 apart, then SP implies both SM and PR. 4 If the vectors in C are at least Hamming distance 2 apart and C satisfies an l 1 property, then LTC implies SP. Jason Morton (Penn State) /19/ / 43
Mixture Models and Representational Power of RBM s, DBN s and DBM s
Mixture Models and Representational Power of RBM s, DBN s and DBM s Guido Montufar Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany. montufar@mis.mpg.de Abstract
More informationDiscrete Restricted Boltzmann Machines
Journal of Machine Learning Research 16 (2015) 653-672 Submitted 6/13; Published 4/15 Discrete Restricted Boltzmann Machines Guido Montúfar Max Planck Institute for Mathematics in the Sciences Inselstrasse
More informationTensor Networks in Algebraic Geometry and Statistics
Tensor Networks in Algebraic Geometry and Statistics Jason Morton Penn State May 10, 2012 Centro de ciencias de Benasque Pedro Pascual Supported by DARPA N66001-10-1-4040 and FA8650-11-1-7145. Jason Morton
More informationMax-Planck-Institut für Mathematik in den Naturwissenschaften Leipzig
Max-Planck-Institut für Mathematik in den Naturwissenschaften Leipzig Discrete Restricted Boltzmann Machines by Guido Montúfar and Jason Morton Preprint no.: 06 204 Discrete Restricted Boltzmann Machines
More informationCS Lecture 4. Markov Random Fields
CS 6347 Lecture 4 Markov Random Fields Recap Announcements First homework is available on elearning Reminder: Office hours Tuesday from 10am-11am Last Time Bayesian networks Today Markov random fields
More informationHamming codes and simplex codes ( )
Chapter 6 Hamming codes and simplex codes (2018-03-17) Synopsis. Hamming codes are essentially the first non-trivial family of codes that we shall meet. We start by proving the Distance Theorem for linear
More informationUndirected Graphical Models: Markov Random Fields
Undirected Graphical Models: Markov Random Fields 40-956 Advanced Topics in AI: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2015 Markov Random Field Structure: undirected
More informationExpressive Power and Approximation Errors of Restricted Boltzmann Machines
Expressive Power and Approximation Errors of Restricted Boltzmann Machines Guido F. Montúfar, Johannes Rauh, and Nihat Ay, Max Planck Institute for Mathematics in the Sciences, Inselstraße 0403 Leipzig,
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationDeep Belief Networks are compact universal approximators
1 Deep Belief Networks are compact universal approximators Nicolas Le Roux 1, Yoshua Bengio 2 1 Microsoft Research Cambridge 2 University of Montreal Keywords: Deep Belief Networks, Universal Approximation
More informationDeep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 8: Autoencoder & DBM Princeton University COS 495 Instructor: Yingyu Liang Autoencoder Autoencoder Neural networks trained to attempt to copy its input to its output Contain
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationRestricted Boltzmann Machines
Restricted Boltzmann Machines http://deeplearning4.org/rbm-mnist-tutorial.html Slides from Hugo Larochelle, Geoffrey Hinton, and Yoshua Bengio CSC321: Intro to Machine Learning and Neural Networks, Winter
More informationNetwork Coding on Directed Acyclic Graphs
Network Coding on Directed Acyclic Graphs John MacLaren Walsh, Ph.D. Multiterminal Information Theory, Spring Quarter, 0 Reference These notes are directly derived from Chapter of R. W. Yeung s Information
More informationLearning Deep Architectures for AI. Part II - Vijay Chakilam
Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model
More informationIn English, this means that if we travel on a straight line between any two points in C, then we never leave C.
Convex sets In this section, we will be introduced to some of the mathematical fundamentals of convex sets. In order to motivate some of the definitions, we will look at the closest point problem from
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationChapter 3 Linear Block Codes
Wireless Information Transmission System Lab. Chapter 3 Linear Block Codes Institute of Communications Engineering National Sun Yat-sen University Outlines Introduction to linear block codes Syndrome and
More informationCourse Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch
Psychology 452 Week 12: Deep Learning What Is Deep Learning? Preliminary Ideas (that we already know!) The Restricted Boltzmann Machine (RBM) Many Layers of RBMs Pros and Cons of Deep Learning Course Structure
More informationThe Vapnik-Chervonenkis Dimension
The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB 1 / 91 Outline 1 Growth Functions 2 Basic Definitions for Vapnik-Chervonenkis Dimension 3 The Sauer-Shelah Theorem 4 The Link between VCD and
More informationHodge theory for combinatorial geometries
Hodge theory for combinatorial geometries June Huh with Karim Adiprasito and Eric Katz June Huh 1 / 48 Three fundamental ideas: June Huh 2 / 48 Three fundamental ideas: The idea of Bernd Sturmfels that
More informationChapter 16. Structured Probabilistic Models for Deep Learning
Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe
More informationIntroduction to the Tensor Train Decomposition and Its Applications in Machine Learning
Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning Anton Rodomanov Higher School of Economics, Russia Bayesian methods research group (http://bayesgroup.ru) 14 March
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationEnergy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13
Energy Based Models Stefano Ermon, Aditya Grover Stanford University Lecture 13 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 1 / 21 Summary Story so far Representation: Latent
More informationIntroduction to Restricted Boltzmann Machines
Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,
More informationLearning convex bodies is hard
Learning convex bodies is hard Navin Goyal Microsoft Research India navingo@microsoft.com Luis Rademacher Georgia Tech lrademac@cc.gatech.edu Abstract We show that learning a convex body in R d, given
More informationAn Introduction to Transversal Matroids
An Introduction to Transversal Matroids Joseph E Bonin The George Washington University These slides and an accompanying expository paper (in essence, notes for this talk, and more) are available at http://homegwuedu/
More informationLarge-Scale Feature Learning with Spike-and-Slab Sparse Coding
Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab
More informationLecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016
Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,
More informationDeep Learning Srihari. Deep Belief Nets. Sargur N. Srihari
Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous
More informationExpressive Power and Approximation Errors of Restricted Boltzmann Machines
Expressive Power and Approximation Errors of Restricted Boltzmann Machines Guido F. Montúfar Johannes Rauh Nihat Ay SFI WORKING PAPER: 2011-09-041 SFI Working Papers contain accounts of scientific work
More informationPreliminaries and Complexity Theory
Preliminaries and Complexity Theory Oleksandr Romanko CAS 746 - Advanced Topics in Combinatorial Optimization McMaster University, January 16, 2006 Introduction Book structure: 2 Part I Linear Algebra
More informationLearning with Submodular Functions: A Convex Optimization Perspective
Foundations and Trends R in Machine Learning Vol. 6, No. 2-3 (2013) 145 373 c 2013 F. Bach DOI: 10.1561/2200000039 Learning with Submodular Functions: A Convex Optimization Perspective Francis Bach INRIA
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationLinear Algebra Review: Linear Independence. IE418 Integer Programming. Linear Algebra Review: Subspaces. Linear Algebra Review: Affine Independence
Linear Algebra Review: Linear Independence IE418: Integer Programming Department of Industrial and Systems Engineering Lehigh University 21st March 2005 A finite collection of vectors x 1,..., x k R n
More informationThe Origin of Deep Learning. Lili Mou Jan, 2015
The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets
More informationA Criterion for the Stochasticity of Matrices with Specified Order Relations
Rend. Istit. Mat. Univ. Trieste Vol. XL, 55 64 (2009) A Criterion for the Stochasticity of Matrices with Specified Order Relations Luca Bortolussi and Andrea Sgarro Abstract. We tackle the following problem:
More informationDiscrete Mathematics and Probability Theory Fall 2015 Lecture 21
CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about
More information1 What a Neural Network Computes
Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists
More informationSpeaker Representation and Verification Part II. by Vasileios Vasilakakis
Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation
More informationDeep Belief Networks are Compact Universal Approximators
Deep Belief Networks are Compact Universal Approximators Franck Olivier Ndjakou Njeunje Applied Mathematics and Scientific Computation May 16, 2016 1 / 29 Outline 1 Introduction 2 Preliminaries Universal
More informationMax-Planck-Institut für Mathematik in den Naturwissenschaften Leipzig
Max-Planck-Institut für Mathematik in den Naturwissenschaften Leipzig Hierarchical Models as Marginals of Hierarchical Models by Guido Montúfar and Johannes Rauh Preprint no.: 27 2016 Hierarchical Models
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,
More information14 : Theory of Variational Inference: Inner and Outer Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2014 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Yu-Hsin Kuo, Amos Ng 1 Introduction Last lecture
More informationArrangements, matroids and codes
Arrangements, matroids and codes first lecture Ruud Pellikaan joint work with Relinde Jurrius ACAGM summer school Leuven Belgium, 18 July 2011 References 2/43 1. Codes, arrangements and matroids by Relinde
More informationOptimization WS 13/14:, by Y. Goldstein/K. Reinert, 9. Dezember 2013, 16: Linear programming. Optimization Problems
Optimization WS 13/14:, by Y. Goldstein/K. Reinert, 9. Dezember 2013, 16:38 2001 Linear programming Optimization Problems General optimization problem max{z(x) f j (x) 0,x D} or min{z(x) f j (x) 0,x D}
More informationGreedy Layer-Wise Training of Deep Networks
Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny Story so far Deep neural nets are more expressive: Can learn
More informationGeometry of Phylogenetic Inference
Geometry of Phylogenetic Inference Matilde Marcolli CS101: Mathematical and Computational Linguistics Winter 2015 References N. Eriksson, K. Ranestad, B. Sturmfels, S. Sullivant, Phylogenetic algebraic
More informationQualifier: CS 6375 Machine Learning Spring 2015
Qualifier: CS 6375 Machine Learning Spring 2015 The exam is closed book. You are allowed to use two double-sided cheat sheets and a calculator. If you run out of room for an answer, use an additional sheet
More informationAnd for polynomials with coefficients in F 2 = Z/2 Euclidean algorithm for gcd s Concept of equality mod M(x) Extended Euclid for inverses mod M(x)
Outline Recall: For integers Euclidean algorithm for finding gcd s Extended Euclid for finding multiplicative inverses Extended Euclid for computing Sun-Ze Test for primitive roots And for polynomials
More informationError Detection and Correction: Hamming Code; Reed-Muller Code
Error Detection and Correction: Hamming Code; Reed-Muller Code Greg Plaxton Theory in Programming Practice, Spring 2005 Department of Computer Science University of Texas at Austin Hamming Code: Motivation
More informationMATH3302. Coding and Cryptography. Coding Theory
MATH3302 Coding and Cryptography Coding Theory 2010 Contents 1 Introduction to coding theory 2 1.1 Introduction.......................................... 2 1.2 Basic definitions and assumptions..............................
More informationChapter 1. Preliminaries
Introduction This dissertation is a reading of chapter 4 in part I of the book : Integer and Combinatorial Optimization by George L. Nemhauser & Laurence A. Wolsey. The chapter elaborates links between
More informationLecture 17: Perfect Codes and Gilbert-Varshamov Bound
Lecture 17: Perfect Codes and Gilbert-Varshamov Bound Maximality of Hamming code Lemma Let C be a code with distance 3, then: C 2n n + 1 Codes that meet this bound: Perfect codes Hamming code is a perfect
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationMulticlass Classification-1
CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass
More informationAn Algebraic Perspective on Deep Learning
An Algebraic Perspective on Deep Learning Jason Morton Penn State July 19-20, 2012 IPAM Supported by DARPA FA8650-11-1-7145. Jason Morton (Penn State) Algebraic Deep Learning 7/19/2012 1 / 103 Motivating
More informationNeural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]
Neural Networks William Cohen 10-601 [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ] WHAT ARE NEURAL NETWORKS? William s notation Logis;c regression + 1
More informationMa/CS 6b Class 25: Error Correcting Codes 2
Ma/CS 6b Class 25: Error Correcting Codes 2 By Adam Sheffer Recall: Codes V n the set of binary sequences of length n. For example, V 3 = 000,001,010,011,100,101,110,111. Codes of length n are subsets
More informationConditional Independence and Factorization
Conditional Independence and Factorization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationFrom the Zonotope Construction to the Minkowski Addition of Convex Polytopes
From the Zonotope Construction to the Minkowski Addition of Convex Polytopes Komei Fukuda School of Computer Science, McGill University, Montreal, Canada Abstract A zonotope is the Minkowski addition of
More informationSpring 2017 CO 250 Course Notes TABLE OF CONTENTS. richardwu.ca. CO 250 Course Notes. Introduction to Optimization
Spring 2017 CO 250 Course Notes TABLE OF CONTENTS richardwu.ca CO 250 Course Notes Introduction to Optimization Kanstantsin Pashkovich Spring 2017 University of Waterloo Last Revision: March 4, 2018 Table
More informationReading Group on Deep Learning Session 4 Unsupervised Neural Networks
Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Jakob Verbeek & Daan Wynen 206-09-22 Jakob Verbeek & Daan Wynen Unsupervised Neural Networks Outline Autoencoders Restricted) Boltzmann
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationWHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,
WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu
More informationLecture B04 : Linear codes and singleton bound
IITM-CS6845: Theory Toolkit February 1, 2012 Lecture B04 : Linear codes and singleton bound Lecturer: Jayalal Sarma Scribe: T Devanathan We start by proving a generalization of Hamming Bound, which we
More informationDecidability of consistency of function and derivative information for a triangle and a convex quadrilateral
Decidability of consistency of function and derivative information for a triangle and a convex quadrilateral Abbas Edalat Department of Computing Imperial College London Abstract Given a triangle in the
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationBL-Functions and Free BL-Algebra
BL-Functions and Free BL-Algebra Simone Bova bova@unisi.it www.mat.unisi.it/ bova Department of Mathematics and Computer Science University of Siena (Italy) December 9, 008 Ph.D. Thesis Defense Outline
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Variational Inference III: Variational Principle I Junming Yin Lecture 16, March 19, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3 X 3 Reading: X 4
More informationWhy is Deep Learning so effective?
Ma191b Winter 2017 Geometry of Neuroscience The unreasonable effectiveness of deep learning This lecture is based entirely on the paper: Reference: Henry W. Lin and Max Tegmark, Why does deep and cheap
More informationAuerbach bases and minimal volume sufficient enlargements
Auerbach bases and minimal volume sufficient enlargements M. I. Ostrovskii January, 2009 Abstract. Let B Y denote the unit ball of a normed linear space Y. A symmetric, bounded, closed, convex set A in
More information3 : Representation of Undirected GM
10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:
More informationLinear Classifiers and the Perceptron
Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers Let s assume that every instance is an n-dimensional vector of real numbers x R n, and there are only two possible
More informationThe Triangle Closure is a Polyhedron
The Triangle Closure is a Polyhedron Amitabh Basu Robert Hildebrand Matthias Köppe January 8, 23 Abstract Recently, cutting planes derived from maximal lattice-free convex sets have been studied intensively
More informationCS446: Machine Learning Fall Final Exam. December 6 th, 2016
CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationPart of the slides are adapted from Ziko Kolter
Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,
More information3. Linear Programming and Polyhedral Combinatorics
Massachusetts Institute of Technology 18.433: Combinatorial Optimization Michel X. Goemans February 28th, 2013 3. Linear Programming and Polyhedral Combinatorics Summary of what was seen in the introductory
More informationTUTORIAL PART 1 Unsupervised Learning
TUTORIAL PART 1 Unsupervised Learning Marc'Aurelio Ranzato Department of Computer Science Univ. of Toronto ranzato@cs.toronto.edu Co-organizers: Honglak Lee, Yoshua Bengio, Geoff Hinton, Yann LeCun, Andrew
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationIntroduction to Convolutional Neural Networks (CNNs)
Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei
More informationSpectral Hashing: Learning to Leverage 80 Million Images
Spectral Hashing: Learning to Leverage 80 Million Images Yair Weiss, Antonio Torralba, Rob Fergus Hebrew University, MIT, NYU Outline Motivation: Brute Force Computer Vision. Semantic Hashing. Spectral
More informationProbabilistic Machine Learning
Probabilistic Machine Learning Bayesian Nets, MCMC, and more Marek Petrik 4/18/2017 Based on: P. Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. Chapter 10. Conditional Independence Independent
More informationTRISTRAM BOGART AND REKHA R. THOMAS
SMALL CHVÁTAL RANK TRISTRAM BOGART AND REKHA R. THOMAS Abstract. We introduce a new measure of complexity of integer hulls of rational polyhedra called the small Chvátal rank (SCR). The SCR of an integer
More informationHidden Markov Models
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework
More informationFrom Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018
From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction
More informationThe Triangle Closure is a Polyhedron
The Triangle Closure is a Polyhedron Amitabh Basu Robert Hildebrand Matthias Köppe November 7, 21 Abstract Recently, cutting planes derived from maximal lattice-free convex sets have been studied intensively
More informationIntroduction to Machine Learning (67577)
Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks
More informationPartial cubes: structures, characterizations, and constructions
Partial cubes: structures, characterizations, and constructions Sergei Ovchinnikov San Francisco State University, Mathematics Department, 1600 Holloway Ave., San Francisco, CA 94132 Abstract Partial cubes
More informationLearning Deep Architectures
Learning Deep Architectures Yoshua Bengio, U. Montreal Microsoft Cambridge, U.K. July 7th, 2009, Montreal Thanks to: Aaron Courville, Pascal Vincent, Dumitru Erhan, Olivier Delalleau, Olivier Breuleux,
More information2. Polynomials. 19 points. 3/3/3/3/3/4 Clearly indicate your correctly formatted answer: this is what is to be graded. No need to justify!
1. Short Modular Arithmetic/RSA. 16 points: 3/3/3/3/4 For each question, please answer in the correct format. When an expression is asked for, it may simply be a number, or an expression involving variables
More informationNonlinear Discrete Optimization
Nonlinear Discrete Optimization Technion Israel Institute of Technology http://ie.technion.ac.il/~onn Billerafest 2008 - conference in honor of Lou Billera's 65th birthday (Update on Lecture Series given
More informationIntroduction to Real Analysis Alternative Chapter 1
Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces
More informationBoolean Autoencoders and Hypercube Clustering Complexity
Designs, Codes and Cryptography manuscript No. (will be inserted by the editor) Boolean Autoencoders and Hypercube Clustering Complexity P. Baldi Received: 1 November 2010 / Revised: 5 June 2012 / Accepted:
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationShort Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning
Short Course Robust Optimization and 3. Optimization in Supervised EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 Outline Overview of Supervised models and variants
More informationIntroduction: MLE, MAP, Bayesian reasoning (28/8/13)
STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this
More information