Algorithms Reading Group Notes: Provable Bounds for Learning Deep Representations

Similar documents
Project in Computational Game Theory: Communities in Social Networks

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Robust Principal Component Analysis

CMPUT 675: Approximation Algorithms Fall 2014

Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015

Undirected Graphical Models

Approximating MAX-E3LIN is NP-Hard

Chris Bishop s PRML Ch. 8: Graphical Models

1 AC 0 and Håstad Switching Lemma

Lecture 13: 04/23/2014

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges

Efficient Approximation for Restricted Biclique Cover Problems

Chapter 16. Structured Probabilistic Models for Deep Learning

CS264: Beyond Worst-Case Analysis Lecture #11: LP Decoding

Sublinear Algorithms Lecture 3. Sofya Raskhodnikova Penn State University

Lecture 4: LMN Learning (Part 2)

CS168: The Modern Algorithmic Toolbox Lecture #19: Expander Codes

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

Lecture 6: Expander Codes

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 3 Luca Trevisan August 31, 2017

Asymptotic normality of global clustering coefficient of uniform random intersection graph

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Three Query Locally Decodable Codes with Higher Correctness Require Exponential Length

Topics in Approximation Algorithms Solution for Homework 3

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

Lecture 5: Probabilistic tools and Applications II

PCPs and Inapproximability Gap-producing and Gap-Preserving Reductions. My T. Thai

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 15. Probabilistic Models on Graph

FORMULATION OF THE LEARNING PROBLEM

CS Communication Complexity: Applications and New Directions

U.C. Berkeley Better-than-Worst-Case Analysis Handout 3 Luca Trevisan May 24, 2018

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Bayesian Networks and Markov Random Fields

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 6: Provable Approximation via Linear Programming

ORIE 6334 Spectral Graph Theory November 22, Lecture 25

Advanced computational methods X Selected Topics: SGD

Ma/CS 6b Class 25: Error Correcting Codes 2

Sets that are connected in two random graphs

Randomized Algorithms

Lecture 13: Introduction to Neural Networks

CS6840: Advanced Complexity Theory Mar 29, Lecturer: Jayalal Sarma M.N. Scribe: Dinesh K.

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Lecture 7: Passive Learning

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

18.6 Regression and Classification with Linear Models

GRAPH PARTITIONING USING SINGLE COMMODITY FLOWS [KRV 06] 1. Preliminaries

CSC321 Lecture 9: Generalization

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CPSC 536N: Randomized Algorithms Term 2. Lecture 9

Deep unsupervised learning

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 2 Luca Trevisan August 29, 2017

Lecture 5: Derandomization (Part II)

Finding Correlations in Subquadratic Time, with Applications to Learning Parities and Juntas

Cheng Soon Ong & Christian Walder. Canberra February June 2018

P P P NP-Hard: L is NP-hard if for all L NP, L L. Thus, if we could solve L in polynomial. Cook's Theorem and Reductions

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Lecture Notes CS:5360 Randomized Algorithms Lecture 20 and 21: Nov 6th and 8th, 2018 Scribe: Qianhang Sun

Computational Learning Theory - Hilary Term : Introduction to the PAC Learning Framework

Bayesian Networks Factor Graphs the Case-Factor Algorithm and the Junction Tree Algorithm

Dynamic Programming on Trees. Example: Independent Set on T = (V, E) rooted at r V.

Solving Random Satisfiable 3CNF Formulas in Expected Polynomial Time

NP Completeness. CS 374: Algorithms & Models of Computation, Spring Lecture 23. November 19, 2015

Santa Claus Schedules Jobs on Unrelated Machines

NP-complete Problems

25 Minimum bandwidth: Approximation via volume respecting embeddings

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Asymptotically optimal induced universal graphs

6.046 Recitation 11 Handout

A Polynomial-Time Algorithm for Pliable Index Coding

Lecture 5: February 16, 2012

Lecture 29: Computational Learning Theory

An Improved Approximation Algorithm for Requirement Cut

Notes for Lecture 15

arxiv: v1 [cs.it] 21 Feb 2013

Lecture 4: Codes based on Concatenation

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Label Cover Algorithms via the Log-Density Threshold

CS 6820 Fall 2014 Lectures, October 3-20, 2014

Computing and Communicating Functions over Sensor Networks

CSC321 Lecture 9: Generalization

A Combinatorial Characterization of the Testable Graph Properties: It s All About Regularity

Augmenting Outerplanar Graphs to Meet Diameter Requirements

High-Rank Matrix Completion and Subspace Clustering with Missing Data

Lecture 23: Alternation vs. Counting

Supplementary Material for: Spectral Unsupervised Parsing with Additive Tree Metrics

11.1 Set Cover ILP formulation of set cover Deterministic rounding

NP-Hardness reductions

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Lecture 14 October 22

On improving matchings in trees, via bounded-length augmentations 1

Recoverabilty Conditions for Rankings Under Partial Information

Learning Bound for Parameter Transfer Learning

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

1 Agenda. 2 History. 3 Probabilistically Checkable Proofs (PCPs). Lecture Notes Definitions. PCPs. Approximation Algorithms.

Impagliazzo s Hardcore Lemma

Transcription:

Algorithms Reading Group Notes: Provable Bounds for Learning Deep Representations Joshua R. Wang November 1, 2016 1 Model and Results Continuing from last week, we again examine provable algorithms for learning neural networks. This time, we will consider the following problem: we draw a random ground truth network and choose some random inputs for it. Running these inputs through the model, we generate outputs. Given just these outputs, can we efficiently compute a network which is statistically (in l 1 norm) indistinguishable from the ground truth network? It turns out that we can, by leveraging the structure in the random graphs between layers of the network. The work considers several cases for the ground truth network distribution. In one simple case, there are l hidden layers h (l),..., h (1) and one observed layer y. Each layer has exactly n nodes. Edges between each layer are chosen randomly subject to the expected degree d being n γ, with γ < 1/(l + 1). Edges have weights random in {±1}. Hidden nodes have a boolean value, while the output layer is real-valued. We impose the restriction that the expected density of the last layer ρ l d l /2 l is O(1). We denote this distribution over ground truth networks as D. 1 The input distribution is uniform over ρ l n-sparse boolean vectors. Theorem 1.1. When degree d = n γ for 0 < γ 0.2 and density ρ l (d/2) l = C for some large constant C, the network model D can be learned using O(log n/p 2 l ) samples and O(n2 l) time. If the weights are instead drawn from [ 1, +1] then we would require O(n 3 l 2 log n/η 2 ) samples and polynomial time, where η is the total variation distance between the distribution generated by the output network and the true distribution for all but the last layer. 2 Denoising Autoencoders It is often convenient if a deep net (or at least several layers of it) approximately preserve information. This naturally leads to the concept of an autoencoder, which aims to capture one way this reversbility property can occur. Definition. An autoencoder consists of a decoding function D(h) = s(w h + b) and an encoding function E(y) = s(w y + b ) where W, W are linear transformations, b, b are fixed vectors and s is a nonlinear function that acts identically on each coordinate. The autoencoder is said to use weight tying if W = W T. Possible functions for s include logistic functions and soft max. Today, s will be a simple threshold function. For more robust reversibility, we can force the autoencoder to reconstruct the input from a corrupted version of the output. This results in a denoising autoencoder: 1 This distribution can be generalized, but the resulting description and analysis become more complex. We cover this simple case today, which is the focus of the short version of the paper. 1

Definition. An autoencoder is denoising if E(D(h) + ξ) = h with high probability where h is drawn from the distribution of the hidden layer and ξ is a noise vector drawn from the noise distribution. Note that normally we would want to minimize the reconstruction error y D(E(y +ξ)), but for today our y will be exactly D(h). Our ground truth distribution D makes each layer of the network a denoising autoencoder with high probability. We now prove this fact: Lemma 2.1 (Layers are Denoising Autoencoders). If the input layer is a random ρn sparse vector and noise flips every output bit independently with probability 0.1, then a single layer of a network in D is a denoising autoencoder with high probability. It uses weight tying. Before proving our Layers are Denoising Autoencoders Lemma, we first cover some notation for and properties of random bipartite graphs: Definition. Let G = (U, V, E, W ) be a bipartite graph with vertex sets U and V, edges E, and weights W. For any node u U, let F (u) be its neighbors in V. For any subset S U, let UF (u, S) be the unique neighbors of u with respect to S, i.e., UF (u, S) = {v V v F (u), v F (S \ {u})} A node u U has the (1 ɛ)-unique neighbor property with respect to a set S U if: W (u, v) (1 ɛ) W (u, v) v UF (u,s) v F (u) A set S U has the (1 ɛ)-strong unique neighbor property if every node in U has the (1 ɛ)-unique neighbor property with respect to S. Fix a set S U of size ρn. We will use the result that with high probability over the randomness of the graph, S has the (1 ɛ)-strong unique neighbor property without proof. The intuition is that it holds due to Hoeffding s Inequality and a union bound. We can now prove our Layers are Denoising Autoencoders Lemma: Proof. The decoder is D(h) = sgn(w h), and the encoder is E(y) = sgn(w T y + b ) where b = 0.2d 1. To see why these two functions work, consider the set of nodes which are 1 in the input layer. A majority of their neighbors in the output layer are unique neighbors. Any unique neighbors with positive edges will be 1, because there are no negative edges to cancel. Hence, by looking at the set of nodes which are 1 in the output, you can infer the correct assignment to the input by doing a threshold of 0.2d at each node. 3 Learning a Single Layer The plan for a single layer h (i) h (i 1) proceeds as follows: 1. Construct a correlation graph using samples of h (i 1). 2. Use RecoverGraph to learn the positive edges E + i from the correlation graph. 3. Use PartialEncoder to encode all h (i 1) to h (i). 4. Use LearnGraph to learn the negative edges E i. 2

Step 1: Construct the Correlation Graph Our plan is to first identify related nodes in the observed layer: two nodes which have a common neighbor in the hidden layer to which they are attached via a +1 edge. To do so, we begin by constructing a correlation graph over output nodes. In particular, we connect pairs of nodes u, v V that are both one in at least a ρ/3 fraction of our samples. Proposition 3.1 (Correlation Graph). In a random sample of the output layer, related pairs u, v are both 1 with probability at least 0.9ρ while unrelated pairs are both 1 with probability at most (ρd) 2. Proof. Suppose that u, v are a related pair and z a vertex with +1 edges to both. Let S be the set of other neighbors of u, v. S cannot be much larger than 2d. By our choice of parameters, ρd 1, so the event that S is all zeros conditioned on z is 1 is at least 0.9. But the probability that z is 1 is ρ, so the probablity of both u and v being 1 is at least 0.9ρ. Now suppose that u, v are an unrelated pair. For both u, v to be 1 there must be nodes y and z, both of which are 1, and which are connected to u and v respectively via +1 edges. By a union bound, this event has probability at most (ρd) 2. Hence if (ρd) 2 < 0.1ρ, then O(log n/ρ) samples should suffice to recover all pairs w.h.p. This assumption can be weakened if we instead use 3-wise correlations; call a triple of observed nodes related if they are connected to a common node via +1 edges. This allows us to prove an analagous claim about recovering all related triples correctly, but we end up with a 3-uniform hypergraph instead. RecoverGraph can be modified to work correctly with triples. Step 2: Recover Positive Edges Notation note: in this subsection, we pretend only positive edges exist, and d refers to the positive degree only. We want to recover the positive edges of the bipartite graph (U, V, E, W ) from the correlation graph. To do so, we will rely on the following properties which are all satisfied by random graphs w.h.p. when d 3 /n 1: 1. For any two observed nodes v 1, v 2 V, consider their common neighbors in the correlation graph. At most d/20 of these do not share a common positive parent. 2. For any two hidden nodes u 1, u 2 U, they have at least 1.5d distinct positive children together (not much overlap). 3. For any hidden u U and observed v V which is not a positive child of u, there at most d/20 neighbors of v in the correlation graph which are positive children of u. 4. For any hidden u U, at least a 0.1 fraction of pairs v 1, v 2 from its positive children do not have a common neighbor other than u. The algorithm is as follows. We repeatedly do the following until all edges of our correlation graph have been marked: Pick a random edge (v 1, v 2 ) in the correlation graph. Compute S, the set of common neighbors of v 1 and v 2 in the correlation graph. Abort if S 1.3d. Compute S, the set of nodes in S who are neighbors with at least 0.8d 1 other nodes in S. Note: S should be a clique in the correlation graph. Create a node u which is connected to every node in S by a positive edge. 3

Mark all the edges between nodes in S. Lemma 3.2. If the graph of positive edges satisfies the properties above, then our algorithm recovers it in O(n 2 ) expected time. Proof. By Property 2, we abort if (v 1, v 2 ) have more than one common positive parent, since then S will be at least 1.5d. By Property 1, if (v 1, v 2 ) have a unique common positive parent, then we do not toss out any children of this parent when pruning from S to S. By Property 3, if (v 1, v 2 ) have a unique common positive parent, then we toss out every node which is not a child of this parent when pruning from S to S. By Property 4, it takes 10 iterations in expectation to identify a new node u U. Since each iteration takes at most O(n) time, this algorithm runs in O(n 2 ) time. Step 3: Recover Hidden Variables We now want to recover the hidden variables using the observed variables and positive edges. Let M be the adjacency matrix of positive edges. We should compute h as follows: Lemma 3.3 (PartialEncoder). If the support of h satisfies the 11/12-strong unique neighbor property, and y = sgn(w h), then choosing h = sgn(m T y θ 1 ) with the threshold θ = 0.3d outputs h. Proof. For any hidden node z with h z = 1, it has at least 0.4d unique neighbors that are connected with +1 edges. This implies all these neighbors are 1, so [M T y] z 0.4d. On the other hand, if h z = 0, z has few common neighbors with the 1 s in h (at most 0.2d, say). Taken together, a threshold of 0.3d suffices. Step 4: Recover Negative Edges Now all that remains is to compute the negative edges. Consider the following situtation in a sample (h, y). Suppose for some observed node v, y v = 1 but there is exactly one hidden node u with the property that h u = 1 and there is a +1 edge between them. In other words, in this sample v is receiving exactly +1 positive weight total, so it cannot receive any negative weight. Hence any other hidden nodes which are 1 in the sample must not have a 1 edge to v. Our algorithm is to simply go through all samples (h, y) and rule out edges as above. We claim that this suffices to recover the negative edges: Lemma 3.4 (LearnGraph). Given O(log n/(ρ 2 d)) samples of pairs (h, y), with high probability (over the random graph and the samples) we recover the correct set of edges E after ruling out E + and nonedges as above. Proof. Consider a non-edge (x, u). We would like these three events to happen: (i) h x = 1, (ii) u has exactly one positive edge to a hidden nodes in h which are 1, and (iii) u has no negative edges to hidden nodes in h which are 1. These events are almost independent and occur with probabilities roughly ρ, ρd, and 1, respectively. 4 Learning Multiple Layers So, as it turns out, we can t simply apply this single layer algorithm repeatedly. The problem is that we have assumed that each hidden layer is a random ρn-sparse vector, when it in fact there are correlations between nodes due to previous layers. Note that by the unique neighbor property, we can expect that roughly ρ l (d/2) fraction of h (l 1) to be 1, and hence roughly ρ l (d/2) l i of h (i). 4

Lemma 4.1 (Multilayer Correlations). Consider a network from D. With high probability (over random graphs between layers), for any two nodes u, v in layer h (1), { P r[h (1) u = h (1) ρ 2 /2 if u, v related v = 1] = ρ 2 /4 otherwise Proof. We say a node s at the topmost layer is an ancestor of u if there is a path of length l 1 from s to u. u has roughly 2 l 1 ρ 1 /ρ l ancestors. With good probability, at most one ancestor of u has value 1. A node is a positive ancestor of u if, when only that node is 1 at the topmost layer, u is 1. The number of positive ancestors of u is roughly ρ 1 /ρ l. The probability that a node u is on is roughly proportional to the number of positive ancestors. For a pair of nodes, if they are both on, then either one of their common positive ancestors is on, or both of them have a positive ancestor that is on. The probability of the latter is O(ρ 2 1) which by assumption is much smaller than ρ 2. If u, v have a common positive parent z, then the positive ancestors of z are all common positive ancestors of u, v, so there are at least ρ 2 /ρ l. The probability of exactly one being on is at least ρ 2 /2. If they do not have a common parent, then the number of common positive ancestors depends on the number of common grandparents. With high probability over the graph structure (not shown here), the number is small and hence the probability that they are both 1 is small. This Multilayer Correlations Lemma is what we actually need to guarantee that our threshold of ρ 2 /3 works to give us the correlation graph. Once we have that, the other steps proceed as normal. We can then repeat this argument by peeling off the bottom layer. 5 Learning the Final Layer Recall that the final layer is real-valued, and hence does not directly work with these algorithms for booleanvalued layers. However, the same plan of computing the correlation graph and using it to recover the positive edges, hidden variables, and negative edges still works. Lemma 5.1. The encoder E(y) = sgn(w T y (0.4d) 1 ) and linear decoder D(h) = W h form a denoising autoencoder for noise vectors γ which have independente components, each having variance at most O(d/ log 2 n) Theorem 5.2 (Final Layer Correlation Graph). When ρ 1 d = O(1), d = Ω(log 2 n), with high probability over the choice of weights and the choice of the graph, for any three nodes u, v, s the assignment y to the bottom layer satisfies: 1. If u, v, s have no common neighbor, then E h [y u y v y s ] ρ 1 /3 2. If u, v, s have a unique common neighbor then E h [y u y v y s ] 2ρ 1 /3 Note that there is a more complex statement of the Final Layer Correlation Graph Theorem which corresponds to the more nuanced analysis of Section 4. 6 Experimental Insights The authors tested their algorithm on synthetic data, and concluded the following: The most expensive step was finding the negative edges, but it was possible to stop early and reasonably over-estimate the negative edges. Trying to learn a network with two hidden layers (l = 2) with n = 5000, d = 16, and ρ l n = 4 took 500, 000 samples and less than one hour to learn the first layer and positive edges of the second layer; learning negative edges of the second layer requires more samples according to the analysis, but the algorithm only made 10 errors with 5, 000, 000 samples. 5

References [ABGM14] Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable bounds for learning some deep representations. In ICML, pages 584 592, 2014. 6