Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Greedy Algorithms, Frank-Wolfe and Friends A modern perspective, Lake Tahoe, December 2013 Based on joint work with Ohad Shamir Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 1 / 25
Neural Networks A single neuron with activation function σ : R R x 1 x 2 x 3 x 4 v 1 v 2 v 3 v 4 v 5 σ( v, x ) x 5 Usually, σ is taken to be a sigmoidal function Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 2 / 25
Neural Networks A multilayer neural network of depth 3 and size 6 Input layer Hidden layer Hidden layer Output layer x 1 x 2 x 3 x 4 x 5 Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 3 / 25
Why Deep Neural Networks are Great? Because A used it to do B Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 4 / 25
Why Deep Neural Networks are Great? Because A used it to do B Classic explanation: Neural Networks are universal approximators every Lipschitz function f : [ 1, 1] d [ 1, 1] can be approximated by a neural network Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 4 / 25
Why Deep Neural Networks are Great? Because A used it to do B Classic explanation: Neural Networks are universal approximators every Lipschitz function f : [ 1, 1] d [ 1, 1] can be approximated by a neural network Not convincing because It can be shown that the size of the network must be exponential in d, so why should we care about such large networks? Many other universal approximators exist (nearest neighbor, boosting with decision stumps, SVM with RBF kernels), so why should we prefer neural networks? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 4 / 25
Why Deep Neural Networks are Great? A Statistical Learning Perspective Goal: Learn a function h : X Y based on training examples S = ((x 1, y 1 ),..., (x m, y m )) (X Y) m No-Free-Lunch Theorem: For any algorithm A, and any sample size m, there exists a distribution D over X Y and a function h such that h is perfect w.r.t. D but with high probability over S D m, the output of A is very bad Prior knowledge: We must bias the learner toward reasonable functions hypothesis class H Y X What should be H? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 5 / 25
Why Deep Neural Networks are Great? A Statistical Learning Perspective Consider all functions over {0, 1} d that can be executed in time at most T (d) Theorem: The class H NN of neural networks of depth O(T (d)) and size O(T (d) 2 ) contains all functions that can be executed in time at most T (d) A great hypothesis class: With sufficiently large network depth and size, we can express all functions we would ever want to learn Sample complexity behaves nicely and is well understood (see Anthony & Bartlett 1999) End of story? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 6 / 25
Why Deep Neural Networks are Great? A Statistical Learning Perspective Consider all functions over {0, 1} d that can be executed in time at most T (d) Theorem: The class H NN of neural networks of depth O(T (d)) and size O(T (d) 2 ) contains all functions that can be executed in time at most T (d) A great hypothesis class: With sufficiently large network depth and size, we can express all functions we would ever want to learn Sample complexity behaves nicely and is well understood (see Anthony & Bartlett 1999) End of story? The computational barrier: But, how do we train neural networks? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 6 / 25
Neural Networks The computational barrier It is NP hard to implement ERM for a depth 2 network with k 3 hidden neurons whose activation function is sigmoidal or sign (Blum and Rivest 1992, Bartlett and Ben-David 2002) Current approaches: Back propagation, possibly with unsupervised pre-training and other bells and whistles No theoretical guarantees, and often requires manual tweaking Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 7 / 25
Outline How to circumvent hardness? 1 Over-specification Extreme over-specification eliminate local (non-global) minima Hardness of improperly learning a two layers network with k = ω(1) hidden neurons 2 Change the activation function (sum-product networks) Efficiently learning sum-product networks of depth 2 using Forward Greedy Selection Hardness of learning deep sum-product networks Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 8 / 25
Circumventing Hardness using Over-specification Yann LeCun: Fix a network architecture and generate data according to it Backpropagation fails to recover parameters However, if we enlarge the network size, backpropagation works just fine Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 9 / 25
Circumventing Hardness using Over-specification Yann LeCun: Fix a network architecture and generate data according to it Backpropagation fails to recover parameters However, if we enlarge the network size, backpropagation works just fine Maybe we can efficiently learn neural network using over-specification? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 9 / 25
Extremely over-specified Networks have no local (non-global) minima Let X R d,m be a data matrix of m examples Consider a network with: N internal neurons v be the weights of all but the last layer F (v; X) be evaluations of internal neurons over data matrix X w be weights connecting internal neurons to the output neuron The output of the network is w F (v; X) Theorem: If N m, and under mild conditions on F, the optimization problem min w,v w F (v; X) y 2 has no local (non-global) minima Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 10 / 25
Extremely over-specified Networks have no local (non-global) minima Let X R d,m be a data matrix of m examples Consider a network with: N internal neurons v be the weights of all but the last layer F (v; X) be evaluations of internal neurons over data matrix X w be weights connecting internal neurons to the output neuron The output of the network is w F (v; X) Theorem: If N m, and under mild conditions on F, the optimization problem min w,v w F (v; X) y 2 has no local (non-global) minima Proof idea: W.h.p. over perturbation of v, F (v; X) has full rank. For such v, if we re not at global minimum, just by changing w we can decrease the objective Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 10 / 25
Is over-specification enough? But, such large networks will lead to overfitting Maybe there s a clever trick that circumvent overfitting (regularization, dropout,...)? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 11 / 25
Is over-specification enough? But, such large networks will lead to overfitting Maybe there s a clever trick that circumvent overfitting (regularization, dropout,...)? Theorem (Daniely, Linial, S.) Even if the data is perfectly generated by a neural network of depth 2 and with only k = ω(1) neurons in the hidden layer, there is no algorithm that can achieve small test error Corollary: over-specification alone is not enough for efficient learnability Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 11 / 25
Proof Idea: Hardness of Improper Learning Improper learning: Learner tries to learn some hypothesis h H but is not restricted to output a hypothesis from H Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 12 / 25
Proof Idea: Hardness of Improper Learning Improper learning: Learner tries to learn some hypothesis h H but is not restricted to output a hypothesis from H How to show hardness? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 12 / 25
Proof Idea: Hardness of Improper Learning Improper learning: Learner tries to learn some hypothesis h H but is not restricted to output a hypothesis from H How to show hardness? Technical novelty: A new method for deriving lower bounds for improper learning, which relies on average-case complexity assumptions Technique yields new hardness results for improper learning of: DNFs (open problem since Kearns&Valiant 1989) Intersection of ω(1) halfspaces (Klivans&Sherstov 2006 showed hardness for d c halfspaces) Constant approximation ratio for agnostically learning halfspaces (previously, only hardness of exact learning was known) Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 12 / 25
Proof Idea: Hardness of Improper Learning Improper learning: Learner tries to learn some hypothesis h H but is not restricted to output a hypothesis from H How to show hardness? Technical novelty: A new method for deriving lower bounds for improper learning, which relies on average-case complexity assumptions Technique yields new hardness results for improper learning of: DNFs (open problem since Kearns&Valiant 1989) Intersection of ω(1) halfspaces (Klivans&Sherstov 2006 showed hardness for d c halfspaces) Constant approximation ratio for agnostically learning halfspaces (previously, only hardness of exact learning was known) Can also be used to establish Computational-Statistical Tradeoffs (Daniely, Linial, S., NIPS 13) Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 12 / 25
Outline How to circumvent hardness? 1 Over-specification Extreme over-specification eliminate local (non-global) minima Hardness of improperly learning a two layers network with k = ω(1) hidden neurons 2 Change the activation function (sum-product networks) Efficiently learning sum-product networks of depth 2 using Forward Greedy Selection Hardness of learning deep sum-product networks Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 13 / 25
Circumventing hardness sum-product networks Simpler non-linearity replace sigmoidal activation function by the square function σ(a) = a 2 Network implements polynomials, where the depth corresponds to degree The size of the network (number of neurons) determines generalization properties and evaluation time Can we efficiently learn the class of polynomial networks of small size? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 14 / 25
Depth 2 polynomial network Input layer x 1 x 2 Hidden layer v 1, x 2 Output layer x 3 v 2, x 2 i w i v i, x 2 x 4 x 5 v 3, x 2 Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 15 / 25
Depth 2 polynomial networks Corresponding hypothesis class: { } r H = x w i v i, x 2 : w = O(1), i, v i = 1 i=1. Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 16 / 25
Depth 2 polynomial networks Corresponding hypothesis class: { } r H = x w i v i, x 2 : w = O(1), i, v i = 1 i=1. ERM is still NP hard But, here, a slight over-specification works! Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 16 / 25
Depth 2 polynomial networks Corresponding hypothesis class: { } r H = x w i v i, x 2 : w = O(1), i, v i = 1 i=1. ERM is still NP hard But, here, a slight over-specification works! Using d 2 hidden neurons suffices (trivial) Can we do better? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 16 / 25
Forward Greedy Selection Goal: minimize R(w) s.t. w 0 r Define: for I [d], w I = argmin w:supp(w) I R(w) Forward Greedy Selection (FGS) Start with I = { } For t = 1, 2,... Pick j s.t. j R(w I ) (1 τ) max i i R(w I ) Let J = I {j} Replacement step: Let I be any set s.t. R(w I ) R(w J ) and I J Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 17 / 25
Analysis of Forward Greedy Selection Theorem Assume that R is β-smooth w.r.t. some norm for which e j = 1 for all j. Then, for every ɛ and w, if FGS is run for k 2β w 2 1 (1 τ) 2 ɛ iterations, it outputs w with R(w) R( w) + ɛ and w 0 k. Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 18 / 25
Analysis of Forward Greedy Selection Theorem Assume that R is β-smooth w.r.t. some norm for which e j = 1 for all j. Then, for every ɛ and w, if FGS is run for k 2β w 2 1 (1 τ) 2 ɛ iterations, it outputs w with R(w) R( w) + ɛ and w 0 k. Remarks Dimension of w has no effect on the theorem! w is any vector (not necessarily the optimum ) If w 2 = O(1) then w 2 1 w 0 w 2 2 = O ( w 0). Bound depends nicely on τ. Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 18 / 25
Forward Greedy Selection for Sum-Product Networks Let S be the Euclidean sphere of R d Think on w as a mapping from S to R Each hypothesis in H corresponds to some w with w 0 = r and w 2 = O(1) R( w) = training loss of the hypothesis in H corresponding to w Applying FGS for k = Ω(r/ɛ) iterations would yield a network with O(r/ɛ) hidden neurons and loss of at most R( w) + ɛ Main caveat: at each iteration we need to find v s.t. v R(w) (1 τ) max u S ur(w) Luckily, this is an eigenvalue problem v R(w) = v E (x,y) l u supp(w) w u u, x 2, y xx v Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 19 / 25
The resulting algorithm Greedy Efficient Component Optimization (GECO): Initialize V = [ ], w = [] For t = 1, 2,..., T Let M = E (x,y) l ( i w i( v i, x ) 2, y)xx V = [V v] where v is an approximate leading eigenvector of M Let B = argmin B R t,t E (x,y) l((x V )B(V x), y) Update w = eigenvalues(b) and V = V eigenvectors(b) Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 20 / 25
The resulting algorithm Greedy Efficient Component Optimization (GECO): Initialize V = [ ], w = [] For t = 1, 2,..., T Let M = E (x,y) l ( i w i( v i, x ) 2, y)xx V = [V v] where v is an approximate leading eigenvector of M Let B = argmin B R t,t E (x,y) l((x V )B(V x), y) Update w = eigenvalues(b) and V = V eigenvectors(b) Remarks: Finding an approximate leading eigenvector takes linear time Overall complexity depends linearly on the size of the data Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 20 / 25
Deeper sum-product networks? Learning depth 2 sigmoidal networks is hard even if we allow over-specification Learning depth 2 sum-product networks is tractable if we allow slight over-specification What about higher degrees? Theorem (Livni, Shamir, S.): It is hard to learn polynomial networks of depth poly(d) even if their size is poly(d). Proof idea: It is possible to approximate the sigmoid function with a polynomial of degree poly(d) Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 21 / 25
Deeper sum-product networks? Distributional assumptions: Is it easier to learn under certain distributional assumptions? Vanishing Component Analysis (Livni et al 2013) Efficiently finding the generating ideal of the data Can be used to efficiently construct deep features Theoretical guarantees still not satisfactory Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 22 / 25
Summary Sigmoidal deep networks are great statistically but cannot be trained efficiently Sum-product networks of depth 2 can be trained efficiently using forward greedy selection Very deep sum-product networks cannot be trained efficiently Open problems Is it possible to train sum-product networks of depth 3? What about depth log(d)? Find a combination of network architecture and distributional assumptions that are useful in practice and lead to efficient algorithms Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 23 / 25
Thanks! Collaborators Seek of efficient algorithms for deep learning: Ohad Shamir GECO: Alon Gonen and Ohad Shamir Based on a previous paper with Tong Zhang and Nati Srebro Lower bounds: Amit Daniely and Nati Linial VCA: Livni, Lehavi, Nachlieli, Schein, Globerson Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 24 / 25
Shameless Advertisement Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 25 / 25