Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection

Similar documents
On the tradeoff between computational complexity and sample complexity in learning

Machine Learning in the Data Revolution Era

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577)

Introduction to Machine Learning (67577) Lecture 5

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

SGD and Deep Learning

Introduction to Machine Learning (67577) Lecture 3

Computational and Statistical Learning Theory

Failures of Gradient-Based Deep Learning

Neural Networks: Introduction

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks: Backpropagation

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Based on the original slides of Hung-yi Lee

Learnability, Stability, Regularization and Strong Convexity

Overparametrization for Landscape Design in Non-convex Optimization

PAC-learning, VC Dimension and Margin-based Bounds

18.6 Regression and Classification with Linear Models

COMS 4771 Introduction to Machine Learning. Nakul Verma

Using More Data to Speed-up Training Time

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS 6375 Machine Learning

Support vector machines Lecture 4

Computational Learning Theory

Accelerating Stochastic Optimization

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Learning from Examples

IFT Lecture 7 Elements of statistical learning theory

Expressiveness of Rectifier Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

Expressiveness of Rectifier Networks

Discriminative Models

TRADING ACCURACY FOR SPARSITY IN OPTIMIZATION PROBLEMS WITH SPARSITY CONSTRAINTS

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

The Perceptron algorithm

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Neural networks and optimization

Learning Theory Continued

Optimistic Rates Nati Srebro

Lecture 8. Instructor: Haipeng Luo

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Lecture Support Vector Machine (SVM) Classifiers

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

6.036 midterm review. Wednesday, March 18, 15

Discriminative Models

ECE 5984: Introduction to Machine Learning

Computational and Statistical Learning Theory

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Stochastic Gradient Descent with Variance Reduction

LEARNING KERNEL-BASED HALFSPACES WITH THE 0-1 LOSS

Classification: The rest of the story

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

CSCI-567: Machine Learning (Spring 2019)

Neural Networks in Structured Prediction. November 17, 2015

Reading Group on Deep Learning Session 1

Day 3 Lecture 3. Optimizing deep networks

Computational and Statistical Learning Theory

10 NEURAL NETWORKS Bio-inspired Multi-Layer Networks. Learning Objectives:

The sample complexity of agnostic learning with deterministic labels

Sections 18.6 and 18.7 Artificial Neural Networks

Introduction to Convolutional Neural Networks (CNNs)

Foundations of Deep Learning: SGD, Overparametrization, and Generalization

Lecture 5: Logistic Regression. Neural Networks

Data Mining Part 5. Prediction

Feed-forward Network Functions

TDT4173 Machine Learning

Normalization Techniques

VBM683 Machine Learning

Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015

Hierarchical Boosting and Filter Generation

CS:4420 Artificial Intelligence

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Neural Networks and Deep Learning

ECE 5424: Introduction to Machine Learning

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Understanding Machine Learning A theory Perspective

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

MythBusters: A Deep Learning Edition

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

PAC-learning, VC Dimension and Margin-based Bounds

Artificial Neuron (Perceptron)

ECE521 Lecture 7/8. Logistic Regression

Full-information Online Learning

ShareBoost: Efficient Multiclass Learning with Feature Sharing

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Deep Feedforward Networks

Basic Principles of Unsupervised and Unsupervised

Online Learning Summer School Copenhagen 2015 Lecture 1

Deep Feedforward Networks

Introduction to Machine Learning

Transcription:

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Greedy Algorithms, Frank-Wolfe and Friends A modern perspective, Lake Tahoe, December 2013 Based on joint work with Ohad Shamir Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 1 / 25

Neural Networks A single neuron with activation function σ : R R x 1 x 2 x 3 x 4 v 1 v 2 v 3 v 4 v 5 σ( v, x ) x 5 Usually, σ is taken to be a sigmoidal function Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 2 / 25

Neural Networks A multilayer neural network of depth 3 and size 6 Input layer Hidden layer Hidden layer Output layer x 1 x 2 x 3 x 4 x 5 Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 3 / 25

Why Deep Neural Networks are Great? Because A used it to do B Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 4 / 25

Why Deep Neural Networks are Great? Because A used it to do B Classic explanation: Neural Networks are universal approximators every Lipschitz function f : [ 1, 1] d [ 1, 1] can be approximated by a neural network Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 4 / 25

Why Deep Neural Networks are Great? Because A used it to do B Classic explanation: Neural Networks are universal approximators every Lipschitz function f : [ 1, 1] d [ 1, 1] can be approximated by a neural network Not convincing because It can be shown that the size of the network must be exponential in d, so why should we care about such large networks? Many other universal approximators exist (nearest neighbor, boosting with decision stumps, SVM with RBF kernels), so why should we prefer neural networks? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 4 / 25

Why Deep Neural Networks are Great? A Statistical Learning Perspective Goal: Learn a function h : X Y based on training examples S = ((x 1, y 1 ),..., (x m, y m )) (X Y) m No-Free-Lunch Theorem: For any algorithm A, and any sample size m, there exists a distribution D over X Y and a function h such that h is perfect w.r.t. D but with high probability over S D m, the output of A is very bad Prior knowledge: We must bias the learner toward reasonable functions hypothesis class H Y X What should be H? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 5 / 25

Why Deep Neural Networks are Great? A Statistical Learning Perspective Consider all functions over {0, 1} d that can be executed in time at most T (d) Theorem: The class H NN of neural networks of depth O(T (d)) and size O(T (d) 2 ) contains all functions that can be executed in time at most T (d) A great hypothesis class: With sufficiently large network depth and size, we can express all functions we would ever want to learn Sample complexity behaves nicely and is well understood (see Anthony & Bartlett 1999) End of story? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 6 / 25

Why Deep Neural Networks are Great? A Statistical Learning Perspective Consider all functions over {0, 1} d that can be executed in time at most T (d) Theorem: The class H NN of neural networks of depth O(T (d)) and size O(T (d) 2 ) contains all functions that can be executed in time at most T (d) A great hypothesis class: With sufficiently large network depth and size, we can express all functions we would ever want to learn Sample complexity behaves nicely and is well understood (see Anthony & Bartlett 1999) End of story? The computational barrier: But, how do we train neural networks? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 6 / 25

Neural Networks The computational barrier It is NP hard to implement ERM for a depth 2 network with k 3 hidden neurons whose activation function is sigmoidal or sign (Blum and Rivest 1992, Bartlett and Ben-David 2002) Current approaches: Back propagation, possibly with unsupervised pre-training and other bells and whistles No theoretical guarantees, and often requires manual tweaking Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 7 / 25

Outline How to circumvent hardness? 1 Over-specification Extreme over-specification eliminate local (non-global) minima Hardness of improperly learning a two layers network with k = ω(1) hidden neurons 2 Change the activation function (sum-product networks) Efficiently learning sum-product networks of depth 2 using Forward Greedy Selection Hardness of learning deep sum-product networks Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 8 / 25

Circumventing Hardness using Over-specification Yann LeCun: Fix a network architecture and generate data according to it Backpropagation fails to recover parameters However, if we enlarge the network size, backpropagation works just fine Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 9 / 25

Circumventing Hardness using Over-specification Yann LeCun: Fix a network architecture and generate data according to it Backpropagation fails to recover parameters However, if we enlarge the network size, backpropagation works just fine Maybe we can efficiently learn neural network using over-specification? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 9 / 25

Extremely over-specified Networks have no local (non-global) minima Let X R d,m be a data matrix of m examples Consider a network with: N internal neurons v be the weights of all but the last layer F (v; X) be evaluations of internal neurons over data matrix X w be weights connecting internal neurons to the output neuron The output of the network is w F (v; X) Theorem: If N m, and under mild conditions on F, the optimization problem min w,v w F (v; X) y 2 has no local (non-global) minima Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 10 / 25

Extremely over-specified Networks have no local (non-global) minima Let X R d,m be a data matrix of m examples Consider a network with: N internal neurons v be the weights of all but the last layer F (v; X) be evaluations of internal neurons over data matrix X w be weights connecting internal neurons to the output neuron The output of the network is w F (v; X) Theorem: If N m, and under mild conditions on F, the optimization problem min w,v w F (v; X) y 2 has no local (non-global) minima Proof idea: W.h.p. over perturbation of v, F (v; X) has full rank. For such v, if we re not at global minimum, just by changing w we can decrease the objective Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 10 / 25

Is over-specification enough? But, such large networks will lead to overfitting Maybe there s a clever trick that circumvent overfitting (regularization, dropout,...)? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 11 / 25

Is over-specification enough? But, such large networks will lead to overfitting Maybe there s a clever trick that circumvent overfitting (regularization, dropout,...)? Theorem (Daniely, Linial, S.) Even if the data is perfectly generated by a neural network of depth 2 and with only k = ω(1) neurons in the hidden layer, there is no algorithm that can achieve small test error Corollary: over-specification alone is not enough for efficient learnability Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 11 / 25

Proof Idea: Hardness of Improper Learning Improper learning: Learner tries to learn some hypothesis h H but is not restricted to output a hypothesis from H Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 12 / 25

Proof Idea: Hardness of Improper Learning Improper learning: Learner tries to learn some hypothesis h H but is not restricted to output a hypothesis from H How to show hardness? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 12 / 25

Proof Idea: Hardness of Improper Learning Improper learning: Learner tries to learn some hypothesis h H but is not restricted to output a hypothesis from H How to show hardness? Technical novelty: A new method for deriving lower bounds for improper learning, which relies on average-case complexity assumptions Technique yields new hardness results for improper learning of: DNFs (open problem since Kearns&Valiant 1989) Intersection of ω(1) halfspaces (Klivans&Sherstov 2006 showed hardness for d c halfspaces) Constant approximation ratio for agnostically learning halfspaces (previously, only hardness of exact learning was known) Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 12 / 25

Proof Idea: Hardness of Improper Learning Improper learning: Learner tries to learn some hypothesis h H but is not restricted to output a hypothesis from H How to show hardness? Technical novelty: A new method for deriving lower bounds for improper learning, which relies on average-case complexity assumptions Technique yields new hardness results for improper learning of: DNFs (open problem since Kearns&Valiant 1989) Intersection of ω(1) halfspaces (Klivans&Sherstov 2006 showed hardness for d c halfspaces) Constant approximation ratio for agnostically learning halfspaces (previously, only hardness of exact learning was known) Can also be used to establish Computational-Statistical Tradeoffs (Daniely, Linial, S., NIPS 13) Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 12 / 25

Outline How to circumvent hardness? 1 Over-specification Extreme over-specification eliminate local (non-global) minima Hardness of improperly learning a two layers network with k = ω(1) hidden neurons 2 Change the activation function (sum-product networks) Efficiently learning sum-product networks of depth 2 using Forward Greedy Selection Hardness of learning deep sum-product networks Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 13 / 25

Circumventing hardness sum-product networks Simpler non-linearity replace sigmoidal activation function by the square function σ(a) = a 2 Network implements polynomials, where the depth corresponds to degree The size of the network (number of neurons) determines generalization properties and evaluation time Can we efficiently learn the class of polynomial networks of small size? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 14 / 25

Depth 2 polynomial network Input layer x 1 x 2 Hidden layer v 1, x 2 Output layer x 3 v 2, x 2 i w i v i, x 2 x 4 x 5 v 3, x 2 Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 15 / 25

Depth 2 polynomial networks Corresponding hypothesis class: { } r H = x w i v i, x 2 : w = O(1), i, v i = 1 i=1. Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 16 / 25

Depth 2 polynomial networks Corresponding hypothesis class: { } r H = x w i v i, x 2 : w = O(1), i, v i = 1 i=1. ERM is still NP hard But, here, a slight over-specification works! Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 16 / 25

Depth 2 polynomial networks Corresponding hypothesis class: { } r H = x w i v i, x 2 : w = O(1), i, v i = 1 i=1. ERM is still NP hard But, here, a slight over-specification works! Using d 2 hidden neurons suffices (trivial) Can we do better? Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 16 / 25

Forward Greedy Selection Goal: minimize R(w) s.t. w 0 r Define: for I [d], w I = argmin w:supp(w) I R(w) Forward Greedy Selection (FGS) Start with I = { } For t = 1, 2,... Pick j s.t. j R(w I ) (1 τ) max i i R(w I ) Let J = I {j} Replacement step: Let I be any set s.t. R(w I ) R(w J ) and I J Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 17 / 25

Analysis of Forward Greedy Selection Theorem Assume that R is β-smooth w.r.t. some norm for which e j = 1 for all j. Then, for every ɛ and w, if FGS is run for k 2β w 2 1 (1 τ) 2 ɛ iterations, it outputs w with R(w) R( w) + ɛ and w 0 k. Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 18 / 25

Analysis of Forward Greedy Selection Theorem Assume that R is β-smooth w.r.t. some norm for which e j = 1 for all j. Then, for every ɛ and w, if FGS is run for k 2β w 2 1 (1 τ) 2 ɛ iterations, it outputs w with R(w) R( w) + ɛ and w 0 k. Remarks Dimension of w has no effect on the theorem! w is any vector (not necessarily the optimum ) If w 2 = O(1) then w 2 1 w 0 w 2 2 = O ( w 0). Bound depends nicely on τ. Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 18 / 25

Forward Greedy Selection for Sum-Product Networks Let S be the Euclidean sphere of R d Think on w as a mapping from S to R Each hypothesis in H corresponds to some w with w 0 = r and w 2 = O(1) R( w) = training loss of the hypothesis in H corresponding to w Applying FGS for k = Ω(r/ɛ) iterations would yield a network with O(r/ɛ) hidden neurons and loss of at most R( w) + ɛ Main caveat: at each iteration we need to find v s.t. v R(w) (1 τ) max u S ur(w) Luckily, this is an eigenvalue problem v R(w) = v E (x,y) l u supp(w) w u u, x 2, y xx v Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 19 / 25

The resulting algorithm Greedy Efficient Component Optimization (GECO): Initialize V = [ ], w = [] For t = 1, 2,..., T Let M = E (x,y) l ( i w i( v i, x ) 2, y)xx V = [V v] where v is an approximate leading eigenvector of M Let B = argmin B R t,t E (x,y) l((x V )B(V x), y) Update w = eigenvalues(b) and V = V eigenvectors(b) Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 20 / 25

The resulting algorithm Greedy Efficient Component Optimization (GECO): Initialize V = [ ], w = [] For t = 1, 2,..., T Let M = E (x,y) l ( i w i( v i, x ) 2, y)xx V = [V v] where v is an approximate leading eigenvector of M Let B = argmin B R t,t E (x,y) l((x V )B(V x), y) Update w = eigenvalues(b) and V = V eigenvectors(b) Remarks: Finding an approximate leading eigenvector takes linear time Overall complexity depends linearly on the size of the data Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 20 / 25

Deeper sum-product networks? Learning depth 2 sigmoidal networks is hard even if we allow over-specification Learning depth 2 sum-product networks is tractable if we allow slight over-specification What about higher degrees? Theorem (Livni, Shamir, S.): It is hard to learn polynomial networks of depth poly(d) even if their size is poly(d). Proof idea: It is possible to approximate the sigmoid function with a polynomial of degree poly(d) Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 21 / 25

Deeper sum-product networks? Distributional assumptions: Is it easier to learn under certain distributional assumptions? Vanishing Component Analysis (Livni et al 2013) Efficiently finding the generating ideal of the data Can be used to efficiently construct deep features Theoretical guarantees still not satisfactory Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 22 / 25

Summary Sigmoidal deep networks are great statistically but cannot be trained efficiently Sum-product networks of depth 2 can be trained efficiently using forward greedy selection Very deep sum-product networks cannot be trained efficiently Open problems Is it possible to train sum-product networks of depth 3? What about depth log(d)? Find a combination of network architecture and distributional assumptions that are useful in practice and lead to efficient algorithms Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 23 / 25

Thanks! Collaborators Seek of efficient algorithms for deep learning: Ohad Shamir GECO: Alon Gonen and Ohad Shamir Based on a previous paper with Tong Zhang and Nati Srebro Lower bounds: Amit Daniely and Nati Linial VCA: Livni, Lehavi, Nachlieli, Schein, Globerson Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 24 / 25

Shameless Advertisement Shai Shalev-Shwartz (Hebrew U) Greedy for Neural Networks Dec 13 25 / 25