Expressive Efficiency and Inductive Bias of Convolutional Networks:

Similar documents
Expressive Efficiency and Inductive Bias of Convolutional Networks:

Understanding Deep Learning via Physics:

Boosting Dilated Convolutional Networks with Mixed Tensor Decompositions

arxiv: v5 [cs.lg] 11 Jun 2018

A TENSOR ANALYSIS ON DENSE CONNECTIVITY VIA CONVOLUTIONAL ARITHMETIC CIRCUITS

arxiv: v2 [cs.lg] 10 Apr 2017

arxiv: v2 [cs.ne] 23 May 2016

Bridging Many-Body Quantum Physics and Deep Learning via Tensor Networks

On the Expressive Power of Deep Learning: A Tensor Analysis

Machine Learning with Quantum-Inspired Tensor Networks

An Isabelle Formalization of the Expressiveness of Deep Learning

Sum-Product Networks: A New Deep Architecture

Sum-Product-Quotient Networks

arxiv: v3 [cs.lg] 20 Feb 2018

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

Introduction to Convolutional Neural Networks (CNNs)

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

Introduction to Machine Learning (67577)

Deep Homogeneous Mixture Models: Representation, Separation, and Approximation

Tensor networks and deep learning

Neural networks COMS 4771

he Applications of Tensor Factorization in Inference, Clustering, Graph Theory, Coding and Visual Representation

CSC321 Lecture 16: ResNets and Attention

Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond

CSCI 315: Artificial Intelligence through Deep Learning

Maxout Networks. Hien Quoc Dang

Nearly-tight VC-dimension bounds for piecewise linear neural networks

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

On the complexity of shallow and deep neural network classifiers

Convolutional neural networks

Convolutional Neural Network Architecture

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Machine Learning with Tensor Networks

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS. Cris Cecka Senior Research Scientist, NVIDIA GTC 2018

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Networks: Introduction

Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018

Convolutional Neural Networks. Srikumar Ramalingam

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Neural Networks and the Back-propagation Algorithm

I-theory on depth vs width: hierarchical function composition

Combining Memory and Landmarks with Predictive State Representations

DRAFT. Algebraic computation models. Chapter 14

Local Affine Approximators for Improving Knowledge Transfer

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Tensor network vs Machine learning. Song Cheng ( 程嵩 ) IOP, CAS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Final Examination CS 540-2: Introduction to Artificial Intelligence

Applied Statistics. Multivariate Analysis - part II. Troels C. Petersen (NBI) Statistics is merely a quantization of common sense 1

General Theory of Equivariant CNNs

Online Dictionary Learning with Group Structure Inducing Norms

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Artificial Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Final Examination CS540-2: Introduction to Artificial Intelligence

Normalization Techniques in Training of Deep Neural Networks

Multi-Layer Boosting for Pattern Recognition

Deep Learning: Approximation of Functions by Composition

From perceptrons to word embeddings. Simon Šuster University of Groningen

WHY DEEP NEURAL NETWORKS FOR FUNCTION APPROXIMATION SHIYU LIANG THESIS

Group Equivariant Convolutional Networks. Taco Cohen

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Lecture 21: Spectral Learning for Graphical Models

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Deep Feedforward Networks. Sargur N. Srihari

arxiv: v1 [cs.lg] 26 Jul 2017

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net

WaveNet: A Generative Model for Raw Audio

Jakub Hajic Artificial Intelligence Seminar I

CSCI 250: Intro to Robotics. Spring Term 2017 Prof. Levy. Computer Vision: A Brief Survey

ON THE EXPRESSIVE POWER OF OVERLAPPING ARCHITECTURES OF DEEP LEARNING

Neural Networks. Learning and Computer Vision Prof. Olga Veksler CS9840. Lecture 10

The connection of dropout and Bayesian statistics


THE POWER OF DEEPER NETWORKS

STA 4273H: Statistical Machine Learning

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Intelligent Systems Discriminative Learning, Neural Networks

Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 5

Introduction to Machine Learning Spring 2018 Note Neural Networks

Fundamentals of Computational Neuroscience 2e

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning

Course 395: Machine Learning - Lectures

Introduction to Machine Learning

Lecture 17: Neural Networks and Deep Learning

Efficient Learning of Linear Perceptrons

Graphical Model Inference with Perfect Graphs

CSC321 Lecture 5: Multilayer Perceptrons

Transcription:

Expressive Efficiency and Inductive Bias of Convolutional Networks: Analysis & Design via Hierarchical Tensor Decompositions Nadav Cohen The Hebrew University of Jerusalem AAAI Spring Symposium Series 2017 Science of Intelligence: Computational Principles of Natural and Artificial Intelligence Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 1 / 48

Sources Deep SimNets N. Cohen, O. Sharir and A. Shashua Computer Vision and Pattern Recognition (CVPR) 2016 On the Expressive Power of Deep Learning: A Tensor Analysis N. Cohen, O. Sharir and A. Shashua Conference on Learning Theory (COLT) 2016 Convolutional Rectifier Networks as Generalized Tensor Decompositions N. Cohen and A. Shashua International Conference on Machine Learning (ICML) 2016 Inductive Bias of Deep Convolutional Networks through Pooling Geometry N. Cohen and A. Shashua International Conference on Learning Representations (ICLR) 2017 Tractable Generative Convolutional Arithmetic Circuits O. Sharir. R. Tamari, N. Cohen and A. Shashua arxiv preprint 2017 On the Expressive Power of Overlapping Operations of Deep Networks O. Sharir and A. Shashua arxiv preprint 2017 Boosting Dilated Convolutional Networks with Mixed Tensor Decompositions N. Cohen, R. Tamari and A. Shashua arxiv preprint 2017 Connections Between Deep Learning and Graph Theory with Application to Network Design Y. Levine, D. Yakira, N. Cohen and A. Shashua arxiv preprint 2017 Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 2 / 48

Collaborators Prof. Amnon Shashua Or Sharir David Yakira Ronen Tamari Yoav Levine Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 3 / 48

Classic vs. State of the Art Deep Learning Classic State of the Art Multilayer Perceptron (MLP) Architectural choices: depth layer widths activation types Convolutional Networks (ConvNets) Architectural choices: depth layer widths activation types pooling types convolution/pooling windows convolution/pooling strides dilation factors connectivity and more... Can the architectural choices of state of the art ConvNets be theoretically analyzed? Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 4 / 48

Outline Expressiveness 1 Expressiveness 2 Expressiveness of Convolutional Networks Questions 3 Convolutional Arithmetic Circuits 4 Efficiency of Depth (Cohen+Sharir+Shashua@COLT 16, Cohen+Shashua@ICML 16) 5 Inductive Bias of Pooling Geometry (Cohen+Shashua@ICLR 17) 6 Efficiency of Overlapping Operations (Sharir+Shashua@arXiv 17) 7 Efficiency of Interconnectivity (Cohen+Tamari+Shashua@arXiv 17) 8 Inductive Bias of Layer Widths (Levine+Yakira+Cohen+Shashua@arXiv 17) Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 5 / 48

Expressiveness Expressiveness Expressiveness: Ability to compactly represent rich and effective classes of func The driving force behind deep networks Fundamental theoretical questions: What kind of func can different network arch represent? Why are these func suitable for real-world tasks? What is the representational benefit of depth? Can other arch features deliver representational benefits? Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 6 / 48

Efficiency Expressiveness Expressive efficiency compares network arch in terms of their ability to compactly represent func Let: H A space of func compactly representable by network arch A H B - - network arch B A is efficient w.r.t. B if H B is a strict subset of H A H A H B A is completely efficient w.r.t. B if H B has zero volume inside H A H A H B Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 7 / 48

Expressiveness Efficiency Formal Definition Network arch A is efficient w.r.t. network arch B if: (1) func realized by B w/size r B can be realized by A w/size r A O(r B ) (2) func realized by A w/size r A requiring B to have size r B Ω(f (r A )), where f ( ) is super-linear A is completely efficient w.r.t. B if (2) holds for all of its func but a set of Lebesgue measure zero (in weight space) Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 8 / 48

Inductive Bias Expressiveness Networks of reasonable size can only realize a fraction of all possible func Efficiency does not explain why this fraction is effective all func Why are these functions interesting? H A H B To explain the effectiveness, one must consider the inductive bias: Not all func are equally useful for a given task Network only needs to represent useful func Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 9 / 48

Expressiveness of Convolutional Networks Questions Outline 1 Expressiveness 2 Expressiveness of Convolutional Networks Questions 3 Convolutional Arithmetic Circuits 4 Efficiency of Depth (Cohen+Sharir+Shashua@COLT 16, Cohen+Shashua@ICML 16) 5 Inductive Bias of Pooling Geometry (Cohen+Shashua@ICLR 17) 6 Efficiency of Overlapping Operations (Sharir+Shashua@arXiv 17) 7 Efficiency of Interconnectivity (Cohen+Tamari+Shashua@arXiv 17) 8 Inductive Bias of Layer Widths (Levine+Yakira+Cohen+Shashua@arXiv 17) Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 10 / 48

Expressiveness of Convolutional Networks Questions Efficiency of Depth Longstanding conjecture, proven for MLP: deep networks are efficient w.r.t. shallow ones Q: Can this be proven for ConvNets? Q: Is their efficiency of depth complete? Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 11 / 48

Expressiveness of Convolutional Networks Questions Inductive Bias of Convolution/Pooling Geometry ConvNets typically employ square conv/pool windows Recently, dilated windows have also become popular Q: What is the inductive bias of conv/pool window geometry? Q: Can the geometries be tailored for a given task? Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 12 / 48

Expressiveness of Convolutional Networks Questions Efficiency of Overlapping Operations Modern ConvNets employ both overlapping and non-overlapping conv/pool operations Q: Do overlapping operations introduce efficiency? Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 13 / 48

Expressiveness of Convolutional Networks Questions Efficiency of Connectivity Schemes Nearly all state of the art ConvNets employ elaborate connectivity schemes Inception (GoogLeNet) ResNet DenseNet Q: Can this be justified in terms of efficiency? Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 14 / 48

Expressiveness of Convolutional Networks Questions Inductive Bias of Layer Widths No clear principle for setting widths (# of channels) of ConvNet layers Q: What is the inductive bias of one layer s width vs. another s? Q: Can the widths be tailored for a given task? Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 15 / 48

Outline Convolutional Arithmetic Circuits 1 Expressiveness 2 Expressiveness of Convolutional Networks Questions 3 Convolutional Arithmetic Circuits 4 Efficiency of Depth (Cohen+Sharir+Shashua@COLT 16, Cohen+Shashua@ICML 16) 5 Inductive Bias of Pooling Geometry (Cohen+Shashua@ICLR 17) 6 Efficiency of Overlapping Operations (Sharir+Shashua@arXiv 17) 7 Efficiency of Interconnectivity (Cohen+Tamari+Shashua@arXiv 17) 8 Inductive Bias of Layer Widths (Levine+Yakira+Cohen+Shashua@arXiv 17) Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 16 / 48

Convolutional Arithmetic Circuits Convolutional Arithmetic Circuits To address raised questions, we consider a surrogate (special case) of ConvNets Convolutional Arithmetic Circuits (ConvACs) ConvACs are equivalent to hierarchical tensor decompositions, allowing theoretical analysis w/mathematical tools from various fields, e.g.: Functional Analysis Measure Theory Matrix Algebra Graph Theory ConvACs are superior to ReLU ConvNets in terms of expressiveness 1 ; deliver promising results in practice: Excel in computationally constrained settings 2 Classify optimally under missing data 3 1 Convolutional Rectifier Networks as Generalized Tensor Decompositions, ICML 16 2 Deep SimNets, CVPR 16 3 Tractable Generative Convolutional Arithmetic Circuits, arxiv 17 Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 17 / 48

Convolutional Arithmetic Circuits Baseline Architecture hidden layer 0 hidden layer L-1 input X representation 1x1 conv pooling 1x1 conv x i pooling dense (output), x 0, conv0 j, a, rep j,: rep i d f d i M r 0 0 pool j, conv j ', r rl 1 r L 1 pooll 1 convl 1 j ', j ' window j j ' covers space out y 0 0 a Ly, Y, pooll 1 : Baseline ConvAC architecture: 2D ConvNet Linear activation (σ(z) = z), product pooling (P{c j } = j c j) 1 1 convolution windows Non-overlapping pooling windows Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 18 / 48

Convolutional Arithmetic Circuits Grid Tensors ConvNets realize func over many local structures: f (x 1, x 2,..., x N ) x i image patches (2D network) / sequence samples (1D network) f ( ) may be studied by discretizing each x i into one of {v (1),..., v (M) }: A d1...d N = f (v (d1)... v (dn) ), d 1...d N {1,..., M} The lookup table A is: an N-dim array (tensor) w/length M in each axis referred to as the grid tensor of f ( ) Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 19 / 48

Convolutional Arithmetic Circuits Tensor Decompositions Compact Parameterizations High-dim tensors (arrays) are exponentially large cannot be used directly May be represented and manipulated via tensor decompositions: Compact algebraic parameterizations Generalization of low-rank matrix decompositions large matrix compact parameterization = Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 20 / 48

Convolutional Arithmetic Circuits Hierarchical Tensor Decompositions Hierarchical tensor decompositions represent high-dim tensors by incrementally generating intermediate tensors of increasing dim Generation process can be described by a tree over tensor modes (axes) {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} {1,2,3,4,5,6,7,8} {9,10,11,12,13,14,15,16} decomposition output {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,2} {3,4} {5,6} {7,8} {9,10} {11,12} {13,14} {15,16} {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16} 16D tensors 8D tensors 4D tensors 2D tensors 1D tensors Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 21 / 48

Convolutional Arithmetic Circuits Convolutional Arithmetic Circuits Hierarchical Tensor Decompositions Observation Grid tensors of func realized by ConvACs are given by hierarchical tensor decompositions: network structure (depth, width, pooling etc) decomposition type (dim tree, internal ranks etc) network weights decomposition parameters We can study networks through corresponding decompositions! Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 22 / 48

Convolutional Arithmetic Circuits Example 1: Shallow Network CP Decomposition Shallow network (single hidden layer, global pooling): hidden layer input X representation 1x1 conv x i, x 0, j, conv j, a, rep j,: pool conv j, rep i d f d i global pooling M r r 0 0 j covers space corresponds to classic CP decomposition: A y = r 0 γ=1 a 1,1,y γ dense (output) out y a 1,1, y Y, pool : a 0,1,γ a 0,2,γ a 0,N,γ ( outer product) Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 23 / 48

Convolutional Arithmetic Circuits Example 2: Deep Network HT Decomposition Deep network with size-2 pooling: hidden layer 0 hidden layer L-1 input X representation 1x1 conv pooling (L=log 2 N) x i 1x1 conv pooling dense (output) rep i, d f d xi M r 0 0 j ' 2 j 1,2 j r rl 1 r L 1 0, j, conv0 j, a, rep j,: pooll 1 convl 1 j ', j ' 1,2 pool0 j, conv0 j ', out y a corresponds to Hierarchical Tucker (HT) decomposition: φ 1,j,γ = r 0 α=1 a1,j,γ α φ l,j,γ = r l 1 A y = r L 1 α=1 al,j,γ α α=1 al,1,y α a 0,2j 1,α a 0,2j,α φ l 1,2j 1,α φ l 1,2j,α φ L 1,1,α φ L 1,2,α L,1, Y, pooll 1 : y Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 24 / 48

Outline Efficiency of Depth 1 Expressiveness 2 Expressiveness of Convolutional Networks Questions 3 Convolutional Arithmetic Circuits 4 Efficiency of Depth (Cohen+Sharir+Shashua@COLT 16, Cohen+Shashua@ICML 16) 5 Inductive Bias of Pooling Geometry (Cohen+Shashua@ICLR 17) 6 Efficiency of Overlapping Operations (Sharir+Shashua@arXiv 17) 7 Efficiency of Interconnectivity (Cohen+Tamari+Shashua@arXiv 17) 8 Inductive Bias of Layer Widths (Levine+Yakira+Cohen+Shashua@arXiv 17) Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 25 / 48

Tensor Matricization Efficiency of Depth Cohen+Sharir+Shashua@COLT 16, Cohen+Shashua@ICML 16 Let A be a tensor of order (dim) N Let (I, J) be a partition of [N], i.e. I J = [N] := {1,..., N} A I,J matricization of A w.r.t. (I, J): Arrangement of A as matrix Rows correspond to modes (axes) indexed by I Cols - - J mode 1 A 111 A 121 A 211 A 221 mode 2 mode 3 matricization w.r.t. I={1,3} J={2} modes 1 & 3 A 111 A 112 A 113 A 211 A 212 A 213 A 121 A 122 A 123 A 221 A 222 A 223 mode 2 Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 26 / 48

Efficiency of Depth Cohen+Sharir+Shashua@COLT 16, Cohen+Shashua@ICML 16 Exponential & Complete Efficiency of Depth Claim Tensors generated by CP decomposition w/r 0 terms, when matricized under any partition (I, J), have rank r 0 or less Theorem Consider the partition I odd = {1, 3,..., N 1}, J even = {2, 4,..., N}. Besides a set of measure zero, all param settings of HT decomposition give tensors that when matricized w.r.t. (I odd, J even ), have exponential ranks. Since # of terms in CP decomposition corresponds to # of hidden channels in shallow ConvAC: Corollary Almost all func realizable by deep ConvAC cannot be replicated by shallow ConvAC with less than exponentially many hidden channels W/ConvACs efficiency of depth is exponential and complete! Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 27 / 48

Efficiency of Depth From Convolutional Arithmetic Circuits to Convolutional Rectifier Networks Cohen+Sharir+Shashua@COLT 16, Cohen+Shashua@ICML 16 hidden layer 0 hidden layer L-1 input X representation 1x1 conv pooling 1x1 conv x i pooling dense (output), x rep i d f d i M r 0 0 0, j, a conv0 j,, rep j,:, ', pool j P conv j 0 0 j ' window j r rl 1 r L 1 ', pool P conv j L 1 L 1 j ' covers space out y a L,1, y, pooll 1 : Y Transform ConvACs into convolutional rectifier networks (R-ConvNets): linear activation ReLU activation: σ(z) = max{z, 0} product pooling max/average pooling: P{c j } = max{c j }/mean{c j } Most successful deep learning architecture to date! Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 28 / 48

Efficiency of Depth Generalized Tensor Decompositions Cohen+Sharir+Shashua@COLT 16, Cohen+Shashua@ICML 16 ConvACs correspond to tensor decompositions based on tensor product : (A B) d1,...,d P+Q = A d1,...,d P B dp+1,...,d P+Q For an operator g : R R R, the generalized tensor product g : (A g B) d1,...,d P+Q := g(a d1,...,d P, B dp+1,...,d P+Q ) (same as but with g( ) instead of multiplication) Generalized tensor decompositions are obtained by replacing with g Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 29 / 48

Efficiency of Depth Convolutional Rectifier Networks Generalized Tensor Decompositions Cohen+Sharir+Shashua@COLT 16, Cohen+Shashua@ICML 16 Define the activation-pooling operator: ρ σ/p (a, b) := P{σ(a), σ(b)} Grid tensors of func realized by R-ConvNets are given by generalized tensor decompositions w/g( ) ρ σ/p ( ): Shallow R-ConvNet Deep R-ConvNet Generalized CP decomposition w/g( ) ρ σ/p ( ) Generalized HT decomposition w/g( ) ρ σ/p ( ) Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 30 / 48

Efficiency of Depth Cohen+Sharir+Shashua@COLT 16, Cohen+Shashua@ICML 16 Exponential But Incomplete Efficiency of Depth By analyzing matricization ranks of tensors realized by generalized CP and HT decompositions w/g( ) ρ σ/p ( ), we show: Claim There exist func realizable by deep R-ConvNet requiring shallow R-ConvNet to be exponentially large On the other hand: Claim A non-negligible (positive measure) set of the func realizable by deep R- ConvNet can be replicated by shallow R-ConvNet w/few hidden channels W/R-ConvNets efficiency of depth is exponential but incomplete! Developing optimization methods for ConvACs may give rise to an arch that is provably superior but has so far been overlooked Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 31 / 48

Outline Inductive Bias of Pooling Geometry 1 Expressiveness 2 Expressiveness of Convolutional Networks Questions 3 Convolutional Arithmetic Circuits 4 Efficiency of Depth (Cohen+Sharir+Shashua@COLT 16, Cohen+Shashua@ICML 16) 5 Inductive Bias of Pooling Geometry (Cohen+Shashua@ICLR 17) 6 Efficiency of Overlapping Operations (Sharir+Shashua@arXiv 17) 7 Efficiency of Interconnectivity (Cohen+Tamari+Shashua@arXiv 17) 8 Inductive Bias of Layer Widths (Levine+Yakira+Cohen+Shashua@arXiv 17) Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 32 / 48

Inductive Bias of Pooling Geometry Cohen+Shashua@ICLR 17 Separation Rank A Measure of Input Correlations ConvNets realize func over many local structures: f (x 1, x 2,..., x N ) x i image patches (2D network) / sequence samples (1D network) Important feature of f ( ) correlations it models between the x i s Separation rank: Formal measure of these correlations I - J - Partition A Partition B Sep rank of f ( ) w.r.t. input partition (I, J) measures dist from separability (sep rank = more correlation between (x i ) i I and (x j ) j J ) Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 33 / 48

Inductive Bias of Pooling Geometry Cohen+Shashua@ICLR 17 Deep Networks Favor Some Correlations Over Others Claim W/ConvAC sep rank w.r.t (I, J) is equal to rank of A y I,J grid tensor matricized w.r.t. (I, J) Theorem Maximal rank of tensor generated by HT decomposition, when matricized w.r.t. (I, J), is: Exponential for interleaved partitions Polynomial for coarse partitions Corollary Deep ConvAC can realize exponential sep ranks (correlations) for favored partitions, polynomial for others Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 34 / 48

Inductive Bias of Pooling Geometry Cohen+Shashua@ICLR 17 Pooling Geometry Controls the Preference contiguous pooling local correlations alternative pooling alternative correlations Pooling geometry of deep ConvAC determines which partitions are favored controls the correlation profile (inductive bias)! Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 35 / 48

Outline Efficiency of Overlapping Operations 1 Expressiveness 2 Expressiveness of Convolutional Networks Questions 3 Convolutional Arithmetic Circuits 4 Efficiency of Depth (Cohen+Sharir+Shashua@COLT 16, Cohen+Shashua@ICML 16) 5 Inductive Bias of Pooling Geometry (Cohen+Shashua@ICLR 17) 6 Efficiency of Overlapping Operations (Sharir+Shashua@arXiv 17) 7 Efficiency of Interconnectivity (Cohen+Tamari+Shashua@arXiv 17) 8 Inductive Bias of Layer Widths (Levine+Yakira+Cohen+Shashua@arXiv 17) Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 36 / 48

Efficiency of Overlapping Operations Overlapping Operations Sharir+Shashua@arXiv 17 Baseline ConvAC arch has non-overlapping conv and pool windows: hidden layer 0 hidden layer L-1 input X representation 1x1 conv pooling 1x1 conv x i, x 0, conv0 j, a, rep j,: rep i d f d i M r 0 0 pool j, conv j ', pooling r rl 1 r L 1 pooll 1 convl 1 j ', j ' window j j ' covers space out y 0 0 Replace those by (possibly) overlapping generalized convolution: a Ly, dense (output) Y, pooll 1 : Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 37 / 48

Efficiency of Overlapping Operations Exponential Efficiency Theorem Sharir+Shashua@arXiv 17 Various ConvACs w/overlapping GC layers realize func requiring ConvAC w/no overlaps to be exponentially large Examples Network starts with large receptive field: input X representa*on GC GC Arbitrary Layers N > N 2 stride 1x1 GC Output N > N 2 M Typical scheme of alternating B B conv and 2 2 pool : N input X N representa-on k: BxB s: 1x1 M Block 0 BxB GC 2x2 GC k: 2x2 s: 2x2 M r 0 r 0 Block L-1 K BxB GC 2x2 GC dense (output) r L-1 r L-1 K W/ConvACs overlaps lead to exponential efficiency! Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 38 / 48

Outline Efficiency of Interconnectivity 1 Expressiveness 2 Expressiveness of Convolutional Networks Questions 3 Convolutional Arithmetic Circuits 4 Efficiency of Depth (Cohen+Sharir+Shashua@COLT 16, Cohen+Shashua@ICML 16) 5 Inductive Bias of Pooling Geometry (Cohen+Shashua@ICLR 17) 6 Efficiency of Overlapping Operations (Sharir+Shashua@arXiv 17) 7 Efficiency of Interconnectivity (Cohen+Tamari+Shashua@arXiv 17) 8 Inductive Bias of Layer Widths (Levine+Yakira+Cohen+Shashua@arXiv 17) Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 39 / 48

Efficiency of Interconnectivity Dilated Convolutional Networks Cohen+Tamari+Shashua@arXiv 17 Study efficiency of interconnectivity w/dilated convolutional networks: output L-1 hidden layers input r L rl 1 r 2 r 1 r 0 Time t-2 L t-2 L +1 t-2 L +2 t-3 t-2 t-1 t t+1 1D ConvNets (sequence data) Dilated (gapped) conv windows No pooling N:=2 L time points size-2 conv: dilation-2 L-1 size-2 conv: dilation-2 size-2 conv: dilation-1 Underlie Google s WaveNet & ByteNet state of the art for audio & text! Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 40 / 48

Efficiency of Interconnectivity Cohen+Tamari+Shashua@arXiv 17 Mixing Tensor Decompositions Interconnectivity With dilated ConvNets, mode (axes) tree underlying corresponding tensor decomposition determines dilation scheme {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} {1,2,3,4,5,6,7,8} {9,10,11,12,13,14,15,16} {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,2} {3,4} {5,6} {7,8} {9,10} {11,12} {13,14} {15,16} {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16} dilation-8 dilation-4 dilation-2 dilation-1 {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} {1,2,3,4,9,10,11,12} {5,6,7,8,13,14,15,16} {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,3} {2,4} {5,7} {6,8} {9,11} {10,12} {13,15} {14,16} {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16} dilation-4 dilation-8 dilation-1 dilation-2 Mixed tensor decomposition blending different mode (axes) trees corresponds to interconnected networks with different dilations Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 41 / 48

Efficiency of Interconnectivity Efficiency of Interconnectivity Cohen+Tamari+Shashua@arXiv 17 Theorem Mixed tensor decomposition generates tensors that can only be realized by individual decompositions if these grow quadratically Corollary Interconnected dilated ConvNets realize func that cannot be realized by individual networks unless these are quadratically larger W/dilated ConvNets interconnectivity brings efficiency! Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 42 / 48

Outline Inductive Bias of Layer Widths 1 Expressiveness 2 Expressiveness of Convolutional Networks Questions 3 Convolutional Arithmetic Circuits 4 Efficiency of Depth (Cohen+Sharir+Shashua@COLT 16, Cohen+Shashua@ICML 16) 5 Inductive Bias of Pooling Geometry (Cohen+Shashua@ICLR 17) 6 Efficiency of Overlapping Operations (Sharir+Shashua@arXiv 17) 7 Efficiency of Interconnectivity (Cohen+Tamari+Shashua@arXiv 17) 8 Inductive Bias of Layer Widths (Levine+Yakira+Cohen+Shashua@arXiv 17) Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 43 / 48

Inductive Bias of Layer Widths Levine+Yakira+Cohen+Shashua@arXiv 17 Convolutional Arithmetic Circuits Contraction Graphs Computation of ConvAC can be cast as a contraction graph G, where: Edge weights hold layer widths (# of channels) Degree-1 nodes correspond to input patches hidden layer 0 hidden layer L-1 input X representation 1x1 conv pooling (L=log2N) 1x1 conv pooling x i dense (output) rep i, d f d xi ' ' M r 0 r 0 rl 1 r L 1 conv0 j, a rep j,: pool conv j ', 0, j,, L 1 L 1 pool0 j, j 1,2 conv0 j ', out y a 1,2 j L,1, y 2 j j Y, pooll 1 : Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 44 / 48

Inductive Bias of Layer Widths Levine+Yakira+Cohen+Shashua@arXiv 17 Correlations Min-Cut over Layer Widths Theorem For input partition (I, J), min-cut in G separating the degree-1 nodes of I from those of J is equal to rank of grid tensors matricized w.r.t. (I, J) Corollary Separation ranks (correlations) of func realized by ConvAC are equal to minimal cuts in G (whose edge weights are layer widths) I - J - separation rank I min-cut J ConvAC layer widths can be tailored to maximize the correlations required for a task at hand (inductive bias)! Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 45 / 48

Outline 1 Expressiveness 2 Expressiveness of Convolutional Networks Questions 3 Convolutional Arithmetic Circuits 4 Efficiency of Depth (Cohen+Sharir+Shashua@COLT 16, Cohen+Shashua@ICML 16) 5 Inductive Bias of Pooling Geometry (Cohen+Shashua@ICLR 17) 6 Efficiency of Overlapping Operations (Sharir+Shashua@arXiv 17) 7 Efficiency of Interconnectivity (Cohen+Tamari+Shashua@arXiv 17) 8 Inductive Bias of Layer Widths (Levine+Yakira+Cohen+Shashua@arXiv 17) Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 46 / 48

Conclusion Expressiveness the driving force behind deep networks Formal concepts for treating expressiveness: Efficiency network arch realizes func requiring alternative arch to be much larger Inductive bias prioritization of some func over others given prior knowledge on task at hand We analyzed efficiency and inductive bias of ConvNet arch features: depth pooling geometry overlapping operations interconnectivity layer widths Fundamental tool underlying all of our analyses: ConvNets hierarchical tensor decompositions Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 47 / 48

Thank You Nadav Cohen (Hebrew U.) Expressiveness of Convolutional Networks Science of Intelligence, 2017 48 / 48