Margin Preservation of Deep Neural Networks

Size: px

Start display at page:

Download "Margin Preservation of Deep Neural Networks"

Simon Fisher
5 years ago
Views:

1 1 Margin Preservation of Deep Neural Networks Jure Sokolić 1, Raja Giryes 2, Guillermo Sapiro 3, Miguel R. D. Rodrigues 1 1 Department of E&EE, University College London, London, UK 2 School of EE, Faculty of Engineering, Tel-Aviv University, Tel Aviv, Israel 3 Department of ECE, Duke University, Durham, North Carolina, USA Abstract arxiv: v1 [stat.ml] 26 May 2016 The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network s Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. I. INTRODUCTION In the recent years deep neural networks (DNN) have been used to achieve state-of-the-art results in image recognition, speech recognition and many other fields [8, 10, 11]. DNN are constructed as a series of non-linear signal transformations that are applied one after another, where the parameters at each layer are estimated from the data [11]. Typically, a layer is formed by a linear (or affine) transformation of the input signal followed by a point-wise non-linearity such as a sigmoid function, a hyperbolic tangent function or a Rectified Linear Unit (ReLU) [14]. Many DNN also include pooling layers, which act as down-sampling operators and may be linear or non-linear. With the remarkable success of DNN, there have been multiple attempts to provide theoretical foundations for the representation power and learning complexity of DNN [2, 4, 5, 6, 13, 20]. An important theorethical aspect of DNN is the effect of their architecture and depth on their generalization error (GE). Various measures such as the VC-dimension [17, 21] and the Rademacher or Gaussian complexities [3] have been used to bound the GE. For example, the VC-dimension of DNN with a hard-threshold non-linearity is equal to the number of parameters in the network, which implies that the sample complexity is linear in the number of parameters of the network. The GE can also be bounded independently of the number of parameters, provided that the norms of the weight matrices (the network s linear components) are constrained appropriately. Such constraints are usually enforced by training networks with weight decay regularization, which is simply the l 1 - or l 2 -norm of all the weights in the network. For example, the work [15] studies the GE of DNN with ReLUs with constraints on the norms of the weight matrices. It shows that the GE scales exponentially with the network depth. Similar behaviour is also depicted in [18]. However, while in practice networks with ReLUs of depth greater than 100 do generalize [7], the current bounds result in requiring a number of samples of the order of > to successfully train these networks. Therefore, a different strategy is required to provide theoretical foundations for standard DNN. Contributions- In this work we focus on the GE of DNN with ReLUs, by studying their classification margin. This allows us to use bounds that do not scale exponentially with the network depth. Our strategy is to treat DNN as transforms that map signals from the input space to the feature space. We assume that the transformed signals are classified by a linear classifier in the feature space and also introduce the concepts of input and output margins: Input margin of a training sample is the distance of the training sample to the classification boundary in the input space. Note that the classification boundary in the input space is piecewise linear for DNN with ReLUs and it can not be optimized for directly. The work of Jure Sokolić and Miguel R. D. Rodrigues was supported in part by EPSRC under grant EP/K033166/1. The work of Guillermo Sapiro was supported in part by NSF, ONR, ARO, and NGA.

2 2 Output margin of a training sample is the distance of the training sample transformed by the deep neural network to the classification boundary induced by the linear classifier in the feature space. The output margin is relevant because in practice it is much easier to compute and to be optimized for compared to the input margin. As we show in this work, the input margin can be used to bound the GE. However, in practice we can only optimize the output margin. This work shows that DNN that achieve a large output margin on a training set also achieve a large input margin on a training set and therefore generalize, provided that these DNN preserve distances in the neighbourhood of the training samples in the direction normal to the decision boundary in the corresponding space. Moreover, the GE depends on the classification margin, which suggests that deeper networks, which can implement more complex decision boundaries, can achieve larger margin and generalize better. We characterize the distance preservation property by the average behaviour of the network s Jacobian matrix in the neighbourhood of the training samples and propose a novel practical regularization method that enforces distance preservation in the direction normal to the corresponding decision boundary. Our results imply that constraints on the l 2 -norm of the weight matrices also guarantee distance preservation, but in a looser fashion compared to our proposed regularization strategy. This suggests that weight decay regularization is inferior to our proposed method, as here demonstrated also in practice. Related Work- The GE of DNN has been studied via the algorithmic robustness framework in [22], and their bounds are based on the per-unit l 1 -norm of the weight matrices. The work in [9] is related to ours in the sense that the authors propose transforms that are locally isometric and have good generalization properties. However, our study focuses on the classification margin and provides a significantly more detailed characterization of the generalization bounds for DNN. Moreover, it shows that only the distance preservation in the direction normal to the decision boundary is important for bounding GE and suggests a different approach to DNN regularization. The authors in [1] have observed that contractive DNN preserve the output margin at the input and proposed a training algorithm for large margin DNN using contractive DNN. However, they do not provide any GE bounds. The training of DNN by promoting a large output margin has also been explored empirically in [19]. Our work provides a theoretical explanation for success of their training strategy. The work in [16] is related to ours in the sense that it proposes to regularize auto-encoders by constraining the Frobenious norm of the encoder s Jacobian matrix. However, their work is not concerned with the classification margin or GE bounds, and our use of the Jacobian matrix for regularization of DNN is significantly different. Complementing these and other advances in the theoretical foundations of DNN, our work is the only one that provides the GE bounds that can leverage the benefits of the network s depth, by exploring the detailed geometrical properties of DNN that lead to classification with a large margin. Paper organization- Section 2 reviews the algorithmic robustness framework and describes the network architecture. The geometry of DNN is described in Section 3. Margin preservation and GE of DNN are given in Section 4. Section 5 presents experimental results. The paper is concluded in Section 6. The proofs appear in the Appendix. II. PRELIMINARIES Here we review the notion of the GE and present the algorithmic robustness framework that we will use to provide bounds on the GE. We also present the DNN architecture studied in this paper. We focus on DNN without the pooling stage as often done in the literature, e.g. [15, 23], and defer the study of DNN with pooling layers to a future study. A. Generalization error We consider a supervised learning task, where a set of training samples is given, and the goal is to find a classifier g with the best performance. We denote the sample space as Z = X Y, where X is the set of observations, Y is the set of labels and the elements of X, Y and Z are denoted by x, y and z = (x, y), respectively. We will assume a binary classification task so that Y = { 1, 1}. Assume that the samples z are drawn from a probability distribution P defined on Z, and denote by Z m = {z i } m i=1 = {(x i, y i )} m i=1 the set of m training samples drawn independently from P.

3 3 The classification performance of g on a sample z is measured by a loss function l(z, g), which might be the 0-1 indicator function or a surrogate for the 0-1 indicator function such as the hinge loss. The empirical loss associated with the training set and the expected loss are defined as l emp (g) = 1/m z i Z m l(z i, g) and l exp (g) = E z P [l(z, g)], (1) respectively. An important question, which occupies us throughout this work, is how well l emp (g) predicts l exp (g). The measure we use for quantifying the prediction quality is the difference between l exp (g) and l emp (g), which is called the generalization error. There are various frameworks that allow us to obtain bounds on the GE. In this work we leverage the robustness framework proposed in [22]. 1) Algorithmic robustness: The algorithmic robustness framework provides bounds on the GE based on the robustness of a learning algorithm that learns a classifier g based on the training set Z m : Definition 1 ([22]). A learning algorithm is (K, ɛ(z m ))-robust if Z can be partitioned into K disjoint sets denoted as C i, i = 1,..., K, such that for all z Z m, the following holds z i, z C i = l(z i, g) l(z, g) ɛ(z m ). (2) Note that z is an element of the training set and z is an arbitrary element from the partition of the sample space C i. Therefore, a robust learning algorithm chooses a classifier g for which the loss of any z in the neighbourhood of each training sample z Z m is bounded by ɛ(z m ). The following theorem provides GE bounds for robust algorithms. Theorem 1 (Theorem 3 in [22]). If a learning algorithm is (K, ɛ(z m ))-robust and l(z, g) M for any z, g, then for any δ > 0, with probability at least 1 δ, 2K log(2)+2 log(1/δ) l exp (g) l emp (g) ɛ(z m ) + M m. (3) Additional variants of this theorem are provided in [22]. Partitioning of the space Z is central for the notion of algorithmic robustness. A generic partitioning of Z can be achieved by covering Z with l 2 -balls of radius ɛ. The smallest number of balls needed to cover Z is the covering number N ɛ (Z). Note that the covering number of Z provides a measure of the intrinsic dimension of the distribution that we learn from. For example, a Gaussian mixture model (GMM) with L Gaussians and covariance matrices of rank at most k leads to a covering number N ɛ = L(1 + 2/ɛ) k [12], and k-sparse representable signals in a dictionary with L atoms have a covering number N ɛ = ( L k) (1 + 2/ɛ) k [5]. We conjecture that the effective covering number might be even smaller due to the property of DNN that causes the merging of the input subspaces [13]. B. Deep neural networks with rectified linear units Finally, we describe the binary classifiers based on the DNN. The classifier is given as g(x) = v T f(x), (4) where v R ML represents the linear classifier operating on the output of a network f( ) with an input vector x R N. 1 The function f : R N R ML represents a deep neural network with L layers: f(x) = f L (x) = [ W T Lf L 1 (x) + b L ]+ f i (x) = [W T i f i 1 (x) + b i ] +, i = 1,..., L 1, (5) where f 0 (x) = x, [ ] + = max(, 0) represents the element-wise ReLU non-linearity, W i R Mi Mi 1, i = 1,..., L, are the weight matrices and b i R Ni, i = 1,..., L, are the bias vectors. Note that M 0 = N. III. GEOMETRY OF DEEP NEURAL NETWORKS We now describe how DNN with ReLUs transform the input space as a function of the properties of the weights matrices W i and the bias vectors b i, i = 1,..., L. This leads to a simple bound for the GE of DNN via their input margin. 1 Without affecting the generality of our results, we omit the classifier bias for simplicity.

4 4 A. Input space partitioning We first observe that DNN with ReLUs partition the input space, i.e. the space X, into polygons. Furthermore, within each polygon the network behaves as a linear function. Consider the first layer with the weight matrix W 1 = [w 11, w 12,..., w 1M1 ] and the bias vector b 1 = [b 11,..., b 1M1 ] T. Each column of W 1 and its corresponding bias element define an hyperplane {x R N : w1i T x + b 1i = 0}, i = 1,..., M 1. ReLU acts as a multiplication by 1 if x is on the positive side of the hyperplane and as a multiplication by 0 if x is on the negative side. Therefore, we can represent the ReLU by a diagonal matrix of ones and zeros, where the values of the diagonal depend on x. We denote this matrix at layer l by S l (x). Note that each layer induces the partitioning of the output space of the previous layer. Therefore, the network s output can be written as f(x) = L i=1 S i(x)w T i x + L i=1 ( L j=i+1 S j(x)w T j ) S i (x)b i = (F(x)) T x + d(x). (6) Note that the number of unique matrices S i (x) is finite and it is upper bounded by 2 Mi, which is the maximum number of possible different binary codes of sign(wi T f i 1 (x) + b i ). Therefore, in a deep network there are at most L i=1 unique values for the matrix F(x). Moreover, the value of the matrix F(x) is the same within 2Mi each input partition defined by the set of inequalities of the form wij T f i 1 (x) + b ij > 0 or wij T f i 1 (x) + b ij < 0, i = 1,..., L and j = 1,..., M i. An important property of the network f(x), which aids us in deriving the GE, is its Jacobian matrix evaluated at x : J(x ) = df(x) dx x=x = L i=1 S i(x )Wi T. (7) As the derivative of max(x, 0) is not defined for x = 0, we need to use subderivatives (or subgradients) to define this Jacobian matrix. We avoid this technical complication and simply take the derivative of max(x, 0) to be 0 when x = 0. Note that this does not change the results in any way because the subset of X for which the derivatives are not defined has zero measure. An input space partitioning induced by a two-layer network is shown in Figure 1. Figure 1(a) shows a simple dataset in two dimensional input space, and the decision regions and classification boundaries of a given two-layer network. The input space partitioning is shown in Figure 1(b), where the black dotted lines mark the boundaries between different partitions. Within each partition the Jacobian matrix is visualized by an ellipsoids. The semi-major axis of the ellipsoid is proportional to the first singular value of the Jacobian matrix and it is oriented in the direction of the first singular vectors. The semi-minor axis of the ellipsoid is proportional to the second singular value of the Jacobian and is oriented in the direction of the second singular vector. B. Distance preservation Next, we explore how DNN change the geometry of a pair of points x and x. We define the line that connects x and x as x(t) = x + t(x x), t [0, 1]. (8) As t goes from 0 to 1, the vector x(t) is contained in different input space partitions. Assume that x(t) passes through K partitions in total. We denote the intervals of [0, 1] that correspond to different partitions by T i, i = 1,..., K. Note that the intervals T i are a function of x, x and the network parameters W i, b i, i = 1,..., L. The width of T i is denoted by T i, and we have T 1 + T T K = 1. The value of the Jacobian is constant for all t T i and it is denoted by J x,x (T i ) J(x + t(x x)), t T i. (9) The average Jacobian matrix on the line between x and x is then defined as J x,x K i=1 T i J x,x (T i ). (10) It is a weighted sum of the Jacobian matrices in all the input space partitions visited by x(t), where the weights correspond to the widths of the intervals T i, i = 1,..., K. The average Jacobian matrix is used to relate the difference x x to the difference f(x ) f(x), as shown by the following Theorem (again, all proofs are provided in the Appendix).

5 5 Class 1 Class 2 J5 γin (x) J2 xa xb x J4 J0 = x2 0.0 x2 J1 (a) Input domain. J6 xc J3 0.0 xd x1 1.0 (b) Input space partitioning and Jacobians. Class 1 Class f(x b ) f(x d ) f(x a ) f(x c ) 0.0 x2 y2 0.5 ri (x) = xi rif (x) = 1 γout (x) y1 rij (x) = (c) Feature domain x1 1.0 (d) Input margin bounds. Fig. 1. Plot (a) shows samples of class 1 and 2 and the decision regions produced by a two-layer network. The input space partitioning is shown in Plot (b). The black dotted lines denote the boundaries between input space partitions. The blue ellipsoids in each partition represent the Jacobian matrix. The width and the height of the ellipsoid corresponds to the singular values of the Jacobian and their orientation corresponds to the orientation of the singular vectors. Plot (c) shows the samples transformed by the network and the decision boundary of the linear classifier at the output. Plot (d) shows boundaries of various sets defined in Section IV and used to bound the input margin. Theorem 2. For any x, x0 X, we have Z 0 f (x ) f (x) = 0 1 J(x + t(x0 x)) dt (x0 x) = Jx,x0 (x0 x). (11) Therefore, the difference f (x0 ) f (x) is a function of the difference x0 x and the average Jacobian on the line segment between x and x0. To gain some intuition see Figure 1(b), where the Jacobian matrix is visualized. The points xa, xb, xc, xd are in the same partition where the local Jacobian matrix is denoted by J3. One can see that J3 is (approximately) low rank, and that it contracts the distances in the direction of axis x1 and preserves the distances in the direction of axis x2. Therefore, one would expect that f (xa ) f (xc ) and f (xb ) f (xd ). This is indeed the case as depicted in Figure 1(c). One can also observe that different local Jacobians in Figure 1(b) preserve and contract distances in different directions. We now provide bounds on the distance kf (x0 ) f (x)k2.

6 6 Theorem 3. For any x, x X, and a network f( ), we have f(x ) f(x) 2 = J x,x (x x) 2 L W i 2 x x 2 i=1 L W i F x x 2. (12) The first equality in (12) follows directly from Theorem 2. Therefore, the distance f(x ) f(x) 2 is a function of the eigenvalues of the average Jacobian J x,x, but also a function of the alignment of x x with the eigenvectors of J x,x. On the other hand, the bounds on f(x ) f(x) 2 provided by the two inequalities in (12) are functions of the distance between x and x and are expressed in terms of the spectral norms and the Frobenious norms of the weight matrices. i=1 IV. MARGIN PRESERVATION OF DEEP NEURAL NETWORKS This section charactherizes the margin preservation of DNN and provides GE bounds. A. Output margin, input margin and margin preservation We start by defining the output margin γ out (x i ) of the sample x i with the label y i : γ out (x i ) = sup{c : f(x i ) f(x) 2 c = y i v T f(x) > 0 x} = max ( ) yiv T f(x i) v 2, 0. (13) The output margin of sample x i corresponds to the radius of the largest l 2 -ball centered at f(x i ) that is still contained in the decision region labeled as y i. Provided that γ out (x i ) > 0, the γ out (x i ) is the distance between the point f(x i ) and the plane given by the classifier v. The output margin is visualized in Figure 1(c). A large margin classifier v can be found by solving a SVM problem with (f(x i ), y i ), i = 1,..., m as training samples. This has been successfully applied in practice [1, 19]. In order to provide GE bounds we need to understand if a large output margin implies a large input margin. The input margin γ in (x i ) of the sample x i with the label y i is defined as γ in (x i ) = sup{c : x i x 2 c = y i v T f(x) > 0 x} (14) and it corresponds to the radius of the largest l 2 -ball centered at x i that is still contained in the decision region labeled as y i. The input margin, visualized in Figure 1(a), is a crucial property that determines the GE of a deep neural network classifier, as stated next: Theorem 4 (Adapted from Example 9 in [22]). If there exists γ such that γ in (x i ) > γ > 0 (x i, y i ) Z m, (15) then the classifier g(x) = v T f(x) is (2N γ/2 (X ), 0)-robust, provided that N γ/2 (X ) <. Therefore, provided that the training samples are classified with an input margin larger than γ, the GE behaves as c/ m, where c depends on the covering number N γ/2. However, maximizing the input margin is hard as it relies on the decision boundaries of DNN, which are non-linear in general, and can not be expressed in a closed form or optimized directly. Thus, we turn to look for a convenient lower bound for the input margin such that it may be used in practice. We use the output margin together with the network properties. We first define the function r i (x) associated with the i-th training sample (x i, y i ): r i (x) = γ out (x i ) v 2 (f(x i ) f(x)) = γ out(x i ) y iv T v 2 J xi,x(x i x). (16) y i v T Assuming that γ out (x i ) > 0, r i (x) > 1 implies that x lies in the same decision regions as x i, r i (x) = 1 implies that x lies on the decision boundary, and r i (x) < 1 implies that x lies in a different decision region than x i. It is easy to verify, via Theorem 3, that r i (x) ri J(x) rf i (x), where ri J (x) = vt v 2 J xi,x γ out (x i ) and ri F (x) = (x i x) 2 2 γ out (x i ) L l=1 W l F (x i x) 2. (17)

7 7 We denote by R i = {x R N : r i (x) > 1} the set of all points x contained in the same decision region as x i. The set R i may be approximated with the set R J i = {x R N : ri J(x) > 1} or with the set RF i = {x R N : ri F (x) > 1} such that R F i R J i R i, where these inclusions follow from (17). The boundaries of these sets, i.e. r i (x) = 1, ri J(x) = 1 and rf i (x) = 1 are visualized in Figure 1(d). By using the definitions of the sets R i, R J i and R F i, we bound the input margin γ in (x i ): Theorem 5. Assume the network f( ), classifier v, a training sample x i and the output margin γ out (x i ) > 0 are given. Then the following holds γ in (x i ) = sup {c : x i x 2 c = x R i x} (18) sup { c : x i x 2 c = x R J i x } (19) sup { c : x i x 2 c = x R F i x } γ out (x i ) = L l=1 W. l F (20) Eq. (18) is a restatement of (14) and is equal to the radius of the largest l 2 -ball centered at x i that can be inscribed in R i. Similarly, (19) and (20) consider the largest l 2 -ball contained in R J i and R F i. The lower bound (20), which is expressed in a closed form, only rescales the output margin by the factor L l=1 W l F. On the other hand, though (19) can not be expressed in a closed form, it can provide a much sharper bound to the input margin than (20) because it takes into account the Jacobian matrix of f( ) in the neighbourhood of x i. This is also suggested by Figure 1(d), where it is shown that ri J(x) = 1 is a much better approximation to r i(x) = 1 than ri F (x) = 1. By combining Theorems 4 and 5 the GE bounds can be easily derived. Consequences for very deep networks- It is instructive to observe the result of [15] that suggests that the GE of DNNs with ReLUs behaves as 1 m 2 L v 2 L i=1 W i F, (21) provided that the energy of the training samples is bounded. The bounds based on Theorems 1 and 4 behave as 1 m N γ/2 (X ), (22) where γ represents the classification margin. The behaviour of the bound (21) suggests that the GE grows exponentially with the network depth even if the product of the Frobenious norms of all the weight matrices is fixed, which is due to the term 2 L. The bound (22) based on the robustness framework, on the other hand, implies that the GE can improve with the network depth, since deeper networks can implement more complex decision boundaries, as suggested by recent works [4, 20], and may achieve better classification margin, leading to a lower GE. B. Large margin regularization The proposed framework suggests that DNN with a low GE can be trained by enforcing a large output margin on the training set and by constraining DNN to be margin preserving. Large margin linear classifier can be optimized with the hinge loss and an l 2 norm constrain on the classifier v. Margin preservation of the network, on the other hand, can be enforced by constraining the Frobenious norms of the weight matrices as suggested by (20). Therefore, the popular l 2 weight decay regularizer, which is usually implemented as v L i=1 W i 2 F, (23) leads to a large output margin via the factor v 2 2, and it leads to the preservation of the output margin at the input via the factor L i=1 W i 2 F. However, Theorem 5 also suggests that a sharper way to control the margin preservation of the network is possible by constraining the behaviour of the network s Jacobian matrix. We discuss a potential way to implement such a regularizer next. Eq. (19) shows that a training sample x i will achieve input margin γ, provided that x i x 2 < γ = ri J(x) > 1 x. Assuming that γ out (x i ) > 0 is given, we constrain the denominator of ri J (x) as v T / v 2 J xi,x 2 (x i x) 2 γ out (x i ), x, x i x 2 < γ. (24)

8 8 Accuracy [%] LM WD Number of layers L (a) MNIST-5000 Accuracy [%] LM WD Number of layers L (b) MNIST Accuracy [%] LM WD Number of layers L (c) CIFAR Accuracy [%] LM WD Number of layers L (d) CIFAR Fig. 2. The plots show accuracies for DNN with a different number of layers trained on MNIST and CIFAR10 with large margin regularization (solid line) and weight decay (dashed line). The number next to the dataset name represents the number of training samples. This formulation is still not feasible for practical implementation, and we impose a stricter condition than (24) by bounding the left-hand side of (24) using the fact that x i x 2 < γ and the definitions (9) and (10) to obtain (detailed derivation is provided in the Appendix) sup x X v T / v 2 J(x) 2 γ out (x i )/γ. (25) Therefore, γ in (x i ) > γ, provided that (25) holds. In practice we promote a large γ out (x i ) by training the network using the hinge loss and by constraining the norm of the classifier v. In order to constrain the left-hand side of (25) we assume that the training set is a good approximation of X and only constrain v T / v 2 J(x i ) 2, i = 1,..., m, which leads to the regularizer m i=1 vt / v 2 J(x i ) 2. (26) The next section shows, that this regularizer outperforms the popular weight decay. V. EXPERIMENTAL RESULTS In this section we empirically validate the theoretical results by showing that our novel large margin (LM) regularizer (26) outperforms weight decay (WD) (23). First, we use networks with different number of layers where the first layer always has 784 nodes, and all the subsequent layers have 392 nodes. At the end of the network we use a 10 class classifier and train the network with the multi-class hinge loss. We use MNIST and CIFAR10, for which we reduce its dimension to 784 by the principal components analysis. Additional training details are provided in the Appendix. We report the results for smaller training sets, where the difference between the LM regularization and WD is more significant. The results are reported in Figure 2. We observe that the proposed LM regularization always outperforms WD. Second, we demonstrate the use of LM regularization with convolutional neural networks (CNN). We choose to use MNIST for our experiments since l 2 -margin is more suitable for this dataset than for CIFAR10. We use a 2 layer CNN with the following architecture: (32, 5, 5)-conv, (2, 2)-max-pool, (32, 5, 5)-conv, (2, 2)-max-pool followed by a linear classifier, and a 3 layer CNN that has an additional (32, 5, 5)-conv layer before the linear classifier. We also compare the multi-class hinge loss and the categorical cross entropy (CCE) loss. The results are reported in the Table I for training with no regularization, WD and LM regularization. We observe that CNN trained with the hinge loss always outperform the networks trained with the CCE loss, provided that WD or LM regularization is used. We also observe that WD always outperforms or is at least as good as no regularization, and LM regularization always outperforms WD, independently of the loss function used. VI. CONCLUSIONS This paper studies the generalization error of deep networks based on their classification margin. Generalization error bounds based on classification margin do not suffer from the exponential dependence on network depth as some recent bounds in the literature do, rendering them unrealistic for the current very deep ( layers) used in the literature. Moreover, the paper explains how DNN that achieve a large classification margin can be trained by using a large margin linear classifier at the output of the DNN and by constraining the DNN to preserve distances in the direction of the decision boundary, which is achieved by constraining the Jacobian matrix of the network. Presented results show that such strategy outperforms the popular weight decay.

9 9 TABLE I CLASSIFICATION ACCURACY [%] OF CNNS ON MINST. 256 samples 512 samples 1024 samples loss # layers no reg. WD LM no reg. WD LM no reg. WD LM hinge hinge CCE CCE Future work will include extensions of the theory to DNN with pooling and other DNN architectures such as Deep Residual Networks. Another important direction is the consideration of other metrics beyond l 2 to measure the classification margin, which are more suitable for datasets where the Euclidean distance is appropriate.

10 10 REFERENCES [1] S. An, M. Hayat, S. H. Khan, M. Bennamoun, F. Boussaid, and F. Sohel. Contractive rectifier networks for nonlinear maximum margin classification. Proceedings of the IEEE International Conference on Computer Vision, [2] F. Bach. Breaking the curse of dimensionality with convex neural networks. arxiv: , [3] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds and structural results. The Journal of Machine Learning Research (JMLR), 3: , [4] N. Cohen, O. Sharir, and A. Shashua. On the expressive power of deep learning: a tensor analysis. arxiv: , [5] R. Giryes, G. Sapiro, and A. M. Bronstein. Deep neural networks with random Gaussian weights: a universal classification strategy? arxiv: , [6] B. D. Haeffele and R. Vidal. Global optimality in tensor factorization, deep learning, and beyond. arxiv: , [7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arxiv: , Dec [8] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82 97, Oct [9] J. Huang, Q. Qiu, G. Sapiro, and R. Calderbank. Discriminative robust transformation learning. Advances in Neural Information Processing Systems (NIPS), pages , [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NIPS), pages , [11] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553): , May [12] S. Mendelson, A. Pajor, and N. Tomczak-Jaegermann. Uniform uncertainty principle for Bernoulli and subgaussian ensembles. Constructive Approximation, 28(3): , Dec [13] G. Montúfar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. Advances in Neural Information Processing Systems (NIPS), pages , [14] V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. Proceedings of the 27th International Conference on Machine Learning (ICML), pages , [15] B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks. Proceedings of The 28th Conference on Learning Theory (COLT), pages , Feb [16] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: explicit invariance during feature extraction. Proceedings of the 28th International Conference on Machine Learning (ICML), pages , [17] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: from theory to algorithms. Cambridge University Press, [18] S. Sun, W. Chen, L. Wang, and T.-Y. Liu. Large margin deep neural networks: theory and algorithms. arxiv: , [19] Y. Tang. Deep learning using linear support vector machines. Workshop on Representational Learning, ICML, [20] M. Telgarsky. Benefits of depth in neural networks. arxiv: , [21] V. N. Vapnik. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10(5): , Sep [22] H. Xu and S. Mannor. Robustness and generalization. Machine Learning, 86(3): , [23] Y. Zhang, J. D. Lee, M. J. Wainwright, and M. I. Jordan. Learning halfspaces and neural networks with random initialization. arxiv: , 2015.

11 11 Proof of Theorem 2 APPENDIX We first note that the line between x and x is given by x+t(x x), t [0, 1]. We define the function F (t) = f(x+t(x x)), df (t) and observe that dt = J(x + t(x x))(x x). Now by generalized fundamental theorem of calculus or Lebesgue differentiation theorem we write f(x ) f(x) = F (1) F (0) = 1 0 df (t) dt dt = 1 0 J(x + t(x x)) dt (x x). (27) The integral on the right hand side can be written as a weighted sum (10) because J(x) is piecewise linear. This concludes the proof. Proof of Theorem 3 The equality in (12) follows directly from Theorem 2. The first inequality in (12) follows from the bound J x,x (x x ) 2 J x,x 2 x x 2, the fact that J x,x 2 can be upper bounded by max x,x J x,x 2, since J x,x is a weighted sum of Jacobians, with the sum of the weights being equal to 1, and the fact that J(x) is a product of weight matrices W i and ReLU matrices S i (x). The spectral norm of a matrix product is bounded by the product of spectral norms of the matrices. And since S i (x) 2 1 the inequality holds. Finally, the second inequality is obtained from the first inequality by noting that the Frobenious norm always bounds the spectral norm, W 2 W F. Proof of Theorem 5 Recall the definition of the input margin in (14) and note that y i v T f(x) = y i ( v T f(x i ) + v T (f(x) f(x i )) ) = ( v 2 γ out (x i ) + y i v T (f(x) f(x i )), where we have leveraged the assumption γ out (x i ) > 0. Therefore, y i v T f(x) > 0 γ f (x i ) > y iv T v 2 (f(x i ) f(x)) r i (x) > 1. (28) This leads to (18). The (19) and (20) lower bound (18) because R J i and R F i are subsets of R i, which implies that solutions of the optimization problems in (19) and (20) can only be smaller or equal to the solution of (18). The closed form solution of (20) is obtained by solving (20) for c. Derivation of equation (25) We start from (24): v T / v 2 J xi,x 2 (x i x) 2 v T / v 2 J xi,x 2 γ. (29) Note now that v T / v 2 J xi,x 2 = vt / v 2 K T k J x,x (T k ) k=1 2 K T k v T / v 2 J xi,x(t k ) 2 k=1 v T / v 2 J xi,x(t k ) 2, (30) max k where the first equality is due to (10) and the second inequality follows from the triangle inequality. The third inequality is due to the fact that T k, k = 1,..., K, sum to 1. Note that by the definition in (9) we have that J xi,x(t k ) = J(x ) is the Jacobian matrix evaluated at some point x in the input space. Therefore, we can further bound max k v T / v 2 J xi,x(t k ) 2 sup x X v T / v 2 J(x) 2. (31)

12 12 Experimental procedure details All the networks were trained using the stochastic gradient descent (SGD) with momentum, which was set to 0.9. Results are reported for the best test set performance achieved. Since we use a multi-class classifier, the Frobenious norm of the matrix of classification vectors is constrained instead of the norm of v. MNIST and CIFAR10 DNN- The networks contain 784 units in the first layer and then 392 units in the higher layers. The batch size was set to 128, and the networks were trained for 110 epochs with the following learning rate schedule: epochs, epochs, epochs, epochs. The weight decay penalty was chosen from the set {10 6, 10 5, 10 4, 10 3, 10 2 }, the classification matrix penalty was chosen from the set {0, 10 6, 10 5, 10 4, 10 3, 10 2, 10 1, 1} and the large margin regularization penalty chosen from the set {0, 10 2, 10 1, 1, 2}, which was then divided by the batch size. Since the regularization (26) assumes a single classification vector v, we took for v one of the multiple classification vectors, where the choice was random for each x i in each mini-batch. MNIST CNN- The 2 layer CNN architecture is the following: (32, 5, 5)-conv, (2, 2)-max-pool, (32, 5, 5)-conv, (2, 2)-max-pool followed by a linear classifier. The 3 layer CNN has an additional (32, 5, 5)-conv layer before the linear classifier. For the CCE loss the linear classifier is followed by the softmax non-linearity. The batch size was set to 32, and the networks were trained for 100 epochs with the following learning rate schedule: epochs, epochs, epochs. The weight decay regularization penalty was chosen from the set {10 6, 10 5, 10 4, 10 3, 10 2, 10 1, 1}, the classification matrix penalty was chosen from the set {0, 10 6, 10 5, 10 4, 10 3, 10 2, 10 1, 1} and the large margin regularization penalty was chosen from the set {0, 10 3, 10 2, 10 1, 1, 10}, which was then divided by the batch size. Since the regularization (26) assumes a single classification vector v, here we sum the regularization term (26) over all possible classification vectors for each sample x i in each mini-batch.

Robust Large Margin Deep Neural Networks

Robust Large Margin Deep Neural Networks 1 Jure Sokolić, Student Member, IEEE, Raja Giryes, Member, IEEE, Guillermo Sapiro, Fellow, IEEE, and Miguel R. D. Rodrigues, Senior Member, IEEE arxiv:1605.08254v2