Margin Preservation of Deep Neural Networks

Size: px
Start display at page:

Download "Margin Preservation of Deep Neural Networks"

Transcription

1 1 Margin Preservation of Deep Neural Networks Jure Sokolić 1, Raja Giryes 2, Guillermo Sapiro 3, Miguel R. D. Rodrigues 1 1 Department of E&EE, University College London, London, UK 2 School of EE, Faculty of Engineering, Tel-Aviv University, Tel Aviv, Israel 3 Department of ECE, Duke University, Durham, North Carolina, USA Abstract arxiv: v1 [stat.ml] 26 May 2016 The generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network s Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically. I. INTRODUCTION In the recent years deep neural networks (DNN) have been used to achieve state-of-the-art results in image recognition, speech recognition and many other fields [8, 10, 11]. DNN are constructed as a series of non-linear signal transformations that are applied one after another, where the parameters at each layer are estimated from the data [11]. Typically, a layer is formed by a linear (or affine) transformation of the input signal followed by a point-wise non-linearity such as a sigmoid function, a hyperbolic tangent function or a Rectified Linear Unit (ReLU) [14]. Many DNN also include pooling layers, which act as down-sampling operators and may be linear or non-linear. With the remarkable success of DNN, there have been multiple attempts to provide theoretical foundations for the representation power and learning complexity of DNN [2, 4, 5, 6, 13, 20]. An important theorethical aspect of DNN is the effect of their architecture and depth on their generalization error (GE). Various measures such as the VC-dimension [17, 21] and the Rademacher or Gaussian complexities [3] have been used to bound the GE. For example, the VC-dimension of DNN with a hard-threshold non-linearity is equal to the number of parameters in the network, which implies that the sample complexity is linear in the number of parameters of the network. The GE can also be bounded independently of the number of parameters, provided that the norms of the weight matrices (the network s linear components) are constrained appropriately. Such constraints are usually enforced by training networks with weight decay regularization, which is simply the l 1 - or l 2 -norm of all the weights in the network. For example, the work [15] studies the GE of DNN with ReLUs with constraints on the norms of the weight matrices. It shows that the GE scales exponentially with the network depth. Similar behaviour is also depicted in [18]. However, while in practice networks with ReLUs of depth greater than 100 do generalize [7], the current bounds result in requiring a number of samples of the order of > to successfully train these networks. Therefore, a different strategy is required to provide theoretical foundations for standard DNN. Contributions- In this work we focus on the GE of DNN with ReLUs, by studying their classification margin. This allows us to use bounds that do not scale exponentially with the network depth. Our strategy is to treat DNN as transforms that map signals from the input space to the feature space. We assume that the transformed signals are classified by a linear classifier in the feature space and also introduce the concepts of input and output margins: Input margin of a training sample is the distance of the training sample to the classification boundary in the input space. Note that the classification boundary in the input space is piecewise linear for DNN with ReLUs and it can not be optimized for directly. The work of Jure Sokolić and Miguel R. D. Rodrigues was supported in part by EPSRC under grant EP/K033166/1. The work of Guillermo Sapiro was supported in part by NSF, ONR, ARO, and NGA.

2 2 Output margin of a training sample is the distance of the training sample transformed by the deep neural network to the classification boundary induced by the linear classifier in the feature space. The output margin is relevant because in practice it is much easier to compute and to be optimized for compared to the input margin. As we show in this work, the input margin can be used to bound the GE. However, in practice we can only optimize the output margin. This work shows that DNN that achieve a large output margin on a training set also achieve a large input margin on a training set and therefore generalize, provided that these DNN preserve distances in the neighbourhood of the training samples in the direction normal to the decision boundary in the corresponding space. Moreover, the GE depends on the classification margin, which suggests that deeper networks, which can implement more complex decision boundaries, can achieve larger margin and generalize better. We characterize the distance preservation property by the average behaviour of the network s Jacobian matrix in the neighbourhood of the training samples and propose a novel practical regularization method that enforces distance preservation in the direction normal to the corresponding decision boundary. Our results imply that constraints on the l 2 -norm of the weight matrices also guarantee distance preservation, but in a looser fashion compared to our proposed regularization strategy. This suggests that weight decay regularization is inferior to our proposed method, as here demonstrated also in practice. Related Work- The GE of DNN has been studied via the algorithmic robustness framework in [22], and their bounds are based on the per-unit l 1 -norm of the weight matrices. The work in [9] is related to ours in the sense that the authors propose transforms that are locally isometric and have good generalization properties. However, our study focuses on the classification margin and provides a significantly more detailed characterization of the generalization bounds for DNN. Moreover, it shows that only the distance preservation in the direction normal to the decision boundary is important for bounding GE and suggests a different approach to DNN regularization. The authors in [1] have observed that contractive DNN preserve the output margin at the input and proposed a training algorithm for large margin DNN using contractive DNN. However, they do not provide any GE bounds. The training of DNN by promoting a large output margin has also been explored empirically in [19]. Our work provides a theoretical explanation for success of their training strategy. The work in [16] is related to ours in the sense that it proposes to regularize auto-encoders by constraining the Frobenious norm of the encoder s Jacobian matrix. However, their work is not concerned with the classification margin or GE bounds, and our use of the Jacobian matrix for regularization of DNN is significantly different. Complementing these and other advances in the theoretical foundations of DNN, our work is the only one that provides the GE bounds that can leverage the benefits of the network s depth, by exploring the detailed geometrical properties of DNN that lead to classification with a large margin. Paper organization- Section 2 reviews the algorithmic robustness framework and describes the network architecture. The geometry of DNN is described in Section 3. Margin preservation and GE of DNN are given in Section 4. Section 5 presents experimental results. The paper is concluded in Section 6. The proofs appear in the Appendix. II. PRELIMINARIES Here we review the notion of the GE and present the algorithmic robustness framework that we will use to provide bounds on the GE. We also present the DNN architecture studied in this paper. We focus on DNN without the pooling stage as often done in the literature, e.g. [15, 23], and defer the study of DNN with pooling layers to a future study. A. Generalization error We consider a supervised learning task, where a set of training samples is given, and the goal is to find a classifier g with the best performance. We denote the sample space as Z = X Y, where X is the set of observations, Y is the set of labels and the elements of X, Y and Z are denoted by x, y and z = (x, y), respectively. We will assume a binary classification task so that Y = { 1, 1}. Assume that the samples z are drawn from a probability distribution P defined on Z, and denote by Z m = {z i } m i=1 = {(x i, y i )} m i=1 the set of m training samples drawn independently from P.

3 3 The classification performance of g on a sample z is measured by a loss function l(z, g), which might be the 0-1 indicator function or a surrogate for the 0-1 indicator function such as the hinge loss. The empirical loss associated with the training set and the expected loss are defined as l emp (g) = 1/m z i Z m l(z i, g) and l exp (g) = E z P [l(z, g)], (1) respectively. An important question, which occupies us throughout this work, is how well l emp (g) predicts l exp (g). The measure we use for quantifying the prediction quality is the difference between l exp (g) and l emp (g), which is called the generalization error. There are various frameworks that allow us to obtain bounds on the GE. In this work we leverage the robustness framework proposed in [22]. 1) Algorithmic robustness: The algorithmic robustness framework provides bounds on the GE based on the robustness of a learning algorithm that learns a classifier g based on the training set Z m : Definition 1 ([22]). A learning algorithm is (K, ɛ(z m ))-robust if Z can be partitioned into K disjoint sets denoted as C i, i = 1,..., K, such that for all z Z m, the following holds z i, z C i = l(z i, g) l(z, g) ɛ(z m ). (2) Note that z is an element of the training set and z is an arbitrary element from the partition of the sample space C i. Therefore, a robust learning algorithm chooses a classifier g for which the loss of any z in the neighbourhood of each training sample z Z m is bounded by ɛ(z m ). The following theorem provides GE bounds for robust algorithms. Theorem 1 (Theorem 3 in [22]). If a learning algorithm is (K, ɛ(z m ))-robust and l(z, g) M for any z, g, then for any δ > 0, with probability at least 1 δ, 2K log(2)+2 log(1/δ) l exp (g) l emp (g) ɛ(z m ) + M m. (3) Additional variants of this theorem are provided in [22]. Partitioning of the space Z is central for the notion of algorithmic robustness. A generic partitioning of Z can be achieved by covering Z with l 2 -balls of radius ɛ. The smallest number of balls needed to cover Z is the covering number N ɛ (Z). Note that the covering number of Z provides a measure of the intrinsic dimension of the distribution that we learn from. For example, a Gaussian mixture model (GMM) with L Gaussians and covariance matrices of rank at most k leads to a covering number N ɛ = L(1 + 2/ɛ) k [12], and k-sparse representable signals in a dictionary with L atoms have a covering number N ɛ = ( L k) (1 + 2/ɛ) k [5]. We conjecture that the effective covering number might be even smaller due to the property of DNN that causes the merging of the input subspaces [13]. B. Deep neural networks with rectified linear units Finally, we describe the binary classifiers based on the DNN. The classifier is given as g(x) = v T f(x), (4) where v R ML represents the linear classifier operating on the output of a network f( ) with an input vector x R N. 1 The function f : R N R ML represents a deep neural network with L layers: f(x) = f L (x) = [ W T Lf L 1 (x) + b L ]+ f i (x) = [W T i f i 1 (x) + b i ] +, i = 1,..., L 1, (5) where f 0 (x) = x, [ ] + = max(, 0) represents the element-wise ReLU non-linearity, W i R Mi Mi 1, i = 1,..., L, are the weight matrices and b i R Ni, i = 1,..., L, are the bias vectors. Note that M 0 = N. III. GEOMETRY OF DEEP NEURAL NETWORKS We now describe how DNN with ReLUs transform the input space as a function of the properties of the weights matrices W i and the bias vectors b i, i = 1,..., L. This leads to a simple bound for the GE of DNN via their input margin. 1 Without affecting the generality of our results, we omit the classifier bias for simplicity.

4 4 A. Input space partitioning We first observe that DNN with ReLUs partition the input space, i.e. the space X, into polygons. Furthermore, within each polygon the network behaves as a linear function. Consider the first layer with the weight matrix W 1 = [w 11, w 12,..., w 1M1 ] and the bias vector b 1 = [b 11,..., b 1M1 ] T. Each column of W 1 and its corresponding bias element define an hyperplane {x R N : w1i T x + b 1i = 0}, i = 1,..., M 1. ReLU acts as a multiplication by 1 if x is on the positive side of the hyperplane and as a multiplication by 0 if x is on the negative side. Therefore, we can represent the ReLU by a diagonal matrix of ones and zeros, where the values of the diagonal depend on x. We denote this matrix at layer l by S l (x). Note that each layer induces the partitioning of the output space of the previous layer. Therefore, the network s output can be written as f(x) = L i=1 S i(x)w T i x + L i=1 ( L j=i+1 S j(x)w T j ) S i (x)b i = (F(x)) T x + d(x). (6) Note that the number of unique matrices S i (x) is finite and it is upper bounded by 2 Mi, which is the maximum number of possible different binary codes of sign(wi T f i 1 (x) + b i ). Therefore, in a deep network there are at most L i=1 unique values for the matrix F(x). Moreover, the value of the matrix F(x) is the same within 2Mi each input partition defined by the set of inequalities of the form wij T f i 1 (x) + b ij > 0 or wij T f i 1 (x) + b ij < 0, i = 1,..., L and j = 1,..., M i. An important property of the network f(x), which aids us in deriving the GE, is its Jacobian matrix evaluated at x : J(x ) = df(x) dx x=x = L i=1 S i(x )Wi T. (7) As the derivative of max(x, 0) is not defined for x = 0, we need to use subderivatives (or subgradients) to define this Jacobian matrix. We avoid this technical complication and simply take the derivative of max(x, 0) to be 0 when x = 0. Note that this does not change the results in any way because the subset of X for which the derivatives are not defined has zero measure. An input space partitioning induced by a two-layer network is shown in Figure 1. Figure 1(a) shows a simple dataset in two dimensional input space, and the decision regions and classification boundaries of a given two-layer network. The input space partitioning is shown in Figure 1(b), where the black dotted lines mark the boundaries between different partitions. Within each partition the Jacobian matrix is visualized by an ellipsoids. The semi-major axis of the ellipsoid is proportional to the first singular value of the Jacobian matrix and it is oriented in the direction of the first singular vectors. The semi-minor axis of the ellipsoid is proportional to the second singular value of the Jacobian and is oriented in the direction of the second singular vector. B. Distance preservation Next, we explore how DNN change the geometry of a pair of points x and x. We define the line that connects x and x as x(t) = x + t(x x), t [0, 1]. (8) As t goes from 0 to 1, the vector x(t) is contained in different input space partitions. Assume that x(t) passes through K partitions in total. We denote the intervals of [0, 1] that correspond to different partitions by T i, i = 1,..., K. Note that the intervals T i are a function of x, x and the network parameters W i, b i, i = 1,..., L. The width of T i is denoted by T i, and we have T 1 + T T K = 1. The value of the Jacobian is constant for all t T i and it is denoted by J x,x (T i ) J(x + t(x x)), t T i. (9) The average Jacobian matrix on the line between x and x is then defined as J x,x K i=1 T i J x,x (T i ). (10) It is a weighted sum of the Jacobian matrices in all the input space partitions visited by x(t), where the weights correspond to the widths of the intervals T i, i = 1,..., K. The average Jacobian matrix is used to relate the difference x x to the difference f(x ) f(x), as shown by the following Theorem (again, all proofs are provided in the Appendix).

5 5 Class 1 Class 2 J5 γin (x) J2 xa xb x J4 J0 = x2 0.0 x2 J1 (a) Input domain. J6 xc J3 0.0 xd x1 1.0 (b) Input space partitioning and Jacobians. Class 1 Class f(x b ) f(x d ) f(x a ) f(x c ) 0.0 x2 y2 0.5 ri (x) = xi rif (x) = 1 γout (x) y1 rij (x) = (c) Feature domain x1 1.0 (d) Input margin bounds. Fig. 1. Plot (a) shows samples of class 1 and 2 and the decision regions produced by a two-layer network. The input space partitioning is shown in Plot (b). The black dotted lines denote the boundaries between input space partitions. The blue ellipsoids in each partition represent the Jacobian matrix. The width and the height of the ellipsoid corresponds to the singular values of the Jacobian and their orientation corresponds to the orientation of the singular vectors. Plot (c) shows the samples transformed by the network and the decision boundary of the linear classifier at the output. Plot (d) shows boundaries of various sets defined in Section IV and used to bound the input margin. Theorem 2. For any x, x0 X, we have Z 0 f (x ) f (x) = 0 1 J(x + t(x0 x)) dt (x0 x) = Jx,x0 (x0 x). (11) Therefore, the difference f (x0 ) f (x) is a function of the difference x0 x and the average Jacobian on the line segment between x and x0. To gain some intuition see Figure 1(b), where the Jacobian matrix is visualized. The points xa, xb, xc, xd are in the same partition where the local Jacobian matrix is denoted by J3. One can see that J3 is (approximately) low rank, and that it contracts the distances in the direction of axis x1 and preserves the distances in the direction of axis x2. Therefore, one would expect that f (xa ) f (xc ) and f (xb ) f (xd ). This is indeed the case as depicted in Figure 1(c). One can also observe that different local Jacobians in Figure 1(b) preserve and contract distances in different directions. We now provide bounds on the distance kf (x0 ) f (x)k2.

6 6 Theorem 3. For any x, x X, and a network f( ), we have f(x ) f(x) 2 = J x,x (x x) 2 L W i 2 x x 2 i=1 L W i F x x 2. (12) The first equality in (12) follows directly from Theorem 2. Therefore, the distance f(x ) f(x) 2 is a function of the eigenvalues of the average Jacobian J x,x, but also a function of the alignment of x x with the eigenvectors of J x,x. On the other hand, the bounds on f(x ) f(x) 2 provided by the two inequalities in (12) are functions of the distance between x and x and are expressed in terms of the spectral norms and the Frobenious norms of the weight matrices. i=1 IV. MARGIN PRESERVATION OF DEEP NEURAL NETWORKS This section charactherizes the margin preservation of DNN and provides GE bounds. A. Output margin, input margin and margin preservation We start by defining the output margin γ out (x i ) of the sample x i with the label y i : γ out (x i ) = sup{c : f(x i ) f(x) 2 c = y i v T f(x) > 0 x} = max ( ) yiv T f(x i) v 2, 0. (13) The output margin of sample x i corresponds to the radius of the largest l 2 -ball centered at f(x i ) that is still contained in the decision region labeled as y i. Provided that γ out (x i ) > 0, the γ out (x i ) is the distance between the point f(x i ) and the plane given by the classifier v. The output margin is visualized in Figure 1(c). A large margin classifier v can be found by solving a SVM problem with (f(x i ), y i ), i = 1,..., m as training samples. This has been successfully applied in practice [1, 19]. In order to provide GE bounds we need to understand if a large output margin implies a large input margin. The input margin γ in (x i ) of the sample x i with the label y i is defined as γ in (x i ) = sup{c : x i x 2 c = y i v T f(x) > 0 x} (14) and it corresponds to the radius of the largest l 2 -ball centered at x i that is still contained in the decision region labeled as y i. The input margin, visualized in Figure 1(a), is a crucial property that determines the GE of a deep neural network classifier, as stated next: Theorem 4 (Adapted from Example 9 in [22]). If there exists γ such that γ in (x i ) > γ > 0 (x i, y i ) Z m, (15) then the classifier g(x) = v T f(x) is (2N γ/2 (X ), 0)-robust, provided that N γ/2 (X ) <. Therefore, provided that the training samples are classified with an input margin larger than γ, the GE behaves as c/ m, where c depends on the covering number N γ/2. However, maximizing the input margin is hard as it relies on the decision boundaries of DNN, which are non-linear in general, and can not be expressed in a closed form or optimized directly. Thus, we turn to look for a convenient lower bound for the input margin such that it may be used in practice. We use the output margin together with the network properties. We first define the function r i (x) associated with the i-th training sample (x i, y i ): r i (x) = γ out (x i ) v 2 (f(x i ) f(x)) = γ out(x i ) y iv T v 2 J xi,x(x i x). (16) y i v T Assuming that γ out (x i ) > 0, r i (x) > 1 implies that x lies in the same decision regions as x i, r i (x) = 1 implies that x lies on the decision boundary, and r i (x) < 1 implies that x lies in a different decision region than x i. It is easy to verify, via Theorem 3, that r i (x) ri J(x) rf i (x), where ri J (x) = vt v 2 J xi,x γ out (x i ) and ri F (x) = (x i x) 2 2 γ out (x i ) L l=1 W l F (x i x) 2. (17)

7 7 We denote by R i = {x R N : r i (x) > 1} the set of all points x contained in the same decision region as x i. The set R i may be approximated with the set R J i = {x R N : ri J(x) > 1} or with the set RF i = {x R N : ri F (x) > 1} such that R F i R J i R i, where these inclusions follow from (17). The boundaries of these sets, i.e. r i (x) = 1, ri J(x) = 1 and rf i (x) = 1 are visualized in Figure 1(d). By using the definitions of the sets R i, R J i and R F i, we bound the input margin γ in (x i ): Theorem 5. Assume the network f( ), classifier v, a training sample x i and the output margin γ out (x i ) > 0 are given. Then the following holds γ in (x i ) = sup {c : x i x 2 c = x R i x} (18) sup { c : x i x 2 c = x R J i x } (19) sup { c : x i x 2 c = x R F i x } γ out (x i ) = L l=1 W. l F (20) Eq. (18) is a restatement of (14) and is equal to the radius of the largest l 2 -ball centered at x i that can be inscribed in R i. Similarly, (19) and (20) consider the largest l 2 -ball contained in R J i and R F i. The lower bound (20), which is expressed in a closed form, only rescales the output margin by the factor L l=1 W l F. On the other hand, though (19) can not be expressed in a closed form, it can provide a much sharper bound to the input margin than (20) because it takes into account the Jacobian matrix of f( ) in the neighbourhood of x i. This is also suggested by Figure 1(d), where it is shown that ri J(x) = 1 is a much better approximation to r i(x) = 1 than ri F (x) = 1. By combining Theorems 4 and 5 the GE bounds can be easily derived. Consequences for very deep networks- It is instructive to observe the result of [15] that suggests that the GE of DNNs with ReLUs behaves as 1 m 2 L v 2 L i=1 W i F, (21) provided that the energy of the training samples is bounded. The bounds based on Theorems 1 and 4 behave as 1 m N γ/2 (X ), (22) where γ represents the classification margin. The behaviour of the bound (21) suggests that the GE grows exponentially with the network depth even if the product of the Frobenious norms of all the weight matrices is fixed, which is due to the term 2 L. The bound (22) based on the robustness framework, on the other hand, implies that the GE can improve with the network depth, since deeper networks can implement more complex decision boundaries, as suggested by recent works [4, 20], and may achieve better classification margin, leading to a lower GE. B. Large margin regularization The proposed framework suggests that DNN with a low GE can be trained by enforcing a large output margin on the training set and by constraining DNN to be margin preserving. Large margin linear classifier can be optimized with the hinge loss and an l 2 norm constrain on the classifier v. Margin preservation of the network, on the other hand, can be enforced by constraining the Frobenious norms of the weight matrices as suggested by (20). Therefore, the popular l 2 weight decay regularizer, which is usually implemented as v L i=1 W i 2 F, (23) leads to a large output margin via the factor v 2 2, and it leads to the preservation of the output margin at the input via the factor L i=1 W i 2 F. However, Theorem 5 also suggests that a sharper way to control the margin preservation of the network is possible by constraining the behaviour of the network s Jacobian matrix. We discuss a potential way to implement such a regularizer next. Eq. (19) shows that a training sample x i will achieve input margin γ, provided that x i x 2 < γ = ri J(x) > 1 x. Assuming that γ out (x i ) > 0 is given, we constrain the denominator of ri J (x) as v T / v 2 J xi,x 2 (x i x) 2 γ out (x i ), x, x i x 2 < γ. (24)

8 8 Accuracy [%] LM WD Number of layers L (a) MNIST-5000 Accuracy [%] LM WD Number of layers L (b) MNIST Accuracy [%] LM WD Number of layers L (c) CIFAR Accuracy [%] LM WD Number of layers L (d) CIFAR Fig. 2. The plots show accuracies for DNN with a different number of layers trained on MNIST and CIFAR10 with large margin regularization (solid line) and weight decay (dashed line). The number next to the dataset name represents the number of training samples. This formulation is still not feasible for practical implementation, and we impose a stricter condition than (24) by bounding the left-hand side of (24) using the fact that x i x 2 < γ and the definitions (9) and (10) to obtain (detailed derivation is provided in the Appendix) sup x X v T / v 2 J(x) 2 γ out (x i )/γ. (25) Therefore, γ in (x i ) > γ, provided that (25) holds. In practice we promote a large γ out (x i ) by training the network using the hinge loss and by constraining the norm of the classifier v. In order to constrain the left-hand side of (25) we assume that the training set is a good approximation of X and only constrain v T / v 2 J(x i ) 2, i = 1,..., m, which leads to the regularizer m i=1 vt / v 2 J(x i ) 2. (26) The next section shows, that this regularizer outperforms the popular weight decay. V. EXPERIMENTAL RESULTS In this section we empirically validate the theoretical results by showing that our novel large margin (LM) regularizer (26) outperforms weight decay (WD) (23). First, we use networks with different number of layers where the first layer always has 784 nodes, and all the subsequent layers have 392 nodes. At the end of the network we use a 10 class classifier and train the network with the multi-class hinge loss. We use MNIST and CIFAR10, for which we reduce its dimension to 784 by the principal components analysis. Additional training details are provided in the Appendix. We report the results for smaller training sets, where the difference between the LM regularization and WD is more significant. The results are reported in Figure 2. We observe that the proposed LM regularization always outperforms WD. Second, we demonstrate the use of LM regularization with convolutional neural networks (CNN). We choose to use MNIST for our experiments since l 2 -margin is more suitable for this dataset than for CIFAR10. We use a 2 layer CNN with the following architecture: (32, 5, 5)-conv, (2, 2)-max-pool, (32, 5, 5)-conv, (2, 2)-max-pool followed by a linear classifier, and a 3 layer CNN that has an additional (32, 5, 5)-conv layer before the linear classifier. We also compare the multi-class hinge loss and the categorical cross entropy (CCE) loss. The results are reported in the Table I for training with no regularization, WD and LM regularization. We observe that CNN trained with the hinge loss always outperform the networks trained with the CCE loss, provided that WD or LM regularization is used. We also observe that WD always outperforms or is at least as good as no regularization, and LM regularization always outperforms WD, independently of the loss function used. VI. CONCLUSIONS This paper studies the generalization error of deep networks based on their classification margin. Generalization error bounds based on classification margin do not suffer from the exponential dependence on network depth as some recent bounds in the literature do, rendering them unrealistic for the current very deep ( layers) used in the literature. Moreover, the paper explains how DNN that achieve a large classification margin can be trained by using a large margin linear classifier at the output of the DNN and by constraining the DNN to preserve distances in the direction of the decision boundary, which is achieved by constraining the Jacobian matrix of the network. Presented results show that such strategy outperforms the popular weight decay.

9 9 TABLE I CLASSIFICATION ACCURACY [%] OF CNNS ON MINST. 256 samples 512 samples 1024 samples loss # layers no reg. WD LM no reg. WD LM no reg. WD LM hinge hinge CCE CCE Future work will include extensions of the theory to DNN with pooling and other DNN architectures such as Deep Residual Networks. Another important direction is the consideration of other metrics beyond l 2 to measure the classification margin, which are more suitable for datasets where the Euclidean distance is appropriate.

10 10 REFERENCES [1] S. An, M. Hayat, S. H. Khan, M. Bennamoun, F. Boussaid, and F. Sohel. Contractive rectifier networks for nonlinear maximum margin classification. Proceedings of the IEEE International Conference on Computer Vision, [2] F. Bach. Breaking the curse of dimensionality with convex neural networks. arxiv: , [3] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds and structural results. The Journal of Machine Learning Research (JMLR), 3: , [4] N. Cohen, O. Sharir, and A. Shashua. On the expressive power of deep learning: a tensor analysis. arxiv: , [5] R. Giryes, G. Sapiro, and A. M. Bronstein. Deep neural networks with random Gaussian weights: a universal classification strategy? arxiv: , [6] B. D. Haeffele and R. Vidal. Global optimality in tensor factorization, deep learning, and beyond. arxiv: , [7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arxiv: , Dec [8] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82 97, Oct [9] J. Huang, Q. Qiu, G. Sapiro, and R. Calderbank. Discriminative robust transformation learning. Advances in Neural Information Processing Systems (NIPS), pages , [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NIPS), pages , [11] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553): , May [12] S. Mendelson, A. Pajor, and N. Tomczak-Jaegermann. Uniform uncertainty principle for Bernoulli and subgaussian ensembles. Constructive Approximation, 28(3): , Dec [13] G. Montúfar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. Advances in Neural Information Processing Systems (NIPS), pages , [14] V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. Proceedings of the 27th International Conference on Machine Learning (ICML), pages , [15] B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks. Proceedings of The 28th Conference on Learning Theory (COLT), pages , Feb [16] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: explicit invariance during feature extraction. Proceedings of the 28th International Conference on Machine Learning (ICML), pages , [17] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: from theory to algorithms. Cambridge University Press, [18] S. Sun, W. Chen, L. Wang, and T.-Y. Liu. Large margin deep neural networks: theory and algorithms. arxiv: , [19] Y. Tang. Deep learning using linear support vector machines. Workshop on Representational Learning, ICML, [20] M. Telgarsky. Benefits of depth in neural networks. arxiv: , [21] V. N. Vapnik. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10(5): , Sep [22] H. Xu and S. Mannor. Robustness and generalization. Machine Learning, 86(3): , [23] Y. Zhang, J. D. Lee, M. J. Wainwright, and M. I. Jordan. Learning halfspaces and neural networks with random initialization. arxiv: , 2015.

11 11 Proof of Theorem 2 APPENDIX We first note that the line between x and x is given by x+t(x x), t [0, 1]. We define the function F (t) = f(x+t(x x)), df (t) and observe that dt = J(x + t(x x))(x x). Now by generalized fundamental theorem of calculus or Lebesgue differentiation theorem we write f(x ) f(x) = F (1) F (0) = 1 0 df (t) dt dt = 1 0 J(x + t(x x)) dt (x x). (27) The integral on the right hand side can be written as a weighted sum (10) because J(x) is piecewise linear. This concludes the proof. Proof of Theorem 3 The equality in (12) follows directly from Theorem 2. The first inequality in (12) follows from the bound J x,x (x x ) 2 J x,x 2 x x 2, the fact that J x,x 2 can be upper bounded by max x,x J x,x 2, since J x,x is a weighted sum of Jacobians, with the sum of the weights being equal to 1, and the fact that J(x) is a product of weight matrices W i and ReLU matrices S i (x). The spectral norm of a matrix product is bounded by the product of spectral norms of the matrices. And since S i (x) 2 1 the inequality holds. Finally, the second inequality is obtained from the first inequality by noting that the Frobenious norm always bounds the spectral norm, W 2 W F. Proof of Theorem 5 Recall the definition of the input margin in (14) and note that y i v T f(x) = y i ( v T f(x i ) + v T (f(x) f(x i )) ) = ( v 2 γ out (x i ) + y i v T (f(x) f(x i )), where we have leveraged the assumption γ out (x i ) > 0. Therefore, y i v T f(x) > 0 γ f (x i ) > y iv T v 2 (f(x i ) f(x)) r i (x) > 1. (28) This leads to (18). The (19) and (20) lower bound (18) because R J i and R F i are subsets of R i, which implies that solutions of the optimization problems in (19) and (20) can only be smaller or equal to the solution of (18). The closed form solution of (20) is obtained by solving (20) for c. Derivation of equation (25) We start from (24): v T / v 2 J xi,x 2 (x i x) 2 v T / v 2 J xi,x 2 γ. (29) Note now that v T / v 2 J xi,x 2 = vt / v 2 K T k J x,x (T k ) k=1 2 K T k v T / v 2 J xi,x(t k ) 2 k=1 v T / v 2 J xi,x(t k ) 2, (30) max k where the first equality is due to (10) and the second inequality follows from the triangle inequality. The third inequality is due to the fact that T k, k = 1,..., K, sum to 1. Note that by the definition in (9) we have that J xi,x(t k ) = J(x ) is the Jacobian matrix evaluated at some point x in the input space. Therefore, we can further bound max k v T / v 2 J xi,x(t k ) 2 sup x X v T / v 2 J(x) 2. (31)

12 12 Experimental procedure details All the networks were trained using the stochastic gradient descent (SGD) with momentum, which was set to 0.9. Results are reported for the best test set performance achieved. Since we use a multi-class classifier, the Frobenious norm of the matrix of classification vectors is constrained instead of the norm of v. MNIST and CIFAR10 DNN- The networks contain 784 units in the first layer and then 392 units in the higher layers. The batch size was set to 128, and the networks were trained for 110 epochs with the following learning rate schedule: epochs, epochs, epochs, epochs. The weight decay penalty was chosen from the set {10 6, 10 5, 10 4, 10 3, 10 2 }, the classification matrix penalty was chosen from the set {0, 10 6, 10 5, 10 4, 10 3, 10 2, 10 1, 1} and the large margin regularization penalty chosen from the set {0, 10 2, 10 1, 1, 2}, which was then divided by the batch size. Since the regularization (26) assumes a single classification vector v, we took for v one of the multiple classification vectors, where the choice was random for each x i in each mini-batch. MNIST CNN- The 2 layer CNN architecture is the following: (32, 5, 5)-conv, (2, 2)-max-pool, (32, 5, 5)-conv, (2, 2)-max-pool followed by a linear classifier. The 3 layer CNN has an additional (32, 5, 5)-conv layer before the linear classifier. For the CCE loss the linear classifier is followed by the softmax non-linearity. The batch size was set to 32, and the networks were trained for 100 epochs with the following learning rate schedule: epochs, epochs, epochs. The weight decay regularization penalty was chosen from the set {10 6, 10 5, 10 4, 10 3, 10 2, 10 1, 1}, the classification matrix penalty was chosen from the set {0, 10 6, 10 5, 10 4, 10 3, 10 2, 10 1, 1} and the large margin regularization penalty was chosen from the set {0, 10 3, 10 2, 10 1, 1, 10}, which was then divided by the batch size. Since the regularization (26) assumes a single classification vector v, here we sum the regularization term (26) over all possible classification vectors for each sample x i in each mini-batch.

Robust Large Margin Deep Neural Networks

Robust Large Margin Deep Neural Networks Robust Large Margin Deep Neural Networks 1 Jure Sokolić, Student Member, IEEE, Raja Giryes, Member, IEEE, Guillermo Sapiro, Fellow, IEEE, and Miguel R. D. Rodrigues, Senior Member, IEEE arxiv:1605.08254v2

More information

ON THE STABILITY OF DEEP NETWORKS

ON THE STABILITY OF DEEP NETWORKS ON THE STABILITY OF DEEP NETWORKS AND THEIR RELATIONSHIP TO COMPRESSED SENSING AND METRIC LEARNING RAJA GIRYES AND GUILLERMO SAPIRO DUKE UNIVERSITY Mathematics of Deep Learning International Conference

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond

Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond Ben Haeffele and René Vidal Center for Imaging Science Mathematical Institute for Data Science Johns Hopkins University This

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speaker Representation and Verification Part II. by Vasileios Vasilakakis Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Generalization in Deep Networks

Generalization in Deep Networks Generalization in Deep Networks Peter Bartlett BAIR UC Berkeley November 28, 2017 1 / 29 Deep neural networks Game playing (Jung Yeon-Je/AFP/Getty Images) 2 / 29 Deep neural networks Image recognition

More information

Deep Neural Networks

Deep Neural Networks Deep Neural Networks DT2118 Speech and Speaker Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 45 Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi

More information

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04 Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures Lecture 04 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Self-Taught Learning 1. Learn

More information

Local Affine Approximators for Improving Knowledge Transfer

Local Affine Approximators for Improving Knowledge Transfer Local Affine Approximators for Improving Knowledge Transfer Suraj Srinivas & François Fleuret Idiap Research Institute and EPFL {suraj.srinivas, francois.fleuret}@idiap.ch Abstract The Jacobian of a neural

More information

Introduction to Machine Learning (67577)

Introduction to Machine Learning (67577) Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

arxiv: v1 [cs.lg] 11 May 2015

arxiv: v1 [cs.lg] 11 May 2015 Improving neural networks with bunches of neurons modeled by Kumaraswamy units: Preliminary study Jakub M. Tomczak JAKUB.TOMCZAK@PWR.EDU.PL Wrocław University of Technology, wybrzeże Wyspiańskiego 7, 5-37,

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Global Optimality in Matrix and Tensor Factorizations, Deep Learning and More

Global Optimality in Matrix and Tensor Factorizations, Deep Learning and More Global Optimality in Matrix and Tensor Factorizations, Deep Learning and More Ben Haeffele and René Vidal Center for Imaging Science Institute for Computational Medicine Learning Deep Image Feature Hierarchies

More information

Some Statistical Properties of Deep Networks

Some Statistical Properties of Deep Networks Some Statistical Properties of Deep Networks Peter Bartlett UC Berkeley August 2, 2018 1 / 22 Deep Networks Deep compositions of nonlinear functions h = h m h m 1 h 1 2 / 22 Deep Networks Deep compositions

More information

Very Deep Convolutional Neural Networks for LVCSR

Very Deep Convolutional Neural Networks for LVCSR INTERSPEECH 2015 Very Deep Convolutional Neural Networks for LVCSR Mengxiao Bi, Yanmin Qian, Kai Yu Key Lab. of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering SpeechLab,

More information

Machine Learning with Quantum-Inspired Tensor Networks

Machine Learning with Quantum-Inspired Tensor Networks Machine Learning with Quantum-Inspired Tensor Networks E.M. Stoudenmire and David J. Schwab Advances in Neural Information Processing 29 arxiv:1605.05775 RIKEN AICS - Mar 2017 Collaboration with David

More information

RegML 2018 Class 8 Deep learning

RegML 2018 Class 8 Deep learning RegML 2018 Class 8 Deep learning Lorenzo Rosasco UNIGE-MIT-IIT June 18, 2018 Supervised vs unsupervised learning? So far we have been thinking of learning schemes made in two steps f(x) = w, Φ(x) F, x

More information

Learning Bound for Parameter Transfer Learning

Learning Bound for Parameter Transfer Learning Learning Bound for Parameter Transfer Learning Wataru Kumagai Faculty of Engineering Kanagawa University kumagai@kanagawa-u.ac.jp Abstract We consider a transfer-learning problem by using the parameter

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems Weinan E 1 and Bing Yu 2 arxiv:1710.00211v1 [cs.lg] 30 Sep 2017 1 The Beijing Institute of Big Data Research,

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

Encoder Based Lifelong Learning - Supplementary materials

Encoder Based Lifelong Learning - Supplementary materials Encoder Based Lifelong Learning - Supplementary materials Amal Rannen Rahaf Aljundi Mathew B. Blaschko Tinne Tuytelaars KU Leuven KU Leuven, ESAT-PSI, IMEC, Belgium firstname.lastname@esat.kuleuven.be

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

LDMNet: Low Dimensional Manifold Regularized Neural Networks

LDMNet: Low Dimensional Manifold Regularized Neural Networks LDMNet: Low Dimensional Manifold Regularized Neural Networks Wei Zhu Duke University Feb 9, 2018 IPAM workshp on deep learning Joint work with Qiang Qiu, Jiaji Huang, Robert Calderbank, Guillermo Sapiro,

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Why ResNet Works? Residuals Generalize

Why ResNet Works? Residuals Generalize Why ResNet Works? Residuals Generalize Fengxiang He Tongliang Liu Dacheng Tao arxiv:1904.01367v1 [stat.ml] 2 Apr 2019 Abstract Residual connections significantly boost the performance of deep neural networks.

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

Classification. Sandro Cumani. Politecnico di Torino

Classification. Sandro Cumani. Politecnico di Torino Politecnico di Torino Outline Generative model: Gaussian classifier (Linear) discriminative model: logistic regression (Non linear) discriminative model: neural networks Gaussian Classifier We want to

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Expressiveness of Rectifier Networks

Expressiveness of Rectifier Networks Xingyuan Pan Vivek Srikumar The University of Utah, Salt Lake City, UT 84112, USA XPAN@CS.UTAH.EDU SVIVEK@CS.UTAH.EDU Abstract Rectified Linear Units (ReLUs have been shown to ameliorate the vanishing

More information

Overparametrization for Landscape Design in Non-convex Optimization

Overparametrization for Landscape Design in Non-convex Optimization Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,

More information

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Maxout Networks. Hien Quoc Dang

Maxout Networks. Hien Quoc Dang Maxout Networks Hien Quoc Dang Outline Introduction Maxout Networks Description A Universal Approximator & Proof Experiments with Maxout Why does Maxout work? Conclusion 10/12/13 Hien Quoc Dang Machine

More information

COMPLEX INPUT CONVOLUTIONAL NEURAL NETWORKS FOR WIDE ANGLE SAR ATR

COMPLEX INPUT CONVOLUTIONAL NEURAL NETWORKS FOR WIDE ANGLE SAR ATR COMPLEX INPUT CONVOLUTIONAL NEURAL NETWORKS FOR WIDE ANGLE SAR ATR Michael Wilmanski #*1, Chris Kreucher *2, & Alfred Hero #3 # University of Michigan 500 S State St, Ann Arbor, MI 48109 1 wilmansk@umich.edu,

More information

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Feb 23, 2018 1 Neural Networks Outline

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques in Training of Deep Neural Networks Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Department of Electrical and Computer Engineering Seoul National University Seoul, 08826 Korea Email: sajid@dsp.snu.ac.kr, khwang@dsp.snu.ac.kr, wysung@snu.ac.kr

More information

Improved Local Coordinate Coding using Local Tangents

Improved Local Coordinate Coding using Local Tangents Improved Local Coordinate Coding using Local Tangents Kai Yu NEC Laboratories America, 10081 N. Wolfe Road, Cupertino, CA 95129 Tong Zhang Rutgers University, 110 Frelinghuysen Road, Piscataway, NJ 08854

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

FreezeOut: Accelerate Training by Progressively Freezing Layers

FreezeOut: Accelerate Training by Progressively Freezing Layers FreezeOut: Accelerate Training by Progressively Freezing Layers Andrew Brock, Theodore Lim, & J.M. Ritchie School of Engineering and Physical Sciences Heriot-Watt University Edinburgh, UK {ajb5, t.lim,

More information

arxiv: v7 [cs.ne] 2 Sep 2014

arxiv: v7 [cs.ne] 2 Sep 2014 Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks Caglar Gulcehre, Kyunghyun Cho, Razvan Pascanu, and Yoshua Bengio Département d Informatique et de Recherche Opérationelle Université

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures for AI. Part II - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model

More information

Let your machine do the learning

Let your machine do the learning Let your machine do the learning Neural Networks Daniel Hugo Cámpora Pérez Universidad de Sevilla CERN Inverted CERN School of Computing, 6th - 8th March, 2017 Daniel Hugo Cámpora Pérez Let your machine

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

arxiv: v2 [stat.ml] 7 Jun 2014

arxiv: v2 [stat.ml] 7 Jun 2014 On the Number of Linear Regions of Deep Neural Networks arxiv:1402.1869v2 [stat.ml] 7 Jun 2014 Guido Montúfar Max Planck Institute for Mathematics in the Sciences montufar@mis.mpg.de Kyunghyun Cho Université

More information

CLOSE-TO-CLEAN REGULARIZATION RELATES

CLOSE-TO-CLEAN REGULARIZATION RELATES Worshop trac - ICLR 016 CLOSE-TO-CLEAN REGULARIZATION RELATES VIRTUAL ADVERSARIAL TRAINING, LADDER NETWORKS AND OTHERS Mudassar Abbas, Jyri Kivinen, Tapani Raio Department of Computer Science, School of

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Negative Momentum for Improved Game Dynamics

Negative Momentum for Improved Game Dynamics Negative Momentum for Improved Game Dynamics Gauthier Gidel Reyhane Askari Hemmat Mohammad Pezeshki Gabriel Huang Rémi Lepriol Simon Lacoste-Julien Ioannis Mitliagkas Mila & DIRO, Université de Montréal

More information

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning Lab Course 2017 (Deep Learning Practical) Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of

More information

arxiv: v2 [cs.sd] 7 Feb 2018

arxiv: v2 [cs.sd] 7 Feb 2018 AUDIO SET CLASSIFICATION WITH ATTENTION MODEL: A PROBABILISTIC PERSPECTIVE Qiuqiang ong*, Yong Xu*, Wenwu Wang, Mark D. Plumbley Center for Vision, Speech and Signal Processing, University of Surrey, U

More information

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Goal: approximate some function f e.g., a classifier, maps input to a class y = f (x) x y Defines a mapping

More information

Feed-forward Network Functions

Feed-forward Network Functions Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

arxiv: v1 [cs.lg] 4 Mar 2019

arxiv: v1 [cs.lg] 4 Mar 2019 A Fundamental Performance Limitation for Adversarial Classification Abed AlRahman Al Makdah, Vaibhav Katewa, and Fabio Pasqualetti arxiv:1903.01032v1 [cs.lg] 4 Mar 2019 Abstract Despite the widespread

More information

Expressiveness of Rectifier Networks

Expressiveness of Rectifier Networks Xingyuan Pan Vivek Srikumar The University of Utah, Salt Lake City, UT 84112, USA XPAN@CS.UTAH.EDU SVIVEK@CS.UTAH.EDU From the learning point of view, the choice of an activation function is driven by

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

Deep Learning (CNNs)

Deep Learning (CNNs) 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Deep Learning (CNNs) Deep Learning Readings: Murphy 28 Bishop - - HTF - - Mitchell

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Deep Convolutional Neural Networks for Pairwise Causality

Deep Convolutional Neural Networks for Pairwise Causality Deep Convolutional Neural Networks for Pairwise Causality Karamjit Singh, Garima Gupta, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal TCS Research, Delhi Tata Consultancy Services Ltd. {karamjit.singh,

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17 3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural

More information

arxiv: v1 [cs.lg] 25 Sep 2018

arxiv: v1 [cs.lg] 25 Sep 2018 Utilizing Class Information for DNN Representation Shaping Daeyoung Choi and Wonjong Rhee Department of Transdisciplinary Studies Seoul National University Seoul, 08826, South Korea {choid, wrhee}@snu.ac.kr

More information

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,

More information

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University. Nonlinear Models Numerical Methods for Deep Learning Lars Ruthotto Departments of Mathematics and Computer Science, Emory University Intro 1 Course Overview Intro 2 Course Overview Lecture 1: Linear Models

More information

CSC 576: Variants of Sparse Learning

CSC 576: Variants of Sparse Learning CSC 576: Variants of Sparse Learning Ji Liu Department of Computer Science, University of Rochester October 27, 205 Introduction Our previous note basically suggests using l norm to enforce sparsity in

More information

arxiv: v2 [cs.ne] 22 Feb 2013

arxiv: v2 [cs.ne] 22 Feb 2013 Sparse Penalty in Deep Belief Networks: Using the Mixed Norm Constraint arxiv:1301.3533v2 [cs.ne] 22 Feb 2013 Xanadu C. Halkias DYNI, LSIS, Universitè du Sud, Avenue de l Université - BP20132, 83957 LA

More information

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Robustness of classifiers: from adversarial to random noise

Robustness of classifiers: from adversarial to random noise Robustness of classifiers: from adversarial to random noise Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, Pascal Frossard École Polytechnique Fédérale de Lausanne Lausanne, Switzerland {alhussein.fawzi,

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

arxiv: v1 [cs.lg] 30 Sep 2018

arxiv: v1 [cs.lg] 30 Sep 2018 Deep, Skinny Neural Networks are not Universal Approximators arxiv:1810.00393v1 [cs.lg] 30 Sep 2018 Jesse Johnson Sanofi jejo.math@gmail.com October 2, 2018 Abstract In order to choose a neural network

More information

COR-OPT Seminar Reading List Sp 18

COR-OPT Seminar Reading List Sp 18 COR-OPT Seminar Reading List Sp 18 Damek Davis January 28, 2018 References [1] S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht. Low-rank Solutions of Linear Matrix Equations via Procrustes

More information