arxiv: v1 [cs.lg] 24 Jan 2019

Size: px

Start display at page:

Download "arxiv: v1 [cs.lg] 24 Jan 2019"

Godwin Ferguson
5 years ago
Views:

1 Cross-Entropy Loss and Low-Rank Features Have Responsibility for Adversarial Examples Kamil Nar Orhan Ocal S. Shankar Sastry Kannan Ramchandran arxiv: v [cs.lg] 24 Jan 209 Abstract State-of-the-art neural networks are vulnerable to adversarial examples; they can easily misclassify inputs that are imperceptibly different than their training and test data. In this work, we establish that the use of cross-entropy loss function and the low-rank features of the training data have responsibility for the existence of these inputs. Based on this observation, we suggest that addressing adversarial examples requires rethinking the use of cross-entropy loss function and looking for an alternative that is more suited for minimization with low-rank features. In this direction, we present a training scheme called differential training, which uses a loss function defined on the differences between the features of points from opposite classes. We show that differential training can ensure a large margin between the decision boundary of the neural network and the points in the training dataset. This larger margin increases the amount of perturbation needed to flip the prediction of the classifier and makes it harder to find an adversarial example with small perturbations. We test differential training on a binary classification task with CIFAR-0 dataset and demonstrate that it radically reduces the ratio of images for which an adversarial example could be found not only in the training dataset, but in the test dataset as well.. Introduction Despite their high accuracy on training and test datasets, state-of-the-art neural networks are vulnerable to adversarial examples: they can easily misclassify inputs that are indistinguishable from the training and test data and express very high confidence for their wrong predictions Szegedy et al., 203. Several methods have recently been introduced to generate these adversarial inputs Goodfellow et al., 205; Authors are with Department of Electrical Engineering and Computer Sciences, University of California, Berkeley. Carlini & Wagner, 207; Moosavi-Dezfooli et al., 207; Athalye et al., 208; and simplicity and effectiveness of these methods have reinforced the concerns about the use of neural networks in many tasks. The presence of adversarial examples was initially attributed to the high nonlinearity of deep neural networks Szegedy et al., 203. Later, however, it was shown that a network with few layers and a high dimensional input space could also suffer from this problem Goodfellow et al., 205. Support vector machines with radial basis function, on the other hand, were robust to these malicious inputs: their accuracy on test datasets and adversarial examples were comparable. Based on these observations, it was claimed that neural networks, unlike support vector machines, failed to introduce adequate nonlinearity as a feature mapping, and this was suggested to be the main explanation for the existence of adversarial examples Goodfellow et al., 205. It is correct that neural networks and support vector machines differ in their level of nonlinearity and their level of robustness against adversarial examples, but this fact on its own does not suffice to build a causal relation between the adversarial examples and the nonlinearity of the classifier. There are many other aspects that neural networks and support vector machines differ in and any of these factors may also have responsibility for the presence of adversarial examples. A major one of these factors is the training procedure. Training a support vector machine involves solving a convex optimization problem defined with the hinge loss function Hastie et al., Due to convexity of the problem, the choice of optimization algorithm has no influence on the classifier obtained at the end of training. In contrast, training a neural network requires solving a nonconvex problem, and the dynamics of the optimization algorithm becomes critical for the solution. It determines the local optimum obtained, and hence, the decision boundary of the trained network. The existence of adversarial examples is the manifestation of a poor margin between the decision boundary of the network and the points in the training and test datasets Fawzi et al., 207. What is interesting is the closeness of the training points to the decision boundary: for some reason,

2 the decision boundary resides extremely close to the training points even after the training is complete although the main purpose of training is to find a boundary that is reasonably far away from these points. We seek out a reason for this poor margin among the ingredients of neural network training that are widely taken for granted: the gradient methods and the cross-entropy loss function... Our contributions. We show that if a linear classifier is trained by minimizing the cross-entropy loss function via the gradient descent algorithm, and if the features of the training points lie on a low-dimensional affine subspace, then the margin between the decision boundary of the classifier and the training points could become much smaller than the optimal value. 2. We show that the penultimate layer of neural networks are very likely to produce low-rank features, and we provide empirical evidence for this on a binary classification task with CIFAR-0 dataset. Combined with the first contribution, this suggests that neural networks could have a poor margin in their penultimate layer, and consequently, very small perturbations in this layer can easily flip the decision of the classifier. 3. In order to improve the margin, we put forward a training scheme called differential training, which uses a loss function defined on the differences between the features of the points from opposite classes. We show that this training scheme allows finding the solution with the largest hard margin for linear classifiers while still using the gradient descent algorithm. 4. We introduce a loss function that improves the margin for nonlinear classifiers and display its effectiveness on a synthetic problem. Then we test this loss function on a binary classification task with CIFAR-0 dataset, and show that it prevents the Projected Gradient Descent Attack Madry et al., 208; Kurakin et al., 206 from being able to find an adversarial example for most of the training and test data. 5. On CIFAR-0 dataset, we empirically show that the network produced by differential training generalizes well over the adversarial examples. That is, the accuracy of the network is virtually the same on adversarial examples generated from the training dataset and on those generated from the test dataset. This result is critical given that the networks trained with robust optimization were shown not to generalize on adversarial examples Schmidt et al., Related Works The minimization of cross-entropy loss function via the gradient descent algorithm has recently been studied for linear classifiers, and its solution has been shown to be equivalent to a support vector machine Soudry et al., 208. However, it has not been emphasized that the separating hyperplane produced by the cross-entropy minimization is constrained to pass through the origin in an augmented space. We show that this fact could cause the margin of the classifier to be drastically small if the features of the dataset lie in a low-dimensional affine subspace in a high dimensional feature space. We also show that this case is not atypical when a neural network is trained with the gradient descent algorithm, and we build a connection between this fact and the existence of adversarial examples. It is known that if a support vector machine is formulated to find a separating hyperplane passing through the origin, the decision boundary of the classifier will be smaller than the optimal value. In order to overcome this problem and to speed up online learning algorithms for support vector machines, the idea of using the differences between the points from opposite classes has previously been suggested in Ishibashi et al., 2008; Keerthi et al., 999. We show that a similar idea in differential training also improves the margin when a neural network is being trained with a gradient-based method. Differential training uses the differences between the features of the training points from opposite classes. This training scheme has been intentionally introduced to improve the dynamics of the gradient descent algorithm on the training cost function; and we consider it as using an alternative cost function in the sequel since the choice of cost function is very critical. However, the procedure could also be considered as using an identical pair of networks in the network architecture, which is closely related to the Siamese Networks Bromley et al., 993; Chopra et al., These networks were previously shown to perform well if limited data were available from any of the classes in a classification task Koch et al., 205. Our work shows that this architecture can also provide a large margin between the decision boundary of the classifier and the training points, and consequently, be more robust to adversarial examples if the network is trained with the cost function we suggest in Section Cross-Entropy Loss on Low-Rank Features Leads to Poor Margins Cross-entropy loss function is almost the sole choice for classification tasks in practice. Its prevalent use is backed theoretically by its association with the minimization of the Kullback-Leibler divergence between the empirical distribu-

3 Decision boundary obtained with cross-entropy minimization Figure. Orange and blue points lie on a low-dimensional affine subspace in R 2, and they represent the data from two different classes. Cross-entropy minimization for a linear classifier on these points leads to the decision boundary shown with the solid line, which attains an extremely poor margin. tion of a dataset and the confidence of the classifier for that dataset. Given the particular success of neural networks for classification tasks Krizhevsky et al., 202; Simonyan & Zisserman, 204; He et al., 206, there seems to be little motivation to search for alternatives for this loss function, and most of the software developed for neural networks incorporates an efficient implementation for it, thereby facilitating its further use. Nevertheless, there seems to be a typical case where the use of cross-entropy loss function can create a problem for the classifier, as shown in Figure. The source of this problem is pointed out in Theorem. Theorem. Assume that the points {x i } i I and {y j } j J are linearly separable and lie in an affine subspace; that is, there exist a set of orthonormal vectors {r k } k K and a set of scalars { k } k K such that r k, x i = r k, y j = k i I, j J, k K. Let w, + B = 0 denote the decision boundary obtained by minimizing the cross-entropy loss function e w x i+b log log, + e w x i+b + e w y j+b i I j J and assume that w and B are scaled such that min i I,j J w, x i w, y j = 2. Then the minimization of the cross-entropy loss yields a margin smaller than or equal to γ + B 2 2 k K 2 k where γ denotes the optimal hard margin given by the SVM solution. Remark. Theorem shows that if the training points lie on an affine subspace, and if the cross-entropy loss is minimized with the gradient descent algorithm, then the margin of the classifier will be smaller than the optimal margin value. As the dimension of this affine subspace decreases, the cardinality of the set K increases and the term k K 2 k could become much larger than /γ2. Therefore, as the dimension of the subspace containing the training points gets smaller compared to the dimension of the input space, cross-entropy minimization with a gradient method becomes more likely to yield a poor margin. The next corollary relaxes the condition of Theorem and allows the training points to be near an affine subspace instead of being exactly on it. Corollary. Assume that the points {x i } i I and {y j } j J in R d are linearly separable and there exist a set of orthonormal vectors {r k } k K and a set of scalars { k } k K such that r k, x i k, r k, y j k i I, j J, k K. Let w, + B = 0 denote the decision boundary obtained by minimizing the cross-entropy loss, as in Theorem. Then the minimization of the cross-entropy loss yields a margin smaller than or equal to B 2 k K 2 k Note that the ability to compare the margin obtained by cross-entropy minimization with the optimal value is lost. Nevertheless, it highlights the fact that same set of points could be assigned a substantially different margin by crossentropy minimization if all of them are shifted away from the origin by the same amount in the same direction. 3. Penultimate Layers of Neural Networks Contain Low-Rank Features The results in the previous section were for linear classifiers, and correspondingly, the features of the training points were the points themselves. In this section, we consider neural networks and regard the outputs of their penultimate layer as the features of the training points. Following theorem shows that these features can have a very low rank if the network is trained with a gradient method. Proposition. Given a set of points {x i } i I, assume that an L-layer network is trained by minimizing the crossentropy loss function: min w,θ i I log e w φ θ x i + e w φ θ x i

4 where φ θ x i is the output of the penultimate layer of the network and represents the features for point x i. Assume that φ θ ends with a linear layer, i.e., φ θ = W h θ where W is a matrix and h θ is the first L 2 layers of the network. If the gradient descent algorithm is initialized with W [0] = 0, then the rank of the set {φˆθx i } i I is at most whenever the algorithm is terminated. The assumption on the initialization of the matrix W could be removed if the network has a certain structure for example, if the last layer of h θ ends with a squishing function such as arctan or tanh. In this case, the points in {φ θ x i } i I keep growing in the same direction if the algorithm is run for long enough, and consequently, this set converges to a set with rank as well. More detail on this case is provided in Appendix B. Note that the only strong assumption in Proposition is the requirement that φ θ ends with a linear layer. Otherwise, φ θ is allowed to contain any type of nonlinear activation functions and convolutional layers. To empirically verify whether the features in a neural network are still low-rank even when the penultimate layer is nonlinear, we trained a standard network with ReLU activations for a binary classification task on CIFAR-0 dataset. The cross-entropy loss function was minimized with three different optimization schemes to train the network. Even though all parameters of the network were initialized as in He et al., 205, the features in the penultimate layer had rank 2 if the training cost was minimized via the gradient method with momentum. When the optimization algorithm was changed to Adam or when batch normalization was used during training, the rank of the features still remained much lower than the dimension of the feature space, as shown in Figure 2. Remark 2. Proposition, along with the empirical observations on CIFAR-0 dataset, shows that the low-rankness of the features of the training dataset is not an exceptional case; on the contrary, it can arise in most cases. This is recently supported by Martin & Mahoney, 208 as well. Along with the main result of Section 2, the fact that penultimate layer of the network contains low-rank features indicates a small margin between the decision boundary of the classifier and the features in this layer. In other words, small perturbations in the penultimate layer can easily flip the decision of the classifier. 4. Differential Training Improves Margin In previous sections, we saw that the combination of crossentropy loss function, low-rank features of training dataset, variance explained Adam+BatchNorm Adam momentum number of principal components used Figure 2. The outputs of the penultimate layer of a neural network can be considered as the features of the training points. A fourlayer convolutional network is trained by minimizing the crossentropy loss function via three different optimization schemes. The plot shows the cumulative variance explained for these features as a function of the number of principle components used. The features lie in a two-dimensional subspace if the gradient method with momentum is used. For the other two algorithms, almost all the variance in the features is captured by the first 20 principle components out of 84. and gradient descent algorithm could lead to a poor margin. We change the training cost function in the following subsections in order to increase the margin of the classifier. 4.. Differential Training for Linear Classifiers Consider the binary classification problem with only two training points, x and y, from two different classes. If we use cross-entropy loss function to find a linear classifier by minimizing e w x+b log + e w x+b log + e w y+b, the gradient descent algorithm gives the update rule: x b w w + η x e w + e y e w y+b w x b + e w y+b where η is the learning rate of the algorithm. The update rule for w reveals a critical fact: even though the optimal direction for w is x y, the increments in w are usually not in this direction. Now consider the problem of finding a separating hyperplane for a linearly separable dataset. If the dataset is low rank, the differences between the training points span a lowdimensional subspace. However, at each iteration of the gradient descent algorithm, the increments on the normal vector of the decision boundary will usually contain components outside of this subspace, as can be seen in. These increments could be forced to lie in the same subspace by feeding the differences of the points from opposite classes instead of the points themselves into the loss function. In

5 fact, a loss function of this form enables finding the separating hyperplane with the largest margin with the gradient descent algorithm. Theorem 2. Given two sets of points {x i } i I and {y j } j J that are linearly separable in R d, if we solve min w R i I log + x i y j d j J e w 2 by using the gradient descent algorithm with a sufficiently small learning rate, then the direction of w converges to the direction of the maximum-margin solution, i.e. lim t wt wt = w SVM w SVM, 3 where w SVM is the solution to the hard-margin SVM problem. Minimization of the cost function 2 provides only the weight parameter ŵ of the decision boundary. The bias parameter, b, could be chosen by plotting the histogram of the inner products { ŵ, x i } i I and { ŵ, y j } j J and fixing a value for ˆb such that ŵ, x i + ˆb 0 i I, 4a ŵ, y j + ˆb 0 j J. 4b The largest hard margin is achieved by ˆb = 2 min i I ŵ, x i 2 max j J ŵ, y j. 5 However, by choosing a larger or smaller value for ˆb, it is possible to make a tradeoff between the Type-I and Type-II errors. The cost function 2 includes a loss defined on every pair of data points from the two classes. There are two aspects of this fact:. When standard loss functions are used for classification tasks, we need to oversample or undersample either of the classes if the training dataset contains different number of points from different classes. This problem does not arise when we use the cost function The number of pairs, I J, will usually be much larger than the size of the original dataset, which contains I + J points. Therefore, the minimization of 2 might appear more expensive than the minimization of the standard cross-entropy loss computationally. However, if the points in different classes are well separated and the stochastic gradient method is used to minimize 2, the algorithm could achieve zero training error after using only a few pairs, which is formalized in Theorem 3. Further computation is needed only to improve the margin of the classifier. In addition, in our experiments to train a neural network to classify two classes from the CIFAR-0 dataset, only a few percent of I J pairs were observed to be sufficient to reach an accuracy on the test dataset that is comparable to the accuracy of the cross-entropy loss minimization. Theorem 3. Given two sets of points {x i } i I and {y j } j J that are linearly separable in R d, assume the cost function 2 is minimized with the stochastic gradient method. Define R x = max{ x i x i : i, i I}, R y = max{ y j y j : j, j J}, and let γ denote the hard margin that would be obtained with the SVM: 2γ = max u R d min i I,j J x i y j, u/ u. If 2γ 5 maxr x, R y, then the stochastic gradient algorithm produces a weight parameter, ŵ, only in one iteration which satisfies the inequalities 4a-4b along with the bias, ˆb, given by Differential Training for Nonlinear Classifiers When a neural network is used to find a nonlinear classifier, a candidate cost function analogous to 2 for differential training would be j J log + e w φ θ x i φ θ y j 6 i I where φ θ is the output of the penultimate layer of the network and represents the features of the points. However, minimization of 6 has been observed to fail in providing a large margin in the input space in our experiments. One reason for this is that the minimization of 6 does not guarantee a small Lipschitz constant for the mapping φ θ. Therefore, even if the margin is large in the penultimate layer, the margin in the input space could still be very small. A cost function that does provide a large margin in the input space is i I j J w φ θ x i w φ θ y j 2. 7 A partial explanation for the different behavior of this function is that the gradient descent algorithm is more likely to converge to a solution with small Lipschitz constant if the network is trained with the squared error loss Nar & Sastry, 208. Consequently, the gradient method is more likely to produce a φ θ which has a small Lipschitz constant, and this implies that the input of φ θ needs to change by a large amount in order for its output to move across the decision boundary.

6 00 PGD Attack y x differential training cross-entropy min. Figure 3. A two-layer neural network is trained with two different cost functions. Cross-entropy minimization marks the region between the dotted lines as the class of blue points, whereas the same class is assigned to the region inside the solid curve when differential training is used. Note that the decision boundaries obtained with cross-entropy minimization have extremely small margins. The effect of training with the cost function 7 on the margin of a nonlinear classifier is demonstrated in Figure 3. A neural network with one hidden layer was trained with two different training cost functions: cross-entropy loss and the differential training cost 7. The minimization of crossentropy loss provided an extremely poor margin in the input space, whereas the use of 7 lead to a decision boundary with large margins. 5. Experiment on CIFAR-0: Differential Training Removes Adversarial Examples A large margin between the decision boundary of the classifier and the points in the training dataset is expected to make it harder to find adversarial examples for these points. In order to verify if this is the case, we trained a four-layer convolutional neural network for a binary classification task on CIFAR-0 dataset by only using the images for planes and horses. Both cross-entropy minimization and differential training achieved zero error on the training dataset, and the accuracies of both training schemes were comparable on the test dataset: cross-entropy minimization lead to 93.65% while differential training yielded 94.65%. We generated adversarial examples for the images in the training dataset using Projected Gradient Descent Attack PGD implemented by Rauber et al., 207. The robustness of the neural network against these adversarial examples was substantially different based on whether the network was trained with the cross-entropy loss or the differential training cost 7. As shown in Figure 4, PGD was able to find adversarial percentage of samples fooled Cross-entropy Min. on test Cross-entropy Min. on train Differential Training on test Differential Training on train norm of the disturbance Figure 4. A four-layer convolutional neural network is trained for a binary classification task on CIFAR-0 dataset with two different training schemes: cross-entropy minimization and differential training. If the network is trained with differential training, the accuracy of the network is much higher for the adversarial examples generated from the training and test datasets with the PGD Attack. Moreover, the accuracy of the network on the adversarial examples generated from the training dataset is almost the same as its accuracy on those generated from the test dataset. Solid lines denote the accuracy on adversarial examples generated from the training dataset, and dashed lines denote the accuracy on adversarial examples generated from the test dataset. examples for the images in the training dataset with small perturbations if the network was trained with the crossentropy loss. In contrast, if the network was trained with differential training, PGD failed to find adversarial examples for the training dataset without disturbing the images by a large amount. Please note that PGD was considered to be the most powerful first-order gradient-based attack in Madry et al., 208. Somewhat surprisingly, the same behavior was observed on the test dataset as well. As displayed in Figure 4, PGD failed to find adversarial examples for most of the images in the test dataset when the network was trained via differential training. Moreover, the accuracy of the network was almost the same for adversarial examples generated from the training dataset and for those generated from the test dataset. We also tested the network under the Carlini-Wagner Attack Carlini & Wagner, 207 implemented by Rauber et al., 207. Similar to its performance under PGD Attack, the accuracy of the network trained with differential training remained much higher compared to the network trained with cross-entropy minimization, as shown in Figure Discussion Low-dimensionality of the training dataset. As stated in Remark, as the dimension of the affine subspace containing the training dataset gets very small compared to the

7 percentage of samples fooled Carlini-Wagner Attack Cross-entropy Min. on test Differential Training on test norm of the disturbance Figure 5. A four-layer convolutional network is trained with two different schemes: cross-entropy minimization and differential training. If the network is trained with differential training, the accuracy of the network is much higher on the adversarial examples generated from the test dataset with the Carlini-Wagner Attack. dimension of the input space, the training algorithm will become more likely to yield a small margin for the classifier. This observation confirms the results of Marzi et al., 208, which showed that if the training dataset is projected onto a low-dimensional subspace before being fed into a neural network, the performance of the network against adversarial examples is improved since projecting the inputs onto a low-dimensional domain corresponds to decreasing the dimension of the input space. Even though this method is effective, it requires the knowledge of the domain in which the training points are low-dimensional. Because this knowledge will not always be available a priori, finding alternative training algorithms and loss functions that are suited for lowdimensional data is still an important direction for future research. Robust optimization. Using robust optimization to train neural networks has been shown to be effective against adversarial examples Madry et al., 208; Athalye et al., 208. Note that these techniques could be considered as inflating the training points by a presumed amount and training the classifier with these inflated points. Nevertheless, as long as the cross-entropy loss is involved, the decision boundaries of the neural network will still be in the vicinity of the inflated points. Therefore, even though the classifier is robust against the disturbances of the presumed magnitude, the margin of the classifier could still be much smaller than what it could potentially be. Differential training. We introduced differential training, which allows the feature mapping to remain trainable while ensuring a large margin between different classes of points. By doing so, this method combines the benefits of neural networks with those of support vector machines. Even though moving from 2N training points to N 2 pairs might seem prohibitive, it points out that a true classification should in fact be able to differentiate between the pairs that are hardest to differentiate, and this search will necessarily require an N 2 term. Some heuristic methods are likely to be effective, such as considering only a smaller subset of points closer to the boundary and updating this set of points as needed during training. If a neural network is trained with this procedure, the network will be forced to find features that are able to tell apart between the hardest pairs. Generalization of differential training, and its connection to one-shot learning. It has been shown that if a neural network is trained with robust optimization, the accuracy of the network on adversarial examples generated from the test dataset could be very low even though the accuracy on adversarial examples produced from the training dataset is high Schmidt et al., 208. Consequently, it has been claimed that the robust optimization requires large amount of data so as to make a network robust against adversarial perturbations on the unseen images. Our empirical results on CIFAR-0 dataset suggest that differential training does not suffer from this problem. That is, differential training provides neural networks with robustness while still using fewer data. This is in congruence with the main premise of Koch et al., 205, which showed that Siamese networks with an identical pair of networks in their architecture perform well with few training points. Please see Section.2 for further comments on the relation between differential training and Siamese networks. Why not empirical risk minimization with a well-known loss function? Consider the standard problem of empirical risk minimization as the proxy for finding a classifier: min w,θ l w, φ θx i ; z i i I 8 where z i denotes the label of the point x i, and w, θ are the parameters of the classifier. If the features of the training points {φ θ x i } i I lie in a low-dimensional subspace, the cost function 8 will likely not be strictly convex; and more importantly, there will be directions in which the parameters are not penalized. Normally, the remedy would be to introduce a regularization term into the cost function. However, the effectiveness of well-known regularization terms is dubious for neural networks: they do not prevent spectral norms of weight matrices from growing unboundedly Bartlett et al., 207, nor do they influence the generalization gap of networks noticeably Zhang et al., 207. Therefore, even if a regularization term is added externally, the gradient descent algorithm will have the potential to drive the parameters in the directions that are not penalized and cause the decision boundary to reside in the vicinity of the training points. Note that the loss function l need not be the crossentropy loss for this to happen. This is why the problem of poor margins is in fact not peculiar to the cross-entropy loss, and this is why other well-known loss functions will likely also fail in addressing adversarial examples.

8 A. Proof of Theorem and Corollary Lemma Adapted from Theorem 3 of Soudry et al., 208. Given two sets of points {x i } i I and {y j } j J that are linearly separable in R d, let x i and ỹ j denote [x i ] and [yj ], respectively, for all i I, j J. Then the iterate of the gradient descent algorithm, wt, on the cross-entropy loss function min log + w R d+ i I e w x i + log + e w ỹ j j J with a sufficiently small step size will converge in direction: lim t where w is the solution to wt wt = w w, minimize z 2 9 z R d+ subject to z, x i i I, z, ỹ j j J. Proof of Theorem. Assume that w = u + m k= α kr k, where u R d and u, r k = 0 for all k K. By denoting z = [w b], the Lagrangian of the problem 9 can be written as 2 w b2 + i I µ i w, x i b + j J ν j + w, y j + b, where µ i 0 for all i I and ν j 0 for all j J. KKT conditions for the optimality of w and B requires that w = i I µ i x i j J ν j y j, B = i I and consequently, for each k K, µ i j J ν j, w, r k = i I µ i x i, r k j J ν j y j, r k Then, we can write w as = i I kµ i j J kν j = B k. w = u + k K B kr k. Let w SVM, + b SVM = 0 denote the hyperplane obtained as the solution of SVM. Then w SVM solves minimize w 2 0 w subject to w, x i y j 2 i I, j J. Since the vector u also satisfies u, x i y j = w, x i y j 2 for all i I, j J, we have u w SVM = γ. As a result, the margin obtained by minimizing the crossentropy loss is w = u 2 + B k r k 2 γ 2 + B 2 2 k Proof of Corollary. If B < 0, we could consider the hyperplane w, B = 0 for the points { x i } i I and { y j } j J, which would have the identical margin due to symmetry. Therefore, without loss of generality, assume B 0. As in the proof of Theorem, KKT conditions for the optimality of w and B requires w = i I µ i x i j J ν j y j, B = i I µ i j J ν j where µ i 0 and ν j 0 for all i I, j J. Note that for each k K, w, r k = i I µ i x i, r k j J ν j y j, r k = B k + i I µ i x i, r k k j J ν j y j, r k k B k. Since {r k } k K is an orthonormal set of vectors, w 2 k K w, r k 2 k K B2 2 k. The result follows from the fact that w is an upper bound on the margin. B. Proposition and Nonzero Initialization Gradient descent algorithm on leads to the dynamics where i I log + e w W h θ x i Ẇ = wv, ẇ = W v, e w W h θ x i v = h θx i i I + e. w W h θ x i If W 0 = 0, then w preserves its direction and wt = w0αt for all t 0, where α : [0, R. Consequently, the column space of W t is spanned by only w0, and W t has rank or 0 for every t 0. This completes the proof of Proposition. In order to make a statement without the condition on W 0, we need the following lemma..

9 Lemma 2. Consider the n n matrix [ 0 v ] v 0 where v R n and assume n 2. It has only one positive eigenvalue, v 2, with the eigenvector [v v 2 ]. Proof. The matrix is at most rank 2, so it has at most 2 nonzero eigenvalues. The vectors [v v 2 ] and [v v 2 ] are its eigenvectors corresponding to the eigenvalues v 2 and v 2, respectively. In the dynamics, if we consider vt as an exogenous signal, the system described becomes a linear time-varying system of the states W, w. Moreover, the dynamics of each row of the pair W, w is independent of the other rows, but is governed by the same matrix. For example, the k th row of the pair W, w satisfies: Ẇ k. Ẇ kn ẇ k [ = 0 vt vt 0 ] W k. W kn w k. 2 If the last layer of h θ ends with a squishing function such as arctan or tanh, and if all training points are classified correctly during training, the dynamics of v becomes v i I h θ x i e w W h θ x i v W W + w 2 v h θ x i if the network is trained for long enough. Then the change in v becomes exponentially slower than those in W and w as the training continues. Consequently, the vector vt in 2 acts as a constant vector; and from Lemma 2, each row of the matrix W grows in the direction vt by the same ratio. As a result, if the algorithm is run for long, all rows of W converge to the same direction. Correspondingly, all of its columns converge to a set with rank or 0. C. Proof of Theorem 2 Apply Lemma by replacing the sets {x i } i I and {y j } j J with {x i y j } i I,j J and the empty set, respectively. Then the minimization of the loss function 2 with the gradient descent algorithm leads to where w satisfies lim t w w = w w w = arg min w w 2 s.t. w, x i y j i I, j J. Since w SVM is the solution of 0, we obtain w = 2 w SVM, and the claim of the theorem holds. D. Proof of Theorem 3 In order to achieve zero training error in one iteration of the stochastic gradient algorithm, it is sufficient to have min i I x i, x i y j > max j J y j, x i y j i I, j J, or equivalently, x i y j, x i y j > 0 i, i I, j, j J. 3 By definition of the margin, there exists a vector w SVM R d with unit norm which satisfies 2γ = min i I,j J x i y j, w SVM. Note that w SVM is orthogonal to the decision boundary given by the SVM. Then we can write every x i y j as x i y j = 2γw SVM + δ x i + δ y j, where δ x i, δy j Rd and δ x i R x and δ y j R y. Then, condition 3 is satisfied if 2γw SVM + δ x i + δ y j, 2γw SVM + δ x i + δy j > 0 for all i, i I and for all j, j J; or equivalently if 4γ 2 +2γ w SVM, δ x i +δ y j +δx i +δy j + δx i +δ y j, δx i +δy j > 0 4 for all i, i I and for all j, j J. If we choose γ > 5 2 maxr x, R y, we have 4γ 2 2γ2R x + 2R y R x + R y 2 > 0, which guarantees 4 and completes the proof. References Athalye, A., Carlini, N., and Wagner, D. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Dy, J. and Krause, A. eds., Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp , Stockholmsmssan, Stockholm Sweden, 0 5 Jul 208. PMLR. URL press/v80/athalye8a.html. Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp , 207. Bromley, J., W. Bentz, J., Bottou, L., Guyon, I., Lecun, Y., Moore, C., Sackinger, E., and Shah, R. Signature verification using a siamese time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence, 7:25, doi: 0.42/S

10 Carlini, N. and Wagner, D. Towards evaluating the robustness of neural networks. In 207 IEEE Symposium on Security and Privacy SP, pp IEEE, 207. Chopra, S., Hadsell, R., and LeCun, Y. Learning a similarity metric discriminatively, with application to face verification. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume, pp , June Fawzi, A., Moosavi-Dezfooli, S., and Frossard, P. The robustness of deep networks: A geometrical perspective. IEEE Signal Processing Magazine, 346:50 62, Nov 207. Goodfellow, I., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 205. Hastie, T., Tibshirani, R., and Friedman, J. The elements of statistical learning: data mining, inference and prediction. Springer, 2 edition, URL tibs/ ElemStatLearn/. He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp , 205. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp , 206. Ishibashi, K., Hatano, K., and Takeda, M. Online learning of maximum p-norm margin classifiers with bias. In 2st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-2, 2008, pp , URL fi/papers/48-ishibashi.pdf. Keerthi, S., Shevade, S. K., Bhattacharyya, C., and Murthy, K. A fast iterative nearest point algorithm for support vector machine classifier design. IEEE Transactions on Neural Networks, :24 36, 999. Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 205. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp , 202. Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial machine learning at scale. arxiv preprint arxiv:6.0236, 206. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 208. Martin, C. H. and Mahoney, M. W. Implicit selfregularization in deep neural networks: Evidence from random matrix theory and implications for learning. CoRR, abs/ , 208. URL org/abs/ Marzi, Z., Gopalakrishnan, S., Madhow, U., and Pedarsani, R. Sparsity-based Defense against Adversarial Attacks on Linear Classifiers. ArXiv e-prints, 208. Moosavi-Dezfooli, S.-M., Fawzi, A., Fawzi, O., and Frossard, P. Universal adversarial perturbations. In IEEE Conference on Computer Vision and Pattern Recognition, pp , 207. Nar, K. and Sastry, S. Step size matters in deep learning. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. eds., Advances in Neural Information Processing Systems 3, pp Curran Associates, Inc., 208. URL step-size-matters-in-deep-learning. pdf. Rauber, J., Brendel, W., and Bethge, M. Foolbox: a python toolbox to benchmark the robustness of machine learning models 207. URL org/abs/ , 207. Schmidt, L., Santurkar, S., Tsipras, D., Talwar, K., and Madry, A. Adversarially robust generalization requires more data. arxiv preprint arxiv: , 208. Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. CoRR, abs/ , 204. Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. The Implicit Bias of Gradient Descent on Separable Data. ArXiv e-prints, 208. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. Intriguing properties of neural networks. CoRR, abs/32.699, 203. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 207.

arxiv: v1 [stat.ml] 15 Mar 2018

arxiv: v1 [stat.ml] 15 Mar 2018 Large Margin Deep Networks for Classification arxiv:1803.05598v1 [stat.ml] 15 Mar 2018 Gamaleldin F. Elsayed Dilip Krishnan Hossein Mobahi Kevin Regan Samy Bengio Google Research {gamaleldin,dilipkay,hmobahi,kevinregan,bengio}@google.com