arxiv: v1 [cs.lg] 24 Jan 2019

Size: px
Start display at page:

Download "arxiv: v1 [cs.lg] 24 Jan 2019"

Transcription

1 Cross-Entropy Loss and Low-Rank Features Have Responsibility for Adversarial Examples Kamil Nar Orhan Ocal S. Shankar Sastry Kannan Ramchandran arxiv: v [cs.lg] 24 Jan 209 Abstract State-of-the-art neural networks are vulnerable to adversarial examples; they can easily misclassify inputs that are imperceptibly different than their training and test data. In this work, we establish that the use of cross-entropy loss function and the low-rank features of the training data have responsibility for the existence of these inputs. Based on this observation, we suggest that addressing adversarial examples requires rethinking the use of cross-entropy loss function and looking for an alternative that is more suited for minimization with low-rank features. In this direction, we present a training scheme called differential training, which uses a loss function defined on the differences between the features of points from opposite classes. We show that differential training can ensure a large margin between the decision boundary of the neural network and the points in the training dataset. This larger margin increases the amount of perturbation needed to flip the prediction of the classifier and makes it harder to find an adversarial example with small perturbations. We test differential training on a binary classification task with CIFAR-0 dataset and demonstrate that it radically reduces the ratio of images for which an adversarial example could be found not only in the training dataset, but in the test dataset as well.. Introduction Despite their high accuracy on training and test datasets, state-of-the-art neural networks are vulnerable to adversarial examples: they can easily misclassify inputs that are indistinguishable from the training and test data and express very high confidence for their wrong predictions Szegedy et al., 203. Several methods have recently been introduced to generate these adversarial inputs Goodfellow et al., 205; Authors are with Department of Electrical Engineering and Computer Sciences, University of California, Berkeley. Carlini & Wagner, 207; Moosavi-Dezfooli et al., 207; Athalye et al., 208; and simplicity and effectiveness of these methods have reinforced the concerns about the use of neural networks in many tasks. The presence of adversarial examples was initially attributed to the high nonlinearity of deep neural networks Szegedy et al., 203. Later, however, it was shown that a network with few layers and a high dimensional input space could also suffer from this problem Goodfellow et al., 205. Support vector machines with radial basis function, on the other hand, were robust to these malicious inputs: their accuracy on test datasets and adversarial examples were comparable. Based on these observations, it was claimed that neural networks, unlike support vector machines, failed to introduce adequate nonlinearity as a feature mapping, and this was suggested to be the main explanation for the existence of adversarial examples Goodfellow et al., 205. It is correct that neural networks and support vector machines differ in their level of nonlinearity and their level of robustness against adversarial examples, but this fact on its own does not suffice to build a causal relation between the adversarial examples and the nonlinearity of the classifier. There are many other aspects that neural networks and support vector machines differ in and any of these factors may also have responsibility for the presence of adversarial examples. A major one of these factors is the training procedure. Training a support vector machine involves solving a convex optimization problem defined with the hinge loss function Hastie et al., Due to convexity of the problem, the choice of optimization algorithm has no influence on the classifier obtained at the end of training. In contrast, training a neural network requires solving a nonconvex problem, and the dynamics of the optimization algorithm becomes critical for the solution. It determines the local optimum obtained, and hence, the decision boundary of the trained network. The existence of adversarial examples is the manifestation of a poor margin between the decision boundary of the network and the points in the training and test datasets Fawzi et al., 207. What is interesting is the closeness of the training points to the decision boundary: for some reason,

2 the decision boundary resides extremely close to the training points even after the training is complete although the main purpose of training is to find a boundary that is reasonably far away from these points. We seek out a reason for this poor margin among the ingredients of neural network training that are widely taken for granted: the gradient methods and the cross-entropy loss function... Our contributions. We show that if a linear classifier is trained by minimizing the cross-entropy loss function via the gradient descent algorithm, and if the features of the training points lie on a low-dimensional affine subspace, then the margin between the decision boundary of the classifier and the training points could become much smaller than the optimal value. 2. We show that the penultimate layer of neural networks are very likely to produce low-rank features, and we provide empirical evidence for this on a binary classification task with CIFAR-0 dataset. Combined with the first contribution, this suggests that neural networks could have a poor margin in their penultimate layer, and consequently, very small perturbations in this layer can easily flip the decision of the classifier. 3. In order to improve the margin, we put forward a training scheme called differential training, which uses a loss function defined on the differences between the features of the points from opposite classes. We show that this training scheme allows finding the solution with the largest hard margin for linear classifiers while still using the gradient descent algorithm. 4. We introduce a loss function that improves the margin for nonlinear classifiers and display its effectiveness on a synthetic problem. Then we test this loss function on a binary classification task with CIFAR-0 dataset, and show that it prevents the Projected Gradient Descent Attack Madry et al., 208; Kurakin et al., 206 from being able to find an adversarial example for most of the training and test data. 5. On CIFAR-0 dataset, we empirically show that the network produced by differential training generalizes well over the adversarial examples. That is, the accuracy of the network is virtually the same on adversarial examples generated from the training dataset and on those generated from the test dataset. This result is critical given that the networks trained with robust optimization were shown not to generalize on adversarial examples Schmidt et al., Related Works The minimization of cross-entropy loss function via the gradient descent algorithm has recently been studied for linear classifiers, and its solution has been shown to be equivalent to a support vector machine Soudry et al., 208. However, it has not been emphasized that the separating hyperplane produced by the cross-entropy minimization is constrained to pass through the origin in an augmented space. We show that this fact could cause the margin of the classifier to be drastically small if the features of the dataset lie in a low-dimensional affine subspace in a high dimensional feature space. We also show that this case is not atypical when a neural network is trained with the gradient descent algorithm, and we build a connection between this fact and the existence of adversarial examples. It is known that if a support vector machine is formulated to find a separating hyperplane passing through the origin, the decision boundary of the classifier will be smaller than the optimal value. In order to overcome this problem and to speed up online learning algorithms for support vector machines, the idea of using the differences between the points from opposite classes has previously been suggested in Ishibashi et al., 2008; Keerthi et al., 999. We show that a similar idea in differential training also improves the margin when a neural network is being trained with a gradient-based method. Differential training uses the differences between the features of the training points from opposite classes. This training scheme has been intentionally introduced to improve the dynamics of the gradient descent algorithm on the training cost function; and we consider it as using an alternative cost function in the sequel since the choice of cost function is very critical. However, the procedure could also be considered as using an identical pair of networks in the network architecture, which is closely related to the Siamese Networks Bromley et al., 993; Chopra et al., These networks were previously shown to perform well if limited data were available from any of the classes in a classification task Koch et al., 205. Our work shows that this architecture can also provide a large margin between the decision boundary of the classifier and the training points, and consequently, be more robust to adversarial examples if the network is trained with the cost function we suggest in Section Cross-Entropy Loss on Low-Rank Features Leads to Poor Margins Cross-entropy loss function is almost the sole choice for classification tasks in practice. Its prevalent use is backed theoretically by its association with the minimization of the Kullback-Leibler divergence between the empirical distribu-

3 Decision boundary obtained with cross-entropy minimization Figure. Orange and blue points lie on a low-dimensional affine subspace in R 2, and they represent the data from two different classes. Cross-entropy minimization for a linear classifier on these points leads to the decision boundary shown with the solid line, which attains an extremely poor margin. tion of a dataset and the confidence of the classifier for that dataset. Given the particular success of neural networks for classification tasks Krizhevsky et al., 202; Simonyan & Zisserman, 204; He et al., 206, there seems to be little motivation to search for alternatives for this loss function, and most of the software developed for neural networks incorporates an efficient implementation for it, thereby facilitating its further use. Nevertheless, there seems to be a typical case where the use of cross-entropy loss function can create a problem for the classifier, as shown in Figure. The source of this problem is pointed out in Theorem. Theorem. Assume that the points {x i } i I and {y j } j J are linearly separable and lie in an affine subspace; that is, there exist a set of orthonormal vectors {r k } k K and a set of scalars { k } k K such that r k, x i = r k, y j = k i I, j J, k K. Let w, + B = 0 denote the decision boundary obtained by minimizing the cross-entropy loss function e w x i+b log log, + e w x i+b + e w y j+b i I j J and assume that w and B are scaled such that min i I,j J w, x i w, y j = 2. Then the minimization of the cross-entropy loss yields a margin smaller than or equal to γ + B 2 2 k K 2 k where γ denotes the optimal hard margin given by the SVM solution. Remark. Theorem shows that if the training points lie on an affine subspace, and if the cross-entropy loss is minimized with the gradient descent algorithm, then the margin of the classifier will be smaller than the optimal margin value. As the dimension of this affine subspace decreases, the cardinality of the set K increases and the term k K 2 k could become much larger than /γ2. Therefore, as the dimension of the subspace containing the training points gets smaller compared to the dimension of the input space, cross-entropy minimization with a gradient method becomes more likely to yield a poor margin. The next corollary relaxes the condition of Theorem and allows the training points to be near an affine subspace instead of being exactly on it. Corollary. Assume that the points {x i } i I and {y j } j J in R d are linearly separable and there exist a set of orthonormal vectors {r k } k K and a set of scalars { k } k K such that r k, x i k, r k, y j k i I, j J, k K. Let w, + B = 0 denote the decision boundary obtained by minimizing the cross-entropy loss, as in Theorem. Then the minimization of the cross-entropy loss yields a margin smaller than or equal to B 2 k K 2 k Note that the ability to compare the margin obtained by cross-entropy minimization with the optimal value is lost. Nevertheless, it highlights the fact that same set of points could be assigned a substantially different margin by crossentropy minimization if all of them are shifted away from the origin by the same amount in the same direction. 3. Penultimate Layers of Neural Networks Contain Low-Rank Features The results in the previous section were for linear classifiers, and correspondingly, the features of the training points were the points themselves. In this section, we consider neural networks and regard the outputs of their penultimate layer as the features of the training points. Following theorem shows that these features can have a very low rank if the network is trained with a gradient method. Proposition. Given a set of points {x i } i I, assume that an L-layer network is trained by minimizing the crossentropy loss function: min w,θ i I log e w φ θ x i + e w φ θ x i

4 where φ θ x i is the output of the penultimate layer of the network and represents the features for point x i. Assume that φ θ ends with a linear layer, i.e., φ θ = W h θ where W is a matrix and h θ is the first L 2 layers of the network. If the gradient descent algorithm is initialized with W [0] = 0, then the rank of the set {φˆθx i } i I is at most whenever the algorithm is terminated. The assumption on the initialization of the matrix W could be removed if the network has a certain structure for example, if the last layer of h θ ends with a squishing function such as arctan or tanh. In this case, the points in {φ θ x i } i I keep growing in the same direction if the algorithm is run for long enough, and consequently, this set converges to a set with rank as well. More detail on this case is provided in Appendix B. Note that the only strong assumption in Proposition is the requirement that φ θ ends with a linear layer. Otherwise, φ θ is allowed to contain any type of nonlinear activation functions and convolutional layers. To empirically verify whether the features in a neural network are still low-rank even when the penultimate layer is nonlinear, we trained a standard network with ReLU activations for a binary classification task on CIFAR-0 dataset. The cross-entropy loss function was minimized with three different optimization schemes to train the network. Even though all parameters of the network were initialized as in He et al., 205, the features in the penultimate layer had rank 2 if the training cost was minimized via the gradient method with momentum. When the optimization algorithm was changed to Adam or when batch normalization was used during training, the rank of the features still remained much lower than the dimension of the feature space, as shown in Figure 2. Remark 2. Proposition, along with the empirical observations on CIFAR-0 dataset, shows that the low-rankness of the features of the training dataset is not an exceptional case; on the contrary, it can arise in most cases. This is recently supported by Martin & Mahoney, 208 as well. Along with the main result of Section 2, the fact that penultimate layer of the network contains low-rank features indicates a small margin between the decision boundary of the classifier and the features in this layer. In other words, small perturbations in the penultimate layer can easily flip the decision of the classifier. 4. Differential Training Improves Margin In previous sections, we saw that the combination of crossentropy loss function, low-rank features of training dataset, variance explained Adam+BatchNorm Adam momentum number of principal components used Figure 2. The outputs of the penultimate layer of a neural network can be considered as the features of the training points. A fourlayer convolutional network is trained by minimizing the crossentropy loss function via three different optimization schemes. The plot shows the cumulative variance explained for these features as a function of the number of principle components used. The features lie in a two-dimensional subspace if the gradient method with momentum is used. For the other two algorithms, almost all the variance in the features is captured by the first 20 principle components out of 84. and gradient descent algorithm could lead to a poor margin. We change the training cost function in the following subsections in order to increase the margin of the classifier. 4.. Differential Training for Linear Classifiers Consider the binary classification problem with only two training points, x and y, from two different classes. If we use cross-entropy loss function to find a linear classifier by minimizing e w x+b log + e w x+b log + e w y+b, the gradient descent algorithm gives the update rule: x b w w + η x e w + e y e w y+b w x b + e w y+b where η is the learning rate of the algorithm. The update rule for w reveals a critical fact: even though the optimal direction for w is x y, the increments in w are usually not in this direction. Now consider the problem of finding a separating hyperplane for a linearly separable dataset. If the dataset is low rank, the differences between the training points span a lowdimensional subspace. However, at each iteration of the gradient descent algorithm, the increments on the normal vector of the decision boundary will usually contain components outside of this subspace, as can be seen in. These increments could be forced to lie in the same subspace by feeding the differences of the points from opposite classes instead of the points themselves into the loss function. In

5 fact, a loss function of this form enables finding the separating hyperplane with the largest margin with the gradient descent algorithm. Theorem 2. Given two sets of points {x i } i I and {y j } j J that are linearly separable in R d, if we solve min w R i I log + x i y j d j J e w 2 by using the gradient descent algorithm with a sufficiently small learning rate, then the direction of w converges to the direction of the maximum-margin solution, i.e. lim t wt wt = w SVM w SVM, 3 where w SVM is the solution to the hard-margin SVM problem. Minimization of the cost function 2 provides only the weight parameter ŵ of the decision boundary. The bias parameter, b, could be chosen by plotting the histogram of the inner products { ŵ, x i } i I and { ŵ, y j } j J and fixing a value for ˆb such that ŵ, x i + ˆb 0 i I, 4a ŵ, y j + ˆb 0 j J. 4b The largest hard margin is achieved by ˆb = 2 min i I ŵ, x i 2 max j J ŵ, y j. 5 However, by choosing a larger or smaller value for ˆb, it is possible to make a tradeoff between the Type-I and Type-II errors. The cost function 2 includes a loss defined on every pair of data points from the two classes. There are two aspects of this fact:. When standard loss functions are used for classification tasks, we need to oversample or undersample either of the classes if the training dataset contains different number of points from different classes. This problem does not arise when we use the cost function The number of pairs, I J, will usually be much larger than the size of the original dataset, which contains I + J points. Therefore, the minimization of 2 might appear more expensive than the minimization of the standard cross-entropy loss computationally. However, if the points in different classes are well separated and the stochastic gradient method is used to minimize 2, the algorithm could achieve zero training error after using only a few pairs, which is formalized in Theorem 3. Further computation is needed only to improve the margin of the classifier. In addition, in our experiments to train a neural network to classify two classes from the CIFAR-0 dataset, only a few percent of I J pairs were observed to be sufficient to reach an accuracy on the test dataset that is comparable to the accuracy of the cross-entropy loss minimization. Theorem 3. Given two sets of points {x i } i I and {y j } j J that are linearly separable in R d, assume the cost function 2 is minimized with the stochastic gradient method. Define R x = max{ x i x i : i, i I}, R y = max{ y j y j : j, j J}, and let γ denote the hard margin that would be obtained with the SVM: 2γ = max u R d min i I,j J x i y j, u/ u. If 2γ 5 maxr x, R y, then the stochastic gradient algorithm produces a weight parameter, ŵ, only in one iteration which satisfies the inequalities 4a-4b along with the bias, ˆb, given by Differential Training for Nonlinear Classifiers When a neural network is used to find a nonlinear classifier, a candidate cost function analogous to 2 for differential training would be j J log + e w φ θ x i φ θ y j 6 i I where φ θ is the output of the penultimate layer of the network and represents the features of the points. However, minimization of 6 has been observed to fail in providing a large margin in the input space in our experiments. One reason for this is that the minimization of 6 does not guarantee a small Lipschitz constant for the mapping φ θ. Therefore, even if the margin is large in the penultimate layer, the margin in the input space could still be very small. A cost function that does provide a large margin in the input space is i I j J w φ θ x i w φ θ y j 2. 7 A partial explanation for the different behavior of this function is that the gradient descent algorithm is more likely to converge to a solution with small Lipschitz constant if the network is trained with the squared error loss Nar & Sastry, 208. Consequently, the gradient method is more likely to produce a φ θ which has a small Lipschitz constant, and this implies that the input of φ θ needs to change by a large amount in order for its output to move across the decision boundary.

6 00 PGD Attack y x differential training cross-entropy min. Figure 3. A two-layer neural network is trained with two different cost functions. Cross-entropy minimization marks the region between the dotted lines as the class of blue points, whereas the same class is assigned to the region inside the solid curve when differential training is used. Note that the decision boundaries obtained with cross-entropy minimization have extremely small margins. The effect of training with the cost function 7 on the margin of a nonlinear classifier is demonstrated in Figure 3. A neural network with one hidden layer was trained with two different training cost functions: cross-entropy loss and the differential training cost 7. The minimization of crossentropy loss provided an extremely poor margin in the input space, whereas the use of 7 lead to a decision boundary with large margins. 5. Experiment on CIFAR-0: Differential Training Removes Adversarial Examples A large margin between the decision boundary of the classifier and the points in the training dataset is expected to make it harder to find adversarial examples for these points. In order to verify if this is the case, we trained a four-layer convolutional neural network for a binary classification task on CIFAR-0 dataset by only using the images for planes and horses. Both cross-entropy minimization and differential training achieved zero error on the training dataset, and the accuracies of both training schemes were comparable on the test dataset: cross-entropy minimization lead to 93.65% while differential training yielded 94.65%. We generated adversarial examples for the images in the training dataset using Projected Gradient Descent Attack PGD implemented by Rauber et al., 207. The robustness of the neural network against these adversarial examples was substantially different based on whether the network was trained with the cross-entropy loss or the differential training cost 7. As shown in Figure 4, PGD was able to find adversarial percentage of samples fooled Cross-entropy Min. on test Cross-entropy Min. on train Differential Training on test Differential Training on train norm of the disturbance Figure 4. A four-layer convolutional neural network is trained for a binary classification task on CIFAR-0 dataset with two different training schemes: cross-entropy minimization and differential training. If the network is trained with differential training, the accuracy of the network is much higher for the adversarial examples generated from the training and test datasets with the PGD Attack. Moreover, the accuracy of the network on the adversarial examples generated from the training dataset is almost the same as its accuracy on those generated from the test dataset. Solid lines denote the accuracy on adversarial examples generated from the training dataset, and dashed lines denote the accuracy on adversarial examples generated from the test dataset. examples for the images in the training dataset with small perturbations if the network was trained with the crossentropy loss. In contrast, if the network was trained with differential training, PGD failed to find adversarial examples for the training dataset without disturbing the images by a large amount. Please note that PGD was considered to be the most powerful first-order gradient-based attack in Madry et al., 208. Somewhat surprisingly, the same behavior was observed on the test dataset as well. As displayed in Figure 4, PGD failed to find adversarial examples for most of the images in the test dataset when the network was trained via differential training. Moreover, the accuracy of the network was almost the same for adversarial examples generated from the training dataset and for those generated from the test dataset. We also tested the network under the Carlini-Wagner Attack Carlini & Wagner, 207 implemented by Rauber et al., 207. Similar to its performance under PGD Attack, the accuracy of the network trained with differential training remained much higher compared to the network trained with cross-entropy minimization, as shown in Figure Discussion Low-dimensionality of the training dataset. As stated in Remark, as the dimension of the affine subspace containing the training dataset gets very small compared to the

7 percentage of samples fooled Carlini-Wagner Attack Cross-entropy Min. on test Differential Training on test norm of the disturbance Figure 5. A four-layer convolutional network is trained with two different schemes: cross-entropy minimization and differential training. If the network is trained with differential training, the accuracy of the network is much higher on the adversarial examples generated from the test dataset with the Carlini-Wagner Attack. dimension of the input space, the training algorithm will become more likely to yield a small margin for the classifier. This observation confirms the results of Marzi et al., 208, which showed that if the training dataset is projected onto a low-dimensional subspace before being fed into a neural network, the performance of the network against adversarial examples is improved since projecting the inputs onto a low-dimensional domain corresponds to decreasing the dimension of the input space. Even though this method is effective, it requires the knowledge of the domain in which the training points are low-dimensional. Because this knowledge will not always be available a priori, finding alternative training algorithms and loss functions that are suited for lowdimensional data is still an important direction for future research. Robust optimization. Using robust optimization to train neural networks has been shown to be effective against adversarial examples Madry et al., 208; Athalye et al., 208. Note that these techniques could be considered as inflating the training points by a presumed amount and training the classifier with these inflated points. Nevertheless, as long as the cross-entropy loss is involved, the decision boundaries of the neural network will still be in the vicinity of the inflated points. Therefore, even though the classifier is robust against the disturbances of the presumed magnitude, the margin of the classifier could still be much smaller than what it could potentially be. Differential training. We introduced differential training, which allows the feature mapping to remain trainable while ensuring a large margin between different classes of points. By doing so, this method combines the benefits of neural networks with those of support vector machines. Even though moving from 2N training points to N 2 pairs might seem prohibitive, it points out that a true classification should in fact be able to differentiate between the pairs that are hardest to differentiate, and this search will necessarily require an N 2 term. Some heuristic methods are likely to be effective, such as considering only a smaller subset of points closer to the boundary and updating this set of points as needed during training. If a neural network is trained with this procedure, the network will be forced to find features that are able to tell apart between the hardest pairs. Generalization of differential training, and its connection to one-shot learning. It has been shown that if a neural network is trained with robust optimization, the accuracy of the network on adversarial examples generated from the test dataset could be very low even though the accuracy on adversarial examples produced from the training dataset is high Schmidt et al., 208. Consequently, it has been claimed that the robust optimization requires large amount of data so as to make a network robust against adversarial perturbations on the unseen images. Our empirical results on CIFAR-0 dataset suggest that differential training does not suffer from this problem. That is, differential training provides neural networks with robustness while still using fewer data. This is in congruence with the main premise of Koch et al., 205, which showed that Siamese networks with an identical pair of networks in their architecture perform well with few training points. Please see Section.2 for further comments on the relation between differential training and Siamese networks. Why not empirical risk minimization with a well-known loss function? Consider the standard problem of empirical risk minimization as the proxy for finding a classifier: min w,θ l w, φ θx i ; z i i I 8 where z i denotes the label of the point x i, and w, θ are the parameters of the classifier. If the features of the training points {φ θ x i } i I lie in a low-dimensional subspace, the cost function 8 will likely not be strictly convex; and more importantly, there will be directions in which the parameters are not penalized. Normally, the remedy would be to introduce a regularization term into the cost function. However, the effectiveness of well-known regularization terms is dubious for neural networks: they do not prevent spectral norms of weight matrices from growing unboundedly Bartlett et al., 207, nor do they influence the generalization gap of networks noticeably Zhang et al., 207. Therefore, even if a regularization term is added externally, the gradient descent algorithm will have the potential to drive the parameters in the directions that are not penalized and cause the decision boundary to reside in the vicinity of the training points. Note that the loss function l need not be the crossentropy loss for this to happen. This is why the problem of poor margins is in fact not peculiar to the cross-entropy loss, and this is why other well-known loss functions will likely also fail in addressing adversarial examples.

8 A. Proof of Theorem and Corollary Lemma Adapted from Theorem 3 of Soudry et al., 208. Given two sets of points {x i } i I and {y j } j J that are linearly separable in R d, let x i and ỹ j denote [x i ] and [yj ], respectively, for all i I, j J. Then the iterate of the gradient descent algorithm, wt, on the cross-entropy loss function min log + w R d+ i I e w x i + log + e w ỹ j j J with a sufficiently small step size will converge in direction: lim t where w is the solution to wt wt = w w, minimize z 2 9 z R d+ subject to z, x i i I, z, ỹ j j J. Proof of Theorem. Assume that w = u + m k= α kr k, where u R d and u, r k = 0 for all k K. By denoting z = [w b], the Lagrangian of the problem 9 can be written as 2 w b2 + i I µ i w, x i b + j J ν j + w, y j + b, where µ i 0 for all i I and ν j 0 for all j J. KKT conditions for the optimality of w and B requires that w = i I µ i x i j J ν j y j, B = i I and consequently, for each k K, µ i j J ν j, w, r k = i I µ i x i, r k j J ν j y j, r k Then, we can write w as = i I kµ i j J kν j = B k. w = u + k K B kr k. Let w SVM, + b SVM = 0 denote the hyperplane obtained as the solution of SVM. Then w SVM solves minimize w 2 0 w subject to w, x i y j 2 i I, j J. Since the vector u also satisfies u, x i y j = w, x i y j 2 for all i I, j J, we have u w SVM = γ. As a result, the margin obtained by minimizing the crossentropy loss is w = u 2 + B k r k 2 γ 2 + B 2 2 k Proof of Corollary. If B < 0, we could consider the hyperplane w, B = 0 for the points { x i } i I and { y j } j J, which would have the identical margin due to symmetry. Therefore, without loss of generality, assume B 0. As in the proof of Theorem, KKT conditions for the optimality of w and B requires w = i I µ i x i j J ν j y j, B = i I µ i j J ν j where µ i 0 and ν j 0 for all i I, j J. Note that for each k K, w, r k = i I µ i x i, r k j J ν j y j, r k = B k + i I µ i x i, r k k j J ν j y j, r k k B k. Since {r k } k K is an orthonormal set of vectors, w 2 k K w, r k 2 k K B2 2 k. The result follows from the fact that w is an upper bound on the margin. B. Proposition and Nonzero Initialization Gradient descent algorithm on leads to the dynamics where i I log + e w W h θ x i Ẇ = wv, ẇ = W v, e w W h θ x i v = h θx i i I + e. w W h θ x i If W 0 = 0, then w preserves its direction and wt = w0αt for all t 0, where α : [0, R. Consequently, the column space of W t is spanned by only w0, and W t has rank or 0 for every t 0. This completes the proof of Proposition. In order to make a statement without the condition on W 0, we need the following lemma..

9 Lemma 2. Consider the n n matrix [ 0 v ] v 0 where v R n and assume n 2. It has only one positive eigenvalue, v 2, with the eigenvector [v v 2 ]. Proof. The matrix is at most rank 2, so it has at most 2 nonzero eigenvalues. The vectors [v v 2 ] and [v v 2 ] are its eigenvectors corresponding to the eigenvalues v 2 and v 2, respectively. In the dynamics, if we consider vt as an exogenous signal, the system described becomes a linear time-varying system of the states W, w. Moreover, the dynamics of each row of the pair W, w is independent of the other rows, but is governed by the same matrix. For example, the k th row of the pair W, w satisfies: Ẇ k. Ẇ kn ẇ k [ = 0 vt vt 0 ] W k. W kn w k. 2 If the last layer of h θ ends with a squishing function such as arctan or tanh, and if all training points are classified correctly during training, the dynamics of v becomes v i I h θ x i e w W h θ x i v W W + w 2 v h θ x i if the network is trained for long enough. Then the change in v becomes exponentially slower than those in W and w as the training continues. Consequently, the vector vt in 2 acts as a constant vector; and from Lemma 2, each row of the matrix W grows in the direction vt by the same ratio. As a result, if the algorithm is run for long, all rows of W converge to the same direction. Correspondingly, all of its columns converge to a set with rank or 0. C. Proof of Theorem 2 Apply Lemma by replacing the sets {x i } i I and {y j } j J with {x i y j } i I,j J and the empty set, respectively. Then the minimization of the loss function 2 with the gradient descent algorithm leads to where w satisfies lim t w w = w w w = arg min w w 2 s.t. w, x i y j i I, j J. Since w SVM is the solution of 0, we obtain w = 2 w SVM, and the claim of the theorem holds. D. Proof of Theorem 3 In order to achieve zero training error in one iteration of the stochastic gradient algorithm, it is sufficient to have min i I x i, x i y j > max j J y j, x i y j i I, j J, or equivalently, x i y j, x i y j > 0 i, i I, j, j J. 3 By definition of the margin, there exists a vector w SVM R d with unit norm which satisfies 2γ = min i I,j J x i y j, w SVM. Note that w SVM is orthogonal to the decision boundary given by the SVM. Then we can write every x i y j as x i y j = 2γw SVM + δ x i + δ y j, where δ x i, δy j Rd and δ x i R x and δ y j R y. Then, condition 3 is satisfied if 2γw SVM + δ x i + δ y j, 2γw SVM + δ x i + δy j > 0 for all i, i I and for all j, j J; or equivalently if 4γ 2 +2γ w SVM, δ x i +δ y j +δx i +δy j + δx i +δ y j, δx i +δy j > 0 4 for all i, i I and for all j, j J. If we choose γ > 5 2 maxr x, R y, we have 4γ 2 2γ2R x + 2R y R x + R y 2 > 0, which guarantees 4 and completes the proof. References Athalye, A., Carlini, N., and Wagner, D. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Dy, J. and Krause, A. eds., Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp , Stockholmsmssan, Stockholm Sweden, 0 5 Jul 208. PMLR. URL press/v80/athalye8a.html. Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp , 207. Bromley, J., W. Bentz, J., Bottou, L., Guyon, I., Lecun, Y., Moore, C., Sackinger, E., and Shah, R. Signature verification using a siamese time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence, 7:25, doi: 0.42/S

10 Carlini, N. and Wagner, D. Towards evaluating the robustness of neural networks. In 207 IEEE Symposium on Security and Privacy SP, pp IEEE, 207. Chopra, S., Hadsell, R., and LeCun, Y. Learning a similarity metric discriminatively, with application to face verification. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume, pp , June Fawzi, A., Moosavi-Dezfooli, S., and Frossard, P. The robustness of deep networks: A geometrical perspective. IEEE Signal Processing Magazine, 346:50 62, Nov 207. Goodfellow, I., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 205. Hastie, T., Tibshirani, R., and Friedman, J. The elements of statistical learning: data mining, inference and prediction. Springer, 2 edition, URL tibs/ ElemStatLearn/. He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp , 205. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp , 206. Ishibashi, K., Hatano, K., and Takeda, M. Online learning of maximum p-norm margin classifiers with bias. In 2st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-2, 2008, pp , URL fi/papers/48-ishibashi.pdf. Keerthi, S., Shevade, S. K., Bhattacharyya, C., and Murthy, K. A fast iterative nearest point algorithm for support vector machine classifier design. IEEE Transactions on Neural Networks, :24 36, 999. Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 205. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp , 202. Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial machine learning at scale. arxiv preprint arxiv:6.0236, 206. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 208. Martin, C. H. and Mahoney, M. W. Implicit selfregularization in deep neural networks: Evidence from random matrix theory and implications for learning. CoRR, abs/ , 208. URL org/abs/ Marzi, Z., Gopalakrishnan, S., Madhow, U., and Pedarsani, R. Sparsity-based Defense against Adversarial Attacks on Linear Classifiers. ArXiv e-prints, 208. Moosavi-Dezfooli, S.-M., Fawzi, A., Fawzi, O., and Frossard, P. Universal adversarial perturbations. In IEEE Conference on Computer Vision and Pattern Recognition, pp , 207. Nar, K. and Sastry, S. Step size matters in deep learning. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. eds., Advances in Neural Information Processing Systems 3, pp Curran Associates, Inc., 208. URL step-size-matters-in-deep-learning. pdf. Rauber, J., Brendel, W., and Bethge, M. Foolbox: a python toolbox to benchmark the robustness of machine learning models 207. URL org/abs/ , 207. Schmidt, L., Santurkar, S., Tsipras, D., Talwar, K., and Madry, A. Adversarially robust generalization requires more data. arxiv preprint arxiv: , 208. Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. CoRR, abs/ , 204. Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. The Implicit Bias of Gradient Descent on Separable Data. ArXiv e-prints, 208. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. Intriguing properties of neural networks. CoRR, abs/32.699, 203. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 207.

arxiv: v1 [stat.ml] 15 Mar 2018

arxiv: v1 [stat.ml] 15 Mar 2018 Large Margin Deep Networks for Classification arxiv:1803.05598v1 [stat.ml] 15 Mar 2018 Gamaleldin F. Elsayed Dilip Krishnan Hossein Mobahi Kevin Regan Samy Bengio Google Research {gamaleldin,dilipkay,hmobahi,kevinregan,bengio}@google.com

More information

arxiv: v1 [cs.lg] 4 Mar 2019

arxiv: v1 [cs.lg] 4 Mar 2019 A Fundamental Performance Limitation for Adversarial Classification Abed AlRahman Al Makdah, Vaibhav Katewa, and Fabio Pasqualetti arxiv:1903.01032v1 [cs.lg] 4 Mar 2019 Abstract Despite the widespread

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Notes on Adversarial Examples

Notes on Adversarial Examples Notes on Adversarial Examples David Meyer dmm@{1-4-5.net,uoregon.edu,...} March 14, 2017 1 Introduction The surprising discovery of adversarial examples by Szegedy et al. [6] has led to new ways of thinking

More information

arxiv: v1 [cs.lg] 6 Dec 2018

arxiv: v1 [cs.lg] 6 Dec 2018 Max-Margin Adversarial (MMA) Training: Direct Input Space Margin Maximization through Adversarial Training Gavin Weiguang Ding, Yash Sharma, Kry Yik Chau Lui, and Ruitong Huang Borealis AI arxiv:1812.02637v1

More information

Towards ML You Can Rely On. Aleksander Mądry

Towards ML You Can Rely On. Aleksander Mądry Towards ML You Can Rely On Aleksander Mądry @aleks_madry madry-lab.ml Machine Learning: The Success Story? Image classification Reinforcement Learning Machine translation Machine Learning: The Success

More information

Measuring the Robustness of Neural Networks via Minimal Adversarial Examples

Measuring the Robustness of Neural Networks via Minimal Adversarial Examples Measuring the Robustness of Neural Networks via Minimal Adversarial Examples Sumanth Dathathri sdathath@caltech.edu Stephan Zheng stephan@caltech.edu Sicun Gao sicung@ucsd.edu Richard M. Murray murray@cds.caltech.edu

More information

arxiv: v1 [cs.lg] 30 Oct 2018

arxiv: v1 [cs.lg] 30 Oct 2018 On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models arxiv:1810.12715v1 [cs.lg] 30 Oct 2018 Sven Gowal sgowal@google.com Rudy Bunel University of Oxford rudy@robots.ox.ac.uk

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional

More information

Adversarially Robust Optimization and Generalization

Adversarially Robust Optimization and Generalization Adversarially Robust Optimization and Generalization Ludwig Schmidt MIT UC Berkeley Based on joint works with Logan ngstrom (MIT), Aleksander Madry (MIT), Aleksandar Makelov (MIT), Dimitris Tsipras (MIT),

More information

Reinforcing Adversarial Robustness using Model Confidence Induced by Adversarial Training

Reinforcing Adversarial Robustness using Model Confidence Induced by Adversarial Training Reinforcing Adversarial Robustness using Model Confidence Induced by Adversarial Training Xi Wu * 1 Uyeong Jang * 2 Jiefeng Chen 2 Lingjiao Chen 2 Somesh Jha 2 Abstract In this paper we study leveraging

More information

Robustness of classifiers: from adversarial to random noise

Robustness of classifiers: from adversarial to random noise Robustness of classifiers: from adversarial to random noise Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, Pascal Frossard École Polytechnique Fédérale de Lausanne Lausanne, Switzerland {alhussein.fawzi,

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

EAD: Elastic-Net Attacks to Deep Neural Networks via Adversarial Examples

EAD: Elastic-Net Attacks to Deep Neural Networks via Adversarial Examples EAD: Elastic-Net Attacks to Deep Neural Networks via Adversarial Examples Pin-Yu Chen1, Yash Sharma2, Huan Zhang3, Jinfeng Yi4, Cho-Jui Hsieh3 1 AI Foundations Lab, IBM T. J. Watson Research Center, Yorktown

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

arxiv: v3 [cs.lg] 22 Mar 2018

arxiv: v3 [cs.lg] 22 Mar 2018 arxiv:1710.06081v3 [cs.lg] 22 Mar 2018 Boosting Adversarial Attacks with Momentum Yinpeng Dong1, Fangzhou Liao1, Tianyu Pang1, Hang Su1, Jun Zhu1, Xiaolin Hu1, Jianguo Li2 1 Department of Computer Science

More information

Universal Adversarial Networks

Universal Adversarial Networks 1 Universal Adversarial Networks Jamie Hayes University College London j.hayes@cs.ucl.ac.uk Abstract Neural networks are known to be vulnerable to adversarial examples, inputs that have been intentionally

More information

ON THE SENSITIVITY OF ADVERSARIAL ROBUSTNESS

ON THE SENSITIVITY OF ADVERSARIAL ROBUSTNESS ON THE SENSITIVITY OF ADVERSARIAL ROBUSTNESS TO INPUT DATA DISTRIBUTIONS Anonymous authors Paper under double-blind review ABSTRACT Neural networks are vulnerable to small adversarial perturbations. Existing

More information

Reinforcing Adversarial Robustness using Model Confidence Induced by Adversarial Training

Reinforcing Adversarial Robustness using Model Confidence Induced by Adversarial Training Reinforcing Adversarial Robustness using Model Confidence Induced by Adversarial Training Xi Wu * 1 Uyeong Jang * 2 Jiefeng Chen 2 Lingjiao Chen 2 Somesh Jha 2 Abstract In this paper we study leveraging

More information

Adversarial Image Perturbation for Privacy Protection A Game Theory Perspective Supplementary Materials

Adversarial Image Perturbation for Privacy Protection A Game Theory Perspective Supplementary Materials Adversarial Image Perturbation for Privacy Protection A Game Theory Perspective Supplementary Materials 1. Contents The supplementary materials contain auxiliary experiments for the empirical analyses

More information

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Yi Zhou Ohio State University Yingbin Liang Ohio State University Abstract zhou.1172@osu.edu liang.889@osu.edu The past

More information

Limitations of the Lipschitz constant as a defense against adversarial examples

Limitations of the Lipschitz constant as a defense against adversarial examples Limitations of the Lipschitz constant as a defense against adversarial examples Todd Huster, Cho-Yu Jason Chiang, and Ritu Chadha Perspecta Labs, Basking Ridge, NJ 07920, USA. thuster@perspectalabs.com

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

arxiv: v1 [cs.cv] 2 Aug 2016

arxiv: v1 [cs.cv] 2 Aug 2016 A study of the effect of JPG compression on adversarial images arxiv:1608.00853v1 [cs.cv] 2 Aug 2016 Gintare Karolina Dziugaite Department of Engineering University of Cambridge Daniel M. Roy Department

More information

Encoder Based Lifelong Learning - Supplementary materials

Encoder Based Lifelong Learning - Supplementary materials Encoder Based Lifelong Learning - Supplementary materials Amal Rannen Rahaf Aljundi Mathew B. Blaschko Tinne Tuytelaars KU Leuven KU Leuven, ESAT-PSI, IMEC, Belgium firstname.lastname@esat.kuleuven.be

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

The Implicit Bias of Gradient Descent on Separable Data

The Implicit Bias of Gradient Descent on Separable Data Journal of Machine Learning Research 19 2018 1-57 Submitted 4/18; Published 11/18 The Implicit Bias of Gradient Descent on Separable Data Daniel Soudry Elad Hoffer Mor Shpigel Nacson Department of Electrical

More information

Regularizing Deep Networks Using Efficient Layerwise Adversarial Training

Regularizing Deep Networks Using Efficient Layerwise Adversarial Training Related Work Many approaches have been proposed to regularize the training procedure of very deep networks. Early stopping and statistical techniques like weight decay are commonly used to prevent overfitting.

More information

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

arxiv: v1 [cs.cv] 27 Nov 2018

arxiv: v1 [cs.cv] 27 Nov 2018 Universal Adversarial Training arxiv:1811.11304v1 [cs.cv] 27 Nov 2018 Ali Shafahi ashafahi@cs.umd.edu Abstract Mahyar Najibi najibi@cs.umd.edu Larry S. Davis lsd@umiacs.umd.edu Standard adversarial attacks

More information

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support

More information

arxiv: v2 [stat.ml] 20 Nov 2017

arxiv: v2 [stat.ml] 20 Nov 2017 : Elastic-Net Attacks to Deep Neural Networks via Adversarial Examples Pin-Yu Chen1, Yash Sharma2, Huan Zhang3, Jinfeng Yi4, Cho-Jui Hsieh3 1 arxiv:1709.04114v2 [stat.ml] 20 Nov 2017 AI Foundations Lab,

More information

Generalization in Deep Networks

Generalization in Deep Networks Generalization in Deep Networks Peter Bartlett BAIR UC Berkeley November 28, 2017 1 / 29 Deep neural networks Game playing (Jung Yeon-Je/AFP/Getty Images) 2 / 29 Deep neural networks Image recognition

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Neural Networks Varun Chandola x x 5 Input Outline Contents February 2, 207 Extending Perceptrons 2 Multi Layered Perceptrons 2 2. Generalizing to Multiple Labels.................

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

arxiv: v3 [cs.ne] 10 Mar 2017

arxiv: v3 [cs.ne] 10 Mar 2017 ROBUSTNESS TO ADVERSARIAL EXAMPLES THROUGH AN ENSEMBLE OF SPECIALISTS Mahdieh Abbasi & Christian Gagné Computer Vision and Systems Laboratory, Electrical and Computer Engineering Department Université

More information

Max-Margin Adversarial (MMA) Training: Direct Input Space Margin Maximization through Adversarial Training

Max-Margin Adversarial (MMA) Training: Direct Input Space Margin Maximization through Adversarial Training Max-Margin Adversarial (MMA) Training: Direct Input Space Margin Maximization through Adversarial Training Gavin Weiguang Ding, Yash Sharma, Kry Yik Chau Lui, and Ruitong Huang Borealis AI arxiv:1812.02637v2

More information

Machine Learning Basics

Machine Learning Basics Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

1 What a Neural Network Computes

1 What a Neural Network Computes Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

More information

arxiv: v1 [cs.lg] 15 Nov 2017 ABSTRACT

arxiv: v1 [cs.lg] 15 Nov 2017 ABSTRACT THE BEST DEFENSE IS A GOOD OFFENSE: COUNTERING BLACK BOX ATTACKS BY PREDICTING SLIGHTLY WRONG LABELS Yannic Kilcher Department of Computer Science ETH Zurich yannic.kilcher@inf.ethz.ch Thomas Hofmann Department

More information

An Inside Look at Deep Neural Networks using Graph Signal Processing

An Inside Look at Deep Neural Networks using Graph Signal Processing An Inside Look at Deep Neural Networks using Graph Signal Processing Vincent Gripon 1, Antonio Ortega 2, and Benjamin Girault 2 1 IMT Atlantique, Brest, France Email: vincent.gripon@imt-atlantique.fr 2

More information

MinOver Revisited for Incremental Support-Vector-Classification

MinOver Revisited for Incremental Support-Vector-Classification MinOver Revisited for Incremental Support-Vector-Classification Thomas Martinetz Institute for Neuro- and Bioinformatics University of Lübeck D-23538 Lübeck, Germany martinetz@informatik.uni-luebeck.de

More information

Maxout Networks. Hien Quoc Dang

Maxout Networks. Hien Quoc Dang Maxout Networks Hien Quoc Dang Outline Introduction Maxout Networks Description A Universal Approximator & Proof Experiments with Maxout Why does Maxout work? Conclusion 10/12/13 Hien Quoc Dang Machine

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

FreezeOut: Accelerate Training by Progressively Freezing Layers

FreezeOut: Accelerate Training by Progressively Freezing Layers FreezeOut: Accelerate Training by Progressively Freezing Layers Andrew Brock, Theodore Lim, & J.M. Ritchie School of Engineering and Physical Sciences Heriot-Watt University Edinburgh, UK {ajb5, t.lim,

More information

Adversarial Examples Generation and Defense Based on Generative Adversarial Network

Adversarial Examples Generation and Defense Based on Generative Adversarial Network Adversarial Examples Generation and Defense Based on Generative Adversarial Network Fei Xia (06082760), Ruishan Liu (06119690) December 15, 2016 1 Abstract We propose a novel generative adversarial network

More information

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence ESANN 0 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 7-9 April 0, idoc.com publ., ISBN 97-7707-. Stochastic Gradient

More information

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Support Vector Machines for Classification and Regression

Support Vector Machines for Classification and Regression CIS 520: Machine Learning Oct 04, 207 Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may

More information

arxiv: v4 [cs.lg] 28 Mar 2016

arxiv: v4 [cs.lg] 28 Mar 2016 Analysis of classifiers robustness to adversarial perturbations Alhussein Fawzi Omar Fawzi Pascal Frossard arxiv:0.090v [cs.lg] 8 Mar 06 Abstract The goal of this paper is to analyze an intriguing phenomenon

More information

Negative Momentum for Improved Game Dynamics

Negative Momentum for Improved Game Dynamics Negative Momentum for Improved Game Dynamics Gauthier Gidel Reyhane Askari Hemmat Mohammad Pezeshki Gabriel Huang Rémi Lepriol Simon Lacoste-Julien Ioannis Mitliagkas Mila & DIRO, Université de Montréal

More information

Distirbutional robustness, regularizing variance, and adversaries

Distirbutional robustness, regularizing variance, and adversaries Distirbutional robustness, regularizing variance, and adversaries John Duchi Based on joint work with Hongseok Namkoong and Aman Sinha Stanford University November 2017 Motivation We do not want machine-learned

More information

Optimization geometry and implicit regularization

Optimization geometry and implicit regularization Optimization geometry and implicit regularization Suriya Gunasekar Joint work with N. Srebro (TTIC), J. Lee (USC), D. Soudry (Technion), M.S. Nacson (Technion), B. Woodworth (TTIC), S. Bhojanapalli (TTIC),

More information

Some Statistical Properties of Deep Networks

Some Statistical Properties of Deep Networks Some Statistical Properties of Deep Networks Peter Bartlett UC Berkeley August 2, 2018 1 / 22 Deep Networks Deep compositions of nonlinear functions h = h m h m 1 h 1 2 / 22 Deep Networks Deep compositions

More information

Is Robustness the Cost of Accuracy? A Comprehensive Study on the Robustness of 18 Deep Image Classification Models

Is Robustness the Cost of Accuracy? A Comprehensive Study on the Robustness of 18 Deep Image Classification Models Is Robustness the Cost of Accuracy? A Comprehensive Study on the Robustness of 18 Deep Image Classification Models Dong Su 1*, Huan Zhang 2*, Hongge Chen 3, Jinfeng Yi 4, Pin-Yu Chen 1, and Yupeng Gao

More information

arxiv: v1 [stat.ml] 3 Apr 2017

arxiv: v1 [stat.ml] 3 Apr 2017 Geometric Insights into SVM Tuning Geometric Insights into Support Vector Machine Behavior using the KKT Conditions arxiv:1704.00767v1 [stat.ml] 3 Apr 2017 Iain Carmichael Department of Statistics and

More information

Lower bounds on the robustness to adversarial perturbations

Lower bounds on the robustness to adversarial perturbations Lower bounds on the robustness to adversarial perturbations Jonathan Peck 1,2, Joris Roels 2,3, Bart Goossens 3, and Yvan Saeys 1,2 1 Department of Applied Mathematics, Computer Science and Statistics,

More information

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

SSCNets A Selective Sobel Convolution-based Technique to Enhance the Robustness of Deep Neural Networks against Security Attacks

SSCNets A Selective Sobel Convolution-based Technique to Enhance the Robustness of Deep Neural Networks against Security Attacks A Selective Sobel Convolution-based Technique to Enhance the Robustness of Deep Neural Networks against Security Attacks Hammad Tariq*, Hassan Ali*, Muhammad Abdullah Hanif, Faiq Khalid, Semeen Rehman,

More information

Classification of Hand-Written Digits Using Scattering Convolutional Network

Classification of Hand-Written Digits Using Scattering Convolutional Network Mid-year Progress Report Classification of Hand-Written Digits Using Scattering Convolutional Network Dongmian Zou Advisor: Professor Radu Balan Co-Advisor: Dr. Maneesh Singh (SRI) Background Overview

More information

arxiv: v1 [cs.lg] 30 Nov 2018

arxiv: v1 [cs.lg] 30 Nov 2018 Adversarial Examples as an Input-Fault Tolerance Problem Angus Galloway 1,2, Anna Golubeva 3,4, and Graham W. Taylor 1,2 arxiv:1811.12601v1 [cs.lg] Nov 2018 1 School of Engineering, University of Guelph

More information

arxiv: v3 [cs.lg] 8 Jun 2018

arxiv: v3 [cs.lg] 8 Jun 2018 Provable Defenses against Adversarial Examples via the Convex Outer Adversarial Polytope Eric Wong 1 J. Zico Kolter 2 arxiv:1711.00851v3 [cs.lg] 8 Jun 2018 Abstract We propose a method to learn deep ReLU-based

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

ENSEMBLE METHODS AS A DEFENSE TO ADVERSAR-

ENSEMBLE METHODS AS A DEFENSE TO ADVERSAR- ENSEMBLE METHODS AS A DEFENSE TO ADVERSAR- IAL PERTURBATIONS AGAINST DEEP NEURAL NET- WORKS Anonymous authors Paper under double-blind review ABSTRACT Deep learning has become the state of the art approach

More information

QuSecNets: Quantization-based Defense Mechanism for Securing Deep Neural Network against Adversarial Attacks

QuSecNets: Quantization-based Defense Mechanism for Securing Deep Neural Network against Adversarial Attacks QuSecNets: Quantization-based Defense Mechanism for Securing Deep Neural Network against Hassan Ali *, Hammad Tariq *, Muhammad Abdullah Hanif, Faiq Khalid, Semeen Rehman, Rehan Ahmed * and Muhammad Shafique

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques in Training of Deep Neural Networks Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,

More information

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu

More information

Learning features by contrasting natural images with noise

Learning features by contrasting natural images with noise Learning features by contrasting natural images with noise Michael Gutmann 1 and Aapo Hyvärinen 12 1 Dept. of Computer Science and HIIT, University of Helsinki, P.O. Box 68, FIN-00014 University of Helsinki,

More information

ECE 595: Machine Learning I Adversarial Attack 1

ECE 595: Machine Learning I Adversarial Attack 1 ECE 595: Machine Learning I Adversarial Attack 1 Spring 2019 Stanley Chan School of Electrical and Computer Engineering Purdue University 1 / 32 Outline Examples of Adversarial Attack Basic Terminology

More information

arxiv: v1 [cs.cv] 21 Jul 2017

arxiv: v1 [cs.cv] 21 Jul 2017 CONFIDENCE ESTIMATION IN DEEP NEURAL NETWORKS VIA DENSITY MODELLING Akshayvarun Subramanya Suraj Srinivas R.Venkatesh Babu Video Analytics Lab, Department of Computational and Data Sciences Indian Institute

More information

Minimax risk bounds for linear threshold functions

Minimax risk bounds for linear threshold functions CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Theories of Deep Learning

Theories of Deep Learning Theories of Deep Learning Lecture 02 Donoho, Monajemi, Papyan Department of Statistics Stanford Oct. 4, 2017 1 / 50 Stats 385 Fall 2017 2 / 50 Stats 285 Fall 2017 3 / 50 Course info Wed 3:00-4:20 PM in

More information

Swapout: Learning an ensemble of deep architectures

Swapout: Learning an ensemble of deep architectures Swapout: Learning an ensemble of deep architectures Saurabh Singh, Derek Hoiem, David Forsyth Department of Computer Science University of Illinois, Urbana-Champaign {ss1, dhoiem, daf}@illinois.edu Abstract

More information

arxiv: v1 [cs.lg] 30 Jan 2019

arxiv: v1 [cs.lg] 30 Jan 2019 A Simple Explanation for the Existence of Adversarial Examples with Small Hamming Distance Adi Shamir 1, Itay Safran 1, Eyal Ronen 2, and Orr Dunkelman 3 arxiv:1901.10861v1 [cs.lg] 30 Jan 2019 1 Computer

More information

arxiv: v1 [stat.ml] 27 Nov 2018

arxiv: v1 [stat.ml] 27 Nov 2018 Robust Classification of Financial Risk arxiv:1811.11079v1 [stat.ml] 27 Nov 2018 Suproteem K. Sarkar suproteemsarkar@g.harvard.edu Daniel Giebisch * danielgiebisch@college.harvard.edu Abstract Kojin Oshiba

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

The Perceptron Algorithm 1

The Perceptron Algorithm 1 CS 64: Machine Learning Spring 5 College of Computer and Information Science Northeastern University Lecture 5 March, 6 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu Introduction The Perceptron

More information

ECE 595: Machine Learning I Adversarial Attack 1

ECE 595: Machine Learning I Adversarial Attack 1 ECE 595: Machine Learning I Adversarial Attack 1 Spring 2019 Stanley Chan School of Electrical and Computer Engineering Purdue University 1 / 32 Outline Examples of Adversarial Attack Basic Terminology

More information

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems Weinan E 1 and Bing Yu 2 arxiv:1710.00211v1 [cs.lg] 30 Sep 2017 1 The Beijing Institute of Big Data Research,

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

arxiv: v1 [cs.lg] 9 Oct 2018

arxiv: v1 [cs.lg] 9 Oct 2018 The Adversarial Attack and Detection under the Fisher Information Metric Chenxiao Zhao East China Normal University 51174506043@stu.ecnu.edu.cn P. Thomas Fletcher University of Utah fletcher@sci.utah.edu

More information

ADVERSARIAL SPHERES ABSTRACT 1 INTRODUCTION. Workshop track - ICLR 2018

ADVERSARIAL SPHERES ABSTRACT 1 INTRODUCTION. Workshop track - ICLR 2018 ADVERSARIAL SPHERES Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S. Schoenholz, Maithra Raghu, Martin Wattenberg, & Ian Goodfellow Google Brain {gilmer,lmetz,schsam,maithra,wattenberg,goodfellow}@google.com

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal Stochastic Optimization Methods for Machine Learning Jorge Nocedal Northwestern University SIAM CSE, March 2017 1 Collaborators Richard Byrd R. Bollagragada N. Keskar University of Colorado Northwestern

More information

arxiv: v3 [cs.cv] 28 Feb 2018

arxiv: v3 [cs.cv] 28 Feb 2018 Defense against Universal Adversarial Perturbations arxiv:1711.05929v3 [cs.cv] 28 Feb 2018 Naveed Akhtar* Jian Liu* Ajmal Mian *The authors contributed equally to this work. School of Computer Science

More information

Implicit Optimization Bias

Implicit Optimization Bias Implicit Optimization Bias as a key to Understanding Deep Learning Nati Srebro (TTIC) Based on joint work with Behnam Neyshabur (TTIC IAS), Ryota Tomioka (TTIC MSR), Srinadh Bhojanapalli, Suriya Gunasekar,

More information