arxiv: v1 [cs.lg] 24 Jan 2019
|
|
- Godwin Ferguson
- 5 years ago
- Views:
Transcription
1 Cross-Entropy Loss and Low-Rank Features Have Responsibility for Adversarial Examples Kamil Nar Orhan Ocal S. Shankar Sastry Kannan Ramchandran arxiv: v [cs.lg] 24 Jan 209 Abstract State-of-the-art neural networks are vulnerable to adversarial examples; they can easily misclassify inputs that are imperceptibly different than their training and test data. In this work, we establish that the use of cross-entropy loss function and the low-rank features of the training data have responsibility for the existence of these inputs. Based on this observation, we suggest that addressing adversarial examples requires rethinking the use of cross-entropy loss function and looking for an alternative that is more suited for minimization with low-rank features. In this direction, we present a training scheme called differential training, which uses a loss function defined on the differences between the features of points from opposite classes. We show that differential training can ensure a large margin between the decision boundary of the neural network and the points in the training dataset. This larger margin increases the amount of perturbation needed to flip the prediction of the classifier and makes it harder to find an adversarial example with small perturbations. We test differential training on a binary classification task with CIFAR-0 dataset and demonstrate that it radically reduces the ratio of images for which an adversarial example could be found not only in the training dataset, but in the test dataset as well.. Introduction Despite their high accuracy on training and test datasets, state-of-the-art neural networks are vulnerable to adversarial examples: they can easily misclassify inputs that are indistinguishable from the training and test data and express very high confidence for their wrong predictions Szegedy et al., 203. Several methods have recently been introduced to generate these adversarial inputs Goodfellow et al., 205; Authors are with Department of Electrical Engineering and Computer Sciences, University of California, Berkeley. Carlini & Wagner, 207; Moosavi-Dezfooli et al., 207; Athalye et al., 208; and simplicity and effectiveness of these methods have reinforced the concerns about the use of neural networks in many tasks. The presence of adversarial examples was initially attributed to the high nonlinearity of deep neural networks Szegedy et al., 203. Later, however, it was shown that a network with few layers and a high dimensional input space could also suffer from this problem Goodfellow et al., 205. Support vector machines with radial basis function, on the other hand, were robust to these malicious inputs: their accuracy on test datasets and adversarial examples were comparable. Based on these observations, it was claimed that neural networks, unlike support vector machines, failed to introduce adequate nonlinearity as a feature mapping, and this was suggested to be the main explanation for the existence of adversarial examples Goodfellow et al., 205. It is correct that neural networks and support vector machines differ in their level of nonlinearity and their level of robustness against adversarial examples, but this fact on its own does not suffice to build a causal relation between the adversarial examples and the nonlinearity of the classifier. There are many other aspects that neural networks and support vector machines differ in and any of these factors may also have responsibility for the presence of adversarial examples. A major one of these factors is the training procedure. Training a support vector machine involves solving a convex optimization problem defined with the hinge loss function Hastie et al., Due to convexity of the problem, the choice of optimization algorithm has no influence on the classifier obtained at the end of training. In contrast, training a neural network requires solving a nonconvex problem, and the dynamics of the optimization algorithm becomes critical for the solution. It determines the local optimum obtained, and hence, the decision boundary of the trained network. The existence of adversarial examples is the manifestation of a poor margin between the decision boundary of the network and the points in the training and test datasets Fawzi et al., 207. What is interesting is the closeness of the training points to the decision boundary: for some reason,
2 the decision boundary resides extremely close to the training points even after the training is complete although the main purpose of training is to find a boundary that is reasonably far away from these points. We seek out a reason for this poor margin among the ingredients of neural network training that are widely taken for granted: the gradient methods and the cross-entropy loss function... Our contributions. We show that if a linear classifier is trained by minimizing the cross-entropy loss function via the gradient descent algorithm, and if the features of the training points lie on a low-dimensional affine subspace, then the margin between the decision boundary of the classifier and the training points could become much smaller than the optimal value. 2. We show that the penultimate layer of neural networks are very likely to produce low-rank features, and we provide empirical evidence for this on a binary classification task with CIFAR-0 dataset. Combined with the first contribution, this suggests that neural networks could have a poor margin in their penultimate layer, and consequently, very small perturbations in this layer can easily flip the decision of the classifier. 3. In order to improve the margin, we put forward a training scheme called differential training, which uses a loss function defined on the differences between the features of the points from opposite classes. We show that this training scheme allows finding the solution with the largest hard margin for linear classifiers while still using the gradient descent algorithm. 4. We introduce a loss function that improves the margin for nonlinear classifiers and display its effectiveness on a synthetic problem. Then we test this loss function on a binary classification task with CIFAR-0 dataset, and show that it prevents the Projected Gradient Descent Attack Madry et al., 208; Kurakin et al., 206 from being able to find an adversarial example for most of the training and test data. 5. On CIFAR-0 dataset, we empirically show that the network produced by differential training generalizes well over the adversarial examples. That is, the accuracy of the network is virtually the same on adversarial examples generated from the training dataset and on those generated from the test dataset. This result is critical given that the networks trained with robust optimization were shown not to generalize on adversarial examples Schmidt et al., Related Works The minimization of cross-entropy loss function via the gradient descent algorithm has recently been studied for linear classifiers, and its solution has been shown to be equivalent to a support vector machine Soudry et al., 208. However, it has not been emphasized that the separating hyperplane produced by the cross-entropy minimization is constrained to pass through the origin in an augmented space. We show that this fact could cause the margin of the classifier to be drastically small if the features of the dataset lie in a low-dimensional affine subspace in a high dimensional feature space. We also show that this case is not atypical when a neural network is trained with the gradient descent algorithm, and we build a connection between this fact and the existence of adversarial examples. It is known that if a support vector machine is formulated to find a separating hyperplane passing through the origin, the decision boundary of the classifier will be smaller than the optimal value. In order to overcome this problem and to speed up online learning algorithms for support vector machines, the idea of using the differences between the points from opposite classes has previously been suggested in Ishibashi et al., 2008; Keerthi et al., 999. We show that a similar idea in differential training also improves the margin when a neural network is being trained with a gradient-based method. Differential training uses the differences between the features of the training points from opposite classes. This training scheme has been intentionally introduced to improve the dynamics of the gradient descent algorithm on the training cost function; and we consider it as using an alternative cost function in the sequel since the choice of cost function is very critical. However, the procedure could also be considered as using an identical pair of networks in the network architecture, which is closely related to the Siamese Networks Bromley et al., 993; Chopra et al., These networks were previously shown to perform well if limited data were available from any of the classes in a classification task Koch et al., 205. Our work shows that this architecture can also provide a large margin between the decision boundary of the classifier and the training points, and consequently, be more robust to adversarial examples if the network is trained with the cost function we suggest in Section Cross-Entropy Loss on Low-Rank Features Leads to Poor Margins Cross-entropy loss function is almost the sole choice for classification tasks in practice. Its prevalent use is backed theoretically by its association with the minimization of the Kullback-Leibler divergence between the empirical distribu-
3 Decision boundary obtained with cross-entropy minimization Figure. Orange and blue points lie on a low-dimensional affine subspace in R 2, and they represent the data from two different classes. Cross-entropy minimization for a linear classifier on these points leads to the decision boundary shown with the solid line, which attains an extremely poor margin. tion of a dataset and the confidence of the classifier for that dataset. Given the particular success of neural networks for classification tasks Krizhevsky et al., 202; Simonyan & Zisserman, 204; He et al., 206, there seems to be little motivation to search for alternatives for this loss function, and most of the software developed for neural networks incorporates an efficient implementation for it, thereby facilitating its further use. Nevertheless, there seems to be a typical case where the use of cross-entropy loss function can create a problem for the classifier, as shown in Figure. The source of this problem is pointed out in Theorem. Theorem. Assume that the points {x i } i I and {y j } j J are linearly separable and lie in an affine subspace; that is, there exist a set of orthonormal vectors {r k } k K and a set of scalars { k } k K such that r k, x i = r k, y j = k i I, j J, k K. Let w, + B = 0 denote the decision boundary obtained by minimizing the cross-entropy loss function e w x i+b log log, + e w x i+b + e w y j+b i I j J and assume that w and B are scaled such that min i I,j J w, x i w, y j = 2. Then the minimization of the cross-entropy loss yields a margin smaller than or equal to γ + B 2 2 k K 2 k where γ denotes the optimal hard margin given by the SVM solution. Remark. Theorem shows that if the training points lie on an affine subspace, and if the cross-entropy loss is minimized with the gradient descent algorithm, then the margin of the classifier will be smaller than the optimal margin value. As the dimension of this affine subspace decreases, the cardinality of the set K increases and the term k K 2 k could become much larger than /γ2. Therefore, as the dimension of the subspace containing the training points gets smaller compared to the dimension of the input space, cross-entropy minimization with a gradient method becomes more likely to yield a poor margin. The next corollary relaxes the condition of Theorem and allows the training points to be near an affine subspace instead of being exactly on it. Corollary. Assume that the points {x i } i I and {y j } j J in R d are linearly separable and there exist a set of orthonormal vectors {r k } k K and a set of scalars { k } k K such that r k, x i k, r k, y j k i I, j J, k K. Let w, + B = 0 denote the decision boundary obtained by minimizing the cross-entropy loss, as in Theorem. Then the minimization of the cross-entropy loss yields a margin smaller than or equal to B 2 k K 2 k Note that the ability to compare the margin obtained by cross-entropy minimization with the optimal value is lost. Nevertheless, it highlights the fact that same set of points could be assigned a substantially different margin by crossentropy minimization if all of them are shifted away from the origin by the same amount in the same direction. 3. Penultimate Layers of Neural Networks Contain Low-Rank Features The results in the previous section were for linear classifiers, and correspondingly, the features of the training points were the points themselves. In this section, we consider neural networks and regard the outputs of their penultimate layer as the features of the training points. Following theorem shows that these features can have a very low rank if the network is trained with a gradient method. Proposition. Given a set of points {x i } i I, assume that an L-layer network is trained by minimizing the crossentropy loss function: min w,θ i I log e w φ θ x i + e w φ θ x i
4 where φ θ x i is the output of the penultimate layer of the network and represents the features for point x i. Assume that φ θ ends with a linear layer, i.e., φ θ = W h θ where W is a matrix and h θ is the first L 2 layers of the network. If the gradient descent algorithm is initialized with W [0] = 0, then the rank of the set {φˆθx i } i I is at most whenever the algorithm is terminated. The assumption on the initialization of the matrix W could be removed if the network has a certain structure for example, if the last layer of h θ ends with a squishing function such as arctan or tanh. In this case, the points in {φ θ x i } i I keep growing in the same direction if the algorithm is run for long enough, and consequently, this set converges to a set with rank as well. More detail on this case is provided in Appendix B. Note that the only strong assumption in Proposition is the requirement that φ θ ends with a linear layer. Otherwise, φ θ is allowed to contain any type of nonlinear activation functions and convolutional layers. To empirically verify whether the features in a neural network are still low-rank even when the penultimate layer is nonlinear, we trained a standard network with ReLU activations for a binary classification task on CIFAR-0 dataset. The cross-entropy loss function was minimized with three different optimization schemes to train the network. Even though all parameters of the network were initialized as in He et al., 205, the features in the penultimate layer had rank 2 if the training cost was minimized via the gradient method with momentum. When the optimization algorithm was changed to Adam or when batch normalization was used during training, the rank of the features still remained much lower than the dimension of the feature space, as shown in Figure 2. Remark 2. Proposition, along with the empirical observations on CIFAR-0 dataset, shows that the low-rankness of the features of the training dataset is not an exceptional case; on the contrary, it can arise in most cases. This is recently supported by Martin & Mahoney, 208 as well. Along with the main result of Section 2, the fact that penultimate layer of the network contains low-rank features indicates a small margin between the decision boundary of the classifier and the features in this layer. In other words, small perturbations in the penultimate layer can easily flip the decision of the classifier. 4. Differential Training Improves Margin In previous sections, we saw that the combination of crossentropy loss function, low-rank features of training dataset, variance explained Adam+BatchNorm Adam momentum number of principal components used Figure 2. The outputs of the penultimate layer of a neural network can be considered as the features of the training points. A fourlayer convolutional network is trained by minimizing the crossentropy loss function via three different optimization schemes. The plot shows the cumulative variance explained for these features as a function of the number of principle components used. The features lie in a two-dimensional subspace if the gradient method with momentum is used. For the other two algorithms, almost all the variance in the features is captured by the first 20 principle components out of 84. and gradient descent algorithm could lead to a poor margin. We change the training cost function in the following subsections in order to increase the margin of the classifier. 4.. Differential Training for Linear Classifiers Consider the binary classification problem with only two training points, x and y, from two different classes. If we use cross-entropy loss function to find a linear classifier by minimizing e w x+b log + e w x+b log + e w y+b, the gradient descent algorithm gives the update rule: x b w w + η x e w + e y e w y+b w x b + e w y+b where η is the learning rate of the algorithm. The update rule for w reveals a critical fact: even though the optimal direction for w is x y, the increments in w are usually not in this direction. Now consider the problem of finding a separating hyperplane for a linearly separable dataset. If the dataset is low rank, the differences between the training points span a lowdimensional subspace. However, at each iteration of the gradient descent algorithm, the increments on the normal vector of the decision boundary will usually contain components outside of this subspace, as can be seen in. These increments could be forced to lie in the same subspace by feeding the differences of the points from opposite classes instead of the points themselves into the loss function. In
5 fact, a loss function of this form enables finding the separating hyperplane with the largest margin with the gradient descent algorithm. Theorem 2. Given two sets of points {x i } i I and {y j } j J that are linearly separable in R d, if we solve min w R i I log + x i y j d j J e w 2 by using the gradient descent algorithm with a sufficiently small learning rate, then the direction of w converges to the direction of the maximum-margin solution, i.e. lim t wt wt = w SVM w SVM, 3 where w SVM is the solution to the hard-margin SVM problem. Minimization of the cost function 2 provides only the weight parameter ŵ of the decision boundary. The bias parameter, b, could be chosen by plotting the histogram of the inner products { ŵ, x i } i I and { ŵ, y j } j J and fixing a value for ˆb such that ŵ, x i + ˆb 0 i I, 4a ŵ, y j + ˆb 0 j J. 4b The largest hard margin is achieved by ˆb = 2 min i I ŵ, x i 2 max j J ŵ, y j. 5 However, by choosing a larger or smaller value for ˆb, it is possible to make a tradeoff between the Type-I and Type-II errors. The cost function 2 includes a loss defined on every pair of data points from the two classes. There are two aspects of this fact:. When standard loss functions are used for classification tasks, we need to oversample or undersample either of the classes if the training dataset contains different number of points from different classes. This problem does not arise when we use the cost function The number of pairs, I J, will usually be much larger than the size of the original dataset, which contains I + J points. Therefore, the minimization of 2 might appear more expensive than the minimization of the standard cross-entropy loss computationally. However, if the points in different classes are well separated and the stochastic gradient method is used to minimize 2, the algorithm could achieve zero training error after using only a few pairs, which is formalized in Theorem 3. Further computation is needed only to improve the margin of the classifier. In addition, in our experiments to train a neural network to classify two classes from the CIFAR-0 dataset, only a few percent of I J pairs were observed to be sufficient to reach an accuracy on the test dataset that is comparable to the accuracy of the cross-entropy loss minimization. Theorem 3. Given two sets of points {x i } i I and {y j } j J that are linearly separable in R d, assume the cost function 2 is minimized with the stochastic gradient method. Define R x = max{ x i x i : i, i I}, R y = max{ y j y j : j, j J}, and let γ denote the hard margin that would be obtained with the SVM: 2γ = max u R d min i I,j J x i y j, u/ u. If 2γ 5 maxr x, R y, then the stochastic gradient algorithm produces a weight parameter, ŵ, only in one iteration which satisfies the inequalities 4a-4b along with the bias, ˆb, given by Differential Training for Nonlinear Classifiers When a neural network is used to find a nonlinear classifier, a candidate cost function analogous to 2 for differential training would be j J log + e w φ θ x i φ θ y j 6 i I where φ θ is the output of the penultimate layer of the network and represents the features of the points. However, minimization of 6 has been observed to fail in providing a large margin in the input space in our experiments. One reason for this is that the minimization of 6 does not guarantee a small Lipschitz constant for the mapping φ θ. Therefore, even if the margin is large in the penultimate layer, the margin in the input space could still be very small. A cost function that does provide a large margin in the input space is i I j J w φ θ x i w φ θ y j 2. 7 A partial explanation for the different behavior of this function is that the gradient descent algorithm is more likely to converge to a solution with small Lipschitz constant if the network is trained with the squared error loss Nar & Sastry, 208. Consequently, the gradient method is more likely to produce a φ θ which has a small Lipschitz constant, and this implies that the input of φ θ needs to change by a large amount in order for its output to move across the decision boundary.
6 00 PGD Attack y x differential training cross-entropy min. Figure 3. A two-layer neural network is trained with two different cost functions. Cross-entropy minimization marks the region between the dotted lines as the class of blue points, whereas the same class is assigned to the region inside the solid curve when differential training is used. Note that the decision boundaries obtained with cross-entropy minimization have extremely small margins. The effect of training with the cost function 7 on the margin of a nonlinear classifier is demonstrated in Figure 3. A neural network with one hidden layer was trained with two different training cost functions: cross-entropy loss and the differential training cost 7. The minimization of crossentropy loss provided an extremely poor margin in the input space, whereas the use of 7 lead to a decision boundary with large margins. 5. Experiment on CIFAR-0: Differential Training Removes Adversarial Examples A large margin between the decision boundary of the classifier and the points in the training dataset is expected to make it harder to find adversarial examples for these points. In order to verify if this is the case, we trained a four-layer convolutional neural network for a binary classification task on CIFAR-0 dataset by only using the images for planes and horses. Both cross-entropy minimization and differential training achieved zero error on the training dataset, and the accuracies of both training schemes were comparable on the test dataset: cross-entropy minimization lead to 93.65% while differential training yielded 94.65%. We generated adversarial examples for the images in the training dataset using Projected Gradient Descent Attack PGD implemented by Rauber et al., 207. The robustness of the neural network against these adversarial examples was substantially different based on whether the network was trained with the cross-entropy loss or the differential training cost 7. As shown in Figure 4, PGD was able to find adversarial percentage of samples fooled Cross-entropy Min. on test Cross-entropy Min. on train Differential Training on test Differential Training on train norm of the disturbance Figure 4. A four-layer convolutional neural network is trained for a binary classification task on CIFAR-0 dataset with two different training schemes: cross-entropy minimization and differential training. If the network is trained with differential training, the accuracy of the network is much higher for the adversarial examples generated from the training and test datasets with the PGD Attack. Moreover, the accuracy of the network on the adversarial examples generated from the training dataset is almost the same as its accuracy on those generated from the test dataset. Solid lines denote the accuracy on adversarial examples generated from the training dataset, and dashed lines denote the accuracy on adversarial examples generated from the test dataset. examples for the images in the training dataset with small perturbations if the network was trained with the crossentropy loss. In contrast, if the network was trained with differential training, PGD failed to find adversarial examples for the training dataset without disturbing the images by a large amount. Please note that PGD was considered to be the most powerful first-order gradient-based attack in Madry et al., 208. Somewhat surprisingly, the same behavior was observed on the test dataset as well. As displayed in Figure 4, PGD failed to find adversarial examples for most of the images in the test dataset when the network was trained via differential training. Moreover, the accuracy of the network was almost the same for adversarial examples generated from the training dataset and for those generated from the test dataset. We also tested the network under the Carlini-Wagner Attack Carlini & Wagner, 207 implemented by Rauber et al., 207. Similar to its performance under PGD Attack, the accuracy of the network trained with differential training remained much higher compared to the network trained with cross-entropy minimization, as shown in Figure Discussion Low-dimensionality of the training dataset. As stated in Remark, as the dimension of the affine subspace containing the training dataset gets very small compared to the
7 percentage of samples fooled Carlini-Wagner Attack Cross-entropy Min. on test Differential Training on test norm of the disturbance Figure 5. A four-layer convolutional network is trained with two different schemes: cross-entropy minimization and differential training. If the network is trained with differential training, the accuracy of the network is much higher on the adversarial examples generated from the test dataset with the Carlini-Wagner Attack. dimension of the input space, the training algorithm will become more likely to yield a small margin for the classifier. This observation confirms the results of Marzi et al., 208, which showed that if the training dataset is projected onto a low-dimensional subspace before being fed into a neural network, the performance of the network against adversarial examples is improved since projecting the inputs onto a low-dimensional domain corresponds to decreasing the dimension of the input space. Even though this method is effective, it requires the knowledge of the domain in which the training points are low-dimensional. Because this knowledge will not always be available a priori, finding alternative training algorithms and loss functions that are suited for lowdimensional data is still an important direction for future research. Robust optimization. Using robust optimization to train neural networks has been shown to be effective against adversarial examples Madry et al., 208; Athalye et al., 208. Note that these techniques could be considered as inflating the training points by a presumed amount and training the classifier with these inflated points. Nevertheless, as long as the cross-entropy loss is involved, the decision boundaries of the neural network will still be in the vicinity of the inflated points. Therefore, even though the classifier is robust against the disturbances of the presumed magnitude, the margin of the classifier could still be much smaller than what it could potentially be. Differential training. We introduced differential training, which allows the feature mapping to remain trainable while ensuring a large margin between different classes of points. By doing so, this method combines the benefits of neural networks with those of support vector machines. Even though moving from 2N training points to N 2 pairs might seem prohibitive, it points out that a true classification should in fact be able to differentiate between the pairs that are hardest to differentiate, and this search will necessarily require an N 2 term. Some heuristic methods are likely to be effective, such as considering only a smaller subset of points closer to the boundary and updating this set of points as needed during training. If a neural network is trained with this procedure, the network will be forced to find features that are able to tell apart between the hardest pairs. Generalization of differential training, and its connection to one-shot learning. It has been shown that if a neural network is trained with robust optimization, the accuracy of the network on adversarial examples generated from the test dataset could be very low even though the accuracy on adversarial examples produced from the training dataset is high Schmidt et al., 208. Consequently, it has been claimed that the robust optimization requires large amount of data so as to make a network robust against adversarial perturbations on the unseen images. Our empirical results on CIFAR-0 dataset suggest that differential training does not suffer from this problem. That is, differential training provides neural networks with robustness while still using fewer data. This is in congruence with the main premise of Koch et al., 205, which showed that Siamese networks with an identical pair of networks in their architecture perform well with few training points. Please see Section.2 for further comments on the relation between differential training and Siamese networks. Why not empirical risk minimization with a well-known loss function? Consider the standard problem of empirical risk minimization as the proxy for finding a classifier: min w,θ l w, φ θx i ; z i i I 8 where z i denotes the label of the point x i, and w, θ are the parameters of the classifier. If the features of the training points {φ θ x i } i I lie in a low-dimensional subspace, the cost function 8 will likely not be strictly convex; and more importantly, there will be directions in which the parameters are not penalized. Normally, the remedy would be to introduce a regularization term into the cost function. However, the effectiveness of well-known regularization terms is dubious for neural networks: they do not prevent spectral norms of weight matrices from growing unboundedly Bartlett et al., 207, nor do they influence the generalization gap of networks noticeably Zhang et al., 207. Therefore, even if a regularization term is added externally, the gradient descent algorithm will have the potential to drive the parameters in the directions that are not penalized and cause the decision boundary to reside in the vicinity of the training points. Note that the loss function l need not be the crossentropy loss for this to happen. This is why the problem of poor margins is in fact not peculiar to the cross-entropy loss, and this is why other well-known loss functions will likely also fail in addressing adversarial examples.
8 A. Proof of Theorem and Corollary Lemma Adapted from Theorem 3 of Soudry et al., 208. Given two sets of points {x i } i I and {y j } j J that are linearly separable in R d, let x i and ỹ j denote [x i ] and [yj ], respectively, for all i I, j J. Then the iterate of the gradient descent algorithm, wt, on the cross-entropy loss function min log + w R d+ i I e w x i + log + e w ỹ j j J with a sufficiently small step size will converge in direction: lim t where w is the solution to wt wt = w w, minimize z 2 9 z R d+ subject to z, x i i I, z, ỹ j j J. Proof of Theorem. Assume that w = u + m k= α kr k, where u R d and u, r k = 0 for all k K. By denoting z = [w b], the Lagrangian of the problem 9 can be written as 2 w b2 + i I µ i w, x i b + j J ν j + w, y j + b, where µ i 0 for all i I and ν j 0 for all j J. KKT conditions for the optimality of w and B requires that w = i I µ i x i j J ν j y j, B = i I and consequently, for each k K, µ i j J ν j, w, r k = i I µ i x i, r k j J ν j y j, r k Then, we can write w as = i I kµ i j J kν j = B k. w = u + k K B kr k. Let w SVM, + b SVM = 0 denote the hyperplane obtained as the solution of SVM. Then w SVM solves minimize w 2 0 w subject to w, x i y j 2 i I, j J. Since the vector u also satisfies u, x i y j = w, x i y j 2 for all i I, j J, we have u w SVM = γ. As a result, the margin obtained by minimizing the crossentropy loss is w = u 2 + B k r k 2 γ 2 + B 2 2 k Proof of Corollary. If B < 0, we could consider the hyperplane w, B = 0 for the points { x i } i I and { y j } j J, which would have the identical margin due to symmetry. Therefore, without loss of generality, assume B 0. As in the proof of Theorem, KKT conditions for the optimality of w and B requires w = i I µ i x i j J ν j y j, B = i I µ i j J ν j where µ i 0 and ν j 0 for all i I, j J. Note that for each k K, w, r k = i I µ i x i, r k j J ν j y j, r k = B k + i I µ i x i, r k k j J ν j y j, r k k B k. Since {r k } k K is an orthonormal set of vectors, w 2 k K w, r k 2 k K B2 2 k. The result follows from the fact that w is an upper bound on the margin. B. Proposition and Nonzero Initialization Gradient descent algorithm on leads to the dynamics where i I log + e w W h θ x i Ẇ = wv, ẇ = W v, e w W h θ x i v = h θx i i I + e. w W h θ x i If W 0 = 0, then w preserves its direction and wt = w0αt for all t 0, where α : [0, R. Consequently, the column space of W t is spanned by only w0, and W t has rank or 0 for every t 0. This completes the proof of Proposition. In order to make a statement without the condition on W 0, we need the following lemma..
9 Lemma 2. Consider the n n matrix [ 0 v ] v 0 where v R n and assume n 2. It has only one positive eigenvalue, v 2, with the eigenvector [v v 2 ]. Proof. The matrix is at most rank 2, so it has at most 2 nonzero eigenvalues. The vectors [v v 2 ] and [v v 2 ] are its eigenvectors corresponding to the eigenvalues v 2 and v 2, respectively. In the dynamics, if we consider vt as an exogenous signal, the system described becomes a linear time-varying system of the states W, w. Moreover, the dynamics of each row of the pair W, w is independent of the other rows, but is governed by the same matrix. For example, the k th row of the pair W, w satisfies: Ẇ k. Ẇ kn ẇ k [ = 0 vt vt 0 ] W k. W kn w k. 2 If the last layer of h θ ends with a squishing function such as arctan or tanh, and if all training points are classified correctly during training, the dynamics of v becomes v i I h θ x i e w W h θ x i v W W + w 2 v h θ x i if the network is trained for long enough. Then the change in v becomes exponentially slower than those in W and w as the training continues. Consequently, the vector vt in 2 acts as a constant vector; and from Lemma 2, each row of the matrix W grows in the direction vt by the same ratio. As a result, if the algorithm is run for long, all rows of W converge to the same direction. Correspondingly, all of its columns converge to a set with rank or 0. C. Proof of Theorem 2 Apply Lemma by replacing the sets {x i } i I and {y j } j J with {x i y j } i I,j J and the empty set, respectively. Then the minimization of the loss function 2 with the gradient descent algorithm leads to where w satisfies lim t w w = w w w = arg min w w 2 s.t. w, x i y j i I, j J. Since w SVM is the solution of 0, we obtain w = 2 w SVM, and the claim of the theorem holds. D. Proof of Theorem 3 In order to achieve zero training error in one iteration of the stochastic gradient algorithm, it is sufficient to have min i I x i, x i y j > max j J y j, x i y j i I, j J, or equivalently, x i y j, x i y j > 0 i, i I, j, j J. 3 By definition of the margin, there exists a vector w SVM R d with unit norm which satisfies 2γ = min i I,j J x i y j, w SVM. Note that w SVM is orthogonal to the decision boundary given by the SVM. Then we can write every x i y j as x i y j = 2γw SVM + δ x i + δ y j, where δ x i, δy j Rd and δ x i R x and δ y j R y. Then, condition 3 is satisfied if 2γw SVM + δ x i + δ y j, 2γw SVM + δ x i + δy j > 0 for all i, i I and for all j, j J; or equivalently if 4γ 2 +2γ w SVM, δ x i +δ y j +δx i +δy j + δx i +δ y j, δx i +δy j > 0 4 for all i, i I and for all j, j J. If we choose γ > 5 2 maxr x, R y, we have 4γ 2 2γ2R x + 2R y R x + R y 2 > 0, which guarantees 4 and completes the proof. References Athalye, A., Carlini, N., and Wagner, D. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Dy, J. and Krause, A. eds., Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp , Stockholmsmssan, Stockholm Sweden, 0 5 Jul 208. PMLR. URL press/v80/athalye8a.html. Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp , 207. Bromley, J., W. Bentz, J., Bottou, L., Guyon, I., Lecun, Y., Moore, C., Sackinger, E., and Shah, R. Signature verification using a siamese time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence, 7:25, doi: 0.42/S
10 Carlini, N. and Wagner, D. Towards evaluating the robustness of neural networks. In 207 IEEE Symposium on Security and Privacy SP, pp IEEE, 207. Chopra, S., Hadsell, R., and LeCun, Y. Learning a similarity metric discriminatively, with application to face verification. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume, pp , June Fawzi, A., Moosavi-Dezfooli, S., and Frossard, P. The robustness of deep networks: A geometrical perspective. IEEE Signal Processing Magazine, 346:50 62, Nov 207. Goodfellow, I., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 205. Hastie, T., Tibshirani, R., and Friedman, J. The elements of statistical learning: data mining, inference and prediction. Springer, 2 edition, URL tibs/ ElemStatLearn/. He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp , 205. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp , 206. Ishibashi, K., Hatano, K., and Takeda, M. Online learning of maximum p-norm margin classifiers with bias. In 2st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-2, 2008, pp , URL fi/papers/48-ishibashi.pdf. Keerthi, S., Shevade, S. K., Bhattacharyya, C., and Murthy, K. A fast iterative nearest point algorithm for support vector machine classifier design. IEEE Transactions on Neural Networks, :24 36, 999. Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 205. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp , 202. Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial machine learning at scale. arxiv preprint arxiv:6.0236, 206. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 208. Martin, C. H. and Mahoney, M. W. Implicit selfregularization in deep neural networks: Evidence from random matrix theory and implications for learning. CoRR, abs/ , 208. URL org/abs/ Marzi, Z., Gopalakrishnan, S., Madhow, U., and Pedarsani, R. Sparsity-based Defense against Adversarial Attacks on Linear Classifiers. ArXiv e-prints, 208. Moosavi-Dezfooli, S.-M., Fawzi, A., Fawzi, O., and Frossard, P. Universal adversarial perturbations. In IEEE Conference on Computer Vision and Pattern Recognition, pp , 207. Nar, K. and Sastry, S. Step size matters in deep learning. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. eds., Advances in Neural Information Processing Systems 3, pp Curran Associates, Inc., 208. URL step-size-matters-in-deep-learning. pdf. Rauber, J., Brendel, W., and Bethge, M. Foolbox: a python toolbox to benchmark the robustness of machine learning models 207. URL org/abs/ , 207. Schmidt, L., Santurkar, S., Tsipras, D., Talwar, K., and Madry, A. Adversarially robust generalization requires more data. arxiv preprint arxiv: , 208. Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. CoRR, abs/ , 204. Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. The Implicit Bias of Gradient Descent on Separable Data. ArXiv e-prints, 208. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. Intriguing properties of neural networks. CoRR, abs/32.699, 203. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 207.
arxiv: v1 [stat.ml] 15 Mar 2018
Large Margin Deep Networks for Classification arxiv:1803.05598v1 [stat.ml] 15 Mar 2018 Gamaleldin F. Elsayed Dilip Krishnan Hossein Mobahi Kevin Regan Samy Bengio Google Research {gamaleldin,dilipkay,hmobahi,kevinregan,bengio}@google.com
More informationarxiv: v1 [cs.lg] 4 Mar 2019
A Fundamental Performance Limitation for Adversarial Classification Abed AlRahman Al Makdah, Vaibhav Katewa, and Fabio Pasqualetti arxiv:1903.01032v1 [cs.lg] 4 Mar 2019 Abstract Despite the widespread
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationNotes on Adversarial Examples
Notes on Adversarial Examples David Meyer dmm@{1-4-5.net,uoregon.edu,...} March 14, 2017 1 Introduction The surprising discovery of adversarial examples by Szegedy et al. [6] has led to new ways of thinking
More informationarxiv: v1 [cs.lg] 6 Dec 2018
Max-Margin Adversarial (MMA) Training: Direct Input Space Margin Maximization through Adversarial Training Gavin Weiguang Ding, Yash Sharma, Kry Yik Chau Lui, and Ruitong Huang Borealis AI arxiv:1812.02637v1
More informationTowards ML You Can Rely On. Aleksander Mądry
Towards ML You Can Rely On Aleksander Mądry @aleks_madry madry-lab.ml Machine Learning: The Success Story? Image classification Reinforcement Learning Machine translation Machine Learning: The Success
More informationMeasuring the Robustness of Neural Networks via Minimal Adversarial Examples
Measuring the Robustness of Neural Networks via Minimal Adversarial Examples Sumanth Dathathri sdathath@caltech.edu Stephan Zheng stephan@caltech.edu Sicun Gao sicung@ucsd.edu Richard M. Murray murray@cds.caltech.edu
More informationarxiv: v1 [cs.lg] 30 Oct 2018
On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models arxiv:1810.12715v1 [cs.lg] 30 Oct 2018 Sven Gowal sgowal@google.com Rudy Bunel University of Oxford rudy@robots.ox.ac.uk
More informationNeural networks and optimization
Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional
More informationAdversarially Robust Optimization and Generalization
Adversarially Robust Optimization and Generalization Ludwig Schmidt MIT UC Berkeley Based on joint works with Logan ngstrom (MIT), Aleksander Madry (MIT), Aleksandar Makelov (MIT), Dimitris Tsipras (MIT),
More informationReinforcing Adversarial Robustness using Model Confidence Induced by Adversarial Training
Reinforcing Adversarial Robustness using Model Confidence Induced by Adversarial Training Xi Wu * 1 Uyeong Jang * 2 Jiefeng Chen 2 Lingjiao Chen 2 Somesh Jha 2 Abstract In this paper we study leveraging
More informationRobustness of classifiers: from adversarial to random noise
Robustness of classifiers: from adversarial to random noise Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, Pascal Frossard École Polytechnique Fédérale de Lausanne Lausanne, Switzerland {alhussein.fawzi,
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationSGD and Deep Learning
SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients
More informationEAD: Elastic-Net Attacks to Deep Neural Networks via Adversarial Examples
EAD: Elastic-Net Attacks to Deep Neural Networks via Adversarial Examples Pin-Yu Chen1, Yash Sharma2, Huan Zhang3, Jinfeng Yi4, Cho-Jui Hsieh3 1 AI Foundations Lab, IBM T. J. Watson Research Center, Yorktown
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationarxiv: v3 [cs.lg] 22 Mar 2018
arxiv:1710.06081v3 [cs.lg] 22 Mar 2018 Boosting Adversarial Attacks with Momentum Yinpeng Dong1, Fangzhou Liao1, Tianyu Pang1, Hang Su1, Jun Zhu1, Xiaolin Hu1, Jianguo Li2 1 Department of Computer Science
More informationUniversal Adversarial Networks
1 Universal Adversarial Networks Jamie Hayes University College London j.hayes@cs.ucl.ac.uk Abstract Neural networks are known to be vulnerable to adversarial examples, inputs that have been intentionally
More informationON THE SENSITIVITY OF ADVERSARIAL ROBUSTNESS
ON THE SENSITIVITY OF ADVERSARIAL ROBUSTNESS TO INPUT DATA DISTRIBUTIONS Anonymous authors Paper under double-blind review ABSTRACT Neural networks are vulnerable to small adversarial perturbations. Existing
More informationReinforcing Adversarial Robustness using Model Confidence Induced by Adversarial Training
Reinforcing Adversarial Robustness using Model Confidence Induced by Adversarial Training Xi Wu * 1 Uyeong Jang * 2 Jiefeng Chen 2 Lingjiao Chen 2 Somesh Jha 2 Abstract In this paper we study leveraging
More informationAdversarial Image Perturbation for Privacy Protection A Game Theory Perspective Supplementary Materials
Adversarial Image Perturbation for Privacy Protection A Game Theory Perspective Supplementary Materials 1. Contents The supplementary materials contain auxiliary experiments for the empirical analyses
More informationCharacterization of Gradient Dominance and Regularity Conditions for Neural Networks
Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Yi Zhou Ohio State University Yingbin Liang Ohio State University Abstract zhou.1172@osu.edu liang.889@osu.edu The past
More informationLimitations of the Lipschitz constant as a defense against adversarial examples
Limitations of the Lipschitz constant as a defense against adversarial examples Todd Huster, Cho-Yu Jason Chiang, and Ritu Chadha Perspecta Labs, Basking Ridge, NJ 07920, USA. thuster@perspectalabs.com
More informationEve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates
Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive
More informationarxiv: v1 [cs.cv] 2 Aug 2016
A study of the effect of JPG compression on adversarial images arxiv:1608.00853v1 [cs.cv] 2 Aug 2016 Gintare Karolina Dziugaite Department of Engineering University of Cambridge Daniel M. Roy Department
More informationEncoder Based Lifelong Learning - Supplementary materials
Encoder Based Lifelong Learning - Supplementary materials Amal Rannen Rahaf Aljundi Mathew B. Blaschko Tinne Tuytelaars KU Leuven KU Leuven, ESAT-PSI, IMEC, Belgium firstname.lastname@esat.kuleuven.be
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More information9 Classification. 9.1 Linear Classifiers
9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive
More informationThe Implicit Bias of Gradient Descent on Separable Data
Journal of Machine Learning Research 19 2018 1-57 Submitted 4/18; Published 11/18 The Implicit Bias of Gradient Descent on Separable Data Daniel Soudry Elad Hoffer Mor Shpigel Nacson Department of Electrical
More informationRegularizing Deep Networks Using Efficient Layerwise Adversarial Training
Related Work Many approaches have been proposed to regularize the training procedure of very deep networks. Early stopping and statistical techniques like weight decay are commonly used to prevent overfitting.
More informationSupport Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs
E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationarxiv: v1 [cs.cv] 27 Nov 2018
Universal Adversarial Training arxiv:1811.11304v1 [cs.cv] 27 Nov 2018 Ali Shafahi ashafahi@cs.umd.edu Abstract Mahyar Najibi najibi@cs.umd.edu Larry S. Davis lsd@umiacs.umd.edu Standard adversarial attacks
More informationNonlinear Support Vector Machines through Iterative Majorization and I-Splines
Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support
More informationarxiv: v2 [stat.ml] 20 Nov 2017
: Elastic-Net Attacks to Deep Neural Networks via Adversarial Examples Pin-Yu Chen1, Yash Sharma2, Huan Zhang3, Jinfeng Yi4, Cho-Jui Hsieh3 1 arxiv:1709.04114v2 [stat.ml] 20 Nov 2017 AI Foundations Lab,
More informationGeneralization in Deep Networks
Generalization in Deep Networks Peter Bartlett BAIR UC Berkeley November 28, 2017 1 / 29 Deep neural networks Game playing (Jung Yeon-Je/AFP/Getty Images) 2 / 29 Deep neural networks Image recognition
More informationIntroduction to Machine Learning
Introduction to Machine Learning Neural Networks Varun Chandola x x 5 Input Outline Contents February 2, 207 Extending Perceptrons 2 Multi Layered Perceptrons 2 2. Generalizing to Multiple Labels.................
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationData Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396
Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction
More informationarxiv: v3 [cs.ne] 10 Mar 2017
ROBUSTNESS TO ADVERSARIAL EXAMPLES THROUGH AN ENSEMBLE OF SPECIALISTS Mahdieh Abbasi & Christian Gagné Computer Vision and Systems Laboratory, Electrical and Computer Engineering Department Université
More informationMax-Margin Adversarial (MMA) Training: Direct Input Space Margin Maximization through Adversarial Training
Max-Margin Adversarial (MMA) Training: Direct Input Space Margin Maximization through Adversarial Training Gavin Weiguang Ding, Yash Sharma, Kry Yik Chau Lui, and Ruitong Huang Borealis AI arxiv:1812.02637v2
More informationMachine Learning Basics
Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More information1 What a Neural Network Computes
Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists
More informationarxiv: v1 [cs.lg] 15 Nov 2017 ABSTRACT
THE BEST DEFENSE IS A GOOD OFFENSE: COUNTERING BLACK BOX ATTACKS BY PREDICTING SLIGHTLY WRONG LABELS Yannic Kilcher Department of Computer Science ETH Zurich yannic.kilcher@inf.ethz.ch Thomas Hofmann Department
More informationAn Inside Look at Deep Neural Networks using Graph Signal Processing
An Inside Look at Deep Neural Networks using Graph Signal Processing Vincent Gripon 1, Antonio Ortega 2, and Benjamin Girault 2 1 IMT Atlantique, Brest, France Email: vincent.gripon@imt-atlantique.fr 2
More informationMinOver Revisited for Incremental Support-Vector-Classification
MinOver Revisited for Incremental Support-Vector-Classification Thomas Martinetz Institute for Neuro- and Bioinformatics University of Lübeck D-23538 Lübeck, Germany martinetz@informatik.uni-luebeck.de
More informationMaxout Networks. Hien Quoc Dang
Maxout Networks Hien Quoc Dang Outline Introduction Maxout Networks Description A Universal Approximator & Proof Experiments with Maxout Why does Maxout work? Conclusion 10/12/13 Hien Quoc Dang Machine
More informationA summary of Deep Learning without Poor Local Minima
A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given
More informationFreezeOut: Accelerate Training by Progressively Freezing Layers
FreezeOut: Accelerate Training by Progressively Freezing Layers Andrew Brock, Theodore Lim, & J.M. Ritchie School of Engineering and Physical Sciences Heriot-Watt University Edinburgh, UK {ajb5, t.lim,
More informationAdversarial Examples Generation and Defense Based on Generative Adversarial Network
Adversarial Examples Generation and Defense Based on Generative Adversarial Network Fei Xia (06082760), Ruishan Liu (06119690) December 15, 2016 1 Abstract We propose a novel generative adversarial network
More informationStochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence
ESANN 0 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 7-9 April 0, idoc.com publ., ISBN 97-7707-. Stochastic Gradient
More informationwhat can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley
what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationSupport Vector Machines for Classification and Regression
CIS 520: Machine Learning Oct 04, 207 Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may
More informationarxiv: v4 [cs.lg] 28 Mar 2016
Analysis of classifiers robustness to adversarial perturbations Alhussein Fawzi Omar Fawzi Pascal Frossard arxiv:0.090v [cs.lg] 8 Mar 06 Abstract The goal of this paper is to analyze an intriguing phenomenon
More informationNegative Momentum for Improved Game Dynamics
Negative Momentum for Improved Game Dynamics Gauthier Gidel Reyhane Askari Hemmat Mohammad Pezeshki Gabriel Huang Rémi Lepriol Simon Lacoste-Julien Ioannis Mitliagkas Mila & DIRO, Université de Montréal
More informationDistirbutional robustness, regularizing variance, and adversaries
Distirbutional robustness, regularizing variance, and adversaries John Duchi Based on joint work with Hongseok Namkoong and Aman Sinha Stanford University November 2017 Motivation We do not want machine-learned
More informationOptimization geometry and implicit regularization
Optimization geometry and implicit regularization Suriya Gunasekar Joint work with N. Srebro (TTIC), J. Lee (USC), D. Soudry (Technion), M.S. Nacson (Technion), B. Woodworth (TTIC), S. Bhojanapalli (TTIC),
More informationSome Statistical Properties of Deep Networks
Some Statistical Properties of Deep Networks Peter Bartlett UC Berkeley August 2, 2018 1 / 22 Deep Networks Deep compositions of nonlinear functions h = h m h m 1 h 1 2 / 22 Deep Networks Deep compositions
More informationIs Robustness the Cost of Accuracy? A Comprehensive Study on the Robustness of 18 Deep Image Classification Models
Is Robustness the Cost of Accuracy? A Comprehensive Study on the Robustness of 18 Deep Image Classification Models Dong Su 1*, Huan Zhang 2*, Hongge Chen 3, Jinfeng Yi 4, Pin-Yu Chen 1, and Yupeng Gao
More informationarxiv: v1 [stat.ml] 3 Apr 2017
Geometric Insights into SVM Tuning Geometric Insights into Support Vector Machine Behavior using the KKT Conditions arxiv:1704.00767v1 [stat.ml] 3 Apr 2017 Iain Carmichael Department of Statistics and
More informationLower bounds on the robustness to adversarial perturbations
Lower bounds on the robustness to adversarial perturbations Jonathan Peck 1,2, Joris Roels 2,3, Bart Goossens 3, and Yvan Saeys 1,2 1 Department of Applied Mathematics, Computer Science and Statistics,
More informationFantope Regularization in Metric Learning
Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction
More informationLarge-Scale Feature Learning with Spike-and-Slab Sparse Coding
Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab
More informationSSCNets A Selective Sobel Convolution-based Technique to Enhance the Robustness of Deep Neural Networks against Security Attacks
A Selective Sobel Convolution-based Technique to Enhance the Robustness of Deep Neural Networks against Security Attacks Hammad Tariq*, Hassan Ali*, Muhammad Abdullah Hanif, Faiq Khalid, Semeen Rehman,
More informationClassification of Hand-Written Digits Using Scattering Convolutional Network
Mid-year Progress Report Classification of Hand-Written Digits Using Scattering Convolutional Network Dongmian Zou Advisor: Professor Radu Balan Co-Advisor: Dr. Maneesh Singh (SRI) Background Overview
More informationarxiv: v1 [cs.lg] 30 Nov 2018
Adversarial Examples as an Input-Fault Tolerance Problem Angus Galloway 1,2, Anna Golubeva 3,4, and Graham W. Taylor 1,2 arxiv:1811.12601v1 [cs.lg] Nov 2018 1 School of Engineering, University of Guelph
More informationarxiv: v3 [cs.lg] 8 Jun 2018
Provable Defenses against Adversarial Examples via the Convex Outer Adversarial Polytope Eric Wong 1 J. Zico Kolter 2 arxiv:1711.00851v3 [cs.lg] 8 Jun 2018 Abstract We propose a method to learn deep ReLU-based
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationENSEMBLE METHODS AS A DEFENSE TO ADVERSAR-
ENSEMBLE METHODS AS A DEFENSE TO ADVERSAR- IAL PERTURBATIONS AGAINST DEEP NEURAL NET- WORKS Anonymous authors Paper under double-blind review ABSTRACT Deep learning has become the state of the art approach
More informationQuSecNets: Quantization-based Defense Mechanism for Securing Deep Neural Network against Adversarial Attacks
QuSecNets: Quantization-based Defense Mechanism for Securing Deep Neural Network against Hassan Ali *, Hammad Tariq *, Muhammad Abdullah Hanif, Faiq Khalid, Semeen Rehman, Rehan Ahmed * and Muhammad Shafique
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationNormalization Techniques in Training of Deep Neural Networks
Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,
More informationWHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,
WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu
More informationLearning features by contrasting natural images with noise
Learning features by contrasting natural images with noise Michael Gutmann 1 and Aapo Hyvärinen 12 1 Dept. of Computer Science and HIIT, University of Helsinki, P.O. Box 68, FIN-00014 University of Helsinki,
More informationECE 595: Machine Learning I Adversarial Attack 1
ECE 595: Machine Learning I Adversarial Attack 1 Spring 2019 Stanley Chan School of Electrical and Computer Engineering Purdue University 1 / 32 Outline Examples of Adversarial Attack Basic Terminology
More informationarxiv: v1 [cs.cv] 21 Jul 2017
CONFIDENCE ESTIMATION IN DEEP NEURAL NETWORKS VIA DENSITY MODELLING Akshayvarun Subramanya Suraj Srinivas R.Venkatesh Babu Video Analytics Lab, Department of Computational and Data Sciences Indian Institute
More informationMinimax risk bounds for linear threshold functions
CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationTheories of Deep Learning
Theories of Deep Learning Lecture 02 Donoho, Monajemi, Papyan Department of Statistics Stanford Oct. 4, 2017 1 / 50 Stats 385 Fall 2017 2 / 50 Stats 285 Fall 2017 3 / 50 Course info Wed 3:00-4:20 PM in
More informationSwapout: Learning an ensemble of deep architectures
Swapout: Learning an ensemble of deep architectures Saurabh Singh, Derek Hoiem, David Forsyth Department of Computer Science University of Illinois, Urbana-Champaign {ss1, dhoiem, daf}@illinois.edu Abstract
More informationarxiv: v1 [cs.lg] 30 Jan 2019
A Simple Explanation for the Existence of Adversarial Examples with Small Hamming Distance Adi Shamir 1, Itay Safran 1, Eyal Ronen 2, and Orr Dunkelman 3 arxiv:1901.10861v1 [cs.lg] 30 Jan 2019 1 Computer
More informationarxiv: v1 [stat.ml] 27 Nov 2018
Robust Classification of Financial Risk arxiv:1811.11079v1 [stat.ml] 27 Nov 2018 Suproteem K. Sarkar suproteemsarkar@g.harvard.edu Daniel Giebisch * danielgiebisch@college.harvard.edu Abstract Kojin Oshiba
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationThe Perceptron Algorithm 1
CS 64: Machine Learning Spring 5 College of Computer and Information Science Northeastern University Lecture 5 March, 6 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu Introduction The Perceptron
More informationECE 595: Machine Learning I Adversarial Attack 1
ECE 595: Machine Learning I Adversarial Attack 1 Spring 2019 Stanley Chan School of Electrical and Computer Engineering Purdue University 1 / 32 Outline Examples of Adversarial Attack Basic Terminology
More informationThe Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems
The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems Weinan E 1 and Bing Yu 2 arxiv:1710.00211v1 [cs.lg] 30 Sep 2017 1 The Beijing Institute of Big Data Research,
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationarxiv: v1 [cs.lg] 9 Oct 2018
The Adversarial Attack and Detection under the Fisher Information Metric Chenxiao Zhao East China Normal University 51174506043@stu.ecnu.edu.cn P. Thomas Fletcher University of Utah fletcher@sci.utah.edu
More informationADVERSARIAL SPHERES ABSTRACT 1 INTRODUCTION. Workshop track - ICLR 2018
ADVERSARIAL SPHERES Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S. Schoenholz, Maithra Raghu, Martin Wattenberg, & Ian Goodfellow Google Brain {gilmer,lmetz,schsam,maithra,wattenberg,goodfellow}@google.com
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationStochastic Optimization Methods for Machine Learning. Jorge Nocedal
Stochastic Optimization Methods for Machine Learning Jorge Nocedal Northwestern University SIAM CSE, March 2017 1 Collaborators Richard Byrd R. Bollagragada N. Keskar University of Colorado Northwestern
More informationarxiv: v3 [cs.cv] 28 Feb 2018
Defense against Universal Adversarial Perturbations arxiv:1711.05929v3 [cs.cv] 28 Feb 2018 Naveed Akhtar* Jian Liu* Ajmal Mian *The authors contributed equally to this work. School of Computer Science
More informationImplicit Optimization Bias
Implicit Optimization Bias as a key to Understanding Deep Learning Nati Srebro (TTIC) Based on joint work with Behnam Neyshabur (TTIC IAS), Ryota Tomioka (TTIC MSR), Srinadh Bhojanapalli, Suriya Gunasekar,
More information