arxiv: v1 [cs.lg] 6 Dec 2018

Size: px

Start display at page:

Download "arxiv: v1 [cs.lg] 6 Dec 2018"

Howard Charles
5 years ago
Views:

1 Max-Margin Adversarial (MMA) Training: Direct Input Space Margin Maximization through Adversarial Training Gavin Weiguang Ding, Yash Sharma, Kry Yik Chau Lui, and Ruitong Huang Borealis AI arxiv: v1 [cs.lg] 6 Dec 2018 Abstract We propose Max-Margin Adversarial (MMA) training for directly maximizing the input space margin. This margin maximization is direct, in the sense that the margin s gradient w.r.t. model parameters can be shown to be parallel with the loss gradient at the minimal length perturbation, thus gradient ascent on margins can be performed by gradient descent on losses. We further propose a specific formulation of MMA training to maximize the average margin of training examples in order to train models that are robust to adversarial perturbations. It is implemented by performing adversarial training on a novel adaptive norm projected gradient descent (AN-PGD) attack. Preliminary experimental results demonstrate that our method outperforms the existing state of the art methods. In particular, testing against both whitebox and transfer projected gradient descent attacks on MNIST, our trained model improves the SOTA l ɛ = 0.3 robust accuracy by 2%, while maintaining the SOTA clean accuracy. Furthermore, the same model provides, to the best of our knowledge, the first model that is robust at l ɛ = 0.4, with a robust accuracy of 86.51%. 1 Introduction Despite their impressive performance on various learning tasks, neural networks have been shown to be vulnerable. An otherwise highly accurate network can be completely fooled by an artificially constructed perturbation imperceptible to human perception, known as the adversarial attack (Szegedy et al., 2013; Biggio et al., 2013). Not surprisingly, numerous algorithms in defending adversarial attacks have already been proposed in the literature which, arguably, can be interpreted as different ways in increasing the margins, i.e. the smallest distance from the sample point to the decision boundary induced by the network. Obviously, adversarial robustness is equivalent to large margins. One type of the algorithms is to use regularization in the learning to enforce the Lipschitz constant of the network (Cisse et al., 2017; Ross and Doshi-Velez, 2017; Hein and Andriushchenko, 2017; Tsuzuku et al., 2018), thus a small loss sample point would have a large margin since the loss cannot increase too fast. If the Lipschitz constant is regularized on data points, it is usually too local and not accurate in a neighborhood; if it is controlled globally, the constraint on the model is often too strong that it harms accuracy. So far, such methods seem not able to achieve very robust models. There are also efforts using first-order approximation to estimate and maximize input space margin (Elsayed et al., 2018; Sokolic et al., 2017; Matyasko and Chau, 2017). Similarly to local Lipschitz regularization, the reliance on local information might not provide accurate margin estimation and efficient maximization. Indeed, many defending methods have been broken Preprint. Work in progress. 1

2 Figure 1: Illustration of decision boundary, minimal length perturbation, margin and etc. after their proposal (Carlini and Wagner, 2016, 2017a; Athalye et al., 2018). Adversarial training (Szegedy et al., 2013; Huang et al., 2015; Madry et al., 2017) is one of the few defense methods that has stood the test of time, and in fact, represents the current SOTA results on several datasets including MNIST. However, as illustrated in Section 1.2, the update in adversarial training does not necessarily increase the margins, which in return diminishes the effectiveness of adversarial training. In this paper, we propose to directly maximize the decision boundary on each training sample, called Max-Margin Adversarial (MMA) training, to achieve the greatest possible robustness. Interestingly, we show that our method, MMA training, has a close relationship with adversarial training. More specifically, we prove in this paper that the margin s gradient w.r.t. model parameter is parallel with the loss gradient at the minimal length perturbation. Such property makes SGD possible for the MMA objective, despite the model parameter is involved in the constraints. We further show that MMA training is an improvement over adversarial training, in the sense that it picks the right perturbation magnitude for adversarial training. We test our algorithms on MNIST and CIFAR10, demonstrating that our method outperforms adversarial training. Moreover, MMA training achieves 86.51% robust accuracy on MNIST against the l ɛ = 0.4 PGD attack. To our best knowledge, this is the first robust model in the literature that can still maintain good robustness against such a strong attack. 1.1 Notations We focus on K-class classification problem in this paper, and assume that the output of the model is a score function f θ (x) = (f 1 (x),..., f K (x)) which assign score f i (x) to the ith class. The predicted label of x is then decided by arg max i f i (x). We use the logit margin loss 1 L θ (x) = max j y f j (x) f y (x). Note that when L θ (x) < 0, the classification is correct, and when L θ (x) > 0, the classification is wrong. Given a model f θ and a data pair z = (x, y), one can compute the minimal length perturbation, δ by δ = arg min δ s.t. L θ (x + δ, y) 0, (1) δ as illustrated in Figure 1. Denote the margin of example z by d θ (z) = δ, which is the (minimal) distance from the input x to the decision boundary L θ (z) = 0 in the input space. Note that Eq. (1) is also compatible with misclassified samples, as for a sample z that is misclassified by f, the optimal δ would be 0, hence d θ (z) = 0. Throughout this paper, we use the single word margin to denote input space margin. For other notions of margin, we will use specific phrases, e.g. output space 1 Since the scores (f 1(x),..., f K(x)) output by f are also called logits in neural network literature. 2

3 Figure 2: Fixed-norm adversarial training does not necessarily improve robustness. margin or logit space margin. All the proofs are postponed to Appendix A. 1.2 Adversarial Training Does NOT Necessarily Increase Margin We use an imaginary example in this section to illustrate that one update step of adversarial training does not necessarily increase margin. Adversarial training (Huang et al., 2015) is based on a minmax formulation as min max L θ(x + δ, y) (2) θ δ ɛ where the worst-case loss is minimized under a fixed perturbation magnitude ɛ. Consider a onedimensional example in Figure 2, where the input example x is a scalar. We perturb x in the positive direction with perturbation δ. We assume L(δ) = L θ (x + δ, y) is monotonically on increasing on δ, which means that larger perturbation is always stronger. Let L 0 ( ) denote the original function before an update step, and δ0 denote the corresponding margin (minimal length perturbation). Next, an update is made to the parameter of L 0 ( ), such that L 0 (ɛ) is reduced. After this update, we imagine two different possibilities of the updated loss functions, L α ( ) in blue dotted-dash line, and L β ( ) in red dash line. L α (ɛ) and L β (ɛ) have the same decreased value at ɛ. However, we can see that L α ( ) has an increased margin δα > δ0, and L β( ) has a decreased margin δβ < δ 0. The second case is undesired since it is less robust as the margin decreases. 2 Max-Margin Adversarial Training This section is devoted to the presentation of our method, MMA training. Before presenting it, we still need some preparation. We calculated the gradient of the margin in Section 2.1. Our method MMA training is presented in Section 2.2. Lastly in Section 2.3, we discussed the relationship between MMA training and adversarial training. 3

4 2.1 Gradients of the Margins In this section, we show how to calculate gradients of the margins, d θ (z), on individual examples, as a preparation for presenting MMA training. Computing such gradients for a linear model is relatively easy due to the existence of its closed-form solution, e.g. SVM. But it is not clear for general functions such as neural networks. Recall that d θ (z) = δ = min L(δ,θ) 0 δ. It is easy to see that for a wrongly classified example z, δ is achieved at 0 and thus θ d θ (z) = 0. Therefore in this section, we focus on correctly classified data examples. Denote the Lagrangian as L(δ, λ), where L(δ, λ) = δ + λl(δ, θ). Also for a fixed θ, denote the optimizers of δ and λ by δ and λ. The following theorem provides an efficient way to compute θ d θ (z). The exact implementation of our algorithm is presented in Section 2.2. Theorem 2.1. Assume that ɛ(δ) = δ and L(δ, θ) are C 2 functions almost everywhere 2. Also assume that the matrix M is full rank almost everywhere, where Then, M = ( 2 ɛ(δ ) δ 2 + λ 2 L(δ,θ) L(δ,θ) δ δ 2 L(δ,θ) δ θ d θ (z) L(δ, θ). θ Remark 2.1. The condition on the matrix M is serving as a technical condition to guarantee that the implicit function theorem is applicable and thus the gradient can be computed. 0 ). 2.2 Max-Margin Adversarial Training Motivated by the observation in Section 1.2, MMA training directly uses margins in its objective: max d θ (z i ) J θ (z j ) θ, (3) i S + θ where J θ ( ) is a regular classification loss function, e.g. cross-entropy loss, which can be different from L θ ( ). S + θ = {i : L θ (z i ) < 0} is the set of all correctly classified examples, and S θ = {i : L θ (z i ) 0} is the set of wrongly classified examples. Intuitively, MMA training tries to minimize classification loss on wrongly classified examples in S θ, until they are correctly classified. For already correctly classified examples, MMA training tries to maximize the margin d θ (z i ). Note that we do not maximize margins on wrongly classified examples, since the gradient of their margins are 0, as we discussed in Section 2.1. Given the Eq. (3) and Theorem 2.1, we can implement MMA training using SGD. It is straightforward to calculate θ J θ (z j ). Based on Theorem 2.1, we can calculate the θ d θ (z i ) as the gradient of the loss at the minimal length perturbation δ. Finding the exact δ is intractable in general settings 3. Here we propose an algorithm to give an approximate solution of δ, the Adaptive Norm Projective Gradient Descent Attack (AN-PGD). In AN-PGD, we apply PGD (Madry et al., 2017) twice to find an approximation. In particular, if we know the true margin ɛ = δ of a data point x in advance, we can use PGD with length ɛ, j S θ 2 All the l p norm with p 0, including p =, satisfies this condition. 3 Exceptions exist, for example linear SVM. 4

5 Algorithm 1 Adaptive Norm Projected Gradient Descent Attack. Inputs: (x, y) is the data example. ɛ init is the initial norm constraint used in the first PGD attack. Outputs: δ is the approximate minimal length perturbation. Parameters: ɛ inc is the increment of perturbation length. ɛ max is the maximum perturbation length. PGD(x, y, ɛ) represents PGD (Madry et al., 2017) perturbation δ with maximum perturbation length ɛ. 1: Adversarial example δ 1 = PGD(x, y, ɛ init ) 2: Unit perturbation δ unit = δ 1 δ 1 3: if prediction on x + δ 1 is correct then 4: ɛ new = arg min η L(x + ηδ unit, y), η [ɛ init, min{ɛ init + ɛ inc, ɛ max }] 5: else 6: ɛ new = arg min η L(x + ηδ unit, y), η [0, ɛ init ) 7: end if 8: δ = PGD(x, y, ɛ new ) PGD(x, y, ɛ ), to find an approximation of δ. Since we do not know ɛ, we need to approximate it first. Therefore, we first perform a PGD attack on x with a fixed perturbation length ɛ 0 to get δ 1, then we search along the direction of δ 1 to find a scaled perturbation that gives L = 0, whose length is used to approximate ɛ. Algorithm 1 describes the details of the AN-PGD. 4 With the proposed AN-PGD attack, we can implement MMA training easily as described in Algorithm 2. Note that AN-PGD here only serves as an algorithm to give an approximate solution of δ, and it can be decoupled from the rest parts of MMA training. Other attacks that can serve a similar purpose can also fit into our MMA training framework, for example, the Carlini-Wagner attack (Carlini and Wagner, 2017b) 5, and the Decoupled Direction and Norm (DDN) attack (Rony et al., 2018) which is concurrent with our work. 2.3 Relation to Adversarial Training As suggested by Theorem 2.1, MMA training is closely related to adversarial training. In this section, we prove that MMA training is actually an improvement on adversarial training, in the sense that it picks the right perturbation magnitude ɛ for adversarial training. Given a sample z = (x, y) and a fixed length ɛ, recall that adversarial training learns a model f by minimizing the adversarial loss, min J θ (ɛ) = min max L θ(x + δ, y) θ θ δ ɛ For simplicity, we denote L θ (x + δ, y) by L θ (δ) in this section. The next theorem shows that the optimal f that maximizes the margin around the decision boundary also minimizes the adversarial loss given the perturbation magnitude being the right margin. Theorem 2.2. Assume that l L θ (0) Range(L θ ), let θ MMA = arg max θ min Lθ (δ) l δ. Also let ɛ = min LθMMA (δ) l δ, and θ adv = arg min θ max δ ɛ L θ (δ), then θ MMA also minimizes the adversarial loss, i.e. max L δ ɛ θ MMA (δ) = max L δ ɛ θ adv (δ). 4 The zero-crossing binary search is denoted as arg min η L(η) for brevity. 5 Carlini-Wagner attack specifically might suffer from computational issues due to the very high number of iterations it requires. 5

6 Algorithm 2 Max-Margin Adversarial Training. Inputs: The training set {(x i, y i )}. Outputs: the trained neural network model f θ ( ). Parameters: ɛ contains perturbation lengths of training data. ɛ inc is the increment of perturbation length. ɛ min is the minimum perturbation length. ɛ max is the maximum perturbation length. A(x, y, ɛ init ) represents the approximate minimal length perturbation returned by the algorithm A on the data example (x, y) and at the initial norm ɛ init. 1: Randomly initialize the parameter θ of model f, and initialize every element of ɛ as ɛ min 2: repeat 3: Read minibatch B = {(x 1, y 1 ),..., (x m, y m )} from the training set 4: Make predictions on B and into two: wrongly predicted B 0 and correctly predicted B 1 5: for each (x i, y i ) in B 1 do 6: Retrieve perturbation length ɛ i from ɛ 7: δ i = A(x i, y i, ɛ i ) 8: Update the ɛ i in ɛ as δ i, and put (x i + δ i, y i) into B adv 1 9: end for 10: Calculate the gradient of margin on B adv 1 and the gradient of classification loss on B 0 11: Perform one step gradient step update on θ 12: until training converged or maximum number of steps reached Note that an analogy of Theorem 2.2, that the optimal f that minimizes the adversarial loss also maximizes the margin given the corresponding threshold l, doesn t hold due to the reason that picking a perturbation magnitude ɛ may be overshooting the margin. The next theorem shows that if ɛ is picked tightly, then the above claim is true. Therefore, MMA training can also be interpreted as a way of improving adversarial training by finding the proper ɛ for each training sample, which is exactly the flavor of Algorithm 2. Theorem 2.3. Given ɛ 0, let θ adv = arg min θ max δ ɛ L θ (δ). Also let l = max δ ɛ L θadv (δ), and θ MMA = arg max θ min Lθ (δ) l δ. Further assume that min δ = ɛ, L θadv (δ) l then θ adv also maximizes the margin distance, i.e. (No overshooting assumption) min δ = min δ. L θadv (δ) l L θmma (δ) l Remark 2.2. The above theorems only focus on the case of one single sample, and they cannot be directly extended to the whole training set. In practice, the behaviour of the algorithm could be much more complicated. 3 Preliminary Results on Adversarial Robustness We test the max-margin adversarial (MMA) training on the MNIST and CIFAR10 datasets under l and l 2 perturbations. We compare our models with PGD trained models (Madry et al., 2017) which represent the SOTA robust models in the literature 6. MMA trained algorithm outperforms PGD trained models, and particularly, MMA training is the first method in the literature, to our best knowledge, that can defend l PGD attacks with length 0.4 on MNIST. 6 All the models are trained and tested under the same type of norm. 6

7 (a) MNIST l (b) MNIST l 2 (c) CIFAR10 l (d) CIFAR10 l 2 Figure 3: Robust accuracy vs perturbation length for MNIST and CIFAR10 under l and l 2 perturbations MNIST under l perturbation We follow a similar test protocol to Madry et al. (2017) using the l PGD attack. Our baseline is the PGD trained model with ɛ = Figure 3a shows the robust accuracies under white-box PGD l perturbations with different norm constraints, all with the number of iterations equals 100. We can see that the MMA trained model is better than the PGD trained model at all ɛ s. When ɛ > 0.3, the MMA trained model is significantly better than PGD trained model and is able to achieve good robustness for very large ɛ s, e.g To our best knowledge, this is so far the only algorithm that can defend l PGD attacks with length 0.4 in the literature. We further perform extensive additional experiments to attack the MMA trained model. We test our models with 50 random restarts under both white-box PGD attacks and transfer attacks generated by attacking a PGD model and a standardly trained model. Note that we always pick the strongest attack among random restarts. We also test using 40 iterations under the transfer attack case to avoid the situation where the attacks overfit to the source model. Table 1 shows all the results. Under these considered attacks, when ɛ = 0.3, MMA model yields a 2% robust accuracy improvement over the PGD model (92.86% vs 90.65%). When ɛ = 0.4, MMA training still achieves robust accuracy at 86.51%, while the PGD model is completely cracked. Notice that the strongest attack on MMA trained model is the transfer attack generated from PGD model. This could indicate that the white-box robustness of the MMA model is partially achieved due to a strong degree of flatness in the loss surface about the input data point. Despite this, however, the MMA model still manages to defend against the various transfer attacks tested. We are continuing to benchmark MMA models in ongoing work. 7 We have also attempted to train PGD models with ɛ = 0.4, but training didn t converge. 7

8 Another notable phenomenon is that MMA trained model is even robust to white-box l PGD attack with ɛ = 0.5, when the PGD is performed on the MMA trained model itself. It has the robust accuracy 89.76% under a single attack, and 70.30% under 50 random starts. This is a surprising phenomenon in the sense that a well-crafted perturbation at ɛ = 0.5 should decrease the accuracy to approximately 10%, where a gray image with constant pixel value 0.5 can already achieve. However, as shown in Table 1, when being attacked by transfer attacks from the standardly trained model, the MMA trained model s robust accuracy can be reduced to 10.96%. These observations combined could suggest that, when ɛ is large, PGD attack gets stuck in poor local minima on the MMA trained models, and we are further investigating this interesting phenomenon. MNIST under l 2 perturbation We compare our MMA trained model with models trained on fixed length PGD perturbations at ɛ = 1, 3 and 5, as shown in Figure 3a. We benchmark using the boundary attack (Brendel et al., 2017), as we observe that it is stronger than the PGD l 2 attack, which is consistent with the observation in Anonymous (2019). We can see that MMA model is not always better than the best PGD model at a given perturbation length, but it provides similar performance compared with the best PGD model at all ɛ s. Also, the MMA model outperforms PGD models on large perturbations, e.g. ɛ = 5, even when the PGD model is specifically trained on perturbations with that distortion level. Having robustness against the gradient-free boundary attack, to some degree, also indicates that, the l 2 robustness of the MMA trained model is likely not purely due to flatness in the loss surface about the input data point. CIFAR10 For l perturbation, we compare the MMA model with models trained with PGD ɛ = 8/255 and ɛ = 16/255, 8 as shown in Figure 3c. For l 2 perturbation, we compare the MMA trained model with models trained with PGD ɛ = 0.5 and ɛ = 1.5, 9 as shown in Figure 3d. For both the l and l 2 cases, when MMA models have similar clean accuracy to PGD models (l 2 : PGD ɛ = 0.5 vs MMA, l : PGD ɛ = 8/255 vs MMA), they have similar robust accuracy for small perturbation lengths, and MMA training outperforms PGD training when perturbation length becomes larger. It is possible that models trained with larger PGD attack are more robust at a certain band of ɛ, but this is at the cost of losing clean accuracy. 4 Other Related Works Several other related works also exist in the literature, as reviewed in this section. First-order Large Margin Previous works (Elsayed et al., 2018; Sokolic et al., 2017; Matyasko and Chau, 2017) have attempted to use first-order approximation to estimate the input space margin. For first-order methods, the margin will be accurately estimated when the classification function is linear. MMA s margin estimation is exact when the minimal length perturbation δ can be solved, which is not only satisfied by linear models, but also by a broader range of models, e.g. models that are convex w.r.t. input x. This relaxed condition could potentially enable more accurate margin estimation which improves MMA training s performance. 8 Training with l ɛ = 32/255 did not converge. 9 Training with l 2 ɛ = 4.5 did not converge. 8

9 (Cross-)Lipschitz Regularization On the other spectrum, Tsuzuku et al. (2018) enlarges their margin by controlling the global Lipschitz constant, which in return places a strong constraint on the model and harms its learning capabilities. Instead, our method, alike adversarial training, uses adversarial attacks to estimate the margin to the decision boundary. With a strong method, our estimate is much more precise in the neighborhood around the data point, while being much more flexible due to not relying on a global Lipschitz constraint. Hard-Margin SVM (Vapnik, 2013) in the separable case Assuming that all the training examples are correctly classified and using our notations on general classifiers, the hard-margin SVM objective can be written as: { } arg max min d θ (z i ) s.t. L θ (z i ) < 0, i. (4) θ i On the other hand, under the same separable and correct assumptions, MMA training formulation in Eq. (3) can be written as { } arg max θ d θ (z i ) s.t. L θ (z i ) < 0, i, (5) i which is maximizing the average margin rather than the minimum margin in SVM. Note that the theorem on gradient calculation of the margin in Section 2.1 also applies to the SVM formulation of differentiable functions. Because of this, we can also use SGD to solve the following SVM-style formulation, which we delay to future work: max θ min i S + θ d θ (z i ) j S θ J θ (z j ). (6) 5 Conclusion In this paper, we propose max-margin adversarial (MMA) training. We show that the input space margin s gradient w.r.t. the model parameter can be calculated with the loss gradient on the minimal length perturbation. This enables gradient descent on objectives that involves input space margin. We propose a specific formulation of max-margin adversarial training to maximize the average margin motivated by improving adversarial robustness. Our preliminary results on MNIST and CIFAR10 show that MMA training is very promising for training robust neural networks, and notably, the MMA training model achieves decent robustness (86.51%) against l PGD attack with ɛ = 0.4 which, to our best knowledge, is first time shown in the literature. References Anonymous (2019). Towards the first adversarially robust neural network model on mnist. In Submitted to International Conference on Learning Representations. under review. 9

10 Athalye, A., Carlini, N., and Wagner, D. (2018). Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arxiv preprint arxiv: Biggio, B., Corona, I., Maiorca, D., Nelson, B., Šrndić, N., Laskov, P., Giacinto, G., and Roli, F. (2013). Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pages Springer. Brendel, W., Rauber, J., and Bethge, M. (2017). Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. arxiv preprint arxiv: Carlini, N. and Wagner, D. (2016). Defensive distillation is not robust to adversarial examples. arxiv preprint arxiv: Carlini, N. and Wagner, D. (2017a). Adversarial examples are not easily detected: Bypassing ten detection methods. arxiv preprint arxiv: Carlini, N. and Wagner, D. (2017b). Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pages IEEE. Cisse, M., Bojanowski, P., Grave, E., Dauphin, Y., and Usunier, N. (2017). Parseval networks: Improving robustness to adversarial examples. In International Conference on Machine Learning, pages Elsayed, G. F., Krishnan, D., Mobahi, H., Regan, K., and Bengio, S. (2018). Large margin deep networks for classification. arxiv preprint arxiv: Hein, M. and Andriushchenko, M. (2017). Formal guarantees on the robustness of a classifier against adversarial manipulation. arxiv preprint arxiv: Huang, R., Xu, B., Schuurmans, D., and Szepesvári, C. (2015). Learning with a strong adversary. arxiv preprint arxiv: Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. (2017). Towards deep learning models resistant to adversarial attacks. arxiv preprint arxiv: Matyasko, A. and Chau, L.-P. (2017). Margin maximization for robust classification using deep learning. In Neural Networks (IJCNN), 2017 International Joint Conference on, pages IEEE. Rony, J., Hafemann, L. G., Oliveira, L. S., Ayed, I. B., Sabourin, R., and Granger, E. (2018). Decoupling direction and norm for efficient gradient-based l2 adversarial attacks and defenses. arxiv preprint arxiv: Ross, A. S. and Doshi-Velez, F. (2017). Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. arxiv preprint arxiv: Sokolic, J., Giryes, R., Sapiro, G., and Rodrigues, M. R. (2017). Robust large margin deep neural networks. IEEE Transactions on Signal Processing. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. (2013). Intriguing properties of neural networks. arxiv preprint arxiv: Tsuzuku, Y., Sato, I., and Sugiyama, M. (2018). Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks. arxiv preprint arxiv: Vapnik, V. (2013). The nature of statistical learning theory. Springer science & business media. 10

11 Table 1: MNIST: Performance of the adversarially trained models against different l adversaries for ɛ = 0.3, 0.4, 0.5. For each ɛ, we show the most successful attack with bold. The source networks used for the attack are: the MMA trained model (MMA) (white-box attack for MMA, transfer attack for PGD), the PGD trained model (PGD) (white-box attack for PGD, transfer attack for MMA), the standardly trained model (STD) (transfer attack for both MMA and PGD). Attack Method ɛ Steps Restarts Source Model Accuracy MMA Model Accuracy PGD Model Natural % 98.63% PGD MMA 96.64% 97.99% PGD MMA 93.64% 96.71% FGSM PGD 95.5% 94.62% PGD PGD 94.65% 92.40% PGD PGD 92.86% 90.65% PGD PGD 94.65% 92.37% PGD PGD 92.87% 90.64% FGSM STD 95.46% 93.45% PGD STD 95.86% 93.97% PGD STD 94.38% 92.53% PGD STD 95.86% 93.97% PGD STD 94.41% 92.55% PGD MMA 95.33% 96.14% PGD MMA 89.99% 89.74% FGSM PGD 94.16% 88.18% PGD PGD 91.18% 9.89% PGD PGD 86.69% 2.83% PGD PGD 91.16% 8.72% PGD PGD 86.51% 2.42% FGSM STD 91.51% 38.23% PGD STD 93.21% 15.38% PGD STD 89.12% 5.23% PGD STD 93.18% 15.17% PGD STD 89.14% 5.07% PGD MMA 89.76% 60.36% PGD MMA 70.30% 26.02% FGSM PGD 88.39% 51.67% PGD PGD 48.05% 0.00% PGD PGD 15.50% 0.00% PGD PGD 45.75% 0.00% PGD PGD 14.20% 0.00% FGSM STD 32.96% 4.45% PGD STD 34.05% 0.00% PGD STD 10.96% 0.00% PGD STD 33.66% 0.00% PGD STD 11.01% 0.00% 11

12 A Proofs A.1 Proof of theorem 2.1 Proof. Recall that z = (x, y) and ɛ(δ) = δ. Here we compute the gradient for d θ (z) in its general form. Consider the following optimization problem: d θ (z) = min δ (θ) ɛ(δ), where (θ) = {δ : L θ (x + δ, y) = 0}, ɛ and g are both C 2 functions 10. Denotes its Lagrangian by L(δ, λ), where L(δ, λ) = ɛ(δ) + λl θ (x + δ, y) For a fixed θ, the optimizer δ and λ must satisfy the first-order conditions (FOC) ɛ(δ) + λ L θ(x + δ, y) δ δ = 0, (7) δ=δ,λ=λ Put the FOC equations in vector form, G((δ, λ), θ) = L θ (x + δ, y) δ=δ = 0. ( ɛ(δ) δ + λ L θ(x+δ,y) δ L θ (x + δ, y) ) δ=δ,λ=λ = 0. Note that G is C 1 continuously differentiable since ɛ and g are C 2 functions. Furthermore, the Jacobian matrix of G w.r.t (δ, λ) is ( ) 2 ɛ(δ ) (δ,λ) G((δ, λ + λ 2 g(θ,δ ) g(θ,δ ) ), θ) = δ 2 δ 2 δ 0 g(θ,δ ) δ which by assumption is full rank. Therefore, by the implicit function theorem, δ and λ can be expressed as a function of θ, denoted by δ (θ) and λ (θ). To further compute θ d θ (z), note that d θ (z) = ɛ(δ (θ)). Thus, θ d θ (z) = ɛ(δ ) δ (θ) δ θ = λ g(θ, δ ) δ δ (θ), (8) θ where the second equality is by Eq. (7). The implicit function theorem also provides a way of computing δ (θ) θ which is complicated involving taking inverse of the matrix (δ,λ) G((δ, λ ), θ). Here we present a relatively simple way to compute this gradient. Note that Differentiate with w.r.t. θ on both sides: Combining Eq. (8) and Eq. (9), g(θ, δ ) θ g(θ, δ (θ)) = 0. + g(θ, δ ) δ (θ) = 0. (9) δ θ θ d θ (z) = λ (θ) g(θ, δ ). (10) θ 10 Note that a simple application of Danskin s theorem would not be valid as the constraint set (θ) depends on the parameter θ. 12

13 A.2 A lemma for later proofs The following lemma helps relate the objective of adversarial training with that of our MMA training. Here, we denote L θ (x + δ, y) as L θ (δ) for simplicity. Lemma A.1. Given (x, y) and θ, assume that L θ (δ) is continuous in δ, then for ɛ 0, and l L θ (0) Range(L θ ), it holds that min δ = ɛ = max L θ(δ) = l ; (11) L θ (δ) l δ ɛ max L θ(δ) = l = δ ɛ min δ ɛ. (12) L θ (δ) l Proof. Eq. (11). We prove this by contradiction. Suppose max δ ɛ L θ (δ) > l. When ɛ = 0, this violates our asssumption l L θ (0) in the theorem. So assume ɛ > 0. Since L θ (δ) is a continuous function defined on a compact set, the maximum is attained by δ such that δ ɛ and L θ ( δ) > l. Note that L θ is continuous and l L θ (0), then there exists δ 0, δ i.e. the line segment connecting 0 and δ, such that δ < ɛ and L θ ( δ) = l. This follows from the intermediate value theorem by restricting L θ (δ) onto 0, δ. This contradicts min Lθ (δ) l δ = ɛ. If max δ ɛ L θ (δ) < l, then {δ : δ ɛ} {δ : L θ (δ) < l}. Every point p {δ : δ ɛ} is in the open set {δ : L θ (δ) < l}, there exists an open ball with some radius r p centered at p such that B rp {δ : L θ (δ) < l}. This forms an open cover for {δ : δ ɛ}. Since {δ : δ ɛ} is compact, there is an open finite subcover U ɛ such that: {δ : δ ɛ} U ɛ {δ : L θ (δ) < l}. Sicne U ɛ is finite, there exists h > 0 such that {δ : δ ɛ + h} {δ : L θ (δ) < l}. Thus {δ : L θ (δ) l} {δ : δ > ɛ + h}, contradicting min Lθ (δ) l δ = ɛ again. Eq. (12). Assume that min Lθ (δ) l δ > ɛ, then {δ : L θ (δ) l} {δ : δ > ɛ}. Taking complementary set of both sides, {δ : δ ɛ} {δ : L θ (δ) < l}. Therefore, by the compactness of {δ : δ ɛ}, max δ ɛ L θ (δ) < l, contradiction. A.3 Proof of theorem 2.2 Proof. By the definition of θ adv, We further prove it is not possible that max L δ ɛ θ MMA (δ) max L δ ɛ θ adv (δ). max L δ ɛ θ MMA (δ) > max L δ ɛ θ adv (δ). If so, by the first implication in Lemma A.1 and the definition of ɛ, we have l = max δ ɛ L θ MMA (δ) > max δ ɛ L θ adv (δ), and thus with similar arguments to the proof of Lemma A.1, contradicting the definition of θ MMA. Therefore, min δ > L θadv (δ) l ɛ = δ (θ MMA ), max δ ɛ L θ MMA (δ) = max δ ɛ L θ adv (δ). 13

14 A.4 Proof of Theorem 2.3 Proof. By the definition of θ MMA, We further prove it is not possible that If so, by the no overshooting assumption, min δ min δ. L θmma (δ) l L θadv (δ) l min δ > min δ. L θmma (δ) l L θadv (δ) l min δ > ɛ. L θmma (δ) l and thus with similar arguments to the proof of the second implication in Lemma A.1, contradicting the definition of θ adv. Therefore, max δ ɛ L θ MMA (δ) < l, max L δ ɛ θ MMA (δ) = max L δ ɛ θ adv (δ). 14

Max-Margin Adversarial (MMA) Training: Direct Input Space Margin Maximization through Adversarial Training

Max-Margin Adversarial (MMA) Training: Direct Input Space Margin Maximization through Adversarial Training Gavin Weiguang Ding, Yash Sharma, Kry Yik Chau Lui, and Ruitong Huang Borealis AI arxiv:1812.02637v2