Universal Adversarial Networks

Size: px

Start display at page:

Download "Universal Adversarial Networks"

Nathan Alexander
5 years ago
Views:

1 1 Universal Adversarial Networks Jamie Hayes University College London Abstract Neural networks are known to be vulnerable to adversarial examples, inputs that have been intentionally perturbed to remain visually similar to the source input, but cause a misclassification. It was recently shown that given a target dataset and classifier, there exists so called universal adversarial perturbations, a single perturbation that causes a misclassification when applied to any input. In this work, we introduce universal adversarial networks, a generative network that is capable of fooling a target classifier when it s generated output is combined with a clean sample. We show that this technique improves on known universal adversarial attacks. I. I N T R O D U C T I O N Machine Learning models are increasingly relied upon for safety and business critical tasks such as in medicine [1], [16], [4], robotics and automotive [15], [18], [3], security [1], [8], [] and financial [6], [9], [0] applications. Recent research shows that machine learning models trained on entirely uncorrupted data, are still vulnerable to adversarial examples [4], [5], [13], [14], [19], [1]: samples that have been maliciously altered so as to be misclassified by a target model while appearing unaltered to the human eye. Most work focuses on generating perturbations that cause a specific input to be misclassified, however, it has been shown that adversarial perturbations generalize across many inputs [19]. Moosavi-Dezfooli et al. [10] showed, in the most extreme case, that given a target model and a dataset, it is possible to compute a single perturbation that when applied to any input, will cause a misclassification with high likelihood. These are referred to as universal adversarial perturbations (UAPs). In this work, we study the capacity for generative models to learn to craft UAPs, we refer to these networks as universal adversarial networks (UANs). We show that a UAN is able to sample from noise and generate a perturbation such that when applied to any input from the dataset, it will result in a misclassification in the target model. Furthermore, perturbations produced by UANs have robust transferable properties. I I. B A C K G R O U N D We define adversarial examples and UAPs along with some terminology and notation. We then introduce the threat model considered, and the datasets we use to evaluate the attacks. A. Adversarial Examples Szegedy et al. [19] casts the construction of adversarial examples as an optimization problem. Given a target model, f, and a source input x, which is sometimes referred to in the literature as a clean or original input. The input x is classified correctly by f as y. The attacker aims to find an adversarial input, x + c, which is perceptually identical to x but f(x + c) y. Then the adversary attempts to find argmin c f(x + c) = l such that x + c is a legitimate input and f(x) l. The problem space can be extended to find a specific misclassification in a targeted attack, or any misclassification, referred to as an non-targeted attack. In the absence of a distance measure that accurately captures the perceptual differences between a source and adversarial examples, the l p distance metric is used in [19] as a measure of similarity. The l 0 distance measures the number of pixels that have changed, the l distance measures the Euclidean distance between two images, while the l distance measures the maximum perturbation between two pixels. Given a target model, f, and a dataset, X, a UAP is a perturbation, c, such that x X, x + c is a valid input, P r(f(x + c) f(x)) = 1 δ, where 0 < δ << 1. B. Threat Model We consider an attacker who wishes to craft UAPs such that when applied to any input it will be incorrectly classified by a target model with high likelihood. We also constrain the adversary input by the attacker to be visually imperceptible to a source input, evaluated through the l and l distance measures. Our attacks assume white-box access to the target model, as we backpropagate the error of the target model back to the UAN. The attacker samples an input from noise, generates a mask, and applies this to an input which is then given to the target models. In almost all experiments, we consider a worstcase scenario with respect to data access, assuming that the attacker has knowledge of, and shares access to, any training data samples. C. Datasets We evaluate attacks using two popular datasets in adversarial machine learning research, ImageNet 1 and CIFAR-10 [7]. We use the validation dataset of Imagenet, which consists of 50, 000 RGB images, scaled to 4 4. The images contain 1, 000 classes. The 50, 000 images are split into 40, 000 training set images and 10, 000 validation set images. We ensure classes are balanced, such that any class contain 40 images in the training set and 10 images in the validation set. Our pretrained models, ResNet-15, [17] and InceptionV3, used as the target models, score 78.4%, 73.1%, 77.3% top-1 test accuracy. The CIFAR-10 dataset consists of 60, 000, 3 3 RGB images of different objects in ten classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. This is split into set. 1 Due to training time constraints, we restrict experiments to the idation

2 Fig. 1: Overview of the attack. 50, 000 training images and 10, 000 validation images. Our pretrained models, ResNet-101, [17] and DenseNet, used as the target models, score 93.75%, 94.01%, 95.00% test accuracy. State-of-the-art results on CIFAR-10 are approximately 95% accurate. I I I. U N I V E R S A L A D V E R S A R I A L N E T W O R K S A. Attack Description An overview of the attack is given in Figure 1. An attacker is provided access to a target model, and attempts to train a new network, called a universal adversarial network (UAN), that learns to generate perturbations such that when applied to any input, will lead to a misclassification in the target model. The UAN is trained to minimize the distance between the source and adversarial image while causing a misclassification in the target model. The UAN accepts noise as input and outputs a candidate for a UAP, δ. To constrain δ to be of minimal norm, we apply a weighting coefficient before applying to a dataset sample, this rescales δ to the desired distance. The best method for minimizing the distance between an adversarial and clean input is for the UAN to minimize the norm of its output, however this will clearly lead to both the clean and adversarial image being assigned the same, correct, label by the target model namely no attack. Therefore we train the UAN via a cost function that reflects the level of success in fooling the target model, while in parallel, updating the weighting coefficient to minimize the norm of candidate UAPs. We adapt the loss functions defined by Carlini and Wagner [] and Chen et al. [3] to learn to craft UAPs. Given a UAN U, target model f, a candidate UAP δ, an input x X and the output layer of f, F, we define the loss of U as: TABLE I: UAN model architecture. IS refers to the image size, 3 for CIFAR-10 experiments and 99 or 4 for ImageNet experiments. Layer Shape Input 100 Deconv + Batch Norm + ReLU Deconv + Batch Norm + ReLU Deconv + Batch Norm + ReLU Deconv + Batch Norm + ReLU Deconv + Batch Norm + ReLU FC + Batch Norm + ReLU 51 FC + Batch Norm + ReLU 104 FC 3 IS IS TABLE II: UAN hyperparameters. Parameter CIFAR-10 Dataset ImageNet Learning Rate Beta Beta Batch Size Epochs Weight coefficient Weight coefficient update l reg factor l loss weight (α) The full description of the attack is given in Algorithm 1 and the UAN model description and hyperparameters are given in Table I and Table II. We define the relative perturbation as the ratio of the norm of the UAP over the norm of the original image. This is set to 4 in all experiments. I V. AT TA C K E VA L U AT I O N A. Comparison with previous work L = log(f (x + δ) F (x) ) max i F (x) log(f (x + δ) i) + α δ, L can be thought of as a success measure of the UAN, reducing L means the distance between the originally predicted class, and the next highest class reduces. The logarithmic term is necessary since most target models have a skewed probability distribution, with one class prediction dominating all others, the logarithmic term reduces the effect of this dominance. In addition to the weighting coefficient, to further constrain the norm of the adversarial perturbation we include l -regularization in the UAN and a loss term constraining the maximum perturbation. We now compare our method for crafting UAPs with two state-of-the-art methods: Moosavi-Dezfooli et al. [10] construct the perturbation iteratively; at each step an input is combined with the current constructed UAP, if the combination does not fool the target model, a new perturbation with minimal norm is found that does fool the target model. The attack terminates when a threshold fooling rate is met. Mopuri et al. [11] develop a method for finding a UAP for a target model that is independent of the dataset. They construct Code available at

3 TABLE III: Comparison of fooling rates (%) for UAN and three other universal adversarial perturbation methods. ATTACK CIFAR-10 ImageNet INCEPTIONV3 UA N 68.9 69.5 86.9 84. 75.3 75.9 88.9 86.0 91.

8 Mopuri et al. [11] 1.9 0.1 37.4 36.5 35.6 34.1 40.7 41.1 37.0 36.9 33.6 33.7 TABLE IV: Success of fooling (%). UAPs are constructed using row models and tested against pre-trained column models.

Transferability An adversarial image is transferable if it successfully fools a model that was not its original target.

3 3 TABLE III: Comparison of fooling rates (%) for UAN and three other universal adversarial perturbation methods. ATTACK CIFAR-10 ImageNet INCEPTIONV3 UA N Moosavi-Dezfooli et al. [10] (` ) Moosavi-Dezfooli et al. [10] (` ) Mopuri et al. [11] TABLE IV: Success of fooling (%). UAPs are constructed using row models and tested against pre-trained column models. ENSEMBLE (a) V. T R A N S F E R A B I L I T Y, G E N E R A L I Z A B I L I T Y & TA R G E T E D AT T A C K S A. Transferability An adversarial image is transferable if it successfully fools a model that was not its original target. Transferability is a yardstick for the robustness of adversarial examples, and is the main property used by Papernot et al. [13], [14] to construct (b) black-box adversarial examples. They construct a white-box attack on a local target model that has been trained to replicate the intended target models decision boundaries, and show that the adversarial examples can successfully transfer to fool the black-box target model. We create 10,000 adversarial images - one for each image in the CIFAR-10 validation set - and apply them to a different target model. Table IV gives results for transferability of a (c) non-targeted attack on three target models - ResNet, DenseNet Fig. : Non-targeted attacks against. From left to and. We find that UAPs crafted using a UAN do right: original, UAP, adversarial image. transfer to other models. For a UAN trained on, and evaluated on ResNet, fooling success is 61.%, a drop of just 7.7% from evaluating on the original target model (). We also measure the capacity for a UAN to learn to fool an a UAP by first starting with a random noise UAP and iteratively ensemble of target models. We trained a UAN against VGGalter it to over-saturate features learned at successive layers in 19, DenseNet and ResNet on CIFAR-10, where the UAN loss the target model, essentially causing neurons at each layer to function is a linear combination of the losses on each target output useless information to cause the desired misclassification. model. From Table IV, we see that a UAN trained against an They optimize the UAP by adjusting it with respect to the loss ensemble of target models is able to fool at comparable rates term: to single target models. K Q L = log( l i (δ)), such that δ < α B. Generalizability i=1 where, l i (δ) is the average of the output at layer i for Moosavi-Dezfooli et al. [10] have shown that UAPs are not perturbation δ, and α is the maximum permitted perturbation. unique; there exists many candidates that perform equally well Table III compares our UAN method of generating UAPs at fooling images in a dataset. If a UAN is truly modeling against the two attacks described above for both CIFAR-10 the distribution of UAPs the output should not be unique. and ImageNet. We see we consistently outperform both attack In Figure 4, we measure the MSE (mean square error) and methods. SSIM (structural similarity index) between outputs of the UAN

4 4 idation idation idation idation idation (a) plane (b) car (c) bird (d) cat (e) deer idation idation idation idation idation (f) dog (g) frog (h) horse (i) ship (j) truck Fig. 3: CIFAR-10 targeted universal attack against. Each figures shows the success probability on transforming any image in CIFAR-10 into the chosen class, as the relative perturbation is increased. We define the relative perturbation as the ratio of the l -norm of the UAP over the l -norm of the original image. Mean squared error Epoch 0 Structural similarity index Fig. 4: MSE and SSIM scores of UAPs throughout training a UAN against for the CIFAR-10 dataset. throughout training. We see that as we continue to train, UAPs have larger differences between one another and so diversify over time. C. Targeted Attacks We follow the same experimental set-up as in Section IV-A, however now the attacker chooses a class, c, they would like the target model to classify an adversarial example as, and success is calculated as the probability that an adversarial example is classified as c. Figure 3 show, for each class in CIFAR- 10, the success probability as we allow larger perturbations. Interestingly, all targeted attacks follow a sigmoidal curve. V I. C O N C L U S I O N We presented a first-of-its-kind universal adversarial example attack that uses machine learning at the heart of its construction. We comprehensively evaluated the attack under many different settings, showing that it produces quality adversarial examples capable of fooling a target model in both targeted and non-targeted attacks. The attack transfers to many different target models, and improves on other state-of-the-art universal adversarial perturbation methods. R E F E R E N C E S [1] A. L. Buczak and E. Guven. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications Surveys & Tutorials, 18(): , 016. [] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 017 IEEE Symposium on, pages IEEE, 017. [3] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh. ZOO: Zeroth Order Optimization based Black-box Attacks to Deep Neural Networks without ing Substitute Models. ArXiv e-prints, Aug [4] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arxiv preprint arxiv: , 014. [5] S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel. Adversarial attacks on neural network policies. arxiv preprint arxiv:1784, 017. [6] S.-J. Kim and S. Boyd. A minimax theorem with applications to machine learning, signal processing, and finance. SIAM Journal on Optimization, 19(3): , 008. [7] A. Krizhevsky. Learning multiple layers of features from tiny images [8] T. D. Lane. Machine learning techniques for the computer security domain of anomaly detection [9] W.-Y. Lin, Y.-H. Hu, and C.-F. Tsai. Machine learning in financial crisis prediction: a survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 4(4):41 436, 01. [10] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations. arxiv preprint arxiv: , 016. [11] K. R. Mopuri, U. Garg, and R. V. Babu. Fast feature fool: A data independent approach to universal adversarial perturbations. arxiv preprint arxiv: , 017. [1] Z. Obermeyer and E. J. Emanuel. Predicting the futurebig data, machine learning, and clinical medicine. The New England journal of medicine, 375(13):116, 016. [13] N. Papernot, P. McDaniel, and I. Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arxiv preprint arxiv: , 016. [14] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-box attacks against deep learning systems using adversarial examples. arxiv preprint arxiv:16697, 016. [15] E. Rosten and T. Drummond. Machine learning for high-speed corner detection. Computer Vision ECCV 006, pages , 006. [16] M. A. Shipp, K. N. Ross, P. Tamayo, A. P. Weng, J. L. Kutok, R. C. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G. S. Pinkus, et al. Diffuse

5 5 large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature medicine, 8(1):68 74, 00. [17] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arxiv preprint arxiv: , 014. [18] S. Sivaraman and M. M. Trivedi. Active learning for on-road vehicle detection: A comparative study. Machine vision and applications, pages 1 13, 014. [19] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arxiv preprint arxiv: , 013. [0] T. B. Trafalis and H. Ince. Support vector machine for regression and applications to financial forecasting. In Neural Networks, 000. IJCNN 000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, volume 6, pages IEEE, 000. [1] F. Tramèr, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel. The space of transferable adversarial examples. arxiv preprint arxiv: , 017. [] F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart. Stealing machine learning models via prediction apis. In USENIX Security Symposium, pages , 016. [3] X. Wen, L. Shao, Y. Xue, and W. Fang. A rapid learning algorithm for vehicle classification. Information Sciences, 95: , 015. [4] Q.-H. Ye, L.-X. Qin, M. Forgues, P. He, J. W. Kim, A. C. Peng, R. Simon, Y. Li, A. I. Robles, Y. Chen, et al. Predicting hepatitis b virus-positive metastatic hepatocellular carcinomas using gene expression profiling and supervised machine learning. Nature medicine, 9(4):416, 003. Algorithm 1: UAN training scheme. Input: dataset X, weight factor w, weight factor increment r, epochs n, UAN U, target model T, i = 0 Output: A trained UAN, U 1 while i < n do P U(z) // Compute adversarial perturbations such that P = X 3 A P w + X // Generate adversarial examples 4 s {x X T (a) T (x), a = p w + x, p P } // Find the number of successful adversarial examples 5 Find loss and update U 6 if i = 0 then 7 curr s 8 prev curr 9 curr s 10 if curr prev then 11 w w + r 1 i i return U

ENSEMBLE METHODS AS A DEFENSE TO ADVERSAR-

ENSEMBLE METHODS AS A DEFENSE TO ADVERSAR- IAL PERTURBATIONS AGAINST DEEP NEURAL NET- WORKS Anonymous authors Paper under double-blind review ABSTRACT Deep learning has become the state of the art approach