Bayesian Games for Adversarial Regression Problems (Online Appendix)

Size: px

Start display at page:

Download "Bayesian Games for Adversarial Regression Problems (Online Appendix)"

Dulcie Joseph
6 years ago
Views:

1 Online Appendix) Michael Großhans Christoph Sawade Michael Brückner Tobias Scheffer University of Potsdam, Department of Computer Science, August-Bebel-Straße 9, 44 Potsdam, Germany SoundCloud Ltd., Rosenthalerstraße, 9 Berlin, Germany A. Proofs Proof of Lemma The partial derivative of Equation 4 for an instance i is given by x i ˆθd w, X, c d ) = x T i w z i ) w + xi x i ). Due to the convexity of ˆθ d we can compute the minimum by equating it to zero and solving for x i using the Sherman-Morrison formula. This results in x x i = x i c d,i ) + ) w T i w z i w. The claim follows, since the optimal transformations are independent from each other. Proof of Lemma The partial derivate of Equation is given by ˆθ l w, φc d ), c l )dqc d ) w = diag c l )φc d )w y) T φc d )w y)dqc d ) w + w w ) = φc d ) T diag c l )φc d )dqc d ) w φc d )dqc d )) T diag c l )y + w. Setting this gradient equal to zero and solving it for w yields w [φ] = I m + ) φc d ) T diag c l )φc d )dqc d ) φc d )dqc d )) T diag c l )y. The matrix φc d ) T diag c l )φc d ) and hence the expectation φc d ) T diag c l )φc d )dqc d ) is positive semidefinite for any adversary s strategy φc d ). Thus, all eigenvalues λ i are non-negative. The corresponding eigenvectors are additionally eigenvectors of I m + φc d ) T diag c l )φc d )dqc d ) with eigenvalues + λ i >. Thus, the inverse matrix exists independently of the choice of φ. This proves the claim. Proof of Lemma We define the action spaces of G as the closure of the convex hull of optimal responses Φ = cl{φ Φ φ = τφ [w ] + τ)φ [w ], τ [, ], w, w W}) W = cl{w W w = τw [φ ] + τ)w [φ ], τ [, ], φ, φ Φ}), where cl M) denotes the closure of set M. Since a Bayesian equilibrium is a pair of optimal responses see Definition ), each equilibrium in the original game G is a member of the restricted space W Φ. Since both games have identical loss functions and costs distributions each equilibrium point in G is an equilibrium point G and vice versa. It remains to be shown that the spaces Φ and W of optimal responses are bounded. We start by showing that any optimal response of the data generator is

2 bounded. Using the triangle inequality, the spectral norm of an optimal response to a model parameter w see Lemma ) can be upper-bounded by φ [w]c d ) ) X + diag c d ) + w I n zw + ) diag c d ) + w I n Xww. 4) The third summand of the right-hand side in Inequality 4 can be upper bounded as follows. ) diag c d ) + w I n Xww diag c d ) + w n) I X ww T ) = w + min X w 6) i < X. 7) Equation follows by the sub-multiplicativity of the spectral norm. The spectral norm of a symmetric positive definite matrix is given by its largest eigenvalue, that is, the maximal diagonal entry of the positive diagonal matrix see Equation 6). Equation 7 holds, since c d >. Analogously, the second summand see Inequality 4) can be bounded using the sub-multiplicativity of the spectral norm Inequality ) and evaluating the largest eigenvalue of the diagonal matrix, which we split up into two factors see Equation 9). In Inequality we can bound each dominator by one of the positive summands. ) diag c d ) + w I n zw diag c d ) + w n) I z w ) z = w + min i w w + min i 9) z < w min i w ) = z max cd,i i ) Finally, using Bound 7 and the function space Φ is bounded by φ[w]c d ) dqc d ) < X + z max cd,i dqc d ) i since max i cd,i dqc d ) < n i + cd,i )dqc d ) < n + n i cd,i dqc d ) < exist. We now show that if the data generator s action space is convex and compact the space of optimal responses W of the learner is compact and convex as well. Let φc d ) be a data generator s strategy given costs c d. Then, the learner s optimal response given by Lemma can be bounded using the submultiplicativity of the l -norm w [φ] I m + φc d ) T diag c l )φc d )dqc d )) φc d )dqc d )) diag c l )y { } sup φ Φ φ c d ) dqc d ) diag c l )y. ) For any arbitrary symmetric, positive definite matrix the squared spectral norm equals its largest eigenvalue. Since φc d ) T diag c l )φc d )dqc d ) is positive semidefinite, all eigenvalues λ i are non-negative. The corresponding eigenvectors are additionally eigenvectors of I m + φc d ) T diag c l )φc d )dqc d ) with eigenvalues + λ i. Thus, the inverse matrix exists and its norm can be upper bounded by one see Inequality ). Then, the claim follows since the data generator s action space is bounded. Proof of Theorem Let a = a,..., a n ) T be an arbitrary vector. Then, following Taylor s Theorem the data generator s optimal response can be written as φ [w]c d ) = φ t;a [w]c d ) + R t;a c d ), t = diag c d a) r C r a) + R t;a c d ), r= with Cauchy remainder R t;a c d ) = t + )diag c d ξ) t diag c d a)c t+ ξ), for some ξ = ξ,..., ξ n ) T with ξ i [, a i ] or ξ i [a i, ], respectively. It remains to show that a can be chosen, such that R t;a c d ) converges to the null matrix, or equivalently, that lim R t;a c d ) t = holds.

3 4 Bayes Nash Ridge σ =.. expected costs µ µ = 4.. variance of costs σ Figure. Evaluation against an adversary that follows a Bayesian equilibrium strategy for varying cost parameters. Error bars show standard errors. The limit of the Cauchy remainder can be expressed as lim R t;ac d ) t t = lim + )diag cd ξ) t diag c d a)c t+ ξ) t t lim + )diag cd ξ) t diag c d a) t )) t+) I n + diag w a t w ) = lim max t + ) ξ i ) t a i ) t i ) t + w a t i w. 4) In Equation we dismiss the two terms ) t+ and Xw z) w T from C t+ ξ) see Equation ). This leads to a diagonal matrix whose spectral norm is given by the maximal absolute diagonal entry of the matrix see Equation 4). If w = the claim holds. So let w >. Then, Equation 4 can be factorized as lim R t;ac d ) t ) t lim max ξ i t + ) t i w + w i) a cd,i a i ). ) A sufficient condition for the geometric sequence and thus for Equation to tend to zero is that ξ i w k < 6) holds for all i and a fixed k. In the following we show by case differentiation according to that Condition 6 is fulfilled for a i = sup { c d qc d ) > }, where c = max i c i ) is the maximum norm. Let a i. Since ξ i [a i, ] it follows that a i ξ i a i and, thus, the quotient can be upperbounded by ξ i w ξ i w a i a i w = k. Since w >, it holds that k <. Now assume that a i >. Since the data generator s costs are bounded from below by zero, it follows that ξ i a i. Hence, the quotient can be upper-bounded by ξ i w ξ i a i w w = k. This proves the claim. B. Comprehensive Empirical Results In Section 6 we studied the behavior of the Bayesian regression model in the context of spam filtering. We compared the Bayesian game regression

4 model denoted Bayes) to the Nash game regression model denoted Nash), the robust ridge regression denoted Minimax), and a regular ridge regression denoted Ridge). Depiction of Theorem In Theorem we derived sufficient conditions for unique equilibrium points. The uniqueness depends on both players costs c d and c l. Figure 6 left) depicts the optimal responses of the learner horizontal axis) and the data generator vertical axis) for different costs c l = c d in a one-dimensional regression game where X =, y =, and z = without uncertainty. Their intersections constitute the equilibrium of the corresponding game. If the costs become too large, such that Condition is violated for any w, φ) [, ] [, ], the game G is no longer locally convex indicated by the dashed black line). Playing against a Bayesian adversary In a first experiment, we evaluated how the methods perform against an adversary that chooses a strategy according to a Bayesian equilibrium. We choose q ) as a gamma distribution and varied its mean µ and the variance σ. We set the Nash model s conjecture for all values of to the mean µ; this is optimal if the data generator s costs are drawn from a single-point distribution q ) = δ = µ). Figure shows the root mean squared error ) for Bayes, Nash, and Ridge as a function of µ and σ left). Figure right) shows a sectional view along µ top) and along σ bottom). We observe that all methods coincide if the data generator s costs vanish. The advantage of Bayes over Nash grows with the variance of q. Playing against actual adversaries In a second experiment, we evaluate all methods over time into the future. Here, the models play against actual spammers. Additionally, in order to artificially create a mismatch to our modeling assumptions, we also evaluate the models on test data from the past. The training sample of instances are drawn from month k. To evaluate into the future, the regularization parameters of all learners for Bayes, we use a single cost parameter for all instances) are tuned on, instances from month k + ; for evaluation into the past, tuning data are drawn from month k. Test data are then drawn from months k + to k + and from months k to k 6, respectively. This process is repeated and measurements are averaged over ten resampling iterations of the training set and, in an outer loop, over four training months k March to June ). The data generator s costs parameters are set to µ =. and σ =. for Bayes and to µ =. for Nash. Figure 4 left) shows the over time for fixed mean µ =. and variance σ =.. Figure 4 center) depicts the for a fixed point in the past over a range of different values for µ. Figure 4 right) shows the training times of the Bayesian equilibrium model and reference models for varying number of attributes. Additionally we study the shift in spam mails in reality and compare it with the computed equilibrium point. For depiction we choose the two most discriminant principal components with respect to spam and non-spam. We train separate Nash models on March 7 see section 6), April 7, May 7 and June 7. Again we use instances and the data generator s costs are set to µ =.. The learner s costs are tuned on the subsequent month. Given the Nash model we are able to extract the optimal transformed data from training month according to Lemma. Figure depicts a training samples blue), test samples from the six subsequent months green/yellow) and the computed equilibrium data red) for three different training months April - June 7); a fourth is shown in Figure. If changes of spam mails are continuous they can be well predicted see e.g. upper right corner). If instead the changes are more volatile see e.g. lower right corner) the Nash model is not able to predict the appearance of new spam mails. Approximation of adversary s responses Finally, we study the impact of the degree t of the Taylor approximation on the accuracy and execution time of Bayes. Figure 6 center) shows the evaluated over time for t =,, see Equation ). Figure 6 right, top) shows the execution time depending on the number of training s for a fixed number of attributes m = ). Figure 6 right, bottom) shows the execution time depending on the number of attributes for a fixed number of training mails n = ).

5 Bayesian Games for Adversarial Regression Problems Bayes Nash Minimax Ridge k 6 k 4 k Execution Time in sec) Evaluation into Past k 4) Tuning Past Training Tuning Future Evaluation into Past and Future relative) k k+ k+4 k+6 month. expected costs µ 6. number of attributes Figure 4. Evaluation of regression models with fixed expected costs into the past and future left) and varying expected costs into the past center). Execution Time right). Error bars show standard errors... Apr May Oct.. May Principal Component Jun Nov Principal Component Principal Component Principal Component.. Principal Component Jun Jul Dec Principal Component Figure. Shift in spam mails over time for three different training months. Training data is shown in blue; the imaginary data are shown in red. Evaluation into Past and Future relative) cv=.. cv=. g*[w] cv=. cv=.. Bayes t=) Bayes t=) Bayes t=) Tuning Past Training Tuning Future Optimal Responses x=, y=, t=) w*[g] k 6 k 4 k k k+ k+4 k+6 month number of training mails. Execution Time in sec) number of attributes Figure 6. Optimal responses for one-dimensional regression games left). Evaluation of Bayesian model with different Taylor degrees t over time center). Execution Time right).

5 Compact linear operators

5 Compact linear operators One of the most important results of Linear Algebra is that for every selfadjoint linear map A on a finite-dimensional space, there exists a basis consisting of eigenvectors.