(i) The optimisation problem solved is 1 min

STATISTICAL LEARNING IN PRACTICE Part III / Lent 208 Example Sheet 3 (of 4) By Dr T. Wang You have the option to submit your answers to Questions and 4 to be marked. If you want your answers to be marked, please leave them in my pigeon hole in the central core of the CMS by 2pm on 7 Mar.. Suppose we have observations x,..., x n R p with associated class labels y,..., y n {, }. Assume that the two classes are completely separable. (i) Write down the optimisation problem solved by the support vector machine (SVM) classifier with hard margin and linear kernel. (ii) What are support vectors for this SVM? Let S = {i : x i is a support vector}, show that the decision boundary of this SVM will not change if we use only {(x i, y i ) : i S} as our training data. (iii) Suppose S = {, 2}. Describe the decision boundary in terms of x, x 2. What if S = {, 2, 3}? Solution. (i) The optimisation problem solved is min β R p,β 0 R 2 β 2 2 subject to y i (x i β + β 0 ), i =,..., n. () (ii) Let β and β0 be the optimiser of the above optimisation problem. The set of support vectors are {x i : y i (x i β + β0 ) =. Let β and β 0 be the optimiser of the SVM problem using {x i : i / S} as training data, we want to show that β = β and β0 = β 0. Assume this is not true, and we prove by contradiction. Note that, (β, β0 ) satisfies the constraints in the SVM problem using {x i : i / S} as the training data. Thus, by uniqueness of the optimiser (note strong convexity), we must have 2 β 2 2 < 2 β 2 2 Let β (h) = hβ + ( h)β and β (h) 0 = hβ 0 + ( h)β 0. Then by convexity of the L 2 norm, we have 2 β(h) 2 2 < 2 β 2 2 for all h (0, ). Moreover, for any support vector x i, we have y i (x i β + β0 ) and y i (x i β + β 0 ), and thus y i(x i β(h) + β (h) 0 ). For any non-support vector x i, we have y i (x i β + β0 ) >, thus, y i(x i β(h) + β (h) 0 ) for h sufficiently close to 0. Therefore, for sufficiently small h, we have shown that (β (h), β (h) 0 ) also satisfies the constraints in () but has smaller objective function value, contradicting the optimality of β and β0. Thus, we must have β = β and β0 = β 0.

(iii) Since (β, β0 ) = (β, β 0 ), both the decision boundary and the set of support vectors remain the same if we remove all non-support vectors from the training set. If S = {, 2}, then y and y 2 are necessarily distinct and the decision boundary, which separates x and x 2 and maximises the distance to the closer of the two points, must be the perpendicular bisector of x and x 2. Similarly, for S = {, 2, 3}, we may assume that y = y 2 y 3. Then the decision boundary must be parallel to the line connecting x and x 2. Thus, the decision boundary is the perpendicular bisector of the segment connecting x 3 to the projection image of x 3 to the line through x and x 2. 2. (i) Let X be an arbitrary set. What are positive definite kernels on X? (ii) Let K and K 2 are two positive definite kernels on X and g : X R any function. Define K sum (x, x ) = K (x, x ) + K 2 (x, x ), K prod (x, x ) = K (x, x )K 2 (x, x ) and K conj (x, x ) = g(x)k (x, x )g(x ). Show that K sum, K prod and K conj are also positive definite kernels. (iii) Let X = R p. Using Part (ii) or otherwise, show that K(x, x ) = (c + x x ) d and K(x, x ) = exp{ x x 2 2σ 2 2 } are both positive definite kernels for any c > 0, d N and σ > 0. Solution. (i) A positive definite kernels on X is a real-valued symmetric function on X, k : X X R, satisfying that for any n N, x,..., x n X, the matrix (k(x i, x j )) i,j=,...,n is positive semidefinite. (ii) Take any n N and x,..., x n X. Let v R n be an arbitrary vector. We have v (K sum (x i, x j )) i,j v = v (K (x i, x j )) i,j v + v (K 2 (x i, x j )) i,j v 0 Let u = (u,..., u n ) be such that u j = v j g(x j ). Then v (K conj (x i, x j )) i,j v = u (K (x i, x j )) i,j u 0. Thus, K sum and K conj are both positive definite kernels. To show K prod is positive definite, it suffices to show that if A, B R n are positive semidefinite, then their pointwise product A B (Hadamard product) is also positive semidefinite. By eigendecomposition, we may write A = A + + A n and B = B + + B n where each A i and B j are rank positive semidefinite matrices. Since pointwise product satisfies distributive law, it suffices to prove A r B s is positive semidefinite for any r, s. If A r = aa and B s = bb, then A r B s = cc, where c = (c,..., c n ) satisfies c i = a i b i for i =,..., n. This proves that K prod is a positive definite kernel. (iii) We first check that the linear kernel (x, x ) x x is a positive definite kernel. This is true since v (x i x j) i,j v = Xv 2 2, where X is a matrix with columns x,..., x n. By Part (ii), positive definite kernels are closed under addition and multiplication, so (x, x ) (c + x x ) d is positive definite. If K(x, x ) = exp{ 2σ 2 x x 2 2 for g(x) = e 2σ 2 x 2, we have { } K(x, x ) = g(x) exp σ 2 x x g(x ). 2 }, then

Hence it suffices to show that (x, x ) e x x /σ 2 is positive definite. This is a (convergent) infinite sum of powers of x x, and hence must also be positive semidefinite. 3. Let G be a fully-connected feedforward neural network with input layer S, hidden layers S 2,..., S m, output layer S m and identity activation. Suppose there are k i non-bias nodes and bias node in S,..., S m and k m output nodes in S m. Let S i = {v S i : v is a non-bias node}. (i) Let x i = (x v : v S i ) be the vector of output values of non-bias nodes in layer S i. Let B i = (β u,v ) u Si,v S i+ and b i = (β v bias i,v ) v S i+ where vi bias = S i \ S i is the bias node in S i. Describe in matrix notation how the final output x m is computed from input x in a forward pass through the network. (ii) Using the same notation as in part (i), describe how backpropagation can be carried out in this network. (iii) Assume k k 2 k m and the input vector has distribution x N k (0, I k ). Show that if we initialise the weights by choosing B i to be an orthonormal matrix for each i, then at the end of the first forward pass after initialisation, we have marginally x v N(0, ) for all non-bias node v. iid (iv) Suppose we initialise β u,v N(0, /ki ) for all (u, v) S i S i+ (this is known as the Xavier initialisation). Show that B i is approximately orthonormal in the sense that E(B i B i) = I ki+. Solution. (i) The forward pass can be described by x i+ = B i x i + b i, i =,..., m, and the input layer initialised by training data. Thus, the output layer output values can be obtained by x m = (B B 2 B m ) x. (ii) Let y be the actual response given an input x, and let L(x m, y) be the objective function. Then back propagation starts with the computation of x m and backpropagates through the recursion ( ) ( ) = = x u = x i B i β u,v (u,v) S i S i+ x v (u,v) S i S i+ x i+ ( ) ( ) = b i x v x i+ β v bias i,v = B i, x i x i+ for i = m,...,. v S i+ = v S i+ = 3

(iii) We prove that x i N ki (0, I ki ) for all i =,..., m by induction. We are given that this is true for i =. Inductively, if x i N ki (0, I ki ), then by orthogonality of B i, we have x i N ki (0, B i B i ) = N ki (0, I ki ) as desired. Thus, marginally, x v N(0, ) for all non-bias nodes v at initialisation. (iv) We have [ E(B i B i ) ] v,v = u S i Eβ u,v β u,v = { 0 if v v if v = v. Thus, E(B i B i) = I ki+. (If fact, you can prove much stronger results about how B i B i concentrates around the identity matrix.) 4. An advertiser wanted to understand how different web design decisions influence the effectiveness of an online advertisement. He showed the advertisement to all users visiting his website while varying the font typeface (cagegorical variable font, with two levels sanserif and serif), display style (variable display, with two levels banner and popup) and seriousness of writing (writing, numerical variable taking value bewteen and 0). For each user, he recorded whether the advertisement was clicked. head(ad) # click font display writing # yes serif banner 3 # 2 no sanserif banner 8 # 3 no sanserif banner # 4 no serif popup 7 # 5 no serif popup 2 # 6 no serif popup 9 x <- model.matrix(~font*display, data=ad)[, -] y <- model.matrix(~click-, data=ad) layer_relu <- layer_dense(units = 2, activation = 'relu', input_shape = dim(x)[2]) layer_softmax <- layer_dense(units = 2, activation = 'softmax') ad.nnet <- keras_model_sequential(list(layer_relu, layer_softmax)) compile(ad.nnet, optimizer='sgd', loss='categorical_crossentropy', metrics='acc') fit(ad.nnet, x, y, batch_size=, epochs=5) (i) Write down algebraically the neural network model fitted in ad.nnet. Sketch a diagram of the neural network. Let β = (β uv : u v) be the vector of all coefficients in model ad.nnet. Show that the maximum likelihood estimator for β is not unique. (ii) Assume that all weights in the neural network are initialise to be equal to and a constant learning rate α = 0. is used. Write down explicitly how the weights are updated after one training step in the first epoch. [You may assume that data are not randomly shuffled, so that the stochastic gradient is computed using the first row of the dataset.] 4

(iii) If we maximise instead the regularised log-likelihood l(β) λ 2 β 2 2. for some λ > 0. How would your stochastic gradient step in Part (ii) change? Solution. (i) Architecture of this neural network: there are 4 input layer nodes (labelled,2,3,0), 3 hidden layer nodes (labelled 4,5,0) and 2 output layer nodes (labelled (6,7,0). Sketch omitted. Let (x i, y i ) be the ith observation, where entries of x i = (x i, xi 2 ) are coded by xi = 0 if font for ith visitor is sanserif and otherwise, and x i 2 = 0 if display is banner and otherwise. Then algebraically, the neural network model fitted is y i Bern(p i ), p i = e s 6 e s 6 + e s 7, where s 6 and s 7 are recursively defined via (we view x, x 2, x 3, x 4, x 5, s 6, s 7 as functions of x i and β, and drop the dependence on i for simplicity) x = x i, x 2 = x i 2, x 3 = x i x i 2 x 4 = max{β,4 x + β 2,4 x 2 + β 3,4 x 3 + β 0,4, 0} x 5 = max{β,5 x + β 2,5 x 2 + β 3,5 x 3 + β 0,5, 0} s 6 = β 4,6 x 4 + β 5,6 x 5 + β 0,6 s 7 = β 4,7 x 4 + β 5,7 x 5 + β 0,7. The MLE for β is not unique because the model is not identifiable. Note that if we replace β 0,6 and β 0,7 by β 0,6 +c and β 0,7 +c respectively for any c R, the probability output p i is unchanged and hence the log-likelihood is unchanged. (ii) We have x = (, 0) and y =. Thus, in the forward pass, we have x =, x 2 = 0, x 3 = 0 x 4 = max{x + x 2 + x 3 +, 0} = 2, x 5 = max{x + x 2 + x 3 +, 0} = 2 s 6 = x 4 + x 5 + = 5, s 7 = x 4 + x 5 + = 5, p = /2. The contribution of the first observation to the log-likelihood is l = y log p + ( y ) log( p ). Thus, partial derivatives of l with respect to all weights can be computed via back- 5

propagation as follows. output probability: p = p = 2 output layer: = e s6+s7 s 6 p (e s 6 + e s 7) 2 = 2, = e s6+s7 s 7 p (e s 6 + e s 7) 2 = 2. = = = = β 4,6 β 4,7 β 5,6 β 5,7 2 2 =, = β 0,6 2 = 2. = β 0,7 2 hidden layer: = β 4,6 + β 4,7 = 0, x 4 s 6 s 7 = f (s 4 ) = 0, s 4 x 4 = 0 for (u, v) {, 2, 3, 0} {4, 5}. β u,v x 5 = s 6 β 5,6 + s 7 β 5,7 = 0. s 5 = x 5 f (s 5 ) = 0. Thus, after one training step, the weights are updated as + 0. 0 = if (u, v) {, 2, 3, 0} {4, 5} + 0. =. if (u, v) {4, 5} {6, 7} β u,v = + 0. 0.5 =.05 if (u, v) = (0, 6) 0. 0.5 = 0.95 if (u, v) = (0, 7). (Remark: we see from this calculation that symmetry-breaking is needed for more than just all zero initialising weights.) (iii) We can write l(β) λ 2 β 2 2 = n ( l i (β) λ ) 2n β 2 2 i= All we need to do is to subtract λ n β u,v to β u,v computed in Part (ii). (Remark: this is known as weight decay in the neural networks literature.) 5. Let (x, y ),..., (x n, y n ) R p {, 2} be a training dataset. Let ψb,m bnn be the bootstrap aggregation nearest neighbour classifier using B m-out-of-n bootstrap samples. (i) Describe how the bootstrap aggregation nearest neighbour classifier is constructed. (ii) Show that if m > n log 2, then for any x R p, we have P ( ψb,m bnn (x) ψ NN (x) ) exp { 2 } B( 2e m/n ) 2. [Hint: you may use Hoeffding s inequality, which says in particular that if Z Bin(B, p), then P(Z/B < p t) e 2Bt2.] 6

(iii) Suppose we modify the bootstrap scheme in ψb,m bnn to resample instead m data points from the training set without replacement in each bootstrap sample. Call this modified bootstrap aggregation classifier ψ bnn B,m. Show that as B, ψ bnn B,m converge to a weighted nearest neighbour classifier with appropriate weights. Solution. (i) Omitted. (ii) We have by definition that ψ NN = Y (). Let ψ (b) ( ) be the NN classifier in the bth bootstrap sample. Then {ψ (b) (x) = Y () } for b =,..., B are iid Bernoulli random variables with probability p at least the probability that X () is selected in an m-out-of-n bootstrap sample, i.e. ( p > n) m e m/n > /2. Then by Hoeffding s inequality, as required. P(ψ bnn B,m (x) ψ NN (x)) e 2B( 2 e m/n ) 2 (iii) Under an m-out-of-n without replacement resampling scheme, the probability that X (i) becomes the nearest point to x in a bootstrap sample is w bnn,w/o i := P(X (i) is nearest to x) = {( n i / ( n ) m ) m if i n m + 0 if n m + 2 i n. Let ψ bnn,w/o B,m be the bagged nearest neighbour classifier without replacement, then lim B ψbnn,w/o B,m (x) = lim B = arg max l=,...,l = arg max l=,...,l arg max l=,...,l B n i= B B {ψ (b) (x) = l} b= B P{ψ (b) (x) = l} b= w bnn,w/o i {Y (i) = l}, establishing the claim that the infinite without replacement bagged nearest neighbour is also a weighted nearest neighbour classifier. 6. Let (x, y ),..., (x n, y n ) R p {, 2} be a training dataset. (i) Suppose φ : R p H is a feature map into some Hilbert space H such that φ(x), φ(x ) H = k(x, x ) for some positive definite kernel k. Show that nearest neighbour classification in the feature space depends on φ only through the kernel k. 7

(ii) Show that if k is a Gaussian kernel, then the k-nearest neighbour classifier in the feature space gives the same classification outcome as the k-nearest neighbour classifier in the original space R p. (iii) Nearest neighbours in high-dimensions can be very far away. Assume x,..., x n are drawn uniformly from the unit ball B(0, ) in R p. Let R = min i=,...,n x i 2 be the nearest neighbour distance at the origin. Show that median(r) = ( 2 /n) /p. Sketch how median(r) varies as n increases for p {, 0, 00}. Solution. (i) The nearest neighbour in the feature space requires sorting {φ(x i ) : i =,..., n} according to their l 2 distance to a test point x in H. Distance computation can be carried out using the kernel k only via φ(x) φ(x i ) 2 2 = k(x, x) + k(x, x ) 2k(x, x ). Hence the classifier depends on the feature space only through k. (ii) If k(x, x ) = e 2σ 2 x x 2 2, then φ(x) φ(x i ) 2 2 = 2 2e 2σ 2 x x i 2 2. Note that the right hand side is increasing in x x i 2. Let π be a permutation of {,..., n} such that x x π() 2 x x π(n) 2. Then we also have φ(x) φ(x π() ) 2 φ(x) φ(x π(n) ) 2. Therefore, the Gaussian kernel nearest neighbour classifier gives identical outcome as the vanilla knn. (iii) We have P(R r) = P({x,..., x n } B(0, r) = ) = ( r p ) n. Thus, the median of R satisfies which is equivalent to the desired result. ( median(r) p ) n = /2, 7. (Exercise with R) Download the letter dataset from the course website: train <- read.table('http://www.statslab.cam.ac.uk/~tw389/teaching/slp8/data/letter', header=true) Each row of the data contains 6 different attributes of a pixel image of one of the 26 capital letters in the English alphabet, together with the letter label itself. More information about this dataset can be found at https://archive.ics.uci.edu/ml/datasets/letter+recognition. Construct a classifier using one of the methods covered in this course. Then test your classifier on the test dataset letter_test from the course website. What is your test error? (Note that all design decision of your classifier must be based on the training data.) 8