(i) The optimisation problem solved is 1 min

Similar documents
1 Machine Learning Concepts (16 points)

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

9 Classification. 9.1 Linear Classifiers

Neural networks and support vector machines

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Bits of Machine Learning Part 1: Supervised Learning

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Lecture 6. Notes on Linear Algebra. Perceptron

Linear and Logistic Regression. Dr. Xiaowei Huang

Linear & nonlinear classifiers

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Lecture 7: Kernels for Classification and Regression

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Kernel Learning via Random Fourier Representations

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Introduction to Machine Learning Midterm Exam

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning - MT & 14. PCA and MDS

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

FINAL: CS 6375 (Machine Learning) Fall 2014

Linear & nonlinear classifiers

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

MLPR: Logistic Regression and Neural Networks

Neural Networks: Backpropagation

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Introduction to Machine Learning Midterm Exam Solutions

Restricted Boltzmann Machines

Midterm: CS 6375 Spring 2015 Solutions

Pattern Recognition and Machine Learning

Neural networks COMS 4771

CSCI567 Machine Learning (Fall 2018)

CSE 546 Final Exam, Autumn 2013

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Classification with Perceptrons. Reading:

Machine Learning, Fall 2012 Homework 2

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

ECE521 Lectures 9 Fully Connected Neural Networks

Neural Network Training

PAPER 218 STATISTICAL LEARNING IN PRACTICE

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Networks and Deep Learning

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Learning with multiple models. Boosting.

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

1 A Support Vector Machine without Support Vectors

Midterm exam CS 189/289, Fall 2015

Computational statistics

ECE521 week 3: 23/26 January 2017

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Introduction to Logistic Regression

The exam is closed book, closed notes except your one-page cheat sheet.

TDT4173 Machine Learning

CS145: INTRODUCTION TO DATA MINING

Statistical Machine Learning from Data

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Machine Learning Basics

Understanding Generalization Error: Bounds and Decompositions

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Support Vector Machines

COMS 4771 Introduction to Machine Learning. Nakul Verma

Advanced Introduction to Machine Learning

Statistical Machine Learning

A summary of Deep Learning without Poor Local Minima

Final Exam, Machine Learning, Spring 2009

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Feed-forward Network Functions

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Introduction to Support Vector Machines

Multi-layer Neural Networks

10-701/ Machine Learning, Fall

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Homework 4. Convex Optimization /36-725

Algorithm-Independent Learning Issues

Statistical Data Mining and Machine Learning Hilary Term 2016

10-701/15-781, Machine Learning: Homework 4

Logistic Regression. COMP 527 Danushka Bollegala

Introduction to Machine Learning Spring 2018 Note 18

A Tutorial on Support Vector Machine

Principal Component Analysis

Lecture 6. Regression

Introduction to Machine Learning Midterm, Tues April 8

Learning theory. Ensemble methods. Boosting. Boosting: history

Transcription:

STATISTICAL LEARNING IN PRACTICE Part III / Lent 208 Example Sheet 3 (of 4) By Dr T. Wang You have the option to submit your answers to Questions and 4 to be marked. If you want your answers to be marked, please leave them in my pigeon hole in the central core of the CMS by 2pm on 7 Mar.. Suppose we have observations x,..., x n R p with associated class labels y,..., y n {, }. Assume that the two classes are completely separable. (i) Write down the optimisation problem solved by the support vector machine (SVM) classifier with hard margin and linear kernel. (ii) What are support vectors for this SVM? Let S = {i : x i is a support vector}, show that the decision boundary of this SVM will not change if we use only {(x i, y i ) : i S} as our training data. (iii) Suppose S = {, 2}. Describe the decision boundary in terms of x, x 2. What if S = {, 2, 3}? Solution. (i) The optimisation problem solved is min β R p,β 0 R 2 β 2 2 subject to y i (x i β + β 0 ), i =,..., n. () (ii) Let β and β0 be the optimiser of the above optimisation problem. The set of support vectors are {x i : y i (x i β + β0 ) =. Let β and β 0 be the optimiser of the SVM problem using {x i : i / S} as training data, we want to show that β = β and β0 = β 0. Assume this is not true, and we prove by contradiction. Note that, (β, β0 ) satisfies the constraints in the SVM problem using {x i : i / S} as the training data. Thus, by uniqueness of the optimiser (note strong convexity), we must have 2 β 2 2 < 2 β 2 2 Let β (h) = hβ + ( h)β and β (h) 0 = hβ 0 + ( h)β 0. Then by convexity of the L 2 norm, we have 2 β(h) 2 2 < 2 β 2 2 for all h (0, ). Moreover, for any support vector x i, we have y i (x i β + β0 ) and y i (x i β + β 0 ), and thus y i(x i β(h) + β (h) 0 ). For any non-support vector x i, we have y i (x i β + β0 ) >, thus, y i(x i β(h) + β (h) 0 ) for h sufficiently close to 0. Therefore, for sufficiently small h, we have shown that (β (h), β (h) 0 ) also satisfies the constraints in () but has smaller objective function value, contradicting the optimality of β and β0. Thus, we must have β = β and β0 = β 0.

(iii) Since (β, β0 ) = (β, β 0 ), both the decision boundary and the set of support vectors remain the same if we remove all non-support vectors from the training set. If S = {, 2}, then y and y 2 are necessarily distinct and the decision boundary, which separates x and x 2 and maximises the distance to the closer of the two points, must be the perpendicular bisector of x and x 2. Similarly, for S = {, 2, 3}, we may assume that y = y 2 y 3. Then the decision boundary must be parallel to the line connecting x and x 2. Thus, the decision boundary is the perpendicular bisector of the segment connecting x 3 to the projection image of x 3 to the line through x and x 2. 2. (i) Let X be an arbitrary set. What are positive definite kernels on X? (ii) Let K and K 2 are two positive definite kernels on X and g : X R any function. Define K sum (x, x ) = K (x, x ) + K 2 (x, x ), K prod (x, x ) = K (x, x )K 2 (x, x ) and K conj (x, x ) = g(x)k (x, x )g(x ). Show that K sum, K prod and K conj are also positive definite kernels. (iii) Let X = R p. Using Part (ii) or otherwise, show that K(x, x ) = (c + x x ) d and K(x, x ) = exp{ x x 2 2σ 2 2 } are both positive definite kernels for any c > 0, d N and σ > 0. Solution. (i) A positive definite kernels on X is a real-valued symmetric function on X, k : X X R, satisfying that for any n N, x,..., x n X, the matrix (k(x i, x j )) i,j=,...,n is positive semidefinite. (ii) Take any n N and x,..., x n X. Let v R n be an arbitrary vector. We have v (K sum (x i, x j )) i,j v = v (K (x i, x j )) i,j v + v (K 2 (x i, x j )) i,j v 0 Let u = (u,..., u n ) be such that u j = v j g(x j ). Then v (K conj (x i, x j )) i,j v = u (K (x i, x j )) i,j u 0. Thus, K sum and K conj are both positive definite kernels. To show K prod is positive definite, it suffices to show that if A, B R n are positive semidefinite, then their pointwise product A B (Hadamard product) is also positive semidefinite. By eigendecomposition, we may write A = A + + A n and B = B + + B n where each A i and B j are rank positive semidefinite matrices. Since pointwise product satisfies distributive law, it suffices to prove A r B s is positive semidefinite for any r, s. If A r = aa and B s = bb, then A r B s = cc, where c = (c,..., c n ) satisfies c i = a i b i for i =,..., n. This proves that K prod is a positive definite kernel. (iii) We first check that the linear kernel (x, x ) x x is a positive definite kernel. This is true since v (x i x j) i,j v = Xv 2 2, where X is a matrix with columns x,..., x n. By Part (ii), positive definite kernels are closed under addition and multiplication, so (x, x ) (c + x x ) d is positive definite. If K(x, x ) = exp{ 2σ 2 x x 2 2 for g(x) = e 2σ 2 x 2, we have { } K(x, x ) = g(x) exp σ 2 x x g(x ). 2 }, then

Hence it suffices to show that (x, x ) e x x /σ 2 is positive definite. This is a (convergent) infinite sum of powers of x x, and hence must also be positive semidefinite. 3. Let G be a fully-connected feedforward neural network with input layer S, hidden layers S 2,..., S m, output layer S m and identity activation. Suppose there are k i non-bias nodes and bias node in S,..., S m and k m output nodes in S m. Let S i = {v S i : v is a non-bias node}. (i) Let x i = (x v : v S i ) be the vector of output values of non-bias nodes in layer S i. Let B i = (β u,v ) u Si,v S i+ and b i = (β v bias i,v ) v S i+ where vi bias = S i \ S i is the bias node in S i. Describe in matrix notation how the final output x m is computed from input x in a forward pass through the network. (ii) Using the same notation as in part (i), describe how backpropagation can be carried out in this network. (iii) Assume k k 2 k m and the input vector has distribution x N k (0, I k ). Show that if we initialise the weights by choosing B i to be an orthonormal matrix for each i, then at the end of the first forward pass after initialisation, we have marginally x v N(0, ) for all non-bias node v. iid (iv) Suppose we initialise β u,v N(0, /ki ) for all (u, v) S i S i+ (this is known as the Xavier initialisation). Show that B i is approximately orthonormal in the sense that E(B i B i) = I ki+. Solution. (i) The forward pass can be described by x i+ = B i x i + b i, i =,..., m, and the input layer initialised by training data. Thus, the output layer output values can be obtained by x m = (B B 2 B m ) x. (ii) Let y be the actual response given an input x, and let L(x m, y) be the objective function. Then back propagation starts with the computation of x m and backpropagates through the recursion ( ) ( ) = = x u = x i B i β u,v (u,v) S i S i+ x v (u,v) S i S i+ x i+ ( ) ( ) = b i x v x i+ β v bias i,v = B i, x i x i+ for i = m,...,. v S i+ = v S i+ = 3

(iii) We prove that x i N ki (0, I ki ) for all i =,..., m by induction. We are given that this is true for i =. Inductively, if x i N ki (0, I ki ), then by orthogonality of B i, we have x i N ki (0, B i B i ) = N ki (0, I ki ) as desired. Thus, marginally, x v N(0, ) for all non-bias nodes v at initialisation. (iv) We have [ E(B i B i ) ] v,v = u S i Eβ u,v β u,v = { 0 if v v if v = v. Thus, E(B i B i) = I ki+. (If fact, you can prove much stronger results about how B i B i concentrates around the identity matrix.) 4. An advertiser wanted to understand how different web design decisions influence the effectiveness of an online advertisement. He showed the advertisement to all users visiting his website while varying the font typeface (cagegorical variable font, with two levels sanserif and serif), display style (variable display, with two levels banner and popup) and seriousness of writing (writing, numerical variable taking value bewteen and 0). For each user, he recorded whether the advertisement was clicked. head(ad) # click font display writing # yes serif banner 3 # 2 no sanserif banner 8 # 3 no sanserif banner # 4 no serif popup 7 # 5 no serif popup 2 # 6 no serif popup 9 x <- model.matrix(~font*display, data=ad)[, -] y <- model.matrix(~click-, data=ad) layer_relu <- layer_dense(units = 2, activation = 'relu', input_shape = dim(x)[2]) layer_softmax <- layer_dense(units = 2, activation = 'softmax') ad.nnet <- keras_model_sequential(list(layer_relu, layer_softmax)) compile(ad.nnet, optimizer='sgd', loss='categorical_crossentropy', metrics='acc') fit(ad.nnet, x, y, batch_size=, epochs=5) (i) Write down algebraically the neural network model fitted in ad.nnet. Sketch a diagram of the neural network. Let β = (β uv : u v) be the vector of all coefficients in model ad.nnet. Show that the maximum likelihood estimator for β is not unique. (ii) Assume that all weights in the neural network are initialise to be equal to and a constant learning rate α = 0. is used. Write down explicitly how the weights are updated after one training step in the first epoch. [You may assume that data are not randomly shuffled, so that the stochastic gradient is computed using the first row of the dataset.] 4

(iii) If we maximise instead the regularised log-likelihood l(β) λ 2 β 2 2. for some λ > 0. How would your stochastic gradient step in Part (ii) change? Solution. (i) Architecture of this neural network: there are 4 input layer nodes (labelled,2,3,0), 3 hidden layer nodes (labelled 4,5,0) and 2 output layer nodes (labelled (6,7,0). Sketch omitted. Let (x i, y i ) be the ith observation, where entries of x i = (x i, xi 2 ) are coded by xi = 0 if font for ith visitor is sanserif and otherwise, and x i 2 = 0 if display is banner and otherwise. Then algebraically, the neural network model fitted is y i Bern(p i ), p i = e s 6 e s 6 + e s 7, where s 6 and s 7 are recursively defined via (we view x, x 2, x 3, x 4, x 5, s 6, s 7 as functions of x i and β, and drop the dependence on i for simplicity) x = x i, x 2 = x i 2, x 3 = x i x i 2 x 4 = max{β,4 x + β 2,4 x 2 + β 3,4 x 3 + β 0,4, 0} x 5 = max{β,5 x + β 2,5 x 2 + β 3,5 x 3 + β 0,5, 0} s 6 = β 4,6 x 4 + β 5,6 x 5 + β 0,6 s 7 = β 4,7 x 4 + β 5,7 x 5 + β 0,7. The MLE for β is not unique because the model is not identifiable. Note that if we replace β 0,6 and β 0,7 by β 0,6 +c and β 0,7 +c respectively for any c R, the probability output p i is unchanged and hence the log-likelihood is unchanged. (ii) We have x = (, 0) and y =. Thus, in the forward pass, we have x =, x 2 = 0, x 3 = 0 x 4 = max{x + x 2 + x 3 +, 0} = 2, x 5 = max{x + x 2 + x 3 +, 0} = 2 s 6 = x 4 + x 5 + = 5, s 7 = x 4 + x 5 + = 5, p = /2. The contribution of the first observation to the log-likelihood is l = y log p + ( y ) log( p ). Thus, partial derivatives of l with respect to all weights can be computed via back- 5

propagation as follows. output probability: p = p = 2 output layer: = e s6+s7 s 6 p (e s 6 + e s 7) 2 = 2, = e s6+s7 s 7 p (e s 6 + e s 7) 2 = 2. = = = = β 4,6 β 4,7 β 5,6 β 5,7 2 2 =, = β 0,6 2 = 2. = β 0,7 2 hidden layer: = β 4,6 + β 4,7 = 0, x 4 s 6 s 7 = f (s 4 ) = 0, s 4 x 4 = 0 for (u, v) {, 2, 3, 0} {4, 5}. β u,v x 5 = s 6 β 5,6 + s 7 β 5,7 = 0. s 5 = x 5 f (s 5 ) = 0. Thus, after one training step, the weights are updated as + 0. 0 = if (u, v) {, 2, 3, 0} {4, 5} + 0. =. if (u, v) {4, 5} {6, 7} β u,v = + 0. 0.5 =.05 if (u, v) = (0, 6) 0. 0.5 = 0.95 if (u, v) = (0, 7). (Remark: we see from this calculation that symmetry-breaking is needed for more than just all zero initialising weights.) (iii) We can write l(β) λ 2 β 2 2 = n ( l i (β) λ ) 2n β 2 2 i= All we need to do is to subtract λ n β u,v to β u,v computed in Part (ii). (Remark: this is known as weight decay in the neural networks literature.) 5. Let (x, y ),..., (x n, y n ) R p {, 2} be a training dataset. Let ψb,m bnn be the bootstrap aggregation nearest neighbour classifier using B m-out-of-n bootstrap samples. (i) Describe how the bootstrap aggregation nearest neighbour classifier is constructed. (ii) Show that if m > n log 2, then for any x R p, we have P ( ψb,m bnn (x) ψ NN (x) ) exp { 2 } B( 2e m/n ) 2. [Hint: you may use Hoeffding s inequality, which says in particular that if Z Bin(B, p), then P(Z/B < p t) e 2Bt2.] 6

(iii) Suppose we modify the bootstrap scheme in ψb,m bnn to resample instead m data points from the training set without replacement in each bootstrap sample. Call this modified bootstrap aggregation classifier ψ bnn B,m. Show that as B, ψ bnn B,m converge to a weighted nearest neighbour classifier with appropriate weights. Solution. (i) Omitted. (ii) We have by definition that ψ NN = Y (). Let ψ (b) ( ) be the NN classifier in the bth bootstrap sample. Then {ψ (b) (x) = Y () } for b =,..., B are iid Bernoulli random variables with probability p at least the probability that X () is selected in an m-out-of-n bootstrap sample, i.e. ( p > n) m e m/n > /2. Then by Hoeffding s inequality, as required. P(ψ bnn B,m (x) ψ NN (x)) e 2B( 2 e m/n ) 2 (iii) Under an m-out-of-n without replacement resampling scheme, the probability that X (i) becomes the nearest point to x in a bootstrap sample is w bnn,w/o i := P(X (i) is nearest to x) = {( n i / ( n ) m ) m if i n m + 0 if n m + 2 i n. Let ψ bnn,w/o B,m be the bagged nearest neighbour classifier without replacement, then lim B ψbnn,w/o B,m (x) = lim B = arg max l=,...,l = arg max l=,...,l arg max l=,...,l B n i= B B {ψ (b) (x) = l} b= B P{ψ (b) (x) = l} b= w bnn,w/o i {Y (i) = l}, establishing the claim that the infinite without replacement bagged nearest neighbour is also a weighted nearest neighbour classifier. 6. Let (x, y ),..., (x n, y n ) R p {, 2} be a training dataset. (i) Suppose φ : R p H is a feature map into some Hilbert space H such that φ(x), φ(x ) H = k(x, x ) for some positive definite kernel k. Show that nearest neighbour classification in the feature space depends on φ only through the kernel k. 7

(ii) Show that if k is a Gaussian kernel, then the k-nearest neighbour classifier in the feature space gives the same classification outcome as the k-nearest neighbour classifier in the original space R p. (iii) Nearest neighbours in high-dimensions can be very far away. Assume x,..., x n are drawn uniformly from the unit ball B(0, ) in R p. Let R = min i=,...,n x i 2 be the nearest neighbour distance at the origin. Show that median(r) = ( 2 /n) /p. Sketch how median(r) varies as n increases for p {, 0, 00}. Solution. (i) The nearest neighbour in the feature space requires sorting {φ(x i ) : i =,..., n} according to their l 2 distance to a test point x in H. Distance computation can be carried out using the kernel k only via φ(x) φ(x i ) 2 2 = k(x, x) + k(x, x ) 2k(x, x ). Hence the classifier depends on the feature space only through k. (ii) If k(x, x ) = e 2σ 2 x x 2 2, then φ(x) φ(x i ) 2 2 = 2 2e 2σ 2 x x i 2 2. Note that the right hand side is increasing in x x i 2. Let π be a permutation of {,..., n} such that x x π() 2 x x π(n) 2. Then we also have φ(x) φ(x π() ) 2 φ(x) φ(x π(n) ) 2. Therefore, the Gaussian kernel nearest neighbour classifier gives identical outcome as the vanilla knn. (iii) We have P(R r) = P({x,..., x n } B(0, r) = ) = ( r p ) n. Thus, the median of R satisfies which is equivalent to the desired result. ( median(r) p ) n = /2, 7. (Exercise with R) Download the letter dataset from the course website: train <- read.table('http://www.statslab.cam.ac.uk/~tw389/teaching/slp8/data/letter', header=true) Each row of the data contains 6 different attributes of a pixel image of one of the 26 capital letters in the English alphabet, together with the letter label itself. More information about this dataset can be found at https://archive.ics.uci.edu/ml/datasets/letter+recognition. Construct a classifier using one of the methods covered in this course. Then test your classifier on the test dataset letter_test from the course website. What is your test error? (Note that all design decision of your classifier must be based on the training data.) 8