(i) The optimisation problem solved is 1 min
|
|
- Nigel Todd
- 5 years ago
- Views:
Transcription
1 STATISTICAL LEARNING IN PRACTICE Part III / Lent 208 Example Sheet 3 (of 4) By Dr T. Wang You have the option to submit your answers to Questions and 4 to be marked. If you want your answers to be marked, please leave them in my pigeon hole in the central core of the CMS by 2pm on 7 Mar.. Suppose we have observations x,..., x n R p with associated class labels y,..., y n {, }. Assume that the two classes are completely separable. (i) Write down the optimisation problem solved by the support vector machine (SVM) classifier with hard margin and linear kernel. (ii) What are support vectors for this SVM? Let S = {i : x i is a support vector}, show that the decision boundary of this SVM will not change if we use only {(x i, y i ) : i S} as our training data. (iii) Suppose S = {, 2}. Describe the decision boundary in terms of x, x 2. What if S = {, 2, 3}? Solution. (i) The optimisation problem solved is min β R p,β 0 R 2 β 2 2 subject to y i (x i β + β 0 ), i =,..., n. () (ii) Let β and β0 be the optimiser of the above optimisation problem. The set of support vectors are {x i : y i (x i β + β0 ) =. Let β and β 0 be the optimiser of the SVM problem using {x i : i / S} as training data, we want to show that β = β and β0 = β 0. Assume this is not true, and we prove by contradiction. Note that, (β, β0 ) satisfies the constraints in the SVM problem using {x i : i / S} as the training data. Thus, by uniqueness of the optimiser (note strong convexity), we must have 2 β 2 2 < 2 β 2 2 Let β (h) = hβ + ( h)β and β (h) 0 = hβ 0 + ( h)β 0. Then by convexity of the L 2 norm, we have 2 β(h) 2 2 < 2 β 2 2 for all h (0, ). Moreover, for any support vector x i, we have y i (x i β + β0 ) and y i (x i β + β 0 ), and thus y i(x i β(h) + β (h) 0 ). For any non-support vector x i, we have y i (x i β + β0 ) >, thus, y i(x i β(h) + β (h) 0 ) for h sufficiently close to 0. Therefore, for sufficiently small h, we have shown that (β (h), β (h) 0 ) also satisfies the constraints in () but has smaller objective function value, contradicting the optimality of β and β0. Thus, we must have β = β and β0 = β 0.
2 (iii) Since (β, β0 ) = (β, β 0 ), both the decision boundary and the set of support vectors remain the same if we remove all non-support vectors from the training set. If S = {, 2}, then y and y 2 are necessarily distinct and the decision boundary, which separates x and x 2 and maximises the distance to the closer of the two points, must be the perpendicular bisector of x and x 2. Similarly, for S = {, 2, 3}, we may assume that y = y 2 y 3. Then the decision boundary must be parallel to the line connecting x and x 2. Thus, the decision boundary is the perpendicular bisector of the segment connecting x 3 to the projection image of x 3 to the line through x and x (i) Let X be an arbitrary set. What are positive definite kernels on X? (ii) Let K and K 2 are two positive definite kernels on X and g : X R any function. Define K sum (x, x ) = K (x, x ) + K 2 (x, x ), K prod (x, x ) = K (x, x )K 2 (x, x ) and K conj (x, x ) = g(x)k (x, x )g(x ). Show that K sum, K prod and K conj are also positive definite kernels. (iii) Let X = R p. Using Part (ii) or otherwise, show that K(x, x ) = (c + x x ) d and K(x, x ) = exp{ x x 2 2σ 2 2 } are both positive definite kernels for any c > 0, d N and σ > 0. Solution. (i) A positive definite kernels on X is a real-valued symmetric function on X, k : X X R, satisfying that for any n N, x,..., x n X, the matrix (k(x i, x j )) i,j=,...,n is positive semidefinite. (ii) Take any n N and x,..., x n X. Let v R n be an arbitrary vector. We have v (K sum (x i, x j )) i,j v = v (K (x i, x j )) i,j v + v (K 2 (x i, x j )) i,j v 0 Let u = (u,..., u n ) be such that u j = v j g(x j ). Then v (K conj (x i, x j )) i,j v = u (K (x i, x j )) i,j u 0. Thus, K sum and K conj are both positive definite kernels. To show K prod is positive definite, it suffices to show that if A, B R n are positive semidefinite, then their pointwise product A B (Hadamard product) is also positive semidefinite. By eigendecomposition, we may write A = A + + A n and B = B + + B n where each A i and B j are rank positive semidefinite matrices. Since pointwise product satisfies distributive law, it suffices to prove A r B s is positive semidefinite for any r, s. If A r = aa and B s = bb, then A r B s = cc, where c = (c,..., c n ) satisfies c i = a i b i for i =,..., n. This proves that K prod is a positive definite kernel. (iii) We first check that the linear kernel (x, x ) x x is a positive definite kernel. This is true since v (x i x j) i,j v = Xv 2 2, where X is a matrix with columns x,..., x n. By Part (ii), positive definite kernels are closed under addition and multiplication, so (x, x ) (c + x x ) d is positive definite. If K(x, x ) = exp{ 2σ 2 x x 2 2 for g(x) = e 2σ 2 x 2, we have { } K(x, x ) = g(x) exp σ 2 x x g(x ). 2 }, then
3 Hence it suffices to show that (x, x ) e x x /σ 2 is positive definite. This is a (convergent) infinite sum of powers of x x, and hence must also be positive semidefinite. 3. Let G be a fully-connected feedforward neural network with input layer S, hidden layers S 2,..., S m, output layer S m and identity activation. Suppose there are k i non-bias nodes and bias node in S,..., S m and k m output nodes in S m. Let S i = {v S i : v is a non-bias node}. (i) Let x i = (x v : v S i ) be the vector of output values of non-bias nodes in layer S i. Let B i = (β u,v ) u Si,v S i+ and b i = (β v bias i,v ) v S i+ where vi bias = S i \ S i is the bias node in S i. Describe in matrix notation how the final output x m is computed from input x in a forward pass through the network. (ii) Using the same notation as in part (i), describe how backpropagation can be carried out in this network. (iii) Assume k k 2 k m and the input vector has distribution x N k (0, I k ). Show that if we initialise the weights by choosing B i to be an orthonormal matrix for each i, then at the end of the first forward pass after initialisation, we have marginally x v N(0, ) for all non-bias node v. iid (iv) Suppose we initialise β u,v N(0, /ki ) for all (u, v) S i S i+ (this is known as the Xavier initialisation). Show that B i is approximately orthonormal in the sense that E(B i B i) = I ki+. Solution. (i) The forward pass can be described by x i+ = B i x i + b i, i =,..., m, and the input layer initialised by training data. Thus, the output layer output values can be obtained by x m = (B B 2 B m ) x. (ii) Let y be the actual response given an input x, and let L(x m, y) be the objective function. Then back propagation starts with the computation of x m and backpropagates through the recursion ( ) ( ) = = x u = x i B i β u,v (u,v) S i S i+ x v (u,v) S i S i+ x i+ ( ) ( ) = b i x v x i+ β v bias i,v = B i, x i x i+ for i = m,...,. v S i+ = v S i+ = 3
4 (iii) We prove that x i N ki (0, I ki ) for all i =,..., m by induction. We are given that this is true for i =. Inductively, if x i N ki (0, I ki ), then by orthogonality of B i, we have x i N ki (0, B i B i ) = N ki (0, I ki ) as desired. Thus, marginally, x v N(0, ) for all non-bias nodes v at initialisation. (iv) We have [ E(B i B i ) ] v,v = u S i Eβ u,v β u,v = { 0 if v v if v = v. Thus, E(B i B i) = I ki+. (If fact, you can prove much stronger results about how B i B i concentrates around the identity matrix.) 4. An advertiser wanted to understand how different web design decisions influence the effectiveness of an online advertisement. He showed the advertisement to all users visiting his website while varying the font typeface (cagegorical variable font, with two levels sanserif and serif), display style (variable display, with two levels banner and popup) and seriousness of writing (writing, numerical variable taking value bewteen and 0). For each user, he recorded whether the advertisement was clicked. head(ad) # click font display writing # yes serif banner 3 # 2 no sanserif banner 8 # 3 no sanserif banner # 4 no serif popup 7 # 5 no serif popup 2 # 6 no serif popup 9 x <- model.matrix(~font*display, data=ad)[, -] y <- model.matrix(~click-, data=ad) layer_relu <- layer_dense(units = 2, activation = 'relu', input_shape = dim(x)[2]) layer_softmax <- layer_dense(units = 2, activation = 'softmax') ad.nnet <- keras_model_sequential(list(layer_relu, layer_softmax)) compile(ad.nnet, optimizer='sgd', loss='categorical_crossentropy', metrics='acc') fit(ad.nnet, x, y, batch_size=, epochs=5) (i) Write down algebraically the neural network model fitted in ad.nnet. Sketch a diagram of the neural network. Let β = (β uv : u v) be the vector of all coefficients in model ad.nnet. Show that the maximum likelihood estimator for β is not unique. (ii) Assume that all weights in the neural network are initialise to be equal to and a constant learning rate α = 0. is used. Write down explicitly how the weights are updated after one training step in the first epoch. [You may assume that data are not randomly shuffled, so that the stochastic gradient is computed using the first row of the dataset.] 4
5 (iii) If we maximise instead the regularised log-likelihood l(β) λ 2 β 2 2. for some λ > 0. How would your stochastic gradient step in Part (ii) change? Solution. (i) Architecture of this neural network: there are 4 input layer nodes (labelled,2,3,0), 3 hidden layer nodes (labelled 4,5,0) and 2 output layer nodes (labelled (6,7,0). Sketch omitted. Let (x i, y i ) be the ith observation, where entries of x i = (x i, xi 2 ) are coded by xi = 0 if font for ith visitor is sanserif and otherwise, and x i 2 = 0 if display is banner and otherwise. Then algebraically, the neural network model fitted is y i Bern(p i ), p i = e s 6 e s 6 + e s 7, where s 6 and s 7 are recursively defined via (we view x, x 2, x 3, x 4, x 5, s 6, s 7 as functions of x i and β, and drop the dependence on i for simplicity) x = x i, x 2 = x i 2, x 3 = x i x i 2 x 4 = max{β,4 x + β 2,4 x 2 + β 3,4 x 3 + β 0,4, 0} x 5 = max{β,5 x + β 2,5 x 2 + β 3,5 x 3 + β 0,5, 0} s 6 = β 4,6 x 4 + β 5,6 x 5 + β 0,6 s 7 = β 4,7 x 4 + β 5,7 x 5 + β 0,7. The MLE for β is not unique because the model is not identifiable. Note that if we replace β 0,6 and β 0,7 by β 0,6 +c and β 0,7 +c respectively for any c R, the probability output p i is unchanged and hence the log-likelihood is unchanged. (ii) We have x = (, 0) and y =. Thus, in the forward pass, we have x =, x 2 = 0, x 3 = 0 x 4 = max{x + x 2 + x 3 +, 0} = 2, x 5 = max{x + x 2 + x 3 +, 0} = 2 s 6 = x 4 + x 5 + = 5, s 7 = x 4 + x 5 + = 5, p = /2. The contribution of the first observation to the log-likelihood is l = y log p + ( y ) log( p ). Thus, partial derivatives of l with respect to all weights can be computed via back- 5
6 propagation as follows. output probability: p = p = 2 output layer: = e s6+s7 s 6 p (e s 6 + e s 7) 2 = 2, = e s6+s7 s 7 p (e s 6 + e s 7) 2 = 2. = = = = β 4,6 β 4,7 β 5,6 β 5,7 2 2 =, = β 0,6 2 = 2. = β 0,7 2 hidden layer: = β 4,6 + β 4,7 = 0, x 4 s 6 s 7 = f (s 4 ) = 0, s 4 x 4 = 0 for (u, v) {, 2, 3, 0} {4, 5}. β u,v x 5 = s 6 β 5,6 + s 7 β 5,7 = 0. s 5 = x 5 f (s 5 ) = 0. Thus, after one training step, the weights are updated as = if (u, v) {, 2, 3, 0} {4, 5} + 0. =. if (u, v) {4, 5} {6, 7} β u,v = =.05 if (u, v) = (0, 6) = 0.95 if (u, v) = (0, 7). (Remark: we see from this calculation that symmetry-breaking is needed for more than just all zero initialising weights.) (iii) We can write l(β) λ 2 β 2 2 = n ( l i (β) λ ) 2n β 2 2 i= All we need to do is to subtract λ n β u,v to β u,v computed in Part (ii). (Remark: this is known as weight decay in the neural networks literature.) 5. Let (x, y ),..., (x n, y n ) R p {, 2} be a training dataset. Let ψb,m bnn be the bootstrap aggregation nearest neighbour classifier using B m-out-of-n bootstrap samples. (i) Describe how the bootstrap aggregation nearest neighbour classifier is constructed. (ii) Show that if m > n log 2, then for any x R p, we have P ( ψb,m bnn (x) ψ NN (x) ) exp { 2 } B( 2e m/n ) 2. [Hint: you may use Hoeffding s inequality, which says in particular that if Z Bin(B, p), then P(Z/B < p t) e 2Bt2.] 6
7 (iii) Suppose we modify the bootstrap scheme in ψb,m bnn to resample instead m data points from the training set without replacement in each bootstrap sample. Call this modified bootstrap aggregation classifier ψ bnn B,m. Show that as B, ψ bnn B,m converge to a weighted nearest neighbour classifier with appropriate weights. Solution. (i) Omitted. (ii) We have by definition that ψ NN = Y (). Let ψ (b) ( ) be the NN classifier in the bth bootstrap sample. Then {ψ (b) (x) = Y () } for b =,..., B are iid Bernoulli random variables with probability p at least the probability that X () is selected in an m-out-of-n bootstrap sample, i.e. ( p > n) m e m/n > /2. Then by Hoeffding s inequality, as required. P(ψ bnn B,m (x) ψ NN (x)) e 2B( 2 e m/n ) 2 (iii) Under an m-out-of-n without replacement resampling scheme, the probability that X (i) becomes the nearest point to x in a bootstrap sample is w bnn,w/o i := P(X (i) is nearest to x) = {( n i / ( n ) m ) m if i n m + 0 if n m + 2 i n. Let ψ bnn,w/o B,m be the bagged nearest neighbour classifier without replacement, then lim B ψbnn,w/o B,m (x) = lim B = arg max l=,...,l = arg max l=,...,l arg max l=,...,l B n i= B B {ψ (b) (x) = l} b= B P{ψ (b) (x) = l} b= w bnn,w/o i {Y (i) = l}, establishing the claim that the infinite without replacement bagged nearest neighbour is also a weighted nearest neighbour classifier. 6. Let (x, y ),..., (x n, y n ) R p {, 2} be a training dataset. (i) Suppose φ : R p H is a feature map into some Hilbert space H such that φ(x), φ(x ) H = k(x, x ) for some positive definite kernel k. Show that nearest neighbour classification in the feature space depends on φ only through the kernel k. 7
8 (ii) Show that if k is a Gaussian kernel, then the k-nearest neighbour classifier in the feature space gives the same classification outcome as the k-nearest neighbour classifier in the original space R p. (iii) Nearest neighbours in high-dimensions can be very far away. Assume x,..., x n are drawn uniformly from the unit ball B(0, ) in R p. Let R = min i=,...,n x i 2 be the nearest neighbour distance at the origin. Show that median(r) = ( 2 /n) /p. Sketch how median(r) varies as n increases for p {, 0, 00}. Solution. (i) The nearest neighbour in the feature space requires sorting {φ(x i ) : i =,..., n} according to their l 2 distance to a test point x in H. Distance computation can be carried out using the kernel k only via φ(x) φ(x i ) 2 2 = k(x, x) + k(x, x ) 2k(x, x ). Hence the classifier depends on the feature space only through k. (ii) If k(x, x ) = e 2σ 2 x x 2 2, then φ(x) φ(x i ) 2 2 = 2 2e 2σ 2 x x i 2 2. Note that the right hand side is increasing in x x i 2. Let π be a permutation of {,..., n} such that x x π() 2 x x π(n) 2. Then we also have φ(x) φ(x π() ) 2 φ(x) φ(x π(n) ) 2. Therefore, the Gaussian kernel nearest neighbour classifier gives identical outcome as the vanilla knn. (iii) We have P(R r) = P({x,..., x n } B(0, r) = ) = ( r p ) n. Thus, the median of R satisfies which is equivalent to the desired result. ( median(r) p ) n = /2, 7. (Exercise with R) Download the letter dataset from the course website: train <- read.table(' header=true) Each row of the data contains 6 different attributes of a pixel image of one of the 26 capital letters in the English alphabet, together with the letter label itself. More information about this dataset can be found at Construct a classifier using one of the methods covered in this course. Then test your classifier on the test dataset letter_test from the course website. What is your test error? (Note that all design decision of your classifier must be based on the training data.) 8
1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More information9 Classification. 9.1 Linear Classifiers
9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive
More informationNeural networks and support vector machines
Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationLecture 6. Notes on Linear Algebra. Perceptron
Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture Notes on linear algebra Vectors
More informationLinear and Logistic Regression. Dr. Xiaowei Huang
Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationLecture 7: Kernels for Classification and Regression
Lecture 7: Kernels for Classification and Regression CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011 Outline Outline A linear regression problem Linear auto-regressive
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationMIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,
MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationMachine Learning - MT & 14. PCA and MDS
Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationReproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto
Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression
ECE521: Inference Algorithms and Machine Learning University of Toronto Assignment 1: k-nn and Linear Regression TA: Use Piazza for Q&A Due date: Feb 7 midnight, 2017 Electronic submission to: ece521ta@gmailcom
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1
More informationOutline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.
Outline MLPR: and Neural Networks Machine Learning and Pattern Recognition 2 Amos Storkey Amos Storkey MLPR: and Neural Networks /28 Recap Amos Storkey MLPR: and Neural Networks 2/28 Which is the correct
More informationMLPR: Logistic Regression and Neural Networks
MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition Amos Storkey Amos Storkey MLPR: Logistic Regression and Neural Networks 1/28 Outline 1 Logistic Regression 2 Multi-layer
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationEE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015
EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationRestricted Boltzmann Machines
Restricted Boltzmann Machines Boltzmann Machine(BM) A Boltzmann machine extends a stochastic Hopfield network to include hidden units. It has binary (0 or 1) visible vector unit x and hidden (latent) vector
More informationMidterm: CS 6375 Spring 2015 Solutions
Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationNeural networks COMS 4771
Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a
More informationCSCI567 Machine Learning (Fall 2018)
CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due
More informationCSE 546 Final Exam, Autumn 2013
CSE 546 Final Exam, Autumn 0. Personal info: Name: Student ID: E-mail address:. There should be 5 numbered pages in this exam (including this cover sheet).. You can use any material you brought: any book,
More information1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.
CS229 Problem Set #2 Solutions 1 CS 229, Public Course Problem Set #2 Solutions: Theory Kernels, SVMs, and 1. Kernel ridge regression In contrast to ordinary least squares which has a cost function J(θ)
More informationThe Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee
The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing
More informationClassification with Perceptrons. Reading:
Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters
More informationMachine Learning, Fall 2012 Homework 2
0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationPAPER 218 STATISTICAL LEARNING IN PRACTICE
MATHEMATICAL TRIPOS Part III Thursday, 7 June, 2018 9:00 am to 12:00 pm PAPER 218 STATISTICAL LEARNING IN PRACTICE Attempt no more than FOUR questions. There are SIX questions in total. The questions carry
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationMark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.
University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x
More informationLearning with multiple models. Boosting.
CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models
More informationNon-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines
Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More information1 A Support Vector Machine without Support Vectors
CS/CNS/EE 53 Advanced Topics in Machine Learning Problem Set 1 Handed out: 15 Jan 010 Due: 01 Feb 010 1 A Support Vector Machine without Support Vectors In this question, you ll be implementing an online
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationNeural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35
Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationThe exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.
CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please
More informationStein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d
More informationIntroduction to Logistic Regression
Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the
More informationThe exam is closed book, closed notes except your one-page cheat sheet.
CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right
More informationTDT4173 Machine Learning
TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationNearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal
More informationMachine Learning Basics
Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationCSC411: Final Review. James Lucas & David Madras. December 3, 2018
CSC411: Final Review James Lucas & David Madras December 3, 2018 Agenda 1. A brief overview 2. Some sample questions Basic ML Terminology The final exam will be on the entire course; however, it will be
More informationThe Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationMachine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)
Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:
More informationNeural Networks. Nicholas Ruozzi University of Texas at Dallas
Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify
More informationSupport Vector Machines
Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More informationA summary of Deep Learning without Poor Local Minima
A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given
More informationFinal Exam, Machine Learning, Spring 2009
Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationFeed-forward Network Functions
Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear
More informationMulti-layer Neural Networks
Multi-layer Neural Networks Steve Renals Informatics 2B Learning and Data Lecture 13 8 March 2011 Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 1 Overview Multi-layer neural
More information10-701/ Machine Learning, Fall
0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training
More informationDeep Feedforward Networks. Seung-Hoon Na Chonbuk National University
Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections
More informationMultilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)
Multilayer Neural Networks (sometimes called Multilayer Perceptrons or MLPs) Linear separability Hyperplane In 2D: w x + w 2 x 2 + w 0 = 0 Feature x 2 = w w 2 x w 0 w 2 Feature 2 A perceptron can separate
More informationHomework 4. Convex Optimization /36-725
Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationAlgorithm-Independent Learning Issues
Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More information10-701/15-781, Machine Learning: Homework 4
10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationIntroduction to Machine Learning Spring 2018 Note 18
CS 189 Introduction to Machine Learning Spring 2018 Note 18 1 Gaussian Discriminant Analysis Recall the idea of generative models: we classify an arbitrary datapoint x with the class label that maximizes
More informationA Tutorial on Support Vector Machine
A Tutorial on School of Computing National University of Singapore Contents Theory on Using with Other s Contents Transforming Theory on Using with Other s What is a classifier? A function that maps instances
More informationPrincipal Component Analysis
Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used
More informationLecture 6. Regression
Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron
More informationIntroduction to Machine Learning Midterm, Tues April 8
Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend
More informationLearning theory. Ensemble methods. Boosting. Boosting: history
Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over
More information