(i) The optimisation problem solved is 1 min

Size: px
Start display at page:

Download "(i) The optimisation problem solved is 1 min"

Transcription

1 STATISTICAL LEARNING IN PRACTICE Part III / Lent 208 Example Sheet 3 (of 4) By Dr T. Wang You have the option to submit your answers to Questions and 4 to be marked. If you want your answers to be marked, please leave them in my pigeon hole in the central core of the CMS by 2pm on 7 Mar.. Suppose we have observations x,..., x n R p with associated class labels y,..., y n {, }. Assume that the two classes are completely separable. (i) Write down the optimisation problem solved by the support vector machine (SVM) classifier with hard margin and linear kernel. (ii) What are support vectors for this SVM? Let S = {i : x i is a support vector}, show that the decision boundary of this SVM will not change if we use only {(x i, y i ) : i S} as our training data. (iii) Suppose S = {, 2}. Describe the decision boundary in terms of x, x 2. What if S = {, 2, 3}? Solution. (i) The optimisation problem solved is min β R p,β 0 R 2 β 2 2 subject to y i (x i β + β 0 ), i =,..., n. () (ii) Let β and β0 be the optimiser of the above optimisation problem. The set of support vectors are {x i : y i (x i β + β0 ) =. Let β and β 0 be the optimiser of the SVM problem using {x i : i / S} as training data, we want to show that β = β and β0 = β 0. Assume this is not true, and we prove by contradiction. Note that, (β, β0 ) satisfies the constraints in the SVM problem using {x i : i / S} as the training data. Thus, by uniqueness of the optimiser (note strong convexity), we must have 2 β 2 2 < 2 β 2 2 Let β (h) = hβ + ( h)β and β (h) 0 = hβ 0 + ( h)β 0. Then by convexity of the L 2 norm, we have 2 β(h) 2 2 < 2 β 2 2 for all h (0, ). Moreover, for any support vector x i, we have y i (x i β + β0 ) and y i (x i β + β 0 ), and thus y i(x i β(h) + β (h) 0 ). For any non-support vector x i, we have y i (x i β + β0 ) >, thus, y i(x i β(h) + β (h) 0 ) for h sufficiently close to 0. Therefore, for sufficiently small h, we have shown that (β (h), β (h) 0 ) also satisfies the constraints in () but has smaller objective function value, contradicting the optimality of β and β0. Thus, we must have β = β and β0 = β 0.

2 (iii) Since (β, β0 ) = (β, β 0 ), both the decision boundary and the set of support vectors remain the same if we remove all non-support vectors from the training set. If S = {, 2}, then y and y 2 are necessarily distinct and the decision boundary, which separates x and x 2 and maximises the distance to the closer of the two points, must be the perpendicular bisector of x and x 2. Similarly, for S = {, 2, 3}, we may assume that y = y 2 y 3. Then the decision boundary must be parallel to the line connecting x and x 2. Thus, the decision boundary is the perpendicular bisector of the segment connecting x 3 to the projection image of x 3 to the line through x and x (i) Let X be an arbitrary set. What are positive definite kernels on X? (ii) Let K and K 2 are two positive definite kernels on X and g : X R any function. Define K sum (x, x ) = K (x, x ) + K 2 (x, x ), K prod (x, x ) = K (x, x )K 2 (x, x ) and K conj (x, x ) = g(x)k (x, x )g(x ). Show that K sum, K prod and K conj are also positive definite kernels. (iii) Let X = R p. Using Part (ii) or otherwise, show that K(x, x ) = (c + x x ) d and K(x, x ) = exp{ x x 2 2σ 2 2 } are both positive definite kernels for any c > 0, d N and σ > 0. Solution. (i) A positive definite kernels on X is a real-valued symmetric function on X, k : X X R, satisfying that for any n N, x,..., x n X, the matrix (k(x i, x j )) i,j=,...,n is positive semidefinite. (ii) Take any n N and x,..., x n X. Let v R n be an arbitrary vector. We have v (K sum (x i, x j )) i,j v = v (K (x i, x j )) i,j v + v (K 2 (x i, x j )) i,j v 0 Let u = (u,..., u n ) be such that u j = v j g(x j ). Then v (K conj (x i, x j )) i,j v = u (K (x i, x j )) i,j u 0. Thus, K sum and K conj are both positive definite kernels. To show K prod is positive definite, it suffices to show that if A, B R n are positive semidefinite, then their pointwise product A B (Hadamard product) is also positive semidefinite. By eigendecomposition, we may write A = A + + A n and B = B + + B n where each A i and B j are rank positive semidefinite matrices. Since pointwise product satisfies distributive law, it suffices to prove A r B s is positive semidefinite for any r, s. If A r = aa and B s = bb, then A r B s = cc, where c = (c,..., c n ) satisfies c i = a i b i for i =,..., n. This proves that K prod is a positive definite kernel. (iii) We first check that the linear kernel (x, x ) x x is a positive definite kernel. This is true since v (x i x j) i,j v = Xv 2 2, where X is a matrix with columns x,..., x n. By Part (ii), positive definite kernels are closed under addition and multiplication, so (x, x ) (c + x x ) d is positive definite. If K(x, x ) = exp{ 2σ 2 x x 2 2 for g(x) = e 2σ 2 x 2, we have { } K(x, x ) = g(x) exp σ 2 x x g(x ). 2 }, then

3 Hence it suffices to show that (x, x ) e x x /σ 2 is positive definite. This is a (convergent) infinite sum of powers of x x, and hence must also be positive semidefinite. 3. Let G be a fully-connected feedforward neural network with input layer S, hidden layers S 2,..., S m, output layer S m and identity activation. Suppose there are k i non-bias nodes and bias node in S,..., S m and k m output nodes in S m. Let S i = {v S i : v is a non-bias node}. (i) Let x i = (x v : v S i ) be the vector of output values of non-bias nodes in layer S i. Let B i = (β u,v ) u Si,v S i+ and b i = (β v bias i,v ) v S i+ where vi bias = S i \ S i is the bias node in S i. Describe in matrix notation how the final output x m is computed from input x in a forward pass through the network. (ii) Using the same notation as in part (i), describe how backpropagation can be carried out in this network. (iii) Assume k k 2 k m and the input vector has distribution x N k (0, I k ). Show that if we initialise the weights by choosing B i to be an orthonormal matrix for each i, then at the end of the first forward pass after initialisation, we have marginally x v N(0, ) for all non-bias node v. iid (iv) Suppose we initialise β u,v N(0, /ki ) for all (u, v) S i S i+ (this is known as the Xavier initialisation). Show that B i is approximately orthonormal in the sense that E(B i B i) = I ki+. Solution. (i) The forward pass can be described by x i+ = B i x i + b i, i =,..., m, and the input layer initialised by training data. Thus, the output layer output values can be obtained by x m = (B B 2 B m ) x. (ii) Let y be the actual response given an input x, and let L(x m, y) be the objective function. Then back propagation starts with the computation of x m and backpropagates through the recursion ( ) ( ) = = x u = x i B i β u,v (u,v) S i S i+ x v (u,v) S i S i+ x i+ ( ) ( ) = b i x v x i+ β v bias i,v = B i, x i x i+ for i = m,...,. v S i+ = v S i+ = 3

4 (iii) We prove that x i N ki (0, I ki ) for all i =,..., m by induction. We are given that this is true for i =. Inductively, if x i N ki (0, I ki ), then by orthogonality of B i, we have x i N ki (0, B i B i ) = N ki (0, I ki ) as desired. Thus, marginally, x v N(0, ) for all non-bias nodes v at initialisation. (iv) We have [ E(B i B i ) ] v,v = u S i Eβ u,v β u,v = { 0 if v v if v = v. Thus, E(B i B i) = I ki+. (If fact, you can prove much stronger results about how B i B i concentrates around the identity matrix.) 4. An advertiser wanted to understand how different web design decisions influence the effectiveness of an online advertisement. He showed the advertisement to all users visiting his website while varying the font typeface (cagegorical variable font, with two levels sanserif and serif), display style (variable display, with two levels banner and popup) and seriousness of writing (writing, numerical variable taking value bewteen and 0). For each user, he recorded whether the advertisement was clicked. head(ad) # click font display writing # yes serif banner 3 # 2 no sanserif banner 8 # 3 no sanserif banner # 4 no serif popup 7 # 5 no serif popup 2 # 6 no serif popup 9 x <- model.matrix(~font*display, data=ad)[, -] y <- model.matrix(~click-, data=ad) layer_relu <- layer_dense(units = 2, activation = 'relu', input_shape = dim(x)[2]) layer_softmax <- layer_dense(units = 2, activation = 'softmax') ad.nnet <- keras_model_sequential(list(layer_relu, layer_softmax)) compile(ad.nnet, optimizer='sgd', loss='categorical_crossentropy', metrics='acc') fit(ad.nnet, x, y, batch_size=, epochs=5) (i) Write down algebraically the neural network model fitted in ad.nnet. Sketch a diagram of the neural network. Let β = (β uv : u v) be the vector of all coefficients in model ad.nnet. Show that the maximum likelihood estimator for β is not unique. (ii) Assume that all weights in the neural network are initialise to be equal to and a constant learning rate α = 0. is used. Write down explicitly how the weights are updated after one training step in the first epoch. [You may assume that data are not randomly shuffled, so that the stochastic gradient is computed using the first row of the dataset.] 4

5 (iii) If we maximise instead the regularised log-likelihood l(β) λ 2 β 2 2. for some λ > 0. How would your stochastic gradient step in Part (ii) change? Solution. (i) Architecture of this neural network: there are 4 input layer nodes (labelled,2,3,0), 3 hidden layer nodes (labelled 4,5,0) and 2 output layer nodes (labelled (6,7,0). Sketch omitted. Let (x i, y i ) be the ith observation, where entries of x i = (x i, xi 2 ) are coded by xi = 0 if font for ith visitor is sanserif and otherwise, and x i 2 = 0 if display is banner and otherwise. Then algebraically, the neural network model fitted is y i Bern(p i ), p i = e s 6 e s 6 + e s 7, where s 6 and s 7 are recursively defined via (we view x, x 2, x 3, x 4, x 5, s 6, s 7 as functions of x i and β, and drop the dependence on i for simplicity) x = x i, x 2 = x i 2, x 3 = x i x i 2 x 4 = max{β,4 x + β 2,4 x 2 + β 3,4 x 3 + β 0,4, 0} x 5 = max{β,5 x + β 2,5 x 2 + β 3,5 x 3 + β 0,5, 0} s 6 = β 4,6 x 4 + β 5,6 x 5 + β 0,6 s 7 = β 4,7 x 4 + β 5,7 x 5 + β 0,7. The MLE for β is not unique because the model is not identifiable. Note that if we replace β 0,6 and β 0,7 by β 0,6 +c and β 0,7 +c respectively for any c R, the probability output p i is unchanged and hence the log-likelihood is unchanged. (ii) We have x = (, 0) and y =. Thus, in the forward pass, we have x =, x 2 = 0, x 3 = 0 x 4 = max{x + x 2 + x 3 +, 0} = 2, x 5 = max{x + x 2 + x 3 +, 0} = 2 s 6 = x 4 + x 5 + = 5, s 7 = x 4 + x 5 + = 5, p = /2. The contribution of the first observation to the log-likelihood is l = y log p + ( y ) log( p ). Thus, partial derivatives of l with respect to all weights can be computed via back- 5

6 propagation as follows. output probability: p = p = 2 output layer: = e s6+s7 s 6 p (e s 6 + e s 7) 2 = 2, = e s6+s7 s 7 p (e s 6 + e s 7) 2 = 2. = = = = β 4,6 β 4,7 β 5,6 β 5,7 2 2 =, = β 0,6 2 = 2. = β 0,7 2 hidden layer: = β 4,6 + β 4,7 = 0, x 4 s 6 s 7 = f (s 4 ) = 0, s 4 x 4 = 0 for (u, v) {, 2, 3, 0} {4, 5}. β u,v x 5 = s 6 β 5,6 + s 7 β 5,7 = 0. s 5 = x 5 f (s 5 ) = 0. Thus, after one training step, the weights are updated as = if (u, v) {, 2, 3, 0} {4, 5} + 0. =. if (u, v) {4, 5} {6, 7} β u,v = =.05 if (u, v) = (0, 6) = 0.95 if (u, v) = (0, 7). (Remark: we see from this calculation that symmetry-breaking is needed for more than just all zero initialising weights.) (iii) We can write l(β) λ 2 β 2 2 = n ( l i (β) λ ) 2n β 2 2 i= All we need to do is to subtract λ n β u,v to β u,v computed in Part (ii). (Remark: this is known as weight decay in the neural networks literature.) 5. Let (x, y ),..., (x n, y n ) R p {, 2} be a training dataset. Let ψb,m bnn be the bootstrap aggregation nearest neighbour classifier using B m-out-of-n bootstrap samples. (i) Describe how the bootstrap aggregation nearest neighbour classifier is constructed. (ii) Show that if m > n log 2, then for any x R p, we have P ( ψb,m bnn (x) ψ NN (x) ) exp { 2 } B( 2e m/n ) 2. [Hint: you may use Hoeffding s inequality, which says in particular that if Z Bin(B, p), then P(Z/B < p t) e 2Bt2.] 6

7 (iii) Suppose we modify the bootstrap scheme in ψb,m bnn to resample instead m data points from the training set without replacement in each bootstrap sample. Call this modified bootstrap aggregation classifier ψ bnn B,m. Show that as B, ψ bnn B,m converge to a weighted nearest neighbour classifier with appropriate weights. Solution. (i) Omitted. (ii) We have by definition that ψ NN = Y (). Let ψ (b) ( ) be the NN classifier in the bth bootstrap sample. Then {ψ (b) (x) = Y () } for b =,..., B are iid Bernoulli random variables with probability p at least the probability that X () is selected in an m-out-of-n bootstrap sample, i.e. ( p > n) m e m/n > /2. Then by Hoeffding s inequality, as required. P(ψ bnn B,m (x) ψ NN (x)) e 2B( 2 e m/n ) 2 (iii) Under an m-out-of-n without replacement resampling scheme, the probability that X (i) becomes the nearest point to x in a bootstrap sample is w bnn,w/o i := P(X (i) is nearest to x) = {( n i / ( n ) m ) m if i n m + 0 if n m + 2 i n. Let ψ bnn,w/o B,m be the bagged nearest neighbour classifier without replacement, then lim B ψbnn,w/o B,m (x) = lim B = arg max l=,...,l = arg max l=,...,l arg max l=,...,l B n i= B B {ψ (b) (x) = l} b= B P{ψ (b) (x) = l} b= w bnn,w/o i {Y (i) = l}, establishing the claim that the infinite without replacement bagged nearest neighbour is also a weighted nearest neighbour classifier. 6. Let (x, y ),..., (x n, y n ) R p {, 2} be a training dataset. (i) Suppose φ : R p H is a feature map into some Hilbert space H such that φ(x), φ(x ) H = k(x, x ) for some positive definite kernel k. Show that nearest neighbour classification in the feature space depends on φ only through the kernel k. 7

8 (ii) Show that if k is a Gaussian kernel, then the k-nearest neighbour classifier in the feature space gives the same classification outcome as the k-nearest neighbour classifier in the original space R p. (iii) Nearest neighbours in high-dimensions can be very far away. Assume x,..., x n are drawn uniformly from the unit ball B(0, ) in R p. Let R = min i=,...,n x i 2 be the nearest neighbour distance at the origin. Show that median(r) = ( 2 /n) /p. Sketch how median(r) varies as n increases for p {, 0, 00}. Solution. (i) The nearest neighbour in the feature space requires sorting {φ(x i ) : i =,..., n} according to their l 2 distance to a test point x in H. Distance computation can be carried out using the kernel k only via φ(x) φ(x i ) 2 2 = k(x, x) + k(x, x ) 2k(x, x ). Hence the classifier depends on the feature space only through k. (ii) If k(x, x ) = e 2σ 2 x x 2 2, then φ(x) φ(x i ) 2 2 = 2 2e 2σ 2 x x i 2 2. Note that the right hand side is increasing in x x i 2. Let π be a permutation of {,..., n} such that x x π() 2 x x π(n) 2. Then we also have φ(x) φ(x π() ) 2 φ(x) φ(x π(n) ) 2. Therefore, the Gaussian kernel nearest neighbour classifier gives identical outcome as the vanilla knn. (iii) We have P(R r) = P({x,..., x n } B(0, r) = ) = ( r p ) n. Thus, the median of R satisfies which is equivalent to the desired result. ( median(r) p ) n = /2, 7. (Exercise with R) Download the letter dataset from the course website: train <- read.table(' header=true) Each row of the data contains 6 different attributes of a pixel image of one of the 26 capital letters in the English alphabet, together with the letter label itself. More information about this dataset can be found at Construct a classifier using one of the methods covered in this course. Then test your classifier on the test dataset letter_test from the course website. What is your test error? (Note that all design decision of your classifier must be based on the training data.) 8

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Neural networks and support vector machines

Neural networks and support vector machines Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Lecture 6. Notes on Linear Algebra. Perceptron

Lecture 6. Notes on Linear Algebra. Perceptron Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture Notes on linear algebra Vectors

More information

Linear and Logistic Regression. Dr. Xiaowei Huang

Linear and Logistic Regression. Dr. Xiaowei Huang Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Lecture 7: Kernels for Classification and Regression

Lecture 7: Kernels for Classification and Regression Lecture 7: Kernels for Classification and Regression CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011 Outline Outline A linear regression problem Linear auto-regressive

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT & 14. PCA and MDS Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression ECE521: Inference Algorithms and Machine Learning University of Toronto Assignment 1: k-nn and Linear Regression TA: Use Piazza for Q&A Due date: Feb 7 midnight, 2017 Electronic submission to: ece521ta@gmailcom

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap. Outline MLPR: and Neural Networks Machine Learning and Pattern Recognition 2 Amos Storkey Amos Storkey MLPR: and Neural Networks /28 Recap Amos Storkey MLPR: and Neural Networks 2/28 Which is the correct

More information

MLPR: Logistic Regression and Neural Networks

MLPR: Logistic Regression and Neural Networks MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition Amos Storkey Amos Storkey MLPR: Logistic Regression and Neural Networks 1/28 Outline 1 Logistic Regression 2 Multi-layer

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Restricted Boltzmann Machines

Restricted Boltzmann Machines Restricted Boltzmann Machines Boltzmann Machine(BM) A Boltzmann machine extends a stochastic Hopfield network to include hidden units. It has binary (0 or 1) visible vector unit x and hidden (latent) vector

More information

Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Neural networks COMS 4771

Neural networks COMS 4771 Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a

More information

CSCI567 Machine Learning (Fall 2018)

CSCI567 Machine Learning (Fall 2018) CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due

More information

CSE 546 Final Exam, Autumn 2013

CSE 546 Final Exam, Autumn 2013 CSE 546 Final Exam, Autumn 0. Personal info: Name: Student ID: E-mail address:. There should be 5 numbered pages in this exam (including this cover sheet).. You can use any material you brought: any book,

More information

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2. CS229 Problem Set #2 Solutions 1 CS 229, Public Course Problem Set #2 Solutions: Theory Kernels, SVMs, and 1. Kernel ridge regression In contrast to ordinary least squares which has a cost function J(θ)

More information

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing

More information

Classification with Perceptrons. Reading:

Classification with Perceptrons. Reading: Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

PAPER 218 STATISTICAL LEARNING IN PRACTICE

PAPER 218 STATISTICAL LEARNING IN PRACTICE MATHEMATICAL TRIPOS Part III Thursday, 7 June, 2018 9:00 am to 12:00 pm PAPER 218 STATISTICAL LEARNING IN PRACTICE Attempt no more than FOUR questions. There are SIX questions in total. The questions carry

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

1 A Support Vector Machine without Support Vectors

1 A Support Vector Machine without Support Vectors CS/CNS/EE 53 Advanced Topics in Machine Learning Problem Set 1 Handed out: 15 Jan 010 Due: 01 Feb 010 1 A Support Vector Machine without Support Vectors In this question, you ll be implementing an online

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d

More information

Introduction to Logistic Regression

Introduction to Logistic Regression Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right

More information

TDT4173 Machine Learning

TDT4173 Machine Learning TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal

More information

Machine Learning Basics

Machine Learning Basics Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

CSC411: Final Review. James Lucas & David Madras. December 3, 2018 CSC411: Final Review James Lucas & David Madras December 3, 2018 Agenda 1. A brief overview 2. Some sample questions Basic ML Terminology The final exam will be on the entire course; however, it will be

More information

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML) Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:

More information

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks. Nicholas Ruozzi University of Texas at Dallas Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Feed-forward Network Functions

Feed-forward Network Functions Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Multi-layer Neural Networks

Multi-layer Neural Networks Multi-layer Neural Networks Steve Renals Informatics 2B Learning and Data Lecture 13 8 March 2011 Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 1 Overview Multi-layer neural

More information

10-701/ Machine Learning, Fall

10-701/ Machine Learning, Fall 0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training

More information

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

More information

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs) Multilayer Neural Networks (sometimes called Multilayer Perceptrons or MLPs) Linear separability Hyperplane In 2D: w x + w 2 x 2 + w 0 = 0 Feature x 2 = w w 2 x w 0 w 2 Feature 2 A perceptron can separate

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

10-701/15-781, Machine Learning: Homework 4

10-701/15-781, Machine Learning: Homework 4 10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Introduction to Machine Learning Spring 2018 Note 18

Introduction to Machine Learning Spring 2018 Note 18 CS 189 Introduction to Machine Learning Spring 2018 Note 18 1 Gaussian Discriminant Analysis Recall the idea of generative models: we classify an arbitrary datapoint x with the class label that maximizes

More information

A Tutorial on Support Vector Machine

A Tutorial on Support Vector Machine A Tutorial on School of Computing National University of Singapore Contents Theory on Using with Other s Contents Transforming Theory on Using with Other s What is a classifier? A function that maps instances

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Lecture 6. Regression

Lecture 6. Regression Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information