Midterm exam CS 189/289, Fall 2015

Similar documents
Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Machine Learning Practice Page 2 of 2 10/28/13

10-701/ Machine Learning - Midterm Exam, Fall 2010

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page cheat sheet.

Overfitting, Bias / Variance Analysis

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Introduction to Machine Learning Midterm, Tues April 8

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Machine Learning Lecture 7

Midterm: CS 6375 Spring 2015 Solutions

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Linear Models in Machine Learning

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

1 Machine Learning Concepts (16 points)

FINAL: CS 6375 (Machine Learning) Fall 2014

18.9 SUPPORT VECTOR MACHINES

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Regression with Numerical Optimization. Logistic

Linear Models for Regression CS534

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

ESS2222. Lecture 4 Linear model

Name (NetID): (1 Point)

Midterm Exam, Spring 2005

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Linear Regression (continued)

Introduction to Machine Learning

Machine Learning 2017

Linear & nonlinear classifiers

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Warm up: risk prediction with logistic regression

Midterm: CS 6375 Spring 2018

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Statistical Data Mining and Machine Learning Hilary Term 2016

Neural Network Training

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Machine Learning Midterm Exam

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Machine Learning for NLP

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CSE 546 Final Exam, Autumn 2013

Classification Logistic Regression

Support Vector Machine (SVM) and Kernel Methods

Machine Learning And Applications: Supervised Learning-SVM

Recap from previous lecture

Ch 4. Linear Models for Classification

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

Support Vector Machines

Machine Learning, Fall 2009: Midterm

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Discriminative Models

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Statistical Methods for SVM

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Statistical Machine Learning Hilary Term 2018

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

SVMs, Duality and the Kernel Trick

Linear discriminant functions

Lecture 5: Logistic Regression. Neural Networks

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Support Vector Machine I

Machine Learning Lecture 5

Logistic Regression Logistic

FA Homework 2 Recitation 2

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Support Vector Machines

Logistic Regression. Machine Learning Fall 2018

Discriminative Models

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Logistic Regression Trained with Different Loss Functions. Discussion

Linear Models for Regression CS534

Final Exam, Fall 2002

Is the test error unbiased for these programs?

Introduction to Logistic Regression

References. Lecture 3: Shrinkage. The Power of Amnesia. Ockham s Razor

Homework 2 Solutions Kernel SVM and Perceptron

Linear and logistic regression

Introduction to Machine Learning

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Final Exam, Machine Learning, Spring 2009

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Transcription:

Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points each). 3. Three descriptive questions worth 10, 15, 15 points. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items. For true/false questions, fill in the True/False bubble. For multiple-choice questions, fill in the bubbles for ALL CORRECT CHOICES (in some cases, there may be more than one). NO PARTIAL CREDIT: all correct answers must be checked. First name Last name SID First and last name of student to your left First and last name of student to your right For staff only T/F /36 Multiple choice /24 Problem I /15 Problem II /15 Problem III /10 Total /100

Notation: X: the training data matrix of dimension (N, d), of N rows representing samples and d columns representing features. x: an input data vector of dimension (1, d) of components x i, i=1:d. x k : a training example of dimension (1, d) is a row of X, k=1:n. w: weight vector of a linear model of dimension (1, d) such that f(x) = w x T = x w T = S i=1:d w i x i y: target vector of dimension (N, 1) of components y k. a: weight vector of dimension (N, 1) of kernel method f(x) = S k=1:n a k k(x, x k ) k(u, v): a kernel function (a similarity measure between two samples u and v). True/False (36 points): 1. Stochastic gradient descent performs less computation per update than batch gradient descent. * 2. A function is convex if its Hessian is negative semidefinite. * 3. If N < d, the solution to Xw T = y is unique. ** 4. A support vector machine computes P(y x). **

5. Adding a ridge to X T X guarantees that it is invertible. * 6. Grid search is less prone to being trapped in a local minimum than other heuristic search methods. *** 7. The bootstrap method involves sampling without replacement. * 8. A non linearly-separable training set in a given feature space can always be made linearly-separable in another space.** 9. Using the kernel trick, one can get non-linear decision boundaries using algorithms designed originally for linear models. * 10. Logistic regression cannot be kernelized.* 11. Ridge regression, weight decay, and Gaussian processes use the same regularizer: ǁwǁ 2. * 12. Hebb s rule computes the centroid method solution if the target values are +1/N 1 and -1/N 0 (N 1 and N 0 are the number of examples of each class)**

13. Any kernel method can be thought of as a parametric method in a possibly infinite dimensional space.* 14. Nearest neighbors is a parametric method.* 15. A symmetric matrix is positive semidefinite if all its eigenvalues are positive or zero. ** 16. Zero correlation between any two random variables implies that the two random variables are independent. *** 17. The Linear Discriminant Analysis (LDA) classifier computes the direction maximizing the ratio of between-class variance over within-class variance. *** 18. If we repeat an experiment twice and get p-values p1 and p2, the minimum of the two p-values is the p-value of the overall experiment. ***

Multiple choice questions (30 points) 1. You trained a binary classifier model which gives very high accuracy on the training data, but much lower accuracy on validation data. The following may be true: * o This is an instance of overfitting. o This is an instance of underfitting. o The training was not well regularized. o The training and testing examples are sampled from different distributions. 2. Okham in the 14th century is credited to have stated that one should shave off unnecessary parameters of a model. Which of the following implement that principle: ** o Regularization. o Maximum likelihood estimation. o Shrinkage. o Empirical risk minimization. o Feature selection. 3. Good practices to avoid overfitting include: ** o Using a two part cost function which includes a regularizer to penalize model complexity. o Using a good optimizer to minimize error on training data. o Building a structure of nested subsets of models and train learning machines in each subset, starting from the inner subset, and stopping when the cross-validation error starts increasing. o Discarding 50% of randomly chosen samples.

4. Wrapper methods are hyper-parameter selection methods that:** o Should be used whenever possible because they are computationally efficient. o Should be avoided unless there are no other options because they are always prone to overfitting. o Are useful mainly when the learning machines are black boxes. o Should be avoided altogether. 5. Three different classifiers are trained on the same data. Their decision boundaries are shown below. Which of the following statements are true? o The leftmost classifier has high robustness, poor fit. o The leftmost classifier has poor robustness, high fit. o The rightmost classifier has poor robustness, high fit. o The rightmost classifier has high robustness, poor fit. 6. What are support vectors: *** o The examples farthest from the decision boundary. o The only examples necessary to compute f(x) in an SVM. o The class centroids. o All the examples that have a non-zero weight a k in a SVM.

7. Which of the following can only be used when training data are linearlyseparable? * o Linear hard-margin SVM. o Linear Logistic Regression. o Linear Soft margin SVM. o The centroid method. o Parzen windows. 8. The number of test examples needed to get statistically significant results should be: *** o Larger if the error rate is larger. o Larger if the error rate is smaller. o It does not matter. Three descriptive problems Problem I: Gradient descent (15 points). Given N training data points {(x k, y k )}, k=1:n, x k in R d, and labels in y k in {-1,1}, we seek a linear discriminant function f(x) = w.x optimizing the loss function L(z) = e -z, for z=y f(x). Question I.1 (3 points) Is L(z) a large margin loss function? Justify your answer (a graphical justification may be useful). Answer: Yes. This is because the loss penalizes even examples that are well classified, but the penalty decreases as you go away from the decision boundary.

Question I.2 (4 points) Derive the stochastic gradient descent update Dw for L(z): Answer: For a learning rate h>0, and for z = y f(x) = y S i=1:d w i x i Dw i = - h L/ w i = - h L/ z z/ w i Dw = h e -z y x i = h e -z y x Question I.3 (3 point) We call R emp (w) = S k=1:n L(z k ), where z k = y k f(x k ), the empirical risk. Derive the batch gradient update Dw for the empirical risk: Answer: Dw = h S k=1:n exp(-z k ) y k x k Question I.4 (3 point) Suppose you also want to include a penalty term l ǁwǁ 2 to the risk functional that you wish to minimize. Derive the batch gradient update for the regularized risk R reg (w) = R emp (w) + l ǁwǁ 2 : Answer: Dw = h S k=1:n exp(-z k ) y k x k 2 h l w or w (1 2 h l w) + h S k=1:n exp(-z k ) y k x k

Question I.5 (2 point) How do you estimate l (answer in at most 3 words)? Answer: By cross-validation. Problem II. Classification concept review (15 points). Question II.1. Centroid method. Now consider a 2-class classification problem in a 2- dimensional feature space x=[x1, x2] with target variable y=±1. The training data comprises 7 samples as shown in Figure 1 (4 black diamonds for the positive class and 3 white diamonds for the negative class). x 2 1 6 2 3 5 + - x 1 4 7 Figure 1: Data for Problem II. Centroid method question. Question II.1.A (2 points): Draw on Figure 1 the centroids of the two classes (mark them with a circles (+) for the positive class and a circled (-) for the negative class). Join the centroids with a thick dashed line. Draw the decision boundary of the centroid method with a thick solid line.

Question II.1.B (1 point) What is the training error rate? 1/7 Question II.1.C (2 points) Is there any sample such that upon its removal, the decision boundary changes in a manner that the removed sample goes to the other side (Answer yes or no )? NO Question II.1.D (2 point) What is the leave-one-out error rate? 1/7 Question II. 2: Support Vector Machine (SVM). Consider again the same training data as in Question II.1, replicated in Figure 2, for your convenience. The maximum margin classifier (also called linear hard margin SVM) is a classifier that leaves the largest possible margin on either side of the decision boundary. The samples lying on the margin are called support vectors. x 2 1 6 2 3 5 x 1 4 7 Figure 1: Data for Problem II. SVM method question.

Question II.2.A (2 points): Draw on Figure 2 the decision boundary obtained by the linear hard margin SVM method with a thick solid line. Draw the margins on either side with thinner dashed lines. Circle the support vectors. Question II.2.B (1 points) What is the training error rate? Zero. Question II.2.C (1 point) The removal of which sample will change the decision boundary? Number 5. Question II.2.D (2 points) What is the leave-one-out error rate? 1/7 Question II.2.E (1 point) A method is more robust if the difference between training error and leave-one-out error is smaller. Which method (centroid or SVM) is more robust? The centroid method. Question II.2.F (1 point) A method has a better fit is it has fewer training error. Which method has the best fit? The SVM method. Problem III. Newton-Raphson for least-square regression (10 points) In this problem, we will derive an optimization algorithm which we did not study in class, called the Newton-Raphson algorithm. The algorithm makes updates in a manner that often allows reaching the solution faster than regular gradient descent. Suppose we start with an initial value of a (1, d) vector w; lets call this initial value as w (0). We know that the first order Taylor approximation of w R(w (1) ), at the point w (0) is: w R(w (1) ) = w R(w (0) ) + (w (1) - w (0) ) w 2 R(w (0) ) Question III.1 (3 points). We want to minimize R(w (1) ) using this approximation of w R(w (1) ). Find the update equation for the value of w (1). This is called the Newton-Raphson update. Notes: This is not a trick question, you just have to

solve for w (1) after equaling w R(w (1) ) to 0. You can assume that the (d, d) Hessian matrix w 2 R(w (0) ) is invertible. Answer: Let us call H = w 2 R(w (0) ) the Hessian matrix. w R(w (1) ) = 0 Û w R(w (0) ) + (w (1) - w (0) ) H = 0 Û (w (1) - w (0) ) H = - w R(w (0) ) Û w (1) = w (0) - w R(w (0) ) H -1 (if H is invertible) Question III.2 (4 points). Consider now the linear regression problem: We are given a training data matrix X of dim (N, d) and a target vector y of dim(n, 1) and want to find a weight vector w of dim (1, d) such that f(x) = x w T approximates y best, in the least square sense. The risk functional is: R(w) = (Xw T - y) T (Xw T - y). We will assume that we are in the regression case N>d and that the Hessian is invertible. Find the Newton-Raphson update for w (1). Answer: w R = 2 (wx T X y T X) w 2 R = H = 2 X T X w (1) = w (0) - w R(w (0) ) H -1 = w (0) (w (0) X T X y T X) (X T X) -1 = y T X(X T X) -1 Question III.3 (3 points). Recall the solution to the problem we found in class using the normal equations or the solution found by solving for w R(w) = 0 directly. Compare with the solution obtained in question (2). How many iterations of the Newton-Raphson update do we need to perform for linear regression? Answer: One iteration. w R = 2 (wx T X y T X) = 0 Û wx T X = y T X Û w = y T X(X T X) -1, if H = X T X is invertible. Newton-Raphson update: w (1) = y T X(X T X) -1 identical.