Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Similar documents
Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm exam CS 189/289, Fall 2015

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

10-701/ Machine Learning - Midterm Exam, Fall 2010

Linear & nonlinear classifiers

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machine (continued)

Linear & nonlinear classifiers

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Machine Learning And Applications: Supervised Learning-SVM

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Machine Learning Practice Page 2 of 2 10/28/13

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

CS-E4830 Kernel Methods in Machine Learning

SVMs, Duality and the Kernel Trick

CSC 411 Lecture 17: Support Vector Machine

CS534 Machine Learning - Spring Final Exam

Discriminative Models

ICS-E4030 Kernel Methods in Machine Learning

Jeff Howbert Introduction to Machine Learning Winter

Overfitting, Bias / Variance Analysis

Convex Optimization and Support Vector Machine

CSE 546 Final Exam, Autumn 2013

Midterm: CS 6375 Spring 2015 Solutions

Support Vector Machines

Pattern Recognition and Machine Learning

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Discriminative Models

Support Vector Machine (SVM) and Kernel Methods

Introduction to Machine Learning Midterm, Tues April 8

Machine Learning. Support Vector Machines. Manfred Huber

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Max Margin-Classifier

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Support Vector Machine (SVM) and Kernel Methods

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Midterm II. Introduction to Artificial Intelligence. CS 188 Spring ˆ You have approximately 1 hour and 50 minutes.

1 Machine Learning Concepts (16 points)

Midterm: CS 6375 Spring 2018

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Support Vector Machines

The exam is closed book, closed notes except your one-page cheat sheet.

Kernels and the Kernel Trick. Machine Learning Fall 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines

Support Vector Machines: Maximum Margin Classifiers

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Pattern Recognition 2018 Support Vector Machines

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Midterm Exam, Spring 2005

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Support Vector Machines

Linear Regression (continued)

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines. Machine Learning Fall 2017

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Pattern Recognition

Support Vector Machines, Kernel SVM

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Lecture Notes on Support Vector Machine

Ch 4. Linear Models for Classification

CS 6375 Machine Learning

Linear Models for Classification

COMP 875 Announcements

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Support vector machines Lecture 4

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Final Examination CS540-2: Introduction to Artificial Intelligence

FINAL: CS 6375 (Machine Learning) Fall 2014

L5 Support Vector Classification

Introduction to Machine Learning

6.036 midterm review. Wednesday, March 18, 15

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Convex Optimization M2

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

CS145: INTRODUCTION TO DATA MINING

Kernel Methods and Support Vector Machines

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning, Fall 2009: Midterm

Machine Learning, Midterm Exam

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Linear Classification

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Perceptron Revisited: Linear Separators. Support Vector Machines

Announcements - Homework

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Introduction to Machine Learning Midterm Exam

Transcription:

CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items. Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. For true/false questions, fill in the True/False bubble. For multiple-choice questions, fill in the bubbles for ALL CORRECT CHOICES (in some cases, there may be more than one). We have introduced a negative penalty for false positives for the multiple choice questions such that the expected value of randomly guessing is 0. Don t worry, for this section, your score will be the maximum of your score and 0, thus you cannot incur a negative score for this section. First name Last name SID First and last name of student to your left First and last name of student to your right For staff use only: Q1. True or False /26 Q2. Multiple Choice /36 Q3. Parameter Estimation /10 Q4. Dual Solution for Ridge Regression /8 Q5. Regularization and Priors for Linear Regression /8 Total /88 1

Q1. [26 pts] True or False (a) [2 pts] If the data is not linearly separable, then there is no solution to the hard-margin SVM. (b) [2 pts] Logistic Regression can be used for classification. (c) [2 pts] In logistic regression, two ways to prevent β vectors from getting too large are using a small step size and using a small regularization value. (d) [2 pts] The L2 norm is often used because it produces sparse results, as opposed to the L1 norm which does not. (e) [2 pts] For a Multivariate Gaussian, the eigenvalues of the covariance matrix are inversely proportional to the lengths of the ellipsoid axes that determine the isocontours of the density. (f) [2 pts] In a generative binary classification model where we assume the class conditionals are distributed as Poisson, and the class priors are Bernoulli, the posterior assumes a logistic form. (g) [2 pts] Maximum likelihood estimation gives us not only a point estimate, but a distribution over the parameters that we are estimating. (h) [2 pts] Penalized maximum likelihood estimators and Bayesian estimators for parameters are better used in the setting of low-dimensional data with many training examples as opposed to the setting of high-dimensional data with few training examples. (i) [2 pts] It is not a good machine learning practice to use the test set to help adjust the hyperparameters of your learning algorithm. (j) [2 pts] A symmetric positive semi-definite matrix always has nonnegative elements. (k) [2 pts] For a valid kernel function K, the corresponding feature mapping φ can map a finite dimensional vector into an infinite dimensional vector. (l) [2 pts] The more features that we use to represent our data, the better the learning algorithm will generalize to new data points. (m) [2 pts] A discriminative classifier explicitly models P (Y X) 2

Q2. [36 pts] Multiple Choice (a) [3 pts] Which of the following algorithms can you use kernels with? Support Vector Machines Perceptrons (b) [3 pts] Cross validation: Is often used to select hyperparameters Is guaranteed to prevent overfitting Does nothing to prevent overfitting (c) [3 pts] In linear regression, L2 regularisation is equivalent to imposing a: Logistic prior Gaussian prior Laplace prior Gaussian class-conditional (d) [3 pts] Say we have two 2-dimensional Gaussian distributions representing two different classes. Which of the following conditions will result in a linear decision boundary: Same mean for both classes Same covariance matrix for both classes Different covariance matrix for each class Linearly separable data (e) [3 pts] The normal equations can be derived from: Minimizing empirical risk Assuming that Y = β T x + ɛ, where ɛ N(0, σ 2 ). Assuming that the P (Y X = x) is distributed normally with mean β x and variance σ 2 Finding a linear combination of the rows of the design matrix that minimizes the distance to our vector of labels Y (f) [3 pts] Logistic regression can be motivated from: Generative models with uniform class conditionals Generative models with Gaussian class conditionals Log odds being equated to an affine function of x (g) [3 pts] The perceptron algorithm will converge: If the data is linearly separable Even if the data is linearly inseparable As long as you initialize θ to all 0 s Always 3

(h) [3 pts] Which of the following is true: Newton s Method typically is more expensive to calculate than gradient descent, per iteration For quadratic equations, Newton s Method typically requires fewer iterations than gradient descent Gradient descent can be viewed as iteratively reweighted least squares (i) [3 pts] Which of the following statements about duality and SVMs is (are) true? Complementary slackness implies that every training point that is misclassified by a soft-margin SVM is a support vector. When we solve the SVM with the dual problem, we need only the dot product of x i, x j for all i, j, and no other information about the x i. We use Lagrange multipliers in an optimization problem with inequality ( ) constraints. (j) [3 pts] Which of the following distance metrics can be computed exclusively with inner products, assuming Φ(x) and Φ(y) are feature mappings of x and y, respectively? Φ(x) Φ(y) Φ(x) Φ(y) 1 Φ(x) Φ(y) 2 2. (k) [3 pts] Strong duality holds for: Hard Margin SVM Soft Margin SVM Constrained optimization problems in general (l) [3 pts] Which the following facts about the C in SVMs is (are) true? As C approaches 0, the soft margin SVM is equal to the hard margin SVM C can be negative, as long as each of the slack variables are nonnegative A larger C tends to create a larger margin 4

Q3. [10 pts] Parameter Estimation In this problem, we have n trials with k possible types of outcomes {1, 2,..., k}. Suppose we observe X 1,..., X k where each X i is the number of outcomes of type i. If p i refers to the probability that a trial has outcome i, then (X 1,..., X k ) is said to have a multinomial distribution with parameters p 1,..., p k, denoted (X 1,..., X k ) Multinomial(p 1,..., p k ). It may be useful to know that the probability mass function of the multinomial distribution is given as follows. P (X 1 = x 1,..., X k = x k ) = n! x 1!x 2!... x k! px1 1... px k k We want to find the maximum likelihood estimators for p 1,..., p k. You may assume that p i > 0 for all i. (a) [4 pts] What is the log-likelihood function, l(p 1,..., p k X 1,..., X k )? (b) [6 pts] You might notice that unconstrained maximization of this function leads to an answer in which we set each p i =. But this is wrong. We must add a constraint such that the probabilities sum up to 1. Now, we have the following optimization problem. max l(p 1,..., p k X 1,..., X k ) p 1,...,p k k s.t. p i = 1 i=1 Recall that we can use the method of Lagrange multipliers to solve an optimization problem with equality constraints. Using this method, find the maximum likelihood estimators for p 1,..., p k. 5

Q4. [8 pts] Dual Solution for Ridge Regression Recall that ridge regression minimizes the objective function: L(w) = Xw y 2 2 + λ w 2 2 where X is an n-by-d design matrix, w is a d-dimensional vector and y is a n-dimensional vector. We already know that the function L(w) is minimized by w = (X T X + λi) 1 X T y. Alternatively, the minimizer can be represented by a linear combination of the design matrix rows. That is, there exists a n-dimensional vector α such that the objective function L(w) is minimized by w = X T α. The vector α is called the dual solution to the linear regression problem. (a) [2 pts] Using the relation w = X T α, define the objective function L in terms of α. (b) [3 pts] Show that α = (XX T + λi) 1 y is a dual solution. (c) [3 pts] To make the solution in question (b) well-defined, the matrix XX T + λi has to be an invertible matrix. Assuming λ > 0, show that XX T + λi is an invertible matrix. (Hint: positive definite matrices are invertible) 6

Q5. [8 pts] Regularization and Priors for Linear Regression Linear regression is a model of the form P (y x) N (w T x, σ 2 ), where w is a d-dimensional vector. Recall that in ridge regression, we add an l 2 regularization term to our least squares objective function to prevent overfitting, so that our loss function becomes: J(w) = n (Y i w T X i ) 2 + λw T w (*) i=1 We can arrive at the same objective function in a Bayesian setting, if we consider a MAP (maximum a posteriori probability) estimate, where w has the prior distribution N (0, f(λ, σ)i). (a) [3 pts] What is the conditional density of w given the data? (b) [5 pts] What f(λ, σ) makes this MAP estimate the same as the solution to (*)? 7