EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Similar documents
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Lecture 3 Classification, Logistic Regression

Machine Learning Linear Classification. Prof. Matteo Matteucci

Statistical Data Mining and Machine Learning Hilary Term 2016

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm Exam

10-701/ Machine Learning - Midterm Exam, Fall 2010

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

FINAL: CS 6375 (Machine Learning) Fall 2014

Introduction to Machine Learning Midterm Exam Solutions

Machine Learning, Fall 2009: Midterm

CS534 Machine Learning - Spring Final Exam

Introduction to Machine Learning

The exam is closed book, closed notes except your one-page cheat sheet.

Statistical Machine Learning Hilary Term 2018

STA 4273H: Statistical Machine Learning

Midterm, Fall 2003

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Machine Learning Practice Page 2 of 2 10/28/13

Chapter 14 Combining Models

Machine Learning, Midterm Exam

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Applied Multivariate and Longitudinal Data Analysis

CSE 546 Final Exam, Autumn 2013

Statistical Machine Learning from Data

Midterm: CS 6375 Spring 2015 Solutions

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Midterm Exam, Spring 2005

Machine Learning for NLP

CMU-Q Lecture 24:

Midterm exam CS 189/289, Fall 2015

A Magiv CV Theory for Large-Margin Classifiers

Algorithm-Independent Learning Issues

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

1 Machine Learning Concepts (16 points)

A Study of Relative Efficiency and Robustness of Classification Methods

COMS 4771 Regression. Nakul Verma

Support Vector Machines for Classification: A Statistical Portrait

ESS2222. Lecture 4 Linear model

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

10-701/ Machine Learning, Fall

Final Exam, Machine Learning, Spring 2009

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Support Vector Machines

Qualifying Exam in Machine Learning

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Data Mining and Analysis: Fundamental Concepts and Algorithms

LDA, QDA, Naive Bayes

6.036 midterm review. Wednesday, March 18, 15

Chart types and when to use them

Evaluation. Andrea Passerini Machine Learning. Evaluation

Variance Reduction and Ensemble Methods

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

CS 195-5: Machine Learning Problem Set 1

Does Modeling Lead to More Accurate Classification?

Machine Learning Lecture 5

Lecture 2 Machine Learning Review

Announcements Kevin Jamieson

Stat 502X Exam 2 Spring 2014

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Classification. Chapter Introduction. 6.2 The Bayes classifier

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Lecture 2: Simple Classifiers

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

9 Classification. 9.1 Linear Classifiers

CSE 546 Midterm Exam, Fall 2014

Introduction to Machine Learning Spring 2018 Note 18

Linear discriminant functions

Machine Learning for Signal Processing Bayes Classification and Regression

CSCI-567: Machine Learning (Spring 2019)

Machine Learning Basics

ECE521 week 3: 23/26 January 2017

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning Lecture 7

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

CS6220: DATA MINING TECHNIQUES

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

CS Machine Learning Qualifying Exam

BAGGING PREDICTORS AND RANDOM FOREST

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Midterm: CS 6375 Spring 2018

Introduction to Machine Learning

CS6220: DATA MINING TECHNIQUES

Statistical Machine Learning

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Evaluation requires to define performance measures to be optimized

Performance Evaluation and Comparison

Transcription:

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical handbooks, 1 hand-written sheet of paper (A4, front and back) with notes and formulas PRELIMINARY GRADES: grade 3 23 points grade 4 33 points grade 5 43 points Some general instructions and information: Your solutions can be given in Swedish or in English. Only write on one page of the paper. Write your exam code and a page number on all pages. Do not use a red pen. Use separate sheets of paper for the different problems (i.e. the numbered problems, 1 5). With the exception of Problem 1, all your answers must be clearly motivated! A correct answer without a proper motivation will score zero points! Good luck!

Some useful formulas Pages 1 3 contain some expressions that may or may not be useful for solving the exam problems. This is not a complete list of formulas used in the course! Consequently, some of the problems may require knowledge about certain expressions not listed here. Furthermore, the formulas listed below are not all self-explanatory, meaning that you need to be familiar with the expressions to be able to interpret them. Thus, the list should be viewed as a support for solving the problems, rather than as a comprehensive collection of formulas. Marginalization and conditioning of probability densities: For a partitioned random vector Z = ( ) T Z1 T Z2 T with joint probability density function p(z) = p(z 1, z 2 ), the marginal probability density function of Z 1 is p(z 1 ) = p(z 1, z 2 )dz 2 Z 2 and the conditional probability density function for Z 1 given Z 2 = z 2 is p(z 1 z 2 ) = p(z 1, z 2 ) p(z 2 ) = p(z 2 z 1 )p(z 1 ). p(z 2 ) The Gaussian distribution: The probability density function of the p-dimensional Gaussian distribution is ( 1 N (x µ, Σ) = (2π) p/2 det Σ exp 1 ) 2 (x µ)t Σ 1 (x µ), x R p. Sum of identically distributed variables: For identically distributed random variables {Z i } n i=1 with mean µ, variance σ 2 and average correlation between distinct variables ρ, it holds that E [ 1 ] ( ni=1 Z n i = µ and Var 1 ) ni=1 Z n i = 1 ρ n σ2 + ρσ 2. Linear regression and regularization: The least-squares estimate of β in the linear regression model Y = β 0 + p j=1 β j X j + ɛ is given by β LS = (X T X) 1 X T y, where 1 x T 1 y 1 1 x T X = 2.., y = y 2., and each x i = 1 x T n 1 y n x i1 x i2.. x ip

Ridge regression uses the regularization term λ β 2 2 = λ p j=0 β 2 j. The ridge regression estimate is β RR = (X T X + λi) 1 X T y. LASSO uses the regularization term λ β 1 = λ p j=0 β j. (The LASSO estimate does not admit a simple closed form expression.) Maximum likelihood: The maximum likelihood estimate is given by β ML = arg max β log l(β) where log l(β) = n i=1 log p(y i β) is the log-likelihood function (the last equality holds when the n training data points are independent). Logistic regression: The logistic regression model uses a linear regression for the the log-odds. In the binary classification context we thus have log where q(x; β) = Pr(Y = 1 X; β). ( ) q(x; β) = β 0 + 1 q(x; β) p β j X j. j=1 For multi-class logistic regression there are two common parameterizations. The first approach uses the Kth class as reference and models the K 1 log-odds as, log q k(x; θ) q K (X; θ) = β 0k + p β jk X j, j=1 for k = 1,..., K 1. Here, q k (X; θ) = Pr(Y = k X; θ) and θ is a vector of all (p + 1) (K 1) model parameters θ = {β jk : 0 j p, 1 k K 1}. The second approach is based on the softmax function and models p log q k (X; θ) β 0k + β jk X j. j=1 The probabilities over the K possible classes are normalized to sum to one. This is an over-parameterization with a total of (p + 1) K parameters, θ = {β jk : 0 j p, 1 k K}. Discriminant Analysis: The linear discriminant analysis (LDA) classifier assigns a test input X = x to class k for which, δ k (x) = x T Σ 1 µ k 1 2 µt k Σ 1 µ k + log π k 2

is largest, where π k = n k /n and µ k = 1 Kk=1 i:y i =k(x i µ k )(x i µ k ) T. 1 n K n k i:y i =k x i for k = 1,..., K, and Σ = For quadratic discriminant analysis (QDA) we instead use the discriminant functions δ k (x) = 1 2 log Σ k 1 2 (x µ T 1 k) Σ k (x µ k) + log π k, where Σ k = 1 n k 1 i:y i =k(x i µ k )(x i µ k ) T for k = 1,..., K. Classification trees: The cost function for tree splitting is C(T ) = T m=1 n m Q m where T is the tree, T the number of terminal nodes, n m the number of training data points falling in node m, and Q m the impurity of node m. Three common impurity measures for splitting classification trees are: p mk Misclassification error: Q m = 1 max k K Gini index: Q m = p mk (1 p mk ) Entropy/deviance: where p mk = 1 n m i:x i R m I(y i = k) Q m = k=1 K k=1 p mk log p mk Loss functions for classification: For a binary classifier expressed as G(X) = sign{c(x)}, for some real-valued function C(X), the margin is defined as Y C(X) (note the convention Y { 1, 1} here). A few common loss functions expressed in terms of the margin, L(Y, C(X)) are, Exponential loss: L(y, c) = exp( yc). 1 yc for yc < 1, Hinge loss: L(y, c) = 0 otherwise. Binomial deviance: L(y, c) = log(1 + exp( yc)). yc for yc < 1, 1 Huber-like loss: L(y, c) = (1 4 yc)2 for 1 yc 0, 0 otherwise. 1 for yc < 0, Misclassification loss: L(y, c) = 0 otherwise. 3

1. This problem is composed of 10 true-or-false statements. You only have to classify these as either true or false. For this problem (only!) no motivation is required. Each correct answer scores 1 point and each incorrect answer scores -1 point (capped at 0 for the whole problem). i. A model that performs substantially better on training data than on test data has most likely been overfitted. ii. A model that has underfitted (= has a high bias) performs substantially better on test data than on training error. iii. In a neural network it is a good habit to initialize all the parameters with zeros prior training. iv. A linear regression problem has always a unique solution if there are more data points n than there are parameters p, i.e. if n > p. v. In discriminant analysis the probability density of X for an observation that comes from the kth class p(x Y = k) are modeled to be normally distributed N (µ k, Σ k ). vi. In linear discriminant analysis the covariances of the class distributions are assumed to be all equal Σ = Σ 1 = = Σ k. vii. k-fold cross validation cannot be used to tune non-parametric models since they don t have any tuning parameters. viii. Logistic regression is equivalent to linear regression if all outputs in the training data take the value 0 or 1. ix. In random forest, for each tree a random selection of leaf nodes are discarded in order to decorrelate the trees. x. In boosting the ensemble members are trained sequentially where one ensemble member attempts to compensate for the errors made by the previous ensemble members. (10p) 4

2. Consider the following training data i 1 2 3 4 5 6 7 8 X 1 1.0 6.0 7.0 9.0 1.0 4.0 4.0 9.0 X 2 8.0 4.0 9.0 8.0 2.0 1.0 6.0 2.0 Y 1 1 1 1 0 0 0 0 where X 1 and X 2 are the input variables, Y is the output and i is the data point index. (a) Illustrate the training data points in a graph with X 1 and X 2 on the two axes. Represent the points belonging to class Y = 0 with a circle and those belonging to class Y = 1 with a cross. Further, annotate the data points with their data point indices. (b) Based on the training data we want to construct bagging classification trees with three ensemble members. For this we draw three new datasets by bootstrapping the training data (sampling with replacement). The following data points indices have been drawn for each of the three bootstrapped datasets Data point indices i Dataset 1 1 2 3 4 5 5 8 8 Dataset 2 1 4 5 5 6 7 7 8 Dataset 3 2 2 3 5 5 6 7 7 Construct three classification trees, one for each of the three bootstrapped datasets. Each tree shall consist of one single binary split that minimizes the misclassification error. (6p) (c) The final classifier predicts according to the majority vote of the three classification trees. Sketch the decision boundary of the final classifier. 5

3. Consider a classification problem with two quantitative input variables X 1 and X 2, and one qualitative binary output Y {0, 1}. A logistic regression classifier Ĝ(X) = I(q(X; β) > r), where q(x; β) = eβ 0+X 1 β 1 +X 2 β 2 1 + e β 0+X 1 β 1 +X 2 β 2 has been trained a training data set. Here q(x; β) = P (Y = 1 X) is the class-1 probability and r is a chosen threshold. This classifier is displayed in Figure 1 for r = 0.5 where the blue region represent the prediction Ĝ(X) = 0 and the red region represent the prediction Ĝ(X) = 1. The model is to be evaluated on a test dataset with n = 9 data points which are displayed in Figure 1 where blue dots are test points belonging to the class Y = 0 and red crosses are test points belonging to the class Y = 1. Figure 1: Logistic regression classifier together with test data 6

(a) Consider Y = 0 to be the negative class and Y = 1 to be the positive class. Fill in a confusion matrix for the classifier based on the test dataset. Hint: The confusion matrix is on the following form: Predicted condition Ĝ(X) = 0 Ĝ(X) = 1 True Y = 0 #true negatives #false positives condition Y = 1 #false negatives #true positives (b) Sketch a ROC curve for the classifier, computed based on the test data set, i.e. a plot of the true positive rate (TPR) vs. the false positive rate (FPR) as the threshold r varies from 0 to 1, where TPR = #true positives, FPR = #positives #false positives. #negatives Sketch the curve on the empty figure axes handed out together with the exam, marked Problem 3b: Sketch the curves here. Don t forget to attach the figure to the exam that you hand in! (5p) Hint: The decision boundaries for the different values of r will all be parallel. (c) Area Under Curve (AUC) is a condensed performance measure for the classifier, taking all possible thresholds into account. What is AUC for this classifier? (1p) (d) Assume you have an application where it is more important to keep the #false negatives low than keeping the #false positives low. Suggest a linear decision boundary (not necessarily based on the logistic regression model above) that primary minimizes the #false negatives and secondary minimizes the #false positives. Don t forget to motivate your answer. Note: This question can be answered independently of question 4a-4c 7

4. Consider a dataset with 10 000 grayscaled images of pets, each consisting of 20 20 scalar pixels. Each image is labeled with one of the four classes Y {cat, dog, rabit, hamster}. For this data we design a small convolutional neural network with one convolutional layer and one dense layer. The convolutional layer is parameterized with a weight tensor W (1) and a bias vector b (1) producing a hidden layer H. The convolutional layer has the following design Number of kernels/output channels 8 Kernel rows and columns (5 5) Stride [2,2] The stride [2,2] means that the kernel is moving by two steps (both row- and column-wise) during the convolution such that the hidden layer has half as many rows and columns as the input image. The dense layer is parameterized with the weight matrix W (2) and bias vector b (2) producing the logits Z. The logits Z are pushed through a softmax function to produce the four class probabilities q 1 (X),..., q 4 (X). (a) What are the sizes of the weight tensor W (1), offset vector b (1) and the hidden layer H? (b) What are the sizes of the weight tensor W (2), offset vector b (2) and the logits Z? (c) What is the total number of parameters used to paramaterize this network? (1p) (d) In the context of neural networks, describe, using a few sentences, the difference between a dense layer and a convolutional layer. Note: This question can be answered independently of question 5a-5c (e) Describe how mini-batch gradient descent works and what the main advantage is in comparison to gradient descent. (3p) Note: This question can be answered independently of question 5a-?? 8

5. (a) Consider the simple linear regression model with one input variable X, Y = f(x) + ε = β 0 + β 1 X + ε, ε N (0, 1 2 ). (1) Assume that we have a training data set consisting of two data points T = {(x 1, y 1 ), (x 2, y 2 )} = {( 2, 1), (1, 2)}. What is the least-squares estimate β LS of β? (b) Assuming that the model assumption in (1) is correct, compute the variance of the estimate β. (3p) Hint 1: Note that we need to take the into account the fact that the training outputs y 1 and y 2 (and thus the estimate β) are random variables whereas x 1 and x 2 are fixed. Hint 2: You may use Var[Ay + b] = AΣA T, where Σ = Var[y]. (c) Assuming that the model assumption in (1) is correct, consider the model for prediction f(x) = β 0 + β 1 X. Compute the bias and variance in f(x ) for a test point x estimating β using least-squares. = 1, when (3p) Hint 3: The model bias is E[ f(x ) f(x )] and the model variance is E[( f(x ) E[ f(x )]) 2 ]. (d) What is the ridge regression estimate β RR of β? Express your solution in terms of the regularization parameter λ. 9

. 10

1 Problem 3b: Sketch the curves here 0.9 0.8 True positive rate (TPR) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate (FPR) 11