EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical handbooks, 1 hand-written sheet of paper (A4, front and back) with notes and formulas PRELIMINARY GRADES: grade 3 23 points grade 4 33 points grade 5 43 points Some general instructions and information: Your solutions can be given in Swedish or in English. Only write on one page of the paper. Write your exam code and a page number on all pages. Do not use a red pen. Use separate sheets of paper for the different problems (i.e. the numbered problems, 1 5). With the exception of Problem 1, all your answers must be clearly motivated! A correct answer without a proper motivation will score zero points! Good luck!

Some useful formulas Pages 1 3 contain some expressions that may or may not be useful for solving the exam problems. This is not a complete list of formulas used in the course! Consequently, some of the problems may require knowledge about certain expressions not listed here. Furthermore, the formulas listed below are not all self-explanatory, meaning that you need to be familiar with the expressions to be able to interpret them. Thus, the list should be viewed as a support for solving the problems, rather than as a comprehensive collection of formulas. Marginalization and conditioning of probability densities: For a partitioned random vector Z = ( ) T Z1 T Z2 T with joint probability density function p(z) = p(z 1, z 2 ), the marginal probability density function of Z 1 is p(z 1 ) = p(z 1, z 2 )dz 2 Z 2 and the conditional probability density function for Z 1 given Z 2 = z 2 is p(z 1 z 2 ) = p(z 1, z 2 ) p(z 2 ) = p(z 2 z 1 )p(z 1 ). p(z 2 ) The Gaussian distribution: The probability density function of the p-dimensional Gaussian distribution is ( 1 N (x µ, Σ) = (2π) p/2 det Σ exp 1 ) 2 (x µ)t Σ 1 (x µ), x R p. Sum of identically distributed variables: For identically distributed random variables {Z i } n i=1 with mean µ, variance σ 2 and average correlation between distinct variables ρ, it holds that E [ 1 ] ( ni=1 Z n i = µ and Var 1 ) ni=1 Z n i = 1 ρ n σ2 + ρσ 2. Linear regression and regularization: The least-squares estimate of β in the linear regression model Y = β 0 + p j=1 β j X j + ɛ is given by β LS = (X T X) 1 X T y, where 1 x T 1 y 1 1 x T X = 2.., y = y 2., and each x i = 1 x T n 1 y n x i1 x i2.. x ip

Ridge regression uses the regularization term λ β 2 2 = λ p j=0 β 2 j. The ridge regression estimate is β RR = (X T X + λi) 1 X T y. LASSO uses the regularization term λ β 1 = λ p j=0 β j. (The LASSO estimate does not admit a simple closed form expression.) Maximum likelihood: The maximum likelihood estimate is given by β ML = arg max β log l(β) where log l(β) = n i=1 log p(y i β) is the log-likelihood function (the last equality holds when the n training data points are independent). Logistic regression: The logistic regression model uses a linear regression for the the log-odds. In the binary classification context we thus have log where q(x; β) = Pr(Y = 1 X; β). ( ) q(x; β) = β 0 + 1 q(x; β) p β j X j. j=1 For multi-class logistic regression there are two common parameterizations. The first approach uses the Kth class as reference and models the K 1 log-odds as, log q k(x; θ) q K (X; θ) = β 0k + p β jk X j, j=1 for k = 1,..., K 1. Here, q k (X; θ) = Pr(Y = k X; θ) and θ is a vector of all (p + 1) (K 1) model parameters θ = {β jk : 0 j p, 1 k K 1}. The second approach is based on the softmax function and models p log q k (X; θ) β 0k + β jk X j. j=1 The probabilities over the K possible classes are normalized to sum to one. This is an over-parameterization with a total of (p + 1) K parameters, θ = {β jk : 0 j p, 1 k K}. Discriminant Analysis: The linear discriminant analysis (LDA) classifier assigns a test input X = x to class k for which, δ k (x) = x T Σ 1 µ k 1 2 µt k Σ 1 µ k + log π k 2

is largest, where π k = n k /n and µ k = 1 Kk=1 i:y i =k(x i µ k )(x i µ k ) T. 1 n K n k i:y i =k x i for k = 1,..., K, and Σ = For quadratic discriminant analysis (QDA) we instead use the discriminant functions δ k (x) = 1 2 log Σ k 1 2 (x µ T 1 k) Σ k (x µ k) + log π k, where Σ k = 1 n k 1 i:y i =k(x i µ k )(x i µ k ) T for k = 1,..., K. Classification trees: The cost function for tree splitting is C(T ) = T m=1 n m Q m where T is the tree, T the number of terminal nodes, n m the number of training data points falling in node m, and Q m the impurity of node m. Three common impurity measures for splitting classification trees are: p mk Misclassification error: Q m = 1 max k K Gini index: Q m = p mk (1 p mk ) Entropy/deviance: where p mk = 1 n m i:x i R m I(y i = k) Q m = k=1 K k=1 p mk log p mk Loss functions for classification: For a binary classifier expressed as G(X) = sign{c(x)}, for some real-valued function C(X), the margin is defined as Y C(X) (note the convention Y { 1, 1} here). A few common loss functions expressed in terms of the margin, L(Y, C(X)) are, Exponential loss: L(y, c) = exp( yc). 1 yc for yc < 1, Hinge loss: L(y, c) = 0 otherwise. Binomial deviance: L(y, c) = log(1 + exp( yc)). yc for yc < 1, 1 Huber-like loss: L(y, c) = (1 4 yc)2 for 1 yc 0, 0 otherwise. 1 for yc < 0, Misclassification loss: L(y, c) = 0 otherwise. 3

1. This problem is composed of 10 true-or-false statements. You only have to classify these as either true or false. For this problem (only!) no motivation is required. Each correct answer scores 1 point and each incorrect answer scores -1 point (capped at 0 for the whole problem). i. A model that performs substantially better on training data than on test data has most likely been overfitted. ii. A model that has underfitted (= has a high bias) performs substantially better on test data than on training error. iii. In a neural network it is a good habit to initialize all the parameters with zeros prior training. iv. A linear regression problem has always a unique solution if there are more data points n than there are parameters p, i.e. if n > p. v. In discriminant analysis the probability density of X for an observation that comes from the kth class p(x Y = k) are modeled to be normally distributed N (µ k, Σ k ). vi. In linear discriminant analysis the covariances of the class distributions are assumed to be all equal Σ = Σ 1 = = Σ k. vii. k-fold cross validation cannot be used to tune non-parametric models since they don t have any tuning parameters. viii. Logistic regression is equivalent to linear regression if all outputs in the training data take the value 0 or 1. ix. In random forest, for each tree a random selection of leaf nodes are discarded in order to decorrelate the trees. x. In boosting the ensemble members are trained sequentially where one ensemble member attempts to compensate for the errors made by the previous ensemble members. (10p) 4

2. Consider the following training data i 1 2 3 4 5 6 7 8 X 1 1.0 6.0 7.0 9.0 1.0 4.0 4.0 9.0 X 2 8.0 4.0 9.0 8.0 2.0 1.0 6.0 2.0 Y 1 1 1 1 0 0 0 0 where X 1 and X 2 are the input variables, Y is the output and i is the data point index. (a) Illustrate the training data points in a graph with X 1 and X 2 on the two axes. Represent the points belonging to class Y = 0 with a circle and those belonging to class Y = 1 with a cross. Further, annotate the data points with their data point indices. (b) Based on the training data we want to construct bagging classification trees with three ensemble members. For this we draw three new datasets by bootstrapping the training data (sampling with replacement). The following data points indices have been drawn for each of the three bootstrapped datasets Data point indices i Dataset 1 1 2 3 4 5 5 8 8 Dataset 2 1 4 5 5 6 7 7 8 Dataset 3 2 2 3 5 5 6 7 7 Construct three classification trees, one for each of the three bootstrapped datasets. Each tree shall consist of one single binary split that minimizes the misclassification error. (6p) (c) The final classifier predicts according to the majority vote of the three classification trees. Sketch the decision boundary of the final classifier. 5

3. Consider a classification problem with two quantitative input variables X 1 and X 2, and one qualitative binary output Y {0, 1}. A logistic regression classifier Ĝ(X) = I(q(X; β) > r), where q(x; β) = eβ 0+X 1 β 1 +X 2 β 2 1 + e β 0+X 1 β 1 +X 2 β 2 has been trained a training data set. Here q(x; β) = P (Y = 1 X) is the class-1 probability and r is a chosen threshold. This classifier is displayed in Figure 1 for r = 0.5 where the blue region represent the prediction Ĝ(X) = 0 and the red region represent the prediction Ĝ(X) = 1. The model is to be evaluated on a test dataset with n = 9 data points which are displayed in Figure 1 where blue dots are test points belonging to the class Y = 0 and red crosses are test points belonging to the class Y = 1. Figure 1: Logistic regression classifier together with test data 6

(a) Consider Y = 0 to be the negative class and Y = 1 to be the positive class. Fill in a confusion matrix for the classifier based on the test dataset. Hint: The confusion matrix is on the following form: Predicted condition Ĝ(X) = 0 Ĝ(X) = 1 True Y = 0 #true negatives #false positives condition Y = 1 #false negatives #true positives (b) Sketch a ROC curve for the classifier, computed based on the test data set, i.e. a plot of the true positive rate (TPR) vs. the false positive rate (FPR) as the threshold r varies from 0 to 1, where TPR = #true positives, FPR = #positives #false positives. #negatives Sketch the curve on the empty figure axes handed out together with the exam, marked Problem 3b: Sketch the curves here. Don t forget to attach the figure to the exam that you hand in! (5p) Hint: The decision boundaries for the different values of r will all be parallel. (c) Area Under Curve (AUC) is a condensed performance measure for the classifier, taking all possible thresholds into account. What is AUC for this classifier? (1p) (d) Assume you have an application where it is more important to keep the #false negatives low than keeping the #false positives low. Suggest a linear decision boundary (not necessarily based on the logistic regression model above) that primary minimizes the #false negatives and secondary minimizes the #false positives. Don t forget to motivate your answer. Note: This question can be answered independently of question 4a-4c 7

4. Consider a dataset with 10 000 grayscaled images of pets, each consisting of 20 20 scalar pixels. Each image is labeled with one of the four classes Y {cat, dog, rabit, hamster}. For this data we design a small convolutional neural network with one convolutional layer and one dense layer. The convolutional layer is parameterized with a weight tensor W (1) and a bias vector b (1) producing a hidden layer H. The convolutional layer has the following design Number of kernels/output channels 8 Kernel rows and columns (5 5) Stride [2,2] The stride [2,2] means that the kernel is moving by two steps (both row- and column-wise) during the convolution such that the hidden layer has half as many rows and columns as the input image. The dense layer is parameterized with the weight matrix W (2) and bias vector b (2) producing the logits Z. The logits Z are pushed through a softmax function to produce the four class probabilities q 1 (X),..., q 4 (X). (a) What are the sizes of the weight tensor W (1), offset vector b (1) and the hidden layer H? (b) What are the sizes of the weight tensor W (2), offset vector b (2) and the logits Z? (c) What is the total number of parameters used to paramaterize this network? (1p) (d) In the context of neural networks, describe, using a few sentences, the difference between a dense layer and a convolutional layer. Note: This question can be answered independently of question 5a-5c (e) Describe how mini-batch gradient descent works and what the main advantage is in comparison to gradient descent. (3p) Note: This question can be answered independently of question 5a-?? 8

5. (a) Consider the simple linear regression model with one input variable X, Y = f(x) + ε = β 0 + β 1 X + ε, ε N (0, 1 2 ). (1) Assume that we have a training data set consisting of two data points T = {(x 1, y 1 ), (x 2, y 2 )} = {( 2, 1), (1, 2)}. What is the least-squares estimate β LS of β? (b) Assuming that the model assumption in (1) is correct, compute the variance of the estimate β. (3p) Hint 1: Note that we need to take the into account the fact that the training outputs y 1 and y 2 (and thus the estimate β) are random variables whereas x 1 and x 2 are fixed. Hint 2: You may use Var[Ay + b] = AΣA T, where Σ = Var[y]. (c) Assuming that the model assumption in (1) is correct, consider the model for prediction f(x) = β 0 + β 1 X. Compute the bias and variance in f(x ) for a test point x estimating β using least-squares. = 1, when (3p) Hint 3: The model bias is E[ f(x ) f(x )] and the model variance is E[( f(x ) E[ f(x )]) 2 ]. (d) What is the ridge regression estimate β RR of β? Express your solution in terms of the regularization parameter λ. 9

. 10

1 Problem 3b: Sketch the curves here 0.9 0.8 True positive rate (TPR) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate (FPR) 11