EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical handbooks, 1 hand-written sheet of paper (A4, front and back) with notes and formulas PRELIMINARY GRADES: grade 3 23 points grade 4 33 points grade 5 43 points Some general instructions and information: Your solutions can be given in Swedish or in English. Only write on one page of the paper. Write your exam code and a page number on all pages. Do not use a red pen. Use separate sheets of paper for the different problems (i.e. the numbered problems, 1 5). With the exception of Problem 1, all your answers must be clearly motivated! A correct answer without a proper motivation will score zero points! Good luck!

Some useful formulas Pages 1 3 contain some expressions that may or may not be useful for solving the exam problems. This is not a complete list of formulas used in the course! Consequently, some of the problems may require knowledge about certain expressions not listed here. Furthermore, the formulas listed below are not all self-explanatory, meaning that you need to be familiar with the expressions to be able to interpret them. Thus, the list should be viewed as a support for solving the problems, rather than as a comprehensive collection of formulas. Marginalization and conditioning of probability densities: For a partitioned random vector Z = ( ) T Z1 T Z2 T with joint probability density function p(z) = p(z 1, z 2 ), the marginal probability density function of Z 1 is p(z 1 ) = p(z 1, z 2 )dz 2 Z 2 and the conditional probability density function for Z 1 given Z 2 = z 2 is p(z 1 z 2 ) = p(z 1, z 2 ) p(z 2 ) = p(z 2 z 1 )p(z 1 ). p(z 2 ) The Gaussian distribution: The probability density function of the p-dimensional Gaussian distribution is ( 1 N (x µ, Σ) = (2π) p/2 det Σ exp 1 ) 2 (x µ)t Σ 1 (x µ), x R p. Sum of identically distributed variables: For identically distributed random variables {Z i } n i=1 with mean µ, variance σ 2 and average correlation between distinct variables ρ, it holds that E [ 1 ] ( ni=1 Z n i = µ and Var 1 ) ni=1 Z n i = 1 ρ n σ2 + ρσ 2. Linear regression and regularization: The least-squares estimate of β in the linear regression model Y = β 0 + p j=1 β j X j + ɛ is given by β LS = (X T X) 1 X T y, where 1 x T 1 y 1 1 x T X = 2.., y = y 2., and each x i = 1 x T n 1 y n x i1 x i2.. x ip

Ridge regression uses the regularization term λ β 2 2 = λ p j=0 β 2 j. The ridge regression estimate is β RR = (X T X + λi) 1 X T y. LASSO uses the regularization term λ β 1 = λ p j=0 β j. (The LASSO estimate does not admit a simple closed form expression.) Maximum likelihood: The maximum likelihood estimate is given by β ML = arg max β log l(β) where log l(β) = n i=1 log p(y i β) is the log-likelihood function (the last equality holds when the n training data points are independent). Logistic regression: The logistic regression model uses a linear regression for the the log-odds. In the binary classification context we thus have log where q(x; β) = Pr(Y = 1 X; β). ( ) q(x; β) = β 0 + 1 q(x; β) p β j X j. j=1 For multi-class logistic regression there are two common parameterizations. The first approach uses the Kth class as reference and models the K 1 log-odds as, log q k(x; θ) q K (X; θ) = β 0k + p β jk X j, j=1 for k = 1,..., K 1. Here, q k (X; θ) = Pr(Y = k X; θ) and θ is a vector of all (p + 1) (K 1) model parameters θ = {β jk : 0 j p, 1 k K 1}. The second approach is based on the softmax function and models p log q k (X; θ) β 0k + β jk X j. j=1 The probabilities over the K possible classes are normalized to sum to one. This is an over-parameterization with a total of (p + 1) K parameters, θ = {β jk : 0 j p, 1 k K}. Discriminant Analysis: The linear discriminant analysis (LDA) classifier assigns a test input X = x to class k for which, δ k (x) = x T Σ 1 µ k 1 2 µt k Σ 1 µ k + log π k 2

is largest, where π k = n k /n and µ k = 1 Kk=1 i:y i =k(x i µ k )(x i µ k ) T. 1 n K n k i:y i =k x i for k = 1,..., K, and Σ = For quadratic discriminant analysis (QDA) we instead use the discriminant functions δ k (x) = 1 2 log Σ k 1 2 (x µ T 1 k) Σ k (x µ k) + log π k, where Σ k = 1 n k 1 i:y i =k(x i µ k )(x i µ k ) T for k = 1,..., K. Classification trees: The cost function for tree splitting is C(T ) = T m=1 n m Q m where T is the tree, T the number of terminal nodes, n m the number of training data points falling in node m, and Q m the impurity of node m. Three common impurity measures for splitting classification trees are: p mk Misclassification error: Q m = 1 max k K Gini index: Q m = p mk (1 p mk ) Entropy/deviance: where p mk = 1 n m i:x i R m I(y i = k) Q m = k=1 K k=1 p mk log p mk Loss functions for classification: For a binary classifier expressed as G(X) = sign{c(x)}, for some real-valued function C(X), the margin is defined as Y C(X) (note the convention Y { 1, 1} here). A few common loss functions expressed in terms of the margin, L(Y, C(X)) are, Exponential loss: L(y, c) = exp( yc). 1 yc for yc < 1, Hinge loss: L(y, c) = 0 otherwise. Binomial deviance: L(y, c) = log(1 + exp( yc)). yc for yc < 1, 1 Huber-like loss: L(y, c) = (1 4 yc)2 for 1 yc 0, 0 otherwise. 1 for yc < 0, Misclassification loss: L(y, c) = 0 otherwise. 3

1. This problem is composed of 10 true-or-false statements. You only have to classify these as either true or false. For this problem (only!) no motivation is required. Each correct answer scores 1 point and each incorrect answer scores -1 point. No answer scores zero points. Total score for the whole problem is capped at 0. i. Linear discriminant analysis is a parametric model with K mean values µ 1,..., µ K (one for each of the K classes) and one single covariance matrix Σ as its parameters. ii. k-nn is a parametric model with the number of neighbors k as its parameter. iii. Growing a tree deep reduces the model variance and increases the model bias. iv. The purpose of using bagging is to reduce the model variance (compared to the base model used). v. A k-nearest neighbor classifier can by design only handle classification problems with K = 2 classes. vi. Linear discriminant analysis (LDA) is a classification method which cannot be used for regression. vii. LASSO is a regularization method which adds squared penalty term γ β 2 2 to the cost function. viii. Consider Y = 1 to be the positive class and Y = 0 to be the negative class. Then the true positive rate (TPR) is defined as TPR = #true positive #all where #true positive are the number of data points that are positive Y = 1 and have been correctly classified as positive Ŷ = 1, and where #all are the total number of data points. ix. Let X be an integer valued quantity denoting the age of a person in years since birth. It is wise to handle X as a qualitative variable since it only can take integer values. x. The model Y = β 0 + β 1 X + β 2 X 2 + ɛ (where β 0, β 1 and β 2 are model parameters) is a linear regression model. (10p) 4

2. (a) Consider the scatter plots in Figure 1 which depict three different training data sets for three binary classification problems. In which dataset(s) could the classes be well separated by...... an LDA classifier?... a QDA classifier? (In both cases we assume that X 1 and X 2 are the only inputs to the classifiers.) (2p) Y = 0 Y = 1 Dataset i Y = 0 Y = 1 Dataset ii Dataset iii Y = 0 Y = 1 X2 X2 X2 X 1 X 1 X 1 Figure 1: Scatter plots of training data for Problems 2a and 2b. (b) Consider again the scatter plots in Figure 1. The LDA and QDA classifiers are based on different assumptions about the properties of the data. Which dataset(s) in Figure 1 appear to correspond well to the assumptions made by LDA and QDA, respectively? (4p) (c) Suppose we want to predict whether or not a student will pass an exam, based on the time spent studying. Historical data shows that the average study time of the students who passed the exam was X pass = 40 (in some unspecified unit of time). For the students who failed, the average study time was X fail = 25. Furthermore, the variances within these two groups were ˆσ pass 2 = 10 2 and ˆσ fail 2 = 7 2, respectively. Finally, 60% of the students passed the exam. Construct a QDA classifier for predicting pass or fail based on the time spent studying X. Specifically, what is the decision boundary of the QDA classifier? What is the prediction (fail or pass) for a student who has studied 33 time units? (4p) Note: This question can be answered independently of 2a and 2b 5

3. Consider a small training dataset with only n = 3 data points with X as input and Y as output X -1.0 0.0 1.0 Y 0.0 0.0-1.0 We want to model the output Y based on the input X. We opt between two regression models to fit the data. where ɛ is a measurement error. Y = α 0 + ɛ (1) Y = β 0 + β 1 X + ɛ (2) (a) For both of the two models (1) and (2), find the parameters ˆα 0, and ˆβ = [ ˆβ 0 ˆβ1 ] T that minimizes the mean-square errors (MSE) on the training data above. (3p) (b) The training MSE for the model (2) will always be equal or smaller than the training MSE for model (1) regardless of the training data. Explain why! (2p) (c) Evaluate the two linear regression models (1) and (2) by performing leaveone-out cross-validation based on MSE on each of the two models. Which model performs the best? Hint: leave-one-out cross-validation is the same as k-fold cross-validation where k=n. (5p) 6

4. (a) In the context of stochastic gradient descent, what is an epoch? (1p) (b) Consider a dataset with n = 100 000 data points for which we train a model with stochastic gradient descent where each mini-batch has the size b = 100. We run the algorithm for 10 epochs. How many iterations have been completed during training, i.e. how many times have the parameters in the network been updated? (1p) (c) What is the main advantage of stochastic gradient descent where b n in comparison to normal gradient descent where b = n? (2p) (d) In gradient descent, describe the impact of the learning rate γ to the learning. What happens if the learning rate is too low? What happens if the learning rate is too high? (2p) (e) In AdaBoost, the exponential loss function is used L(C(X), Y ) = e Y C(X), mainly due to computational simplicity in the AdaBoost-algorithm. Another nice property is that minimizing exponential loss enforces C(X) to model the log-odds similar to logistic regression. Consider Y to be a random variable which can take either the value 1 or -1. Show that the C (X) minimizing the exponential loss C (X) = arg min E Y X [L(C(X), Y )] where Y C(X) L(C(X), Y ) = e C(X) is equal to half log-odds C (X) = 1 ( ) Pr(Y = 1 X) 2 log Pr(Y = 1 X) if Pr(Y = 1 X) 0 and Pr(Y = 1 X) 0. (4p) 7

5. Consider the following dataset of n = 5 samples of bell pepper with three attributes, Domestic, Weight and Color. Domestic No Yes No Yes No Weight (kg) 0.3 0.4 0.3 0.3 0.2 Color Green Yellow Yellow Red Red Based on this dataset, we want to train a binary classifier to predict the attribute Domestic based on the other two attributes. We consider a logistic regression model Pr(Y = 1 X; β) = eβt X 1 + e βt X, where the parameter β is estimated by maximizing the log-likelihood ˆβ = arg max ( log l(β) ), where n l(β) = Pr(Y = y i X = x i ; β). i=1 (a) Which of the three attributes Domestic, Weight and Color are quantitative and qualitative, respectively? (1p) (b) Provide numerical values of the training data T = {x i, y i } n i=1 to be used in the formulas above. (3p) Note: Multiple correct answers possible (c) Write down an explicit expression for the log-likelihood function log l(β) for the data set provided above. (2p) Note: You don t have to do the actual maximization, only provide an explicit expression of the function log l(β) to be maximized. (d) Consider a regression problem where we want to predict the Weight based on the other two attributes Domestic and Color. Use linear regression to find such a prediction model. What weight do you predict for a green and domestic bell pepper? (4p) 8