EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Similar documents
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Statistical Data Mining and Machine Learning Hilary Term 2016

10-701/ Machine Learning - Midterm Exam, Fall 2010

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Statistical Machine Learning Hilary Term 2018

Machine Learning Practice Page 2 of 2 10/28/13

FINAL: CS 6375 (Machine Learning) Fall 2014

CS534 Machine Learning - Spring Final Exam

Introduction to Machine Learning Midterm, Tues April 8

Lecture 3 Classification, Logistic Regression

Midterm: CS 6375 Spring 2018

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Midterm: CS 6375 Spring 2015 Solutions

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Classification. Chapter Introduction. 6.2 The Bayes classifier

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Midterm Exam, Spring 2005

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Linear Models in Machine Learning

Midterm exam CS 189/289, Fall 2015

A Study of Relative Efficiency and Robustness of Classification Methods

Machine Learning for NLP

CSE 546 Final Exam, Autumn 2013

9 Classification. 9.1 Linear Classifiers

Does Modeling Lead to More Accurate Classification?

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Introduction to Machine Learning

Support Vector Machines for Classification: A Statistical Portrait

Final Exam, Machine Learning, Spring 2009

CMU-Q Lecture 24:

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Linear discriminant functions

Machine Learning 2017

Deep Learning for Computer Vision

Midterm, Fall 2003

Linear Regression (9/11/13)

1 Machine Learning Concepts (16 points)

Introduction to Machine Learning Midterm Exam

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning Spring 2018 Note 18

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Statistical Methods for Data Mining

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Lecture 2 Machine Learning Review

Generalized Boosted Models: A guide to the gbm package

Machine Learning, Fall 2009: Midterm

Applied Multivariate and Longitudinal Data Analysis

Machine Learning (CS 567) Lecture 5

Linear Methods for Prediction

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

ESS2222. Lecture 4 Linear model

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Lecture 3: Statistical Decision Theory (Part II)

CSCI-567: Machine Learning (Spring 2019)

Introduction to Machine Learning HW6

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning Basics

CS540 ANSWER SHEET

A Magiv CV Theory for Large-Margin Classifiers

Regularization Paths

Lecture 6. Notes on Linear Algebra. Perceptron

Machine Learning, Fall 2012 Homework 2

6.036 midterm review. Wednesday, March 18, 15

Machine Learning for NLP

Algorithm-Independent Learning Issues

Statistics and learning: Big Data

The exam is closed book, closed notes except your one-page cheat sheet.

Introduction to Machine Learning Midterm Exam Solutions

Support Vector Machines

Computational statistics

Warm up: risk prediction with logistic regression

COMS 4771 Regression. Nakul Verma

Chapter 14 Combining Models

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Sample questions for Fundamentals of Machine Learning 2018

Statistical Machine Learning from Data

Stat 502X Exam 2 Spring 2014

Introduction to Machine Learning

Machine Learning: A Statistics and Optimization Perspective

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

CMSC858P Supervised Learning Methods

Final Exam, Fall 2002

CSE 546 Midterm Exam, Fall 2014

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Midterm Exam Solutions, Spring 2007

Transcription:

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical handbooks, 1 hand-written sheet of paper (A4, front and back) with notes and formulas PRELIMINARY GRADES: grade 3 23 points grade 4 33 points grade 5 43 points Some general instructions and information: Your solutions can be given in Swedish or in English. Only write on one page of the paper. Write your exam code and a page number on all pages. Do not use a red pen. Use separate sheets of paper for the different problems (i.e. the numbered problems, 1 5). With the exception of Problem 1, all your answers must be clearly motivated! A correct answer without a proper motivation will score zero points! Good luck!

Some useful formulas Pages 1 3 contain some expressions that may or may not be useful for solving the exam problems. This is not a complete list of formulas used in the course! Consequently, some of the problems may require knowledge about certain expressions not listed here. Furthermore, the formulas listed below are not all self-explanatory, meaning that you need to be familiar with the expressions to be able to interpret them. Thus, the list should be viewed as a support for solving the problems, rather than as a comprehensive collection of formulas. Marginalization and conditioning of probability densities: For a partitioned random vector Z = ( ) T Z1 T Z2 T with joint probability density function p(z) = p(z 1, z 2 ), the marginal probability density function of Z 1 is p(z 1 ) = p(z 1, z 2 )dz 2 Z 2 and the conditional probability density function for Z 1 given Z 2 = z 2 is p(z 1 z 2 ) = p(z 1, z 2 ) p(z 2 ) = p(z 2 z 1 )p(z 1 ). p(z 2 ) The Gaussian distribution: The probability density function of the p-dimensional Gaussian distribution is ( 1 N (x µ, Σ) = (2π) p/2 det Σ exp 1 ) 2 (x µ)t Σ 1 (x µ), x R p. Sum of identically distributed variables: For identically distributed random variables {Z i } n i=1 with mean µ, variance σ 2 and average correlation between distinct variables ρ, it holds that E [ 1 ] ( ni=1 Z n i = µ and Var 1 ) ni=1 Z n i = 1 ρ n σ2 + ρσ 2. Linear regression and regularization: The least-squares estimate of β in the linear regression model Y = β 0 + p j=1 β j X j + ɛ is given by β LS = (X T X) 1 X T y, where 1 x T 1 y 1 1 x T X = 2.., y = y 2., and each x i = 1 x T n 1 y n x i1 x i2.. x ip

Ridge regression uses the regularization term λ β 2 2 = λ p j=0 β 2 j. The ridge regression estimate is β RR = (X T X + λi) 1 X T y. LASSO uses the regularization term λ β 1 = λ p j=0 β j. (The LASSO estimate does not admit a simple closed form expression.) Maximum likelihood: The maximum likelihood estimate is given by β ML = arg max β log l(β) where log l(β) = n i=1 log p(y i β) is the log-likelihood function (the last equality holds when the n training data points are independent). Logistic regression: The logistic regression model uses a linear regression for the the log-odds. In the binary classification context we thus have log where q(x; β) = Pr(Y = 1 X; β). ( ) q(x; β) = β 0 + 1 q(x; β) p β j X j. j=1 For multi-class logistic regression there are two common parameterizations. The first approach uses the Kth class as reference and models the K 1 log-odds as, log q k(x; θ) q K (X; θ) = β 0k + p β jk X j, j=1 for k = 1,..., K 1. Here, q k (X; θ) = Pr(Y = k X; θ) and θ is a vector of all (p + 1) (K 1) model parameters θ = {β jk : 0 j p, 1 k K 1}. The second approach is based on the softmax function and models p log q k (X; θ) β 0k + β jk X j. j=1 The probabilities over the K possible classes are normalized to sum to one. This is an over-parameterization with a total of (p + 1) K parameters, θ = {β jk : 0 j p, 1 k K}. Discriminant Analysis: The linear discriminant analysis (LDA) classifier assigns a test input X = x to class k for which, δ k (x) = x T Σ 1 µ k 1 2 µt k Σ 1 µ k + log π k 2

is largest, where π k = n k /n and µ k = 1 Kk=1 i:y i =k(x i µ k )(x i µ k ) T. 1 n K n k i:y i =k x i for k = 1,..., K, and Σ = For quadratic discriminant analysis (QDA) we instead use the discriminant functions δ k (x) = 1 2 log Σ k 1 2 (x µ T 1 k) Σ k (x µ k) + log π k, where Σ k = 1 n k 1 i:y i =k(x i µ k )(x i µ k ) T for k = 1,..., K. Classification trees: The cost function for tree splitting is C(T ) = T m=1 n m Q m where T is the tree, T the number of terminal nodes, n m the number of training data points falling in node m, and Q m the impurity of node m. Three common impurity measures for splitting classification trees are: p mk Misclassification error: Q m = 1 max k K Gini index: Q m = p mk (1 p mk ) Entropy/deviance: where p mk = 1 n m i:x i R m I(y i = k) Q m = k=1 K k=1 p mk log p mk Loss functions for classification: For a binary classifier expressed as G(X) = sign{c(x)}, for some real-valued function C(X), the margin is defined as Y C(X) (note the convention Y { 1, 1} here). A few common loss functions expressed in terms of the margin, L(Y, C(X)) are, Exponential loss: L(y, c) = exp( yc). 1 yc for yc < 1, Hinge loss: L(y, c) = 0 otherwise. Binomial deviance: L(y, c) = log(1 + exp( yc)). yc for yc < 1, 1 Huber-like loss: L(y, c) = (1 4 yc)2 for 1 yc 0, 0 otherwise. 1 for yc < 0, Misclassification loss: L(y, c) = 0 otherwise. 3

1. This problem is composed of 10 true-or-false statements. You only have to classify these as either true or false. For this problem (only!) no motivation is required. Each correct answer scores 1 point and each incorrect answer scores -1 point. No answer scores zero points. Total score for the whole problem is capped at 0. i. Linear discriminant analysis is a parametric model with K mean values µ 1,..., µ K (one for each of the K classes) and one single covariance matrix Σ as its parameters. ii. k-nn is a parametric model with the number of neighbors k as its parameter. iii. Growing a tree deep reduces the model variance and increases the model bias. iv. The purpose of using bagging is to reduce the model variance (compared to the base model used). v. A k-nearest neighbor classifier can by design only handle classification problems with K = 2 classes. vi. Linear discriminant analysis (LDA) is a classification method which cannot be used for regression. vii. LASSO is a regularization method which adds squared penalty term γ β 2 2 to the cost function. viii. Consider Y = 1 to be the positive class and Y = 0 to be the negative class. Then the true positive rate (TPR) is defined as TPR = #true positive #all where #true positive are the number of data points that are positive Y = 1 and have been correctly classified as positive Ŷ = 1, and where #all are the total number of data points. ix. Let X be an integer valued quantity denoting the age of a person in years since birth. It is wise to handle X as a qualitative variable since it only can take integer values. x. The model Y = β 0 + β 1 X + β 2 X 2 + ɛ (where β 0, β 1 and β 2 are model parameters) is a linear regression model. (10p) 4

2. (a) Consider the scatter plots in Figure 1 which depict three different training data sets for three binary classification problems. In which dataset(s) could the classes be well separated by...... an LDA classifier?... a QDA classifier? (In both cases we assume that X 1 and X 2 are the only inputs to the classifiers.) (2p) Y = 0 Y = 1 Dataset i Y = 0 Y = 1 Dataset ii Dataset iii Y = 0 Y = 1 X2 X2 X2 X 1 X 1 X 1 Figure 1: Scatter plots of training data for Problems 2a and 2b. (b) Consider again the scatter plots in Figure 1. The LDA and QDA classifiers are based on different assumptions about the properties of the data. Which dataset(s) in Figure 1 appear to correspond well to the assumptions made by LDA and QDA, respectively? (4p) (c) Suppose we want to predict whether or not a student will pass an exam, based on the time spent studying. Historical data shows that the average study time of the students who passed the exam was X pass = 40 (in some unspecified unit of time). For the students who failed, the average study time was X fail = 25. Furthermore, the variances within these two groups were ˆσ pass 2 = 10 2 and ˆσ fail 2 = 7 2, respectively. Finally, 60% of the students passed the exam. Construct a QDA classifier for predicting pass or fail based on the time spent studying X. Specifically, what is the decision boundary of the QDA classifier? What is the prediction (fail or pass) for a student who has studied 33 time units? (4p) Note: This question can be answered independently of 2a and 2b 5

3. Consider a small training dataset with only n = 3 data points with X as input and Y as output X -1.0 0.0 1.0 Y 0.0 0.0-1.0 We want to model the output Y based on the input X. We opt between two regression models to fit the data. where ɛ is a measurement error. Y = α 0 + ɛ (1) Y = β 0 + β 1 X + ɛ (2) (a) For both of the two models (1) and (2), find the parameters ˆα 0, and ˆβ = [ ˆβ 0 ˆβ1 ] T that minimizes the mean-square errors (MSE) on the training data above. (3p) (b) The training MSE for the model (2) will always be equal or smaller than the training MSE for model (1) regardless of the training data. Explain why! (2p) (c) Evaluate the two linear regression models (1) and (2) by performing leaveone-out cross-validation based on MSE on each of the two models. Which model performs the best? Hint: leave-one-out cross-validation is the same as k-fold cross-validation where k=n. (5p) 6

4. (a) In the context of stochastic gradient descent, what is an epoch? (1p) (b) Consider a dataset with n = 100 000 data points for which we train a model with stochastic gradient descent where each mini-batch has the size b = 100. We run the algorithm for 10 epochs. How many iterations have been completed during training, i.e. how many times have the parameters in the network been updated? (1p) (c) What is the main advantage of stochastic gradient descent where b n in comparison to normal gradient descent where b = n? (2p) (d) In gradient descent, describe the impact of the learning rate γ to the learning. What happens if the learning rate is too low? What happens if the learning rate is too high? (2p) (e) In AdaBoost, the exponential loss function is used L(C(X), Y ) = e Y C(X), mainly due to computational simplicity in the AdaBoost-algorithm. Another nice property is that minimizing exponential loss enforces C(X) to model the log-odds similar to logistic regression. Consider Y to be a random variable which can take either the value 1 or -1. Show that the C (X) minimizing the exponential loss C (X) = arg min E Y X [L(C(X), Y )] where Y C(X) L(C(X), Y ) = e C(X) is equal to half log-odds C (X) = 1 ( ) Pr(Y = 1 X) 2 log Pr(Y = 1 X) if Pr(Y = 1 X) 0 and Pr(Y = 1 X) 0. (4p) 7

5. Consider the following dataset of n = 5 samples of bell pepper with three attributes, Domestic, Weight and Color. Domestic No Yes No Yes No Weight (kg) 0.3 0.4 0.3 0.3 0.2 Color Green Yellow Yellow Red Red Based on this dataset, we want to train a binary classifier to predict the attribute Domestic based on the other two attributes. We consider a logistic regression model Pr(Y = 1 X; β) = eβt X 1 + e βt X, where the parameter β is estimated by maximizing the log-likelihood ˆβ = arg max ( log l(β) ), where n l(β) = Pr(Y = y i X = x i ; β). i=1 (a) Which of the three attributes Domestic, Weight and Color are quantitative and qualitative, respectively? (1p) (b) Provide numerical values of the training data T = {x i, y i } n i=1 to be used in the formulas above. (3p) Note: Multiple correct answers possible (c) Write down an explicit expression for the log-likelihood function log l(β) for the data set provided above. (2p) Note: You don t have to do the actual maximization, only provide an explicit expression of the function log l(β) to be maximized. (d) Consider a regression problem where we want to predict the Weight based on the other two attributes Domestic and Color. Use linear regression to find such a prediction model. What weight do you predict for a green and domestic bell pepper? (4p) 8