Introduction to Machine Learning Midterm, Tues April 8

Similar documents
Machine Learning, Midterm Exam

10-701/ Machine Learning - Midterm Exam, Fall 2010

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Machine Learning Practice Page 2 of 2 10/28/13

Introduction to Machine Learning Midterm Exam

CSE 546 Final Exam, Autumn 2013

FINAL: CS 6375 (Machine Learning) Fall 2014

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Machine Learning, Fall 2009: Midterm

Introduction to Machine Learning Midterm Exam Solutions

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CS534 Machine Learning - Spring Final Exam

CSCI-567: Machine Learning (Spring 2019)

Midterm: CS 6375 Spring 2018

Brief Introduction of Machine Learning Techniques for Content Analysis

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

STA 4273H: Statistical Machine Learning

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Final Exam, Fall 2002

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Midterm exam CS 189/289, Fall 2015

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Final Exam, Spring 2006

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

STA 4273H: Statistical Machine Learning

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Midterm: CS 6375 Spring 2015 Solutions

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Final Exam, Machine Learning, Spring 2009

STA 414/2104: Machine Learning

The Bayes classifier

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Pattern Recognition and Machine Learning

CPSC 540: Machine Learning

Be able to define the following terms and answer basic questions about them:

Final Examination CS540-2: Introduction to Artificial Intelligence

The exam is closed book, closed notes except your one-page cheat sheet.

Linear & nonlinear classifiers

Final Examination CS 540-2: Introduction to Artificial Intelligence

Introduction. Chapter 1

Midterm II. Introduction to Artificial Intelligence. CS 188 Spring ˆ You have approximately 1 hour and 50 minutes.

Machine Learning, Fall 2011: Homework 5

Midterm Exam Solutions, Spring 2007

STA414/2104 Statistical Methods for Machine Learning II

Undirected Graphical Models

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Mathematical Formulation of Our Example

Hidden Markov Models

Machine Learning for Signal Processing Bayes Classification

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

6.036 midterm review. Wednesday, March 18, 15

Midterm II. Introduction to Artificial Intelligence. CS 188 Spring ˆ You have approximately 1 hour and 50 minutes.

COMS 4771 Regression. Nakul Verma

Based on slides by Richard Zemel

Midterm Exam, Spring 2005

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Midterm 2 V1. Introduction to Artificial Intelligence. CS 188 Spring 2015

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Final Exam December 12, 2017

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Qualifier: CS 6375 Machine Learning Spring 2015

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Announcements - Homework

Qualifying Exam in Machine Learning

CS145: INTRODUCTION TO DATA MINING

Introduction to Graphical Models

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Logistic Regression. Machine Learning Fall 2018

PAC-learning, VC Dimension and Margin-based Bounds

Warm up: risk prediction with logistic regression

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Chapter 16. Structured Probabilistic Models for Deep Learning

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

CSE 546 Midterm Exam, Fall 2014

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

A brief introduction to Conditional Random Fields

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Final Exam December 12, 2017

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

10701/15781 Machine Learning, Spring 2007: Homework 2

1 Machine Learning Concepts (16 points)

STA 4273H: Statistical Machine Learning

Max Margin-Classifier

Stat 502X Exam 2 Spring 2014

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Introduction to Machine Learning CMU-10701

L11: Pattern recognition principles

Linear Models for Classification

Machine Learning Linear Classification. Prof. Matteo Matteucci

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Transcription:

Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend too long on any one question! Max Name & Andrew id 1 True/False 40 Short Ques 59 Total 100 Score 1

1 True or False questions [40 pts = 2 20] Answer True or False. Justify your answer very briefly in 1-2 sentences. 1. Gaussian Mixture Model is essentially a probability distribution. True. GMM specifies the following distribution p(x) = K i=1 π in (µ i, Σ i ). 2. When the feature space is larger, overfitting is less likely. False. The more the number of features, the higher the complexity of the model and hence greater its ability to overfit the training data. 3. The EM iterations never decrease the likelihood function of the data. True. Both the E and M steps maximize a lower bound on the likelihood function of the data, and hence never decrease it. 4. Non-parametric models do not have parameters. False. Non-parametric models can have parameters e.g. kernel regression has the bandwidth parameter, but the number of parameters scale with the size of the dataset. 5. In kernel density estimation, a large kernel bandwidth will result in low bias. False. A large kernel bandwidth results in more smoothing and poor approximation, resulting in higher bias. 6. Non-parametric models are usually more efficient than parametric models in terms of model storage. False. Non-parametric models either need to look at the entire dataset to predict the label of test points or require the number of parameters to scale with the dataset size, hence require more storage. 7. Increasing the regularization parameter λ in lasso regression leads to sparser regression coefficients. True. Larger regularization parameter penalizes non-zero coefficients more, leading to sparser solution. 8. Boosting decision stumps can result in a quadratic decision boundary. False. The sign of a finite linear combination of decision stumps always results in a piecewise linear decision boundary. 9. Decision trees are generative classifiers. False. Decision trees do not assume a model for the input feature distribution, hence are not generative. 10. The following statement always holds for any joint probability distribution: p(x 1, X 2, X 3,..., X N ) = p(x 1 )p(x 2 X 1 )p(x 3 X 1, X 2 ),... p(x N X 1, X 2,..., X N 1 ) True. This simply follows from chain rule. 2

11. For a fixed size of the training and test set, increasing the complexity of the model always leads to reduction of the test error. False. Increasing complexity of model for a fixed training and test set leads to overfitting the training data and reduction in training error, but not test error. 12. Suppose you wish to predict age of a person from his/her brain scan using regression, but you only have 10 subjects and each subject is represented by the brain activity at 20,000 regions in the brain. You would prefer to use least squares regression instead of ridge regression. False. When the number of datapoints (subjects) is less than number of features, the least squares solution needs to be regularized to prevent overfitting, hence we prefer ridge regression. 13. The Local Markov Property states a Node is independent of non-descendants given at least one of its parents. False. The Local Markov Property states a Node is independent of non-descendants given all of its parents. 14. HMM is a generative model. True. HMM assumes a model for the data generating process. 15. The goal of the Viterbi algorithm is to learn the parameters of the Hidden Markov Model. False. Baum-Welch algorithm is used to learn the parameters of an HMM, Viterbi is used to infer the most likely state assignment. 16. When doing kernel regression on a memory-constrained device, you should prefer to use a box kernel instead of a Gaussian kernel. True. A box kernel only uses a few data points for prediction and hence does not need to load the entire dataset into memory unlike Gaussian kernel which assigns non-zero weight to all training data points. 17. To predict the chance that Steelers football team will win the Super Bowl Championship next year, you should prefer to use logistic regression instead of decision trees. True. Logistic regression will characterize the probability (chance) of label being win or loss, whereas decision tree will simply output the decision (win or loss). 18. The kmeans algorithm finds the global optimum of the kmeans cost function. False. The kmeans cost function is non-convex and the algorithm is only guaranteed to converge to a local optimum. 19. Let X R n m be the data matrix of m n-dimensional data instances. Let X = UDV denote the singular decomposition of X where D is a diagonal matrix containing the singular values ordered from largest to smallest. The 1st principal component of X is the first column vector of V. False. The 1st principal component of X is the first column vector of U. 3

20. The goal of independent component analysis is to transform the datapoints with a linear transformation in such a way that they become linearly independent. False. ICA transforms the datapoints with a linear transformation in such a way that they become statistically and not linearly independent. 2 Short questions [59 pts] 2.1 Loss functions [5pts] Match each predictor to the loss function it typically minimizes: boosting exponential loss logistic regression log loss k-nn classifier 0/1 loss linear regression squared loss SVM classification hinge loss 2.2 Decision boundaries [10 pts] For each classifier, circle the type(s) of decision boundary it can yield for a binary classification problem. In some cases, more than 1 option may be correct. Circle all options that you think are correct. decision trees: (non-kernel) SVM: logistic regression Gaussian Naive Bayes boosting linear, piecewise linear linear linear linear, quadratic linear, piecewise linear, quadratic 2.3 Independence [6 pts] Give an example of a 2D distribution where the marginal distributions are uncorrelated, but they are statistically dependent by construction. There are many good solutions. Possible solutions: α-degree rotated 2D uniform distribution for 0 < α < π/2, or uniform distribution on a disk. Discrete solutions can also be accepted if it is proved that the marginal distributions are uncorrelated, but they are not independent. 2D Gaussian is not a good solution! 4

2.4 Leave-one-out error [6pts] What is the leave-one-out error of 1. 1NN: 2 or (2/4) 2. 3 NN 1 or (1/4) on the following dataset: Figure 1: Figure for Leave-one-out Cross-validation 2.5 SVM kernels [10 pts] The following figure (a) shows 6 labeled training data points from two classes. (a) (b) 1. Let s consider the following kernel K(x, z) = zt x. Prove that K(x, z) is a kernel. x z Let φ(x) = Then K(x, z) = φ(x)t φ(z), therefore K is a kernel function with x. x 5

feature map φ. 2. Map each data point in the original feature space as shown in figure (a) to the new feature space representation implicit by K(x, z) in figure (b). Drawn. 3. Is the training data linearly separable in the new feature space? [Yes? or No?] Yes. 4. In figure (b), draw the decision boundary for the maximum margin separator in this implicit feature space. A qualitative drawing is sufficient. Drawn. 5. In figure (a), draw the decision boundary in the original feature space resulting from this kernel. A qualitative drawing is sufficient. Drawn. 2.6 Graphical Model [8 pts] A B C D E F G H Figure 2: Bayesian network for Graphical Model question For this question, refer to the graphical model in Figure 2. What is the factorization of the joint probability distribution p(a, B, C, D, E, F, G, H) according to this graphical model. 6

p(a)p(b)p(c A, B)p(D C)p(E C)p(F C)p(G D, E)p(H E, F ) Assume that every random variable is a discrete random variable with K possible values. How many parameters are needed to represent this graphical model? O(2K + 3K 3 + 3K 2 ) Is the following statement true or false? Explain why. A H False. The path A C F H does not meet the D-separation criterion (other solutions are possible too). A H B, D, F False. The path A C E H does not meet the D-separation criterion when conditioned on the set B, D, F. 2.7 Hidden Markov Models [6 pts] X t 2 X t 1 X t Y t 2 Y t 1 Y t Figure 3: Figure for the HMM question. Answer true or false. Explain your reasoning in 1 sentences. 7

Refer to Figure 3. Suppose that we know the parameters of this HMM. Having observed X t 2 and X t 1, we can make better prediction for X t than observing only X t 1. False. X t is conditionally independent of X t 2 given X t 1, therefore X t 2 cannot provide more information to predict X t given X t 1. Refer to Figure 3. Suppose that we don t know the parameters of this HMM. Having observed Y t 2 and Y t 1, we can make better prediction for Y t than observing only Y t 1. True. More training data can help to learn better parameters which can lead to better prediction. Exact inference with variable elimination is intractable in (first-order) HMMs in general. False. It is a tree, so exact inference is tractable. 2.8 MLE [8 pts] We have a random variable X drawn from a Poisson distribution. The Poisson distribution is a discrete distribution and X can be any positive integer. The probability of X at a point x is p(x) = λx e λ x!. Given points x 1,..., x n, write down the MLE estimate of λ. 8

n λ x i e λ p(x 1,..., x n λ) = x i=1 i! n log p(x 1,..., x n λ) ( x i ) log λ nλ i=1 0 = log p(x 1,..., x n λ) λ ˆλ = 1 n x i n i=1 n = ( x i ) 1 λ n i=1 9