UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Similar documents
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Introduction to Machine Learning Midterm, Tues April 8

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Introduction to Machine Learning Midterm Exam

Machine Learning, Midterm Exam

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning Practice Page 2 of 2 10/28/13

The exam is closed book, closed notes except your one-page cheat sheet.

Introduction to Machine Learning Midterm Exam Solutions

10-701/ Machine Learning - Midterm Exam, Fall 2010

Machine Learning, Fall 2009: Midterm

FINAL: CS 6375 (Machine Learning) Fall 2014

Midterm Exam, Spring 2005

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Pattern Recognition and Machine Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Statistical Data Mining and Machine Learning Hilary Term 2016

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Midterm: CS 6375 Spring 2018

CS534 Machine Learning - Spring Final Exam

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

CSCI-567: Machine Learning (Spring 2019)

Announcements - Homework

CSE 546 Final Exam, Autumn 2013

Final Exam, Spring 2006

6.036 midterm review. Wednesday, March 18, 15

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Midterm: CS 6375 Spring 2015 Solutions

Advanced Introduction to Machine Learning

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

STA414/2104 Statistical Methods for Machine Learning II

Midterm exam CS 189/289, Fall 2015

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

CPSC 340: Machine Learning and Data Mining

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Warm up: risk prediction with logistic regression

Notes on Discriminant Functions and Optimal Classification

1 Machine Learning Concepts (16 points)

Linear & nonlinear classifiers

Qualifying Exam in Machine Learning

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Final Exam, Machine Learning, Spring 2009

Density Estimation. Seungjin Choi

CS281 Section 4: Factor Analysis and PCA

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Final Exam, Fall 2002

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

18.9 SUPPORT VECTOR MACHINES

Non-Parametric Bayes

Advanced Introduction to Machine Learning

Dimension Reduction Methods

Introduction to Logistic Regression and Support Vector Machine

Kernel Methods and Support Vector Machines

Jeff Howbert Introduction to Machine Learning Winter

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Generative Clustering, Topic Modeling, & Bayesian Inference

Classification: The rest of the story

Support Vector Machines

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

STA 4273H: Statistical Machine Learning

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

4 Bias-Variance for Ridge Regression (24 points)

Introduction to Gaussian Process

Nonparametric Bayesian Methods (Gaussian Processes)

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Support Vector Machines

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

ECE521 week 3: 23/26 January 2017

Linear & nonlinear classifiers

Minimum Description Length (MDL)

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

ECE 5984: Introduction to Machine Learning

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Machine Learning Lecture 5

Announcements. Proposals graded

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Midterm Exam Solutions, Spring 2007

PATTERN CLASSIFICATION

Bias-Variance Tradeoff

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Machine Learning: A Statistics and Optimization Perspective

Qualifier: CS 6375 Machine Learning Spring 2015

PAC-learning, VC Dimension and Margin-based Bounds

L11: Pattern recognition principles

Support Vector Machine I

Machine Learning for Signal Processing Bayes Classification

Logistic Regression Logistic

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

TDT4173 Machine Learning

MACHINE LEARNING ADVANCED MACHINE LEARNING

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Transcription:

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and Penn student ID (the 8 bigger digits on your ID card) on the scantron form and fill in the associated bubbles in pencil. If you are taking this as a WPE, then enter only your WPE exam number. If you think a question is ambiguous, mark what you think is the best answer. The questions seek to test your general understanding; they are not intentionally trick questions. As always, we will consider written regrade requests if your interpretation of a question differed from what we intended. We will only grade the scantron forms For the TRUE or FALSE questions, note that TRUE is (a) and FALSE is (b). For the multiple choice questions, select exactly one answer. The exam is 10 pages long and has 93 questions. Name: 1

CIS520 Final, Fall 2013 2 1. [0 points] This is version A of the exam. Please fill in the bubble for that letter. 2. [1 points] or? Ridge regression finds the global optimum for minimizing its loss function (squared error plus an appropriate penalty). 3. [2 points] Ridge regression minimizes which of the following? (Assume, as usual, n observations). (a) i (y i w x i ) 2 + λ w 2 2 (b) i (y i w x i ) 2 + λ w 2 (c) (1/n) i (y i w x i ) 2 + λ w 2 2 (d) (1/n) i (y i w x i ) 2 + λ w 2 (e) i (y i w x i ) 2 λ w 2 A 4. [2 points] When doing linear regression with n = 1, 000, 000 observations and p = 10, 000 features, if one expects around 500 or 1,000 features to enter the model, the best penalty to use is (a) AIC penalty (b) BIC penalty (c) RIC penalty (d) no penalty B 5. [1 points] or? The conjugate prior to the Gaussian distribution is a Gaussian distribution. 6. [1 points] or? Lasso finds the global optimum for minimizing its loss function (squared error plus an appropriate penalty). 7. [1 points] or? Stepwise regression with a BIC penalty term finds the global optimum for minimizing its loss function (squared error plus an appropriate penalty).

CIS520 Final, Fall 2013 3 8. [1 points] or? The EM algorithm finds the global optimum for clustering data from a mixture of Gaussians, assuming the number of clusters K is set to the correct value. 9. [1 points] or? When using k-nearest neighbors, the implementation choice that most impacts predction accuracy is usually the choice of k. : distance metric is more important than k 10. [2 points] Which of the following loss functions is most sensitive to outliers? (a) Hinge loss (b) Squared loss (c) Exponential loss (d) 0-1 loss C 11. [1 points] or? For small training sets, Naive Bayes generally is more accurate than logistic regression. 12. [1 points] or? Ordinary least squares (linear regression) can be formulated either as minimizing an L 2 loss function or as maximizing a likelihood function. 13. [1 points] or? Naive Bayes can be formulated either as minimizing an L 2 loss function or as maximizing a likelihood function. 14. [1 points] or? Naive Bayes, as used in practice, in generally an MLE algorithm. - usually MAP 15. [1 points] or? Ridge regression can be formulated as an MLE algorithm.

CIS520 Final, Fall 2013 4 16. [1 points] or? SVMs are generally formulated as MLE algorithms. 17. [1 points] or? One can make a good argument that minimizing an L 1 loss penalty in regression gives better results than the more traditional L 2 loss function minimized by ordinary least squares. 18. [1 points] or? L 2 penalized linear regression is, in general, more sensitive to outliers than L 1 penalized linear regression. (I.e. one point that is far from the predicted regression line will have more effect on the regression coefficients.) 19. [1 points] or? Large margin methods like SVMs tend to be slightly less accurate in predictions (when measured with an L 0 loss function ) than logistic regression. 20. [1 points] or? One can do kernelized logistic regression to get many of the same benefits one would get using kernels in SVMs. 21. [1 points] or? It is not possible to both reduce bias and reduce variance by changing model forms (e.g. by adding a kernel to an SVM). 22. [1 points] or? It is sometimes useful to do linear regression in the dual space. 23. [1 points] or? Linear SVMs tend to overfit more than standard logistic regression.

CIS520 Final, Fall 2013 5 24. [2 points] You are doing ridge regression. You first estimate a regression model with some data. You then estimate a second model with four times as many observations (but the same ridge penalty). Roughly how do you expect the regression coefficients to change when more data is used? (a) The coefficients should on average shrink towards zero (become smaller in absolute value). (b) The coefficients should on average move away from zero (become larger in absolute value). (c) The coefficients will not change. B 25. [1 points] or? Suppose we have two datasets X 1 and X 2 which each represent the same observations (and the same Y labels) except that X 1 s features are a subset of the features of X 2. (I.e., X 2 is X 1 with extra columns added.) Stepwise linear regression with an L 0 penalty will always add at least as many features when trained on the bigger feature set, X 2. 26. [1 points] or? RIC (Risk Inflation Criterion) can be viewed as an MDL method. 27. [1 points] or? If you expect a tiny fraction of the features to enter a model, BIC is a better penalty to use than RIC. 28. [1 points] or? If you expect a tiny fraction of the features to enter a model, an appropriate L 0 penalty will probably work better than than an L 1 penalty. 29. [1 points] or? The elastic net tends to select fewer features than well-optimized L 0 penalty methods.

CIS520 Final, Fall 2013 6 30. [1 points] or? Estimating the elastic net is a convex optimization problem, and hence relatively fast. 31. [1 points] or? The appropriate penalty in L 1 -penalized linear regression can be determined by theory, e.g. using an MDL approach. 32. [1 points] or? The larger (in magnitude) coefficients in a linear regression correspond to (multiply) the more important features (those that when changed have a larger effect on y). : it depends on how the features are scaled 33. [1 points] or? KL divergence can be used to measure (dis)similarity when doing k-means clustering. 34. [1 points] or? KL divergence is a valid distance metric. 35. [1 points] or? KL divergence is generally used to see how well one distribution approximates another one. 36. [1 points] or? Given a label drawn from the following probability distribution P(Apple) = 0.5, P(Banana) = 0.25, P(Carrot) = 0.25, its entropy (in bits) is 0.5log 2 (0.5)+0.5log 2 (0.5)+0.25log 2 (0.25)+0.75log 2 (0.75)+0.25log 2 (0.25)+0.75log 2 (0.75). 37. [1 points] or? The theory behind boosting is based on there being a weak classifier h t whose error on the weighted dataset is always less than (or equal to) 0.5.

CIS520 Final, Fall 2013 7 38. [1 points] or? At each round t of boosting, the minimum error of the classifier on the weighted dataset ɛ t is a monotone non-decreasing function, i.e. ɛ i ɛ j for all i < j. 39. [1 points] or? Hinge loss is upper bounded by the loss function of boosting. 40. [1 points] or? Boosting usually averages over many decision trees, each of which is learned without pruning. 41. [1 points] or? In general, if you have multiple methods for making predictions, it is better to pick the best one rather than using the majority vote of the methods as the prediction. 42. [1 points] or? Perceptrons are an online alternative to SVMs that generally converge to the same solution as SVMs. 43. [1 points] or? Averaged and voted perceptrons result in models that contain similar numbers of parameters and hence are of similar accuracy and similar computational cost to use in making preditions. 44. [2 points] Which of the following classifiers has the lowest 0-1 error (L 0 loss) given a training set with an infinite number of observations. (a) Standard SVM with optimal regularization (b) Logistic regression with optimal regularization (c) Naive Bayes (d) Bayesian classifier (one that classifies using an estimated model p(y x; θ) using the true distributional form (the correct equation for p(y x).

CIS520 Final, Fall 2013 8 D 45. [2 points] An SVM using a Gaussian kernel gives the same separating hyperplane as a linear SVM when σ in the denominator of the exponential in the Gaussian approaches: (a) 0 (b) (c) the Gaussian kernel SVM does not give the same hyperplane as the linear SVM in any limit C: As sigma goes to infinity, the Gaussian kernel becomes exp( a b /sigma 2 ) exp(0) 1 : a singular kernel which is not the same as the linear kernel. Therefore, it doesn t become the linear SVM in under any circumstance. 46. [1 points] or? SVMs are, in general, more sensitive to the actual distribution of the data than logistic regression is. 47. [2 points] Which model generally has the highest bias? (a) linear SVM (b) Gaussian SVM (c) perceptron with constant update C 48. [2 points] It is best to use the primal SVM when... (Choose the best answer.) (a) p >> n (b) n >> p (c) we don t have a kernel (d) the dual doesn t exist B 49. [2 points] What is the objective for L 2 regularized L 1 -loss SVM? (a) min w 1 + C i ξ i subject to y i (w T x + b) 1 ξ i and ξ i 0 for i = 1...n (b) min (1/2) w 2 2 + C i ξ i subject to y i (w T x + b) 1 ξ i and ξ i 0 for i = 1...n

CIS520 Final, Fall 2013 9 (c) min w 1 + C i ξ2 i subject to y i (w T x + b) = 1 ξ i for i = 1...n (d) min (1/2) w 2 2 + C i ξ2 i subject to y i (w T x + b) = 1 ξ i for i = 1...n (e) None of the above B 50. [2 points] Which type(s) of SVM will tend to have the largest number of support vectors for any given dataset? (a) L 1 regularized L 1 loss SVM (b) L 1 regularized L 2 loss SVM (c) L 2 regularized L 1 loss SVM (d) L 2 regularized L 2 loss SVM (e) Two of the above E 51. [1 points] or? In the limit of infinite data (n ), L 1 regularized L 1 loss SVM and L 2 regularized L 1 loss SVM will give the same hyperplane. 52. [1 points] or? In the limit of infinite data (n ), L 2 regularized L 1 loss SVM and L 2 regularized L 2 loss SVM will give the same hyperplane. 53. [2 points] The effect of increasing the loss penalty, C, on L 2 regularized L 1 loss SVM is: (a) Increased bias, decreased variance (b) Increased bias, increased variance (c) Decreased bias, decreased variance (d) Decreased bias, increased variance (e) Not enough information to tell

CIS520 Final, Fall 2013 10 D: As C increases, the weight for the 1/2norm(w, 2) 2 term goes down therefore margin shrinks. Since margin shrinks the model has high variance but low bias. 54. [2 points] Suppose we have two datasets X 1, X 2 representing the same observations (with the same Y labels) except that X 1 s features are a subset of the features of X 2. If we apply L 2 regularized L 1 loss SVM to this dataset then which is true? (a) The number of support vectors for X 1 the number of support vectors for X 2. (b) The number of support vectors for X 1 the number of support vectors for X 2. (c) Not enough information to tell. A: For intution: in very a high dimensional space, almost all points are support vectors, so if you keep fewer features, tthere will be relatively fewer support vectors. 55. [2 points] Kernels work naturally with which of the following: I. PCA II. linear regression III. SVMs IV. decision trees (a) I and III (b) II and III (c) I, II and III (d) all of them C 56. [1 points] or? k(x, y) = exp( x y ) is a valid kernel for any norm z p. 57. [1 points] or? k(x, y) = x y 2 2 + c2 is a valid kernel. 58. [1 points] or? k(x, y) = n i=1 max(x i, y i ) is a valid kernel. 59. [2 points] The number of parameters needed to specify a Gaussian Mixture Model with 4 clusters, data of dimension 5, and diagonal covariances is: (a) Between 5 and 15 (b) Between 16 and 26 (c) Between 27 and 37 (d) Between 38 and 48 (e) More than 49

CIS520 Final, Fall 2013 11 D: 4*5 centers, 4*5 covariances, 4-1 cluster probabilities; 3 + 20 + 20 = 43 (the clusters can have different diagonals) 60. [2 points] The number of parameters needed to specify a Gaussian Mixture Model with 3 clusters, data of dimension 4, and spherical covariances is: (a) Between 5 and 15 (b) Between 16 and 26 (c) Between 27 and 37 (d) Between 38 and 48 (e) More than 49 B; 3*4 centers, 3 covariancds 3-1 cluster probs 12+3+2=17 (the clusters can be different sizes) 61. [2 points] Suppose you are given an EM algorithm that finds maximum likelihood estimates for a model with latent variables. You are asked to modify the algorithm so that it finds MAP estimates instead. Which step or steps do you need to modify? (a) Expectation (b) Maximization (c) No modification is necessary (d) Both B 62. [1 points] or? EM is a search algorithm for finding maximum likelihood (or sometimes MAP) estimates. Thus, it can be replaced by other search algorithm that also maximizes the same likelihood function. 63. [1 points] or? PCA can be formulated as an optimization problem that finds the set of k basis functions that best represent a set of observations X, in the sense of minimizing the L 2 reconstruction error. 64. [1 points] or? PCA can be formulated as an optimization problem that finds the (orthogonal) directions of maximum covariance of a set of observations X.

CIS520 Final, Fall 2013 12 65. [1 points] or? A positive definite symmetric real square matrix has only positive entries. 66. [1 points] or? The Frobenius norm of a matrix is equal to the sum of the squares of its eigenvalues. 67. [1 points] or? The Frobenius norm of a matrix is equal to the sum of the squares of all of its elements. 68. [1 points] or? PCA is often used for preprocessing images. 69. [1 points] or? PCA (Principal Component Analysis) is a supervised learning method. 70. [1 points] or? It is often possible to get a higher testing accuracy by training the same model on the PCA ed dataset than on the original dataset. 71. [1 points] or? Using principal components (PCs) as the features when training Naive Bayes assures (since the principal components are orthogonal) that the Naive Bayes assumption is met. 72. [1 points] or? Thin SVD of a rectangular matrix X = U 1 D 1 V1 (where X is n*p with n > p and D 1 is k*k) and that of its transpose X = U 2 D 2 V2 (again with D 2 also k*k) always yields D 1 = D 2 as long as there are no repeated singular values.

CIS520 Final, Fall 2013 13 73. [1 points] or? Thin SVD of a rectangular matrix X = U 1 D 1 V1 (where X is n*p with n > p and D 1 is k*k) and that of its transpose X = U 2 D 2 V2 (again with D 2 also k*k) always yields U 1 = V 2 as long as there are no repeated singular values. 74. [1 points] or? A k*k*k tensor Γ can be viewed as a mapping from three length k vectors (x 1, x 2, and x 3 ) to a scalar (y) such that the mapping is linear in each of the input vectors. Thus, given a bunch of observations of the form (y, x 1, x 2, x 3 ) the elements of Γ can be estimated using linear regression. 75. [1 points] or? Deep Neural Networks are usually semi-parametric models. 76. [1 points] or? Deep neural networks are almost always supervised learning methods. 77. [1 points] or? Random Forests is a generative model. 78. [1 points] or? Random Forests tend to work well for problems like image classification. 79. [2 points] Radial basis functions the dimensionality of the feature space (a) only increase (b) only decrease (c) can either increase, decrease, or not change (d) don t change

CIS520 Final, Fall 2013 14 C 80. [1 points] or? Latent Dirichlet Allocation is a latent variable model that can be estimated using EM-style optimization. The following questions refer to the following figure. B A C D E F G H I J K 81. [1 points] or? (A B H) L 82. [1 points] or? (C I E, L, G) 83. [1 points] or? (A L J, K, L) 84. [1 points] or? (B K H) 85. [1 points] or? (G H A, B, C) 86. [1 points] or? J d-separates F and L

CIS520 Final, Fall 2013 15 87. [2 points] What is the minimum number of parameters needed to represent the full joint distribution P (A, B, C, D, E, F, G, H, I, J, K, L) in the network, given that all variables are binary? hint: Parameters here refer to the value of each probability. For example, we need 1 parameter for P (X) and 3 parameters for P (X, Y ) if X and Y are binary. (a) 12 (b) 4095 (c) 24 (d) 29 D 88. [2 points] How many edges need to be added to moralize the network? (a) 3 (b) 4 (c) 5 (d) 6 B 89. [1 points] or? Randomized search procedures, although though slow, can be used to reliably learn causal belief networks. 90. [1 points] or? The Markov transition matrix in an HMM gives the probability p(h t h t+1 ). 91. [1 points] or? Hidden Markov Models are generative models and hence are usually learned using EM algorithms.

CIS520 Final, Fall 2013 16 92. [2 points] In EM models, which of the following equations do you use to derive the parameters θ? (a) log P (D θ) = i log z q(z i = z x i ) p θ(z i =z,x i ) q(z i =z x i ) (b) F (q, θ) = i (H(q(Z i x i )) + z q(z i = z x i ) log p θ (Z i = z, x i )) (c) F (q, θ) = i ( KL(q(Z i x i ) p θ (Z i x i )) + log p θ (x i )) (d) log P (D θ) = i z q(z i = z x i ) log p θ(z i =z,x i ) q(z i =z x i ) B is the best answer (the first term drops out), but I gave everyone full credit on this just to be generous p For the next question, consider the Bayes Net below with parameter values labeled. This is an instance of an HMM. (Similar to homework 8) P (X 1 = a) = 0.4 P (X 2 = a X 1 = a) = 0.7 P (X 2 = a X 1 = b) = 0.2 X 1 X 2 O 1 O 2 P (O i = 0 X i = a) = 0.6 P (O i = 0 X i = b) = 0.3 93. [3 points] Suppose you have the observation sequence O 1 = 1, O 2 = 0. What is the prediction of Viterbi Decoding? (Maximize P (X 1, X 2 O 1 = 1, O 2 = 0)) (a) X 1 = a, X 2 = a (b) X 1 = a, X 2 = b (c) X 1 = b, X 2 = a (d) X 1 = b, X 2 = b D