UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Similar documents
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Introduction to Machine Learning Midterm, Tues April 8

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning, Midterm Exam

Machine Learning Practice Page 2 of 2 10/28/13

10-701/ Machine Learning - Midterm Exam, Fall 2010

Midterm Exam, Spring 2005

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Support Vector Machines

FINAL: CS 6375 (Machine Learning) Fall 2014

The exam is closed book, closed notes except your one-page cheat sheet.

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Final Examination CS540-2: Introduction to Artificial Intelligence

Midterm: CS 6375 Spring 2015 Solutions

CPSC 340: Machine Learning and Data Mining

Introduction to Machine Learning Midterm Exam

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Midterm exam CS 189/289, Fall 2015

CSE 546 Final Exam, Autumn 2013

Announcements - Homework

Final Exam, Fall 2002

Support Vector Machine (continued)

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machine (SVM) and Kernel Methods

6.036 midterm review. Wednesday, March 18, 15

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines for Classification: A Statistical Portrait

Linear & nonlinear classifiers

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Introduction to Machine Learning

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Final Examination CS 540-2: Introduction to Artificial Intelligence

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Machine Learning, Fall 2011: Homework 5

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Qualifier: CS 6375 Machine Learning Spring 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CS145: INTRODUCTION TO DATA MINING

Midterm Exam Solutions, Spring 2007

Chemometrics: Classification of spectra

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

18.9 SUPPORT VECTOR MACHINES

Support Vector Machine (SVM) and Kernel Methods

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Pattern Recognition and Machine Learning

Machine Learning Lecture 5

CS534 Machine Learning - Spring Final Exam

Support Vector Machine. Industrial AI Lab.

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Linear & nonlinear classifiers

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Pattern Recognition 2018 Support Vector Machines

Linear Methods for Regression. Lijun Zhang

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Advanced Introduction to Machine Learning

CSC 411 Lecture 17: Support Vector Machine

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CPSC 540: Machine Learning

CS 188: Artificial Intelligence Spring Announcements

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines

Linear regression methods

Linear Models for Classification

Final Exam, Machine Learning, Spring 2009

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Kernels and the Kernel Trick. Machine Learning Fall 2017

Support Vector Machines.

Max Margin-Classifier

ECE 5984: Introduction to Machine Learning

HOMEWORK 4: SVMS AND KERNELS

Support Vector Machines: Maximum Margin Classifiers

COMS 4771 Introduction to Machine Learning. Nakul Verma

CIS 520: Machine Learning Oct 09, Kernel Methods

Final Exam, Spring 2006

Classifier Complexity and Support Vector Classifiers

Introduction to Logistic Regression and Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Lecture 10: Support Vector Machine and Large Margin Classifier

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Statistical Pattern Recognition

Perceptron Revisited: Linear Separators. Support Vector Machines

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Machine Learning, Fall 2009: Midterm

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Kernel methods, kernel SVM and ridge regression

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Transcription:

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write your name and Penn student ID (the 8 bigger digits on your ID card) on the bubble form and fill in the associated bubbles in pencil. If you are taking this as a WPE, then enter only your WPE exam number. If you think a question is ambiguous, mark what you think is the best answer. The questions seek to test your general understanding; they are not intentionally trick questions. As always, we will consider written regrade requests if your interpretation of a question differed from what we intended. We will only grade the scantron forms For the TRUE or FALSE questions, note that TRUE is (a) and FALSE is (b). For the multiple choice questions, select exactly one answer. The exam is 9 pages long and has 78 questions. Name: 1

CIS520 Final, Fall 2014 2 1. [0 points] This is version A of the exam. Please fill in the bubble for that letter. 2. [1 points] or? Under the usual assumptions ridge regression is consistent. 3. [1 points] or? Under the usual assumptions ridge regression is unbiased. 4. [1 points] or? Stepwise regression finds the global optimum minimizing its loss function (squared error plus the usual L 0 penalty). 5. [1 points] or? k-means clustering finds the global optimum minimizing its loss function. 6. [2 points] When doing linear regression with n = 10, 000 observations and p = 1, 000, 000 features, if one expects around 500 or 1,000 features to enter the model, the best penalty to use is (a) AIC penalty (b) BIC penalty (c) RIC penalty (d) This problem is hopeless you couldn t possibly find a model that reliably beats just using a constant. C 7. [2 points] When doing linear regression, if we expect a very small fraction of the features to enter the model, we should use an (a) L 0 penalty (b) L 1 penalty (c) L 2 penalty A 8. [1 points] or? In general, in machine learning, we prefer to use unbiased algorithms due to their better accuracy.

CIS520 Final, Fall 2014 3 9. [1 points] or? L 1 penalized regression (Lasso) solves a convex optimization problem. 10. [2 points] Which of the following loss functions is least sensitive to outliers? (a) Hinge loss (b) L 1 loss (c) Squared (L 2 ) loss (d) Exponential loss A 11. [1 points] or? For small training sets, Naive Bayes generally is more accurate than logistic regression. 12. [1 points] or? Naive Bayes, as used in practice, in generally an MAP algorithm. 13. [1 points] or? One can make a good argument that minimizing an L 1 loss penalty in regression gives better results than the more traditional L 2 loss function minimized by ordinary least squares. 14. [1 points] or? Linear SVMs tend to be slower, but more accurate than logistic regression.

CIS520 Final, Fall 2014 4 15. [2 points] You estimate a ridge regression model with some data taken from your robot, and find (using cross validation) and optimal ridge penalty λ 1. You then buy a new sensor which has noise with 1/4 the variance (half the standard deviation) as before. Using the same number of observations as before you collect new data, and find a new optimal ridge penalty λ 2. Which of the following will be closest to true? (a) λ 1 /λ 2 = 1/4 (b) λ 1 /λ 2 = 1/2 (c) λ 1 /λ 2 = 1 (d) λ 1 /λ 2 = 2 (e) λ 1 /λ 2 = 4 E 16. [1 points] or? BIC can be viewed as an MDL method. 17. [1 points] or? If you expect half of the features to enter a model, and have n p, BIC is a better penalty to use than RIC. 18. [1 points] or? The elastic net tends to select fewer features than well-optimized L 0 penalty methods. 19. [1 points] or? The elastic net generally gives at least as good a model (in terms of test error) as Lasso. 20. [1 points] or? The appropriate penalty in L 0 -penalized linear regression can be determined by theory, e.g. using an MDL approach.

CIS520 Final, Fall 2014 5 21. [1 points] or? Ridge regression is scale invariant in the sense the that test set prediction accuracy is unchanged if one rescales the features, x. 22. [1 points] or? The the fact that a coefficient in a penalized linear regression is kept or killed (removed from the model) is generally a good indicator of the importance of the corresponding feature; features that are highly correlated with y will be kept and those with low correlation will be dropped. Consider an L 0 penalized linear regression with three different penalties. The resulting models have the following properties: bits to bits to code residual code the model model 1 400 300 model 2 300 400 model 3 320 350 23. [1 points] or? Method 1 is overfitting. 24. [1 points] or? Method 2 is overfitting. 25. [1 points] or? The KL divergence between a true distribution (p(a) = 0.5, p(b) = 0.5, p(c) = 0) and an approximating distribution (q(a) = 0.3, q(b) = 0.3q(C) = 0.4) will be infinite. 26. [1 points] or? Decision trees select features to add based on their expected information gain. This has effect that features which take on many possible values tend to be preferentially added, since they tend to have higher information gain.

CIS520 Final, Fall 2014 6 27. [1 points] or? Boosting can be shown to optimize weights for an exponential loss function. 28. [1 points] or? Key to the boosting algorithm is the fact that at each iteration more weight is given points that were misclassified. This often enables test set accuracy to continue to improve even after training set error goes to zero. 29. [1 points] or? Perceptrons are a form of stagewise regression. 30. [1 points] or? Voted perceptrons are generally more accurate than regular ( simple ) perceptrons. 31. [1 points] or? Voted perceptrons are generally faster (at test time) than averaged perceptrons. 32. [1 points] or? Perceptrons (approximately) optimize a hinge loss. 33. [1 points] or? For a linearly separable problem, standard perceptrons are guaranteed to find a lienarly separating hyperplane if there are no repeated x s with inconsistent labels. 34. [1 points] Which of the following classifiers has the lowest 0-1 error (L 0 loss) given a training set with an infinite number of observations. (a) Logistic regression (b) Naive Bayes

CIS520 Final, Fall 2014 7 A 35. [1 points] or? If we consider a linear SVM as a kernel SVM, then the kernel function is the inner product between the x s. 36. [1 points] or? Radial Basis Functions (RBFs) can be used either to reduce or to increase the effective dimensionality p of a regression problem. 37. [1 points] or? Any SVM problem can be made linearly separable with the right selection of a kernel function. 38. [1 points] or? The number of support vectors found by an SVM depends upon the size of the penalty on the slack variables. 39. [0 points] or? Because SVMs already seek large margin solutions, they do not, in general, require inclusion of a separate regularization penalty. This is badly phrased and was thrown out. 40. [1 points] or? An SVM with a Gaussian kernel, exp( x y C ), will have a lower expected variance when C = 1 than when C = 10, 41. [2 points] In the figure below which points are support vectors? A B and D are in class 1, C and E are in class 2

CIS520 Final, Fall 2014 8 E D C x 2 A B x 1 (a) A, C (b) A, C, D (c) Not enough information was provided to tell. C 42. [2 points] For the primal problem for non-negative weighted regression is: min w i (y i w T x) 2 s.t. w j 0 for j = 1...p The dual problem is to solve (a) max λ i (y i w T x) 2 + j λ jw j s.t. λ j 0 (b) max λ i (y i w T x) 2 + λ j w j s.t. λ 0 (c) max λ i (y i w T x) 2 j λ jw j s.t. λ j 0 (d) min λ i (y i w T x) 2 + j λ jw j s.t. λ j 0 (e) min λ i (y i w T x) 2 j λ jw j s.t. λ j 0 C 43. [1 points] or? For the above optimization problem, the constraint corresponding to each weight is binding if and only if the weight is zero. 44. [2 points] Which of the following methods cannot be kernelized? (a) k-nn (b) linear regression (c) perceptrons (d) PCA (e) All of the above methods can be kernelized.

CIS520 Final, Fall 2014 9 E 45. [1 points] of? Any function φ(x) can be used to generate a kernel using k(x, y) = φ(x) T φ(y). 46. [1 points] or? If there exists a pair of points x and y such that k(x, y) < 0, then k() can not be a kernel. 47. [1 points] or? All entries in a kernel matrix must be non-negative. 48. [1 points] or? A kernel matrix must be symmetric. 49. [1 points] or? Any norm x can be used to define a distance by defining d(x, y) = x y 50. [1 points] or? k(x, y) = e ( x y 2 2 ) is a legitimate kernel function 51. [2 points] The number of parameters needed to specify a Gaussian Mixture Model with 4 clusters, data of dimension 3, and a single (full) covariance matrix shared across all 4 clusters is: (a) fewer than 16 (b) between 16 and 20 (inclusive) (c) 21 or 22 (d) 23 (e) 24 or more

CIS520 Final, Fall 2014 10 C 52. [1 points] or? EM is a search algorithm for finding maximum likelihood (or sometimes MAP) estimates. Thus, it can, in theory, be replaced by other search algorithm that also maximizes the same likelihood function. 53. [1 points] or? The L 2 reconstruction error from using k-component PCA to approximate a set of observations X, can be characterizaed in in terms of the k largest eigenvalues of X X. 54. [1 points] or? A positive definite symmetric real square matrix has only positive eigenvalues. 55. [1 points] or? For real world data sets, the principle components of a matrix X are preferably found using SVD of X rather than actually finding the eigenvectors of X X. 56. [1 points] or? The singular values of X are equal to the eigenvalues of X X. 57. [1 points] or? For an NxP matrix, X. (N > P ) the k largest right singular vectors will be the same as the loadings of X. 58. [1 points] or? Principle component regression (PCR) in effect does regularization, and thus offers a partial, if not exact replacement for Ridge regression. 59. [1 points] or? Principle component regression (PCR), like linear regression, is scale invariant.

CIS520 Final, Fall 2014 11 60. [2 points] The dominant cost of linear regression, when n >> p scales as (a) np (b) np 2 (c) n 2 p (d) p 3 B 61. [1 points] or? Deep neural networks are almost always supervised learning methods. 62. [1 points] or? Deep neural networks most often use an architecture in which all nodes on each layer are connected to all nodes on the following layer, but have no connections to other deeper layers. 63. [1 points] or? Deep neural networks currently hold the records for best machine learning performance in problems ranging from speeh and vision to natural language processing and brain image modeling (MRI). Consider the following confusion matrix corrent answer predicted 8 2 answer 12 11 64. [1 points] For the above confusion matrix the precision is (a) 2/10 (b) 8/20 (c) 19/33 (d) none of the above

CIS520 Final, Fall 2014 12 D 65. [1 points] For the above confusion matrix the recall is (a) 2/10 (b) 8/20 (c) 19/33 (d) none of the above B 66. [1 points] or? L 2 loss (sometime with a regularization penalty) is widely used because it usually reflects the actual loss function for applications in business and science. 67. [1 points] or? One can compute a description length (for the model plus residual) for a belief net and the data it represents, and a causal belief net is likely to have a shorter description length than one that just captures the conditional independence structure of the same data. The following questions refer to the following figure; means is conditionally independent of. B C D E F G I J K L 68. [1 points] or? (B C D) 69. [1 points] or? (J K G)

CIS520 Final, Fall 2014 13 70. [1 points] or? (J K L) 71. [1 points] or? (B J F, G) 72. [1 points] or? (F I G, L) 73. [1 points] or? G d-separates D and J 74. [2 points] What is the minimum number of parameters needed to represent the full joint distribution P (B, C, D, E, F, G, I, J, K, L) in the network, given that all variables are binary? hint: Parameters here refer to the value of each probability. For example, we need 1 parameter for P (X) and 3 parameters for P (X, Y ) if X and Y are binary. (a) < 20 (b) 20-30 (c) 31-99 (d) more than 100 B 75. [1 points] or? When building a belief net from a set of observations where A, B are Boolean variables, if P (A = T rue B = T rue) = P (A = T rue), then we know that there should not be a link from A to B. 76. [1 points] or? When building a belief net from a set of observations where A, B are Boolean variables, if P (A = T rue B = T rue) = P (A = T rue), then we know that there should not be a link from B to A.

CIS520 Final, Fall 2014 14 77. [1 points] or? The emission matrix in an HMM represents the probability of the state, given an observation. For the next question, consider the Bayes Net below with parameter values labeled. This is an instance of an HMM. (Similar to homework 8) P (X 1 = a) = 0.3 P (X 2 = a X 1 = a) = 0.5 P (X 2 = a X 1 = b) = 0.2 X 1 X 2 O 1 O 2 P (O i = 0 X i = a) = 0.3 P (O i = 0 X i = b) = 0.6 78. [3 points] Suppose you have the observation sequence O 1 = 1, O 2 = 0. What is the prediction of Viterbi Decoding? (Maximize P (X 1, X 2 O 1 = 1, O 2 = 0)) (a) X 1 = a, X 2 = a (b) X 1 = a, X 2 = b (c) X 1 = b, X 2 = a (d) X 1 = b, X 2 = b D