STA 414/2104, Spring 2014, Practice Problem Set #1

Similar documents
Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

6.867 Machine learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Metric-based classifiers. Nuno Vasconcelos UCSD

g-priors for Linear Regression

Bayesian Machine Learning

STA414/2104 Statistical Methods for Machine Learning II

Lecture : Probabilistic Machine Learning

Notes on Discriminant Functions and Optimal Classification

Machine Learning, Fall 2009: Midterm

10-701/ Machine Learning - Midterm Exam, Fall 2010

Machine Learning Practice Page 2 of 2 10/28/13

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Nonparametric Bayesian Methods (Gaussian Processes)

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Announcements. Proposals graded

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Naïve Bayes classification

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Machine Learning 4771

CSE 546 Final Exam, Autumn 2013

Midterm Exam, Spring 2005

CMU-Q Lecture 24:

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

CSC2515 Assignment #2

Midterm: CS 6375 Spring 2018

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Midterm Exam Solutions, Spring 2007

FINAL: CS 6375 (Machine Learning) Fall 2014

Introduction to Machine Learning

A Bayesian Treatment of Linear Gaussian Regression

Final Exam, Spring 2006

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayesian Regression Linear and Logistic Regression

Machine Learning Basics III

Machine Learning for Signal Processing Bayes Classification and Regression

Bayesian Methods: Naïve Bayes

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Where now? Machine Learning and Bayesian Inference

Introduction to Machine Learning Midterm Exam

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Machine Learning for Signal Processing Bayes Classification

Introduction to Bayesian Inference

ECE521 week 3: 23/26 January 2017

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

STA 4273H: Sta-s-cal Machine Learning

Lecture 3: Pattern Classification. Pattern classification

Introduction. Chapter 1

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

2 Statistical Estimation: Basic Concepts

7 Gaussian Discriminant Analysis (including QDA and LDA)

Support Vector Machine. Industrial AI Lab.

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

GWAS IV: Bayesian linear (variance component) models

5. Discriminant analysis

Bayes Decision Theory - I

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Linear Classification. Prof. Matteo Matteucci

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Gaussian Processes (10/16/13)

GAUSSIAN PROCESS REGRESSION

Midterm exam CS 189/289, Fall 2015

Density Estimation. Seungjin Choi

Introduction to Machine Learning

INTRODUCTION TO PATTERN RECOGNITION

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Final Exam, Machine Learning, Spring 2009

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Bayesian Machine Learning

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Ch 4. Linear Models for Classification

COS513 LECTURE 8 STATISTICAL CONCEPTS

Introduction to Machine Learning Midterm Exam Solutions

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

CSE 559A: Computer Vision

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Non-Parametric Bayes

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Machine Learning, Fall 2012 Homework 2

Probabilistic Machine Learning

LECTURE NOTE #3 PROF. ALAN YUILLE

CSE 559A: Computer Vision Tomorrow Zhihao's Office Hours back in Jolley 309: 10:30am-Noon

p(x ω i 0.4 ω 2 ω

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Bayesian Approaches Data Mining Selected Technique

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Transcription:

STA 44/4, Spring 4, Practice Problem Set # Note: these problems are not for credit, and not to be handed in Question : Consider a classification problem in which there are two real-valued inputs, and, and a binary (/) target (class) variable, y. There are training cases, plotted below. Cases where y = are plotted as black dots, cases where y = as white dots, with the location of the dot giving the inputs, and, for that training case. 3 3 4 A) Estimate the error rate of the one-nearest-neighbor (-NN) classifier for this problem using leave-one-out cross validation. (That is, using S-fold cross validation with S equal to the number of training cases, in which each training case is predicted using all the other training cases.) B) Suppose we use the three-nearest-neightbor (3-NN) method to estimate the probability that a test case is in class. For test cases with each of the following sets of input values, find the estimated probability of class. =, = =, = = 3, =

Question : Here is a plot of training cases for a binary classification problem with two input variables, and, with points in class in white and points in class in black: + + We wish to compare three variations on the K-nearest-neighbor method for this problem, using -fold cross validation (ie, we leave out each training case in turn and try to predict it from the other nine). We use the fraction of cases that are misclassified as the error measure. We set K = in all methods, so we just predict the class in a test case from the class of its nearest neighbor. A) The first method looks only at, so the distance between cases with input vectors and is. What is the cross-validation error for this method? B) The second method looks only at, so the distance between cases with input vectors and is. What is the cross-validation error for this method? C) The third method looks at both inputs, and uses Euclidean distance, so the distance between cases with input vectors and is ( ) +( ). What is the cross-validation error for this method? D) If we use the method (from among these three) that is best according to -fold crossvalidation, what will be the predicted class for a test case with inputs = (.5,.5)?

Question 3: Consider a linear basis function regression model, with one input and the following three basis functions: φ () = φ () = { if < φ () = if The model for the target variable, y, is that P(y,β) = N(y f(,β),), where f(,β) = m j= β j φ j () Suppose we have four data points, as plotted below: t What is the maimum likelihood (least squares) estimate for the parameters β, β, and β? Elaborate calculations should not be necessary. Question 4: Below is a plot of a dataset of n = 3 observations of ( i,y i ) pairs: y..5..5..5 3. 3 4 In other words, the data points are (,), (,3), (4,). Suppose we model this data with a linear basis function model with m = basis functions given by φ () = and φ () =. We use a quadratic penalty of the form λβ, which penalizes only the regression coefficient for φ (), not that for φ (). Suppose we use squared error from three-fold cross-validation (ie, with each validation set having only one case) to choose the value of λ. Suppose we consider only two values for λ one very close to zero, and one very large. For the data above, will we choose λ near zero, or λ that is very big? 3

Question 5: Consider a linear basis function model for a regression problem with response y and a single scalar input,, in which the basis functions are φ () =, φ () =, and φ () =. Below is a plot of four training cases to be fit with this model: 3 y + A) Suppose we fit this linear basis function model by least squares. What will be the estimated coefficients for the three basis functions, β, β, and β? B) Suppose we fit this linear basis function model by penalized least squares, with a penalty of λ β (note that the penalty does not depend on β and β ). What will be the estimated coefficients for the three basis functions, β, β, and β in the limit as λ goes to infinity? C) Suppose we use the form of the penalty as in part (B), but with λ =. Will the penalized least squares estimate for β be eactly zero? Show why or why not. Question 6: Suppose that we observe a binary (/) variable, Y. We do not know the probability, θ, that Y will be, but we have a prior distribution for θ, that has the following density function on the interval (,): ( P(θ) = θ ) A) Find as simple a formula as you can for the density function of the posterior distribution of θ given that we observe Y =. Your formula should give the correcty normalized density. B) Suppose that Y is a future observation, that is independent of Y given θ. Find the predictive probability that Y = given that Y = ie, find P(Y = Y = ). 4

Question 7: Let X, X, X 3,... for a sequence of binary (/) random variables. Given a value for θ, these random variables are independent, and P(X i = ) = θ for all i. Suppose that we are sure that θ is at least /, and that our prior distribution for θ for values / and above is uniform on the interval [/,]. We have observed that X =, but don t know the values of any other X i. A) Write down the likelihood function for θ, based on the observation X =. B) Find an epression for the posterior probability density function of θ given X =, simplified as much as possible, with the correct normalizing constant included. C) Find the predictive probability that X = given that X =. D) Find the probability that X = X 3 given that X =. Question 8: Consider a binary classification problem in which the probability that the class, y, of an item is depends on a single real-valued input,, with the classes for different cases being independent, given a parameter φ and. We use the following model for this class probability in terms of the unknown parameter φ: P(y =,φ) = { / if φ if > φ We have a training set consisting of the following si (,y) pairs: (., ), (.3, ), (.4, ), (.6, ), (.7, ), (.8, ) A) Draw a graph of the likelihood function for φ based on the si training cases above. B) Compute the marginal likelihood for this model with this data (ie, the prior probability of the observed training data with this model and prior distribution), assuming that the prior distribution of φ is uniform on the interval [.5, ] C) Find the posterior distribution of φ given the si training cases above, and the prior from part (B) Display this posterior distribution by drawing a graph of its probability density function. D) Find the predictive probability that y = for each of three test cases in which has the values.,.6, and.7, based on the posterior distribution you found in part (C). 5

Question 9: Below is a plot of si training cases for a regression problem with two inputs. The location of the circle for a training case gives the values of the two inputs, and, for that case, and the number in the circle is the value of the response, y, for that case.. +.3. + + +.5.7.5? Suppose we use a linear basis function model for this data, with the following basis functions: φ () = φ () = φ () = { if > if { if + > if + A) What will be the least squares estimates of the regression coefficients, β, β, and β, based on these si training cases (ignore the question mark in the lower right for now)? No elaborate matri computations should be needed to answer this. B) Based on the least squares estimates for part (A), what will be the prediction for the value of the response in a test case whose and values are given by the location of the question mark in the plot above? C) For this training set, find and estimate of the average squared error of prediction using least squares estimates, by applying leave-one-out cross-validation (which is the same as si-fold cross validation here, since there are si training cases). D) What will be the prediction for the test case at the question mark based on penalized least squares estimates, if the penalty has the form λ(β +β ), and λ is very large? 6

Question : Let Y, Y, Y 3,... be random quantities that are independent given a parameter θ, with each Y t having the value,, or 3, with probabilities θ if y = P(Y t = y θ) = θ if y = 3θ if y = 3 The prior distribution for the model parameter θ is uniform on the interval [, /3]. For the questions below, suppose we observe that Y = and Y = 3, but do not observe Y 3, Y 4... A) Find the marginal likelihood for this data (that is, the probability that Y = and Y = 3, integrating over the prior distribution of θ). B) Find the posterior probability density function for θ (that is, the density for θ given Y = and Y = 3), with the correct normalizing constant. C) Find the predictive distribution for Y 3 given the observed data (that is, give a table of P(Y 3 = y Y =, Y = 3) for y =,,3). Question : Answer the following questions about Bayesian inference for linear basis function models. Recall that if the noise variance is σ, and the prior distribution for β is Gaussian with mean zero and covariance matri S, the posterior distribution for β is Gaussian with mean m n and covariance matri S n that can be written as follows: [ ], S n = S + (/σ )Φ T Φ mn = S n Φ T y/σ and the log of the marginal likelihood for the model is n log(π) n log(σ ) ( log S ) S n For the questions below, assume that S = ω I, for some positive ω. y Φm n) /σ mt ns m n A) Suppose we set the noise variance, σ, to be bigger and bigger, while fiing other aspects of the model. What will be the limiting values of the the posterior mean and covariance matri? B) Suppose we set ω, the prior variance of the β j, to be bigger and bigger, while fiing other aspects of the model. What will be the limiting values of the the posterior mean, m n, and covariance matri, S n? C) Suppose we set ω to be bigger and bigger while fiing other aspects of the model. What will be the limiting value of the marginal likelihood? D) Suppose there is only one input (so is a scalar), and the basis functions are φ j () = j, for j =,...,m. The Bayesian mean prediction for the value of y in a test case with input is found by integrating the prediction based on β (ie, the epected value of y given and β) with respect to the posterior distribution of β. Will this final mean prediction be a polynomial function of? 7