DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Similar documents
DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Lecture : Probabilistic Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Probabilistic Machine Learning. Industrial AI Lab.

Machine Learning Practice Page 2 of 2 10/28/13

Bayesian Learning (II)

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Least Squares Regression

Bayesian Machine Learning

An Introduction to Statistical and Probabilistic Linear Models

10-701/ Machine Learning - Midterm Exam, Fall 2010

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning for Signal Processing Bayes Classification

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CMU-Q Lecture 24:

Linear & nonlinear classifiers

CS-E3210 Machine Learning: Basic Principles

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

ECE521 week 3: 23/26 January 2017

Pattern Recognition and Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

COMP90051 Statistical Machine Learning

Linear Models for Classification

PATTERN RECOGNITION AND MACHINE LEARNING

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Naïve Bayes classification

Relevance Vector Machines

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Linear Classification

Machine Learning Linear Classification. Prof. Matteo Matteucci

CSC 411: Lecture 04: Logistic Regression

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Introduction to SVM and RVM

Logistic Regression. Machine Learning Fall 2018

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Multivariate Bayesian Linear Regression MLAI Lecture 11

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayesian Models in Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Probabilistic Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Introduction. Chapter 1

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Support Vector Machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Introduction to Machine Learning

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

STA 414/2104, Spring 2014, Practice Problem Set #1

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Machine Learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Notes on Discriminant Functions and Optimal Classification

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Bayesian Linear Regression. Sargur Srihari

MLPR: Logistic Regression and Neural Networks

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Least Squares Regression

Supervised Learning Coursework

Density Estimation. Seungjin Choi

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Machine Learning. 7. Logistic and Linear Regression

GWAS IV: Bayesian linear (variance component) models

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

PMR Learning as Inference

Bayesian Machine Learning

Machine Learning Lecture 5

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Logistic Regression. COMP 527 Danushka Bollegala

Gaussian Processes (10/16/13)

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Non-parametric Methods

Machine Learning 2017

Machine Learning, Fall 2012 Homework 2

6.867 Machine learning

Ch 4. Linear Models for Classification

Linear discriminant functions

Introduction to Probabilistic Machine Learning

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Classifier Complexity and Support Vector Classifiers

Introduction to Machine Learning

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

INTRODUCTION TO PATTERN RECOGNITION

Introduction to Gaussian Processes

Transcription:

Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures in square brackets indicate the percentage of available marks allocated to each part of a question. Registration number from U-Card (9 digits) to be completed by student TURN OVER

. This question concerns general concepts in machine learning. a) Overfitting is a common problem in machine learning, in both regression and classification. Explain the problem, when it might occur, and how you can measure the extent of overfitting. With reference to the regression or classification models you have studied, describe two different regularisation techniques for addressing the overfitting problem. Provide a sentence or two on each, explaining how they limit overfitting. b) Model training for probabilistic models often involves taking point estimates for the model parameters, such as the maximum a posteriori (MAP) estimate. (iii) Define the objective the MAP is optimising, making reference to Bayes rule. Bayesian inference is an alternative inference technique which also makes use of a prior. Outline the Bayesian inference technique, and contrast it with the use of point estimates. [20%] Why might you choose to use Bayesian inference instead of a point estimate, or vice-versa? In what circumstances would Bayesian inference be preferable? c) Several pairs of distributions are said to be conjugate, such as the Binomial and Beta; Multinomial and Dirichlet; and Normal (mean) and Normal. Explain the notion of conjugacy, and how this might be practically important in a classification or regression scenario. d) Logistic regression is a probabilistic model for binary classification. It defines the probability of class as p(c x) = + exp( w x) and p(c x) = p(c 2 x). Show how this gives rise to a linear discriminant function. You may want to start by formulating the log-odds ratio for predicting C versus C 2. 2 CONTINUED

2. This question is based on classifying the gender of Olympic 00m sprint winners. Shown below is a plot of the winning times in each year for the women s and men s events. time (s) 2.5 2.0.5.0 0.5 0.0 9.5 880 900 920 940 960 9802000 2020 year The women s results are shown with plus symbols (+) and the men s results with crosses (x). We would now like to develop a classifier to predict the gender (t = male or female) automatically based on a two dimensional data point, x = (year, winning time). Note that we are seeking to model this as a classification dataset, not regression as used for your class work. a) We decide to model this data using a linear binary classifier. Draw a rough diagram to illustrate a decision boundary that you might hope to learn from this data with a linear classifier. Label the regions corresponding to the two classes, C = female, C 2 = male. Is a linear classifier an appropriate choice for this data? You should consider both interpolation and extrapolation settings, and explain these terms in your answer. b) Basis functions can be used to develop non-linear models. Give an example of a basis function that would be appropriate for this dataset, and explain why. Based on your choice, state the basis vector φ(x) used to represent a data-point x. c) Various techniques can be used for validation, such as using a fixed held-out validation set, or cross-validation. Explain the purpose of validation. Define fixed held-out validation and crossvalidation. Imagine we were to perform k-fold cross-validation on this data. We do so by assigning the data points from the 00m dataset based on the year of each race, such that each 20 year period forms a fold. Explain why this form of evaluation might not give a reliable estimate of the generalisation error, and how you might fix this. 3 TURN OVER

d) Kernel methods extend the basis functions technique to allow for richer and more flexible data representations. Explain what is meant by a kernel function, and how they relate to basis functions. Using a linear perceptron with basis function φ, the perceptron update takes the form w w + ηt i φ(x i ) after an error is made on the i th training instance. Show how the weights can be represented as a linear combination of the training samples, w = i α i t i φ(x i ) and show the dual form of the update rule, in terms of α. (iii) Using the above reparameterisation, derive the kernel perceptron. This requires you to prove that the perceptron discriminant function y(x) = w T φ(x) can be expressed such that the basis functions occur solely as inner products φ(x i ) T φ(x j ). Justify why this is important. 4 CONTINUED

3. This question concerns regression and maximum likelihood fits of regression models with basis functions. a) What role do the basis functions play in a regression model? b) The polynomial basis with degree d computed for a one dimensional input has the form φ(x i ) = [ x i x 2 i x 3 i... x d i ]. Give a disadvantage of the polynomial basis. Suggest a potential fix for this disadvantage and propose an alternative basis. [20%] c) The likelihood of a single data point in a regression model is given by, ( p(y i w, x i, σ 2 ) = exp (y ) i φ(x i ) w) 2. 2πσ 2 2σ 2 Assuming that each data point is independent and identically distributed, derive a suitable error function that should be minimized to recover w and σ 2. Explain your reasoning at each step. [25%] d) Now show that this error function is minimized with respect to the vector w, at the following point, w = [ Φ Φ ] Φ y where Φ is a design matrix containing all the basis vectors, and y is a vector of regression targets. [30%] e) What problem will arise as the number of basis functions we use increases to become larger than the number of the data points we are given? How can we perform a regression in this case? 5 TURN OVER

4. This question deals with Bayesian approaches to machine learning problems. a) Machine learning deals with data. What do we need to combine with the data in order to make predictions? b) Bayes Rule relates four terms: the likelihood, the prior, the posterior and the marginal likelihood or evidence. Describe the role of each of these terms when modelling data. [20%] Write down the relationship between these four terms as given by Bayes rule. c) In a regression problem we are given a vector of real valued targets, y, consisting of N observations y... y N which are associated with multidimensional inputs x... x N. We assume a linear relationship between y i and x i where the data is corrupted by independent Gaussian noise giving a likelihood of the form p(y x, w, σ 2 ) = exp (2πσ 2 ) N/2 ( 2σ 2 ) N (y i w x i ) 2. Consider the following Gaussian prior density for the k dimensional vector of parameters, w, ( p(w) = exp ) (2πα) k 2 2α w w i= This prior and likelihood can be combined to form the posterior. Explain why the resulting posterior will be Gaussian distributed. Show that the covariance of the posterior density for w is given by C w = [ σ 2 X X + ] α I, where X is a design matrix of the input data. [35%] (iii) Show that the mean of the posterior density for w is given by µ w = C w σ 2 X y. END OF QUESTION PAPER 6