Mathematical Tools for Neuroscience (NEU 314) Princeton University, Spring 2016 Jonathan Pillow. Homework 8: Logistic Regression & Information Theory

Similar documents
6.867 Machine Learning

Linear & nonlinear classifiers

Week 5: Logistic Regression & Neural Networks

10701/15781 Machine Learning, Spring 2007: Homework 2

Machine Learning Basics

Support Vector Machines

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Machine Learning, Fall 2009: Midterm

Review for Exam 1. Erik G. Learned-Miller Department of Computer Science University of Massachusetts, Amherst Amherst, MA

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Linear Models for Classification

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

ECE521 week 3: 23/26 January 2017

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Midterm, Fall 2003

1 What a Neural Network Computes

CMU-Q Lecture 24:

Least Squares Regression

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

9 Classification. 9.1 Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

6.867 Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Midterm: CS 6375 Spring 2015 Solutions

Neural Network Training

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Introduction. Chapter 1

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

ECE521 Lectures 9 Fully Connected Neural Networks

Support Vector Machines and Bayes Regression

1 A Support Vector Machine without Support Vectors

4 Bias-Variance for Ridge Regression (24 points)

CHARACTERIZATION OF NONLINEAR NEURON RESPONSES

ECE521 Lecture7. Logistic Regression

Lecture 7: Con3nuous Latent Variable Models

Overfitting, Bias / Variance Analysis

Comparing integrate-and-fire models estimated using intracellular and extracellular data 1

Classification and Regression Trees

CSC2515 Assignment #2

Least Squares Regression

Machine Learning Linear Classification. Prof. Matteo Matteucci

Midterm Exam, Spring 2005

CS281 Section 4: Factor Analysis and PCA

Homework 2 Solutions Kernel SVM and Perceptron

Midterm: CS 6375 Spring 2018

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Bayesian Linear Regression [DRAFT - In Progress]

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Linear Methods for Prediction

6.867 Machine Learning

Linear & nonlinear classifiers

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Sample questions for Fundamentals of Machine Learning 2018

Introduction to Machine Learning Midterm Exam

18.9 SUPPORT VECTOR MACHINES

Introduction to Machine Learning Midterm, Tues April 8

Logistic Regression. William Cohen

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Loss Functions and Optimization. Lecture 3-1

CSE 250a. Assignment Noisy-OR model. Out: Tue Oct 26 Due: Tue Nov 2

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Warm up: risk prediction with logistic regression

f = Xw + b, We can compute the total square error of the function values above, compared to the observed training set values:

Linear Neural Networks

Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

Linear Models for Regression

Learning features by contrasting natural images with noise

Lecture 2 Machine Learning Review

CS 340 Lec. 16: Logistic Regression

Introduction to Signal Detection and Classification. Phani Chavali

Classification with Perceptrons. Reading:

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Classification Logistic Regression

Deep Learning for Computer Vision

Learning from Data: Regression

Machine Learning, Fall 2012 Homework 2

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

!) + log(t) # n i. The last two terms on the right hand side (RHS) are clearly independent of θ and can be

Machine Learning Practice Page 2 of 2 10/28/13

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

HOMEWORK #4: LOGISTIC REGRESSION

Linear classifiers: Logistic regression

Homework 5. Convex Optimization /36-725

Encoding or decoding

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Support Vector Machines

Machine Learning and Data Mining. Linear classification. Kalev Kask

Stochastic Gradient Descent

Linear Models for Regression

Transcription:

Mathematical Tools for Neuroscience (NEU 34) Princeton University, Spring 206 Jonathan Pillow Homework 8: Logistic Regression & Information Theory Due: Tuesday, April 26, 9:59am Optimization Toolbox One of the things to learn in this problem set is how to use the optimization toolbox. Specifically, we ll use the function fminunc, which performs unconstrained minimization of a function. It allows you to pass in a function and a starting point (a vector), and it performs gradient descent of the function to find the location of a local minimum. The optimization tooldbox functions generally do minimization instead of maximization. Since we generally want to maximize a likelihood function, you ll actually end up writing a function to compute the negative log-likelihood (since the maximum of the log-likelihood is in the same place as the minimum of the negative log-likelihood). Before diving into the problem set, it will help to familiarize yourself with how to use fminunc. Open up the script example_optimizationscript.m, which shows the basic steps involved in using fminunc to minimize an example function I wrote (which appens to be a quadratic) called examplelossfunction.m. Open up this function too, just to see what it computes (and what good commenting practice looks like when you write your own functions!). You should be able to use example_optimizationscript.m as a starting point for coding up maximum likelihood estimation of a logistic regression model. Logistic regression The logistic regression is an extremely popular and important model for binary classification. This comes up when we have a stack of vectors with labels 0 and and we d like to find a linear hyperplane that optimally separates them (or equivalently, a D linear projection that captures the probability of any vector being a 0 or ). Assume we have data vectors and labels {(x i, y i )} for i =,..., N, where x i R d is a d-dimensional input vector, and y i {0, } is the corresponding label ( 0 or ) for that vector. For our purposes, think of x as a stimulus for a single time bin (e.g., an image represented as a length-d vector), and y as indicating whether a neuron spiked or not. The logistic regression model has a parameter vector w, which we can think of as the neuron s receptive field (i.e., a linear set of weights telling us how the neuron integrates each element of the

stimulus) and a constant offset b, which will tell us the neuron s bias (towards firing or silence) in response to a zero-vector input. To compute the probability of a spike under the logistic regression model, we take a dot product between x and the weight vector w, and add b. Then we put this value through a logistic function: P (y = x) = + exp ( x w b). () The probability of spike plus the probability of no-spike has to sum to, so we have P (y = 0 x) = P (y = x) = Verify for yourself that these numbers add to. We can express these probabilities more compactly as: + exp(x w + b). (2) P (y x) = exp(y(x w + b)) + exp(x w + b) (3) Verify for yourself that this agrees with the probabilities given in equations and 2 above. Your goal for the first set of problems will to be to write a function that evaluates the negative log-likelihood of a dataset. Taking the negative of the log of (eq. 3) and summing across all stimulus-response pairs, we can show this to be equal to: log P (Y X, {w, b}) = N y i (x i w + b) + i= N log( + exp(x i w + b)) (4) You will hand this function to fminunc and ask it to find the values of w and b that minimize it (i.e., the maximum-likelihood estimates ŵ ML and ˆb ML.). To do this, you will want to create a single vector parameter vector that contains both w and b inside it, e.g., params = [w;b];. i=. Load and plot the raw data Load the dataset neuralencodingdataset.mat. This contains 000 stimulus-response pairs from a (simulated) neural experiment. The matrix x is a 000 2 matrix, each row of which is a 2-component stimulus vector for a single trial (or time bin ). The vector y holds the corresponding spike count (or label ) observed for each stimulus. Now create a training set consisting of the first 800 datapoints, e.g., xtrain = x(:800,:); ytrain = y(:800); Set aside the final 200 points as a test dataset that you ll use later when evaluating the quality of your model fit. Make two scatter plots of the raw data, using different symbols for the spike and no-spike stimuli. One plot should show the training data, the other should show the test data. 2

2. Warmup: least-squares regression estimate (a) Compute the least-squares estimate ŵ LS for the weight vector w under the linear- Gaussian regression model of the data, given by: y i = x i w + n i, where n i is an independent Gaussian random variable with mean zero and variance σ 2. Use the training data for this. (You should be able to do this with a single line of matlab. The formula should be secondnature to you by now!) (Note: we know that a linear-gaussian model is the wrong model for this data, since Gaussian noise will not produce outputs with the values 0 and. But still, it s (perhaps) not a bad thing to start with.) (b) Re-generate the scatter plot of the training data and add a unit vector pointing in the direction of the estimated ŵ LS vector. (That is, compute a unit vector pointing in the same direction as ŵ LS and add a line from 0 to the corresponding point on the unit circle). Does it look like it points in the correct direction, i.e., the direction that will preserve information about the probability of spiking if we project the stimuli onto it? 3. Logistic regression: maximum likelihood estimation (a) Write a function to evaluate the negative log-likelihood of the dataset under a logistic regression model given parameters w and b, as described above. It would be great if you can avoid for loops! (2 points). (b) Use fminunc to find the maximum likelihood estimate of w and b under the logistic model given the training data. (See notes above on optimization toolbox and logistic regression model). (c) Re-generate your scatter plot of the training data and add a unit vector pointing in the direction of ŵ ML. (d) Now add a line showing the linear separatrix, that is the line dividing the region of x space where P (y = x) < 0.5 from the region where P (y = x) > 0.5. (You will need to do a little bit of math to figure out what this line is; Your two hints are that it corresponds to the points for which P (y = x) = 0.5, and it lies orthogonal to ŵ ML.) (2 points). (e) Make a separate plot showing the projected training datapoints x w+b on the x axis, and the probability P (y = x) under the fitted model on the y axis. Use a different symbol/color for the points with y = 0 from those with y =. Now add a smooth plot of the fitted logistic function using a grid of x values extending over the range of the data. (f) Classify both training and test datasets by assigning labels to each point. Assign z =0 if P (y = x) < 0.5 and z = otherwise. Now compute the classification error the training data and the test data by comparing the true labels y with the model-assigned labels z. Did your model do better on the training or test data? (g) Compute the (positive) log-likelihood per sample for both the training and test sets. That 3

is, evaluate the negative log-likelihood of the training sets uing parameters {ŵ ML, ˆb}, then divide by 800, the number of training points and multiply by (-). Do the same for the test set (using the same parameters), but divide by the number of test points. These quantities are known as the training log-likelihood and test log-likelihood per sample. Which is higher? Why do you think this is? 4. Nonlinear classification using logistic regression. Load the dataset neuralencodingdataset2.mat, which contains x and y of the same format as the previous problem. (a) Make a scatter plot of the raw data, using different symbols for points labelled 0 and. What do you notice about the separability of these data? (Could they be well separated with a line?) (b) Repeat steps b,c,d, and f from the previous problem on the new dataset (using the first 800 samples for training and the last 200 for testing, just as before). You should be able to copy the code from the previous problem over a new file and just call it with the new data. (c) Augment the x matrix with three additional columns, given by quadratic functions of the original input vectors: x(:,).^2, x(:,2).^2, and x(:,).*x(:,2). These columns provide a basis set for all quadratic functions of the original x, and will allow us to fit an arbitary quadratic classification boundary between the points labelled y = 0 and y =. The new design matrix will have 5 columns. Run logistic regression on this expanded data, using the same division between training and test data. (That is, find the maximum likelihood estimators for w and b under the logistic regression model, where w is now vector with 5 components.) (d) Compute the training and test error of the quadratic logistic regression model. How does it compare to the training and test error of the original model? (e) Make a pair of scatter plots of the test data using distinct symbol (or color) for the 4 different kinds of points: points correctly labelled 0 ( correct rejects ), points incorrectly labelled 0 ( misses ), points correctly labelled ( hits ) and points incorrectly labelled ( false alarms ). Make one plot using the labels from the original (linear) logistic regression model, and another plot using labels from the quadratic logistic regression model. Information Theory 5. Load the file NeuralResps.mat, which contains data from a simulated neuroscience experiment. The vector S is a series of stimulus intensities presented to a neuron, and R is a vector of corresponding responses (which you can assume to have been simulated iid from P (R S)). 4

(a)compute the joint distribution P (R, S) and marginal distributions P (R) and P (S). (You might wish to use the function hist3). Plot the two marginals, and use imagesc to display the joint. Now use these sample distributions to compute plug-in estimates of: (b) the entropy H(S); (c) the entropy H(R); (d) the mutual information I(S, R). 6. Generate a 2D discrete probability table (of dimension 50 50) for the joint distribution P (X, Y ) of two random variables X and Y. Generate a joint distribution such that X and Y are completely uncorrelated but have high mutual information. Make a plot of P (X, Y ) and compute the correlation coefficient and the mutual information between X and Y. 5