Week 5: Logistic Regression & Neural Networks

Similar documents
Bias-Variance Tradeoff

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Generative v. Discriminative classifiers Intuition

Machine Learning, Fall 2012 Homework 2

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Generative v. Discriminative classifiers Intuition

Linear Models for Classification

Week 3: Linear Regression

Lecture 2: Logistic Regression and Neural Networks

1 What a Neural Network Computes

Machine Learning Gaussian Naïve Bayes Big Picture

Generative v. Discriminative classifiers Intuition

Machine Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Stochastic Gradient Descent

ECE 5984: Introduction to Machine Learning

CSC321 Lecture 5: Multilayer Perceptrons

Introduction to Machine Learning

Machine Learning (CSE 446): Neural Networks

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Lecture 3 Feedforward Networks and Backpropagation

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

FINAL: CS 6375 (Machine Learning) Fall 2014

Machine Learning (CS 567) Lecture 5

Notes on Discriminant Functions and Optimal Classification

CSC321 Lecture 4: Learning a Classifier

Lecture : Probabilistic Machine Learning

CSC321 Lecture 4: Learning a Classifier

Linear discriminant functions

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Mathematical Tools for Neuroscience (NEU 314) Princeton University, Spring 2016 Jonathan Pillow. Homework 8: Logistic Regression & Information Theory

Machine Learning

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Introduction to Machine Learning Midterm Exam

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

ECE521 Lecture 7/8. Logistic Regression

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

10-701/ Machine Learning, Fall

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Lecture 4: Training a Classifier

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Introduction to Machine Learning Midterm Exam Solutions

Machine Learning

Gradient-Based Learning. Sargur N. Srihari

Linear and logistic regression

ECE 5424: Introduction to Machine Learning

CSC321 Lecture 4 The Perceptron Algorithm

Lecture 5: Logistic Regression. Neural Networks

Homework 2 Solutions Kernel SVM and Perceptron

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Logistic Regression. William Cohen

Learning from Data: Multi-layer Perceptrons

CSC 411: Lecture 04: Logistic Regression

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

CS534 Machine Learning - Spring Final Exam

Overfitting, Bias / Variance Analysis

Nonparametric Bayesian Methods (Gaussian Processes)

Probabilistic Graphical Models

Generative Clustering, Topic Modeling, & Bayesian Inference

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Logistic Regression. Machine Learning Fall 2018

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Lecture 3 Feedforward Networks and Backpropagation

CSC321 Lecture 9: Generalization

1 Machine Learning Concepts (16 points)

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Deep Feedforward Networks

Feedforward Neural Networks. Michael Collins, Columbia University

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Feed-forward Network Functions

Midterm: CS 6375 Spring 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Lecture 4: Training a Classifier

Probabilistic Machine Learning. Industrial AI Lab.

Nonlinear Classification

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning, Fall 2009: Midterm

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Generalized Linear Models

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

VBM683 Machine Learning

CS281 Section 4: Factor Analysis and PCA

CSC321 Lecture 5 Learning in a Single Neuron

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Transcription:

Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and optimizes the conditional log likelihood: L(w) = log p(y i x i (, w) = y i h(x i ) w log[exp(h(x i ) w) + 1] ). This likelihood cannot be optimized analytically, so instead we compute a gradient, according to L(w) = h(x i ) [ y i p(y = 1 x i, w) ]. Then, we use an algorithm called gradient ascent to optimize L(w), by repeatedly performing the following operation: w (j+1) = w (j) + α L(w (j) ). This performs MLE to determine w. We can also use MAP, where we put a prior on w. For example, a Gaussian prior produces the likelihood L(w) = ( y i h(x i ) w log[exp(h(x i ) w) + 1] ) h(x) + λ wk. 2 We can also use logistic regression to classify multiple different classes. If y {1,..., L y }, we can use log p(y = j x, W) = exp(h(x) w j ) Ly j =1 exp(h(x) w, j where the parameters now correspond to a weight matrix W = [w 1,..., w Ly ]. 2 Comparison with Naïve Bayes Logistic regression solves the same kind of problem as naïve Bayes, but uses a different object and a different estimation procedure. The two algorithms k=1 1

have slightly different strengths and weaknesses, and present some interesting tradeoffs. First, in the case where x R K (and h(x) = x) 1 that is, when all features are real-valued it can be proven that naïve Bayes and logistic regression are representationally equivalent, meaning that for every logistic regression classifier, we can find a naïve Bayes classifier that always outputs the same answer, and vice-versa. This requires naïve Bayes to use a Gaussian condition distribution for each p(x k y), and for all of these Gaussians to have the same variance (but a different mean). This representational equivalence makes it convenient to compare naïve Bayes and logistic regression. We won t derive the representational equivalence in detail: it s quite straightforward to obtain it from Bayes rule, and you may see this on a homework exercise or (hint!) an exam question. does not? What assumptions does naïve Bayes make that logistic regression Answer. Naïve Bayes assumes that the features are independent of one another. This is extremely convenient, because it allows us to estimate each conditional p(x k y) separately and in closed form. However, this assumption can be quite simplistic. As the number of features K increases, which model is likely to overfit more, logistic regression or naïve Bayes? Answer. One way to think about this question is to imagine what happens if a certain feature is replicated with a bit of noise added to it. For example, if we have K features that are all equal in expectation, but corrupted with different Gaussian noise. Naïve Bayes estimates each feature distribution independently, so no matter how many features we have, naïve Bayes will perform about the same. However, logistic regression will get worse and worse as the number of features increases: as the number of features K exceeds the number of datapoints N, it will be easy for logistic regression to fit to the noise in the data. In the extreme case, if we imagine that each feature is on average 0 but randomly +1 or 1 with some (equal) but small probability, as K >> N, we can find for each sample a feature that is +1 for only that sample, making it trivially easy to overfit. Based on this intuition, we can make a few conclusions about the relative of performance of logistic regression vs. naïve Bayes. Indeed, these conclusions can be proven to hold theoretically, though proving this is outside the scope of this class. Naïve Bayes will tend to exhibit more bias, since the independence assumption corresponds to a simpler model. That means that with less data, naïve Bayes will overfit less, but as the amount of data increases, logistic 1 In this section, I will drop the convention of using h(x) to represent the features and simply denote them with x. As we ve learned before, this doesn t change of the math, it s just done here to match the notation in naïve Bayes. 2

regression will eventually perform better, since it doesn t make the simplifying assumption of feature independence. Correspondingly, naïve Bayes will continue to perform well as the number of features increases, while logistic regression will become more vulnerable to overfitting. So, in short: naïve Bayes generally needs less data, while logistic regression will generally find a better solution if it doesn t overfit. In the case that the features are truly independent and the amount of data is near infinite, it can be shown that both methods will find the same solution if naïve Bayes uses Gaussian conditionals with input-independent variance. The other consideration, which is more obvious from inspecting the learning algorithms, is that naïve Bayes classifiers can be learned extremely efficiently and quickly, while logistic regression requires a comparatively more complex gradient ascent procedure. This is typically not a big deal, but naïve Bayes sometimes has a bit more flexibility than logistic regression. For example, it s easy with naïve Bayes to handle missing data: if some record is missing a certain feature x k, we simply ignore that record when estimating p(x k y), and during inference, we can easily ignore features that are unknown. The same is not possible with logistic regression. 3 From Logistic Regression to Neural Networks Although we can often choose very expressive features h(x) for logistic regression, we sometimes don t know exactly which features are needed to solve a given classification problem. If we just use h(x) = x, the linear logistic regression classifier can t solve certain otherwise simple problems. For example, imagine that x = [x 1, x 2 ], where x 1, x 2 {0, 1}, and y = x 1 XOR x 2, such that y = 0 if x 1 = x 2 and y = 1 if x 1 x 2. If we have h(x) = x, no logistic regression classifier can solve this task. To understand why, it s enough to draw a plot with x 1 and x 2 on the axes: the positive and negative examples cannot be separated by a single line. However, we can observe that we can write the XOR function as the following logical expression: y = (x 1 = 1 and x 2 = 0) or (x 1 = 0 and x 2 = 1). Note that the or operator can actually be implemented by a linear classifier, so if we had two features h 1 = (x 1 = 1 and x 2 = 0) and h 2 = (x 1 = 0 and x 2 = 1), we could set p(y x) = exp(h 1 + h 2 0.5) exp(h 1 + h 2 0.5) + 1, which can be implemented using logistic regression (though we need a bias feature). So can we learn to extract h 1 and h 2 automatically? This turns out to be quite easy if we have another layer of logistic regression that attempts to output v 1 and v 2 (separately) from x 1 and x 2. For example, we can use: h 1 = exp(8x 1 4x 2 6) exp(8x 1 4x 2 6) + 1. 3

Note that h 1 > 0.5 when (x 1 = 1 and x 2 = 0) and less than 0.25 otherwise, which means that h 1 + h 2 > 0.5 only when (x 1 = 1 and x 2 = 0) or (x 1 = 0 and x 2 = 1). This idea of using intermediate layers of logistic regression to automatically construct features results in a model called a neural network. Each layer in a neural network has weights W (l) = [w (l) 1,..., w(l) h (l) ]T, where each weight matrix has a number of rows equal to the number of features (referred to as hidden units or neurons) at that layer. Note that by convention, the weight vectors are rows of this matrix, so that h (l) = σ(w (l) h (l 1) ), where σ(z) is a nonlinearity, which in the case of logistic regression is the logistic function σ(z) = exp(z) exp(z) + 1 = 1 1 + exp( z). By convention, we automatically add a bias feature to each layer and denote it separately from the weight matrix at b (l), such that h (l) = σ(w (l) h (l 1) + b (l) ). How many hidden units does the XOR network have? Answer. The XOR network has two hidden units, h 1 and h 2. What are the first layer weights in the XOR network? Answer. Since there are two hidden units, we need a weight matrix W (1) with two rows, and a 2D bias vector b (1). Based on the equations for h 1 and h 2, we have: W (1) = [ 8 4 4 8 ] b (1) = [ 6 6 ]. Answer. What are the second layer weights in the XOR network? There is only one output, and therefore only one weight vector: W (2) = [1 1] b (2) = 0.5. In general, we could have different nonlinearity functions σ on the hidden layers in the network. The sigmoid or logistic function is a popular choice. Other popular choices include the hyperbolic tangent and, more recently, the rectified linear unit, given simply by ReLU(z) = max(z, 0). We will discuss this briefly later in the week. For the neural network presented above, what is the objective and what is the model modeling? 4

Answer. Just like with logistic regression, we can optimize the neural network with respect to the conditional likelihood L(θ) = log p(y i x i, θ), where the parameters θ are given by θ = {W (1), b (1), W (2), b (2) }. We can also use neural networks for multiclass classification, regression, and other estimation problems. Typically, neural networks optimize a conditional objective, though it is also possible to build generative neural networks (that will not be covered in this class). We ll discuss algorithms for training neural networks next time, along with some discussion of their strengths and weaknesses. 5