Lecture 2: Logistic Regression and Neural Networks

Similar documents
Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Week 5: Logistic Regression & Neural Networks

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Generative v. Discriminative classifiers Intuition

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Naïve Bayes classification

Machine Learning (CSE 446): Neural Networks

Logistic Regression & Neural Networks

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

CS Machine Learning Qualifying Exam

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Generative v. Discriminative classifiers Intuition

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Machine Learning

Stochastic Gradient Descent

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Introduction to Neural Networks

Introduction to Machine Learning

Lecture 4 Logistic Regression

Logistic Regression. Machine Learning Fall 2018

ECE521 Lectures 9 Fully Connected Neural Networks

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

ECE 5984: Introduction to Machine Learning

From perceptrons to word embeddings. Simon Šuster University of Groningen

Machine Learning Gaussian Naïve Bayes Big Picture

CSC321 Lecture 4: Learning a Classifier

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

MLPR: Logistic Regression and Neural Networks

CSC321 Lecture 4: Learning a Classifier

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Deep Feedforward Networks

Bias-Variance Tradeoff

Midterm, Fall 2003

Lecture 5: Logistic Regression. Neural Networks

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Naive Bayes and Gaussian Bayes Classifier

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Naive Bayes and Gaussian Bayes Classifier

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Gaussian discriminant analysis Naive Bayes

Multi-layer Neural Networks

y(x n, w) t n 2. (1)

Gradient-Based Learning. Sargur N. Srihari

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Rapid Introduction to Machine Learning/ Deep Learning

Neural Networks and the Back-propagation Algorithm

Linear discriminant functions

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Artificial Neural Networks

Advanced statistical methods for data analysis Lecture 2

Qualifier: CS 6375 Machine Learning Spring 2015

Support Vector Machines

ECE521 Lecture 7/8. Logistic Regression

Logistic Regression. COMP 527 Danushka Bollegala

Lecture 4: Perceptrons and Multilayer Perceptrons

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Loss Functions, Decision Theory, and Linear Models

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Machine Learning Basics III

1 Machine Learning Concepts (16 points)

Introduction to Logistic Regression and Support Vector Machine

Logistic Regression. INFO-2301: Quantitative Reasoning 2 Michael Paul and Jordan Boyd-Graber SLIDES ADAPTED FROM HINRICH SCHÜTZE

Artificial Neural Networks

Inf2b Learning and Data

Natural Language Processing

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Midterm: CS 6375 Spring 2018

Naive Bayes and Gaussian Bayes Classifier

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Neural networks III: The delta learning rule with semilinear activation function

Lecture 17: Neural Networks and Deep Learning

Radial-Basis Function Networks

Probabilistic Graphical Models

Data Mining & Machine Learning

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Statistical Data Mining and Machine Learning Hilary Term 2016

ECE521 Lecture7. Logistic Regression

Lecture 3 Feedforward Networks and Backpropagation

Feedforward Neural Nets and Backpropagation

Lecture 2 Machine Learning Review

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

The Naïve Bayes Classifier. Machine Learning Fall 2017

Linear and logistic regression

Department of Computer and Information Science and Engineering. CAP4770/CAP5771 Fall Midterm Exam. Instructor: Prof.

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Intro to Neural Networks and Deep Learning

Transcription:

1/23 Lecture 2: and Neural Networks Pedro Savarese TTI 2018

2/23 Table of Contents 1 2 3 4

3/23 Naive Bayes Learn p(x, y) = p(y)p(x y) Training: Maximum Likelihood Estimation Issues? Why learn p(x, y) if we only use p(y x)? Is Naive assumption realistic? Are words independent given y? Is Bernoulli assumption realistic? What about repeated words?

4/23 Naive Bayes Let s analyze the prediction rule. Denote spam as 1 and not spam as 0. We predict 1 if: p(1 x) > p(0 x) p(1)p(x 1) > p(0)p(x 0) p(x 1) p(x 0) > p(0) p(1) log p(x 1) p(0) > log log p(x 0) p(1) v V p(v 1)1{v x} (1 p(v 1)) 1{v / x} p(0) > log v V p(v 0)1{v x} (1 p(v 0)) 1{v / x} p(1)

5/23 Naive Bayes Denote 1{v x} as φ v (x): v V log p(v 1)φv (x) 1 φv (1 p(v 1)) (x) p(0) v V p(v 0)φv (x) > log 1 φv (1 p(v 0)) (x) p(1) ( φ v (x) log p(v 1) p(v 0) + (1 φ v (x)) log 1 p(v 1) ) 1 p(v 0) v V ( ) p(v 1)(1 p(v 0)) 1 p(v 1) φ v (x) log + log p(v 0)(1 p(v 1)) 1 p(v 0) v V > log p(0) p(1) > log p(0) p(1) p(v 1)(1 p(v 0)) φ v (x) log p(v 0)(1 p(v 1)) + log 1 p(v 1) p(0) > log 1 p(v 0) p(1) v V v V φ v (x) w v + c 1 > c 2 v V φv (x) w v > b

6/23 Naive Bayes So why all this math? φ v (x) w v + b > 0 v V Is just a linear function! A Naive Bayes model is, in reality, equivalent to learning to separate spam / not spam with a line.

7/23 Table of Contents 1 2 3 4

8/23 We only care about p(y x), so why learn p(x, y)? Learn p(y x) directly, through a linear model Spam detection: learn f : x p(spam x) First, learn linear score function s : x R We want s(x) to be high if p(spam x) 1 And s(x) to be low if p(spam x) 0 Linear model for s: s(x) = w, x + b Remaining: map s(x) to p(spam x): use sigmoid function σ(z) = 1 1+e z. Facts: σ( ) = 1 σ( ) = 0 σ(> 0) > 0.5 σ(< 0) < 0.5

9/23 Sigmoid function σ:

10/23 ( ) Model: p(spam x) = σ w, x + b Parameters: w and b Learning: Maximum Likelihood Estimation (for conditional likelihood!!) L = = = (x,y) (X,y) (x,y) (X,y) (x,y) (X,y) p(y x) p(spam x) 1{y=spam} 1{y=not spam} (1 p(spam x)) σ( w, x + b) 1{y=spam} 1{y=not spam} (1 σ( w, x + b))

Log-Likelihood: log L = (x,y) (X,y) ( 1{y = spam} log σ( w, x + b) ) + 1{y = not spam} log(1 σ( w, x + b)) log L w = log L b (x,y) (X,y) = (x,y) (X,y) ( ) x 1{y = spam} σ( w, x + b) 1{y = spam} σ( w, x + b) Not possible to solve log L w = 0 analytically: non-linear system 11/23

12/23 Alternative: gradient ascent log L w is direction (in w) that increases log L the most Gradient ascent: move w in direction log L w w w + η b b + η, iteratively: w w + η log L w ( ) x 1{y = spam} σ( w, x + b) (x,y) (X,y) b b + η log L b 1{y = spam} σ( w, x + b) (x,y) (X,y)

13/23 Classifying emails: predict spam if: p(spam x) > p(not spam x) p(spam x) > 1 p(spam x) p(spam x) > 0.5 σ( w, x + b) > 0.5 w, x + b > 0 Looks familiar?

14/23 Table of Contents 1 2 3 4

15/23 Input Encoding x and w must have same dimensionality Must represent data as fixed-dimensional vector Common encoding in Natural Language Processing: x(e-mail) i = 1{word v i of the vocabulary V is in e-mail} = φ vi (x) Then x(e-mail) is always V -dimensional

16/23 Spam Detection Quick coding session: Implement input encoding: transform emails into fixed-length vector Implement logistic model p(spam x) = σ(s(x)) Implement gradient computation for w and b Implement gradient ascent and train model

17/23 Table of Contents 1 2 3 4

18/23 Sigmoid : prediction is given by linear function Many simple problems are not linearly separable, example: V = {human, dog} Possible combinations: {}, {human}, {dog}, {human, dog} Task: classify single-word email versus non-single word Idea: model p(spam x) with non-linear function instead Sigmoid : composition of logistic models

19/23 Sigmoid Sigmoid Feedforward Neural Network:

20/23 Sigmoid Sigmoid Feedforward Neural Network: ( ) h i = σ, x + b (h) w (h) i p(y x) = σ ( w (y), h + b (y)) Each h i is called a hidden neuron. Note that we can have as many hidden neurons as desired. i

21/23 Sigmoid Advantages of Neural Networks: Cybenko 89: given enough hidden neurons, neural networks can approximate any function arbitrarily well Easy to train: gradient ascent / descent In practice: hidden neurons act as feature extractors Can control model complexity by adding more hidden neurons or more layers (Deep Learning)

22/23 Quick Coding Session: Understand neural network code, and difference from logistic regression Train neural network. Do same settings from Logistic Regression work? Add more layers and train network again

23/23 Is data often not linearly separable? Problem: vocabulary of 10 words Task: given email, classify as 1 if it has 5 or 6 words Is problem easy?