Logistic Regression & Neural Networks

Similar documents
Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

NLP Programming Tutorial 6 - Advanced Discriminative Learning

Sequential Data Modeling - Conditional Random Fields

NLP Programming Tutorial 8 - Recurrent Neural Nets

Classification, Linear Models, Naïve Bayes

Introduction to Machine Learning

y(x n, w) t n 2. (1)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Final Examination CS 540-2: Introduction to Artificial Intelligence

Reading Group on Deep Learning Session 1

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Network Training

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Lecture 10

Logistic Regression. COMP 527 Danushka Bollegala

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Logistic Regression. Machine Learning Fall 2018

Artificial Neural Networks. MGS Lecture 2

Neural Networks and Deep Learning

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Introduction to Machine Learning

Introduction to Neural Networks

Machine Learning Lecture 5

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Lecture 4: Perceptrons and Multilayer Perceptrons

ECE521 Lectures 9 Fully Connected Neural Networks

Neural Networks (Part 1) Goals for the lecture

Intro to Neural Networks and Deep Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Statistical Machine Learning from Data

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Multi-layer Neural Networks

Lecture 5: Logistic Regression. Neural Networks

Machine Learning for NLP

Machine Learning Basics III

Computational statistics

Artificial Neural Networks

Classification Based on Probability

text classification 3: neural networks

Machine Learning Lecture 12

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Statistical NLP for the Web

Multilayer Perceptron

Advanced statistical methods for data analysis Lecture 2

Generative v. Discriminative classifiers Intuition

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Stochastic Gradient Descent

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Feedforward Neural Nets and Backpropagation

Multilayer Perceptron

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

From perceptrons to word embeddings. Simon Šuster University of Groningen

Machine Learning (CSE 446): Neural Networks

Neural Networks: Backpropagation

Linear discriminant functions

Lecture 17: Neural Networks and Deep Learning

Ch 4. Linear Models for Classification

Multilayer Perceptrons and Backpropagation

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Classification Logistic Regression

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

CSC242: Intro to AI. Lecture 21

Other Topologies. Y. LeCun: Machine Learning and Pattern Recognition p. 5/3

Machine Learning Lecture 7

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Deep Neural Networks (1) Hidden layers; Back-propagation

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 5: Multilayer Perceptrons

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2005, Lecture 4 Gradient-Based Learning III: Architectures Yann LeCun

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Multiclass Logistic Regression

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Lecture 5 Neural models for NLP

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression. William Cohen

Feed-forward Network Functions

NEURAL LANGUAGE MODELS

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Transcription:

Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein

Logistic Regression

Perceptron & Probabilities What if we want a probability p(y x)? The perceptron gives us a prediction y Let s illustrate this with binary classification Illustrations: Graham Neubig

The logistic function Softer function than in perceptron Can account for uncertainty Differentiable

Logistic regression: how to train? Train based on conditional likelihood Find parameters w that maximize conditional likelihood of all answers y " given examples x "

Stochastic gradient ascent (or descent) Online training algorithm for logistic regression and other probabilistic models Update weights for every training example Move in direction given by gradient Size of update step scaled by learning rate

Gradient of the logistic function

Example: Person/not-person classification problem Given an introductory sentence in Wikipedia predict whether the article is about a person

Example: initial update

Example: second update

How to set the learning rate? Various strategies decay over time α = 1 C + t Parameter Number of samples Use held-out test set, increase learning rate when likelihood increases

Multiclass version

Some models are better then others Consider these 2 examples Which of the 2 models below is better? Classifier 2 will probably generalize better! It does not include irrelevant information => Smaller model is better

Regularization A penalty on adding extra weights L2 regularization: w + big penalty on large weights small penalty on small weights w, L1 regularization: Uniform increase when large or small Will cause many weights to become zero

L1 regularization in online learning

What you should know Standard supervised learning set-up for text classification Difference between train vs. test data How to evaluate 3 examples of supervised linear classifiers Naïve Bayes, Perceptron, Logistic Regression Learning as optimization: what is the objective function optimized? Difference between generative vs. discriminative classifiers Smoothing, regularization Overfitting, underfitting

Neural networks

Person/not-person classification problem Given an introductory sentence in Wikipedia predict whether the article is about a person

Formalizing binary prediction

The Perceptron: a machine to calculate a weighted sum φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ black = 0 0-3 0 0 0 0 0 2 0. sign - w " "/, ϕ " x -1

The Perceptron: Geometric interpretation O X O X O X

The Perceptron: Geometric interpretation O X O X O X

Limitation of perceptron can only find linear separations between positive and negative examples X O O X

Neural Networks Connect together multiple perceptrons φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ black = 0-1 Motivation: Can represent non-linear functions!

Neural Networks: key terms φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ black = 0-1 Input (aka features) Output Nodes Layers Hidden layers Activation function (non-linear) Multi-layer perceptron

Example Create two classifiers φ 0 [0] φ 0 (x 1 ) = {-1, 1} X φ 0 (x 2 ) = {1, 1} φ 0 [1] 1 sign φ φ 0 [1] 1 [0] O 1-1 φ 0 [0] w 0,0 1 b 0,0 O φ 0 (x 3 ) = {-1, -1} X φ 0 (x 4 ) = {1, -1} φ 0 [0] φ 0 [1] w 0,1-1 -1 sign φ 1 [1] 1-1 b 0,1

Example These classifiers map to a new space φ 0 (x 1 ) = {-1, 1} φ 0 (x 2 ) = {1, 1} φ 1 (x 3 ) = {-1, 1} X φ 2 O O φ 1 [1] φ 1 φ 1 [0] O X φ 0 (x 3 ) = {-1, -1}φ 0 (x 4 ) = {1, -1} X φ 1 (x 1 ) = {-1, -1} φ 1 (x 4 ) = {-1, -1} O φ 1 (x 2 ) = {1, -1} 1 1-1 φ 1 [0] -1-1 -1 φ 1 [1]

Example In new space, the examples are linearly separable! φ 0 (x 1 ) = {-1, 1} X φ 0 [1] φ 0 (x 2 ) = {1, 1} O O X φ 0 [0] φ 0 (x 3 ) = {-1, -1}φ 0 (x 4 ) = {1, -1} 1 1 1 φ 2 [0] = y 1 1-1 -1-1 -1 φ 1 [0] φ 1 [1] φ 1 (x 3 ) = {-1, 1} O φ 1 [1] φ 1 [0] φ 1 (x 1 ) = {-1, -1} X O φ 1 (x 2 ) = {1, -1} φ 1 (x 4 ) = {-1, -1}

Example wrap-up: Forward propagation The final net φ 0 [0] 1 φ 0 [1] 1 1-1 tanh φ 1 [0] 1 φ 0 [0] φ 0 [1] -1-1 tanh φ 1 [1] 1 tanh φ 2 [0] 1-1 1 1

Softmax Function for multiclass classification Sigmoid function for multiple classes P y x = ew 6 7,9 ew 6 7,9; 9; Current class Sum of other classes Can be expressed using matrix/vector ops r = exp W ϕ x, y p = rg- E r r 30

Stochastic Gradient Descent Online training algorithm for probabilistic models w = 0 for I iterations for each labeled pair x, y in the data w += α * dp(y x)/dw In other words For every training example, calculate the gradient (the direction that will increase the probability of y) Move in that direction, multiplied by learning rate α

Gradient of the Sigmoid Function Take the derivative of the probability d dw P y = 1 x = d dw = ϕ x e w 6 7 1 + e w 6 7 e w 6 7 1 + e w 6 7 + d d P y = 1 x = dw dw 1 ew 6 7 1 + e w 6 7 = ϕ x e w 6 7 1 + e w 6 7 +

Learning: We Don't Know the Derivative for Hidden Units! For NNs, only know correct tag for last layer h x w 1 dp y = 1 x dw 1 =? ϕ x w 2 w 4 dp y = 1 x dw 4 y=1 = h x e w 4 h 7 1 + e w 4 h 7 + w 3 dp y = 1 x dw 2 =? dp y = 1 x dw 3 =?

Answer: Back-Propagation Calculate derivative with chain rule dp y = 1 x dw 1 = dp y = 1 x dw 4 h x dw 4 h x dh, x dh, x dw 1 e w 4 h 7 1 + e w 4 h 7 + w,,r Error of next unit (δ 4 ) Weight Gradient of this unit In General Calculate i based on next units j: dp y = 1 x w i = dh " x dw i - δ U U w ",U

Backpropagation = Gradient descent + Chain rule

Feed Forward Neural Nets All connections point forward ϕ x y It is a directed acyclic graph (DAG)

Neural Networks Non-linear classification Prediction: forward propagation Vector/matrix operations + non-linearities Training: backpropagation + stochastic gradient descent For more details, see CIML Chap 7