ECE521 Lectures 9 Fully Connected Neural Networks

Similar documents
STA 414/2104: Lecture 8

ECE521 Lecture 7/8. Logistic Regression

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

ECE521 Lecture7. Logistic Regression

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks: Backpropagation

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Introduction to Machine Learning

From perceptrons to word embeddings. Simon Šuster University of Groningen

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

Lecture 17: Neural Networks and Deep Learning

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Machine Learning Basics

CSC321 Lecture 5: Multilayer Perceptrons

Neural Networks: Backpropagation

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Neural networks (NN) 1

Computational statistics

Intro to Neural Networks and Deep Learning

Deep Neural Networks (1) Hidden layers; Back-propagation

Stochastic gradient descent; Classification

Artificial Neural Networks

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Linear and Logistic Regression. Dr. Xiaowei Huang

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

The connection of dropout and Bayesian statistics

Multilayer Perceptron

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Natural Language Processing with Deep Learning CS224N/Ling284

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Convolutional Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Neural Networks and the Back-propagation Algorithm

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CSCI567 Machine Learning (Fall 2018)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Based on the original slides of Hung-yi Lee

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

A summary of Deep Learning without Poor Local Minima

Logistic Regression Trained with Different Loss Functions. Discussion

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Lecture 5: Logistic Regression. Neural Networks

Logistic Regression & Neural Networks

Learning from Data: Multi-layer Perceptrons

Unit III. A Survey of Neural Network Model

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Network Training

Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

18.6 Regression and Classification with Linear Models

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

How to do backpropagation in a brain

Lecture 2: Learning with neural networks

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

Topic 3: Neural Networks

Feed-forward Network Functions

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Neural Networks DWML, /25

CSC321 Lecture 4: Learning a Classifier

STA 414/2104: Lecture 8

Jakub Hajic Artificial Intelligence Seminar I

CS 730/730W/830: Intro AI

4. Multilayer Perceptrons

Bits of Machine Learning Part 1: Supervised Learning

Introduction to Neural Networks

Other Topologies. Y. LeCun: Machine Learning and Pattern Recognition p. 5/3

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Artificial Intelligence

Machine Learning Lecture 12

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2005, Lecture 4 Gradient-Based Learning III: Architectures Yann LeCun

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Neural Networks and Deep Learning

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Learning Deep Architectures for AI. Part I - Vijay Chakilam

CSC321 Lecture 4: Learning a Classifier

Deep Learning Lab Course 2017 (Deep Learning Practical)

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

CSC242: Intro to AI. Lecture 21

Linear Regression, Neural Networks, etc.

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Statistical Machine Learning from Data

Transcription:

ECE521 Lectures 9 Fully Connected Neural Networks

Outline Multi-class classification Learning multi-layer neural networks 2

Measuring distance in probability space We learnt that the squared L2 distance is an important concept that captures the natural distance measure between the two points in Euclidean space. The Kullback-Leibler divergence is an equally important concept that is an appropriate distance measure between two probability distributions: KL divergence is an asymmetric distance function: KL divergence is also nonnegative for any two distributions 3

Binary cross-entropy loss Consider the Q distribution to be the discrete probability distribution of the observed binary labels in the training dataset is either zero or one depending on the training example. (The dataset is observed, so there is no randomness) Choose the P distribution to be the model s prediction:. The KL divergence between Q and P is in fact the cross-entropy loss: cross entropy The cross-entropy loss function measures the distance between the empirical data distribution and the model predictive distribution this is zero 0*log(0) 0 4

Multi-class cross-entropy loss and softmax It is easy to use the KL divergence interpretation to generalize the cross-entropy loss to a multi-class scenario: Let there be K classes, with class labels The multi-class cross entropy loss can be written using indicator function I(.): Similarly, the multi-class generalization of the sigmoid function is the softmax function. The multi-class predictive distribution becomes: Here are the outputs of a neural network or linear regression 5

Outline Multi-class classification Learning multi-layer neural networks 6

Example 1: representing digital circuits with neural networks Let us look at a simple example of a soft OR gate simulated by a neural network: Use a single sigmoid neuron with two inputs and a bias unit. x2=0 x2=1 x1=0 0 1 σ x1=1 1 1 7

Example 1: representing digital circuits with neural networks Let us look at a simple example of a soft OR gate simulated by a neural network: Use a single sigmoid neuron with two inputs and a bias unit. One possible solution is to use the bias as a threshold while setting w 1 and w 2 to be large positive values. When either of the inputs is non-zero, the sigmoid neuron will be turned on and the output will be 1. x2=0 x2=1 x1=0 0 1 σ x1=1 1 1 8

Learning fully connected multi-layer neural networks There are many choices when crafting the architecture of a neural network. The fully connected multi-layer NN is the most general multi-layer NN: Each neuron has its incoming weights connected to all the neurons from the previous layer and its outgoing weights connected to all the neurons in the next layer. Fully connected network is the go-to architecture choice if we do not have any additional information about the dataset. After choosing the network architecture, there are a few more engineering choices: #hidden units, #layers, the type of activation function. The output units type: linear, logistic or softmax are determined by output tasks, i.e. regression or classification task 9

Learning fully connected multi-layer neural networks Consider a fully connected neural network with 3 hidden layers: The input to the neural network is an N-dimensional vector. There are H1, H2, and H3 hidden units in the three hidden layers. We use superscript to index the layers. There are four weight matrices among the hidden layers, e.g. The jth row of the weight matrix is denoted as The hidden activation of the jth hidden unit in the second hidden layer is the weighted sum of the first hidden layer: We can use vector notation to express the hidden vector: 10

Learning fully connected multi-layer neural networks For a single data point, we can write the the hidden activations of the fully connected neural network as a recursive computation using the vector notation: forward propagation f() is the output activation function The output of the network is then used to compute the loss function on the training data 11

Learning fully connected multi-layer neural networks Learning neural networks using stochastic gradient descent requires the gradient of the weight matrices from each hidden layer. Let us consider the gradient of the loss for a single training example. The gradient w.r.t. the incoming weights of the jth hidden unit in the second layer is the product of the hidden activation from layer 1 and the partial derivative w.r.t. z j. Remember: The partial derivative w.r.t. z j in the second hidden layer is the weighted sum of the partial derivatives from the third layer, weighted by the outgoing weights of the jth hidden units: 12

Learning fully connected multi-layer neural networks Similar to the hidden-activation computation (slide 10), the weighted sum of the partial derivatives can be rewritten using vector notation: Here, is the jth column of the weight matrix To express the partial derivatives w.r.t. z for the entire second hidden layer, we can use a matrix-vector product: 13

Learning fully connected multi-layer neural networks For a single training datum, computing the gradient w.r.t. the weight matrices is also a recursive procedure: Remember: Back-propagation is similar to running the neural network backwards using the transpose of the weight matrices back-propagation What about the expression for the bias units? 14