Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Similar documents
ECE521 Lectures 9 Fully Connected Neural Networks

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Deep Feedforward Networks

Jakub Hajic Artificial Intelligence Seminar I

CSC321 Lecture 5: Multilayer Perceptrons

Introduction to Neural Networks

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Feedforward Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Artificial Neural Networks. MGS Lecture 2

Ch.6 Deep Feedforward Networks (2/3)

Logistic Regression & Neural Networks

Neural networks (NN) 1

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Deep Feedforward Networks

Deep Feedforward Networks

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Neural Networks: Backpropagation

From perceptrons to word embeddings. Simon Šuster University of Groningen

Neural Networks and Deep Learning

Lecture 3 Feedforward Networks and Backpropagation

Deep Neural Networks (1) Hidden layers; Back-propagation

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Deep Feedforward Networks. Sargur N. Srihari

Artificial Neural Networks

Deep Learning Lab Course 2017 (Deep Learning Practical)

Artificial Neural Networks

Lecture 3 Feedforward Networks and Backpropagation

CSC 411 Lecture 10: Neural Networks

Classification. Sandro Cumani. Politecnico di Torino

Deep Neural Networks (1) Hidden layers; Back-propagation

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

CS60010: Deep Learning

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Lecture 5: Logistic Regression. Neural Networks

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Computational Graphs, and Backpropagation. Michael Collins, Columbia University

ECE521 Lecture 7/8. Logistic Regression

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Lecture 2: Learning with neural networks

CSCI567 Machine Learning (Fall 2018)

Introduction to Machine Learning

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

CSC 578 Neural Networks and Deep Learning

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Data Mining Part 5. Prediction

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Feedforward Neural Nets and Backpropagation

Machine Learning Basics III

Feedforward Neural Networks. Michael Collins, Columbia University

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Network Language Modeling

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Course 395: Machine Learning - Lectures

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Neural Networks and Deep Learning.

Neural Networks (Part 1) Goals for the lecture

1 What a Neural Network Computes

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Feed-forward Network Functions

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Lecture 2: Modular Learning

text classification 3: neural networks

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Deep Neural Networks

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

An overview of deep learning methods for genomics

Logistic Regression. COMP 527 Danushka Bollegala

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Convolutional Neural Networks. Srikumar Ramalingam

PV021: Neural networks. Tomáš Brázdil

Recurrent and Recursive Networks

Introduction to (Convolutional) Neural Networks

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Neural networks COMS 4771

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Introduction to Convolutional Neural Networks (CNNs)

Topic 3: Neural Networks

Machine Learning

CSCI 315: Artificial Intelligence through Deep Learning

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL

Stochastic gradient descent; Classification

How to do backpropagation in a brain

Transcription:

Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann

Feedforward networks

Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

New features to the rescue! x 2 0 1 1 x 3 x 3 = 0 0 x 1

New features to the rescue! x 2 0 1 x 3 1 0 x 1 x 3 = xor(x 1, x 2 )

How do we get new features? We want to apply the linear model not to x directly but to a representation φ(x) of x. How do we get this representation? Option 1. Manually engineer φ using expert knowledge. feature engineering Option 2. Make the model sensitive to parameters such that learning these parameters identifies a good representation φ. feature learning

From linear models to neural networks x 1 x 1 h 1 y y x 2 x 2 h 2

Function composition Neural networks are called networks because they can be understood in terms of function composition. (f g)(x) = f(g(x)) In essence, a neural network is an acyclic directed graph that describes how a collection of functions are composed. length of the composition chain = depth of the model The compositional structure of neural networks is important for the success of gradient-based optimisation. chain rule of derivatives

Functions, types, compositions h 1 x 1 h 2 y x 2 h 3 g 2 3 f 3 1

Shapes of the parameter matrices h 1 x 1 h 2 y x 2 h 3 H : (2, 3) W : (3, 1)

Feedforward networks Information flows through the network from the input layer x, through the intermediate layers, to the output layer y. There are no feedback connections in which (possibly intermediate) outputs of the model are fed back to itself. When feedforward networks are extended to include feedback connections, they are called recurrent networks.

A simple feedforward network h 1 x 1 h 2 y x 2 h 3 input layer hidden layer output layer

Artificial neuron x 0 θ 0 Σ f h(x) x n θ n

The rules of the game Choose the activation functions that will be used at each layer. sigmoid, tanh, rectified linear units, Choose an error function. function of predicted output and target output Choose a regulariser to prevent the network from overfitting. encodes preferences over the choices of parameters Choose an optimisation procedure to minimise the training loss. typically a variant of stochastic gradient descent

Activation functions

Logistic function 1 1 0,5 0,5 0-6 -3 0 3 6 0-6 -3 0 3 6

Logistic function The output of a logistic unit is a number between 0 and 1. Therefore, the output can be interpreted as a conditional probability P(y = 1 x) for a binary random variable y. This makes logistic units ideal as output units for binary classification problems.

Softmax function The softmax function takes a k-dimensional vector z as its input and returns a k-dimensional vector y such that The softmax function generalises the logistic function in that it yields a probability distribution over k possible classes. In particular, each of the output components is a number between 0 and 1, and the sum of all output components is 1.

Softmax layer y 1 y 2 y 3 z 1 z 2 z 3 h 1 h 2 h 3 h 4

Hyperbolic tangent 1 1 0,5 0 0,5-0,5-1 -6-3 0 3 6 0-6 -3 0 3 6

Problems with sigmoidal units Sigmoidal units saturate across most of their domain, which can make gradient-based learning very difficult. gradient is close to zero both for negative and positive values For this reason, their use as hidden units in feedforward networks is now discouraged. Sigmoidal units can still be used as output units when the cost function can undo the saturation. not the case with squared loss!

Rectified linear units 1 1 0,5 0,5 0-6 -3 0 3 6 0-6 -3 0 3 6

Comparison of activation functions sigmoid tanh relu sigmoid tanh relu 1 1 0,5 0,75 0 0,5-0,5 0,25-1 -6-3 0 3 6 0-6 -3 0 3 6 activation functions gradients

Error functions

Maximum likelihood estimation Consider a family of probability distributions P(X; θ) that assign a probability to any sequence X of N examples. The maximum likelihood estimator for θ is then defined as If we assume that the examples are mutually independent and identically distributed, this can be rewritten as

Properties of the Maximum Likelihood Estimator The maximum likelihood estimator has two desirable properties: Consistency The mean squared error between the estimated and the true parameters decreases as N increases. Efficiency No consistent estimator has a lower mean squared error with respect to the parameters.

Conditional log-likelihood In supervised learning, we want to learn a conditional probability distribution over target values y, given features x. The assumption that the samples are i.i.d. gives us Maximising likelihood is the same as minimising the crossentropy between the empirical distribution and the model. derivation in GBC, section 5.7

Conditional log-likelihood The maximum likelihood principle gives us a principled way to derive the cost function for a supervised learning problem: In the case of linear regression, minimising this expression is equivalent to minimising the mean squared error. GBC, Section 5.7.1

Negative log-likelihood 5 3,75 log p 2,5 1,25 0 0 0,25 0,5 0,75 1 p

Logistic function 1 1 0,5 0,5 0-6 -3 0 3 6 0-6 -3 0 3 6

Cross-entropy error function The output of a logistic unit can be interpreted as the conditional probability P(y i = 1 x) for a binary random variable y i. The natural error function for a logistic unit is the negative log probability of the correct output: This is usually written as

Cross-entropy cost function 3 3 2,25 2,25 error 1,5 error 1,5 0,75 0,75 0 0 0,25 0,5 0,75 1 0 0 0,25 0,5 0,75 1 h(x) h(x) y = 1 y = 0

Sigmoid and cross-entropy balance each other E f y k z k The steepness of cross-entropy error exactly balances the flatness of the logistic function.

Regularisation

Norm-based regularisation We can regularise the training of a neural network by adding an additional term to the error function. L2-regularisation: Give preference to parameter vectors with smaller Euclidean norms ( lengths ): L1-regularisation: Give preference to parameter vectors with smaller absolute-value norms:

Selected regularisation techniques Dataset augmentation. Generate new training data by systematically transforming the existing data. example: rotating and scaling images Early stopping. Stop the training when the validation set error goes up and backtrack to the previous set of parameters. Bagging. Train several different models separately, then have all of the models vote on the output.

Dropout Randomly set a fraction of units to zero during training. for example, 50% of all units in a given layer Intuition: Damaging random parts of the network prevents it from becoming oversensitive to idiosyncratic patterns in the data.

Dropout unmodified neural net net after applying dropout

Backpropagation

Backpropagation Feedforward networks can be trained using gradient descent. feedforward network = chain of differentiable functions The computational problem is how to efficiently compute the gradient for all layers of the network, at the same time. The standard algorithm for this is called backpropagation.

Network structure f w jk f w ij y i

Forward pass E f y k z k w jk f y j z j w ij y i

What do we want? E w ij

Computing the errors E f y k z k w jk f y j z j w ij y i

Error in the output layer E f y k z k

Error in a hidden layer E f z k w jk y j z j

Computing the weight gradients E z j w ij y i

Lab: Handwritten digit recognition

Handwritten digit recognition You are to build a feedforward net that takes in a greyscale image of a handwritten digit and outputs the digit (an integer). supervised learning

Basic network architecture one neuron for each pixel one neuron for each digit 1 + 28 28? 10

How to use the network Translate each image to a vector x with 1 + 28 28 components, where component x i is the greyscale value for pixel i in the image. The greyscale value is a fraction k/155 between 0 (black) and 1 (white). Feed the image to the network. Find the neuron y i in the output layer that has the highest activation and predict the digit i. Bonus: Implement a softmax layer!

What does the net learn? Source: Kylin-Xu

How to train the network To train the network we use the MNIST database, which consists of 70,000 handwritten labelled digits. Each target is translated into a vector y with 10 components, where y i is 1 if the target equals i and 0 otherwise. Example: If the target is 3 then y 3 = 1, and all other components are zero.