Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Similar documents
DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Greedy Layer-Wise Training of Deep Networks

UNSUPERVISED LEARNING

Part 2. Representation Learning Algorithms

How to do backpropagation in a brain

Deep neural networks

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Learning Deep Architectures for AI. Part II - Vijay Chakilam

How to do backpropagation in a brain. Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto

Learning Deep Architectures

Lecture 5: Logistic Regression. Neural Networks

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Machine Learning. Neural Networks

TUTORIAL PART 1 Unsupervised Learning

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

The Origin of Deep Learning. Lili Mou Jan, 2015

Jakub Hajic Artificial Intelligence Seminar I

An efficient way to learn deep generative models

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Deep unsupervised learning

Deep Generative Models. (Unsupervised Learning)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural Networks and Deep Learning.

Denoising Autoencoders

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Logistic Regression. William Cohen

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Deep Belief Networks are compact universal approximators

Introduction to Convolutional Neural Networks (CNNs)

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks and Deep Learning

Artificial Neural Networks

Knowledge Extraction from DBNs for Images

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Neural Networks. Learning and Computer Vision Prof. Olga Veksler CS9840. Lecture 10

CSE446: Neural Networks Winter 2015

RegML 2018 Class 8 Deep learning

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

CSC321 Lecture 20: Autoencoders

Deep Learning: a gentle introduction

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Restricted Boltzmann Machines

Artificial Neural Networks. MGS Lecture 2

Deep Neural Networks

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

Multilayer Perceptrons (MLPs)

STA 414/2104: Lecture 8

Learning Deep Architectures

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural networks. Chapter 19, Sections 1 5 1

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

From perceptrons to word embeddings. Simon Šuster University of Groningen

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Grundlagen der Künstlichen Intelligenz

Deep Learning & Neural Networks Lecture 2

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

10. Artificial Neural Networks

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Lecture 16 Deep Neural Generative Models

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Lecture 2: Logistic Regression and Neural Networks

CSC 411 Lecture 10: Neural Networks

Machine Learning Techniques

Feature Design. Feature Design. Feature Design. & Deep Learning

The XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic

Neural Networks and the Back-propagation Algorithm

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Neural networks. Chapter 20, Section 5 1

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CSC321 Lecture 5: Multilayer Perceptrons

Lecture 17: Neural Networks and Deep Learning

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Deep Learning Architectures and Algorithms

Statistical NLP for the Web

Intelligent Systems Discriminative Learning, Neural Networks

STA 414/2104: Lecture 8

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Introduction to Machine Learning

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Neural networks. Chapter 20. Chapter 20 1

Natural Language Processing with Deep Learning. CS224N/Ling284

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Other Topologies. Y. LeCun: Machine Learning and Pattern Recognition p. 5/3

Transcription:

Neural Networks William Cohen 10-601 [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

WHAT ARE NEURAL NETWORKS?

William s notation Logis;c regression + 1 / if y 1+ exp( x i w) i =1 - - P(y i x i,w),% 1 ( 0 -' 1 * if y & 1+ exp( x i w) i = 0-.- ) 1-1 logistic(u) 1+ e u

Predict pos on (x1,x2) iff ax1 + bx2 + c >0 = S(ax1 + bx2 + c)

On beyond linear classifiers What about data that s not linearly separable?

One powerful extension to logis;c regression: mul;layer networks Classifier is a mul;layer network of logis-c units Each unit takes some inputs and produces one output using a logis;c classifier Output of one unit can be the input of another Input layer Hidden layer Output layer 1 w 0,2 w 0,1 w 1,1 v 1 =S(w T X) w 1 x 1 w 1,2 w 2,1 w 2 z 1 =S(w T V) x 2 w 2,2 v 2 =S(w T X)

Mul;layer Logic Regression: one type of Ar;ficial Neural Network (ANN) Neuron is a cell in the brain Highly connected to other neurons, and performs computa;ons by combining signals from other neurons Outputs of these computa;ons may be transmieed to one or more other neurons

Does AI need ANNs? Neurons take ~ 0.001 second to change state We can do complex scene recogni;on tasks in ~ 0.1 sec or about 100 steps of computa;on ANNs are similar: lots of parallelism not many steps of computa;on

BACKPROP: LEARNING FOR NEURAL NETWORKS

Ideas from Linear and Logis;c regression Define a loss which is squared error But over a network of logis;c units Minimize loss with gradient descent J X,y (w) = i ( y i y ˆ i ) 2 But output is network output Quick review: the math for linear regression and logis;c regression

Gradient Descent for Linear Regression (simplified) Differentiate the loss function: J X,y (w) = 1 2 J(w) = 1 w j w j 2 i ( y i y ˆ i 2 1 ) = 2 = y i ŷ i i i ( ) = y i ŷ i i ( ) = y i ŷ i i ( ) ( y i ŷ i ) i w j w j x j $ & y i % 2 ŷ i j w j x j j predict with : ˆ y i = ' w j x j ) ( 2 n j w j x j

Gradient Descent for Logis;c Regression (simplified) Output of a logistic unit: After some math:

( f n )'= nf n 1 f ' p 13

Gradient Descent for Logis;c Regression (simplified) Output of a logistic unit: After some math: 1 S(g) = 1+ e g S g = g(1 g)

DERIVATION OF BACKPROP

Nota;on a y becomes a target t a is an ac;va;on and net is a weighted sum of inputs i.e., a=s(net) S is sigmoid func;on w kj is a weight from hidden j to output k w ji is a weight from input i to hidden j J(w) = 1 2 w ji ( t k a ) 2 k k w kj

Deriva;on: gradient for output- layer weights J(w) = 1 2 k J = J w kj a k ( t k a ) 2 k a k net k net k w kj

Deriva;on: output layer J(w) = 1 2 k J = J w kj a k net k w kj = w kj ( t k a ) 2 k a k net k net k w kj j' w kj' a j' = a j J a k = ( t k a ) k a k net = a ( 1 a ) k k From logistic regression result

Deriva;on: output layer J(w) = 1 2 k ( t k a ) k J = ( t k a ) k a ( k 1 a ) k a j w kj 2 δ k ( t k a k )a ( k 1 a ) k error w kj J = δ k a j sensitivity

Deriva;on: gradient for hidden- layer weights J(w) = 1 2 J(w) = 1 2 J(w) = 1 2 w ji J = # % $ k k k k ( t k a ) k # # t k S % % $ $ 2 j w kj a j && (( '' # # # &&& t k S % w kj S% w ji a i ( % ( $ $ j $ i ' ( '' J a k a k net k net k a j 2 2 & a j net j ( ' net j w ji

Deriva;on: hidden layer J = w ji " $ # k k J a k a k net k net k a j " % = $ δ k w kj 'a j (1 a j )a i # & δ j ( δ k w ) kj a j 1 a j k J w ji = δ j a i % a j net j ' & net j w ji ( )

Compu;ng the weight update For nodes k in output layer: δ k ( t k a ) k a ( k 1 a ) k For nodes j in hidden layer: δ j ( δ k w ) kj a j 1 a j k For all weights: ( ) w kj = w kj ε δ k a j Propagate errors backward BACKPROP Can carry this recursion out further if you have multiple hidden layers w ji = w ji ε δ j a i

Some extensions to BackProp. Different units: Can replace (1 + e - x ) - 1 with tanh, Different architectures: Can extend this method to any directed graph Weight sharing/parameter ;e - ing Can have the same weights appear in mul;ple loca;ons in the graph (more later)

EXPRESSIVENESS OF ANNS

Networks of logis;c units can do a lot One logis;c unit can implement and AND or an OR of a subset of inputs e.g., (x 3 AND x 5 AND AND x 19 ) Every boolean func;on can be expressed as an OR of ANDs e.g., (x 3 AND x 5 ) OR (x 7 AND x 19 ) OR So one hidden layer can express any BF (But it might need lots and lots of hidden units)

Networks of logis;c units can do a lot An ANN with one hidden layer can also approximate every bounded con;nuous func;on. (But it might need lots and lots of hidden units)

Limita;ons of Backprop Learning is slow: thousands of itera;ons Learning is oden ineffec;ve on deep networks more than 1-2 hidden layers (excep;on: convolu;onal nets) Lots of parameters to tweak structure of the network regulariza;on, learning rate, momentum terms, number of itera;ons/convergence, common trick is early stopping: train ;ll error on a held- out valida;on set stops decreasing similar to L2- regulariza;on BP finds a local minima, not a global one star;ng point maeers (mul;ple restarts) warning: you usually want to randomize the weights when you start It only is useful for supervised learning tasks: or, is it?

DEEP NEURAL NETWORKS

Observa;ons and mo;va;ons Slow is not scary anymore Moore s law, GPUs, Brains seem to be hierarchical deeper more abstract features computed from simpler ones 100 steps!= 3 steps

CONVOLUTIONAL NETS

weight- sharing and convolu;onal networks Convolu;onal neural networks for vision tasks: These can be trained with Backprop, even with many hidden layers Every unit is a feature detector for part of the re;na, with a receptor field that defines its inputs and (conceptually) a loca-on on the re;na Two types of units: detec;on: learn to recognize some feature (copied to mul;ple places via weight- sharing) pooling: combine detectors from mul;ple nearby loca;ons together (e.g., max, sampling, )

Model of vision in animals

Vision with ANNs

Vision with ANNs

Similar technique applies to audio

DENSITY MODELLING WITH ANNS

Neural network auto- encoding Assume we would like to learn the following (trivial?) output func;on: Using the following network: Input Output 00000001 00000001 00000010 00000010 00000101 00000100 00001000 00001000 00010000 00010000 00100000 00100000 01000000 01000000 10000000 10000000 Can this be done?

Note that each value is assigned to the edge from the corresponding input Learned parameters

Hypothe;cal example (not actually from an autoencoder) Data Reconstruction Data Reconstruction

Neural network autoencoding The hidden layer is a compressed version of the data Reconstruc-on error on a vector x is related to P(x) on the probability that the auto- encoder was trained with. Denoising auto- encoders: trained to reconstruct x from a noisy version of x

Density Modeling with ANNs Sigmoid belief nets (Neal, 1992): explicitly model a generative process. Top layer nodes are generated by a binomial. Sampling from (generating data with) a SBN: Sample h k from P(h k ) Sample from h k-1 from P(h k-1 h k ) Sample x from P(x h 1) What about learning?

Recall gradient for logis;c regression This can be interpreted as a difference between the expected value of y x j =1 in the data and the expected value of y x j =1 as predicted by the model Gradient ascent tries to make those equal So: If we can sample from the model we can approximate the gradient. = E x,y~data [y x] E x~data,y~ ˆP(y x) [y x] 42

Density Modeling with ANNs Sigmoid belief nets (Neal, 1992): explicitly model a generative process. Top layer nodes are generated by a binomial. Sampling from (generating data with) a SBN: Sample h k from P(h k ) Sample from h k-1 from P(h k-1 h k ) Sample x from P(x h 1)

Density Modeling with ANNs Notation: x is now visible units v Restricted Boltzmann machine: equivalent to a very deep sigmoid belief network, with lots of weights tied. Usually show this as a two-layer network with one hidden layer and symmetric weights. RBNs can compute Pr(v h) or Pr(h v) directly

Compu;ng probabili;es with RBM via Gibbs sampling j j j j 0 < v i hj> < v i h j > a fantasy i i i i t = 0 t = 1 t = 2 t = infinity Start with any value on the visible units. Then alternate between updating the hidden units (in parallel) and the visible units in parallel. System will converge to a stream of <v,h> drawn from Pr(v,h) Goal of learning: push system s Pr(v,h) toward the observed probabilities Start with observed x, let system drift away, then train it to stay closer to the observed x. variant - clamp some units to a fixed value c, and sample from Pr(v 1,h v 2 =c)

Training a stack of RBNs =

Training a stack of RBNs = Step 1: train an RBN Step 2:

Training a stack of RBNs a deep Boltzmann machine = Steps 3, : repeat as needed. Then fine-tune with BP on a discriminative task: clamp n-1 of the inputs to x and use Gibbs to sample from Pr(y x)

DBN examples: learning the digit 2 16 x 16 = 256 pixel image; 50 x 256 hidden units

Example: reconstruc;ng digits with a network trained on different copies of 2 Data Reconstruction from activated binary features Data Reconstruction from activated binary features New test images from the digit class that the model was trained on Images from an unfamiliar digit class (the network tries to see every image as a 2)

Samples generated a_er clamping the label y

PSD features learned from all digits [LeCunn]

DENSITY MODELLING COMBINED WITH BACKPROP

Training a stack of RBNs a deep Boltzmann machine = Steps 3, : repeat as needed. Then fine-tune with BP on a discriminative task:

Deep learning methods Greedily learn a deep model of the unlabeled data. Deep Boltzmann machines, denoising autoencoders, PSDs, Learn a hidden layer from the inputs; Fix this layer and learn a deeper layer from it; Then fine-tune with BP. Another approach for RBNs: unroll the stack of RBNs into a multi-layer network, add appropriate output notes, and use it starting point for BP.

0 1 9.

Error reduced from 16% to 6% by using the deep features.

Modeling documents using top 2000 words.

What you should know What a neural network is and what sorts of things it can compute Why it s not linear Backprop: loss func;on, updates, etc. Limita;ons of Backprop without pre- training on deep networks