Deep Feedforward Networks

Similar documents
Deep Feedforward Networks. Sargur N. Srihari

Introduction to Machine Learning Spring 2018 Note Neural Networks

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Neural Networks and Deep Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Network Training

Neural networks. Chapter 20. Chapter 20 1

Deep Feedforward Networks

Artificial Neural Networks

CSC321 Lecture 5: Multilayer Perceptrons

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Lecture 17: Neural Networks and Deep Learning

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Gradient-Based Learning. Sargur N. Srihari

Feed-forward Network Functions

Neural networks COMS 4771

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Neural networks. Chapter 19, Sections 1 5 1

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Course 395: Machine Learning - Lectures

Machine Learning Basics III

Artificial Neural Networks

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Neural networks. Chapter 20, Section 5 1

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Neural Networks. Intro to AI Bert Huang Virginia Tech

Feedforward Neural Nets and Backpropagation

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Neural networks and support vector machines

Multilayer Neural Networks

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

CS:4420 Artificial Intelligence

Grundlagen der Künstlichen Intelligenz

Artificial Neural Networks. MGS Lecture 2

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

arxiv: v3 [cs.lg] 14 Jan 2018

Feedforward Neural Networks

Neural Networks: Introduction

Machine Learning (CSE 446): Neural Networks

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Deep Learning for NLP

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Reading Group on Deep Learning Session 1

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

1 What a Neural Network Computes

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Intro to Neural Networks and Deep Learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

The Perceptron. Volker Tresp Summer 2016

Multilayer Perceptrons (MLPs)

CSC 411 Lecture 10: Neural Networks

Lecture 4: Perceptrons and Multilayer Perceptrons

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Deep Feedforward Networks

Nonlinear Classification

Multilayer Neural Networks

18.6 Regression and Classification with Linear Models

Linear Models for Regression. Sargur Srihari

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

The Perceptron. Volker Tresp Summer 2014

9 Classification. 9.1 Linear Classifiers

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Artificial Intelligence

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

SGD and Deep Learning

CMSC 421: Neural Computation. Applications of Neural Networks

1 Machine Learning Concepts (16 points)

Neural Networks (and Gradient Ascent Again)

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

AI Programming CS F-20 Neural Networks

Machine Learning Lecture 12

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Multilayer Perceptron

From perceptrons to word embeddings. Simon Šuster University of Groningen

Artificial Neural Networks

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Multiclass Logistic Regression

Bits of Machine Learning Part 1: Supervised Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

ECE521 Lecture 7/8. Logistic Regression

CSCI567 Machine Learning (Fall 2018)

Neural Networks: Backpropagation

Machine Learning Basics: Maximum Likelihood Estimation

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Machine Learning Lecture 10

Transcription:

Deep Feedforward Networks Yongjin Park 1

Goal of Feedforward Networks Deep Feedforward Networks are also called as Feedforward neural networks or Multilayer Perceptrons Their Goal: approximate some function f * E.g., a classifier y = f * (x) maps an input x to a category y Feedforward Network defines a mapping y = f * (x ; θ ) and learns the values of the parameters θ that result in the best function approximation 2

Feedforward Network Models are called Feedforward because: Information flows through function being evaluated from x through intermediate computations used to define f and finally to output y There are no feedback connections No outputs of the model are fed back US Presidential Election Inputs are raw votes cast Hidden layer is electoral college Output are candidates 3

Feedforward vs. Recurrent When extended to include feedback connections they are called Recurrent Neural Networks 4

Srihari Importance of Feedforward Networks They are extremely important to ML practice Many commercial applications E.g., convolutional networks used for recognizing objects from photos They are a conceptual stepping stone on path to recurrent networks Which power many NLP applications 5

Srihari Feedforward Neural Network Structures Called networks because they are composed of many different functions Model is associated with a directed acyclic graph describing how functions composed E.g., functions f (1), f (2), f (3) connected in a chain to form f (x)= f (3) [ f (2) [ f (1) (x)]] f (1) is called the first layer of the network f (2) is called the second layer, etc These chain structures are the most commonly used structures of neural networks 6

Definition of Depth Overall length of the chain is the depth of the model The name deep learning arises from this terminology Final layer of a feedforward network is called the output layer 7

A Feed-forward Neural Network K outputs y 1,..y K for a given input x Hidden layer consists of M units M D (2) (1) (1) (2) y k (x,w) = σ w kj h w ji x i + w j 0 + w k 0 j =1 i=1 f m (1) =z m = h(x,w (1) ), m=1,..m f k (2) = σ (z,w (2) ), k=1,..k 8

Example of Feedforward Network Optical Character Recognition (OCR) Hidden layer compares raw pixel inputs to component patterns 9

Training the Network In network training we drive f(x) to match f*(x) Training data provides us with noisy, approximate examples of f*(x) evaluated at different training points Each example accompanied by label y f*(x) Training examples specify directly what the output layer must do at each point x It must produce a value that is close to y 10

What are Hidden Layers? Behavior of other layers is not directly specified by the data Learning algorithm must decide how to use those layers to produce value that is close to y Training data does not say what individual layers should do Since the desired output for these layers is not shown, they are called hidden layers 11

Why are they called neural? These networks are loosely inspired by neuroscience Each hidden layer is typically vector-valued Dimensionality of hidden layer is width of the model Each element of vector viewed as a neuron Instead of thinking of it as a vector-vector function, they are regarded as units in parallel Each unite receives inputs from many other units and computes its own activation value 12

Neural networks and Brains Choice of functions f (i) (x): Loosely guided by neuroscientific observations about biological neurons Modern neural networks are guided by many mathematical and engineering disciplines Not perfectly model the brain Think of feedforward networks as function approximation machines Designed to achieve statistical generalization Occasionally draw insights from what we know about the brain Rather than as models of brain function 13

Machine Extending Learning Linear Models To represent non-linear functions of x apply linear model not to x but to a transformed input ϕ(x) where ϕ is non-linear Equivalently kernel trick obtains a nonlinearity SVM: f (x)=w T x+b written as b + Σ i α i ϕ(x) ϕ(x (i) ) Choose k(x,x (i) )=ϕ(x) ϕ(x (i) ) Use linear regression on Lagrangian for weights α i Evaluate f over samples for non-zero (support vectors) ϕ provides a set of features describing x Replace x by function ϕ(x) 14

View as Extension of Linear Models Begin with linear models and see limitations Linear regression: y(x,w) = w j φ j (x) = w T φ(x) Simple closed form solutions: Or solved with convex optimization: M 1 j = 0 E D (w) = 1 2 w ML = (Φ T Φ) 1 Φ T t N n=1 { t n w T φ(x n ) } 2 φ 0(x1) φ 0(x 2) Φ = φ 0(x N ) φ (x ) 1 1... φ M 1(x 1) φ M 1(x N ) N ( τ + 1) ( τ ) w = w η E E n n = { t n w T φ(x n ) } Logistic regression: y(x,w)= σ (w T φ (x)) No closed-form solution Convex Optimization: φ(x n ) T n=1 w τ+1 = w τ η E n E n = (y n t n )φ(x n ) If ϕ (x) =x model capacity is limited to linear functions and model has no understanding of interaction between any two input variables 15

Three methods to choose ϕ 1. Generic feature function ϕ (x) RBF: N(x; x (i), σ 2 I) centered at x (i) 2. Manually engineer ϕ Dominant approach until arrival of deep learning Requires decades of effort e.g., speech recognition, computer vision Laborious, non-transferable between domains 3. Principle of Deep Learning: Learn ϕ Approach used in deep learning 16

Approach 3: Learn Features Model is y=f (x;θ,w) = ϕ (x;θ) T w θ used to learn ϕ from broad class of functions Parameters w map from ϕ (x) to output Defines FFN where ϕ define a hidden layer Unlike other two (basis functions, manual engineering), this approach gives-up on convexity of training But its benefits outweigh harms 17

Extend Linear Methods to Learn ϕ θ MD ϕ M w KM K outputs y 1,..y K for a given input x Hidden layer consists of M units M D y k (x; θ,w) = w kj φ j θ ji x i + θ j 0 j=1 i=1 + w k 0 ϕ 1 w 10 y k = f k (x;θ,w) = ϕ (x;θ) T w ϕ 0 Can be viewed as a generalization of linear models Nonlinear function f k with M+1 parameters w k = (w k0,..w km ) with M basis functions, ϕ j j=1,..m each with D parameters θ j = (θ j1,..θ jd ) Both w k and θ j are learnt from data 18

Approaches to Learning ϕ Parameterize the basis functions as ϕ(x;θ) Use optimization to find θ that corresponds to a good representation Approach can capture benefit of first approach (fixed basis functions) by being highly generic By using a broad family for ϕ(x;θ) Can also capture benefits of second approach Human practitioners design families of ϕ(x;θ) that will perform well Need only find right function family rather than 19 precise right function

Importance of Learning ϕ Learning ϕ is discussed beyond this first introduction to feed-forward networks It is a recurring theme throughout deep learning applicable to all kinds of models Feedforward networks are application of this principle to learning deterministic mappings form x to y without feedback Applicable to learning stochastic mappings functions with feedback 20 learning probability distributions over a single vector

Plan of Discussion: Feedforward Networks 1. A simple example: learning XOR 2. Design decisions for a feedforward network Many are same as for designing a linear model Basics of gradient descent Choosing the optimizer, Cost function, Form of output units Some are unique Concept of hidden layer Makes it necessary to have activation functions Architecture of network How many layers, How are they connected to each other, How many units in each later Learning requires gradients of complicated functions Backpropagation and modern generalizations 21

Ex: XOR problem XOR: an operation on binary variables x 1 and x 2 When exactly one value equals 1 it returns 1 otherwise it returns 0 Target function is y=f *(x) that we want to learn Our model is y =f ([x 1, x 2 ] ; θ) which we learn, i.e., adapt parameters θ to make it similar to f * Not concerned with statistical generalization Perform correctly on four training points: X={[0,0] T, [0,1] T,[1,0] T, [1,1] T } Challenge is to fit the training set We want f ([0,0] T ; θ) = f ([1,1] T ; θ) = 0 f ([0,1] T ; θ) = f ([1,0] T ; θ) = 1 22

ML for XOR: linear model doesn t fit Treat it as regression with MSE loss function J(θ) = 1 f *(x) f (x;θ) = 1 4 4 x X ( ) 2 Usually not used for binary data But math is simple We must choose the form of the model n=1 Consider a linear model with θ ={w,b} where Minimize t n x T n w - b) to get closed-form solution Differentiate wrt w and b to obtain w = 0 and b=½ Then the linear model f(x;w,b)=½ simply outputs 0.5 everywhere Why does this happen? 4 f(x;w,b) = x T w +b J(θ) = 1 4 4 n=1 ( ) 2 ( f *(x n ) f (x n ;θ)) 2 Alternative is Cross-entropy J(θ) J(θ) = ln p(t θ) N { } = t n lny n +(1 t n )ln(1 y n ) n=1 y n = σ (θ T x n ) 23

Linear model cannot solve XOR Bold numbers are values system must output When x 1 =0, output has to increase with x 2 When x 1 =1, output has to decrease with x 2 Linear model f (x;w,b)= x 1 w 1 +x 2 w 2 +b has to assign a single weight to x 2, so it cannot solve this problem A better solution: use a model to learn a different representation in which a linear model is able to represent the solution We use a simple feedforward network one hidden layer containing two hidden units 24

Feedforward Network for XOR Introduce a simple feedforward network with one hidden layer containing two units Same network drawn in two different styles Matrix W describes mapping from x to h Vector w describes mapping from h to y Intercept parameters b are omitted 25

Functions computed by Network Layer 1 (hidden layer): vector of hidden units h computed by function f (1) (x; W,c) c are bias variables Layer 2 (output layer) computes f (2) (h; w,b) w are linear regression weights Output is linear regression applied to h rather than to x Complete model is f (x; W,c,w,b)=f (2) (f (1) (x)) 26

Linear vs Nonlinear functions If we choose both f (1) and f (2) to be linear, the total function will still be linear f (x)=x T w Suppose f (1) (x)= W T x and f (2) (h)=h T w Then we could represent this function as f (x)=x T w where w =Ww Since linear is insufficient, we must use a nonlinear function to describe the features We use the strategy of neural networks by using a nonlinear activation function f (x)=x T w h=g(w T x+c) 27

Activation Function In linear regression we used a vector of weights w and scalar bias b f(x;w,b) = x T w +b to describe an affine transformation from an input vector to an output scalar Now we describe an affine transformation from a vector x to a vector h, so an entire vector of bias parameters is needed Activation function g is typically chosen to be applied element-wise h i =g(x T W :, i +c i ) 28

Default Activation Function Activation: g(z)=max{0,z} Applying this to the output of a linear transformation yields a nonlinear transformation However function remains close to linear Piecewise linear with two pieces Therefore preserve properties that make linear models easy to optimize with gradient-based methods Preserve many properties that make linear models generalize well A principle of CS: Build complicated systems from minimal components. A Turing Machine Memory needs only 0 and 1 states. We can build Universal Function approximator from ReLUs

Specifying the Network using ReLU Activation: g(z)=max{0,z} We can now specify the complete network as f (x; W,c,w,b)=f (2) (f (1) (x))=w T max {0,W T x+c}+b

We can now specify XOR Solution Let W = 1 1 1 1, c = Now walk through how model processes a batch of inputs Design matrix X of all four points: First step is XW: In this space all points lie Adding c: along a line with slope 1. Cannot be implemented by a linear model Compute h Using ReLU Finish by multiplying by w: Network has obtained 0 1, w = Has changed relationship among examples. They no longer lie on a single line. A linear model suffices correct answer for all 4 examples 1 2, b = 0 max{0,xw +c} = f(x) = 0 1 1 0 XW +c = 0 0 1 0 1 0 2 1 f (x; W,c,w,b)= w T max {0,W T x+c}+b 0 1 1 0 1 0 2 1 XW = 0 0 1 1 1 1 2 2 X = 0 0 0 1 1 0 1 1 31

Learned representation for XOR Two points that must have output 1 have been collapsed into one Points x=[0,1] T and x=[1,0] T have been mapped into h=[0,1] T When x 1 =0, output has to increase with x 2 When x 1 =1, output has to decrease with x 2 Can be described in linear model When h 1 =0, output is constant 0 with h 2 When h 1 =1, output is constant 1 with h 2 When h 1 =2, output is constant 0 For fixed h 2, output with h 32 increases in h 2 1

About the XOR example We simply specified the solution Then showed that it achieves zero error In real situations there might be billions of parameters and billions of training examples So one cannot simply guess the solution Instead gradient descent optimization can find parameters that produce very little error The solution described is at the global minimum Gradient descent could converge to this solution Convergence depends on initial values Would not always find easily understood integer solutions 33