Introduction to Neural Networks

Similar documents
Introduction to Neural Networks

Neural Network Language Modeling

Artifical Neural Networks

Statistical Machine Learning from Data

Neural networks. Chapter 20. Chapter 20 1

Introduction Biologically Motivated Crude Model Backpropagation

Neural networks. Chapter 19, Sections 1 5 1

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Data Mining Part 5. Prediction

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Networks. Advanced data-mining. Yongdai Kim. Department of Statistics, Seoul National University, South Korea

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Artificial Neural Network

Artificial Neural Network and Fuzzy Logic

Lecture 7 Artificial neural networks: Supervised learning

Artificial Intelligence

Linear Regression, Neural Networks, etc.

CS:4420 Artificial Intelligence

Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics

Artificial Neural Networks

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Artificial Neural Networks. Historical description

Neural networks. Chapter 20, Section 5 1

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

EE04 804(B) Soft Computing Ver. 1.2 Class 2. Neural Networks - I Feb 23, Sasidharan Sreedharan

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Biosciences in the 21st century

Revision: Neural Network

Neural Networks and Deep Learning

Feedforward Neural Nets and Backpropagation

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

A summary of Deep Learning without Poor Local Minima

Multilayer Perceptron

Lecture 17: Neural Networks and Deep Learning

Deep Neural Networks (1) Hidden layers; Back-propagation

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neuron. Detector Model. Understanding Neural Components in Detector Model. Detector vs. Computer. Detector. Neuron. output. axon

ECE521 Lectures 9 Fully Connected Neural Networks

Neural Networks Language Models

Artificial Neuron (Perceptron)

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Artificial Neural Networks. Q550: Models in Cognitive Science Lecture 5

PV021: Neural networks. Tomáš Brázdil

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Ch.8 Neural Networks

ECE521 Lecture 7/8. Logistic Regression

Jakub Hajic Artificial Intelligence Seminar I

Neural Networks Teaser

3 Detector vs. Computer

Neural Networks and Fuzzy Logic Rajendra Dept.of CSE ASCET

Lecture 4: Feed Forward Neural Networks

Neural Networks (and Gradient Ascent Again)

Information processing. Divisions of nervous system. Neuron structure and function Synapse. Neurons, synapses, and signaling 11/3/2017

Feedforward Neural Networks. Michael Collins, Columbia University

Introduction to Deep Learning

Machine Learning. Neural Networks

Multilayer Perceptron Tutorial

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Neural Networks Introduction CIS 32

Control and Integration. Nervous System Organization: Bilateral Symmetric Animals. Nervous System Organization: Radial Symmetric Animals

Advanced Machine Learning

Machine Learning. Boris

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Deep Neural Networks (1) Hidden layers; Back-propagation

Instituto Tecnológico y de Estudios Superiores de Occidente Departamento de Electrónica, Sistemas e Informática. Introductory Notes on Neural Networks

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Introduction to Neural Networks

Nervous System Organization

CSC242: Intro to AI. Lecture 21

(Artificial) Neural Networks in TensorFlow

Deep Learning: a gentle introduction

EEE 241: Linear Systems

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Administrative Issues. CS5242 mirror site:

Dendrites - receives information from other neuron cells - input receivers.

Nervous Tissue. Neurons Electrochemical Gradient Propagation & Transduction Neurotransmitters Temporal & Spatial Summation

Machine Learning Basics III

CMSC 421: Neural Computation. Applications of Neural Networks

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Neural Networks Introduction

Neural Nets Supervised learning

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Radial-Basis Function Networks

AI Programming CS F-20 Neural Networks

Radial-Basis Function Networks

Based on the original slides of Hung-yi Lee

Chapter 9. Nerve Signals and Homeostasis

SGD and Deep Learning

Backpropagation: The Good, the Bad and the Ugly

18.6 Regression and Classification with Linear Models

Transcription:

Introduction to Neural Networks Philipp Koehn 4 April 205

Linear Models We used before weighted linear combination of feature values h j and weights λ j score(λ, d i ) = j λ j h j (d i ) Such models can be illustrated as a network

Limits of Linearity 2 We can give each feature a weight But not more complex value relationships, e.g, any value in the range [0;5] is equally good values over 8 are bad higher than 0 is not worse

XOR 3 Linear models cannot model XOR good bad bad good

Multiple Layers 4 Add an intermediate ( hidden ) layer of processing (each arrow is a weight) Have we gained anything so far?

Non-Linearity 5 Instead of computing a linear combination score(λ, d i ) = j λ j h j (d i ) Add a non-linear function Popular choices score(λ, d i ) = f ( λ j h j (d i ) ) tanh(x) sigmoid(x) = +e x j (sigmoid is also called the logistic function )

Deep Learning 6 More layers = deep learning

7 example

Simple Neural Network 8 3.7 3.7 2.9 2.9 -.5-4.6 4.5-5.2-2.0 One innovation: bias units (no inputs, always value )

Sample Input 9.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6 4.5-5.2-2.0 Try out two input values Hidden unit computation sigmoid(.0 3.7 + 0.0 3.7 +.5) = sigmoid(2.2) = = 0.90 + e 2.2 sigmoid(.0 2.9 + 0.0 2.9 + 4.5) = sigmoid(.6) = = 0.7 + e.6

Computed Hidden 0.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6.90.7 4.5-5.2-2.0 Try out two input values Hidden unit computation sigmoid(.0 3.7 + 0.0 3.7 +.5) = sigmoid(2.2) = = 0.90 + e 2.2 sigmoid(.0 2.9 + 0.0 2.9 + 4.5) = sigmoid(.6) = = 0.7 + e.6

Compute Output.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6.90.7 4.5-5.2-2.0 Output unit computation sigmoid(.90 4.5 +.7 5.2 + 2.0) = sigmoid(.7) = = 0.76 + e.7

Computed Output 2.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6.90.7 4.5-5.2-2.0.76 Output unit computation sigmoid(.90 4.5 +.7 5.2 + 2.0) = sigmoid(.7) = = 0.76 + e.7

3 why neural networks?

Neuron in the Brain 4 The human brain is made up of about 00 billion neurons Dendrite Axon terminal Soma Nucleus Axon Neurons receive electric signals at the dendrites and send them to the axon

Neural Communication 5 The axon of the neuron is connected to the dendrites of many other neurons Neurotransmitter Voltage gated Ca++ channel Synaptic vesicle Neurotransmitter transporter Axon terminal Postsynaptic density Receptor Synaptic cleft Dendrite

The Brain vs. Artificial Neural Networks 6 Similarities Neurons, connections between neurons Learning = change of connections, not change of neurons Massive parallel processing But artificial neural networks are much simpler computation within neuron vastly simplified discrete time steps typically some form of supervised learning with massive number of stimuli

7 back-propagation training

Error 8.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6.90.7 4.5-5.2-2.0.76 Computed output: y =.76 Correct output: t =.0 How do we adjust the weights?

Key Concepts 9 Gradient descent error is a function of the weights we want to reduce the error gradient descent: move towards the error minimum compute gradient get direction to the error minimum adjust weights towards direction of lower error Back-propagation first adjust last set of weights propagate error back to each previous layer adjust their weights

Derivative of Sigmoid Sigmoid sigmoid(x) = + e x 20 Reminder: quotient rule (f(x) ) = g(x)f (x) f(x)g (x) g(x) g(x) 2 Derivative d sigmoid(x) dx = d dx + e x = 0 ( e x ) ( e x ) ( + e x ) 2 = = ( e x ) + e x + e x ( + e x ) + e x = sigmoid(x)( sigmoid(x))

Final Layer Update 2 Linear combination of weights s = k w kh k Activation function y = sigmoid(s) Error (L2 norm) E = 2 (t y)2 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k

Final Layer Update () 22 Linear combination of weights s = k w kh k Activation function y = sigmoid(s) Error (L2 norm) E = 2 (t y)2 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k Error E is defined with respect to y de dy = d dy 2 (t y)2 = (t y)

Final Layer Update (2) 23 Linear combination of weights s = k w kh k Activation function y = sigmoid(s) Error (L2 norm) E = 2 (t y)2 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k y with respect to x is sigmoid(s) dy ds = d sigmoid(s) ds = sigmoid(s)( sigmoid(s)) = y( y)

Final Layer Update (3) 24 Linear combination of weights s = k w kh k Activation function y = sigmoid(s) Error (L2 norm) E = 2 (t y)2 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k x is weighted linear combination of hidden node values h k ds dw k = d dw k k w k h k = h k

Putting it All Together 25 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k = (t y) y( y) h k error derivative of sigmoid: y Weight adjustment will be scaled by a fixed learning rate µ w k = µ (t y) y h k

Multiple Output Nodes 26 Our example only had one output node Typically neural networks have multiple output nodes Error is computed over all j output nodes E = j 2 (t j y j ) 2 Weights k j are adjusted according to the node they point to w j k = µ(t j y j ) y j h k

Hidden Layer Update 27 In a hidden layer, we do not have a target output value But we can compute how much each node contributed to downstream error Definition of error term of each node δ j = (t j y j ) y j Back-propagate the error term (why this way? there is math to back it up...) δ i = ( ) w j i δ j j y i Universal update formula w j k = µ δ j h k

Our Example 28 A B C.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6 D E F.90.7 4.5-5.2-2.0 G.76 Computed output: y =.76 Correct output: t =.0 Final layer weight updates (learning rate µ = 0) δ G = (t y) y = (.76) 0.8 =.0434 w GD = µ δ G h D = 0.0434.90 =.39 w GE = µ δ G h E = 0.0434.7 =.074 w GF = µ δ G h F = 0.0434 =.434

Our Example 29 A B C.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6 D E F.90.7 4.89 4.5-5.26-5.2 -.566-2.0 G.76 Computed output: y =.76 Correct output: t =.0 Final layer weight updates (learning rate µ = 0) δ G = (t y) y = (.76) 0.8 =.0434 w GD = µ δ G h D = 0.0434.90 =.39 w GE = µ δ G h E = 0.0434.7 =.074 w GF = µ δ G h F = 0.0434 =.434

Hidden Layer Updates 30 A B C.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6 D E F.90.7 4.89 4.5-5.26-5.2 -.566-2.0 Hidden node D ( ) δ D = j w j iδ j y D = w GD δ G y D = 4.5.0434.0898 =.075 w DA = µ δ D h A = 0.075.0 =.75 w DB = µ δ D h B = 0.075 0.0 = 0 w DC = µ δ D h C = 0.075 =.75 Hidden node E ( ) δ E = j w j iδ j y E = w GE δ G y E = 5.2.0434 0.2055 =.0464 w EA = µ δ E h A = 0.0464.0 =.464 etc. G.76

3 some additional aspects

Initialization of Weights 32 Weights are initialized randomly e.g., uniformly from interval [ 0.0, 0.0] Glorot and Bengio (200) suggest for shallow neural networks [ n, ] n n is the size of the previous layer for deep neural networks [ 6 nj + n j+, sqrt6 ] nj + n j+ n j is the size of the previous layer, n j size of next layer

Neural Networks for Classification 33 Predict class: one output node per class Training data output: One-hot vector, e.g., y = (0, 0, ) T Prediction predicted class is output node y i with highest value obtain posterior probability distribution by soft-max softmax(y i ) = ey i j ey j

Speedup: Momentum Term 34 Updates may move a weight slowly in one direction To speed this up, we can keep a memory of prior updates w j k (n )... and add these to any new updates (with decay factor ρ) w j k (n) = µ δ j h k + ρ w j k (n )

35 computational aspects

Vector and Matrix Multiplications 36 Forward computation: s = W h Activation function: y = sigmoid( h) Error term: δ = ( t y) sigmoid ( s) Propagation of error term: δ i = W δ i+ sigmoid ( s) Weight updates: W = µ δ h T

GPU 37 Neural network layers may have, say, 200 nodes Computations such as W h require 200 200 = 40, 000 multiplications Graphics Processing Units (GPU) are designed for such computations image rendering requires such vector and matrix operations massively mulit-core but lean processing units example: NVIDIA Tesla K20c GPU provides 2496 thread processors Extensions to C to support programming of GPUs, such as CUDA

Theano 38 GPU library for Python Homepage: http://deeplearning.net/software/theano/ See web site for sample implementation of back-propagation training Used to implement neural network language models neural machine translation (Bahdanau et al., 205)