Introduction to Neural Networks

Similar documents
Introduction to Neural Networks

Neural Network Language Modeling

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Artifical Neural Networks

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Introduction to Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Based on the original slides of Hung-yi Lee

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Statistical Machine Learning from Data

Neural networks. Chapter 20. Chapter 20 1

Data Mining Part 5. Prediction

Neural Networks and Deep Learning

Introduction Biologically Motivated Crude Model Backpropagation

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Neural networks. Chapter 19, Sections 1 5 1

Artificial Neural Networks. Historical description

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

ECE521 Lecture 7/8. Logistic Regression

Artificial Intelligence

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Artificial Neural Network and Fuzzy Logic

Artificial Neuron (Perceptron)

CS:4420 Artificial Intelligence

Deep Feedforward Networks

Multilayer Perceptron

Lecture 7 Artificial neural networks: Supervised learning

ECE521 Lectures 9 Fully Connected Neural Networks

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

SGD and Deep Learning

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Neural Networks. Advanced data-mining. Yongdai Kim. Department of Statistics, Seoul National University, South Korea

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Artificial Neural Network

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Neural networks. Chapter 20, Section 5 1

Feedforward Neural Nets and Backpropagation

Jakub Hajic Artificial Intelligence Seminar I

Machine Learning Basics III

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL

Neuron. Detector Model. Understanding Neural Components in Detector Model. Detector vs. Computer. Detector. Neuron. output. axon

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

EE04 804(B) Soft Computing Ver. 1.2 Class 2. Neural Networks - I Feb 23, Sasidharan Sreedharan

Lecture 17: Neural Networks and Deep Learning

Linear Regression, Neural Networks, etc.

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks Language Models

Introduction to Convolutional Neural Networks (CNNs)

CSC321 Lecture 5: Multilayer Perceptrons

AI Programming CS F-20 Neural Networks

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Artificial Neural Networks

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

CSC 578 Neural Networks and Deep Learning

Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics

Machine Learning

CSC 411 Lecture 10: Neural Networks

Backpropagation: The Good, the Bad and the Ugly

Advanced Machine Learning

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Deep Neural Networks (1) Hidden layers; Back-propagation

Classification of Higgs Boson Tau-Tau decays using GPU accelerated Neural Networks

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Course 395: Machine Learning - Lectures

Neural Networks. Learning and Computer Vision Prof. Olga Veksler CS9840. Lecture 10

Deep Learning: a gentle introduction

Data Mining & Machine Learning

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

A summary of Deep Learning without Poor Local Minima

3 Detector vs. Computer

Supervised Learning in Neural Networks

Neural Networks Introduction CIS 32

CSC321 Lecture 9: Generalization

Ch.8 Neural Networks

Feedforward Neural Networks. Michael Collins, Columbia University

CSC242: Intro to AI. Lecture 21

Lecture 5: Logistic Regression. Neural Networks

DeepLearning on FPGAs

Learning Deep Architectures for AI. Part I - Vijay Chakilam

How to do backpropagation in a brain

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Administrative Issues. CS5242 mirror site:

Neural Networks and Fuzzy Logic Rajendra Dept.of CSE ASCET

Artificial Neural Networks 2

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

epochs epochs

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Transcription:

Introduction to Neural Networks Philipp Koehn 3 October 207

Linear Models We used before weighted linear combination of feature values h j and weights λ j score(λ, d i ) = j λ j h j (d i ) Such models can be illustrated as a network

Limits of Linearity 2 We can give each feature a weight But not more complex value relationships, e.g, any value in the range [0;5] is equally good values over 8 are bad higher than 0 is not worse

XOR 3 Linear models cannot model XOR good bad bad good

Multiple Layers 4 Add an intermediate ( hidden ) layer of processing (each arrow is a weight) Have we gained anything so far?

Non-Linearity 5 Instead of computing a linear combination score(λ, d i ) = j λ j h j (d i ) Add a non-linear function Popular choices score(λ, d i ) = f ( λ j h j (d i ) ) tanh(x) sigmoid(x) = +e x relu(x) = max(0,x) j (sigmoid is also called the logistic function )

Deep Learning 6 More layers = deep learning

What Depths Holds 7 Each layer is a processing step Having multiple processing steps allows complex functions Metaphor: NN and computing circuits computer = sequence of Boolean gates neural computer = sequence of layers Deep neural networks can implement complex functions e.g., sorting on input values

8 example

Simple Neural Network 9 3.7 3.7 2.9 2.9 -.5-4.6 4.5-5.2-2.0 One innovation: bias units (no inputs, always value )

Sample Input 0.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6 4.5-5.2-2.0 Try out two input values Hidden unit computation sigmoid(.0 3.7 + 0.0 3.7 +.5) = sigmoid(2.2) = = 0.90 + e 2.2 sigmoid(.0 2.9 + 0.0 2.9 + 4.5) = sigmoid(.6) = = 0.7 + e.6

Computed Hidden.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6.90.7 4.5-5.2-2.0 Try out two input values Hidden unit computation sigmoid(.0 3.7 + 0.0 3.7 +.5) = sigmoid(2.2) = = 0.90 + e 2.2 sigmoid(.0 2.9 + 0.0 2.9 + 4.5) = sigmoid(.6) = = 0.7 + e.6

Compute Output 2.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6.90.7 4.5-5.2-2.0 Output unit computation sigmoid(.90 4.5 +.7 5.2 + 2.0) = sigmoid(.7) = = 0.76 + e.7

Computed Output 3.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6.90.7 4.5-5.2-2.0.76 Output unit computation sigmoid(.90 4.5 +.7 5.2 + 2.0) = sigmoid(.7) = = 0.76 + e.7

Output for all Binary Inputs 4 Input x 0 Input x Hidden h 0 Hidden h Output y 0 0 0 0.2 0.02 0.8 0 0 0.88 0.27 0.74 0 0.73 0.2 0.74 0.99 0.73 0.33 0 Network implements XOR hidden node h 0 is OR hidden node h is AND final layer operation is h 0 h Power of deep neural networks: chaining of processing steps just as: more Boolean circuits more complex computations possible

5 why neural networks?

Neuron in the Brain 6 The human brain is made up of about 00 billion neurons Dendrite Axon terminal Soma Nucleus Axon Neurons receive electric signals at the dendrites and send them to the axon

Neural Communication 7 The axon of the neuron is connected to the dendrites of many other neurons Neurotransmitter Voltage gated Ca++ channel Synaptic vesicle Neurotransmitter transporter Axon terminal Postsynaptic density Receptor Synaptic cleft Dendrite

The Brain vs. Artificial Neural Networks 8 Similarities Neurons, connections between neurons Learning = change of connections, not change of neurons Massive parallel processing But artificial neural networks are much simpler computation within neuron vastly simplified discrete time steps typically some form of supervised learning with massive number of stimuli

9 back-propagation training

Error 20.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6.90.7 4.5-5.2-2.0.76 Computed output: y =.76 Correct output: t =.0 How do we adjust the weights?

Key Concepts 2 Gradient descent error is a function of the weights we want to reduce the error gradient descent: move towards the error minimum compute gradient get direction to the error minimum adjust weights towards direction of lower error Back-propagation first adjust last set of weights propagate error back to each previous layer adjust their weights

Gradient Descent 22 error(λ) gradient = λ optimal λ current λ

Gradient Descent 23 Optimum Gradient for w Combined Gradient Current Point Gradient for w2

Derivative of Sigmoid Sigmoid sigmoid(x) = + e x 24 Reminder: quotient rule (f(x) ) = g(x)f (x) f(x)g (x) g(x) g(x) 2 Derivative d sigmoid(x) dx = d dx + e x = 0 ( e x ) ( e x ) ( + e x ) 2 = = ( e x ) + e x + e x ( + e x ) + e x = sigmoid(x)( sigmoid(x))

Final Layer Update 25 Linear combination of weights s = k w kh k Activation function y = sigmoid(s) Error (L2 norm) E = 2 (t y)2 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k

Final Layer Update () 26 Linear combination of weights s = k w kh k Activation function y = sigmoid(s) Error (L2 norm) E = 2 (t y)2 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k Error E is defined with respect to y de dy = d dy 2 (t y)2 = (t y)

Final Layer Update (2) 27 Linear combination of weights s = k w kh k Activation function y = sigmoid(s) Error (L2 norm) E = 2 (t y)2 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k y with respect to x is sigmoid(s) dy ds = d sigmoid(s) ds = sigmoid(s)( sigmoid(s)) = y( y)

Final Layer Update (3) 28 Linear combination of weights s = k w kh k Activation function y = sigmoid(s) Error (L2 norm) E = 2 (t y)2 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k x is weighted linear combination of hidden node values h k ds dw k = d dw k k w k h k = h k

Putting it All Together 29 Derivative of error with regard to one weight w k de = de dy dw k dy ds ds dw k = (t y) y( y) h k error derivative of sigmoid: y Weight adjustment will be scaled by a fixed learning rate µ w k = µ (t y) y h k

Multiple Output Nodes 30 Our example only had one output node Typically neural networks have multiple output nodes Error is computed over all j output nodes E = j 2 (t j y j ) 2 Weights k j are adjusted according to the node they point to w j k = µ(t j y j ) y j h k

Hidden Layer Update 3 In a hidden layer, we do not have a target output value But we can compute how much each node contributed to downstream error Definition of error term of each node δ j = (t j y j ) y j Back-propagate the error term (why this way? there is math to back it up...) δ i = ( ) w j i δ j j y i Universal update formula w j k = µ δ j h k

Our Example 32 A B C.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6 D E F.90.7 4.5-5.2-2.0 G.76 Computed output: y =.76 Correct output: t =.0 Final layer weight updates (learning rate µ = 0) δ G = (t y) y = (.76) 0.8 =.0434 w GD = µ δ G h D = 0.0434.90 =.39 w GE = µ δ G h E = 0.0434.7 =.074 w GF = µ δ G h F = 0.0434 =.434

Our Example 33 A B C.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6 D E F.90.7 4.89 4.5-5.26-5.2 -.566-2.0 G.76 Computed output: y =.76 Correct output: t =.0 Final layer weight updates (learning rate µ = 0) δ G = (t y) y = (.76) 0.8 =.0434 w GD = µ δ G h D = 0.0434.90 =.39 w GE = µ δ G h E = 0.0434.7 =.074 w GF = µ δ G h F = 0.0434 =.434

Hidden Layer Updates 34 A B C.0 0.0 3.7 3.7 2.9 2.9 -.5-4.6 D E F.90.7 4.89 4.5-5.26-5.2 -.566-2.0 Hidden node D ( ) δ D = j w j iδ j y D = w GD δ G y D = 4.5.0434.0898 =.075 w DA = µ δ D h A = 0.075.0 =.75 w DB = µ δ D h B = 0.075 0.0 = 0 w DC = µ δ D h C = 0.075 =.75 Hidden node E ( ) δ E = j w j iδ j y E = w GE δ G y E = 5.2.0434 0.2055 =.0464 w EA = µ δ E h A = 0.0464.0 =.464 etc. G.76

35 some additional aspects

Initialization of Weights 36 Weights are initialized randomly e.g., uniformly from interval [ 0.0, 0.0] Glorot and Bengio (200) suggest for shallow neural networks [ n, ] n n is the size of the previous layer for deep neural networks [ 6 nj + n j+, 6 nj + n j+ ] n j is the size of the previous layer, n j size of next layer

Neural Networks for Classification 37 Predict class: one output node per class Training data output: One-hot vector, e.g., y = (0, 0, ) T Prediction predicted class is output node y i with highest value obtain posterior probability distribution by soft-max softmax(y i ) = ey i j ey j

Problems with Gradient Descent Training 38 error(λ) λ Too high learning rate

Problems with Gradient Descent Training 39 error(λ) λ Bad initialization

Problems with Gradient Descent Training 40 error(λ) local optimum global optimum λ Local optimum

Speedup: Momentum Term 4 Updates may move a weight slowly in one direction To speed this up, we can keep a memory of prior updates w j k (n )... and add these to any new updates (with decay factor ρ) w j k (n) = µ δ j h k + ρ w j k (n )

Adagrad 42 Typically reduce the learning rate µ over time at the beginning, things have to change a lot later, just fine-tuning Adapting learning rate per parameter Adagrad update based on error E with respect to the weight w at time t = g t = de dw w t = µ t τ= g2 τ g t

Dropout 43 A general problem of machine learning: overfitting to training data (very good on train, bad on unseen test) Solution: regularization, e.g., keeping weights from having extreme values Dropout: randomly remove some hidden units during training mask: set of hidden units dropped randomly generate, say, 0 20 masks alternate between the masks during training Why does that work? bagging, ensemble,...

Mini Batches 44 Each training example yields a set of weight updates w i. Batch up several training examples sum up their updates apply sum to model Mostly done or speed reasons

45 computational aspects

Vector and Matrix Multiplications 46 Forward computation: s = W h Activation function: y = sigmoid( h) Error term: δ = ( t y) sigmoid ( s) Propagation of error term: δ i = W δ i+ sigmoid ( s) Weight updates: W = µ δ h T

GPU 47 Neural network layers may have, say, 200 nodes Computations such as W h require 200 200 = 40, 000 multiplications Graphics Processing Units (GPU) are designed for such computations image rendering requires such vector and matrix operations massively mulit-core but lean processing units example: NVIDIA Tesla K20c GPU provides 2496 thread processors Extensions to C to support programming of GPUs, such as CUDA

Toolkits 48 Theano Tensorflow (Google) PyTorch (Facebook) MXNet (Amazon) DyNet