Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Similar documents
Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks

Neural networks. Chapter 19, Sections 1 5 1

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Neural networks. Chapter 20, Section 5 1

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Artificial Neural Networks

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Neural networks. Chapter 20. Chapter 20 1

Introduction to Neural Networks

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Introduction Biologically Motivated Crude Model Backpropagation

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Introduction To Artificial Neural Networks

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Sections 18.6 and 18.7 Artificial Neural Networks

Data Mining Part 5. Prediction

How to do backpropagation in a brain

Incremental Stochastic Gradient Descent

Artificial Neural Networks

Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso

Sections 18.6 and 18.7 Artificial Neural Networks

Grundlagen der Künstlichen Intelligenz

Feedforward Neural Nets and Backpropagation

Neural Networks and Deep Learning

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Artificial Neural Networks. Q550: Models in Cognitive Science Lecture 5

CMSC 421: Neural Computation. Applications of Neural Networks

CS:4420 Artificial Intelligence

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Introduction to Artificial Neural Networks

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Artificial Neural Network

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

Unit III. A Survey of Neural Network Model

Artificial Neural Networks. Historical description

COMP9444 Neural Networks and Deep Learning 2. Perceptrons. COMP9444 c Alan Blair, 2017

Artificial Intelligence

Neural Networks biological neuron artificial neuron 1

Artificial Neural Networks The Introduction

CS 4700: Foundations of Artificial Intelligence

Artificial Neural Network and Fuzzy Logic

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

From perceptrons to word embeddings. Simon Šuster University of Groningen

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Course 395: Machine Learning - Lectures

Artifical Neural Networks

Neural Networks DWML, /25

Neural Networks (Part 1) Goals for the lecture

Artificial Neural Networks

Lecture 4: Perceptrons and Multilayer Perceptrons

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Neural Networks and Fuzzy Logic Rajendra Dept.of CSE ASCET

Neural Networks. Fundamentals of Neural Networks : Architectures, Algorithms and Applications. L, Fausett, 1994

Artificial Neural Networks Examination, June 2005

COMP-4360 Machine Learning Neural Networks

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

UNSUPERVISED LEARNING

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

AI Programming CS F-20 Neural Networks

Simple Neural Nets For Pattern Classification

CSE446: Neural Networks Winter 2015

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Lecture 4: Feed Forward Neural Networks

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Artificial Neural Networks

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Artificial Neural Networks. Edward Gatt

Part 8: Neural Networks

Lecture 7 Artificial neural networks: Supervised learning

CS 4700: Foundations of Artificial Intelligence

Artificial Neural Networks. Part 2

Neural Networks: Introduction

Artificial Neural Networks

EEE 241: Linear Systems

Machine Learning Lecture 10

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Neural Networks and the Back-propagation Algorithm

Statistical NLP for the Web

Multilayer Perceptrons (MLPs)

Machine Learning Lecture 12

Artificial Neural Networks Examination, March 2004

Artificial Neural Networks. MGS Lecture 2

Computational statistics

Revision: Neural Network

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Machine Learning

Learning and Memory in Neural Networks

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Transcription:

Machine Learning Neural Networks (slides from Domingos, Pardo, others)

Human Brain

Neurons

Input-Output Transformation Input Spikes Output Spike Spike (= a brief pulse) (Excitatory Post-Synaptic Potential)

Human Learning Number of neurons: ~ 10 11 Connections per neuron: ~ 10 3 to 10 5 Neuron switching time: ~ 0.001 second Scene recognition time: ~ 0.1 second 100 inference steps doesn t seem much

Machine Learning Abstraction

Artificial Neural Networks Typically, machine learning ANNs are very artificial, ignoring: Time Space Biological learning processes More realistic neural models exist Hodgkin & Huxley (1952) won a Nobel prize for theirs (in 1963) Nonetheless, very artificial ANNs have been useful in many ML applications

Perceptrons The first wave in neural networks Big in the 1960 s McCulloch & Pitts (1943), Woodrow & Hoff (1960), Rosenblatt (1962)

Problem def: Perceptrons Let f be a target function from X = <x 1, x 2, > where x i {0, 1} to y {0, 1} Given training data {(X 1, y 1 ), (X 2, y 2 ) } Learn h (X ), an approximation of f (X )

Inputs A single perceptron x 1 x 0 Bias (x 0 =1,always) w 1 w 0 x 2 x 3 w 2 w 3 1 0 if else n i 0 w i x i 0 x 4 w 4 x 5 w 5

Logical Operators x 1 x 2 0.5 0.5 x 0-0.8 1 if 0 else x 0-0.3 n i 0 w i x i 0 AND x 1-1.0 x 0 0.1 1 0 if else n i 0 w i x i 0 NOT x 1 x 2 0.5 0.5 1 0 if else n i 0 w i x i 0 OR

Learning Weights Perceptron Training Rule Gradient Descent (other approaches: Genetic Algorithms) x 1? x 0? x 2? 1 0 if else n i 0 w i x i 0

Perceptron Training Rule Weights modified for each training example Update Rule: w i w i w i where w ( t o) x i i learning rate target value perceptron output input value

Perception Training for NOT x 0 x 1 w 1 w 0 1 0 if else n i 0 w i x i 0 NOT w i w i w i Work w ( t o) x i i Start End Bryan Pardo, Machine Learning: EECS 349 Fall 2009 14

What weights make XOR? x 0? x 1 x 2?? 1 0 if else n i 0 w i x i 0 No combination of weights works Perceptrons can only represent linearly separable functions

Linear Separability x 2 OR x 1

Linear Separability x 2 AND x 1

Linear Separability x 2 XOR x 1

Perceptron Training Rule Converges to the correct classification IF Cases are linearly separable Learning rate is slow enough Proved by Minsky and Papert in 1969 Killed widespread interest in perceptrons till the 80 s

XOR x 0 0 x 1 x 2-0.6-0.6 0.6 0.6 1 0 x 0 0 1 0 if else if else n i 0 n i 0 w i x i w i x i 0 0 1 1 x 0 0 1 0 if else n i 0 w i x i 0 XOR

What s wrong with perceptrons? You can always plug multiple perceptrons together to calculate any function. BUT who decides what the weights are? Assignment of error to parental inputs becomes a problem.

Perceptrons use a step function x 1 x 2?? x 0? 1 0 if else n i 0 w i x i 0 Perceptron Threshold Step function Small changes in inputs -> either no change or large change in output.

Solution: Differentiable Function x 0? Simple linear function x 1 x 2?? n i 0 w i x i Varying any input a little creates a perceptible change in the output We can now characterize how error changes w i even in multi-layer case

Measuring error for linear units Output Function ( x ) w x Error Measure: E ( w ) 1 2 d D ( t d o d ) 2 data target value linear unit output

Gradient Descent Gradient: Training rule:

Gradient Descent Rule D d d d i i o t w w E 2 ) ( 2 1 D d d i d d x o t ) )( (, D d d i d d i i x o t w w, ) ( Update Rule:

Gradient Descent for Multiple Layers x 0 x 1 w ij x 0 n i 0 w i x i x 0 n i 0 w i x i XOR x 2 n i 0 w i x i We can compute: E w ij

Gradient Descent vs. Perceptrons Perceptron Rule & Threshold Units Learner converges on an answer ONLY IF data is linearly separable Can t assign proper error to parent nodes Gradient Descent (locally) Minimizes error even if examples are not linearly separable Works for multi-layer networks But linear units only make linear decision surfaces (can t learn XOR even with many layers) And the step function isn t differentiable

Perceptron A compromise function 1 output 0 if else n i 0 w x 0 i i Linear n output net i 0 w i x i Sigmoid (Logistic) output ( net ) 1 1 e net

The sigmoid (logistic) unit Has differentiable function Allows gradient descent Can be used to learn non-linear functions x 1? 1 1 n i0 e w i x i x 2?

Logistic function Inputs Output Age 34 Gender 1 Stage 4.5.4.8 S 0.6 Probability of beingalive Independent variables Coefficients 1 1 n w i x i i0 e Prediction

Neural Network Model Inputs Age 34 Gender 2 Stage 4.6.2.1.3.7.2 S S.4.2.5.8 S Output 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction

Getting an answer from a NN Inputs Age 34.6 Output Gender 2 Stage 4.1.7.5.8 S 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction

Getting an answer from a NN Inputs Age 34 Gender 2 Stage 4.3.2.2.5.8 S Output 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction

Getting an answer from a NN Inputs Age 34 Gender 1 Stage 4.6.2.1.3.7.2.5.8 S Output 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction

Minimizing the Error Error surface initial error negative derivative final error local minimum w initial w trained positive change

Differentiability is key! Sigmoid is easy to differentiate ( y ) ( y ) (1 ( y )) y For gradient descent on multiple layers, a little dynamic programming can help: Compute errors at each output node Use these to compute errors at each hidden node Use these to compute weight gradient

The Backpropagation Algorithm ji k ji ji ji k outputs k hk h h h h k k k k k k u x δ w w w δ w o o δ δ o t o o δ δ k u o x x,t ight network we each 4.Update ) (1 error term its calculate h, unit hidden each For 3. ) )( (1 error term its calculate, unit output each For 2. network in the unit every for output the compute and network the to instance Input 1. example, ning input trai each For

Learning Weights Inputs Age 34 Gender 1 Stage 4.6.2.1.3.7.2.5.8 S Output 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction

The fine print Don t implement back-propagation Use a package Second-order or variable step-size optimization techniques exist Feature normalization Typical to normalize inputs to lie in [0,1] (and outputs must be normalized) Problems with NN training: Slow training times (though, getting better) Local minima

Minimizing the Error Error surface initial error negative derivative final error local minimum w initial w trained positive change

Expressive Power of ANNs Universal Function Approximator: Given enough hidden units, can approximate any continuous function f Need 2+ hidden units to learn XOR Why not use millions of hidden units? Efficiency (training is slow) Overfitting

Overfitting Real Distribution Overfitted Model

Combating Overfitting in Neural Nets Many techniques Two popular ones: Early Stopping Use a lot of hidden units Just don t over-train Cross-validation Test different architectures to choose right number of hidden units

Early Stopping error Overfitted model error a min ( error) a = validation set b = training set error b Stopping criterion Epochs

Cross-validation Cross-validation: general-purpose technique for model selection E.g., how many hidden units should I use? More extensive version of validation-set approach.

Cross-validation Break training set into k sets For each model M For i=1 k Train M on all but set i Test on set i Output M with highest average test score, trained on full training set

Summary of Neural Networks When are Neural Networks useful? Instances represented by attribute-value pairs Particularly when attributes are real valued The target function is Discrete-valued Real-valued Vector-valued Training examples may contain errors Fast evaluation times are necessary When not? Fast training times are necessary Understandability of the function is required

Summary of Neural Networks Non-linear regression technique that is trained with gradient descent. Question: How important is the biological metaphor?

Advanced Topics in Neural Nets Batch Move vs. incremental Auto-Encoders Deep Belief Nets (briefly) Neural Networks on Silicon Neural Network language models

Incremental vs. Batch Mode

Incremental vs. Batch Mode In Batch Mode we minimize: Same as computing: w D w d d D Then setting w w w D

Advanced Topics in Neural Nets Batch Move vs. incremental Auto-Encoders Deep Belief Nets (briefly) Neural Networks on Silicon Neural Network language models

Hidden Layer Representations Input->Hidden Layer mapping: representation of input vectors tailored to the task Can also be exploited for dimensionality reduction Form of unsupervised learning in which we output a more compact representation of input vectors <x 1,,x n > -> <x 1,,x m > where m < n Useful for visualization, problem simplification, data compression, etc.

Dimensionality Reduction Model: Function to learn:

Dimensionality Reduction: Example

Dimensionality Reduction: Example

Dimensionality Reduction: Example

Dimensionality Reduction: Example

Advanced Topics in Neural Nets Batch Move vs. incremental Auto-encoders Deep Belief Nets (briefly) Neural Networks on Silicon Neural Network language models

Restricted Boltzman Machine

Similar Auto-encoders vs. RBMs? Auto-encoder (AE) goal is to reconstruct input in two steps, input->hidden->output RBM defines a probability distribution over P(x) Goal is to assign high likelihood to the observed training examples Determining likelihood of a given x actually requires summing over all possible settings of hidden nodes, rather than just computing a single activation as in AE Take EECS 395/495 Probabilistic Graphical Models to learn more

Deep Belief Nets

Advanced Topics in Neural Nets Batch Move vs. incremental Auto-Encoders Deep Belief Nets (briefly) Neural Networks on Silicon Neural Network language models

Neural Networks on Silicon Currently: Simulation of continuous device physics (neural networks) Why not skip this? Digital computational model (thresholding) Continuous device physics (voltage)

Example: Silicon Retina Simulates function of biological retina Single-transistor synapses adapt to luminance, temporal contrast Modeling retina directly on chip => requires 100x less power!

Example: Silicon Retina Synapses modeled with single transistors

Luminance Adaptation

Comparison with Mammal Data Real: Artificial:

Graphics and results taken from:

General NN learning in silicon? People seem more excited about / satisfied with GPUs

Advanced Topics in Neural Nets Batch Move vs. incremental Hidden Layer Representations Hopfield Nets Neural Networks on Silicon Neural Network language models

Neural Network Language Models Statistical Language Modeling: Predict probability of next word in sequence I was headed to Madrid, P( = Spain ) = 0.5, P( = but ) = 0.2, etc. Used in speech recognition, machine translation, (recently) information extraction

Formally Estimate: P ( w w, w w j,..., j 1 j 2 j n 1 P ( w j h j

Optimizations Key idea learn simultaneously: vector representations of each word (here 120 dim) predictor of next word. based on previous vectors Short-lists Much complexity in hidden->output layer Number of possible next words is large Only predict a subset of words Use a standard probabilistic model for the rest Can also bin words into fixed classes

Design Decisions (1) Number of hidden units

Design Decisions (2) Word representation (# of dimensions) They chose 120 more recent work uses up to 1000s

Comparison vs. state of the art Circa 2005 Schwenk, Holger, and Jean-Luc Gauvain. "Training neural network language models on very large corpora." Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2005.

Latest Results Chelba, Ciprian, et al. "One billion word benchmark for measuring progress in statistical language modeling." arxiv preprint arxiv:1312.3005 (2013).