Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Similar documents
Artificial Neural Networks

Neural Networks & Fuzzy Logic. Introduction

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Artificial Neural Networks

Machine Learning

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning

Lecture 5: Logistic Regression. Neural Networks

Lecture 4: Perceptrons and Multilayer Perceptrons

Perceptron & Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Artificial Neural Networks

Artifical Neural Networks

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Lab 5: 16 th April Exercises on Neural Networks

Neural networks. Chapter 20. Chapter 20 1

Neural networks. Chapter 19, Sections 1 5 1

COMP-4360 Machine Learning Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neural Networks: Introduction

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Linear discriminant functions

Machine Learning (CSE 446): Neural Networks

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

CSC242: Intro to AI. Lecture 21

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

CS:4420 Artificial Intelligence

Neural Networks (Part 1) Goals for the lecture

Data Mining Part 5. Prediction

Introduction to Neural Networks

Artificial Neural Networks. Edward Gatt

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Neural networks. Chapter 20, Section 5 1

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Neural Networks DWML, /25

Feedforward Neural Nets and Backpropagation

Artificial Neural Networks. MGS Lecture 2

Revision: Neural Network

Machine Learning and Data Mining. Linear classification. Kalev Kask

Machine Learning. Neural Networks

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

Learning Neural Networks

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

CMSC 421: Neural Computation. Applications of Neural Networks

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Multilayer Perceptrons and Backpropagation

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning Lecture 5

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

A summary of Deep Learning without Poor Local Minima

Artificial Neural Network

CSC321 Lecture 5: Multilayer Perceptrons

Lecture 7 Artificial neural networks: Supervised learning

Introduction Biologically Motivated Crude Model Backpropagation

Lecture 6. Regression

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Introduction to Artificial Neural Networks

Incremental Stochastic Gradient Descent

Artificial neural networks

Course 395: Machine Learning - Lectures

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Rapid Introduction to Machine Learning/ Deep Learning

Numerical Learning Algorithms

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Computational statistics

Reading Group on Deep Learning Session 1

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Artificial Neural Networks

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Unit 8: Introduction to neural networks. Perceptrons

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

AI Programming CS F-20 Neural Networks

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Introduction to Neural Networks

Neural networks COMS 4771

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks and Fuzzy Logic Rajendra Dept.of CSE ASCET

Neural Networks and Deep Learning

Artificial Neural Networks. Q550: Models in Cognitive Science Lecture 5

Logistic Regression. Machine Learning Fall 2018

Logistic Regression & Neural Networks

CSC 411 Lecture 10: Neural Networks

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Multilayer Perceptron

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Artificial Neural Networks

Mul7layer Perceptrons

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Machine Learning: Logistic Regression. Lecture 04

Introduction To Artificial Neural Networks

Transcription:

Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Neural Networks Le Song Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU Reading: Chap. 5 CB

Learning highly non-linear functions f: X Y f might be non-linear function X (vector of) continuous and/or discrete vars Y (vector of) continuous and/or discrete vars The XOR gate Speech recognition

Perceptron and Neural Nets From biological neuron to artificial neuron (perceptron) Dendrites Synapse Axon Soma Synapse Soma Synapse Activation function X n i Dendrites Axon x i w i 1, if X Y 1 1, if X Inputs x 1 x 2 w 1 w 2 Linear Combiner Hard Limiter Threshold Output Y Artificial neuron networks supervised learning gradient descent i g n a l s I n p u t S Input Layer Middle Layer Output Layer O u t p u t S i g n a l s

Connectionist Models Consider humans: Neuron switching time ~ 0.001 second Number of neurons ~ 10 10 Connections per neuron ~ 10 4-5 Scene recognition time ~ 0.1 second 100 inference steps doesn't seem like enough much parallel computation Properties of artificial neural nets (ANN) Many neuron-like threshold switching units Many weighted interconnections among units Highly parallel, distributed processes

Abdominal Pain Perceptron Intensity Male Age Temp WBC Pain 1 20 37 10 1 1 Duration Pain adjustable weights Appendicitis Diverticulitis Ulcer Duodenal Perforated Pain Cholecystitis Non-specific Small Obstruction Pancreatitis Bowel 0 0 0 1 0 0 0

Myocardial Infarction Network Duration Pain Intensity Pain Elevation ECG: ST Smoker Age Male 2 4 1 1 50 1 Myocardial Infarction 0.8 Probability of MI

The "Driver" Network

Perceptrons Input units weights Cough Headache D rule change weights to decrease the error No disease Pneumonia Flu Meningitis what we got - what we wanted error Output units

Perceptrons Output units j Output of unit j : o j = 1/ (1 + e - ( a j + j ) ) Input to unit j : a j = S w ij a i Input units i Input to unit i : a i measured value of variable i

Jargon Pseudo-Correspondence Independent variable = input variable Dependent variable = output variable Coefficients = weights Estimates = targets Logistic Regression Model (the sigmoid unit) Inputs Output Age 34 Gender 1 Stage 4 5 4 8 S 0.6 Probability of beingalive Independent variables x1, x2, x3 Coefficients a, b, c Dependent variable p Prediction

The perceptron learning algorithm Recall the nice property of sigmoid function Consider regression problem f:x Y, for scalar Y: Let s maximize the conditional data likelihood

Gradient Descent x d = input t d = target output o d = observed unit output w i = weight i

The perceptron learning rules x d = input t d = target output o d = observed unit output w i = weight i Incremental mode: Do until converge: For each training example d in D 1. compute gradient E d [w] 2. where Batch mode: Do until converge: 1. compute gradient E D [w] 2.

MLE vs MAP Maximum conditional likelihood estimate Maximum a posteriori estimate

What decision surface does a perceptron define? x y Z (color) 0 0 1 0 1 1 1 0 1 1 1 0 NAND y = 0.5 w 1 w 2 x 1 x 2 f(x 1 w 1 + x 2 w 2 ) = y f(0w 1 + 0w 2 ) = 1 f(0w 1 + 1w 2 ) = 1 f(1w 1 + 0w 2 ) = 1 f(1w 1 + 1w 2 ) = 0 f(a) = 1, for a > 0, for a some possible values for w 1 and w 2 w 1 w 2 0.20 0.35 0.40 0.20 0.25 0.40 0.30 0.20

What decision surface does a perceptron define? x y Z (color) 0 0 0 0 1 1 1 0 1 1 1 0 NAND y = 0.5 w 1 w 2 x 1 x 2 f(x 1 w 1 + x 2 w 2 ) = y f(0w 1 + 0w 2 ) = 0 f(0w 1 + 1w 2 ) = 1 f(1w 1 + 0w 2 ) = 1 f(1w 1 + 1w 2 ) = 0 f(a) = 1, for a > 0, for a w 1 w 2 some possible values for w 1 and w 2

What decision surface does a perceptron define? x y Z (color) 0 0 0 0 1 1 1 0 1 1 1 0 NAND w 5 w 6 = 0.5 for all units w 1 w 4 w 3 w 2 f(a) = 1, for a > 0, for a a possible set of values for (w 1, w 2, w 3, w 4, w 5, w 6 ): (0.6,-0.6,-0.7,0.8,1,1)

Non Linear Separation Meningitis No cough Headache 01 11 Flu Cough Headache No treatment Treatment 00 10 No disease Pneumonia No cough No headache Cough No headache 010 011 110 111 101 000 100

Neural Network Model Inputs Age 34 Gender 2 Stage 4.1.3.6.2.7.2 S S.4.2.5.8 S Output 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction

Combined logistic models Inputs Age 34.6 Output Gender 2 Stage 4.1.7.5.8 S 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction

Inputs Age 34 Gender 2 Stage 4.3.2.2.5.8 S Output 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction

Inputs Age 34 Gender 1 Stage 4.1.3.6.2.7.2.5.8 S Output 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction

Not really, no target for hidden units... Age 34 Gender 2 Stage 4.1.3.6.2.7.2 S S.4.2.5.8 S 0.6 Probability of beingalive Independent variables Weights Hidden Layer Weights Dependent variable Prediction

Perceptrons Input units weights Cough Headache D rule change weights to decrease the error No disease Pneumonia Flu Meningitis what we got - what we wanted error Output units

Hidden Units and Backpropagation

Backpropagation Algorithm x d = input t d = target output o d = observed unit output w i = weight i Initialize all weights to small random numbers Until convergence, Do 1. Input the training example to the network and compute the network outputs 1. For each output unit k 2. For each hidden unit h 3. Undate each network weight w i,j where

More on Backpropatation It is doing gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will find a local, not necessarily global error minimum In practice, often works well (can run multiple times) Often include weight momentum a Minimizes error over training examples Will it generalize well to subsequent testing examples? Training can take thousands of iterations, very slow! Using network after training is very fast

Learning Hidden Layer Representation A network: A target function: Can this be learned?

Learning Hidden Layer Representation A network: Learned hidden layer representation:

Training

Training

Expressive Capabilities of ANNs Boolean functions: Every Boolean function can be represented by network with single hidden layer But might require exponential (in number of inputs) hidden units Continuous functions: Every bounded continuous function can be approximated with arbitrary small error, by network with one hidden layer [Cybenko 1989; Hornik et al 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].

Application: ANN for Face Reco. The model The learned hidden unit weights

Regression vs. Neural Networks Y Y X 1 X 2 X 1 X 3 X 1 X 2 X 3 X 1 X 2 X 3 X 1 X 2 X 1 X 3 X 2 X 3 X 1 X 2 X 3 (2 3-1) possible combinations X 1 X 2 X 3 Y = a(x 1 ) + b(x 2 ) + c(x 3 ) + d(x 1 X 2 ) +...

Minimizing the Error Error surface initial error negative derivative final error local minimum w initial w trained positive change

CHD Overfitting in Neural Nets Overfitted model Real model error holdout Overfitted model training 0 age cycles

Alternative Error Functions Penalize large weights: Training on target slopes as well as values Tie together weights E.g., in phoneme recognition

Artificial neural networks what you should know Highly expressive non-linear functions Highly parallel network of logistic function units Minimizing sum of squared training errors Gives MLE estimates of network weights if we assume zero mean Gaussian noise on output values Minimizing sum of sq errors plus weight squared (regularization) MAP estimates assuming weight priors are zero mean Gaussian Gradient descent as training procedure How to derive your own gradient descent procedure Discover useful representations at hidden units Local minima is greatest problem Overfitting, regularization, early stopping