Neural Networks DWML, /25

Similar documents
Artificial Neural Networks

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks and the Back-propagation Algorithm

Lecture 4: Perceptrons and Multilayer Perceptrons

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Artificial Neural Networks

Neural networks. Chapter 19, Sections 1 5 1

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Artificial Neural Networks. Edward Gatt

AI Programming CS F-20 Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

ECE521 Lectures 9 Fully Connected Neural Networks

COMS 4771 Introduction to Machine Learning. Nakul Verma

Neural networks. Chapter 20. Chapter 20 1

COMP-4360 Machine Learning Neural Networks

Course 395: Machine Learning - Lectures

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

18.6 Regression and Classification with Linear Models

Data Mining Part 5. Prediction

Artificial Neural Networks

Neural networks. Chapter 20, Section 5 1

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Artificial Neural Network

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Lecture 7 Artificial neural networks: Supervised learning

Artificial Neural Network

Introduction to Neural Networks

Artificial Neural Networks Examination, March 2004

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Artificial Neural Networks

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Unit 8: Introduction to neural networks. Perceptrons

Radial-Basis Function Networks

Artifical Neural Networks

Machine Learning Linear Models

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Introduction to Artificial Neural Networks

Artificial Neural Networks Examination, June 2004

Multilayer Neural Networks

Linear discriminant functions

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Artificial Neural Networks Examination, June 2005

Introduction to Neural Networks

Artificial Intelligence

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Neural Networks biological neuron artificial neuron 1

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Artificial Neuron (Perceptron)

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Multilayer Perceptrons and Backpropagation

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Nonlinear Classification

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Numerical Learning Algorithms

Simple Neural Nets For Pattern Classification

Radial-Basis Function Networks

Computational Intelligence Winter Term 2017/18

Artificial Neural Networks

Unit III. A Survey of Neural Network Model

Introduction To Artificial Neural Networks

Artificial Neural Networks. MGS Lecture 2

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Neural Networks and Deep Learning

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

An artificial neural networks (ANNs) model is a functional abstraction of the

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

4. Multilayer Perceptrons

Computational Intelligence

Lecture 6. Notes on Linear Algebra. Perceptron

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Neural Networks Lecture 4: Radial Bases Function Networks

CS:4420 Artificial Intelligence

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Artificial Neural Networks. Part 2

Revision: Neural Network

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Mining Classification Knowledge

Neural Networks (Part 1) Goals for the lecture

CSC242: Intro to AI. Lecture 21

y(x n, w) t n 2. (1)

Multilayer Neural Networks

CSC321 Lecture 5: Multilayer Perceptrons

Learning from Examples

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Brief Introduction to Machine Learning

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

In the Name of God. Lecture 11: Single Layer Perceptrons

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Lab 5: 16 th April Exercises on Neural Networks

Mining Classification Knowledge

Part 8: Neural Networks

6.036 midterm review. Wednesday, March 18, 15

Transcription:

DWML, 2007 /25

Neural networks: Biological and artificial Consider humans: Neuron switching time 0.00 second Number of neurons 0 0 Connections per neuron 0 4-0 5 Scene recognition time 0. sec 00 inference steps doesn t seem like enough much parallel computation DWML, 2007 2/25

Neural networks: Biological and artificial Consider humans: Properties of artificial neural nets (ANN s): Neuron switching time 0.00 second Many neuron-like threshold switching units Number of neurons 0 0 Connections per neuron 0 4-0 5 Scene recognition time 0. sec 00 inference steps doesn t seem like Many weighted interconnections among units Highly parallel, distributed process Emphasis on tuning weights automatically enough much parallel computation DWML, 2007 2/25

Neural Network Structure Input Layer Hidden Layer Output Layer Layered circuit of neurons Neighboring layers completely connected; no other connections (feedforward network) Arbitrary number of hidden layers allowed, but usually 0 or DWML, 2007 3/25

Model of biological neurons A Single Neuron x x 2 w w 2 w 3 P af o x 3 Perceptron: The inputs are combined linearly: w x + + w n x n = w x (vector notation). The output is non-linear. DWML, 2007 4/25

Model of biological neurons A Single Neuron x x 2 w w 2 w 3 P af o x 3 Perceptron: The inputs are combined linearly: w x + + w n x n = w x (vector notation). The output is non-linear. We have different activation functions af: Sigmoid Sign.0 0.9 0.8 0.7 0.6 0.5.0 0.8 0.6 0.4 0.2 0.4 0.0 0.3 5 4 3 x 2 0 0.2 2 3 4 5 0.2 0. 0.4 0.0 0.6 5 4 3 x 2 0 2 3 4 5 0.8.0 af(x) = σ(x) = /( + e x ) af(x) = sign(x) DWML, 2007 4/25

Neural Network Semantics Given the network structure, the weights associated with links/nodes, the activation function (usually the same for all hidden/output nodes) a neural network with n input and k output nodes defines k real-valued functions on continuous input attributes: o i (a,..., a n ) R (i =,..., k). DWML, 2007 5/25

Propagation in Neural Networks I I 2 0 w H w 2H H w H w HO O w O The input nodes are set to and 0, respectively. DWML, 2007 6/25

Propagation in Neural Networks I I 2 0 0. 0. H 0. 0. O 0. The input nodes are set to and 0, respectively. DWML, 2007 6/25

Propagation in Neural Networks I I 2 0 0. 0. 0.5498 H 0. The output of of neuron H is: o H = σ( 0. + 0 0. + 0.) = 0.5498. 0. O 0. The input nodes are set to and 0, respectively. DWML, 2007 6/25

Propagation in Neural Networks I I 2 0 0. 0. 0.5498 H 0. The output of of neuron O is: o H = σ(0.5498 0. + 0.) = 0.53867. 0.53867 0. O 0. The input nodes are set to and 0, respectively. DWML, 2007 6/25

Neural Networks for Regression Calories Protein Sugars Vitamins Rating Inputs are continuous! Discrete attributes can be represented by 0,-valued indicator nodes: e.g. for States(A) = {red, blue, green} introduce 3 input nodes is_red, is_blue, is_green, and represent instance with A = blue with input is_red = 0, is_blue =, is_green = 0. DWML, 2007 7/25

Neural Networks for Classification Use one output node for each class label. Classify instance by class label associated with output node with highest output value. A B 9 DWML, 2007 8/25

The Task of Learning Given: structure and activation functions. To be learned: weights. Goal: given the training examples Input Output A A 2... A n Y Y 2... Y m a, a 2,... a n, y, y 2,... y m, a,2 a 2,2... a n,2 y,2 y 2,2... y m,2........ a,n a 2,N... a n,n y,n y 2,N... y m,n Find the weights that minimize the sum of squared errors (SSE) NX mx (y j,i o j (a i )) 2 i= j= DWML, 2007 9/25

Learning Basic principle: SSE is a differentiable function of the weights (for differentiable activation functions such as the sigmoid function!). Use gradient descent to optimize SSE: SSE(w, w 2 ) 0.5 0.4 0.3 0.2 SSE( w) = SSE,..., SSE «w 0 w n specifies the direction of steepest increase in SSE. Hence, our new training rule becomes: 0. 0-6 -4-2 0 2 w 2 4 6 6 4 2 0-2 w -4-6 where w i := w i + w i, w i = η SSE w i In practice: use the back propagation algorithm (approximation of gradient descent) DWML, 2007 0/25

The Principle of Back Propagation Training examples provide target values for only network outputs, so no target values are directly available for indicating the error of the hidden units values. I I 2 w w w 22 w w 2 2 w 2 H H 2 w O w O O w 2O SSE = (y o)2 2 DWML, 2007 /25

The Principle of Back Propagation Training examples provide target values for only network outputs, so no target values are directly available for indicating the error of the hidden units values. I I 2 w w w 22 w w 2 2 w 2 H H 2 w O w 2O w O O δ O SSE = (y o)2 2 Idea: Calculate an error term δ h for a hidden unit by taking the weighted sum of the error terms, δ k, for each output units it influences. DWML, 2007 /25

The Principle of Back Propagation Training examples provide target values for only network outputs, so no target values are directly available for indicating the error of the hidden units values. I I 2 w w w 22 w w 2 2 w 2 δ H (δ O ) H H 2 δ H2 (δ O ) w O w 2O w O O δ O SSE = (y o)2 2 Idea: Calculate an error term δ h for a hidden unit by taking the weighted sum of the error terms, δ k, for each output units it influences. DWML, 2007 /25

Updating Rules When using a sigmoid activation function we can derive the following updating rule: w new ij := w current ij + η δ j x ij, where δ j = learning rate error term input ( o j ( o j )(y o j ) for output nodes, o j ( o j ) P m k= w jkδ k for hidden nodes. DWML, 2007 2/25

Back Propagation Example I I 2 0 w H w 2H H w H w HO O w O Assume that we have the training example (I =, I 2 = 0, O = ). DWML, 2007 3/25

Back Propagation Example I I 2 0 0. 0. H 0. 0. O 0. Assume that we have the training example (I =, I 2 = 0, O = ). DWML, 2007 3/25

Back Propagation Example I I 2 0 0. 0. 0.5498 H 0. The output of of neuron H is: o H = σ( 0. + 0 0. + 0.) = 0.5498. 0. O 0. Assume that we have the training example (I =, I 2 = 0, O = ). DWML, 2007 3/25

Back Propagation Example I I 2 0 0. 0. 0.5498 H 0. The output of of neuron O is: o H = σ(0.5498 0. + 0.) = 0.53867. 0.53867 0. O 0. Assume that we have the training example (I =, I 2 = 0, O = ). DWML, 2007 3/25

Back Propagation Example I I 2 0 0. 0. 0.5498 H 0. The SSE value is: SSE = ( 0.53867) 2 = 0.2283. 0.53867 0. O 0. SSE = 0.2283 Assume that we have the training example (I =, I 2 = 0, O = ). DWML, 2007 3/25

Back Propagation Example I I 2 0 0. 0. 0.5498 H 0. The error term for node O is: δ O = 0.53867 ( 0.53867) ( 0.53867) = 0.46. Recall: 0. 0.53867 O 0. δ O = 0.46 SSE = 0.2283 δ O = o j ( o j )(O o j ). Assume that we have the training example (I =, I 2 = 0, O = ). DWML, 2007 3/25

Back Propagation Example I I 2 0 The updated weights are: 0. 0. 0.5498 H 0. w O = 0. + [0.3 0.46 ] = 0.342, w HO = 0. + [0.3 0.46 0.5498] = 0.089. 0./0.089 0.53867 0./0.342 O δ O = 0.46 SSE = 0.2283 Recall: w new ij := w current ij + [η δ j x ij ]. Assume that we have the training example (I =, I 2 = 0, O = ). DWML, 2007 3/25

Back Propagation Example I I 2 0 The error term for node H is: 0. 0. 0.5498 H 0. δ H = 0.002835 0./0.089 0.53867 0./0.342 O δ O = 0.46 SSE = 0.2283 δ H = 0.5498 ( 0.5498) 0. 0.46 = 0.002836. Recall: mx δ H = o j ( o j ) w jk δ k. i= Assume that we have the training example (I =, I 2 = 0, O = ). DWML, 2007 3/25

Back Propagation Example I I 2 0 The updated weights are: 0.00849 0./0. 0.5498 0./0.00849 H δ H = 0.002835 w H = 0. + [0.3 0.00283 ] = 0.00849, w 2H = 0. + [0.3 0.00283 0] = 0., w H = 0. + [0.3 0.00283 ] = 0.00849 0./0.089 0.53867 O 0./0.342 δ O = 0.46 SSE = 0.2283 Recall: w new ij := w current ij + [η δ j x ij ]. Assume that we have the training example (I =, I 2 = 0, O = ). DWML, 2007 3/25

Learning Rate and Momentum What should the learning rate be? If it is too small the convergence time will be unacceptable. If it is too large the algorithm may overshoot the optimal solution or start to oscillate. A possible solution might be to let the learning rate decrease over time or to introduce a momentum to the weight adjustments: w new ij := w current ij + η δ j x ij + α w previous ij. DWML, 2007 4/25

Pros and Cons + Often very good results in continuous domains, e.g. pattern recognition + Can represent complex, non-linear decision boundaries + Fast for classification - No explanatory power - Slow for learning DWML, 2007 5/25

K Nearest Neighbor DWML, 2007 6/25

K Nearest Neighbor Labeled training data in instance space (class labels: red, green, blue) DWML, 2007 7/25

K Nearest Neighbor Labeled training data in instance space (class labels: red, green, blue) x A new instance x should be classified. DWML, 2007 7/25

K Nearest Neighbor Labeled training data in instance space (class labels: red, green, blue) x A new instance x should be classified. The nearest neighbor is green, hence x is classified as green. DWML, 2007 7/25

K Nearest Neighbor Labeled training data in instance space (class labels: red, green, blue) x A new instance x should be classified. The nearest neighbor is green, hence x is classified as green. Two of x s three nearest neighbors are red, hence x is classified as red. DWML, 2007 7/25

K Nearest Neighbor: Distance Measures Distance Measures in Instance Space Some classification and almost all clustering methods require a distance measure d(i, i 2 ) between any pair a = (a,,..., a,k ),a 2 = (a 2,,..., a 2,k ) of instances. Common distance measures are: (I) for instances with continuous attributes A,..., A k : d 2 (a,a 2 ) = q Pk j= (a,j a 2,j ) 2 Euclidean or L 2 distance d (a,a 2 ) = P k j= a,j a 2,j Manhatten or L distance d (a,a 2 ) = max{ a,j a 2,j j =,..., k} L distance (II) for instances with binary attributes A,..., A k : d(a,a 2 ) = {j a,j a 2,j } Hamming or edit distance DWML, 2007 8/25

K Nearest Neighbor: Distance Measures (II) for instances with discrete attributes A,..., A k : d(a,a 2 ) = kx d j (a,j, a 2,j ) j= where d j is a separately defined distance function for attribute A j, e.g States(A j ) low medium high low 0 2 medium 0 high 2 0 States(A j ) red blue green red 0 blue 0 green 0 If all attributes have 0- distance (right matrix), then this is the same as edit distance. DWML, 2007 9/25

K Nearest Neighbor: Distance Measures Normalization Continuous attributes: using Euclidean distance on continuous attributes may cause one attribute to dominate the distance measure. E.g.: A k = height in inches A l = income in $ Methods for providing a common scale for all attributes: Min-Max Normalization replace A i with A i min(a i ) max(a i ) min(a i ) normalized values 0.8 0.6 0.4 (min(a i ),max(a i ) are 0.2 min/max values of A i appearing in the data) 0-20 0 20 40 60 80 00 20 original values A A2 DWML, 2007 20/25

K Nearest Neighbor: Distance Measures Z-score Standardization 3 replace A i with A i mean(a i ) standard deviation(a i ) standardized values 2 0 - -2 where -3 A A2-4 -20 0 20 40 60 80 00 20 original values mean(a i ) = n P nj= a j,i standard deviation(a i ) = q n P nj= (a j,i mean(a i )) 2 DWML, 2007 2/25

K Nearest Neighbor Classifier Model=(Training) Data Required: distance function on instances. Model = labeled training data (a, c ),..., (a N, c N ). Classify new instance a new as follows: - Let (a j, c j ),..., (a jk, c jk ) be the K training instances whose attributes are closest to a new. - Define C(a new ) as the class label that occurs most frequently among c j,..., c jk. DWML, 2007 22/25

K Nearest Neighbor Classifier Dependence on K Decision regions (approximately) for -nearest neighbor (left) and 5-nearest neighbor (right). possibility of overfitting for small values K. Cross-validation can be used to find a suitable value for k. DWML, 2007 23/25

K Nearest Neighbor Classifier Weighted voting We can give a higher weight to neighbors close to x than to neighbors far away. Calculate a weight for label c: kx v(c) = d(x,a i ) i=:c i =c and label x with the class having the highest weight. DWML, 2007 24/25

K Nearest Neighbor Classifier Pros and Cons + Can represent complex decision boundaries + Trivial to learn - High memory requirement (but can sometimes just use subset of data) - Classification time increases in size of training data - Does not explain the data - Dependence on appropriate distance function DWML, 2007 25/25