Machine Learning CS-527A ANN ANN. ANN Short History ANN. Artificial Neural Networks (ANN) Artificial Neural Networks

Similar documents
Introduction to the Introduction to Artificial Neural Network

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

EEE 241: Linear Systems

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Model of Neurons. CS 416 Artificial Intelligence. Early History of Neural Nets. Cybernetics. McCulloch-Pitts Neurons. Hebbian Modification.

Multilayer Perceptron (MLP)

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Lecture 23: Artificial neural networks

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen

Neural Networks. Perceptrons and Backpropagation. Silke Bussen-Heyen. 5th of Novemeber Universität Bremen Fachbereich 3. Neural Networks 1 / 17

Multilayer neural networks

Multi-layer neural networks

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Supporting Information

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Online Classification: Perceptron and Winnow

Generalized Linear Methods

VQ widely used in coding speech, image, and video

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Problem Set 9 Solutions

Lecture Notes on Linear Regression

Kernel Methods and SVMs Extension

Chapter Newton s Method

Numerical Heat and Mass Transfer

Which Separator? Spring 1

Internet Engineering. Jacek Mazurkiewicz, PhD Softcomputing. Part 3: Recurrent Artificial Neural Networks Self-Organising Artificial Neural Networks

Week 5: Neural Networks

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Unsupervised Learning

1 Convex Optimization

Multi layer feed-forward NN FFNN. XOR problem. XOR problem. Neural Network for Speech. NETtalk (Sejnowski & Rosenberg, 1987) NETtalk (contd.

MATH 567: Mathematical Techniques in Data Science Lab 8

CSC 411 / CSC D11 / CSC C11

10-701/ Machine Learning, Fall 2005 Homework 3

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

Neural Networks & Learning

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Hopfield Training Rules 1 N

IV. Performance Optimization

Section 8.3 Polar Form of Complex Numbers

Boostrapaggregating (Bagging)

1 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations

9 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations

Pattern Classification

Linear Approximation with Regularization and Moving Least Squares

Multigradient for Neural Networks for Equalizers 1

The equation of motion of a dynamical system is given by a set of differential equations. That is (1)

Lecture 10 Support Vector Machines II

1 Input-Output Mappings. 2 Hebbian Failure. 3 Delta Rule Success.

Neural Networks. Neural Network Motivation. Why Neural Networks? Comments on Blue Gene. More Comments on Blue Gene

Evaluation of classifiers MLPs

CHAPTER III Neural Networks as Associative Memory

Markov Chain Monte Carlo Lecture 6

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

Outline. Communication. Bellman Ford Algorithm. Bellman Ford Example. Bellman Ford Shortest Path [1]

Feature Selection: Part 1

Ensemble Methods: Boosting

1 Matrix representations of canonical matrices

Linear Feature Engineering 11

Lecture 12: Discrete Laplacian

8 Derivation of Network Rate Equations from Single- Cell Conductance Equations

1 GSW Iterative Techniques for y = Ax

Neural Networks. Class 22: MLSP, Fall 2016 Instructor: Bhiksha Raj

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

COMPLEX NUMBERS AND QUADRATIC EQUATIONS

THE SUMMATION NOTATION Ʃ

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Video Data Analysis. Video Data Analysis, B-IT

Linear Classification, SVMs and Nearest Neighbors

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Using deep belief network modelling to characterize differences in brain morphometry in schizophrenia

Learning Theory: Lecture Notes

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Lecture 10 Support Vector Machines. Oct

Errors for Linear Systems

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

8 Derivation of Network Rate Equations from Single- Cell Conductance Equations

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Maximal Margin Classifier

Queueing Networks II Network Performance

Review of Taylor Series. Read Section 1.2

Using Immune Genetic Algorithm to Optimize BP Neural Network and Its Application Peng-fei LIU1,Qun-tai SHEN1 and Jun ZHI2,*

Report on Image warping

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Lecture 12: Classification

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

Supervised Learning NNs

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 16

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

MMA and GCMMA two methods for nonlinear optimization

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

Hashing. Alexandra Stefan

Calculation of time complexity (3%)

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Transcription:

Machne Learnng CS-57A Artfcal Neural Networks Burchan (bourch-khan) Bayazt http://www.cse.wustl.edu/~bayazt/courses/cs57a/ Malng lst: cs-57a@cse.wustl.edu Artfcal Neural Networks (ANN) Neural network nspred by bologcal nervous systems, such as our bran Useful for learnng real-valued, dscrete-valued or vector-valued functons. Appled to problems such as nterpretng vsual scenes, speech recognton, learnng robot control strateges. orks well wth nosy, complex sensor data such as nputs from cameras and mcrophones. ANN ANN nspraton from Neurobology A neuron: many-nputs / one-output unt Cell Body s 5 0 mcrons n dameter ncomng sgnals from other neurons determne f the neuron shall excte ("fre") Axon turn the processed nputs to outputs. Synapses are the electrochemcal contact between neurons. ANN n human bran, approxmately 0 neurons are densely nterconnected. They are arranged n networks Each neuron connected to 0 4 others on average Fastest neuron swtchng tme 0 - seconds ANN motvaton by bologcal neuron systems; however many features are nconsstent wth bologcal systems. ANN Short Hstory McCulloch & Ptts (94) are generally recognzed as the desgners of the frst neural network Ther deas such as threshold and many smple unts combnng to gve ncreased computatonal power are stll n use today n the 50 s and 60 s, many researchers worked on the perceptron n 969, Mnsky and Papert showed that perceptrons were lmted so neural network research ded down for about 5 years n the md 80 s nterest revved (Parket and LeCun)

ANN Hopfeld Network One Layer Perceptron Two Layer Perceptron Types Neural Network Archtectures Many knds of structures, man dstncton made between two classes: a) feed- forward (a drected acyclc graph (DAG): lnks are undrectonal, no cycles - There s no nternal state other than the weghts. Perceptron x x o =, w o nput Perceptron n f = = w x o 0 0 otherwse Output b) recurrent: lnks form arbtrary topologes e.g., Hopfeld Networks and Boltzmann machnes x n n = w x 0 Recurrent networks: can be unstable, or oscllate, or exhbt chaotc behavor e.g., gven some nput values, can take a long tme to compute stable output and learnng s made more dffcult. However, can mplement more complex agent desgns and can model systems wth state The McCullogh-Ptts model The perceptron calculates a weghted sum of nputs and compares t to a threshold. f the sum s hgher than the threshold, the output s set to, otherwse to -. Learnng s fndng weghts w g = Actvaton functons for unts Perceptron Mathematcal Representaton Step functon (Lnear Threshold Unt) step(x) =, f x >= threshold 0, f x < threshold Sgn functon sgn(x) = +, f x >= 0 -, f x < 0 Sgmod functon sgmod(x) = /(+e -x ) Addng an extra nput wth actvaton a 0 = - and weght 0,j = t s equvalent to havng a threshold at t. Ths way we can always assume a 0 threshold. f wo + w x +... + wn xn > 0 o( x, x,..., xn ) = otherwse r o r r ( x) sgn( w x) = where r r sgn( = { ( n H w w R +) } f y > 0 y) = otherwse

Perceptron o( x r ) defnes N-dmensonal space and (N-) dmensonal plane. The perceptron returns for data ponts lyng on one sde of the hyperplane and - for data ponts lyng on the other sde. f the postve and negatve examples are separated by a hyperplane, they are called lnearly separable sets of examples. But t s not always the case. Perceptron The equaton below descrbes a (hyper-)plane n the nput space consstng of real valued m-dmensonal vectors. The plane splts the nput space nto two regons, each of them descrbng one class. decson regon for C m x w x + w x + w 0 >= 0 w x + w0 = 0 decson = boundary C C x w x + w x + w 0 = 0 Perceptron Learnng Perceptron Learnng e have ether (-) or (+) as the output and nputs are ether 0 or There are 4 cases The output s suppose to be + and perceptron returns + The output s suppose to be - and perceptron returns - The output s suppose to be + and perceptron returns - The output s suppose to be - and perceptron returns + For each tranng data <x,t> D Fnd o=o(x) update each weght w =Δw +w where Δw =(t-o)x f Case or, do nothng snce the perceptron returns rght result f Case w 0 +w x +w x +.+w n x n >0 we need to ncrease the weghts so that the left sde of the equaton wll become greater than 0 f Case 4, the weghts must be decreased w w + Δw Δw = η So we can use followng update rule that satsfes ths ( t o) x t s the target output, o s the output generated by the perceptron and η s a postve constant known as the learnng rate. Learnng AND functon Learnng AND functon nput (0,) (,) Tranng Data: (0,,0) (0,0,0) (,0,0) (,,) w w w0 w 0 =- w =0.6 w =0.6 (0,0) (,0) nput

Learnng AND functon Output space for AND gate nput (0,) (,) Tranng Data: (0,,0) (0,0,0) (,0,0) (,,) Lmtatons of the Perceptron Only bnary nput-output values Only two layers Separates the space lnearly w 0 +w *x + w *x =0 (0,0) (,0) nput Only two layers Learnng OR Mnsky and Papert (969) showed that a two-layer Perceptron cannot represent certan logcal functons nput (0,) (,) Not Lnearly Separable Some of these are very fundamental, n partcular the exclusve or (OR) Do you want coffee OR tea? nput (0,0) (,0) Learnng OR Soluton to Lnear nseparablty nput Not Lnearly Separable Use another tranng rule (delta rule) Backpropagaton (0,) (,) nput (0,0) (,0) 4

ANN Gradent Descent and the Delta Rule Delta Rule desgned to converge examples that are not lnearly separable. Uses gradent descent to search the hypothess space of possble weght vectors to fnd the weghts that best ft the tranng examples. Gradent Descent Defne an error functon based on target concepts and NN output The goal s to change weghts so that the error wll be reduces (w,w ) (w +Δw,w +Δw ) Gradent Descent Tranng error of a hypothess: E ( w ) = ( t d o d ) d D D s the set of tranng examples, T d s the target output for tranng example d, and o d s the output of the lnear unt for tranng example d. How to fnd Δw? Dervaton of the Gradent Descent Rule r E E E ( w) =, E,..., 0 n r r r r r w w+δw Δw = η E w where ( ) Drecton of the steepest descent along the error surface The negatve sgn s present as we want to go n the drecton that decreases E. For the th component: w w + Δw where E Δw = η w How to fnd Δw? Gradent-Descent Algorthm Hence E = E = = ( td od ) = ( td od ) d D ( ) ( td od ) td od d D ( td od )( xd ) d D d D ( ) ( td w xd ) td od = d D E here x d Δw = η ( td od ) d D x d for tranng example d r s the sngle nput component x for r Each tranng example s a par of the form r x,t where x r s the vector of nput values, and t s the target output value and η s the learnng rate (e.g. 0.5) ntalze each w to some small random value Untl the termnaton condton s met, Do ntalze each Δw to zero. r For each x,t n tranng examples, Do nput the nstance x r to the unt and compute the output o For each lnear unt weght w, Do Δw Δw + η( t o) x For each lnear unt weght w, Do w w + Δw 5

Tranng Strateges Onlne tranng: Update weghts after each sample Offlne (batch tranng): Compute error over all samples Then update weghts Onlne tranng nosy Senstve to ndvdual nstances However, may escape local mnma Example: Learnng addton Example: Learnng addton nput Layer Hdden Layer 0 0 0 Goal: Learn bnary addton:.e.: (0+0)=0,(0+)=,(+0)=,(+)=0 Output Layer 0 0 0 0 0 0 0 Actvaton Functon Tranng Data nputs 0,0 0,,0, Target Concept 0,0 0, 0,,0 Example: Learnng addton Example: Learnng addton 0 0 0 Frst fnd the outputs O, O n order to do ths, propagate the nputs forward. Frst fnd the outputs for the neurons of hdden layer 0 0 0 Then fnd the outputs of the neurons of hdden layer 0 0 0 0 6

Example: Learnng addton Example: Learnng addton 0 0 0 Now propagate back the errors. n order to do that frst fnd the errors for the output layer, also update the weghts between hdden layer and output layer 0 0 0 And backpropagate the errors to hdden layer. 0 0 0 0 Example: Learnng addton Example: Learnng addton 0 And backpropagate the errors to hdden layer. 0 Fnally update weghts!!!! 0 0 0 0 0 0 0 0 mportance of Learnng Rate Generalzaton of the Backpropagaton 0.0 50 7

Backpropagaton Usng Gradent Descent Advantages Relatvely smple mplementaton Standard method and generally works well Dsadvantages Slow and neffcent Can get stuck n local mnma resultng n sub-optmal solutons Local Mnma Local Mnmum Global Mnmum Alternatves To Gradent Descent Smulated Annealng Advantages Can guarantee optmal soluton (global mnmum) Dsadvantages May be slower than gradent descent Much more complcated mplementaton Alternatves To Gradent Descent Genetc Algorthms/Evolutonary Strateges Advantages Faster than smulated annealng Less lkely to get stuck n local mnma Dsadvantages Slower than gradent descent Memory ntensve for large nets Alternatves To Gradent Descent Smplex Algorthm Advantages Smlar to gradent descent but faster Easy to mplement Dsadvantages Does not guarantee a global mnmum Enhancements To Gradent Descent Momentum Adds a percentage of the last movement to the current movement 8

Enhancements To Gradent Descent Momentum Useful to get over small bumps n the error functon Often fnds a mnmum n less steps Δw j (t) = -η*δ j *x j + α*w j (t-) Backpropagaton Drawback Slow convergence mprove ncrease learnng rates? 0 0 - - - - - 0 - - - 0 Bas Hard to characterze Smooth nterpretaton between data ponts Overfttng Use a valdaton set, keep the weghts for most accurate learnng Decay weghts Use several networks and use votng K-fold cross valdaton:. Dvde nput set to K small sets. For k=..k. use Set k as valdaton set, and the remanng as the test set 4. fnd the number of teratons k to optmal learnng for ths set 5. Fnd the average of number of teratons for all sets 6. Tran the network wth that number of teratons. Despte ts popularty backpropagaton has some dsadvantages Learnng s slow New learnng wll rapdly overwrte old representatons, unless these are nterleaved (.e., repeated) wth the new patterns Ths makes t hard to keep networks up-todate wth new nformaton (e.g., dollar rate) Ths also makes t very mplausble from as a psychologcal model of human memory Good ponts Easy to use Few parameters to set Algorthm s easy to mplement Can be appled to a wde range of data s very popular Has contrbuted greatly to the new connectonsm (second wave) 9

Defcences of BP Nets Learnng often takes a long tme to converge Complex functons often need hundreds or thousands of epochs The net s essentally a black box f may provde a desred mappng between nput and output vectors (x, y) but does not have the nformaton of why a partcular x s mapped to a partcular y. t thus cannot provde an ntutve (e.g., causal) explanaton for the computed result. Ths s because the hdden unts and the learned weghts do not have a semantcs. hat can be learned are operatonal parameters, not general, abstract knowledge of a doman Gradent descent approach only guarantees to reduce the total error to a local mnmum. (E may be be reduced to zero) Cannot escape from the local mnmum error state Not every functon that s represent able can be learned How bad: depends on the shape of the error surface. Too many valleys/wells wll make t easy to be trapped n local mnma Possble remedes: Try nets wth dfferent # of hdden layers and hdden unts (they may lead to dfferent error surfaces, some mght be better than others) Try dfferent ntal weghts (dfferent startng ponts on the surface) Forced escape from local mnma by random perturbaton (e.g., smulated annealng) Generalzaton s not guaranteed even f the error s reduced to zero Over-fttng/over-tranng problem: traned net fts the tranng samples perfectly (E reduced to 0) but t does not gve accurate outputs for nputs not n the tranng set Unlke many statstcal methods, there s no theoretcally well-founded way to assess the qualty of BP learnng hat s the confdence level one can have for a traned BP net, wth the fnal E (whch not or may not be close to zero) Kohonen Kohonen For each tranng data Fnd the wnner neuron usng Update the weghts of the neghbors Every neuron of the output layer s connected wth every neuron of the nput layer. hle learnng, the closest neuron to the nput data (the dstance between ts weghts and the nput vector s mnmum) and ts neghbors (see below) update ther weghts. The dstance s defned as follows: The formula for the Kohonen map tends to brng the connectons closer to the nput data: Kohonen Maps Kohonen Maps The nput x s gven to all the unts at the same tme 0

Kohonen Maps NETtalk (Sejnowsk & Rosenberg, 987) Kller Applcaton The weghts of the wnner unt are updated together wth the weghts of ts neghborhoods The task s to learn to pronounce Englsh text from examples. Tranng data s 04 words from a sde-by-sde Englsh/phoneme source. nput: 7 consecutve characters from wrtten text presented n a movng wndow that scans text. Output: phoneme code gvng the pronuncaton of the letter at the center of the nput wndow. Network topology: 7x9 nputs (6 chars + punctuaton marks), 80 hdden unts and 6 output unts (phoneme code). Sgmod unts n hdden and output layer. NETtalk (contd.) Tranng protocol: 95% accuracy on tranng set after 50 epochs of tranng by full gradent descent. 78% accuracy on a set-asde test set. Comparson aganst Dectalk (a rule based expert system): Dectalk performs better; t represents a decade of analyss by lngusts. NETtalk learns from examples alone and was constructed wth lttle knowledge of the task. Steerng an Automoble ALVNN system [Pomerleau 99,99] Uses Artfcal Neural Network Used 0* TV mage as nput (960 nput node) 5 Hdden node 0 output node Tranng regme: modfed on-the-fly A human drver drves the car, and hs actual steerng angles are taken as correct labels for the correspondng nputs. Shfted and rotated mages were also used for tranng. ALVNN has drven for 0 consecutve klometers at speeds up to 00km/h. Steerng an Automoble- ALVNN network Voce Recognton Task: Learn to dscrmnate between two dfferent voces sayng Hello Data Sources Steve Smpson Davd Raubenhemer Format Frequency dstrbuton (60 bns)

Network archtecture Presentng the data Feed forward network 60 nput (one for each frequency bn) 6 hdden output (0- for Steve, -0 for Davd ) Steve Davd