The Perceptron. Volker Tresp Summer 2014

Similar documents
The Perceptron. Volker Tresp Summer 2016

The Perceptron. Volker Tresp Summer 2018

Neural Networks and the Back-propagation Algorithm

Linear discriminant functions

Neural Networks biological neuron artificial neuron 1

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Computational Intelligence Lecture 3: Simple Neural Networks for Pattern Classification

Simple Neural Nets For Pattern Classification

Linear Regression. Volker Tresp 2014

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Neural Networks Lecture 2:Single Layer Classifiers

EEE 241: Linear Systems

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Lecture 6. Notes on Linear Algebra. Perceptron

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Neural Networks and Deep Learning

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

The Perceptron Algorithm 1

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Intelligent Systems Discriminative Learning, Neural Networks

Single layer NN. Neuron Model

Deep Feedforward Networks

More about the Perceptron

Neural Networks: Introduction

Introduction To Artificial Neural Networks

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 5: Logistic Regression. Neural Networks

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Classification with Perceptrons. Reading:

Perceptron. (c) Marcin Sydow. Summary. Perceptron

Linear & nonlinear classifiers

Artificial Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Unit 8: Introduction to neural networks. Perceptrons

Lecture 4: Perceptrons and Multilayer Perceptrons

Inf2b Learning and Data

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

CMSC 421: Neural Computation. Applications of Neural Networks

Neural Networks. Volker Tresp Summer 2015

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Supervised Learning. George Konidaris

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Artificial Neural Networks. Q550: Models in Cognitive Science Lecture 5

Linear Classifiers. Michael Collins. January 18, 2012

18.9 SUPPORT VECTOR MACHINES

Logistic Regression & Neural Networks

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 9: Classifica8on with Support Vector Machine (cont.

9 Classification. 9.1 Linear Classifiers

Lecture 4: Feed Forward Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Mining Classification Knowledge

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Lecture 5 Neural models for NLP

Feedforward Neural Nets and Backpropagation

Linear Discriminant Functions

Multilayer Neural Networks

Binary Classification / Perceptron

CPSC 340: Machine Learning and Data Mining

Neural networks and optimization

Machine Learning (CSE 446): Neural Networks

Data Mining Part 5. Prediction

Artificial Intelligence Roman Barták

Midterm: CS 6375 Spring 2015 Solutions

CSC Neural Networks. Perceptron Learning Rule

Deep Feedforward Networks. Sargur N. Srihari

Introduction to Machine Learning

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Neural Networks (Part 1) Goals for the lecture

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Basics

Linear Regression. Volker Tresp 2018

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

GRADIENT DESCENT AND LOCAL MINIMA

Multilayer Perceptron

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

Artificial Neural Networks

Lecture 7 Artificial neural networks: Supervised learning

Machine Learning (CS 567) Lecture 3

CSC321 Lecture 5: Multilayer Perceptrons

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning

Bias-Variance Tradeoff

In the Name of God. Lecture 11: Single Layer Perceptrons

CS 6375 Machine Learning

Part of the slides are adapted from Ziko Kolter

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Machine Learning. Neural Networks

Transcription:

The Perceptron Volker Tresp Summer 2014 1

Introduction One of the first serious learning machines Most important elements in learning tasks Collection and preprocessing of training data Definition of a class of learning models. Often defined by the free parameters in a learning model with a fixed structure (e.g., a Perceptron) Selection of a cost function Learning rule to find the best model in the class of learning models. Often this means the learning of the optimal parameters 2

Prototypical Learning Task Classification of printed or handwritten digits Application: automatic reading of Zip codes More general: OCR (optical character recognition) 3

Transformation of the Raw Data (2-D) into Pattern Vectors (1-D) as part of a Learning Matrix 4

Binary Classification 5

Data Matrix for Supervised Learning M number of inputs N number of training patterns x i = (x i,0,..., x i,m 1 ) T i-th input x i,j j-th component of x i X = (x 1,..., x N ) T design matrix y i i-th target for x i y = (y 1,..., y N ) T Vector of targets ŷ i prediction for x i d i = (x i,0,..., x i,m 1, y i ) T i-th pattern D = {d 1,..., d N } training data z test input t unknown target for z 6

Model 7

A Biologically Motivated Model 8

Input-Output Models A biological system needs to make a decision, based on available senor information An OCR system classifies a hand written digit A prognostic system predicts tomorrow s energy consumption 9

Supervised Learning In supervised learning one assumes that in training both inputs and outputs are available For example, an input pattern might reflect the attributes of an object and the target is the class membership of this object The goal is the correct classification for new patterns Linear classifier: one of the simplest but surprisingly powerful classifiers A linear classifier is particularly suitable, when the number of inputs M is large; if this is not the case, one can transform the input data into a high-dimensional space, where a linear classifier might be able to solve the problem; this idea is central to a large portion of the lecture (basis functions, neural networks, kernel models) A linear classifier can be realized through a Perceptron, a single formalized neuron! 10

Supervised Learning and Learning of Decisions One might argue that learning is only of interest if it changes (future) behavior; at least for a biological system Many decisions can be reduced to a supervised learning problem: if I can read a Zip code correctly, I know where the letter should be sent Decision tasks can often be reduced to an intermediate supervised learning problem But who produces the targets for the intermediate task? For biological systems a hotly debated issue: is supervised learning biologically relevant? Is only reinforcement learning, based on rewards and punishment, biologically plausible? 11

The Perceptron: A Learning Machine The activation function of the Perceptron is weighted sum of inputs h i = M 1 j=0 w j x i,j (Note: x i,0 = 1 is a constant input, such that w 0 can be though of as a bias The binary classification y i {1, 1} is calculated as ŷ i = sign(h i ) The linear classification boundary (separating hyperplane) is defined as h i = 0 12

Perceptron as a Weighted Voting machine The Perceptron is often displayed as a graphical model with one input node for each input variable and with one output node for the target The bias w 0 determines the class when all inputs are zero When x i,j = 1 the j-th input votes with weight w j for class sign(w j ) Thus, the response of the Perceptron can be thought of as a weighted voting for a class. 13

2-D Representation of the Decision Boundary The class boundaries are often displayed graphically with M = 3 (next slide) This provides some intuition But note, that this 2-D picture can be misleading, since the Perceptron is typically employed in high-dimensional problems (M >> 1) 14

Two classes that are Linearly Separable 15

Perceptron Learning Rule We now need a learning rule to find optimal parameters w 0,..., w M 1 We define a cost function that is dependent on the training data and the parameters In the learning process (training), one attempts to find parameters that minimize the cost function 16

The Perceptron Cost Function Goal: correct classification of the N training samples {y 1,..., y N } The Perceptron cost function is cost = i M y i h i = N i=1 y i h i + where M {1,..., N} is the index set of the currently misclassified patterns and x i,j is the value of the j-th input in the i-th pattern. arg + = max(arg, 0). Obviously, we get cost = 0 only, when all patterns are correctly classified (then M ); otherwise cost > 0, since y i and h i have different signs for misclassified patterns 17

Contribution to the Cost Function of one Data Point 18

Gradient Descent Initialize parameters (typically small random values) In each learning step, change the parameters such that the cost function decreases Gradient decent: adapt the parameters in the direction of the negative gradient The partial derivative of the weights with respect to the parameters is (Example: w j ) cost w j Thus, a sensible adaptation rule is = y i x i,j i M w j w j + η y i x i,j i M 19

Gradient Descent with One Parameter (Conceptual) 20

Gradient Descent with Two Parameters (Conceptual) 21

The Perceptron-Learning Rule In the actual Perceptron learning rule, one presents randomly selected currently misclassified patterns and adapts with only that pattern. This is biologically more plausible and also leads to faster convergence. Let x t and y t be the training pattern in the t-th step. One adapts t = 1, 2,... w j w j + ηy t x t,j j = 1,..., M A weight increases, when (postsynaptic) y(t) and (presynaptic) x j (t) have the same sign; different signs lead to a weight decrease (compare: Hebb Learning) η > 0 is the learning rate, typically 0 < η << 1 Pattern-based learning is also called stochastic gradient descent (SGD) 22

Stochastic Gradient Descent (Conceptual) 23

Comments Convergence proof: with sufficiently small learning rate η and when the problem is linearly separable, the algorithm converges and terminates after a finite number of steps If classes are not linearly separable and with finite η there is no convergence 24

Example: Perceptron Learning Rule, η = 0.1 25

Linearly Separable Classes 26

Convergence and Degenerativity 27

Classes that Cannot be Separated with a Linear Classifier 28

The classical Example for Linearly Non-Separable Classes: XOR 29

Classes are Separable (Convergence) 30

Classes are not Separable (no Convergence) 31

Comments on the Perceptron Convergence can be very fast A linear classifiers is a very important basic building block: with M most problems become linearly separable! In some case, the data are already high-dimensional with M > 10000 (e.g., number of possible key words in a text) In other cases, one first transforms the input data into a high-dimensional (sometimes even infinite) space and applies the linear classifier in that space: kernel trick, Neural Networks Considering the power of a single formalized neuron: how much computational power might 100 billion neurons posses? Are there grandmother cells in the brain? Or grandmother areas? 32

Comments on the Perceptron (cont d) The Perceptron learning rule is not used much any more No convergence, when classes are not separable Classification boundary is not unique Alterbatvie learning rules: Linear Support Vector Machine Fisher Linear Discriminant Logistic Regression 33

Application for a Linear Classifier; Analysis of fmri Brain Scans (Tom Mitchel et al., CMU) Goal: based on the image slices determine if someone thinks of tools, buildings, food, or a large set of other semantic concepts The trained linear classifier is 90% correct and can. e.g., predict if someone reads about tools or buildings The figure shows the voxels, which are most important for the classification task. All three test persons display similar regions 34

Pattern Recognition Paradigm von Neumann:... the brain uses a peculiar statistical language unlike that employed in the operation of man-made computers... A classification decision is done in done by considering the complete input pattern, and neither as a logical decision based on a small number of attributes nor as a complex logical programm The linearly weighted sum corresponds more to a voting: each input has either a positive or a negative influence on the classification decision Robustness: in high dimensions a single, possible incorrect, input has little influence 35

Afterword 36

Why Pattern Recognition? Alternative approach to pattern recognition: learning of simple close-to deterministic rules (naive expectation) One of the big mysteries in machine learning is why rule learning is not very successful Problems: the learned rules are either trivial, known, or extremely complex and very difficult to interpret This is in contrast to the general impression that the world is governed by simple rules Also: computer programs, machines... follow simple deterministic if-then-rules? 37

Example: Birds Fly Define flying: using its own force, at leat 20m, at leat 1m high, at least one every day in its adult life,... A bird can fly if, it is not a penguin, or... it is not seriously injured or dead it is not too old the wings have not been clipped it does not have a number of diseases it only lives in a stable it carries heavy weights... 38

Pattern Recognition 90% of all birds fly Of all birds which do not belong to a flightless class 94% fly... and which are not domesticated 96%... Basic problem: Complexity of the underlying (deterministic) system Incomplete information Thus: success of statistical machine learning! 39

Example: Predicting Buying Pattern 40

Where Rule-Learning Works Technical human generated worlds ( Engine A always goes with transmission B ). 41