Supervised Learning Part I

Similar documents
Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment

Course 395: Machine Learning - Lectures

Neural Network Learning: Testing Bounds on Sample Complexity

Introduction To Artificial Neural Networks

Learning and Memory in Neural Networks

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

How to do backpropagation in a brain

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Grundlagen der Künstlichen Intelligenz

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Artificial Neural Networks The Introduction

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

MinOver Revisited for Incremental Support-Vector-Classification

TTIC 31230, Fundamentals of Deep Learning, Winter David McAllester. The Fundamental Equations of Deep Learning

Computational Learning Theory (VC Dimension)

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

CS 4700: Foundations of Artificial Intelligence

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Computational Learning Theory

Empirical Risk Minimization

Classifier Complexity and Support Vector Classifiers

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

Neural networks. Chapter 20, Section 5 1

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Denoising Autoencoders

Machine Learning. Neural Networks

Large Margin Classification Using the Perceptron Algorithm

From perceptrons to word embeddings. Simon Šuster University of Groningen

Artificial Neural Networks. MGS Lecture 2

Linear & nonlinear classifiers

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Part of the slides are adapted from Ziko Kolter

Perceptron. (c) Marcin Sydow. Summary. Perceptron

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

The Perceptron algorithm

Supervised Learning. George Konidaris

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Lecture 4: Feed Forward Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

Neural Networks and the Back-propagation Algorithm

Are Rosenblatt multilayer perceptrons more powerfull than sigmoidal multilayer perceptrons? From a counter example to a general result

ML4NLP Multiclass Classification

Based on the original slides of Hung-yi Lee

Simple Neural Nets For Pattern Classification

COMPARING PERFORMANCE OF NEURAL NETWORKS RECOGNIZING MACHINE GENERATED CHARACTERS

Convolutional Neural Networks. Srikumar Ramalingam

Statistical learning theory, Support vector machines, and Bioinformatics

Lecture Support Vector Machine (SVM) Classifiers

Multilayer Perceptron

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Machine Learning Lecture 12

Radial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition

Links between Perceptrons, MLPs and SVMs

Introduction to Convolutional Neural Networks (CNNs)

Lecture 7 Artificial neural networks: Supervised learning

Discriminative Models

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Cheng Soon Ong & Christian Walder. Canberra February June 2018

arxiv: v2 [nlin.ao] 19 May 2015

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Perceptron Revisited: Linear Separators. Support Vector Machines

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Neural Networks biological neuron artificial neuron 1

Weight Quantization for Multi-layer Perceptrons Using Soft Weight Sharing

Discriminative Models

Large Scale Machine Learning with Stochastic Gradient Descent

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

STA 414/2104: Lecture 8

CS 4700: Foundations of Artificial Intelligence

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Introduction to Neural Networks

Neural Networks: Introduction

Neural networks. Chapter 19, Sections 1 5 1

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

The Perceptron. Volker Tresp Summer 2016

Machine Learning Lecture 10

UNSUPERVISED LEARNING

EEE 241: Linear Systems

Neural Networks: A Very Brief Tutorial

A New Perspective on an Old Perceptron Algorithm

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Neural networks. Chapter 20. Chapter 20 1

Artificial Neural Networks. Historical description

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Introduction to Support Vector Machines

Machine Learning Lecture 7

Optimization Methods for Machine Learning (OMML)

COMP-4360 Machine Learning Neural Networks

Introduction to Neural Networks

Basic Principles of Unsupervised and Unsupervised

Transcription:

Supervised Learning Part I http://www.lps.ens.fr/~nadal/cours/mva Jean-Pierre Nadal CNRS & EHESS Laboratoire de Physique Statistique (LPS, UMR 8550 CNRS - ENS UPMC Univ. Paris Diderot) Ecole Normale Supérieure (ENS) & Centre d Analyse et de Mathématique Sociales (CAMS, UMR 8557 CNRS - EHESS) Ecole des Hautes Etudes en Sciences Sociales (EHESS) nadal@lps.ens.fr

Supervised learning Menu Intro: F. Rosenblatt The Perceptron as a linear separator Capacity & information capacity Cover geometrical approach Vapnik beyond the perceptron Learning a rule from examples Gardner statistical physics approach The perceptron algorithm the perceptron algorithm (Rosenblatt) max stability/optimal margin Support Vector Machines (SVM): back to the original Perceptron? Alternatives: MLP, deep-learning Modeling the Cerebellum: Purkinje cells as Perceptrons Efficient Hebbian learning

The Perceptron Frank Rosenblatt, «The perceptron: a probabilistic model for information storage and organization in the brain», Psychological Review, Vol. 65:6 (1958) F. Rosenblatt (1962). Principles of neurodynamics. New York: Spartan. Marvin Minsky and Seymour Papert, Perceptrons: An Introduction to Computational Geometry, MIT Press, 1969. Thomas M. Cover. Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition. IEEE Transactions on Electronic Computers, EC-14(3):326--334, June 1965

The Perceptron

Supervised learning paradigm: learning a set of associations Perceptron Linear separator Space of patterns Space of couplings blackboard

The Perceptron: learning capacity Frank Rosenblatt, «The perceptron: a probabilistic model for information storage and organization in the brain», Psychological Review, Vol. 65:6 (1958) Frank Rosenblatt, Principles of neurodynamics, New York: Spartan (1962) Marvin Minsky and Seymour Papert, Perceptrons: An Introduction to Computational Geometry, MIT Press, 1969. Thomas M. Cover. Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition. IEEE Transactions on Electronic Computers, EC-14(3):326--334, June 1965

Supervised learning paradigm: learning a set of associations Perceptron capacity Growth function Number of dichotomies (number of domains in the space of couplings)

(Theorem 1) Perceptron capacity from Schlafli 1950 to Cover 1965

Perceptron capacity Cover 1965 (null threshold)

Number of dichotomies Cover 1965 space of dimension ( inputs) hyperplanes passing through the origin (threshold set to zero) reminder (binomial coefficient): With threshold proof The Vapnik-Chervonenkis dimension of the Perceptron is

Perceptron capacity - Cover 1965 Probability that the p associations can be learned by a perceptron with N inputs p/n critical capacity

Perceptron capacity - Cover 1965 Probability that the p associations can be learned by a perceptron with N inputs N p/n critical capacity

Perceptron capacity - Cover 1965 Probability that the p associations can be learned by a perceptron with N inputs N p/n critical capacity

f fraction of «1» s Entropy: Exactly p f «1» s: H = p s f if = 1 with proba. f, 0 with proba. 1- f s(f) = f ln f 1 f ln 1 f With logarithms in base 2: information in bits s 2 (f) = f log 2 f (1 f) log 2 (1 f) s 2 f = s 2 (1 f) s 2 1 log 2 (. ) = ln(. ) ln 2 s 2 f = 0 = s 2 (f = 1) = 0 For f = 1 2 s 2 = 1 bit 0 0 1/2 1 f

Information stored = difference of entropies large number of p objects Object type: τ {, } f = probability to have an object of type Classification Data analysis Signal processing Encoding If noise, errors Drunk Maxwell s demon H 1 > 0 H 2 > 0 Box number: σ { 1, 2} Entropy (Shannon information): H H = p f ln f 1 f ln(1 f) H = Information gain = decrease in entropy I = H - H 1 - H 2 = mutual information between τ and σ

Information capacity N (bits per synapse) α = p / N G. Toulouse 1989; N. Brunel, JPN & G. Toulouse 1992

Beyond critical capacity: minimum fraction of errors N α = p / N Capacity (maximum information that can be transmitted) Information loss (entropy corresponding to ε p errors randomly distributed) + = Information sent (p bits = desired dichotomy of the p patterns) (bits per synapse) Fano s inequality in information theory (50 s) (min information loss in a noisy channel) Reminder, binary entropy, in bits: G. Toulouse 1989; N. Brunel, JPN & G. Toulouse 1992

Supervised learning Menu Intro: F. Rosenblatt The Perceptron as a linear separator Capacity & information capacity Cover geometrical approach Vapnik beyond the perceptron Learning a rule from examples Gardner statistical physics approach The perceptron algorithm the perceptron algorithm (Rosenblatt) max stability/optimal margin Support Vector Machines (SVM): back to the original Perceptron? Alternatives: MLP, deep-learning Modeling the Cerebellum: Purkinje cells as Perceptrons Efficient Hebbian learning

Perceptron algorithm and beyond Percepton algorithm Variants: minover optimal margin From the perceptron to the SVM (and back) Multi Layer Perceptrons, Deep learning

Perceptron algorithm and beyond Percepton algorithm Variants: minover optimal margin From the perceptron to the SVM (and back) Multi Layer Perceptrons,, Deep learning ( blackboard)

The Perceptron Rosenblatt vs SVM Choice of the Kernel = choice of the feature space

Perceptron algorithm and beyond Percepton algorithm Variants: minover optimal margin From the perceptron to the SVM (and back) Multi Layer Perceptrons, Deep learning

Deep learning Prior knowledge specific architecture Hinton, G. E., Osindero, S. and Teh, Y. (2006) http://www.cs.toronto.edu/~hinton/ http://www.deeplearning.net/tutorial/ Approach further developped by Hinton, Bengio, LeCun and others Unsupervised learning phase initialization of parameters Supervised gradient descent fine tuning for each layer, companion feed-back layer trying to reconstruct the layer input from its output efficient coding Most recent versions: purely supervised approaches Figure from Bengio & LeCun, in Large-Scale Kernel Machines, Bottou et al Ed., MIT Press 2007

Applications MNIST database data set: handwritten digits 60000 training examples and 10000 test examples. Current best result: error rate of.23%, Ciresan et al. 2012 Human performance ~ 0.2% Best performance of the year from results collected by Y LeCun http://yann.lecun.com/exdb/mnist/ Automatic speech recognition TIMIT data base phonemically and lexically transcribed speech of American English speakers of different sexes and dialects. 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 1998 2003 2008 2013 2018 Caltech 101 dataset: 101 natural object categories with up to 30 training instances per class. M A Ranzato et al: Average accuracy 54% M A Ranzato, http://www.cs.nyu.edu/~ranzato/

Supplementary slides

The Perceptron - Supplementary material Points in general position Points not in general position Definition: p points in dimension N are in general position iff no subset of size less than N is linearly dependent = generic case, typically true for points chosen at random

In terms of hyperplanes: in general position not in general position Definition: p points in dimension N are in general position iff no subset of size less than N is linearly dependent = generic case, typically true for points chosen at random

Cover result: = number of dichotomies of p points in dimension N Recursion : one shows that In the space of patterns p + 1 points. Dichotomies of the p points: A = those which can be realized by an hyperplane passing through the new point B = those for which this is not the case Clearly: In the case of zero threshold (hyperplanes passing through the origin), one shows:

proof in the case of zero threshold: hyperplanes passing through the origin new point Every hyperplane in this set goes through the origin and the (p+1)th point. Projection of each hyperplane and of each one of the p points, onto the (N- 1)-dim space orthogonal to [O, (p+1)th point]. Each projection is a linear separation (passing through the origin) of the projected p points. Figure from Hertz, Krogh, and Palmer, 1991 back

Generalization Learning curves

Learning from examples a given Learning Machine (not necessarily a neural network) with N-dimensional inputs and binary outputs, and a set of adaptable parameters θ Data: a set of input patterns given with their desired output (classification task) Hyp.: the desired output is some unknown function of the input Wanted: after learning, good performance on a new input pattern Standard method: learning on part of the data (training set), test on what remains (test set)

Consistent learning Cortes et al, 1993 Amari Fujita Shinomoto 1992 Amari Murata 1993 Seung et al 1992 not learnable learnable

Cortes et al, 1993

VC dimension Vapnik V. N. & Chervonenkis A. (1968, 1971, 1974; book in Russian: V. Vapnik, A. Chervonenkis: Pattern Recognition Theory, Statistical Learning Problems, Nauka, Moskva, 1974) Vapnik V.N., The Nature of Statistical Learning Theory, Springer-Verlag, 1995, 2nd ed. 1998 a given Learning Machine (not necessarily a neural network) with N-dimensional inputs and binary outputs, and a set of adaptable parameters W Data: a set of input patterns given with their desired output (classification task) Growth function The Vapnik-Chervonenkis dimension of the Perceptron is The Vapnik-Chervonenkis dimension of the Perceptron with margin is at most In most cases (important exceptions): ~ number of parameters where R is the radius of smallest sphere containing all the input patterns (Vapnik 1998)

Vapnik: structural risk minimization Vapnik V. N. & Chervonenkis A. (1968, 1971, 1974; book in Russian: V. Vapnik, A. Chervonenkis: Pattern Recognition Theory, Statistical Learning Problems, Nauka, Moskva, 1974) Vapnik V.N., The Nature of Statistical Learning Theory, Springer-Verlag, 1995, 2nd ed. 1998 a given Learning Machine (not necessarily a neural network) with N-dimensional inputs and binary outputs, and a set of adaptable parameters W Data: a set of input patterns given with their desired output (classification task) Bounds on generalization error (worst case analysis) With probability (here case with zero training error: )

Generalization for p > The meaning of generalization Generalization Learning by heart Any set of associations is learnable N

back