Supervised Learning Part I http://www.lps.ens.fr/~nadal/cours/mva Jean-Pierre Nadal CNRS & EHESS Laboratoire de Physique Statistique (LPS, UMR 8550 CNRS - ENS UPMC Univ. Paris Diderot) Ecole Normale Supérieure (ENS) & Centre d Analyse et de Mathématique Sociales (CAMS, UMR 8557 CNRS - EHESS) Ecole des Hautes Etudes en Sciences Sociales (EHESS) nadal@lps.ens.fr
Supervised learning Menu Intro: F. Rosenblatt The Perceptron as a linear separator Capacity & information capacity Cover geometrical approach Vapnik beyond the perceptron Learning a rule from examples Gardner statistical physics approach The perceptron algorithm the perceptron algorithm (Rosenblatt) max stability/optimal margin Support Vector Machines (SVM): back to the original Perceptron? Alternatives: MLP, deep-learning Modeling the Cerebellum: Purkinje cells as Perceptrons Efficient Hebbian learning
The Perceptron Frank Rosenblatt, «The perceptron: a probabilistic model for information storage and organization in the brain», Psychological Review, Vol. 65:6 (1958) F. Rosenblatt (1962). Principles of neurodynamics. New York: Spartan. Marvin Minsky and Seymour Papert, Perceptrons: An Introduction to Computational Geometry, MIT Press, 1969. Thomas M. Cover. Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition. IEEE Transactions on Electronic Computers, EC-14(3):326--334, June 1965
The Perceptron
Supervised learning paradigm: learning a set of associations Perceptron Linear separator Space of patterns Space of couplings blackboard
The Perceptron: learning capacity Frank Rosenblatt, «The perceptron: a probabilistic model for information storage and organization in the brain», Psychological Review, Vol. 65:6 (1958) Frank Rosenblatt, Principles of neurodynamics, New York: Spartan (1962) Marvin Minsky and Seymour Papert, Perceptrons: An Introduction to Computational Geometry, MIT Press, 1969. Thomas M. Cover. Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition. IEEE Transactions on Electronic Computers, EC-14(3):326--334, June 1965
Supervised learning paradigm: learning a set of associations Perceptron capacity Growth function Number of dichotomies (number of domains in the space of couplings)
(Theorem 1) Perceptron capacity from Schlafli 1950 to Cover 1965
Perceptron capacity Cover 1965 (null threshold)
Number of dichotomies Cover 1965 space of dimension ( inputs) hyperplanes passing through the origin (threshold set to zero) reminder (binomial coefficient): With threshold proof The Vapnik-Chervonenkis dimension of the Perceptron is
Perceptron capacity - Cover 1965 Probability that the p associations can be learned by a perceptron with N inputs p/n critical capacity
Perceptron capacity - Cover 1965 Probability that the p associations can be learned by a perceptron with N inputs N p/n critical capacity
Perceptron capacity - Cover 1965 Probability that the p associations can be learned by a perceptron with N inputs N p/n critical capacity
f fraction of «1» s Entropy: Exactly p f «1» s: H = p s f if = 1 with proba. f, 0 with proba. 1- f s(f) = f ln f 1 f ln 1 f With logarithms in base 2: information in bits s 2 (f) = f log 2 f (1 f) log 2 (1 f) s 2 f = s 2 (1 f) s 2 1 log 2 (. ) = ln(. ) ln 2 s 2 f = 0 = s 2 (f = 1) = 0 For f = 1 2 s 2 = 1 bit 0 0 1/2 1 f
Information stored = difference of entropies large number of p objects Object type: τ {, } f = probability to have an object of type Classification Data analysis Signal processing Encoding If noise, errors Drunk Maxwell s demon H 1 > 0 H 2 > 0 Box number: σ { 1, 2} Entropy (Shannon information): H H = p f ln f 1 f ln(1 f) H = Information gain = decrease in entropy I = H - H 1 - H 2 = mutual information between τ and σ
Information capacity N (bits per synapse) α = p / N G. Toulouse 1989; N. Brunel, JPN & G. Toulouse 1992
Beyond critical capacity: minimum fraction of errors N α = p / N Capacity (maximum information that can be transmitted) Information loss (entropy corresponding to ε p errors randomly distributed) + = Information sent (p bits = desired dichotomy of the p patterns) (bits per synapse) Fano s inequality in information theory (50 s) (min information loss in a noisy channel) Reminder, binary entropy, in bits: G. Toulouse 1989; N. Brunel, JPN & G. Toulouse 1992
Supervised learning Menu Intro: F. Rosenblatt The Perceptron as a linear separator Capacity & information capacity Cover geometrical approach Vapnik beyond the perceptron Learning a rule from examples Gardner statistical physics approach The perceptron algorithm the perceptron algorithm (Rosenblatt) max stability/optimal margin Support Vector Machines (SVM): back to the original Perceptron? Alternatives: MLP, deep-learning Modeling the Cerebellum: Purkinje cells as Perceptrons Efficient Hebbian learning
Perceptron algorithm and beyond Percepton algorithm Variants: minover optimal margin From the perceptron to the SVM (and back) Multi Layer Perceptrons, Deep learning
Perceptron algorithm and beyond Percepton algorithm Variants: minover optimal margin From the perceptron to the SVM (and back) Multi Layer Perceptrons,, Deep learning ( blackboard)
The Perceptron Rosenblatt vs SVM Choice of the Kernel = choice of the feature space
Perceptron algorithm and beyond Percepton algorithm Variants: minover optimal margin From the perceptron to the SVM (and back) Multi Layer Perceptrons, Deep learning
Deep learning Prior knowledge specific architecture Hinton, G. E., Osindero, S. and Teh, Y. (2006) http://www.cs.toronto.edu/~hinton/ http://www.deeplearning.net/tutorial/ Approach further developped by Hinton, Bengio, LeCun and others Unsupervised learning phase initialization of parameters Supervised gradient descent fine tuning for each layer, companion feed-back layer trying to reconstruct the layer input from its output efficient coding Most recent versions: purely supervised approaches Figure from Bengio & LeCun, in Large-Scale Kernel Machines, Bottou et al Ed., MIT Press 2007
Applications MNIST database data set: handwritten digits 60000 training examples and 10000 test examples. Current best result: error rate of.23%, Ciresan et al. 2012 Human performance ~ 0.2% Best performance of the year from results collected by Y LeCun http://yann.lecun.com/exdb/mnist/ Automatic speech recognition TIMIT data base phonemically and lexically transcribed speech of American English speakers of different sexes and dialects. 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 1998 2003 2008 2013 2018 Caltech 101 dataset: 101 natural object categories with up to 30 training instances per class. M A Ranzato et al: Average accuracy 54% M A Ranzato, http://www.cs.nyu.edu/~ranzato/
Supplementary slides
The Perceptron - Supplementary material Points in general position Points not in general position Definition: p points in dimension N are in general position iff no subset of size less than N is linearly dependent = generic case, typically true for points chosen at random
In terms of hyperplanes: in general position not in general position Definition: p points in dimension N are in general position iff no subset of size less than N is linearly dependent = generic case, typically true for points chosen at random
Cover result: = number of dichotomies of p points in dimension N Recursion : one shows that In the space of patterns p + 1 points. Dichotomies of the p points: A = those which can be realized by an hyperplane passing through the new point B = those for which this is not the case Clearly: In the case of zero threshold (hyperplanes passing through the origin), one shows:
proof in the case of zero threshold: hyperplanes passing through the origin new point Every hyperplane in this set goes through the origin and the (p+1)th point. Projection of each hyperplane and of each one of the p points, onto the (N- 1)-dim space orthogonal to [O, (p+1)th point]. Each projection is a linear separation (passing through the origin) of the projected p points. Figure from Hertz, Krogh, and Palmer, 1991 back
Generalization Learning curves
Learning from examples a given Learning Machine (not necessarily a neural network) with N-dimensional inputs and binary outputs, and a set of adaptable parameters θ Data: a set of input patterns given with their desired output (classification task) Hyp.: the desired output is some unknown function of the input Wanted: after learning, good performance on a new input pattern Standard method: learning on part of the data (training set), test on what remains (test set)
Consistent learning Cortes et al, 1993 Amari Fujita Shinomoto 1992 Amari Murata 1993 Seung et al 1992 not learnable learnable
Cortes et al, 1993
VC dimension Vapnik V. N. & Chervonenkis A. (1968, 1971, 1974; book in Russian: V. Vapnik, A. Chervonenkis: Pattern Recognition Theory, Statistical Learning Problems, Nauka, Moskva, 1974) Vapnik V.N., The Nature of Statistical Learning Theory, Springer-Verlag, 1995, 2nd ed. 1998 a given Learning Machine (not necessarily a neural network) with N-dimensional inputs and binary outputs, and a set of adaptable parameters W Data: a set of input patterns given with their desired output (classification task) Growth function The Vapnik-Chervonenkis dimension of the Perceptron is The Vapnik-Chervonenkis dimension of the Perceptron with margin is at most In most cases (important exceptions): ~ number of parameters where R is the radius of smallest sphere containing all the input patterns (Vapnik 1998)
Vapnik: structural risk minimization Vapnik V. N. & Chervonenkis A. (1968, 1971, 1974; book in Russian: V. Vapnik, A. Chervonenkis: Pattern Recognition Theory, Statistical Learning Problems, Nauka, Moskva, 1974) Vapnik V.N., The Nature of Statistical Learning Theory, Springer-Verlag, 1995, 2nd ed. 1998 a given Learning Machine (not necessarily a neural network) with N-dimensional inputs and binary outputs, and a set of adaptable parameters W Data: a set of input patterns given with their desired output (classification task) Bounds on generalization error (worst case analysis) With probability (here case with zero training error: )
Generalization for p > The meaning of generalization Generalization Learning by heart Any set of associations is learnable N
back