Ventral Visual Stream and Deep Networks

Similar documents
MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

TUM 2016 Class 1 Statistical learning theory

1. Mathematical Foundations of Machine Learning

On invariance and selectivity in representation learning

Unsupervised Learning of Invariant Representations in Hierarchical Architectures

RegML 2018 Class 8 Deep learning

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

Eigenvalues and eigenfunctions of the Laplacian. Andrew Hassell

Class Learning Data Representations: beyond DeepLearning: the Magic Theory. Tomaso Poggio. Thursday, December 5, 13

THEOREMS, ETC., FOR MATH 515

On Invariance and Selectivity in Representation Learning

PROBLEMS. (b) (Polarization Identity) Show that in any inner product space

Representation Learning in Sensory Cortex: a theory

QUALIFYING EXAMINATION Harvard University Department of Mathematics Tuesday September 21, 2004 (Day 1)

Overview of normed linear spaces

Class 2 & 3 Overfitting & Regularization

Neural Codes and Neural Rings: Topology and Algebraic Geometry

Qualifying Exams I, 2014 Spring

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

MAT 578 FUNCTIONAL ANALYSIS EXERCISES

Geometry on Probability Spaces

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

THE COMPUTATIONAL MAGIC OF THE VENTRAL STREAM

I-theory on depth vs width: hierarchical function composition

Peak Point Theorems for Uniform Algebras on Smooth Manifolds

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Unsupervised learning of invariant representations with low sample complexity: the magic of sensory cortex or a new framework for machine learning?

Analysis Preliminary Exam Workshop: Hilbert Spaces

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

Solutions: Problem Set 4 Math 201B, Winter 2007

Statistical Machine Learning

Variations on Quantum Ergodic Theorems. Michael Taylor

Problem Set 6: Solutions Math 201A: Fall a n x n,

Introduction and Preliminaries

RANDOM FIELDS AND GEOMETRY. Robert Adler and Jonathan Taylor

REAL AND COMPLEX ANALYSIS

Exercise Solutions to Functional Analysis

Applied Analysis (APPM 5440): Final exam 1:30pm 4:00pm, Dec. 14, Closed books.

Techinical Proofs for Nonlinear Learning using Local Coordinate Coding

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Takens embedding theorem for infinite-dimensional dynamical systems

1. If 1, ω, ω 2, -----, ω 9 are the 10 th roots of unity, then (1 + ω) (1 + ω 2 ) (1 + ω 9 ) is A) 1 B) 1 C) 10 D) 0

SPHERICAL UNITARY REPRESENTATIONS FOR REDUCTIVE GROUPS

Tutorial on Machine Learning for Advanced Electronics

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

Quantum ergodicity. Nalini Anantharaman. 22 août Université de Strasbourg

(1) Consider the space S consisting of all continuous real-valued functions on the closed interval [0, 1]. For f, g S, define

McGill University Department of Mathematics and Statistics. Ph.D. preliminary examination, PART A. PURE AND APPLIED MATHEMATICS Paper BETA

ON THE MATHEMATICAL FOUNDATIONS OF LEARNING. Introduction

9 Radon-Nikodym theorem and conditioning

RUTGERS UNIVERSITY GRADUATE PROGRAM IN MATHEMATICS Written Qualifying Examination August, Session 1. Algebra

Learning Theory of Randomized Kaczmarz Algorithm

STA 414/2104: Lecture 8

How far from automatically interpreting deep learning

MATH MEASURE THEORY AND FOURIER ANALYSIS. Contents

Characterization of half-radial matrices

Understanding Generalization Error: Bounds and Decompositions

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma

An introduction to some aspects of functional analysis

Problem Set 1: Solutions Math 201A: Fall Problem 1. Let (X, d) be a metric space. (a) Prove the reverse triangle inequality: for every x, y, z X

Generalization and Properties of the Neural Response Jake Bouvrie, Tomaso Poggio, Lorenzo Rosasco, Steve Smale, and Andre Wibisono

Short note on compact operators - Monday 24 th March, Sylvester Eriksson-Bique

The Gaussian free field, Gibbs measures and NLS on planar domains

Spectral Processing. Misha Kazhdan

THE INVERSE FUNCTION THEOREM

Topics in Harmonic Analysis Lecture 1: The Fourier transform

Nonlinear Dimensionality Reduction

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

Implicit Functions, Curves and Surfaces

Decouplings and applications

Kernel Method: Data Analysis with Positive Definite Kernels

SYMPLECTIC GEOMETRY: LECTURE 5

Why is Deep Learning so effective?

MATH & MATH FUNCTIONS OF A REAL VARIABLE EXERCISES FALL 2015 & SPRING Scientia Imperii Decus et Tutamen 1

MET Workshop: Exercises

Contribution from: Springer Verlag Berlin Heidelberg 2005 ISBN

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Ph.D. Qualifying Exam: Algebra I

8. COMPACT LIE GROUPS AND REPRESENTATIONS

Simple Cell Receptive Fields in V1.

Model Selection and Geometry

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Natural Image Statistics

Manifold Regularization

2) Let X be a compact space. Prove that the space C(X) of continuous real-valued functions is a complete metric space.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

ON NEARLY SEMIFREE CIRCLE ACTIONS

1.3.1 Definition and Basic Properties of Convolution

Topics in Representation Theory: Roots and Weights

Hilbert Space Methods in Learning

THEOREMS, ETC., FOR MATH 516

1 Math 241A-B Homework Problem List for F2015 and W2016

Universal Approximation properties of Feedforward Artificial Neural Networks

l(y j ) = 0 for all y j (1)

Locally convex spaces, the hyperplane separation theorem, and the Krein-Milman theorem

Topological properties

Contact Geometry of the Visual Cortex

NATIONAL BOARD FOR HIGHER MATHEMATICS. Research Scholarships Screening Test. Saturday, January 20, Time Allowed: 150 Minutes Maximum Marks: 40

Transcription:

Ma191b Winter 2017 Geometry of Neuroscience

References for this lecture: Tomaso A. Poggio and Fabio Anselmi, Visual Cortex and Deep Networks, MIT Press, 2016 F. Cucker, S. Smale, On the mathematical foundations of learning, Bulletin of the American Math. Society 39 (2001) N.1, 1 49.

Modeling Ventral Visual Stream via Deep Neural Networks Ventral Visual Stream considered responsible for object recognition abilities dorsal (green) and ventral (purple) visual streams responsible for first 100msc time of processing visual information from initial visual stimulus to activation of inferior temporal cortex neurons

mathematical model describing learning of invariant representations in the Ventral Visual Stream working hypothesis: main computational goal of the Ventral Visual Stream is compute neural representations of images that are invariant with respect to certain groups of transformations (mostly affine transformations: translations, rotations, scaling) model based on unsupervised learning far fewer examples are needed to train a classifier for recognition if using an invariant representation Gabor functions and frames optimal templates for simultaneously maximizing invariance with respect to translations and scaling architecture: hierarchy of Hubel Wiesel modules

a significant difference between (supervised) learning algorithms and functioning of the brain is that learning in the brain seems to require a very small number of labelled examples conjecture: key to reducing sample complexity of object recognition is invariance under transformations two aspects: recognition and categorization for recognition it is clear that complexity is greatly increased by transformations (same objects seen from different perspectives, in different light conditions, etc.) for categorizations also (distinguishing between different classes of objects: cats/dogs, etc.) transformations can hide intrinsic characteristics of an object empirical evidence: accuracy of a classifier per number of examples greatly improved in the presence of an oracle that factors out transformations (solid curve, rectified; dashed, non-rectified)

order of magnitude for number of different object categorizations (e.g. distinguishable different types of dogs) smaller than magnitude for different viewpoints generated by group actions reducing the variability by transformations makes greatly reduces the learning task complexity refer to sample complexity as number of examples needed for estimating a target function within an assigned error rate transform problem of distinguishing images into problem of distinguishing orbits under a given group action

Feedforward architecture in the ventral stream two main stages 1 retinotopic areas computing a representation that is invariant under affine transformations 2 approximate invariance to other object-specific dependencies, not described by group actions (parallel pathways) first stage realized through Gabor frames analysis overall model relies on a mathematical model of learning (Cucker-Smale)

architecture layers: red circle = vector computed by one of the modules, double arrow = its receptive field; image at level zero (bottom), vector computed at top layer consists of invariant features (fed as input to a supervised learning classifier)

biologically plausible algorithm (Hubel Wiesel modules) two types of neurons roles: simple cells: perform an operation of inner product with a template t H Hilbert space; a further non-linear operation (a threshold) is also applied complex cells: aggregate the outputs of several simple cells steps: (assume G finite subgroup of affine transformations) 1 unsupervised learning of group G by storing memory of orbit G t = {gt : g G} of a set of templates t H 2 computation of invariant representation: new image I H compute gt k, I for g G and t k, k = 1,..., K templates and µ k h (I) = 1 σ h ( gt k, I ) #G g G σ h a set of nonlinear functions (e.g. threshold functions)

computed µ k h (I) called signature of I signature µ k h (I) clearly G-invariant Selectivity Question: how well does µ k h (I) distinguish different objects? different meaning G I G I Main Selectivity Result (Poggio-Anselmi) want to be able to distinguish images within a given set of N images I, with an error of at most a given ɛ > 0 the signatures µ k h (I) can ɛ-approximate the distance between pairs among the N images with probability 1 δ provided that the number of templates used is at least K > c ɛ 2 log N δ more detailed discussion of this statement below; main point: need of the order of log(n) templates to distinguish N images

General problem: when two sets of random variables x, y are probabilistically related relation described by probability distribution P(x, y) some square loss problem (minimization problem) E(f ) = (y f (x)) 2 P(x, y) dx dy distribution itself unknown, but minimize empirical error E N (f ) = 1 N N (y i f (x i )) 2 i=1 over a set of random sampled data points {(x i, y i )} i=1,...,n if f N minimizes empirical error, want that the probability P( E(f N ) E N (f N ) > ɛ) is sufficiently small Problem depends on the function space where f N lives

General setting F. Cucker, S. Smale, On the mathematical foundations of learning, Bulletin of the American Math. Society 39 (2001) N.1, 1 49. X compact manifold, Y = R k (for simplicity k = 1), Z = X Y with Borel measure ρ ξ random variable (real valued) on probability space (Z, ρ) expectation value and variance E(ξ) = ξ dρ, σ 2 (ξ) = E((ξ E(ξ)) 2 ) = E(ξ 2 ) E(ξ) 2 Z function f : X Y, least squares error of f E(f ) = (f (x) y) 2 dρ Z measures average error incurred in using f (x) as a model of the dependence between y and x Problem: how to minimize the error?

conditional probability ρ(y x) (probability measure on Y ) marginal probability ρ X (S) = ρ(π 1 (S)) on X, with projection π : Z = X Y X relation between these measures ( φ(x, y) dρ = Z X Y ) φ(x, y) dρ(y x) dρ X breaking of ρ(x, y) into ρ(y x) and ρ X (S) is breaking of Z into input X and output Y

regression function f ρ : X Y f ρ (x) = y dρ(y x) assumption: f ρ is bounded for fixed x X map Y to R via Y y y f ρ (x) expectation value is zero so variance σ 2 (x) = (y f ρ (x)) 2 dρ(y x) averaged variance σρ 2 = X Y σ 2 (x) dρ X = E(f ρ ) measures how well conditioned ρ is Note: in general ρ and f ρ not known but ρ X known

error, regression, and variance: E(f ) = ((f (x) f ρ (x)) 2 + σρ) 2 dρ X X What this says: σ 2 ρ is a lower bound for the error E(f ) for all f, and f = f ρ has the smallest possible error (which depends only on ρ) why identity holds:

Goal: learn (= find a good approximation for) f ρ given random samples of Z Z N z = ((x 1, y 1 ),..., (x N, y N )) sample set of points (x i, y i ) independently drawn with probability ρ empirical error E z (f ) = 1 N N (f (x i ) y i ) 2 i=1 for random variable ξ empirical mean E z (ξ) = 1 N N ξ(z i, y i ) i=1 given f : X Y take f Y : Z Y to be f Y : (x, y) f (x) y E(f ) = E(f 2 Y ), E z(f ) = E z (f 2 Y )

Facts of Probability Theory (quantitative versions of law of large numbers) ξ random variable on probability space Z with mean E(ξ) = µ and variance σ 2 (ξ) σ 2 Chebyshev: for all ɛ > 0 { P z Z m 1 : m } m ξ(z i ) µ ɛ i=1 σ2 mɛ 2 Bernstein: if ξ(z) E(ξ) M for almost all z Z then ɛ > 0 { } ( ) P z Z m 1 m : ξ(z i ) µ m ɛ mɛ 2 2 exp 2(σ 2 + 1 3 Mɛ) i=1 Hoeffding: { P z Z m : 1 m } m ξ(z i ) µ ɛ i=1 ) 2 exp ( mɛ2 2M 2

Defect Function of f : X Y L z (f ) := E(f ) E z (f ) discrepancy between error and empirical error (only E z (f ) measured directly) estimate of defect if f (x) y M almost everywhere, then ɛ > 0, with σ 2 variance of fy 2 ( ) P{z Z m mɛ 2 : L z (f ) ɛ} 1 2ɛ exp 2(σ 2 + 1 3 M2 ɛ) from previous Bernstein estimate taking ξ = f 2 Y when is f (x) y M a.e. satisfied? e.g. for M = M ρ + P M ρ = inf{ M : {(x, y) Z : y f ρ (x) M} measure zero } P sup f (x) f ρ (x) x X

Hypothesis Space a learning process requires a datum of a class of functions (hypothesis space) within which the best approximation for f ρ C(X ) algebra of continuous functions on topological space X H C(X ) compact subset (not necessarily subalgebra) look for minimizer (not necessarily unique) f H = argmin f H (f (x) y) 2 because E(f ) = X (f f ρ) 2 + σρ 2 also minimizer f H = argmin f H (f f ρ ) 2 continuity: if for f H have f (x) y M a.e., bounds E(f 1 ) E(f 2 ) 2M f 1 f 2 and for E z also, so E and E z continuous compactness of H ensures existence of minimizer but not uniqueness (a uniqueness result when H convex) Z X

Empirical target function f H,z minimizer (non unique in general) f H,z = argmin f H 1 m m (f (x i ) y i ) 2 i=1 Normalized Error E H (f ) = E(f ) E(f H ) E H (f ) 0 vanishing at f H Sample Error E H (f H,z ) E(f H,z ) = E H (f H,z ) + E(f H ) = X (f H,z f ρ ) 2 + σ 2 ρ estimating E(f H,z ) by estimating sample and approximation errors, E H (f H,z ) and E(f H ) one on H the other independent of sample z

bias-variance trade-off bias = approximation error; variance = sample error fix H: sample error E H (f H,z ) decreases by increasing number m of samples fix m: approximation error E(f H ) decreases when enlarging H procedure: 1 estimate how close f H,z and f H depending on m 2 how to choose dim H when m is fixed first problem: how many examples need to draw to say with confidence 1 δ that X (f H,z f H ) 2 ɛ?

Uniformity Estimate (Vapnik s Statistical Learning Theory) covering number: S metric space, s > 0, number N (S, s) minimal l N so that disks in S radii s covering S; for S compact N (S, s) finite uniform estimate: H C(X ) compact, if for all f H have f (x) y M a.e., then ɛ > 0 ( ) P{z Z m ɛ mɛ 2 : sup L z (f ) ɛ} 1 N (H, )2 exp f H 8M 4(2σ 2 + 1 3 M2 ɛ) with σ 2 = sup f H σ 2 (f 2 Y ) main idea: like previous estimate of defect but passing from a single function to a family of functions, using a uniformity based on covering number

Estimate of Sample Error H C(X ) compact, with f (x) y M a.e. for all f H, and σ 2 = sup f H σ 2 (fy 2 ), then ɛ > 0 ( ) P{z Z m ɛ mɛ 2 : E H (f z ) ɛ} 1 N (H, )2 exp 16M 8(4σ 4 + 1 3 M2 ɛ) obtained from previous estimate using L z (f ) = E(f ) E z (f ) so answer to first question: to ensure probability above 1 δ need to take at least m 8(4σ4 + 1 3 M2 ɛ) ɛ 2 ( log(2n (H, ) ɛ 16M )) + log(1 δ ) obtained by setting ( ) ɛ mɛ 2 δ = N (H, )2 exp 16M 8(4σ 4 + 1 3 M2 ɛ) need various techniques for estimating covering numbers N (H, s) depending on the choice of the compact set H

Second Question: Estimating the Approximation Error E(f H,z ) = E H (f H,z ) + E(f H ) focus on E(f H ), which depends on H and ρ (f H f ρ ) 2 + σρ 2 X second term independent of H so focus on first; f ρ bounded, but not in H nor necessarily in C(X ) Main idea: use finite dimensional hypothesis space H; estimate in terms of growth of eigenvalues of an operator Main technique: Fourier analysis; Hilbert spaces

Fourier Series: start with case of X = T n = (S 1 ) n torus Hilbert space L 2 (X ) Lebesgue measure with complete orthonormal system φ α (x) = (2π) n/2 exp(iα x), α = (α 1,..., α n ) Z n Fourier series expansion f = α Z n c α φ α finite dimensional subspaces H N L 2 (X ) spanned by φ α with α B, dimension N(B) number of lattice points in ball radius B in R n N(B) (2B) n/2 H hypothesis space: ball H N,R of radius R in norm in H N

Laplacian on torus X = T n Laplacian : C (X ) C (X ) (f ) = n 2 f x 2 i=1 i Fourier series basis φ α are eigenfunctions of with eigenvalue α 2 more general X : bounded domain X R n with smooth boundary X and a complete orthonormal system φ k of L 2 (X ) (Lebesgue measure) of eigenfunctions of Laplacian with (φ k ) = ζ k φ k, φ k X 0, k 1 0 < ζ 1 ζ 2 ζ k subspace H N of L 2 (X ) generated by {φ 1,..., φ N } hypothesis space H = H N,R ball of radius R for in H N

Construction of f H Lebesgue measure µ on X and measure ρ (marginal probability ρ X induced by ρ on Z = X Y ) consider regression function f ρ (x) = Y y dρ(y x) assumption f ρ bounded on X so in L 2 ρ(x ) and in L 2 µ(x ) choice of R: assume also that R f ρ, which implies R f ρ ρ then f H is orthogonal projection of f ρ onto H N using inner product in L 2 ρ(x ) goal: estimate approximation error E(f H ) for this f H

Distorsion factor: identity function on bounded functions extends to J : L 2 µ(x ) L 2 ρ(x ) distorsion of ρ with respect to µ D ρµ = J operator norm: how much ρ distorts the ambient measure µ reasonable assumption: distorsion is finite in general ρ not known, but ρ X is known, so D ρµ can be computed

Weyl Law Weyl law on rate of growth of eigenvalues of the Laplacian (acting on functions vanishing on boundary of domain X R n ) N(λ) lim λ λ n/2 = (2π) n B n Vol(X ) B n volume of unit ball in R n ; N(λ) number of eigenvalues (with multiplicity) up to λ Weyl law: Li Yau version ζ k n n + 2 4π2 ( k B n Vol(X ) ) 2/n P. Li and S.-T. Yau, On the parabolic kernel of the Schrödinger operator, Acta Math. 156 (1986), 153 201 from this get a weaker estimate, using explicit volume B n ( ) k 2/n ζ k Vol(X )

Approximation Error and Weyl Law norm K : for f = k=1 c k φ k with φ k eigenfunctions of ( ) 1/2 f K := ck 2 ζ k k=1 like L 2 -norm but weighted by eigenvalues of Laplacian in l 2 measure of c = (c k ) Approximation Error Estimate: for H and f H as above ( ) k 2/n E(f H ) Dρµ 2 f ρ 2 K Vol(X ) + σ2 ρ proved using Weyl law and estimates d µ (f ρ, H N ) 2 = f ρ f H ρ = d ρ (f ρ, H N ) J d µ (f ρ, H N ) k=n+1 where f ρ = k c kφ k c k φ k 2 µ = k=n+1 c 2 k = k=n+1 c 2 k ζ k 1 ζ k 1 ζ N+1 f ρ 2 K

Solution of the bias-variance problem mimimize E(f H,z ) by minimizing both sample error and approximation error minimization as a function of N N (for the choice of hypothesis space H = H N,R ) select integer N N that minimizes A(N) + ɛ(n) where ɛ = ɛ(n) as in previous estimate of sample error and A(N) = D 2 ρµ ( ) 2/n k f ρ 2 K + σρ 2 Vol(X ) from previous relation between m, R = f ρ, δ and ɛ obtain ( ɛ 288M2 N log( 96RM ) + 1 + log( 1 ) m ɛ δ ) 0 find N that minimizes ɛ with this constraint no explicit closed form solution for N minimizing A(N) + ɛ(n) but can be estimated numerically in specific cases

back to the visual cortex modeling (Poggio-Anselmi) stored templates t k, k = 1,..., K and new images I in some finite dimensional approximation H N to a Hilbert space simple cells perform inner products gt k, I in H N estimate in terms of 1D-projections: I R d some in general large d; projections t k, I for a set of normalized vectors t k S d 1 (unit sphere) Z : S d 1 R +, Z(t) = µ t (I) µ t (I ) distance between images d(i, I ) think of as a distance between two probability distributions P I, P I on R d measure distance in terms of d(p I, P I ) Z(t) dvol(t) S d 1

model this in terms of ˆd(P I, P I ) := 1 K K Z(t k ) k=1 want to evaluate the error incurred in using ˆd(P I, P ci ) (1D projections and templates) to estimate d(p I, P I ) as in the Cucker-Smale setting, evaluate the error and the probability of error in terms of the Hoeffding estimate 1 K d(p I, P I ) ˆd(P I, P ci ) = Z(t k ) E(Z) K k=1 probability of error ( ) 1 K P Z(t k ) E(Z) K > ɛ k=1 if a.e. bound Z(t) E(Z) M 2e Kɛ2 2M 2

want this estimate to hold uniformly over a set of N images: want same bound to hold over each pair so error probability is at most ) ) N(N 1) exp ( Kɛ2 2Mmin 2 N 2 exp ( Kɛ2 2Mmin 2 δ 2 with M min the smallest M over all pairs This is at most a given δ 2 whenever K 4M2 min ɛ 2 log N δ

Group Actions and Orbits {t k } k=1,...,k given templates G finite subgroup of the affine group (translations, rotations, scaling) G acts on set of images I: orbit GI projection P : R d R K of images I onto span of templates t k Johnson Lindenstrauss lemma: low distorsion embeddings of sets of points from a high-dimensional to a low-dimensional Euclidean space (special case with map an orthogonal projection) given 0 < ɛ < 1; given finite set X of n points in R d take K > 8 log(n)/ɛ 2 then there is a linear map f given by a multiple of an orthogonal projection onto a (random) subspace of dimension K such that, for all u, v X (1 ɛ) u v 2 R d f (u) f (v) 2 R K (1 + ɛ) u v 2 R d result depends on concentration of measure phenomenon

up to a scaling, for a good choice of subspace spanned by templates, can take P to satisfy Johnson-Lindenstrauss lemma starting from finite set X = {u} of images, can generate another set by including all group translates X G = {g u : g G, u X } then for Johnson-Lindenstrauss lemma required accuracy for X G log(n #G) K > 8 ɛ 2 so can estimate sufficiently well the distance between images in R d using the distance between projections t k, gi of their group orbits onto the space of templates by t k, gi = g 1 t k, I for unitary representations it would seem one needs to increase by K #G K the number of templates to distinguish orbits, but in fact by argument above need an increase K K + 8 log(#g)/ɛ 2

given t k, gi = g 1 t k, I computed by the simple cells, pooling by complex cells by computing µ k h (I) = 1 #G σ h ( gt k, I ) g G σ h a set of nonlinear functions: examples µ k average(i) = 1 #G g G gt k, I µ k energy(i) = 1 #G g G gtk, I 2 µ k max(i) = max g G gt k, I other nonlinear functions: especially useful case, when σ h : R R + is injective Note: stored knowledge of gt k for g G allows the system to be automatically invariant wrt G action on images I

Localization and uncertainty principle would like templates t(x) to be localized in x: small outside of some interval x would also like ˆt to be localized in frequency: small outside an interval ω but... uncertainty principle: localized in x / delocalized in ω x ω 1

Optimal localization optimal possible localization when x ω = 1 realized by the Gabor functions t(x) = e iω 0x e x2 2σ 2

each cell applies a Gabor filter; plotted n y /n x anisotropy ratios