Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Similar documents
Learning gradients: prescriptive models

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Kernels A Machine Learning Overview

RegML 2018 Class 2 Tikhonov regularization and kernels

Reproducing Kernel Hilbert Spaces

Oslo Class 2 Tikhonov regularization and kernels

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

Bayesian Regularization

Kernel Methods. Machine Learning A W VO

Support Vector Machines

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Support Vector Machines

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Machine Learning. Support Vector Machines. Manfred Huber

Basis Expansion and Nonlinear SVM. Kai Yu

Support Vector Machines

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

CIS 520: Machine Learning Oct 09, Kernel Methods

8. Geometric problems

STA414/2104 Statistical Methods for Machine Learning II

Online Gradient Descent Learning Algorithms

8. Geometric problems

MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels

Linear Regression and Its Applications

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Spectral Regularization

Lecture Notes on Support Vector Machine

Efficient Complex Output Prediction

Lecture 10: A brief introduction to Support Vector Machine

Regularization via Spectral Filtering

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Support Vector Machines for Classification: A Statistical Portrait

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Manifold Regularization

Kernel Methods. Outline

CSC 411 Lecture 17: Support Vector Machine

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

Worst-Case Bounds for Gaussian Process Models

Support Vector Machine (SVM) and Kernel Methods

Announcements. Proposals graded

Neural Network Training

ECE521 week 3: 23/26 January 2017

Linear & nonlinear classifiers

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

SVMs, Duality and the Kernel Trick

Less is More: Computational Regularization by Subsampling

ICS-E4030 Kernel Methods in Machine Learning

Regularized Least Squares

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Graphs, Geometry and Semi-supervised Learning

Support Vector Machines: Maximum Margin Classifiers

Kernel Methods. Barnabás Póczos

Lecture 1 October 9, 2013

Non-linear Dimensionality Reduction

Logistic Regression Logistic

Convergence Rates of Kernel Quadrature Rules

Kernel Principal Component Analysis

Regularization in Reproducing Kernel Banach Spaces

Support Vector Machines

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

L5 Support Vector Classification

Ordinary Least Squares Linear Regression

STUDY PLAN MASTER IN (MATHEMATICS) (Thesis Track)

Contribution from: Springer Verlag Berlin Heidelberg 2005 ISBN

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Optimization in Information Theory

Statistical Methods for SVM

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

CS-E4830 Kernel Methods in Machine Learning

Functional Gradient Descent

12. Interior-point methods

Mathematical Methods for Data Analysis

Convex optimization. Javier Peña Carnegie Mellon University. Universidad de los Andes Bogotá, Colombia September 2014

March 13, Paper: R.R. Coifman, S. Lafon, Diffusion maps ([Coifman06]) Seminar: Learning with Graphs, Prof. Hein, Saarland University

Homework 4. Convex Optimization /36-725

Representer theorem and kernel examples

Semidefinite Programming

Constrained Optimization and Lagrangian Duality

Support Vector Machine (SVM) and Kernel Methods

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

Approximate Kernel PCA with Random Features

Nonparametric Bayesian Methods (Gaussian Processes)

Advanced Machine Learning & Perception

Hilbert Space Methods in Learning

Approximation Theoretical Questions for SVMs

Statistical Pattern Recognition

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Transcription:

Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions 8 6 4 2 0 2 4 6 8 8 6 4 2 0 2 4 6 8 4 2 0 2 4

What Manifold Learning might be Sample data in a low dimensional space Pass each data point through the same nonlinearity How to recover the data?

Probabilistic Model Data x 1,,x n in R d sampled from a joint distribution p(x) Each x is passed through a nonlinear function f: R d R D. y i =f(x i ) The distribution for Y is given by

Diffeomorphic Warping If we assume that f is a diffeomorphism, there exists an inverse function in the neighborhood of the image such that g(f(x))=x and g f = I for all x.

Diffeomorphic Warping If we assume that f is a diffeomorphism, there exists an inverse function in the neighborhood of the image such that g(f(x))=x and g f = I for all x. Taking a logarithm, we may search for the maximum likelihood g

Benefits of this Perspective Asymptotic Convergence Out of Sample Extension No neighborhood estimates Incorporates Prior Knowledge Easy to make semi-supervised

Asymptotic Convergence If y i is sampled iid, Which is minimized when p y (y;g)=p y Similarly, if joint distribution is stationary and ergodic sequence and k-th order Markov, log p Y converges to the cross entropy (Shannon-McMillian- Breiman Theorem)

Diffeomorphic Warping Ingredients Set of functions Prior on X Optimization Tools RKHS Dynamics Duality

Kernels k be a function of two variables which is positive definite for all x i and c i. Such a function is called a positive definite kernel.

Examples of Kernels Linear Polynomial Gaussian

Aside: Checking Positivity When is a symmetric function a kernel? Polynomials trivial to check (positive definite form on monomials) Gaussians Fourier transform of positive function Sum and Mixtures Pointwise Products (Schur Product theorem) What else? And what algorithmic tools can we develop to check whether a kernel is positive?

Reproducing Kernel Hilbert Spaces Theorem: If X is a compact set and H is a Hilbert space of functions from X to R. Then all functionals are bounded iff there is a unique positive definite kernel k(x 1, x 2 ) such that for all f H and x X k is called the reproducing kernel of H.

RKHS (converse) If k(x 1,x 2 ) is a positive definite kernel on X, consider the set of functions And define the inner product Then the span of is an inner product space and its completion is an RKHS

Properties of RKHS Let where K ij = k(x i,x j ) If x X

Duality and RKHS λ i is the Largrange multiplier associated with constraint i No duality gap

Dual: Duality and RKHS

Duality and RKHS Optimizations involving norm of f and f on data admit finite representation Nonlinearity of kernel gives nonlinear functions Extends to gradients of f Hugely successful in approximation and learning

Linear Regression Find best linear model agreeing with data y x

Linear Regression Find best linear model agreeing with data y x

Linear Regression Find best linear model agreeing with data y x

Linear Regression Find best linear model agreeing with data Agree with data Smoothness/Complexity Linear f. Euclidean Norm. Solution:

Linear Regression: Morals Can be solved with leastsquares. The solution is a linear combination of the data. Computing f(x) only involved inner products of the data. How about nonlinear models? y x

Regression (nonlinear) Search over an RKHS y Evgeniou et al (1999), Poggio and Smale (2003) x

Nonlinear Regression Find best model agreeing with data Agree with data Smoothness/Complexity f RKHS, RKHS norm. Solution:

Nonlinear Regression Find best model agreeing with data. Agree with data Smoothness/Complexity Linear Nonlinear

Nonlinear Regression: Morals Can be solved with leastsquares. The solution is a linear combination of kernels centered at the data. Computing f(x) only involves kernel products of the data. RKHS often dense in L 2. y x

Generalization and Stability Regularizing with the norm makes algorithms robus to changes in the training data Models generalize if they predict novel data as well as they predict on the training data Theorem: A model generalizes if and only if it is robust to changes in the data [Poggio et al 2004]. The RKHS norm is meaningful: penalizes complexity.

Diffeomorphic Warping with RKHS We know: But log det is not convex in (a,c) Construct a dual problem as approximation. If p X is a zero-mean gaussian, we get a determinant maximization problem [Vandenberghe et al, 1998]

Eigenvalue Approximation Ω -1 is an optimal solution if and only if This follows from KKT conditions of MAXDET The eigenvalues of Ω give coefficients for the expansion of g

Remarks Dual can be solved using an interior point method Provides a lower bound on the log-likelihood We can approximate with a spectral method Easy to extend to any log-concave prior on X Performs quite well in experiments

Diffeomorphic Warping Ingredients Set of functions Prior on X Optimization Tools RKHS Dynamics Duality

Dynamics Assume data is generated by an LTIG system For the experiments, this model can be very dumb!

The Sensetable James Patten Me 100 RFID tag Signal strength measurements from tag 80 60 40 20 0 20 40 60 80 100 10 20 30 40 50 60

Sensetable: Manifold Learning LLE KPCA Ground Truth Isomap ST-Isomap

Sensetable: DW 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.03 0.02 0.01 0 0.01 0.02 0.03 Ground Truth Diffeomorphic Warping

Tracking 350 350 400 400 450 450 500 500 550 550 600 600 650 450 400 350 300 250 200 150 100 50 450 400 350 300 250 200 150 100 400 400 420 420 400 440 440 460 460 450 480 480 500 500 500 520 520 540 560 540 560 550 580 580 450 400 350 300 250 200 300 250 200 150 100 50 600 400 350 300 250 200 150 350 350 380 400 400 400 420 450 440 460 450 500 480 500 500 520 550 540 550 560 600 580 600 400 350 300 250 200 150 100 400 350 300 250 200 150 400 350 300 250 200 150 100

Tracking with KPCA 0.03 0.04 0.02 0.03 0.01 0 0.01 0.02 0.02 0.01 0 0.01 0.02 0.03 0.03 0.04 0.04 0.05 0.05 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 0.05 0.04 0.02 0 0.02 0.04 0.06 0.04 0.005 0.03 0.02 0 0.005 0.01 0.015 0.03 0.02 0.01 0.01 0 0.02 0.025 0.03 0 0.01 0.01 0.035 0.02 0.02 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 0.04 0.045 0.03 0.02 0.01 0 0.01 0.02 0.03 0.03 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 0.03 0.025 0.03 0.02 0.02 0.02 0.01 0 0.015 0.01 0.005 0.01 0 0.01 0 0.01 0.02 0.03 0.005 0.01 0.015 0.02 0.03 0.04 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 0.05 0.02 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 0.05

Video

Representation Big mess of numbers for each frame Raw pixels, no image processing

Annotations from user or detection algorithms

Assume that output time series is smooth.

Video

Future Work Speeding up the log-det Optimizing over families of priors p X Estimating the duality gap Learning manifolds that need more than one chart Understanding why nonparametric ID is easy while parametric ID is hard