Advanced Machine Learning & Perception

Size: px

Start display at page:

Download "Advanced Machine Learning & Perception"

Susanna Roberts
5 years ago
Views:

1 Advanced Machine Learning & Perception Instructor: Tony Jebara

2 Topic 1 Introduction, researchy course, latest papers Going beyond simple machine learning Perception, strange spaces, images, time, behavior Info, policies, texts, web page Syllabus, Overview, Review Gaussian simplest distribution / signal handling Representation & Appearance Based Methods Least Squares, Correlation, Gaussians, Bases Principal Components Analysis and shortcomings it entails

3 About me Tony Jebara, Associate Professor in Computer Science Started at Columbia in 2002 PhD from MIT in Machine Learning Thesis: Discriminative, Generative and Imitative Learning (2001) Research Areas: (Columbia Machine Learning Lab, CEPSR 6LE5) Machine Learning Some Computer Vision

4 Course Web Page Some material & announcements will be online But, many things will be handed out in class such as photocopies of papers for readings, etc. Check EWS link to see deadlines, homework, etc. Available online, see TA info, etc. Follow the policies, homework, deadlines, readings closely please!

5 Syllabus Week 1 Introduction, Review of Basic Concepts, Representation Issues, Vector and Appearance-Based Models, Correlation and Least Squared Error Methods, Bases, Eigenspace Recognition, Principal Components Analysis Week 2 onlinear Dimensionality Reduction, Manifolds, kernel PCA, Locally Linear Embedding, Maximum Variance Unfolding, Minimum Volume Embedding Week 3 Support Vector Machines and related machines, VC Dimension, Large Margin, Large Relative Margin Week 4 Kernel Methods, Reproducing Kernel Hilbert Space, Probabilistic Kernel Approaches, Kernel Principal Components Analysis, Bag of Vectors/Pixel Kernels Week 5 Maximum Entropy, Iterative Scaling, Maximum Entropy Discrimination, Large Margin Probability Models Week 6 SVM Extensions, Multi-Class Classification, Error-Correcting Codes, Feature Selection, Kernel Selection, Meta-Learning, Transduction

6 Syllabus Week 7 Bayesian etworks, Belief Propagation, Hidden Markov Models, Gesture Recognition Week 8 Kalman Filtering, Structure from Motion, Parameter Estimation, Coupled and Linked Hidden Markov Models, Variational and Mean-Field Methods Week 9 Factorial Hidden Markov Models, Switched Kalman Filters, Dynamical Bayesian etworks, Structured Mean-Field Week 10 Graph Learning, b-matching, Loopy Belief Propagation, Perfect Graphs Week 11 Spectral Clustering, Random Walks, cuts Methods, Stability, Image Segmentation Week 12 Boosting, Mixtures of Experts, AdaBoost, Online Learning Week 13 TBD Week 14 Project Presentations

7 Beyond Canned Learning Latest research methods, high dimensions, nonlinearities, dynamics, manifolds, invariance, unlabeled, feature selection Modeling Images / People / Activity / Time Series / MoCAP

8 Representation & Vectorization How to represent our data? Images, time series, genes Vectorization: the poor man s representation Almost anything can be written as a long vector E.g. image is read lexicographically RGB of eachpixel is added to a vector Or, a gene sequence can be written as a binary vector GATACAC = [ ] For images, this is called Appearance Based representation But, we lose many important properties this way We will fix this in later lectures For now, it is an easy way to proceed

9 Least Squares Detection How to find a face in an image given a template? Template Matching and Sum-Squares-Difference (SSD) aïve: try all positions/scales, find least squares distance i * = arg min i 1,T Correlation and ormalized Correlation Could normalize length of all vectors (fixes lighting) ˆµ = µ µ i * = arg min i 1,T i * = arg min i 1,T 1 2 µ x i ( ) T ( µ x ) i 1 2 µ x i ( ) 1 2 ˆµTˆµ 2ˆµ T ˆx i + ˆx it ˆx i 2 = arg max i 1,T ˆµ T ˆx i

10 Least Squares & Gaussians Minimum squared error is equivalent to maximum likelihood under a Gaussian i * 1 = arg min µ x 2 i 1,T 2 i 1 = arg max i 1,T log ( 2π) exp 1 µ x 2 D/2 2 i Can now treat it as a probabilistic problem Trying to find the most likely position (or sub-image) in search image given the Gaussian model of the template Define the log of the likelihood as: log p( x θ) For a Gaussian probability or likelihood is: p( x i θ) = 1 ( 2π) exp 1 µ x 2 D/2 2 i

11 Gaussian Distribution Recall 1-dimensional Gaussian with mean parameter µ translates Gaussian left & right p( x µ ) = 1 exp 1 ( x µ ) 2 2π 2 Can also have variance parameter σ widens or narrows the Gaussian p( x µ,σ) = 1 exp 1 x µ 2πσ 2 2σ 2 ( ) 2 ote: ( ) p x dx = 1 x =

Multivariate Gaussian Gaussian can extend to D-dimensions Gaussian mean parameter µ vector (translates) Covariance matrix Σ stretches and rotates bump p( x µ,σ) 1 = exp 1 ( ( 2π) x µ ) T Σ 1 ( x µ )

12 Multivariate Gaussian Gaussian can extend to D-dimensions Gaussian mean parameter µ vector (translates) Covariance matrix Σ stretches and rotates bump p( x µ,σ) 1 = exp 1 ( ( 2π) x µ ) T Σ 1 ( x µ ) D/2 Σ 2 x,µ R D, Σ R D D Max and expectation = µ Mean is any real vector, variance is now Σ matrix Covariance matrix is positive semi-definite Covariance matrix is symmetric eed matrix inverse (inv) eed matrix determinant (det) eed matrix trace operator (trace)

Multivariate Gaussian Spherical: Σ = σ 2 I = σ 2 0 0 0 σ 2 0 0 0

of multiple 1d Gaussians ( ) = 1 p x µ,σ D d =1 2πσ d exp x µ d

13 Multivariate Gaussian Spherical: Σ = σ 2 I = σ σ σ 2 Diagonal Covariance: dimensions of x are independent product of multiple 1d Gaussians ( ) = 1 p x µ,σ D d =1 2πσ d exp x µ d d 2 2σ d ( ) 2 Σ = σ σ σ σ 4

14 Max Likelihood Gaussian How to make face detector to work on all faces? Why use a template? How can we use many templates? Have IID samples of template vectors..: Represent IID samples with parameters as network: x 1,x 2,x 3,,x More efficiently drawn using Replicator Plate i = 1 Let us get a good Gaussian from these many templates. Standard approach: Max Likelihood log p x i ( µ,σ) 1 = log exp 1 x i µ 2π 2 ( ) D/2 Σ ( ) T Σ 1 ( x i µ )

15 Max Likelihood Gaussian Max over µ x T x x µ µ 1 log exp 1 ( 2π) x i µ D/2 Σ 2 ( ) T Σ 1 ( x i µ ) = 0 ( ) T Σ 1 ( x i µ ) D log 2π 1 log Σ 1 x i µ = = 2x T see Jordan Ch. 12, get sample mean For Σ need Trace operator: and several properties: tr A ( x i µ ) T Σ 1 = 0 ( ) = tr A T ( ) = tr ( BA) ( ) = tr ( A) tr AB tr BAB 1 µ = 1 ( ) D = A dd d =1 tr ( xx T A) = tr ( x T Ax) = x T Ax x i

16 Max Likelihood Gaussian Likelihood rewritten in trace notation: l = D 2 log 2π 1 2 log Σ 1 2 x i µ Max over A=Σ -1 use properties: = D 2 log 2π + 2 log Σ = D 2 log 2π + 2 log Σ = D 2 log 2π + 2 log A 1 2 l A = = 2 Σ 1 2 log A A Get sample covariance: = A 1 ( A 1 ) T 1 2 ( ) T Σ 1 ( x i µ ) tr ( x i µ ) T Σ 1 ( x i µ ) tr x i µ tr x i µ ( )( x i µ ) T Σ 1 ( )( x i µ ) T A ( ) T tr BA ( x i µ ) x i µ l A = BT ( x i µ ) x i µ ( ) T A = 0 Σ = 1 ( ) T T ( x i µ ) x i µ ( ) T

17 Principal Components Analysis Gaussians: for Classification, Regression & Compression! Data can be constant in some directions, changes in others Use Gaussian to find directions of high/low variance

18 Principal Components Analysis Idea: instead of writing data in all its dimensions, only write it as mean + steps along one direction x x i y i µ x µ y v +c x i v y More generally, keep a subset of dimensions C from D (I.e. 2 of 3) x i C µ + c j =1 ijv j Compression method: x i c i Optimal directions: along eigenvectors of covariance Which directions to keep: highest eigenvalues (variances)

19 Principal Components Analysis If we have eigenvectors, mean and coefficients: x i C µ + c j =1 ijv j Getting eigenvectors (I.e. approximating the covariance): Σ =V ΛV T Σ 11 Σ 12 Σ 13 Σ 12 Σ 22 Σ 23 Σ 13 Σ 23 Σ 33 = v 1 v2 v3 T Eigenvectors are orthonormal: v i v j = δ ij In coordinates of v, Gaussian is diagonal, cov = Λ All eigenvalues are non-negative λ i 0 Higher eigenvalues are higher variance, use those first To compute the coefficients: λ 1 λ 2 λ 3 λ 4 λ λ λ 3 c ij = x i µ v1 ( ) T v j v2 v3 T

20 Snapshot Method Assume 2000 images each containing D=10,000 pixels How big is the covariance matrix? It is DxD pixels Finding the eigenvectors or inverting DxD, requires O(D 3 ) First compute mean of all data (easy) and subtract it from each point Instead of: Σ = 1 T x compute i x Φwhere Φ i = x T i,j i x j Then find eigenvectors/eigenvalues of Eigenvectors of Σ are then: j =1 Φ= V Λ V T ( ) v i x j v i j

21 Eigenfaces { x 1,,x } = µ { µ + c 1j v j, } =

22 Eigenface Detection Use a Gaussian model but only approximate the huge covariance matrix with its eigenvectors and a constant spherical component Spherical here Use variable Eigenvalues Eigenvectors here ( ) = 1 x µ,σ ( 2π) D/2 Σ exp 1 ( x µ ) T Σ 1 ( x µ ) 2 Σ M λ k v k v T k + ρ D T δ k δ k Σ 1 k=1 M k=1 1 λ k v k v k T + 1 ρ Eigenvalues k=m +1 D k=m +1 δ k δ k T

Gaussian Face Finding Instead of minimizing squared error, use Gaussian model p( x µ,σ) 1 = exp 1 ( ( 2π) x µ ) T Σ 1 ( x µ ) D/2 Σ 2 Euclidean distance: 1 ( x µ ) T ( x µ ) (assumed Σ=I) 2 ( ) T Σ 1

23 Gaussian Face Finding Instead of minimizing squared error, use Gaussian model p( x µ,σ) 1 = exp 1 ( ( 2π) x µ ) T Σ 1 ( x µ ) D/2 Σ 2 Euclidean distance: 1 ( x µ ) T ( x µ ) (assumed Σ=I) 2 ( ) T Σ 1 ( x µ ) Mahalanobis distance: 1 x µ 2 (shrink distances in directions with high variance) State of the art face finder as evaluated by IST/DARPA Moghaddam et al Top performer after 2000 is Viola-Jones (boosted cascade)

24 Gaussian Face Recognition Instead of modeling face images with Gaussian model the difference of two face images with a Gaussian Each difference of all pairs of images in our data is represented as a D-dimensional vector x Also have a binary label y, y=1 same person same, y=0 not x R D y 0,1 { } y i = 1,x i = - y j = 0,x j = - One Gaussian for same-face deltas another for different people deltas

Gaussian Classification Have two classes, each with their own Gaussian: p x µ,σ,y Generation: 1) flip a coin, get y 2) pick Gaussian y, sample x from it Maximum Likelihood: l = {( x 1,y 1 ),,( x,y )}

25 Gaussian Classification Have two classes, each with their own Gaussian: p x µ,σ,y Generation: 1) flip a coin, get y 2) pick Gaussian y, sample x from it Maximum Likelihood: l = {( x 1,y 1 ),,( x,y )} x R D y 0,1 ( ) ( ) + ( ) + log p x i,y i α,µ,σ = log p y i α p( α)p ( µ )p( Σ)p ( y α)p( x y,µ,σ) p( y α) = α y ( 1 α) 1 y ( ) = ( x µ y,σ ) y ( ) ( ) + log p( x i µ 1,Σ 1 ) log p x i y i,µ,σ = log p y i α log p x i µ 0,Σ 0 yi 0 { } yi 1

26 Gaussian Classification Max Likelihood can be done separately for the 3 terms l = ( ) + log p y i α log p x i µ 0,Σ 0 yi 0 ( ) + log p( x i µ 1,Σ 1 ) yi 1 Count # of pos & neg examples (class prior): α = Get mean & cov of negatives and mean & cov of positives: µ 0 = 1 0 x y i 0 i Σ 0 = 1 ( x 0 i µ )( 0 x i µ ) T yi 0 0 µ 1 = 1 1 x y i 1 i Σ 1 = 1 ( x 1 i µ )( 1 x i µ ) T yi 1 Given (x,y) pair, can now compute likelihood p( 1 x,y) To make classification, a bit of Decision Theory Without x, can compute prior guess for y p( y) Give me x, want y, I need posterior p( y x) Bayes Optimal Decision: ŷ = arg max y= { 0,1 } p ( y x ) Optimal iff we have true probability

27 Gaussian Classification Example cases, plotting decision boundary when = 0.5 ( ) = p y = 1 x = p( x,y = 1) ( ) + p( x,y = 1) α ( x µ 1,Σ 1 ) ( ) + α ( x µ 1,Σ 1 ) p x,y = 0 ( 1 α) x µ 0,Σ 0 If covariances are equal: linear decision If covariances are different: quadratic decision

28 Intra-Extra Personal Gaussians Intrapersonal Gaussian model ( x µ 1,Σ ) 1 Covariance is approximated by these eigenvectors: Extrapersonal Gaussian model ( x µ 0,Σ ) 0 Covariances is approximated by these eigenvectors: Question: what are the Gaussian means? Probability a pair is the same person: ( ) = p y = 1 x α ( x µ 1,Σ 1 ) ( ) + α ( x µ 1,Σ 1 ) ( 1 α) x µ 0,Σ 0

29 Other Standard Bases There are other choices for the eigenvectors, not just PCA Could pick eigenvectors without looking at the data Just for their interesting properties x i µ + C c ij v j j =1 Fourier basis: denoises, only keeps smooth parts of image Wavelet basis: localized or windowed Fourier PCA: optimal least squares linear dataset reconstruction

30 Basis for atural Images What happens if we do PCA on all natural images instead of just faces? Get difference of Gaussian bases Like Gabor or Wavelet basis ot specific like faces Multi-scale & orientation Also called steerable filters Similar to visual cortex

linear/gaussian/pca models are not enough What worked for aligned faces breaks for general image

31 Problems with Linear Bases Coefficient representation changes wildly if image rotates and so does p(x) The eigenspace is sensitive to rotations, translations and transformations of the image Simple linear/gaussian/pca models are not enough What worked for aligned faces breaks for general image datasets Most of the PCA eigenvectors and spectrum energy is wasted due to OLIEAR EFFECTS c ij = ( x i µ ) T v j

Advanced Machine Learning & Perception

Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 1 Introduction, researchy course, latest papers Going beyond simple machine learning Perception, strange spaces, images, time, behavior