Advanced Machine Learning & Perception

Similar documents
Advanced Machine Learning & Perception

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning 4771

Advanced Machine Learning & Perception

CS 6375 Machine Learning

PCA and admixture models

STA 4273H: Statistical Machine Learning

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Naïve Bayes classification

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Introduction to Machine Learning

Machine Learning, Fall 2009: Midterm

Advanced Machine Learning & Perception

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CS534 Machine Learning - Spring Final Exam

Expectation Maximization

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Probabilistic & Unsupervised Learning

Probabilistic Graphical Models for Image Analysis - Lecture 1

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Introduction to Machine Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

PCA, Kernel PCA, ICA

Machine Learning for Data Science (CS4786) Lecture 12

6.867 Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Dimensionality reduction

Statistical Machine Learning

Boosting: Algorithms and Applications

Unsupervised Learning

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Pattern Recognition and Machine Learning

COM336: Neural Computing

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Kernel methods, kernel SVM and ridge regression

COS 429: COMPUTER VISON Face Recognition

Computational Genomics

Factor Analysis and Kalman Filtering (11/2/04)

Curve Fitting Re-visited, Bishop1.2.5

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Machine Learning - MT & 14. PCA and MDS

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

Generative Clustering, Topic Modeling, & Bayesian Inference

Machine Learning 4771

STA 414/2104: Machine Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Machine Learning 2nd Edition

Advanced Introduction to Machine Learning

What is semi-supervised learning?

STA 4273H: Statistical Machine Learning

CPSC 540: Machine Learning

CS281 Section 4: Factor Analysis and PCA

More Spectral Clustering and an Introduction to Conjugacy

Machine Learning 4771

Face detection and recognition. Detection Recognition Sally

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Image Analysis. PCA and Eigenfaces

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Principal Component Analysis

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Gaussian Mixture Models

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

CSC 411 Lecture 12: Principal Component Analysis

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Learning gradients: prescriptive models

EECS490: Digital Image Processing. Lecture #26

Machine Learning (CS 567) Lecture 5

Dimension Reduction. David M. Blei. April 23, 2012

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

MACHINE LEARNING ADVANCED MACHINE LEARNING

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Lecture 7: Con3nuous Latent Variable Models

Overfitting, Bias / Variance Analysis

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

LECTURE NOTE #11 PROF. ALAN YUILLE

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Kernel methods for comparing distributions, measuring dependence

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

CPSC 540: Machine Learning

Machine Learning. Instructor: Pranjal Awasthi

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

Principal Component Analysis (PCA)

Final Exam, Machine Learning, Spring 2009

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Machine Learning Lecture 5

Transcription:

Advanced Machine Learning & Perception Instructor: Tony Jebara

Topic 1 Introduction, researchy course, latest papers Going beyond simple machine learning Perception, strange spaces, images, time, behavior Info, policies, texts, web page Syllabus, Overview, Review Gaussian Distributions Representation & Appearance Based Methods Least Squares, Correlation, Gaussians, Bases Principal Components Analysis and its Shortcomings

About me Tony Jebara, Associate Professor in Computer Science Started at Columbia in 2002 PhD from MIT in Machine Learning Thesis: Discriminative, Generative and Imitative Learning (2001) Research Areas: (Columbia Machine Learning Lab, CEPSR 6LE5) www.cs.columbia.edu/learning Machine Learning Some Computer Vision

Course Web Page http://www.cs.columbia.edu/~jebara/4772 http://www.cs.columbia.edu/~jebara/6772 Some material & announcements will be online But, many things will be handed out in class such as photocopies of papers for readings, etc. Check EWS link to see deadlines, homework, etc. Available online, see TA info, etc. Follow the policies, homework, deadlines, readings closely please!

Syllabus Week 1: Introduction, Review of Basic Concepts, Representation Issues, Vector and Appearance-Based Models, Correlation and Least Squared Error Methods, Bases, Eigenspace Recognition, Principal Components Analysis Week 2: onlinear Dimensionality Reduction, Manifolds, Kernel PCA, Locally Linear Embedding, Maximum Variance Unfolding, Minimum Volume Embedding Week 3: Maximum Entropy, Exponential Families, Maximum Entropy Discrimination, Large Margin Probability Models Week 4: Conditional Random Fields and Linear Models, Iterative Scaling and Majorization Week 5: Graphical Models, Multi-Class Support Vector Machines, Structured Support Vector Machines, Cutting Plane Algorithms Week 6: Kernels and Probabilistic Kernels

Syllabus Week 7: Feature Selection and Kernel Selection, Support Vector Machine Extensions Week 8: Meta-Learning and Multi-Task Support Vector Machines Week 9: Semi-Supervised Learning and Graph-Based Semi-Supervised Learning Week 10: High-Tree Width Graphical Models, Approximate Inference, Graph Structure Learning Week 11: Clustering, Spectral Clustering, ormalized Cuts. Week 12: Boosting, Mixtures of Experts, AdaBoost, Online Learning Week 13: Project Presentations Week 14: Project Presentations

Beyond Canned Learning Latest research methods, high dimensions, nonlinearities, dynamics, manifolds, invariance, unlabeled, feature selection Modeling Images / People / Activity / Time Series / MoCAP

Representation & Vectorization How to represent our data? Images, time series, genes Vectorization: the poor man s representation Almost anything can be written as a long vector E.g. image is read lexicographically RGB of eachpixel is added to a vector Or, a gene sequence can be written as a binary vector GATACAC = [0100 0001 1000 0001 0010 0001 0010] For images, this is called Appearance Based representation But, we lose many important properties this way We will fix this in later lectures For now, it is an easy way to proceed 1 3 0 2

Least Squares Detection How to find a face in an image given a template? Template Matching and Sum-Squares-Difference (SSD) aïve: try all positions/scales, find least squares distance i * = arg min i 1,T Correlation and ormalized Correlation Could normalize length of all vectors (fixes lighting) ˆµ = µ µ i * = arg min i 1,T i * = arg min i 1,T 1 2 µ x i ( ) T ( µ x ) i 1 2 µ x i ( ) 1 2 ˆµTˆµ 2ˆµ T ˆx i + ˆx it ˆx i 2 = arg max i 1,T ˆµ T ˆx i

Least Squares as Gaussian Model Minimum squared error is equivalent to maximum likelihood under a Gaussian i * 1 = arg min µ x 2 i 1,T 2 i 1 = arg max i 1,T log ( 2π) exp 1 µ x 2 D/2 2 i Can now treat it as a probabilistic problem Trying to find the most likely position (or sub-image) in search image given the Gaussian model of the template Define the log of the likelihood as: log p( x θ) For a Gaussian probability or likelihood is: p( x i θ) = 1 ( 2π) exp 1 µ x 2 D/2 2 i

Multivariate Gaussian Gaussian extend to D-dimensions and have 2 parameters: Mean vector µ vector (translates) Covariance matrix Σ (stretches and rotates) p( x µ,σ) 1 = exp 1 ( ( 2π) x µ ) T Σ 1 x µ D/2 Σ 2 x,µ R D, Σ R D D ( ) Max and expectation = µ Mean is any real vector, variance is now Σ matrix Covariance matrix is positive semi-definite Covariance matrix is symmetric eed matrix inverse (inv) eed matrix determinant (det) eed matrix trace operator (trace)

Max Likelihood Gaussian How to make face detector to work on all faces? Why use a template? How can we use many templates? Have IID samples of template vectors..: Represent IID samples with parameters as network: x 1,x 2,x 3,,x More efficiently drawn using Replicator Plate i = 1 Let us get a good Gaussian from these many templates. Standard approach: Max Likelihood log p x i ( µ,σ) 1 = log exp 1 x i µ 2π 2 ( ) D/2 Σ ( ) T Σ 1 ( x i µ )

Max Likelihood Gaussian Max over µ x T x x µ µ 1 log exp 1 ( 2π) x i µ D/2 Σ 2 ( ) T Σ 1 ( x i µ ) = 0 ( ) T Σ 1 ( x i µ ) D log 2π 1 log Σ 1 x i µ = 0 2 2 2 = 2x T see Jordan Ch. 12, get sample mean For Σ need Trace operator: and several properties: tr A ( x i µ ) T Σ 1 = 0 ( ) = tr A T ( ) = tr ( BA) ( ) = tr ( A) tr AB tr BAB 1 µ = 1 ( ) D = A dd d =1 tr ( xx T A) = tr ( x T Ax) = x T Ax x i

Max Likelihood Gaussian Likelihood rewritten in trace notation: l = D 2 log 2π 1 2 log Σ 1 2 x i µ Max over A=Σ -1 use properties: Get sample covariance: = D 2 log 2π + 2 log Σ 1 1 2 = D 2 log 2π + 2 log Σ 1 1 2 = D 2 log 2π + 2 log A 1 2 l A = 0 + 2 = 2 Σ 1 2 log A A = A 1 ( A 1 ) T 1 2 ( ) T Σ 1 ( x i µ ) tr ( x i µ ) T Σ 1 ( x i µ ) tr x i µ tr x i µ ( )( x i µ ) T Σ 1 ( )( x i µ ) T A ( ) T tr BA ( x i µ ) x i µ l A = BT ( x i µ ) x i µ ( ) T A = 0 Σ = 1 ( ) T T ( x i µ ) x i µ ( ) T

Principal Components Analysis Problem: for high dimensional data, D is large Storing Σ, inverting Σ -1 and determining Σ are expensive! Idea: limit Gaussian model to directions of high variance Use Principal Components Analysis to mimic Σ

Principal Components Analysis PCA approximates each datapoint as a mean vector plus steps along eigenvector directions. E.g. c i steps along v x x i 1 ( ) x ( i 2) µ 1 ( ) µ ( 2) v 1 +c i v 2 ( ) ( ) More generally, PCA uses a set of eigenvectors M (where M<<D) x i ˆx i = M µ + c j =1 ijv j PCA selects { µ,c ij, v } 2 j to minimize x i ˆx i The optimal directions are eigenvectors of covariance Which directions to keep: highest eigenvalues (variances)

Principal Components Analysis Use eigenvectors, mean & coefficients to approximate data x i M µ + c j =1 ijv j PCA finds eigenvectors by decomposing covariance matrix: Σ =V ΛV T = λ i vi vi T Σ 11 Σ 12 Σ 13 Σ 12 Σ 22 Σ 23 Σ 13 Σ 23 Σ 33 D = v 1 v2 v3 Eigenvectors are orthonormal: Eigenvalues are non-negative and non-increasing PCA selects the M eigenvectors with largest eigenvalues M Truncating gives an approximate covariance: PCA finds coefficients by: c ij = x i µ ˆΣ = λ 1 λ 2 λ D 0 λ 1 0 0 0 λ 2 0 0 0 λ 3 T v i v j = δ ij ( ) T v j v1 v2 v3 T λ i vi vi T

PCA via the Snapshot Method Careful how big is the covariance matrix? Assume 1000 images each containing D=20,000 pixels It is DxD pixels, that s unstoreable! Also, finding the eigenvectors or inverting DxD, requires O(D 3 )! First compute mean of all data (easy) and subtract it from each point Instead of: Σ = 1 T x i x i compute Φ where Φ = x T i,j i x j Then find eigendecomposition of Gram matrix Φ= V Λ V T Eigenvectors of Σ are then: j =1 ( ) v i x j v i j

PCA via the Snapshot Method { x 1,,x } = µ { µ + c 1j v j, } =

Truncated Gaussian Detection Approximate Σ with PCA plus spherical term via Specific Eigenvalues & Eigenvectors Spherical Eigenvalues p( x µ, ˆΣ ) Σ = D λ k v k v T k k=1 ˆΣ = M λ k v k v T k + k=1 ˆΣ 1 M 1 = ˆΣ D k=m +1 v λ k k v T k + D 1 k=1 k=m +1 ˆΣ 1 M 1 = k=1( 1 ρ)v k v T k + ˆΣ = 1 I ρ M λ k λ k=1 k D k=m +1 ρ ρv k v k T T ρ v k v k ρ = 1 D M = 1 D M = 1 D M Σ kk = 1 D λ k=m +1 k ( D λ k=1 k M λ ) k=1 k ( D Σ k=1 kk M λ ) k=1 k ( ( k) µ ( k) ) 2 x i

Truncated Gaussian Detection Instead of minimizing squared error, use Gaussian model p x µ, ˆΣ ( ) 1 = exp 1 ( ( 2π) x µ ) T ˆΣ 1 ( x µ ) D/2 ˆΣ 2 Use Snapshot PCA to efficiently store the big covariance This maximum likelihood Gaussian model achieved state of the art face finding as evaluated by IST/DARPA (Moghaddam et al., 2000) Top performer after 2000 is Viola-Jones (boosted cascade)

Gaussian Face Recognition Instead of modeling face images with Gaussian model the difference of two face images with a Gaussian Each difference of all pairs of images in our data is represented as a D-dimensional vector x Also have a binary label y, y=1 same person same, y=0 not x R D y 0,1 { } y i = 1,x i = - y j = 0,x j = - One Gaussian for same-face deltas another for different people deltas

Gaussian Classification Have two classes, each with their own Gaussian: i = 1 p x µ,σ,y Generation: 1) flip a coin, get y 2) pick Gaussian y, sample x from it Maximum Likelihood: l = {( x 1,y 1 ),,( x,y )} x R D y 0,1 ( ) ( ) + ( ) + log p x i,y i α,µ,σ = log p y i α p( α)p ( µ )p( Σ)p ( y α)p( x y,µ,σ) p( y α) = α y ( 1 α) 1 y ( ) = ( x µ y,σ ) y ( ) ( ) + log p( x i µ 1,Σ 1 ) log p x i y i,µ,σ = log p y i α log p x i µ 0,Σ 0 yi 0 { } yi 1

Gaussian Classification Max Likelihood can be done separately for the 3 terms l = ( ) + log p y i α log p x i µ 0,Σ 0 yi 0 ( ) + log p( x i µ 1,Σ 1 ) yi 1 Count # of pos & neg examples (class prior): α = 1 0 + 1 Get mean & cov of negatives and mean & cov of positives: µ 0 = 1 0 x y i 0 i Σ 0 = 1 ( x 0 i µ )( 0 x i µ ) T yi 0 0 µ 1 = 1 1 x y i 1 i Σ 1 = 1 ( x 1 i µ )( 1 x i µ ) T yi 1 Given (x,y) pair, can now compute likelihood p( 1 x,y) To make classification, a bit of Decision Theory Without x, can compute prior guess for y p( y) Give me x, want y, I need posterior p( y x) Bayes Optimal Decision: ŷ = arg max y= { 0,1 } p ( y x ) Optimal iff we have true probability

Gaussian Classification Example cases, plotting decision boundary when = 0.5 ( ) = p y = 1 x = p( x,y = 1) ( ) + p( x,y = 1) α ( x µ 1,Σ 1 ) ( ) + α ( x µ 1,Σ 1 ) p x,y = 0 ( 1 α) x µ 0,Σ 0 If covariances are equal: linear decision If covariances are different: quadratic decision

Intra-Extra Personal Gaussians Intrapersonal Gaussian model ( x µ 1,Σ ) 1 Covariance is approximated by these eigenvectors: Extrapersonal Gaussian model ( x µ 0,Σ ) 0 Covariances is approximated by these eigenvectors: Question: what are the Gaussian means? Probability a pair is the same person: ( ) = p y = 1 x α ( x µ 1,Σ 1 ) ( ) + α ( x µ 1,Σ 1 ) ( 1 α) x µ 0,Σ 0

Other Standard Bases There are other choices for the eigenvectors, not just PCA Could pick eigenvectors without looking at the data Just for their interesting properties x i µ + C c ij v j j =1 Fourier basis: denoises, only keeps smooth parts of image Wavelet basis: localized or windowed Fourier PCA: optimal least squares linear dataset reconstruction

Basis for atural Images What happens if we do PCA on all natural images instead of just faces? Get difference of Gaussian bases Like Gabor or Wavelet basis ot specific like faces Multi-scale & orientation Also called steerable filters Similar to visual cortex

Problems with Linear Bases Coefficient representation changes wildly if image rotates and so does p(x) The eigenspace is sensitive to rotations, translations and transformations of the image Simple linear/gaussian/pca models are not enough What worked for aligned faces breaks for general image datasets Most of the PCA eigenvectors and spectrum energy is wasted due to OLIEAR EFFECTS c ij = ( x i µ ) T v j