Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Similar documents
Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

STATISTICAL LEARNING SYSTEMS

6.036 midterm review. Wednesday, March 18, 15

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Recap from previous lecture

Machine Learning - MT & 14. PCA and MDS

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

A Least Squares Formulation for Canonical Correlation Analysis

What is semi-supervised learning?

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

PRINCIPAL COMPONENTS ANALYSIS

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Data Mining Techniques

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

CSC 411 Lecture 12: Principal Component Analysis

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Warm up: risk prediction with logistic regression

Machine Learning Linear Models

STA 414/2104: Lecture 8

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

20 Unsupervised Learning and Principal Components Analysis (PCA)

Lecture 6: Methods for high-dimensional problems

7 Principal Component Analysis

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Principal Component Analysis (PCA)

Machine learning for pervasive systems Classification in high-dimensional spaces

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Machine Learning (BSMC-GA 4439) Wenke Liu

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Learning with Singular Vectors

Basis Expansion and Nonlinear SVM. Kai Yu

MLCC 2015 Dimensionality Reduction and PCA

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Feature Engineering, Model Evaluations

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

UNSUPERVISED LEARNING

Advanced Introduction to Machine Learning

CS 6375 Machine Learning

Covariance and Correlation Matrix

Lasso, Ridge, and Elastic Net

CS 340 Lec. 6: Linear Dimensionality Reduction

Semi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT)

STA 414/2104: Lecture 8

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)

Bits of Machine Learning Part 1: Supervised Learning

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Classification: The rest of the story

Dimensionality Reduction and Principle Components Analysis

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Support Vector Machines.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 2: Linear Algebra Review

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

CPSC 540: Machine Learning

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

INTRODUCTION TO DATA SCIENCE

CMSC858P Supervised Learning Methods

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Principal Component Analysis

Linear Algebra- Final Exam Review

Linear & nonlinear classifiers

Lecture VIII Dim. Reduction (I)

The exam is closed book, closed notes except your one-page cheat sheet.

CSC321 Lecture 20: Autoencoders

Machine Learning Practice Page 2 of 2 10/28/13

Introduction to Machine learning

Machine Learning Techniques

Statistical Pattern Recognition

CSE 546 Final Exam, Autumn 2013

Analysis of Spectral Kernel Design based Semi-supervised Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

MLCC 2017 Regularization Networks I: Linear Models

Discriminative Models

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Unsupervised Learning: K- Means & PCA

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Data Mining Techniques

Vector Space Models. wine_spectral.r

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Kernel Methods & Support Vector Machines

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Principal Component Analysis

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Classification 1: Linear regression of indicators, linear discriminant analysis

Basic Principles of Unsupervised and Unsupervised

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Multivariate Statistical Analysis

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Multi-View Regression via Canonincal Correlation Analysis

Introduction to Machine Learning

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

Transcription:

Regression and PCA

Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes (e.g., Y in 0,..,9 for digit classification).

Regression Sometimes we want to map an input X into a real number Y. This is the regression problem. Examples: X is high school exam scores and Y is university final score X is website layout and Y is traffic X is user data and Y expected electricity consumption.

Formal Definition In the spirit of supervised classification. We will map X to Y via a function f(x) Prediction loss: (y f(x)) 2 f will come from a hypothesis class F Assume (x,y) sampled from distribution D Would like to minimize: E (Y f(x)) 2

ERM for Regression Want to find f that minimizes: E (Y f(x)) 2 But we don t know this expected value We do have (x 1,y 1 ),...,(x n,y n ) Use it to approximate the expected value: E (Y f(x)) 2 1 n X (y i f(x i )) 2 i This is the empirical risk. Minimize it!

ERM for Regression The ERM problem: min f2f 1 n X (y i f(x i )) 2 How does this relate to: min E (Y f(x)) 2? f2f As in classification: i Gets closer the more data we have Larger difference for more complex classes. We won t discuss this further in the course.

Linear Regression Which functions F should we use? Would be nice if they are simple and ERM tractable. Start with the simplest case, linear functions. Assume x 2 R d Then F contains functions f(x) =a x for a 2 R d

Adding Bias For f(x) =x a it will always hold that f(0) =0 Can also add a bias term: y = x a + b Can represent this by adding a constant feature 1 y =[x, 1] [a,b] We therefore don t add bias explicitly.

Solving Linear Regression

Solving Linear Regression Solve: min a nx (y i a x i ) 2 i=1 Define: y = 0 1 y 1 B. C @.. A2 R n,1 X = y n =min a `(a) 0 B @ x 1......... 1 C A 2 R n,d x n `(a) =ky Xak 2 2 =(y Xa) T (y Xa) = kyk 2 2 2y T Xa + a T X T Xa

Solving Linear Regression Minimize: `(a) =kyk 2 2 2y T Xa + a T X T Xa Take gradient: r`(a) =2X T Xa 2X T y Used: r(v a) =v and r(a T Ca) =2Ca Set gradient to zero and get: X T Xa = X T y If X T X is invertible, we solve: a =(X T X) 1 X T y

The Correlation Matrix Recall the solution a =(X T X) 1 X T y Define: C = 1. Then: n XT X C i,j = 1 n X k x ki x kj E [X i X j ] Measures the correlation between the features i and j. For zero mean variables, it is the covariance matrix.

The Singular Case Recall: a =(X T X) 1 X T y What happens if the correlation is not invertible? When is the correlation not invertible? Assume d<n. The singularity happens when rank[x]<d. 0 1 x 1 One feature is a function of the... others. X = B @...... x n C A 2 R n,d

The Singular Case X = 0 B @ 1 1... 0.5 0.5... 2 2............ 1 C A 2 R n,d Implies many equivalent solutions: a =[a 1,a 2,a 3,...,a d ] a =[a 1 + a 2, 0,a 3,...,a d ] a =[a 1 +1e9,a 2 1e9,a 3,...,a d ]

The Singular Case The optimum satisfies: X T Xa = X T y In the singular case, this has infinitely many solutions. For any v such that X T Xv = 0, if a is a solution then so is a+v How can we avoid this? Regularization!

Regularized Regression How do we choose between a set of solutions? Add a regularization term f(a) that is low for solutions we prefer, and high for those we don t. Recall SVM where we add l2 regularization. Indeed, two popular choices for regression are: Add kak 2 2 - Ridge regression X Add kak 1 = a i - Lasso i

Ridge Regression Goal: minimize arg min a ky Xak 2 2 + kak 2 2. Derivation nearly identical to standard case gives: a =(X T X + I d ) 1 X T y. Note the inversion above is always possible! (why?) If X T X=I then: a = X T y a = 1 1+ XT y Regular Ridge

Lasso (Tibshirani 96) Goal: arg min a ky Xak 2 2 + kak 1. Results in sparse solutions (zero weights for features that are not too important ).

Regression Extensions Non linear predictors: Kernels Neural nets Generalization analysis. Fancy regularizers where the desired a has more structure (e.g., a2 and a3 should be close)

Supervised Learning Labeled Data quail apple apple corn corn Features 1.1-0.5 0 0 0.3 quail -1 0 1.2-0.4 0.1 apple 1.1-0.5 0 0 0.3-1 0 1.2-0.4 0.1 apple corn x y Model Class Consider classifiers of the form y = f(x; w) Learning Find w that works well on the training data

Unsupervised Learning But life is more like this: Many images. Very few labels. What can we do?

Unsupervised Learning Data is millions of points, each with 20K features. What can we do with it? Understand its structure Learn useful new features/representations Use it together with some labeled data (this is known as semi-supervised learning).

Understanding Structure What can you say about these points? How can we use these clusters? The may correspond to something useful (groups in population) Could be used as features for learning. How do we find them? Clustering algorithms. Next classes!

Understanding Structure What can you say about these points? x 2 How can we use this? Variables are dependent. Could be meaningful. x 1 Data is really 1D. Can be represented by a single number instead of 2. How do we find this structure? Principal Component Analysis. Now!

Unsupervised Learning Two key goals: Clustering Dimensionality reduction

Linear Subspaces Suppose our data lies on a low dimensional linear subspace. How do we find this subspace? Principal Component Analysis

Linear Subspaces An r dimensional linear subspace is defined via a basis v 1,...,v r 2 R d The subspace is all points X x 2 R d such that there exist a 1,...,a r and x = a i v i i a i The are an r dimensional representation (encoding) of x. Denote V = v 1 v 2... v r and a =[a 1,...,a r ] Then: x = V a

Linear Subspaces An r dimensional linear subspace is defined via a basis v 1,...,v r 2 R d Assume w.l.o.g that the basis vectors are orthogonal Namely: v i v j = i,j. Can always get via Gram-Schmidt So: V T V = I r x = V a a = V T x

Encoding-Decoding x 2 R d V a 2 R r V T x 2 R d

Projection to Subspace Now say we have a point x not in the subspace What is the closest point x in the subspace? x x 0 Given x 2 R d what is the closest point in the subspace? arg min x 0 =V a kx0 xk 2 2 = VV T x

Encoding-Decoding arg min x 0 =V a kx0 xk 2 2 = VVT x In subspace x 0 2 R d V Decoding Subspace coordinates a 2 R r V T Encoding Not in subspace x 2 R d

The PCA Problem Goal: Find subspace that is closest to all the data points. min V :V T V =I X i kx i VV T x i k 2 2 This is the PCA optimization problem There are other equivalent formulations.

Mean Removal Note: Before running PCA, remove the mean so that the data has mean zero. Several reasons for doing this: Is optimal for finding affine subspaces Makes the correlation matrix the covariance matrix Formally. Set µ = 1 x X x x i. n µ Set new data to: x i = x i µ i

The PCA Solution Denote by C the covariance matrix: C = X T X Denote its eigenvectors and eigenvalues by 1 2... d u 1 u 2 u d Then the PCA solution is: v 1 = u 1,...,v r = u r Namely take the r eigenvectors with largest eigenvalues. See proof in writeup.

Dim. Reduction with PCA The PCA projection matrix is: V = u 1 u 2... u r Map x 2 R d to a = V T x in R r Map back via x 0 = V a Denote lower dim points by a 1,...,a n And the matrix with ai as rows by A. Then: A T A = I r The new data is uncorrelated!

Encoding-Decoding In subspace x 0 2 R d V Decoding Subspace coordinates a 2 R r V T Encoding Not in subspace x 2 R d

Toy Example v 1 v 2

PCA on Faces

The Eigen Faces (ui)

Decoded Faces When using 10,30,, 310 eigenfaces

PCA as Linear Autoencoder PCA can be used to encode and decode an input But is it optimal among linear encoder decoders? x 0 = V a a = W x Yes. Can show that PCA solves: X min W 2R r,d,v d,r i x kx i VWx i k 2 2

PCA - Extensions Kernels - Do non linear transformation on x, where kernel trick can be used Non linear auto encoders - replace linear encoding decoding by neural networks. Harder to train but can give nice results. PCA (left), Neural auto encoder (right). From Hinton & Salakhutdinov, 2006