Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Regression and PCA

Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes (e.g., Y in 0,..,9 for digit classification).

Regression Sometimes we want to map an input X into a real number Y. This is the regression problem. Examples: X is high school exam scores and Y is university final score X is website layout and Y is traffic X is user data and Y expected electricity consumption.

Formal Definition In the spirit of supervised classification. We will map X to Y via a function f(x) Prediction loss: (y f(x)) 2 f will come from a hypothesis class F Assume (x,y) sampled from distribution D Would like to minimize: E (Y f(x)) 2

ERM for Regression Want to find f that minimizes: E (Y f(x)) 2 But we don t know this expected value We do have (x 1,y 1 ),...,(x n,y n ) Use it to approximate the expected value: E (Y f(x)) 2 1 n X (y i f(x i )) 2 i This is the empirical risk. Minimize it!

ERM for Regression The ERM problem: min f2f 1 n X (y i f(x i )) 2 How does this relate to: min E (Y f(x)) 2? f2f As in classification: i Gets closer the more data we have Larger difference for more complex classes. We won t discuss this further in the course.

Linear Regression Which functions F should we use? Would be nice if they are simple and ERM tractable. Start with the simplest case, linear functions. Assume x 2 R d Then F contains functions f(x) =a x for a 2 R d

Adding Bias For f(x) =x a it will always hold that f(0) =0 Can also add a bias term: y = x a + b Can represent this by adding a constant feature 1 y =[x, 1] [a,b] We therefore don t add bias explicitly.

Solving Linear Regression

Solving Linear Regression Solve: min a nx (y i a x i ) 2 i=1 Define: y = 0 1 y 1 B. C @.. A2 R n,1 X = y n =min a `(a) 0 B @ x 1......... 1 C A 2 R n,d x n `(a) =ky Xak 2 2 =(y Xa) T (y Xa) = kyk 2 2 2y T Xa + a T X T Xa

Solving Linear Regression Minimize: `(a) =kyk 2 2 2y T Xa + a T X T Xa Take gradient: r`(a) =2X T Xa 2X T y Used: r(v a) =v and r(a T Ca) =2Ca Set gradient to zero and get: X T Xa = X T y If X T X is invertible, we solve: a =(X T X) 1 X T y

The Correlation Matrix Recall the solution a =(X T X) 1 X T y Define: C = 1. Then: n XT X C i,j = 1 n X k x ki x kj E [X i X j ] Measures the correlation between the features i and j. For zero mean variables, it is the covariance matrix.

The Singular Case Recall: a =(X T X) 1 X T y What happens if the correlation is not invertible? When is the correlation not invertible? Assume d<n. The singularity happens when rank[x]<d. 0 1 x 1 One feature is a function of the... others. X = B @...... x n C A 2 R n,d

The Singular Case X = 0 B @ 1 1... 0.5 0.5... 2 2............ 1 C A 2 R n,d Implies many equivalent solutions: a =[a 1,a 2,a 3,...,a d ] a =[a 1 + a 2, 0,a 3,...,a d ] a =[a 1 +1e9,a 2 1e9,a 3,...,a d ]

The Singular Case The optimum satisfies: X T Xa = X T y In the singular case, this has infinitely many solutions. For any v such that X T Xv = 0, if a is a solution then so is a+v How can we avoid this? Regularization!

Regularized Regression How do we choose between a set of solutions? Add a regularization term f(a) that is low for solutions we prefer, and high for those we don t. Recall SVM where we add l2 regularization. Indeed, two popular choices for regression are: Add kak 2 2 - Ridge regression X Add kak 1 = a i - Lasso i

Ridge Regression Goal: minimize arg min a ky Xak 2 2 + kak 2 2. Derivation nearly identical to standard case gives: a =(X T X + I d ) 1 X T y. Note the inversion above is always possible! (why?) If X T X=I then: a = X T y a = 1 1+ XT y Regular Ridge

Lasso (Tibshirani 96) Goal: arg min a ky Xak 2 2 + kak 1. Results in sparse solutions (zero weights for features that are not too important ).

Regression Extensions Non linear predictors: Kernels Neural nets Generalization analysis. Fancy regularizers where the desired a has more structure (e.g., a2 and a3 should be close)

Supervised Learning Labeled Data quail apple apple corn corn Features 1.1-0.5 0 0 0.3 quail -1 0 1.2-0.4 0.1 apple 1.1-0.5 0 0 0.3-1 0 1.2-0.4 0.1 apple corn x y Model Class Consider classifiers of the form y = f(x; w) Learning Find w that works well on the training data

Unsupervised Learning But life is more like this: Many images. Very few labels. What can we do?

Unsupervised Learning Data is millions of points, each with 20K features. What can we do with it? Understand its structure Learn useful new features/representations Use it together with some labeled data (this is known as semi-supervised learning).

Understanding Structure What can you say about these points? How can we use these clusters? The may correspond to something useful (groups in population) Could be used as features for learning. How do we find them? Clustering algorithms. Next classes!

Understanding Structure What can you say about these points? x 2 How can we use this? Variables are dependent. Could be meaningful. x 1 Data is really 1D. Can be represented by a single number instead of 2. How do we find this structure? Principal Component Analysis. Now!

Unsupervised Learning Two key goals: Clustering Dimensionality reduction

Linear Subspaces Suppose our data lies on a low dimensional linear subspace. How do we find this subspace? Principal Component Analysis

Linear Subspaces An r dimensional linear subspace is defined via a basis v 1,...,v r 2 R d The subspace is all points X x 2 R d such that there exist a 1,...,a r and x = a i v i i a i The are an r dimensional representation (encoding) of x. Denote V = v 1 v 2... v r and a =[a 1,...,a r ] Then: x = V a

Linear Subspaces An r dimensional linear subspace is defined via a basis v 1,...,v r 2 R d Assume w.l.o.g that the basis vectors are orthogonal Namely: v i v j = i,j. Can always get via Gram-Schmidt So: V T V = I r x = V a a = V T x

Encoding-Decoding x 2 R d V a 2 R r V T x 2 R d

Projection to Subspace Now say we have a point x not in the subspace What is the closest point x in the subspace? x x 0 Given x 2 R d what is the closest point in the subspace? arg min x 0 =V a kx0 xk 2 2 = VV T x

Encoding-Decoding arg min x 0 =V a kx0 xk 2 2 = VVT x In subspace x 0 2 R d V Decoding Subspace coordinates a 2 R r V T Encoding Not in subspace x 2 R d

The PCA Problem Goal: Find subspace that is closest to all the data points. min V :V T V =I X i kx i VV T x i k 2 2 This is the PCA optimization problem There are other equivalent formulations.

Mean Removal Note: Before running PCA, remove the mean so that the data has mean zero. Several reasons for doing this: Is optimal for finding affine subspaces Makes the correlation matrix the covariance matrix Formally. Set µ = 1 x X x x i. n µ Set new data to: x i = x i µ i

The PCA Solution Denote by C the covariance matrix: C = X T X Denote its eigenvectors and eigenvalues by 1 2... d u 1 u 2 u d Then the PCA solution is: v 1 = u 1,...,v r = u r Namely take the r eigenvectors with largest eigenvalues. See proof in writeup.

Dim. Reduction with PCA The PCA projection matrix is: V = u 1 u 2... u r Map x 2 R d to a = V T x in R r Map back via x 0 = V a Denote lower dim points by a 1,...,a n And the matrix with ai as rows by A. Then: A T A = I r The new data is uncorrelated!

Encoding-Decoding In subspace x 0 2 R d V Decoding Subspace coordinates a 2 R r V T Encoding Not in subspace x 2 R d

Toy Example v 1 v 2

PCA on Faces

The Eigen Faces (ui)

Decoded Faces When using 10,30,, 310 eigenfaces

PCA as Linear Autoencoder PCA can be used to encode and decode an input But is it optimal among linear encoder decoders? x 0 = V a a = W x Yes. Can show that PCA solves: X min W 2R r,d,v d,r i x kx i VWx i k 2 2

PCA - Extensions Kernels - Do non linear transformation on x, where kernel trick can be used Non linear auto encoders - replace linear encoding decoding by neural networks. Harder to train but can give nice results. PCA (left), Neural auto encoder (right). From Hinton & Salakhutdinov, 2006