STATS 306B: Unsupervised Learning Spring Lecture 8 April 23

Similar documents
Machine Learning for Data Science (CS 4786)

10-701/ Machine Learning Mid-term Exam Solution

Machine Learning for Data Science (CS 4786)

Support vector machine revisited

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

CS276A Practice Problem Set 1 Solutions

Output Analysis and Run-Length Control

6.867 Machine learning, lecture 7 (Jaakkola) 1

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

6.867 Machine learning

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Statistical Pattern Recognition

1 Review of Probability & Statistics

Algorithms for Clustering

Linear Classifiers III

Optimally Sparse SVMs

Lecture 12: February 28

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

Vector Quantization: a Limiting Case of EM

5.1 Review of Singular Value Decomposition (SVD)

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Lecture 8: October 20, Applications of SVD: least squares approximation

Chapter Vectors

Linear Regression Demystified

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

CHAPTER 5. Theory and Solution Using Matrix Techniques

Problem Set 2 Solutions

The Method of Least Squares. To understand least squares fitting of data.

TMA4205 Numerical Linear Algebra. The Poisson problem in R 2 : diagonalization methods

Stochastic Simulation

Axis Aligned Ellipsoid

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Matrix Representation of Data in Experiment

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

Information-based Feature Selection

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Introduction to Machine Learning DIS10

Session 5. (1) Principal component analysis and Karhunen-Loève transformation

6.3 Testing Series With Positive Terms

Lecture 3: August 31

Algebra of Least Squares

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Distributional Similarity Models (cont.)

Lecture 2: April 3, 2013

Notes for Lecture 11

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Distributional Similarity Models (cont.)

A Note on Effi cient Conditional Simulation of Gaussian Distributions. April 2010

Chapter 7. Support Vector Machine

Machine Learning Brett Bernstein

Notes on iteration and Newton s method. Iteration

CS284A: Representations and Algorithms in Molecular Biology

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

Assumptions. Motivation. Linear Transforms. Standard measures. Correlation. Cofactor. γ k

CMSE 820: Math. Foundations of Data Sci.

Machine Learning Theory (CS 6783)

Quantum Computing Lecture 7. Quantum Factoring

ALGEBRAIC GEOMETRY COURSE NOTES, LECTURE 5: SINGULARITIES.

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Data Analysis and Statistical Methods Statistics 651

Chapter 6 Principles of Data Reduction

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3

Math 312 Lecture Notes One Dimensional Maps

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

Infinite Sequences and Series

Slide 1. Slide 2. Slide 3. Solids of Rotation:

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random

Section 1.1. Calculus: Areas And Tangents. Difference Equations to Differential Equations

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

CHAPTER I: Vector Spaces

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

Section 5.1 The Basics of Counting

Support Vector Machines and Kernel Methods

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Sequences and Series of Functions

Introduction to Artificial Intelligence CAP 4601 Summer 2013 Midterm Exam

Stat 421-SP2012 Interval Estimation Section

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

1 Last time: similar and diagonalizable matrices

1 Review and Overview

ECON 3150/4150, Spring term Lecture 3

Math 61CM - Solutions to homework 3

Fall 2013 MTH431/531 Real analysis Section Notes

A statistical method to determine sample size to estimate characteristic value of soil parameters

Probability, Expectation Value and Uncertainty

Machine Learning. Ilya Narsky, Caltech

1 Adiabatic and diabatic representations

(3) If you replace row i of A by its sum with a multiple of another row, then the determinant is unchanged! Expand across the i th row:

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

1 Approximating Integrals using Taylor Polynomials

Inverse Matrix. A meaning that matrix B is an inverse of matrix A.

Transcription:

STATS 306B: Usupervised Learig Sprig 2014 Lecture 8 April 23 Lecturer: Lester Mackey Scribe: Kexi Nie, Na Bi 8.1 Pricipal Compoet Aalysis Last time we itroduced the mathematical framework uderlyig Pricipal Compoet Aalysis (PCA); ext we will cosider some of its applicatios. Please refer to the accompayig slides. 8.1.1 Examples Example 1. Digit data (Slide 2:) Here is a example take from the textbook. This set of had writte digital images cotais 130 threes, ad each three is a 16 16 greyscale image. Hece we may represet each datapoit as a vector of 256 greyscale pixels. (Slide 3:) The figure o the left shows the first two pricipal compoets of these images. The rectagular grid is computed by selected quatiles of the two pricipal compoets. Based o the projected coordiates o the two directios, the circled poits refer to the images that are closest to these vertices of the grid. The figure o the right displays the threes correspodig to the circled poits. The vertical compoet appears to capture chages i lie thickess / darkess, while the horizotal compoet appears to capture chages i the legth of the bottom of the three. (Slide 4:) This is a visual represetatio of the leared two-compoet PCA model. The first term is the mea of all images, ad the followig v 1 ad v 2 are two visualized pricipal directios (the loadigs), which ca also be called eige threes. Example 2. Eige-faces (Slide 5:) PCA is widely used i face recogitio. Suppose X d is the pixel-image matrix, where each colum is a face image. d is the umber of pixels ad x ji is the itesity of j-th pixel i image i. The loadigs retured by PCA are liear combiatios of faces, which ca be called eige-faces. The workig assumptio is that the PC scores z i, gotte by projectig the origial image oto the eige-face space, represet a more meaigful ad compact represetatio of the i-th face tha the raw pixel represetatio. The z i ca be used i place of x i for earest-eighbor classificatio. Sice the dimesio of face-space has decreased from d to k, the computatioal complexity becomes O(dk + k) istead of O(d). This is of great efficiecy whe, d k. Example 3. Latet sematic aalysis (Slide 6:) Aother applicatio of PCA is i text aalysis. Let d to be the total umber of words i the vocabulary; the each documet x i R d is a vector of word couts, ad x ji is the frequecy of word j i documet i. After we apply PCA here, the similarity betwee two documets is ow z T i z j, which is ofte more iformative tha the raw measure x T i x j. Notice that there may ot be sigificat computatioal savigs, sice the origial word-documet matrix was sparse, while the reduced represetatio is typically dese. 8-1

Example 4. Aomaly detectio (Slide 7:) PCA ca be used i etwork aomaly detectio. I the time-lik matrix X, x ji represets the amout of traffic o lik j i the etwork durig time-iterval i. I the two pictures o the left, traffic appears periodic ad reasoably determiistic o the selected pricipal compoet, which asserts that these two are ormal behaviors. I the cotrast, traffic spikes i the pictures o the right, which idicates aomalous behavior i this flow. Example 5. Part-of-speech taggig (Slide 8:) Usupervised part-of-speech taggig is a commo task i atural laguage processig, as maually taggig a large corpus is expesive ad time-cosumig. Here it is commo to model each word i a vocabulary by its cotext distributio, i.e., x ji is the umber of times that word i appears i cotext j. The key idea of usupervised POS taggig is that words appearig i similar cotexts ted to have same POS tags. Hece, a typical taggig techique is to cluster words accordig to their cotexts. However, i ay give corpus, ay give cotext may occur rather ifrequetly (the vectors x i are too sparse), so PCA has bee used to fid a more suitable, comparable represetatio for each word before clusterig is applied. Example 6. Multi-task learig (Slide 9:) I multi-task learig, oe is attemptig to solve related learig tasks simultaeously, e.g., classifyig documets as relevat or ot for users. Ofte task i reduces to learig a weight vector x i which produces for example the classificatio rule. Our goal is to exploit the similarities amogst these tasks to do more effective learig overall. Oe way to accomplish this is to use PCA is to idetify a small set of eige-classifiers amog the leared rules x 1,..., x. The, the classifiers ca be retraied with a added regularizatio term ecouragig each x i to lie ear the subspace spaed by the pricipal directios. These two steps of PCA ad retraiig are iterated util covergece. I this way, low-dimesioal represetatio of classifiers ca help to detect the shared structures betwee idepedet tasks. 8.1.2 Choosig a umber of compoets (Slide 10:) As i the clusterig settig, we face a model selectio questio: how do we choose the umber of pricipal compoets? While there is o agreed-upo solutio to this problem, here are some guidelies. The umber of pricipal compoets might be costraied by the problem goal, your computatioal or storage resources, or by the miimum fractio of variace to be explaied. For example, it is commo to choose 3 or fewer pricipal compoets whe doig visualizatio problems. Recall that eigevalue magitudes determie the explaied variace. I the accompayig figure, the first 5 pricipal compoets already explai early all of the variace, so a small umber of pricipal compoets may be sufficiet (although oe must use care i drawig this coclusio, sice small differeces i recostructio error may still be sematically sigificat; cosider face recogitio for example). Furthermore, we may look for elbow criterio or compare explaied variace with that obtaied uder a referece distributio. 8-2

8.1.3 PCA limitatios ad extesios While PCA has a great umber of applicatios, it has its limitatios as well: Squared Euclidea recostructio error is ot appropriate for all data types. Various extesios, such as expoetial family PCA, have bee developed for biary, categorical, cout, ad oegative data. PCA ca oly fid liear compressios of data. Kerel PCA is a importat geeralizatio desiged for o-liear dimesioality reductio. 8.2 No-liear dimesioality reductio with kerel PCA 8.2.1 Ituitio Figure 8.1. Data lyig ear a liear subspace Figure 8.2. Data lyig ear a parabola Figure 8.1 displays a 2D example i which PCA is effective because data lie ear a liear subspace. However, i Figure 8.2 PCA is ieffective, because data the data lie ear a parabola. I this case, the PCA compressio of the data might project all poits oto the orage lie, which is far from ideal. Let us cosider the differeces betwee these two settigs mathematically. Liear subspace (Figure 8.1): I this example we have ambiet dimesio p = 2 ad compoet dimesio k = 1. Sice the blue lie is a k-dimesioal liear subspace of R p, we kow that there is some matrix U R p k such that the subspace S takes the form S = {x R p : x = Uz, z R k } where U = [ u1 u 2 = {(x 1, x 2 ) : x 1 = u 1 z, x 2 = u 2 z} = {(x 1, x 2 ) : x 2 = u 2 x 1 }, u 1 ], sice (p, k) = (2, 1) i our example. 8-3

Parabola (Figure 8.2): I this example we agai have ambiet dimesio p = 2 ad compoet dimesio k = 1. Moreover, there is some fixed matrix U R p k such that the uderlyig blue parabola takes the form S = {(x 1, x 2 ) : x 2 = u 2 u 1 x 2 1} which is similar to the represetatio derived i the liear model. Ideed, if we itroduce a auxiliary variable z, we get, S = {(x 1, x 2 ) : x 2 1 = u 1 z, x 2 2 = u 2 z, for z R} = {x R p : φ(x) = Uz, z R k, } [ ] x 2 where φ(x) = 1 is a o-liear fuctio of x. I this fial represetatio, U is still a x 2 liear mappig of the latet compoets z, but the represetatio beig recostructed liearly is o loger x itself but rather a potetially o-liear mappig φ of x. 8.2.2 Take-away We should be able to capture o-liear dimesioality reductio i x space by performig liear dimesioality reductio i φ(x) space (we ofte call φ(x) the feature space). Of course we still eed to fid the right feature space to perform dimesioality reductio i. Oe optio is to had-desig the feature mappig φ explicitly coordiate by coordiate, e.g., φ(x) = (x 1, x 2 2, x 1 x 2, si(x 1 ),...). However, this process quickly becomes tedious ad has to be ad hoc. Moreover, workig i feature space becomes expesive if φ(x) is very large. For example, cosider the umber of all quadratic terms x i x j = O(p 2 ). A alterative, which we will explore ext, is to ecode φ implicitly via its ier products usig the kerel trick. 8.2.3 The Kerel Trick Our path to the kerel trick begis with a iterestig claim: the PCA solutio depeds o the data matrix x 1 X = x 2... x oly through the Gram matrix (a.k.a. the Kerel matrix), K = XX T R. The kerel matrix is the matrix of ier products K ij =< x i, x j >. Proof. Each Pricipal Compoet loadig u j is a eigevector of XT X u j = λ j u j for some λ j XT X u j = X T α j = i=1 α jix i for some weights α j. That is, u j is a liear combiatio of the datapoits. This is called a represeter theorem for the PCA solutio. It is aalogous 8-4

to represeter theorems you may have see for Support Vector Machies or ridge regressio. Therefore oe ca restrict attetio to cadidate loadigs u j with this form. Now cosider the PCA objective max u j u T j X T X u j s.t. u j 2 = 1, u T X T X j u l = 0, l < j X T α j s.t. αj T XX T α j = 1, αj T X (XT X) max αj T X (XT X) α j max α j α T j K 2 α j, s.t. α T j Kα j = 1, α T j which oly depeds o the data through K! X T α l = 0, l < j K 2 α l = 0, l < j (8.1) The fial represetatio of PCA i kerel form (8.1) is a example of a geeralized eigevalue problem, so we kow how to compute its solutio. However, we will give a more explicit derivatio of its solutio by covertig this problem ito a equivalet eigevalue problem. Hereafter we will assume K is o-sigular. Let β j = K 1 2 α j so that α j = K 1 2 β j. Now the problem becomes (8.1) max u j βj T Kβ j, s.t. βj T β j = 1, βj T K β j = 0, l < j This is a eigevalue problem with solutio give by β j = the j-th leadig eigevector of K ad hece α j = K 1 2 β j = β j λj (K). Furthermore, we ca recover the pricipal compoet scores from this represetatio by z = u T X T = [α 1,..., α k] T XX T = [α 1,..., α k] T K. The puchlie is that we ca solve PCA by fidig the eigevectors ad eigevalues of K; this is kerel PCA, the kerelized form of the PCA algorithm (ote that the solutio is equivalet to the origial PCA solutio if K = XX T ). Hece, the ier products of X are sufficiet, ad we do ot eed additioal access to explicit datapoits. Why is this relevat? Suppose we wat to ru kerel PCA o a o-liear mappig of data Φ = φ(x 1 ) φ(x 2 )... φ(x ) The we do ot eed to compute or store Φ explicitly; K φ = ΦΦ T suffices to ru kerel PCA. Moreover, we ca ofte compute etries of K φ ij =< φ(x i), φ(x j ) > via a kerel fuctio K(x i, x j ) without formig φ(x i ) explicitly. This is the kerel trick. Here are a few commo examples:. 8-5

Kerel trick examples Kerel K(x i, x j ) φ(x) Liear < x i, x j > x Quadratic (1+ < x i, x j >) 2 (1, x 1,..., x p, x 2 1...., x 2 p, x 1 x 2,..., x p 1 x p ) Polyomial (1+ < x i, x j >) d all moomials of order d or less ( ) xi x Gaussia/ Radial basis fuctio exp j 2 2 σ 2 ifiite dimesioal feature vector A pricipal advatage of the kerel trick is that oe ca carry out o-liear dimesio reductio with little depedece o the dimesio of the o-liear feature space. However, oe has to form ad operate o a matrix (which ca be quite expesive). It is commo to approximate the kerel whe is large usig radom (e.g., the Nystrom method of Williams & Seeger, 2000) or determiistic (e.g., the icomplete Cholesky decompositio of Fie & Scheiberg, 2001) low-rak approximatios. 8-6