Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians

Similar documents
Singular Value Decompsition

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Latent Semantic Analysis. Hongning Wang

Linear Algebra & Geometry why is linear algebra useful in computer vision?

STA141C: Big Data & High Performance Statistical Computing

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Linear Algebra & Geometry why is linear algebra useful in computer vision?

STA141C: Big Data & High Performance Statistical Computing

Latent Semantic Analysis. Hongning Wang

Lecture: Face Recognition and Feature Reduction

Lecture: Face Recognition and Feature Reduction

Linear Algebra for Machine Learning. Sargur N. Srihari

Singular Value Decomposition

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Linear Algebra - Part II

Multivariate Statistical Analysis

MATH 612 Computational methods for equation solving and function minimization Week # 2

Probabilistic Latent Semantic Analysis

Fundamentals of Matrices

Background Mathematics (2/2) 1. David Barber

Image Registration Lecture 2: Vectors and Matrices

Large Scale Data Analysis Using Deep Learning

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Eigenvalues and diagonalization

Introduction to Machine Learning

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

IV. Matrix Approximation using Least-Squares

Linear Algebra (Review) Volker Tresp 2017

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Linear Algebra Review. Fei-Fei Li

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR

Principal Component Analysis (PCA)

Learning goals: students learn to use the SVD to find good approximations to matrices and to compute the pseudoinverse.

Lecture 2: Linear Algebra Review

Linear Algebra (Review) Volker Tresp 2018

. = V c = V [x]v (5.1) c 1. c k

Linear Algebra Review. Fei-Fei Li

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

The Singular Value Decomposition

1 Inner Product and Orthogonality

Machine Learning - MT & 14. PCA and MDS

Machine Learning (Spring 2012) Principal Component Analysis

Applied Linear Algebra in Geoscience Using MATLAB

Information Retrieval

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Singular Value Decomposition (SVD)

Linear Algebra and Eigenproblems

Linear Algebra, part 3 QR and SVD

Linear Algebra. Session 12

Data Mining Techniques

be a Householder matrix. Then prove the followings H = I 2 uut Hu = (I 2 uu u T u )u = u 2 uut u

Review problems for MA 54, Fall 2004.

Collaborative Filtering: A Machine Learning Perspective

Dimensionality Reduction and Principle Components Analysis

Preliminary Examination, Numerical Analysis, August 2016

Notes on Eigenvalues, Singular Values and QR

Lecture 02 Linear Algebra Basics

33AH, WINTER 2018: STUDY GUIDE FOR FINAL EXAM

Singular Value Decomposition and Digital Image Compression

1 Vectors. Notes for Bindel, Spring 2017 Numerical Analysis (CS 4220)

Homework 1. Yuan Yao. September 18, 2011

I. Multiple Choice Questions (Answer any eight)

Linear Algebra, part 3. Going back to least squares. Mathematical Models, Analysis and Simulation = 0. a T 1 e. a T n e. Anna-Karin Tornberg

COMP 558 lecture 18 Nov. 15, 2010

EE731 Lecture Notes: Matrix Computations for Signal Processing

Foundations of Computer Vision

Contents. Preface for the Instructor. Preface for the Student. xvii. Acknowledgments. 1 Vector Spaces 1 1.A R n and C n 2

Data Mining Techniques

Least Squares Optimization

Maths for Signals and Systems Linear Algebra in Engineering


Maths for Signals and Systems Linear Algebra in Engineering

Linear Subspace Models

4 Bias-Variance for Ridge Regression (24 points)

Chapter 7: Symmetric Matrices and Quadratic Forms

Least Squares Optimization

The Singular Value Decomposition

Using SVD to Recommend Movies

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

Singular Value Decomposition

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

Review of Some Concepts from Linear Algebra: Part 2

The Mathematics of Facial Recognition

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear Algebra (MATH ) Spring 2011 Final Exam Practice Problem Solutions

Designing Information Devices and Systems II Fall 2018 Elad Alon and Miki Lustig Homework 9

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Lecture 7: Kernels for Classification and Regression

Introduction to Data Mining

Least Squares Optimization

1 Linearity and Linear Systems

Fall TMA4145 Linear Methods. Exercise set Given the matrix 1 2

Fall 2016 MATH*1160 Final Exam

9 Searching the Internet with the SVD

Transcription:

Exercise Sheet 1 1 Probability revision 1: Student-t as an infinite mixture of Gaussians Show that an infinite mixture of Gaussian distributions, with Gamma distributions as mixing weights in the following manner: is equal to a t-distribution: p(x) = 0 N (x 0, 1/λ)Ga(λ ν/2, ν/2)dλ (1) p(x) = ( 2πν 2 ) 1 Γ( ν+1 2 2 ) x2 ν+1 (1 + ) 2 = T (x 0, 1, ν) (2) Γ(ν/2) ν I recommend that you look at section 2.4.2 of Kevin s book for visualization of these distributions. You can also look them up in Wikipedia. Hint: Consider a change of variable z = λ. This popular integration trick appears in the formulation of numerous statistical and neuroscience models, for example: Martin Wainwright and Eero Simoncelli. Scale mixtures of Gaussians and the statistics of natural images. NIPS. 1999. Gavin Cawley, Nicola Talbot, and Mark Girolami. Sparse multinomial logistic regression via Bayesian L1 regularisation. NIPS. 2007. 2 Reducing the cost of linear regression for large d small n The ridge method is a regularized version of least squares, with objective function: min y Xθ 2 θ R d 2 + δ 2 θ 2 2. Here, δ is a scalar, the input matrix X R n d and the output vector y R n. The parameter vector θ R d is obtained by differentiating the above cost function, yielding the normal equations (X T X + δ 2 I d )θ = X T y, where I d is the d d identity matrix. The predictions ŷ = ŷ(x ) for new test points X R n d are obtained by evaluating the hyperplane ŷ = X θ = X (X T X + δ 2 I d ) 1 X T y = Hy. (3) The matrix H is known as the hat matrix because it puts a hat on y. Page 1

1. Show that the solution can be written as θ = X T α, where α = δ 2 (y Xθ). 2. Show that α can also be written as follows: α = (XX T + δ 2 I n ) 1 y and, hence, the predictions can be written as follows: ŷ = X θ = X X T α = [X X T ]([XX T ] + δ 2 I n ) 1 y (4) (This is an awesome trick because if n = 20 patients with d = 10, 000 gene measurements, the computation of α only requires inverting the n n matrix, while the direct computation of θ would have required the inversion of a d d matrix.) 3 Linear algebra revision: Eigen-decompositions Once you start looking at raw data, one of the first things you notice is how redundant it often is. In images, it s often not necessary to keep track of the exact value of every pixel; in text, you don t always need the counts of every word. Correlations among variables also create redundancy. For example, if every time a gene, say A, is expressed another gene B is also expressed, then to build a tool that predicts patient recovery rate from gene expression data, it seems reasonable to remove either A or B. Most situations are not as clear-cut. In this question, we ll look at eigenvalue methods for factoring and projecting data matrices (images, document collections, image collections), with an eye to one of the most common uses: Converting a high-dimensional data matrix to a lower-dimensional one, while minimizing the loss of information. The Singular Value Decomposition (SVD) is a matrix factorization that has many applications in information retrieval, collaborative filtering, least-squares problems and image processing. Let X be an n n matrix of real numbers; that is X R n n. Assume that X has n eigenvalue-eigenvector pairs (λ i, q i ): Xq i = λ i q i i = 1,..., n If we place the eigenvalues λ i R into a diagonal matrix Λ and gather the eigenvectors q i R n into a matrix Q, then the eigenvalue decomposition of X is given by σ 1 X [ ] [ ] σ 2 q 1 q 2... q n = q1 q 2... q n (5)... σn or, equivalently, X = QΛQ 1. For a symmetric matrix, i.e. X = X T, one can show that X = QΛQ T. But what if X is not a square matrix? Then the SVD comes to the rescue. Given X R m n, the SVD of X is a factorization of the form X = UΣV T. These matrices have some interesting properties: Page 2

Σ R n n is diagonal with positive entries (singular values σ in the diagonal). U R m n has orthonormal columns: u T i u j = 1 only when i = j and 0 otherwise. V R n n has orthonormal columns and rows. That is, V is an orthogonal matrix, so V 1 = V T. Often, U is m-by-m, not m-by-n. The extra columns are added by a process of orthogonalization. To ensure that dimensions still match, a block of zeros is added to Σ. For our purposes, however, we will only consider the version where U is m-by-n, which is known as the thin-svd. It will turn out useful to introduce the vector notation: Xv j = σ j u j j = 1, 2,..., n where u R m are the left singular vectors, σ [0, ) are the singular values and v R n are the right singular vectors. That is, σ 1 X [ ] [ ] σ 2 v 1 v 2... v n = u1 u 2... u n... (6) σn or XV = UΣ. Note that there is no assumption that m n or that X has full rank. addition, all diagonal elements of Σ are non-negative and in non-increasing order: In σ 1 σ 2... σ p 0 where p = min (m, n). Question: Outline a procedure for computing the SVD of a matrix X. Hint: assume you can find the eigenvalue decompositions of the symmetric matrices X T X and XX T. 4 Representing data with eigen-decompositions It is instructive to think of the SVD as a sum of rank-one matrices: X r = r σ j u j vj T, j=1 with r min(m, n). What is so useful about this expansion is that the k th partial sum captures as much of the energy of X as possible. In this case, energy is defined by the 2-norm of a matrix: X 2 = σ 1, Page 3

Machine Learning which can also be written as follows: kxk2 = max kxyk2, kyk2 =1 This definition ofp the norm of a matrix builds upon our standard definition for the 2-norm of a vector kyk2 = yt y. Essensiatily, you can think of the norm of a matrix as how much it expands the unit-circle kyk2 = 1, when applied to it. Question: Using the SVD, prove that the two definitions of the 2-norm of a matrix are equivalent. full k = 25 k = 50 k = 100 Figure 1: Image compression with SVD, showing reconstructions with different values of k. If we only use k r terms in the expansion: Xk = k X σj uj vjt, j=1 a measure of the error in approximating X is: kx Xk k2 = σk+1. Page 4

We will illustrate the value of this with an image compression example. Figure 1 shows the original image X, and three SVD compressions X k. We can see that when k is low, we get a coarse version of the image. As more eigen-components are added, we get more high-frequency details text becomes legible and pieces of the go game become circular instead of blocky. The memory savings are dramatic as we only need to store the eigenvalues and eigen-vectors up to order k in order to generate the compressed image X k. Too see this, let the original image be of size 961 pixels by 1143 pixels, stored as a 961-by-1143 array of 32-bit floats. The total size of the data is over 17 MB (the number of entries in the array times 4 bytes per entry). Using SVD, we can compress and display the image to obtain the following numbers: compression bytes reduction original 17 574 768 0% k = 100 842 000 95.2% k = 50? 97.6% k = 25? 98.8% Question: Replace the question marks in the tables above with actual numbers of bytes. Figure 2 shows another image example. The figure illustrates how the image of the clown can be reconstructed using the first rank-one components of the the SVD decomposition. = + + + Figure 2: SVD reconstruction. Large eigenvalues are associated with low-frequencies and small eigenvalues with high-frequency components of the image. Page 5