Weighted Low Rank Approximations

Similar documents
Learning with Matrix Factorizations

Linear Dependent Dimensionality Reduction

Collaborative Filtering: A Machine Learning Perspective

CS281 Section 4: Factor Analysis and PCA

Expectation Maximization

Maximum Margin Matrix Factorization

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Probabilistic Latent Semantic Analysis

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Advanced Introduction to Machine Learning

Using SVD to Recommend Movies

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Computational Methods. Eigenvalues and Singular Values

Computational functional genomics

Graphical Models for Collaborative Filtering

Introduction to Machine Learning

PCA, Kernel PCA, ICA

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians

Singular Value Decomposition

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning


Andriy Mnih and Ruslan Salakhutdinov

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Dimension Reduction. David M. Blei. April 23, 2012

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

CPSC 340 Assignment 4 (due November 17 ATE)

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Singular Value Decomposition and Principal Component Analysis (PCA) I

Nonnegative Matrix Factorization

The Singular Value Decomposition

PCA and admixture models

STA 414/2104: Lecture 8

STA 414/2104: Machine Learning

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

1 Inner Product and Orthogonality

Positive Definite Matrix

Principal Component Analysis

Mathematical foundations - linear algebra

Linear Algebra for Machine Learning. Sargur N. Srihari

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method

Background Mathematics (2/2) 1. David Barber

Rank, Trace-Norm & Max-Norm

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Data Mining Techniques

Unsupervised Learning

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Matrix Factorizations: A Tale of Two Norms

Methods for sparse analysis of high-dimensional data, II

Linear Algebra Review. Vectors

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Factor Analysis and Kalman Filtering (11/2/04)

HST.582J/6.555J/16.456J

Methods for sparse analysis of high-dimensional data, II

Based on slides by Richard Zemel

Probabilistic & Unsupervised Learning

LECTURE 16: PCA AND SVD

PROBABILISTIC LATENT SEMANTIC ANALYSIS

STA 414/2104: Lecture 8

Numerical Methods I Singular Value Decomposition

1 Data Arrays and Decompositions

CSC411 Fall 2018 Homework 5

STA141C: Big Data & High Performance Statistical Computing

CSE 554 Lecture 7: Alignment

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Lecture 7: Con3nuous Latent Variable Models

The Expectation Maximization or EM algorithm

Vector Space Models. wine_spectral.r

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Multivariate Statistical Analysis

Unsupervised Learning: Dimensionality Reduction

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Lecture Notes 2: Matrices

Matrix decompositions

Pattern Recognition and Machine Learning

Jeffrey D. Ullman Stanford University

CS 143 Linear Algebra Review

Latent Semantic Analysis. Hongning Wang

ECE 5984: Introduction to Machine Learning

Econ 204 Supplement to Section 3.6 Diagonalization and Quadratic Forms. 1 Diagonalization and Change of Basis

7 Principal Component Analysis

Unsupervised learning: beyond simple clustering and PCA

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Clustering VS Classification

IV. Matrix Approximation using Least-Squares

Bayesian Machine Learning

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

What is Principal Component Analysis?

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Principal Component Analysis

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Transcription:

Weighted Low Rank Approximations Nathan Srebro and Tommi Jaakkola Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Weighted Low Rank Approximations What is a weighted low rank approximation? What is a low rank approximation? Why weighted low rank approximations? How do we find a weighted low rank approximation? What can we do with weighted low rank approximations?

Low Rank Approximation titles titles A u1 v 1 + u v + u3 v 3 preferences of a specific user (real-valued preference level for each title) characteristics of the user comic value dramatic value violence

Low Rank Approximation titles A k u titles V k

users Low Rank Approximation titles A A data (target) k u U titles V k

users Low Rank Approximation titles A A data (target) u U rank k

Low Rank Approximation A data (target) rank k U V Compression (mostly to reduce processing time) Prediction: collaborative filtering Reconstructing latent signal biological processes through gene expression Capturing structure in a corpus documents, images, etc Basic building block, e.g. for non-linear dimensionality reduction

Low Rank Approximation A data (target) + rank k Z iid Gaussian noise ( ) 1 ; A logp ( A ) ( A ) + logl σ const Max likelihood low-rank matrix given iid Gaussian noise low-rank minimizing sum-squared error given explicitly in terms of SVD

Weighted Low Rank Approximation A data (target) rank k + Z Gaussian noise Z ~ N (0, σ ) logl ( ) ; A logp ( A ) 1 ( A + w σ ) w 1/ σ const Find low-rank matrix minimizing weighted sum-squared-error [Young 1940]

Weighted Low Rank Approximation low-rank matrix minimizing weighted sum-squared error External information about noise variance for each measurement e.g. in gene expression analysis Missing data (0/1 weights) collaborative filtering Different number of samples e.g. separating style and content [Tenenbaum Freeman 00] Varying importance of different entries e.g. D filters [Shpak 90, Lu et al 97] Subroutine for further generalizations

How? Given A and W, find rank k matrix minimizing the weighted sum-square difference W ( A )

(Unweighted) Low Rank Approximation

(Unweighted) Low Rank Approximation J(UV ) n A - objective d target rank k Fro

(Unweighted) Low Rank Approximation J(UV ) objective n d k d A target - n U V k Fro J U ( UV ' A) V 0 J V ( VU ' A' ) U 0 UV VAV VU UA U

(Unweighted) Low Rank Approximation J ( UV ') U, V 0 U,V are correspondingly spanned by eigenvectors of AA and A A J U J(UV ) objective ( UV ' A) V n UV VAV d k d A target 0 - n U V J V k Fro orthonormal V VI orthogonal U U G Γ diagonal ( VU ' A' ) U VU UA Γ AV 0

(Unweighted) Low Rank Approximation J ( UV ') U, V 0 U,V are correspondingly spanned by eigenvectors of AA and A A Global Minimum U,V are correspondingly spanned by leading eigenvectors U,V spanned by eigenvectors of AA and AA J ( UV ') A UV ' unselected eigenvalues

(Unweighted) Low Rank Approximation J ( UV ') U, V 0 U,V are correspondingly spanned by eigenvectors of AA and A A Global Minimum U,V are correspondingly spanned by leading eigenvectors All other fixed points are saddle points (no non-global local minima)

Weighted Low Rank Approximation J ( UV ') W ( A UV ') J U i (( U V ' A ) W ) V i i i 0 U i A WV ( V ' WV ) i i i 1 If rank(w)1, (V W i V) can be simultaneously diagonalized eigen-methods apply [IraniAnandan 000] Otherwise, eigen-methods cannot be used, solutions are not incremental

WLRA: Optimization J ( UV ') W ( A UV ') For fixed V, find optimal U For fixed U, find optimal V V J * ( V ) minj ( UV ') U J * ( V ) U *'(( U * V ' Y ) W ) Conjugate gradient descent on J* n k U d V k Optimize kd parameters instead of k(d+n)

Local Minima in WLRA A 1 1 1.1 1 W 1 + α 1 1 1+ α θ

WLRA: An EM Approach 0/1 weights: A A+Z observations N(0,1) noise parameters

WLRA: An EM Approach Expectation Step: A 1 A A N missing A r [i,j] [i,j] Maximization Step: A r +Z r LRA 1 N A r r same low-rank for all targets A r independent noise Z r for each target A r

A 1 WLRA: An EM Approach A A r +Z r A N integer WLRA(A,W) with W[i,j]w[i,j]/N Expectation Step: missing A r [i,j] [i,j] Maximization Step: LRA 1 N A r r A r [i,j]a[i,j], or missing if w[i,j]<r LRA(W A+(1-W) )

WLRA: Optimization J ( UV ') W ( A UV ') For fixed V, find optimal U For fixed U, find optimal V V J * ( V ) minj ( UV ') U J * ( V ') U *'(( U * V ' A) W ) Conjugate gradient descent on J* LRA(W A+(1-W) )

Estimations of a planted with Non-Identical Noise Variances Reconst. Err: normalized sum diff to planted 10 10 0 10 - unweighted weighted 10 0 10 Signal/Noise (weighted sum low-rank signal/weighted sum noise)

Collaborative Filtering of Joke Preferences ( Jester ) users jokes +1-+3 +6 +5? - +6+4-7 -? +4-3 -5+7-+6+? -1-7 -1++4 +1? +3-3 +7 +-3-6 +4? 0 - -1? +5- +6 +1 +5+-7 +3? - -3-1 +7-3 - -6 ++ -4? -7 +3-4++ +1? +-5-3 -6-1? +6+5-7 +1-6+7-3 - +6+5 +9+4+5? - -6 +7-5+ +4 -? -3 +1+6-? ++4 - +3-3+3-6 -4-9 +7? 0 + -4+3+- 0? -4 [Goldberg 001]

Collaborative Filtering of Joke Preferences ( Jester ) Test Error 0.3 0.3 0.8 0.6 0.4 0. 0. 0.18 0.16 0.14 0 4 6 8 10 1 14 16 WLRA rank SVD, unobserved0, observed rescaled to preserve mean [Azar+ 001] SVD on always observed columns [Goldberg+ 001] SVD, unobserved0 [BillsusPazzani 98]

WLRA as a Subroutine for Other Cost Functions ( ) Low Rank Approximation (PCA) A W ( A Weighted Low Rank Approximation ) Loss ( A, )

Logistic Low Rank Regression [Collins+ 00, Gordon 003, Schein+ 003] Pr Y g observed rank k rank k g ( x ) 1 1+ e 8 6 4 0 4 6 8 x ( t ) ( t ) Loss ( Y, ) logg ( Y, ) ( t 1) ( t 1) g ( Y ) g ( Y ) Y t t ( ) ( 1) + ( t g ( Y 1) ) + Const

Logistic Low Rank Regression [Collins+ 00, Gordon 003, Schein+ 003] Pr Y g observed rank k rank k g ( x ) 1 1+ e 8 6 4 0 4 6 8 x ( t ) ( t ) Loss ( Y, ) logg ( Y, ) ( t 1) ( t 1) g ( Y ) g ( Y ) Y t t W (t ) ( ) ( 1) + A (t ) ( t g ( Y W ( t ) g ( Y ( t 1) ) g ( Y ( t 1) ) A ( t ) Y 1) ) ( t 1) + ( t 1) g ( Y ) + Const

Maximum Likelihood Estimation with Gaussian Mixture Noise Y observed + rank k Z noise C (hidden) Mixture of Gaussians: {p c,µ c,σ c } E step: calculate posteriors of C M step: WLRA with Pr( C c ) W σ c C A Pr( C Y + c σc c ) µ C W

Weighted Low Rank Approximations Nathan Srebro Tommi Jaakkola Weights often appropriate in low rank approximation WLRA more complicated than unweighted LRA Eigenmethods (SVD) do not apply Non-incremental Local minima Optimization approaches: Alternate optimization Gradient methods on J*(V) EM method: LRA(W A+(1-W) ) WLRA useful as subroutine for more general loss functions www.csail.mit.edu/~nati/lowrank/

END

Local Minima in WLRA 1 1 1 1 A 1 1 W V[,-1] 3 3 1 3 1 3 1 3 3 1 3 1 3 1 V[1,] J⅓ J⅓ J3

(Unweighted) Low Rank Approximation J ( UV ') U, V 0 U,V are correspondingly spanned by eigenvectors of AA and A A A U 0 S V 0

(Unweighted) Low Rank Approximation J ( UV ') U, V 0 U,V are correspondingly spanned by eigenvectors of AA and A A A U 00 S V 0 Q U Q V I U U 0 S Q U V V 0 Q V

(Unweighted) Low Rank Approximation J ( UV ') U, V 0 U,V are correspondingly spanned by eigenvectors of AA and A A A U* U 0 1 11 1 1 U 0 S Q U s 1 s s 3 s m V 0 V* 1 V 0 Q V A α s UV ' s α α selected α A UV ' s 1 111 α unselected α 0/1 permuted diagonal matrix

(Unweighted) Low Rank Approximation J ( UV ') U, V 0 U,V are correspondingly spanned by eigenvectors of AA and A A Global Minimum U,V are correspondingly spanned by leading eigenvectors 1 111 U U 0 S 1 V V 0 1 11 1 1