Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Similar documents
Variational Bayesian Learning

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Algorithms for Variational Learning of Mixture of Gaussians

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method

Graphical Models for Collaborative Filtering

Bayesian ensemble learning of generative models

STA 4273H: Statistical Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Expectation Maximization

Collaborative Filtering: A Machine Learning Perspective

Andriy Mnih and Ruslan Salakhutdinov

Bayesian Support Vector Machines for Feature Ranking and Selection

Pattern Recognition and Machine Learning

Ch 4. Linear Models for Classification

The Expectation Maximization or EM algorithm

CSC411 Fall 2018 Homework 5

Expectation maximization tutorial

ECE521 week 3: 23/26 January 2017

Variational Autoencoders

Latent Variable Models and Expectation Maximization

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Introduction to Machine Learning

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Latent Variable Models

STA 4273H: Statistical Machine Learning

Recent Advances in Bayesian Inference Techniques

Introduction to Probabilistic Graphical Models: Exercises

Latent Variable Models and Expectation Maximization

Algorithmisches Lernen/Machine Learning

MACHINE LEARNING ADVANCED MACHINE LEARNING

Kernel methods, kernel SVM and ridge regression

Variational Principal Components

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Linear Dynamical Systems

Probabilistic & Unsupervised Learning

STA 4273H: Statistical Machine Learning

Probabilistic Reasoning in Deep Learning

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Latent Variable Models and EM Algorithm

Data Mining Techniques

Non-Parametric Bayes

Gaussian Mixture Models

Neural networks: Unsupervised learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

STA 4273H: Sta-s-cal Machine Learning

STA 414/2104: Lecture 8

Nonparametric Bayesian Methods (Gaussian Processes)

Gaussian Models

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Lecture 3: Pattern Classification

Large-scale Ordinal Collaborative Filtering

Probabilistic Graphical Models & Applications

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

PROBABILISTIC LATENT SEMANTIC ANALYSIS

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Probabilistic Graphical Models

CS281 Section 4: Factor Analysis and PCA

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Expectation Propagation Algorithm

CPSC 540: Machine Learning

Generative Models for Discrete Data

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Latent Variable Models and EM algorithm

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Bayesian Methods for Machine Learning

K-Means and Gaussian Mixture Models

Auto-Encoding Variational Bayes

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Clustering, K-Means, EM Tutorial

STA 414/2104: Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Unsupervised Learning

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Lecture 4: Probabilistic Learning

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CSE446: Clustering and EM Spring 2017

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Introduction to Machine Learning

STA414/2104 Statistical Methods for Machine Learning II

Linear Models for Regression CS534

Bayesian Machine Learning

CPSC 540: Machine Learning

Machine Learning Lecture 7

Approximate Inference Part 1 of 2

Expectation propagation as a way of life

Variational Learning : From exponential families to multilinear systems

Machine Learning Techniques for Computer Vision

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Transcription:

AB Principal Component Analysis (PCA) for Sparse High-Dimensional Data Tapani Raiko Helsinki University of Technology, Finland Adaptive Informatics Research Center

The Data Explosion We are facing an enormous challenge in the ever increasing amount of data in electronic form First wave: text, second wave: real-world data Basically, any information that may have value will be made available, e.g., through the Web We need adaptive informatics which adds intelligence at the access point.

Adaptive Informatics: A field of research where automated learning algorithms are used to discover informative concepts, components, and their mutual relations from large amounts of real-world data The goal is to understand the underlying phenomena, structures, and patterns buried in the large data sets, in order to make the information usable.

Retrieval of multimodel objects:

Proactive Information Retrieval

Principal Component Analysis Data X consists of n d-dimensional vectors Matrix X is decomposed in to a product of smaller matrices such that the square reconstruction error is minimized X AS, C = X AS 2 F = d n (x ij c a ik s kj ) 2 i=1 j=1 k=1

Algorithms for PCA Eigenvalue decomposition (standard approach) Compute the covariance matrix and its eigenvectors

Algorithms for PCA Eigenvalue decomposition (standard approach) Compute the covariance matrix and its eigenvectors EM algorithm Iterates between: A XS T (SS T ) 1, S (A T A) 1 A T X.

Algorithms for PCA Eigenvalue decomposition (standard approach) Compute the covariance matrix and its eigenvectors EM algorithm Iterates between: A XS T (SS T ) 1, S (A T A) 1 A T X. Minimization of cost C (Oja s subspace rule) A A + γ(x AS)S T, S S + γa T (X AS).

PCA with Missing Values Red and blue data points are reconstructed based on only one of the two dimensions

Adapting the Algorithms for Iterative imputation Missing Values Alternately 1) fill in missing values and 2) solve normal PCA with the standard approach

Adapting the Algorithms for Iterative imputation Missing Values Alternately 1) fill in missing values and 2) solve normal PCA with the standard approach EM algorithm becomes computationally heavier S A s :j = (A T j A j) 1 A T j j = 1,..., n X :j A T i: = X T i: ST i (S is T i ) 1 i = 1,..., d

Adapting the Algorithms for Iterative imputation Missing Values Alternately 1) fill in missing values and 2) solve normal PCA with the standard approach EM algorithm becomes computationally heavier S A s :j = (A T j A j) 1 A T j j = 1,..., n X :j A T i: = X T i: ST i (S is T i ) 1 i = 1,..., d Minimization of cost C Easy to adapt: Take error over observed values only

Speeding up Gradient Descent Newton s method is known to converge fast, but It requires computing the Hessian matrix which is computationally too demanding in highdimensional problems We propose using only the diagonal part of the Hessian We also include a control parameter to interpolate between standard gradient descent (0) and the diagonal Newton s method (1)

The cost function: C = (i,j) O e 2 ij, with e ij = x ij c k=1 a ik s kj.

The cost function: C = (i,j) O e 2 ij, with e ij = x ij c k=1 a ik s kj. Its partial derivatives: C a il = 2 j (i,j) O e ij s lj, C s lj = 2 i (i,j) O e ij a il.

The cost function: C = (i,j) O e 2 ij, with e ij = x ij c k=1 a ik s kj. Its partial derivatives: C a il = 2 j (i,j) O e ij s lj, C s lj = 2 i (i,j) O e ij a il. Update rules: a il a il γ ( 2 C a 2 il s lj s lj γ ( 2 C s 2 lj ) α C a il = a il + γ ) α C s lj = s lj + γ j (i,j) O e ijs lj ( j (i,j) O s2 lj ) α, i (i,j) O e ija il ( i (i,j) O a2 il ) α.

Overfitting in Case of Sparse Data Overfitted solution Regularized solution

Regularization against Overfitting Penalizing the use of large parameter values Estimating the distribution of unknown parameters (Variational Bayesian learning)

Experiments with Netflix Data www.netflixprize.com Collaborative filtering task: predict people s preferences based on other people s preferences d = 18 000 movies, n = 500 000 customers, N = 100 000 000 movie ratings from 1 to 5 98.8% of the values are missing Find c=15 principal components

Computational Performance Method Complexity Seconds/Iter Hours to E O = 0.85 Gradient O(N c + nc) 58 1.9 Speed-up O(N c + nc) 110 0.22 Natural Grad. O(Nc + nc 2 ) 75 3.5 Imputation O(nd 2 ) 110000 64 EM O(Nc 2 + nc 3 ) 45000 58 Summary of the computational performance of differen N=100 000 000, # of ratings c=15, # of components n=500 000, # of people d=18 000, # of movies

Error on Training Data against computation time in hours 1.04 1 0.96 0.92 Gradient Speed!up Natural Grad. Imputation EM 0.88 0.84 0.8 0.76 0 1 2 4 8 16 32 64

Error on Validation Data against computation time in hours 1.1 1.05 Gradient Speed!up Natural Grad. Regularized VB1 VB2 1 0.95 0 1 2 4 8 16 32

Variational Bayesian Learning The main issue in probabilistic machine learning models is to find the posterior distribution over the model parameters and latent variables Using a point estimate might overfit Sampling is prohibitively slow for large latent variable models Variational Bayesian (VB) learning is a good compromise

Overfitting An overfitted model explains the current data but does not generalize well to new data 6th order polynomial is fitted to 10 points by maximum likelihood and sampling

Posterior mass matters You want to make predictions about new data Y based on existing data X This is solved by fitting a model to the data and then predicting based on that p(y X) = p(y X, Z, θ)p(z, θ X)dZdθ Note how you need to integrate over the posterior)p(z, θ X) If you need to select a single solution Z, θ, it should represent the posterior mass well

Why early stopping might help

Variational Bayes VB works by fitting a distribution q over the unknown variables to the true posterior by minimizing the KL divergence: KL (q(z, θ) p(z, θ X)) = E q(z,θ) { ln } q(z, θ) p(z, θ X) The form of q can be chosen such that the expectations are tractable For instance, q(z, θ) = q(z)q(θ) is assumed almost always, allowing the VB-EM algorithm KL divergence can also be used for model comparison

VB-EM algorithm The VB-EM algorithm alternates between updates for the latent variables and parameters Steps are symmetric and they resemble the E-step of the EM algorithm VB-E step: q(z) argmin VB-M step: E q(θ) {KL (q(z) p(z X, θ))} q(z) { } ( ) { ( ( ) ( ))} q(θ) argmin q(θ) E q(z) {KL (q(θ) p(θ X, Z))}

tributions 0 of x1 and y Example 2 are shown in black contours. Maxte is plotted as a red star, Bayes estimate (or the expec-!2 1 0 2 4 Figure 2.1: Posterior distributions of x and y ar ) is plotted as a red circle. Variational!2 Bayesian!1 solution 0 erior y imum distributions a posterior of xestimate and model y are is plotted shownas ina black red st ior approximation with diagonal covariance is shown in tation posterior over the posterior) is plotted as a red circ d2 by ellipses. Left: model p(z) = N (z; xy, 0.02), obserwith a GaussianFigure 2.1: Posterior distr ) = N (x; 0, 1), p(y) maximum posterior approximation with = N (y; 0, 1). Right: model p(z) = 1 imum prior ationblue z = 2, asvb priors a dotp(x) surrounded = N (x; a 1, posterior by5), ellipses. p(y) = estimate N Left: (y; 0, mo 5). vation z = 1, priors tation p(x) over = N the (x; 0, posterior) 1), p(y) = i 0 N (z; y, mass exp( x)), centre observation z = 2, priors p(x)!1!2!2!1 0 1 2!2!1 0 1 2!2!2 r estimate is plotted as a red star, Bayes estimate osterior) is plotted as a red circle. Variational Ba with a Gaussian posterior vation z = 1, priors p(x) = osterior distributions. blue The models as a dot are not surrounded particularly data two unknown posterior variables), but they are chosen to highus posterior approximations, which are described in the!4 posterior approximation with diagonal covarian rrounded by ellipses. Left: model p(z) = N (z; x iors p(x) = N (x; 0, 1), p(y) = N (y; 0, 1). Right: ), observation z = 2, priors p(x) = N (x; 1, 5), p(y Figure 2.1 shows two posterior distributions. N (z; y, exp( x)), observat meaningful (having x just two unknown variables s two posterior distributions. The models are n light differences in various posterior approximat

0 1 2!2 Figure 0 2.1:2 Posterio 4!2 shown!2in black 0 contours. 2 4 6 imum a8 posterior e are Max!4 2. Bayesian probability theory rior distributions of x y are shown in black co tation over the post star, Bayes estimate (or and the expecis plotted as a red star, Bayes estimate ( circle. Variational Bayesian solution with a Gaussian p!2!1 0 1 2 xestimate and y are shown in black contours. Max!2 y th diagonal covariance is shown in 6 blue as a dot surro sterior) is plotted as a red circle. Variational Bay as a red star, Bayes estimate (or the expecmodel model p(z) = N (z; xy, 0.02), observation z = 1, prior posterior approximation with diagonal covarianc s 4a red circle. Variational Bayesian of solution gure 2.1: Posterior distributions x and y are = N (y; 0, 1). Right: model p(z) = VB Np(z) (z; y,= exp( x)), o rounded by ellipses. Left: model N (z; xy ation with diagonal covariance is shown inred sta um a posterior estimate is plotted as a maximum x)2 = N (x; 1, 5), p(y) = N (y; prior 0, 5). Example 2 ors p(x)model = N (x; 0, = 1),Np(y) =N (y; 0,obser1). Right: Left: p(z) (z; xy, 0.02), tion over the posterior) is plotted as a red circl mass centre observation =0,2,1). priors p(x)model = N (x; 1,=5), p(y) 1), (y; Right: p(z) 0 p(y) = N z th a Gaussian posterior approximation with Figure 2.1 shows t priorsposterior p(x) = N (x; 1, 5), p(y) = N (y; 0, 5). ue as amodels dot surrounded by ellipses. Left:(having mod.!2 The are not particularly meaningful bles), but they are chosen to=hightion z = 1, priors p(x) N (x; 0, 1), p(y) = N light differences in data!4 mations, which are described in the two posterior distributions. The models are no following. (z;!2 y, exp( x)), observation z = 2, priors p(x) = 0 2 4 6 8 ng just twothe unknown but they are ch ibutions. models variables), are x not particularly nd y are shown in black contours. Max-

By restricting the form of q(z), the inference (E-step) can be made faster A B C D E F G Completely factorized A Tree A Exact A B C B C B C D E D E D E F F F G G G

Pros and cons of VB + Robust against overfitting + Fast (compared to sampling) + Applicable to a large family of models - Intensive formulae (lots of integrals) - Prone to bad but locally optimal solutions (lot of work with arranging good initializations and other tricks to avoid them)

Bayes Blocks Software Package Bayes Block by Valpola et al. concentrates on continuous values fully factorial posterior approximation includes nonlinearities allows for variance modelling algorithm: message passing with line searches for speed-up