Predictive Discrete Latent Factor Models for large incomplete dyadic data

Similar documents
Improving Response Prediction for Dyadic Data

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Content-based Recommendation

Generative Models for Discrete Data

A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation

Unsupervised Anomaly Detection for High Dimensional Data

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Localization of Radioactive Sources Zhifei Zhang

Nonparametric Bayes tensor factorizations for big data

Andriy Mnih and Ruslan Salakhutdinov

Data Mining Techniques

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Matrix Factorization Techniques for Recommender Systems

Large-scale Ordinal Collaborative Filtering

CS534 Machine Learning - Spring Final Exam

Regression-based Latent Factor Models

Chart types and when to use them

Stat 315c: Introduction

Feature Engineering, Model Evaluations

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Collaborative Filtering. Radek Pelánek

Data Preprocessing. Cluster Similarity

Bayesian non-parametric model to longitudinally predict churn

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Advanced Statistical Methods: Beyond Linear Regression

A Bayesian Perspective on Residential Demand Response Using Smart Meter Data

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007

Collaborative topic models: motivations cont

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Click-Through Rate prediction: TOP-5 solution for the Avazu contest

Models, Data, Learning Problems

Cheng Soon Ong & Christian Walder. Canberra February June 2018

1-Bit Matrix Completion

PCA, Kernel PCA, ICA

Introduction to Logistic Regression

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Machine learning for pervasive systems Classification in high-dimensional spaces

Learning to Learn and Collaborative Filtering

Non-parametric Bayesian Modeling and Fusion of Spatio-temporal Information Sources

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

Probabilistic Time Series Classification

Introduction to Gaussian Process

Computer Vision Group Prof. Daniel Cremers. 3. Regression

CSCI-567: Machine Learning (Spring 2019)

Temporal Multi-View Inconsistency Detection for Network Traffic Analysis

Leverage Sparse Information in Predictive Modeling

Latent Variable Models

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

A CUSUM approach for online change-point detection on curve sequences

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Mixed Membership Matrix Factorization

Large-Scale Behavioral Targeting

Experimental Design and Data Analysis for Biologists

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

ECS289: Scalable Machine Learning

CS6220: DATA MINING TECHNIQUES

Introduction to Machine Learning Midterm Exam

Probabilistic Matrix Factorization

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Recommendations as Treatments: Debiasing Learning and Evaluation

Dimension Reduction Methods

ECE 5984: Introduction to Machine Learning

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Generative Clustering, Topic Modeling, & Bayesian Inference

Mixed Membership Matrix Factorization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

arxiv: v1 [stat.ml] 5 Dec 2016

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

CS6220: DATA MINING TECHNIQUES

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Learning representations

Recurrent Latent Variable Networks for Session-Based Recommendation

Recent Advances in Bayesian Inference Techniques

What is semi-supervised learning?

Nonparametric Bayesian Methods (Gaussian Processes)

Online Advertising is Big Business

Prediction of double gene knockout measurements

Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Introduction to Machine Learning Midterm Exam Solutions

ECS289: Scalable Machine Learning

MODULE -4 BAYEIAN LEARNING

Pattern Recognition and Machine Learning

CPSC 540: Machine Learning

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

1-Bit Matrix Completion

Course in Data Science

Time Series Analysis via Matrix Estimation

Topic Models and Applications to Short Documents

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Click Prediction and Preference Ranking of RSS Feeds

Collaborative Nowcasting for Contextual Recommendation

STA 4273H: Statistical Machine Learning

CPSC 540: Machine Learning

Ad Placement Strategies

Principal Component Analysis

When Dictionary Learning Meets Classification

Transcription:

Predictive Discrete Latent Factor Models for large incomplete dyadic data Deepak Agarwal, Srujana Merugu, Abhishek Agarwal Y! Research MMDS Workshop, Stanford University 6/25/2008

Agenda Motivating applications Problem Definitions Classic approaches Our approach PDLF Building local models via co-clustering Enhancing PDLF via factorization Discussion

Motivating Applications Movie (Music) Recommendations (Netflix, Y! Music) Personalized; based on historical ratings Product Recommendation Y! shopping: top products based on browse behavior Online advertising What ads to show on a page? Traffic Quality of a publisher What is the conversion rate?

DATA Movie, music, product, ad j i user, webpage

Problem Definition GOAL: (i, j); Training data Predict Response CHALLENGES Scalability: Large dyadic matrix Missing data: Small fraction of dyads Noise: SNR low; data heterogeneous but there are strong interactions

Classical Approaches SUPERVISED LEARNING Non-parametric Function Estimation Random effects: to estimate interactions UNSUPERVISED LEARNING Co-clustering, low-rank factorization, Our main contribution Blend supervised & unsupervised in a model based way; scalable fitting

Non-parametric function estimation y ij h( ) xij noise E.g. Trees, Neural Nets, Boosted Trees, Kernel Learning, capture entire structure through covariates Dyadic data: Covariate-only models shows Lack-of- Fit, better estimates of interactions possible by using information on dyads.

Random effects model Specific term per observed cell f T T ( y ; z, x ) f ( y ; z x ) ( ; G) ij ij ij ij ij ij ij ij global Dyad-specific Smooth dyad-specific effects using prior( shrinkage ) E.g. Gaussian mixture, Dirichlet process,.. Main goal: hypothesis testing, not suited to prediction Prediction for new cell is based only on estimated prior Our approach Co-cluster the matrix; local models in each cluster Co-clustering done to obtain the best model fit d ij

Classic Co-clustering Exclusively capture interactions No covariates included! Goal: Prediction by Matrix Approximation Scalable Iteratively cluster rows & cols homogeneous blocks

RAW DATA After ROW CLUSTERING Smooth Rows using column clusters; reduces variance After COLUMN Clustering Iterate Until convergence

Our Generative model f T T ( y ; x, z ) P( I, J ) f ( y ; z x ij ij ij i j ij ij ij I, I 1 J 1 Cluster id for row i i K L ; j Cluster id for column j J ) Sparse, flexible approach to learn dyad-specific coeffs borrow strength across rows and columns Capture interactions by co-clustering Local model in each co-cluster Convergence fast, procedure scalable Completely model based, easy to generalize We consider x ij =1 in this talk

Hard assignment Row/col assigned i j Several million Estimate arg max( Easily done Complexity I arg max( Conditional J Scalable model fitting j:( i, j) i:( i, j) in parallel; dyads; parameters via : O(N((K EM algorithm ij or "Winner - Take all" to the best cluster ( y ij ( y i J L) I j ( x ( x thousands usual s 2 ij )) T ij we use Map - of oncluster assignments T β β : statistica l I i J Reduce j ) ) ) ) rows/colum ns procedures take few hours

Simulation on Movie Lens User-movie ratings Covariates: User demographics; genres Simulated (200 sets): estimated co-cluster structure Response assumed Gaussian β 0 β 1 β 2 β 3 β 4 σ 2 truth 95% c.i 3.78 0.51-0.28 0.14 0.24 1.16 3.66,3.84-0.63,0.62-0.58,-0.16-0.09,0.18-0.68,1.05 0.90,0.99

Regression on Movie Lens Rating > 3: +ve 23 covariates Logistic Regression PDLF Pure co-clustering

Click Count Data Goal: Click activity on publisher pages from ip-domains Dataset: 47903 ip-domains, 585 web-sites, 125208 click-count observations two covariates: ip-location (country-province) and routing type (e.g., aol pop, anonymizer, mobile-gateway), row-col effects. Model: PDLF model based on Poisson distributions with number of row/column clusters set to 5 We thank Nicolas Eddy Mayoraz for discussions and data

Co-cluster Interactions: Plain Co-clustering Publishers: IP Domains Internet-related e.g., buydomains Non- Clickin g IPs/ US AOL/ Unkno wn Edu / Europe an Japane se Korean Shopping search e.g., netguide AOL & Yahoo Smaller publishers e.g. blogs Smaller portals e.g. MSN, Netzero

Co-cluster Interactions: PDLF Publishers: IP Domains WebMedia e.g., usatoday Tech Compani es ISPs Online Portals (MSN, Yahoo)

Smoothing via Factorization Cluster size vary in PDLF, smoothing across local models works better row profile : u i ( u i1, u i 2,..., u ir ); col profile : v i ( v Regularized WeightedFactorization(RWF) : i1, v i 2,..., v ir ) ij u i T v j ; u i, v i drawn from Gaussian prior Squashed Matrix Factorizat ion (SMF) Co - cluster and factorize cluster profiles : IJ U I T V J

Synthetic example (moderately sparse data) Finer interactions Missing values are shown as dark regions

Synthetic example (highly sparse data)

Movie Lens

Estimating conversion rates Back to ip x publisher example Now model conversion rates Prob (click results in sales) Detecting important interaction helps in traffic quality estimation

Summary Covariate only models often fail to capture residual dependence for dyadic data Model based co-clustering attractive and scalable approach to estimate interactions Factorization on cluster effects smoothes local models; leads to better performance Models widely applicable in many scenarios

Ongoing work Fast co-clustering through DP mixtures (Richard Hahn, David Dunson) Few sequential scans over the data Initial results extremely promising Model based hierarchical co-clustering (Inderjit Dhillon) Multi-resolution local models; smoothing by borrowing strength across resolutions