Supplemental Material for TKDE

Similar documents
RECOMMENDATION problems can be viewed as

Distributed Methods for High-dimensional and Large-scale Tensor Factorization

Parallel Matrix Factorization for Recommender Systems

Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems

Large-Scale Behavioral Targeting

CS425: Algorithms for Web Scale Data

Large-scale Matrix Factorization. Kijung Shin Ph.D. Student, CSD

Linear Algebra (Review) Volker Tresp 2018

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Fast and Scalable Distributed Boolean Tensor Factorization

GIVEN a tensor that is too large to fit in memory, how can

CHAPTER 2 EXTRACTION OF THE QUADRATICS FROM REAL ALGEBRAIC POLYNOMIAL

Graph Analysis Using Map/Reduce

Linear Algebra (Review) Volker Tresp 2017

Neural Networks. ICS 273A UC Irvine Instructor: Max Welling

Tall-and-skinny! QRs and SVDs in MapReduce

For more information visit here:

Multiplication of Polynomials

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 11

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

Using R for Iterative and Incremental Processing

Discrete Multi-material Topology Optimization under Total Mass Constraint

Lecture notes 1: ECEN 489

Incomplete Cholesky preconditioners that exploit the low-rank property

k-variates++: more pluses in the k-means++

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

l 1 and l 2 Regularization

Math Models of OR: Branch-and-Bound

Chapter 7 Iterative Techniques in Matrix Algebra

Incorporating uncertainty in the design of water distribution systems

SOLUTIONS TO ADDITIONAL EXERCISES FOR II.1 AND II.2

THE solution of the absolute value equation (AVE) of

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Thermodynamics [ENGR 251] [Lyes KADEM 2007]

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

A Hybrid Method for the Wave Equation. beilina

QR Decomposition in a Multicore Environment

Nesterov-based Alternating Optimization for Nonnegative Tensor Completion: Algorithm and Parallel Implementation

Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Lecture XI. Approximating the Invariant Distribution

CS246 Final Exam, Winter 2011

Scheduling with AND/OR Precedence Constraints

Math 304 (Spring 2010) - Lecture 2

LINEAR COMPARTMENTAL MODELS: INPUT-OUTPUT EQUATIONS AND OPERATIONS THAT PRESERVE IDENTIFIABILITY. 1. Introduction

A Proof of Theorem 1. Dual Weighted Generalized Nuclear Norm. C Proof of Theorem 2

SOLUTIONS TO EXERCISES FOR. MATHEMATICS 133 Part 2. I. Topics from linear algebra

High Performance Parallel Tucker Decomposition of Sparse Tensors

Some definitions. Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization. A-inner product. Important facts

Cache Oblivious Stencil Computations

Matrix Factorization Techniques for Recommender Systems

Solving Regularized Total Least Squares Problems

Linear Regression (continued)

arxiv: v3 [stat.ml] 14 Apr 2016

LFTF: A Framework for Efficient Tensor Analytics at Scale

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai

Efficient Primal- Dual Graph Algorithms for Map Reduce

Knowledge Discovery and Data Mining 1 (VO) ( )

Environment (Parallelizing Query Optimization)

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

Notes on Mathematics Groups

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

2.6 Large Signal Operation

Introduction to Compressed Sensing

SIMPLEX LIKE (aka REDUCED GRADIENT) METHODS. REDUCED GRADIENT METHOD (Wolfe)

Continuous methods for numerical linear algebra problems

Multi-Approximate-Keyword Routing Query

a b = a T b = a i b i (1) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by a b = a b cos(θ a,b ) (2)

Bidirectional Representation and Backpropagation Learning

Lecture 9: September 28

Chapter Two Elements of Linear Algebra

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

FPGA Implementation of a Predictive Controller

CSE546 Machine Learning, Autumn 2016: Homework 1

Web Appendix for The Value of Reputation in an Online Freelance Marketplace


Definitions and Properties of R N

CS224W: Methods of Parallelized Kronecker Graph Generation

Recommendation Systems

Transmission Line Input Impedance

Scalable Nonnegative Matrix Factorization with Block-wise Updates

Additional Mathematics Lines and circles

Math 215 HW #11 Solutions

CHAPTER 5. Basic Iterative Methods

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

On Spectral Factorization and Riccati Equations for Time-Varying Systems in Discrete Time

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Cryptanalysis of a hash function, and the modular subset sum problem

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

AMS526: Numerical Analysis I (Numerical Linear Algebra)

( ) = ( ) ( ) = ( ) = + = = = ( ) Therefore: , where t. Note: If we start with the condition BM = tab, we will have BM = ( x + 2, y + 3, z 5)

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Online Nonnegative Matrix Factorization with General Divergences

Transcription:

Supplemental Material for TKDE-05-05-035 Kijung Sh, Lee Sael, U Kang PROPOSED METHODS Proofs of Update Rules In this section, we present the proofs of the update rules Section 35 of the ma paper Specifically, we prove the CDTF update rule for L regularization Theorem 7, the CDTF update rule for the nonnegativity constrat Theorem, the SALS update rule for coupled tensor factorization Theorem, and the update rule for bias terms commonly used by CDTF and SALS for the bias model Theorem 0 Lemma Partial Derivative CDTF: For a parameter i, let ˆr i i N x ii N N s k l al, g i,,i N ˆr Ω n ii N l n al k, and d i,,i N Ω n i,,i N Ω i,,i N Ω i,,i N Ω n i,,i N Ω n g l n al k, as the ma paper x i i N K s l x i i N K x i i N x i i N K s l s l ˆr i i N a l a l K s l a l l a l k a l g a l k l n Theorem 7: Correctness of CDTF with L - regularization The update rule the ma paper mimizes the loss function with respect to the updated parameter For an updated parameter, let ˆr i i N x ii N s k N l al, g i,,i N Ω n d i,,i N Ω n paper arg m k ˆr ii N l n al k, and l n al k, as the ma λ g/d if g > λ L LassoA,, A N λ g/d if g < -λ 0 otherwise By Lemma, L LassoA,, A N i,,i N Ω g i d λ an x i i N K i Case : If g > λ > 0, s l { g i d λ g i d λ L Lasso k 0, should be negative for L Lasso k < 0, L Lasso k < 0 makes L Lasso k g a l λ N A l l d λ, and an if an if an > 0 < 0 > d Sce d to be zero If λ g/d 0 Sce L Lasso d 0, k λg/d mimizes L LassoA,, A N with respect to Case : Likewise, if g < λ < 0, L Lasso k Sce d 0, be zero If L Lasso k < should be positive for L Lasso k > 0, L Lasso g k λ g/d > 0 makes L Lasso k d 0, to d λ, and 0 Sce λ g/d mimizes L Lasso A,, A N with respect to i Case 3: On the other hand, if λ g λ, L Lasso { g i d λ an i d 0 g i d λ an i d 0 if an if an i > 0 i < 0 That is, L Lasso A,, A N decreases if i < 0 and creases if > 0, given other parameters Thus, L Lasso A,, A N, which is a contuous

function, is mimized with respect to arg m k if an 0 λ g/d if g > λ L LassoA,, A N λ g/d if g < -λ 0 otherwise N l al, Theorem : Correctness of CDTF with the nonnegativity constrat The update rule 4 the ma paper mimizes the loss function with respect to the updated parameter under the nonnegativity constrat For an updated parameter i, let ˆr i i N x ii N s k g i,,i N ˆr Ω n ii N l n al k, and d i,,i N Ω n arg m k 0 l n al k, as the ma paper LA,, A N g max d λ, 0 By Lemma, LA,, A N i,,i N Ω g d λ Thus, L x i i N K s l a l λ N A l F l > 0 if > g/d λ 0 if g/d λ < 0 otherwise Case : If g/dλ 0, sce L k dλ 0, LA,, A N is mimized if g/dλ with respect to i Case : On the other hand, if g/d λ < 0, under the constrat that i 0, LA,, A N is mimized with respect to if an 0 This is because LA,, A N, which is a contuous function, monotonically creases ie, 0 > g/d λ arg m k 0 L k LA,, A N g max d λ, 0 > 0 if Ny n y Ny n y y ii Ny K k i C c c Likewise, let x Ω and y Ω be the sets of dices of the observable entries X and Y, respectively arg m [a i k,,a i k C ] T L Coupled xa,, xa Nx, ya,, ya Ny xb i yb i λi C xc i yc i, where x B i and y B i are C by C matrices whose entries are xb i c c x c x c, n n yb i c c xc i and y c i xc i c yc i c i,,i Nx xω i i,,i Ny yω i y c n y c n, c, c, are length C vectors whose entries are xˆr i i N x, i,,i Nx xω i i,,i Ny yω i y ˆr i i N and I C is the C by C identity matrix L xa,, xa Nx a i k c i,,i Nx xω λ Nx xa n F n xa i k c i,,i Nx xω i n c y c n LxA,, xa Nx xa x i i Nx K x i i Nx K i,,i Nx xω i λa i k c Likewise, xa i k c x i i Nx x s n xa C i k c N x s n i k c x s n K N x s n x x x λ xa i k c, c x s xˆr i i Nx x c n Theorem : Correctness of SALS Coupled Tensor Factorization The update rule 6 the ma paper mimizes 5 with respect to the updated parameters Let xr and y R be the residual tensors for X and Y, respectively That is, for C updated parameters a i k,, a i k C, xˆr ii Nx x ii Nx K Nx k n x i C Nx c n x c and yˆr ii Ny L ya,, ya Ny a i k c C i,,i Ny yω i λa i k c N y y s n s y ˆr i i Ny y c n

From these, L Coupled a i k c i,,i Nx xω i 0, c, c C i,,i Ny yω i C N x s n C x s xˆr i i Nx x c n N y y s n λa i k c 0, c C xa i k s i,,i Nx xω i i,,i Ny yω i λa i k c i,,i Nx xω i i,,i Ny yω i s s y ˆr i i Ny x s n C ya i k s s xˆr i i Nx y ˆr i i Ny y s n x c n y c n y c n x n, c c y n c xb i yb i λi C[a i k,, a i k C ] T xc i yc i arg m [a i k,,a i k C ] T L Coupled xa,, xa Nx, ya,, ya Ny xb i yb i λi C xc i yc i Theorem 0: Correctness of the Update Rule for Bias Terms The update rule the ma paper mimizes 7 with respect to the updated parameter For an updated parameter b n, let r ii N x ii N K N k l al k l n bl µ, as the ma paper arg m b n L BiasA,, A N, b,, b N i,,i N Ω n L BiasA,, A N, b,, b N i,,i N Ω λ A l b n x i i N K a l s l b n N A l F b n r i i N /λ b Ω n λ b l N l N b l F b n b l µ i,,i N Ω n x i i N K i,,i N Ω n x i i N a l s l λ b Ω n b n Sce L Bias b n N K s l l a l N b l µ l b n r i i N b n λ b b n i,,i N Ω n r i i N b l µ λ b l 3 N b l F b n λ b Ω n 0, L Bias A,, A N, b,, b N is mimized with respect to b n if b n i,,i N r Ω n ii N /λ b Ω n 0, which entails L Bias b n arg m b n L BiasA,, A N, b,, b N Pseudocodes i,,i N Ω n r i i N /λ b Ω n We present the pseudocodes of the SALS variants described Section 35 of the ma paper SALS for Coupled Tensor Factorization Algorithm 6 describes SALS for coupled tensor factorization, where two tensors, denoted by X and Y, share their first mode without loss of generality We denote the residual tensors for X and Y by x R and yr, respectively The lengths of the n-th modes of X and Y are denoted by x I n and y I n, respectively Algorithm 6: SALS for Coupled Tensor Factorization Input : X, Y, K, λ Output: A, xa n for n N x, ya n for n N y itialize xr, yr, A, xa n and ya n for all n for outer iter T out do 3 for split iter K C do 4 choose k,, k C from columns not updated yet 5 compute x ˆR and y ˆR usg 6 for ner iter T do 7 for i I xi yi do update a i k,, a i k C usg 6 0 3 4 5 for n N x do for xi n do update x,, x usg C for n N y do for yi n do update y,, y usg C update xr and yr usg 0

4 Algorithm 7: SALS for Bias Model Input : X, K, λ A, λ b Output: A n for all n, b n for all n, µ compute µ the mean of the observable entries of X itialize R, A n for all n, and b n for all n 3 for outer iter T out do 4 for split iter K C do 5 choose k,, k C from columns not updated yet 6 compute ˆR usg 7 for ner iter T do for n N do for I n do 0 update,, C usg 3 4 5 update R usg 0 for n N do for I n do update b n usg update R usg SALS for Bias Model SALS for the bias model is described Algorithm 7, where each i,, i N th entry of R is r i x ii N µ N n bn K N k n an, as explaed Section 355 of the ma paper OPTIMIZATION ON MAPREDUCE In this section, we present the details of the optimization techniques described Section 4 of the ma paper Local Disk Cachg As explaed Section 4 of the ma paper, our MAPREDUCE implementation of CDTF and SALS with local disk cachg, X entries are distributed across maches and cached their local disk durg the map and reduce stages Algorithm gives the details of the map and reduce stages The rest of CDTF and SALS runs the close stage cleanup stage Hadoop usg the cached data Direct Communication In the ma paper, we troduce direct communication between reducers usg distributed file system to overcome the rigidity of MAPREDUCE model Algorithm describes the implementation of m k broadcast CDTF le 0 of Algorithm 3 the ma paper based on this communication method 3 Greedy Row Assignment Our MAPREDUCE implementation of SALS and CDTF uses the greedy row assignment, explaed Section 343 of the ma paper In this section, we expla our MAPREDUCE implementation of the greedy row assignment We assume that X is stored Algorithm : Data distribution CDTF and SALS with local disk cachg Input : X, ms n for all m and n Output: mω n entries of R X for all m and n MapKey k, Value v beg 3 i,, i N, x i i N v 4 for n,,n do 5 fd m where ms n 6 emit < m, n, i,, i N, x i i N > 7 end PartitionerKey k, Value v beg 0 m, n k assign < k, v > to mache m end 3 ReduceKey k, Value v[ v ] 4 beg 5 m, n k 6 create a file on the local disk to cache mω n entries of R 7 foreach i,, i N, x i i N v do write i,, i N, x i i N to the file end Algorithm : m k broadcast CDTF Input : m k parameters to broadcast Output: k parameters received from others beg create a data file ma on the distributed file system DFS 3 write m k on the datafile 4 create a dummy file md on DFS 5 while not all data files are read do 6 get the list of dummy files from DFS 7 foreach m D the list do if m A are not read then read m k from m A 0 end on the distributed file system At the first stage, Ω n for all n and is computed Specifically, mappers output < n,, > for all n for each entry x ii N, and reducers output < n,, Ω n > for all n and by countg the number of values for each key At the second stage, the outputs are aggregated to a sgle reducer which runs the rest of Algorithm 5 the ma paper 3 EXPERIMENTS In this section, we design and conduct additional experiments to answer the followg questions: How do different numbers of ner iterations T affect the convergence of SALS? How do different numbers of columns updated at a time C affect the runng time of SALS? 3 Experimental Settgs We ran experiments on a 0-node Hadoop cluster Each node had an Intel Xeon E3-30 33GHz CPU

5 a Netflix 3 C 0 b Netflix 3 C 0 c Yahoo-music 4 C 0 d Yahoo-music 4 C 40 Fig 4: Effects of T ie, ner iterations on the convergence of SALS when C ie, the number of columns updated at a time has large enough values The effects of T on convergence speed and the quality of converged solutions are margal a Netflix 3 3 Effects of the Number of Inner Iterations ie, T on the Convergence of SALS We compared the convergence properties of SALS with different T values Especially, we focused on cases where C ie, the number of columns updated at a time has large enough values The effect of T when C is set to one and thus SALS is equivalent to CDTF can be found the ma paper As seen Figure 4, the effects of T on convergence speed and the quality of converged solutions are neither distct nor consistent When C is set to one, however, high T values are preferred see Section 57 of the ma paper for detailed experimental results b Yahoo-music 4 Fig 5: Effects of the number of columns updated at a time C on the runng time of SALS Runng time per iteration decreased until C 0, then started to crease The maximum heap size per reducer was set to GB Other experimental settgs, cludg datasets and parameter values λ and K, were the same as those the ma paper The number of reducers was set to 0 We used the root mean square error RMSE on a held-out test set, which is commonly used recommender systems, to measure the accuracy, as the ma paper 33 Effects of the Number of Columns Updated at a Time ie, C on Runng Time of SALS We measured the runng times per iteration SALS, as we creased C from to K As seen Figure 5, runng time per iteration decreased until C 0, then started to crease As C creases, the amount of disk I/O decles sce it depends on the number of times that the entries of R or ˆR are streamed from disk, which is versely proportional to C Conversely, computational cost creases quadratically with regard to C At small C values, the decrease the amount of disk I/O was greater and leaded to a downward trend of runng time per iteration The opposite happened at large C values