Matrix and Tensor Factorization from a Machine Learning Perspective

Size: px
Start display at page:

Download "Matrix and Tensor Factorization from a Machine Learning Perspective"

Transcription

1 Matrix and Tensor Factorization from a Machine Learning Perspective Christoph Freudenthaler Information Systems and Machine Learning Lab, University of Hildesheim Research Seminar, Vienna University of Economics and Business, January 13, 2012 Christoph Freudenthaler 1 / 34 ISMLL, University of Hildesheim

2 Matrix Factorization - SVD SVD: Decompose p 1 p 2 matrix Y := V 1 D V2 T V1 are eigenvectors of YY T V2 are eigenvectors of Y T Y D := eig(diag(yy T )) p 1 d 1 0 p = 1 p 2 0 d Y V 1 V 2 p 2... Christoph Freudenthaler 2 / 34 ISMLL, University of Hildesheim

3 Matrix Factorization - SVD SVD: Decompose p 1 p 2 matrix Y := V 1 D V2 T V1 are eigenvectors of YY T V2 are eigenvectors of Y T Y D := eig(diag(yy T )) p 1 d 1 0 p = 1 p 2 0 d Y V 1 V 2 p 2... Properties: Method from linear algebra for arbitrary Y Tool for descriptive and exploratory data analysis Decomposition optimizes the Frobenius norm with orthogonality constraints Christoph Freudenthaler 2 / 34 ISMLL, University of Hildesheim

4 Tensor Factorization - Tucer Decomposition Tucer Decomposition: Decompose p 1 p 2 p 3 tensor Y := D 1 V 1 2 V 2 3 V 3 V1 are 1 eigenvectors of mode-1 unfolded Y V2 are 2 eigenvectors of mode-2 unfolded Y V3 are 3 eigenvectors of mode-3 unfolded Y D R non-diagonal core tensor p 3 3 p 2 Y = D p V 1 p 2 2 V 2 p 3 3 V 3 1 p Christoph Freudenthaler 3 / 34 ISMLL, University of Hildesheim

5 Tensor Factorization - Tucer Decomposition Tucer Decomposition: Decompose p 1 p 2 p 3 tensor Y := D 1 V 1 2 V 2 3 V 3 V1 are 1 eigenvectors of mode-1 unfolded Y V2 are 2 eigenvectors of mode-2 unfolded Y V3 are 3 eigenvectors of mode-3 unfolded Y D R non-diagonal core tensor p 3 3 p 2 Y = D p V 1 p 2 2 V 2 p 3 3 V 3 1 p Properties: Extension of SVD method for arbitrary tensor Y orthogonal representation Tool for descriptive and exploratory data analysis Decomposition optimizes the Frobenius norm Expensive inference: sequences of SVD + core tensor D Christoph Freudenthaler 3 / 34 ISMLL, University of Hildesheim

6 Tensor Factorization - Canonical Decomposition Canonical Decomposition (CD): Decompose p 1 p 2 p 3 tensor Y := D 1 V 1 2 V 2 3 V 3 with diagonal, identity tensor D as a sum of ran-one tensors Y = f =1 v 1,f v 2,f v 3,f p 3 p Y = p 1 1 V 1 p 2 2 V 2 p V p 1 Christoph Freudenthaler 4 / 34 ISMLL, University of Hildesheim

7 Tensor Factorization - Canonical Decomposition Canonical Decomposition (CD): Decompose p 1 p 2 p 3 tensor Y := D 1 V 1 2 V 2 3 V 3 with diagonal, identity tensor D as a sum of ran-one tensors Y = f =1 v 1,f v 2,f v 3,f p 3 p Y = p 1 1 V 1 p 2 2 V 2 p V p 1 Properties: Extension of SVD method for arbitrary tensor Y Tool for descriptive and exploratory data analysis Decomposition optimizes the Frobenius norm Fast inference due to less parameters Christoph Freudenthaler 4 / 34 ISMLL, University of Hildesheim

8 Machine Learning Perspective Machine Learning:...is the tas of learning from (noisy) experience E with respect to some class of tass T and performance measure P Experience E: data Tass T: predictions Performance P: root mean square error (RMSE), misclassification,... Christoph Freudenthaler 5 / 34 ISMLL, University of Hildesheim

9 Machine Learning Perspective Machine Learning:...is the tas of learning from (noisy) experience E with respect to some class of tass T and performance measure P Experience E: data Tass T: predictions Performance P: root mean square error (RMSE), misclassification,... to generalize from noisy data to accurately predict new cases Christoph Freudenthaler 5 / 34 ISMLL, University of Hildesheim

10 Machine Learning Perspective Machine Learning:...is the tas of learning from (noisy) experience E with respect to some class of tass T and performance measure P Experience E: data Tass T: predictions Performance P: root mean square error (RMSE), misclassification,... to generalize from noisy data to accurately predict new cases In other words: Prediction accuracy on new cases = Machine Learning No hypothesis generation/testing = Data Mining Christoph Freudenthaler 5 / 34 ISMLL, University of Hildesheim

11 Applications of the Machine Learning Perspective Many different applications: Social Networ Analysis Recommender Systems Graph Analysis Image/Video Analysis... Christoph Freudenthaler 6 / 34 ISMLL, University of Hildesheim

12 Applications of the Machine Learning Perspective Overall Tas: Prediction of missing friendships, interesting items, corrupted pixels, missing edges, etc. Data representation: matrix/tensor with missing entries Y = u 1 u 2... u p1 i 1 i 2 i 3... i p2 1 i p Christoph Freudenthaler 7 / 34 ISMLL, University of Hildesheim

13 Applications of the Machine Learning Perspective Overall Tas: Prediction of missing friendships, interesting items, corrupted pixels, missing edges, etc. Data representation: matrix/tensor with missing entries Further Properties: Y = u 1 u 2... u p1 i 1 i 2 i 3... i p2 1 i p Unobserved Heterogeneity, e.g. different consumption preferences of different users and items Large-scale: millions of interactions Poor Data Quality: high noise due to indirect data collection 2 Factorization Models Christoph Freudenthaler 7 / 34 ISMLL, University of Hildesheim

14 Factorization Models from a Machine Learning Perspective Common usage of matrix (and tensor) factorization: Identification/Interpretation of unobserved heterogeneity, i.e. latent dependencies between instances of a mode (rows, columns,... ) Data compression, e.g., instead of p 1 p 2 values, store only (p 1 + p 2 ) values Data preprocessing: uncorrelate p predictor variables of a design matrix X Christoph Freudenthaler 8 / 34 ISMLL, University of Hildesheim

15 Factorization Models from a Machine Learning Perspective Machine Learning perspective on factorization models: Factorization models seen as predictive models No probabilistic embedding, e.g. for Bayesian Analysis Gaussian lielihood: Y = V 1 D V T 2 + E, e l E : e l N(0, σ 2 = 1) No missing value treatment; often treated as 0 Christoph Freudenthaler 8 / 34 ISMLL, University of Hildesheim

16 Factorization Models from a Machine Learning Perspective Machine Learning perspective on factorization models: Factorization models seen as predictive models No probabilistic embedding, e.g. for Bayesian Analysis Gaussian lielihood: Y = V 1 D V T 2 + E, e l E : e l N(0, σ 2 = 1) No missing value treatment; often treated as 0 Non-informative prior: p(v 1, D, V 2 ) 1 Does not distinguish between signal and noise Christoph Freudenthaler 8 / 34 ISMLL, University of Hildesheim

17 Factorization Models from a Machine Learning Perspective Machine Learning perspective on factorization models: Factorization models seen as predictive models No probabilistic embedding, e.g. for Bayesian Analysis Gaussian lielihood: Y = V 1 D V T 2 + E, e l E : e l N(0, σ 2 = 1) No missing value treatment; often treated as 0 Non-informative prior: p(v 1, D, V 2 ) 1 Does not distinguish between signal and noise No interest in latent representation, e.g. interpretation Elimination of orthonormality constraint Y = V 1, V2 T + E for general tensors: Y = v 1,f v 2,f... v m,f + E f =1 Christoph Freudenthaler 8 / 34 ISMLL, University of Hildesheim

18 Factorization Models from a Machine Learning Perspective Machine Learning perspective on factorization models: Factorization models seen as predictive models No probabilistic embedding, e.g. for Bayesian Analysis Gaussian lielihood: Y = V 1 D V T 2 + E, e l E : e l N(0, σ 2 = 1) No missing value treatment; often treated as 0 Non-informative prior: p(v 1, D, V 2 ) 1 Does not distinguish between signal and noise No interest in latent representation, e.g. interpretation Elimination of orthonormality constraint Y = V 1, V2 T + E for general tensors: Y = v 1,f v 2,f... v m,f + E f =1 Christoph Freudenthaler 8 / 34 ISMLL, University of Hildesheim

19 Related Wor Existing Extensions of Factorization Models: More general lielihood for multinomial, count data, etc. Exponential Family PCA 1 Inference only on observed tensor entries 1 Mohamed, S., Heller, K. A., Ghahramani, Z.: Exponential Family PCA, NIPS08. 2 Tipping, M. E., Bishop, C. M.: Probabilistic PCA, Journal of the Royal Statistical Society. 3 Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 9 / 34 ISMLL, University of Hildesheim

20 Related Wor Existing Extensions of Factorization Models: More general lielihood for multinomial, count data, etc. Exponential Family PCA 1 Inference only on observed tensor entries Introduction of prior distributions 2 1 Mohamed, S., Heller, K. A., Ghahramani, Z.: Exponential Family PCA, NIPS08. 2 Tipping, M. E., Bishop, C. M.: Probabilistic PCA, Journal of the Royal Statistical Society. 3 Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 9 / 34 ISMLL, University of Hildesheim

21 Related Wor Existing Extensions of Factorization Models: More general lielihood for multinomial, count data, etc. Exponential Family PCA 1 Inference only on observed tensor entries Introduction of prior distributions 2 Scalable Learning algorithms for large scale data: Gradient Descent Models 3 1 Mohamed, S., Heller, K. A., Ghahramani, Z.: Exponential Family PCA, NIPS08. 2 Tipping, M. E., Bishop, C. M.: Probabilistic PCA, Journal of the Royal Statistical Society. 3 Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 9 / 34 ISMLL, University of Hildesheim

22 Related Wor Existing Extensions of Factorization Models: More general lielihood for multinomial, count data, etc. Exponential Family PCA 1 Inference only on observed tensor entries Introduction of prior distributions 2 Scalable Learning algorithms for large scale data: Gradient Descent Models 3 Missing Extensions: Extensions of the predictive models, i.e. SVD, CD Comparison to standard predictive models 1 Mohamed, S., Heller, K. A., Ghahramani, Z.: Exponential Family PCA, NIPS08. 2 Tipping, M. E., Bishop, C. M.: Probabilistic PCA, Journal of the Royal Statistical Society. 3 Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 9 / 34 ISMLL, University of Hildesheim

23 Outline Generalized Factorization Model Relations to standard Models Empirical Evaluation Conclusion and Discussion Christoph Freudenthaler 10 / 34 ISMLL, University of Hildesheim

24 Outline Generalized Factorization Model Relations to standard Models Empirical Evaluation Conclusion and Discussion Christoph Freudenthaler 11 / 34 ISMLL, University of Hildesheim

25 Probabilistic SVD Frobenius norm optimal: argmin Y V 1 V2 T F = θ={v 1,V 2} p 1 p 2 (y i,j v1,iv T 2,j ) 2 i=1 j=1 j j i y i, j p 1 i Y V 1 v 1,i v 2,j V 2 T p 2 Christoph Freudenthaler 11 / 34 ISMLL, University of Hildesheim

26 Probabilistic SVD Frobenius norm optimal: argmin Y V 1 V2 T F = θ={v 1,V 2} p 1 p 2 (y i,j v1,iv T 2,j ) 2 i=1 j=1 j j i y i, j p i Y 1 = V 1 T v 1,i. v 2,j V 2 T p 2 Christoph Freudenthaler 11 / 34 ISMLL, University of Hildesheim

27 Probabilistic SVD Frobenius norm optimal: argmin Y V 1 V2 T F = θ={v 1,V 2} p 1 p 2 (y i,j v1,iv T 2,j ) 2 i=1 j=1 i j y i, j p i Y 1 = V 1 j V T 2 T. v 1,i v 2,j = v f =1 1,i,f v 2,j,f p 2 Christoph Freudenthaler 11 / 34 ISMLL, University of Hildesheim

28 Probabilistic SVD Frobenius norm optimal: argmin Y V 1 V2 T F = θ={v 1,V 2} p 1 p 2 (y i,j v1,iv T 2,j ) 2 i=1 j=1 i j y i, j p i Y 1 = V 1 j V T 2 T. v 1,i v 2,j = v f =1 1,i,f v 2,j,f p 2 Gaussian maximum lielihood: p 1 p 2 argmax θ={v 1,V 2} i=1 j=1 exp ( 12 (y i,j v T1,iv ) 2,j ) 2 Christoph Freudenthaler 11 / 34 ISMLL, University of Hildesheim

29 Probabilistic CD - Extend SVD to arbitrary m Frobenius norm optimal: argmin p 1 p 2... θ={v 1,V 2,...,V m} i 1=1 i 2=1 p m i m=1 (y i1,i 2,...,i m v 1,i1,f v 2,i2,f... v m,im,f f =1 } {{ } yi CD 1,i 2,...,im ) 2 i 2 V 2 T y i1,i 2,i 3 p 1 i 1 Y V 1 p 2 v 1,i1 v 2,i2 v 2,i3 V T 3 i 3 Christoph Freudenthaler 12 / 34 ISMLL, University of Hildesheim

30 Probabilistic CD - Extend SVD to arbitrary m Frobenius norm optimal: argmin p 1 p 2... θ={v 1,V 2,...,V m} i 1=1 i 2=1 p m i m=1 (y i1,i 2,...,i m v 1,i1,f v 2,i2,f... v m,im,f f =1 } {{ } yi CD 1,i 2,...,im ) 2 i 2 y i1,i 2,i 3 p 1 i 1 Y V 1 p 2 V T 2 v 1,i1 v 2,i2 v 2,i3 V T 3 i 3 Christoph Freudenthaler 12 / 34 ISMLL, University of Hildesheim

31 Probabilistic CD - Extend SVD to arbitrary m Frobenius norm optimal: argmin p 1 p 2... θ={v 1,V 2,...,V m} i 1=1 i 2=1 p m i m=1 (y i1,i 2,...,i m v 1,i1,f v 2,i2,f... v m,im,f f =1 } {{ } yi CD 1,i 2,...,im ) 2 i 2 V 2 T y i1,i 2,i 3 p 1 i = 1 Y V 1 p 2 v 1,i1 v 2,i2 v 2,i3 V T 3 i 3 Dot product = f =1 v 1,i1,f v 2,i2,f v 3,i3,f Christoph Freudenthaler 12 / 34 ISMLL, University of Hildesheim

32 Probabilistic CD - Extend SVD to arbitrary m Frobenius norm optimal: argmin p 1 p 2... θ={v 1,V 2,...,V m} i 1=1 i 2=1 p m i m=1 (y i1,i 2,...,i m v 1,i1,f v 2,i2,f... v m,im,f f =1 } {{ } yi CD 1,i 2,...,im ) 2 i 2 V 2 T y i1,i 2,i 3 p 1 i = 1 Y V 1 p 2 v 1,i1 v 2,i2 v 2,i3 V T 3 i 3 Dot product = f =1 v 1,i1,f v 2,i2,f v 3,i3,f Gaussian maximum lielihood: argmax p 1 p 2... p m ( exp 1 ) 2 (y i 1,i 2,...,i m yi CD 1,i 2,...,i m ) 2 θ={v 1,V 2,...} i 1=1 i 2=1 i m=1 Christoph Freudenthaler 12 / 34 ISMLL, University of Hildesheim

33 GFM I: Predictive Model Interpretation Adapt Notation: Introduce for each tensor element l = 1,..., n vector-valued indicator vectors x l,j, j = 1,..., m of length p j Christoph Freudenthaler 13 / 34 ISMLL, University of Hildesheim

34 GFM I: Predictive Model Interpretation Adapt Notation: Introduce for each tensor element l = 1,..., n vector-valued indicator vectors x l,j, j = 1,..., m of length p j Form for each tensor element y i1,...,i m a predictor vector x l R p=p1+p2+...+pm by concatenating x j : x CD l = 0,..., 1,..., 0, 0,..., 1,..., 0,... }{{}}{{} x l,1 x l,2 Christoph Freudenthaler 13 / 34 ISMLL, University of Hildesheim

35 GFM I: Predictive Model Interpretation Adapt Notation: Introduce for each tensor element l = 1,..., n vector-valued indicator vectors x l,j, j = 1,..., m of length p j Form for each tensor element y i1,...,i m a predictor vector x l R p=p1+p2+...+pm by concatenating x j : x CD l = 0,..., 1,..., 0, 0,..., 1,..., 0,... }{{}}{{} x l,1 x l,2 Denote with V = {V 1,..., V m } the set of all predictive model parameters Christoph Freudenthaler 13 / 34 ISMLL, University of Hildesheim

36 GFM I: Predictive Model Interpretation Adapt Notation: Introduce for each tensor element l = 1,..., n vector-valued indicator vectors x l,j, j = 1,..., m of length p j Form for each tensor element y i1,...,i m a predictor vector x l R p=p1+p2+...+pm by concatenating x j : x CD l = 0,..., 1,..., 0, 0,..., 1,..., 0,... }{{}}{{} x l,1 x l,2 Denote with V = {V 1,..., V m } the set of all predictive model parameters Rewrite yi CD 1,...,i m as yi CD 1,...,i m = yl CD = f (xl CD V ) with general f (x l V ) = m f =1 j=1 xt l,j v j,f = f =1 v 1,i 1,f v 2,i2,f... v m,im,f Christoph Freudenthaler 13 / 34 ISMLL, University of Hildesheim

37 GFM I: Predictive Model Interpretation More intuitive interpretation: 2-level hierarchical representation y l N(µ l, α = 1) =1 y l T j, f x l =x l, j vj,f Level 1 V l=1,...,n Level 2 f (x l V ) = µ l = f =1 j=1 m x T l,j v j,f }{{} µ j,f (x l ) Christoph Freudenthaler 14 / 34 ISMLL, University of Hildesheim

38 GFM I: Predictive Model Interpretation More intuitive interpretation: 2-level hierarchical representation y l N(µ l, α = 1) =1 y l T j, f x l =x l, j vj,f Level 1 V l=1,...,n Level 2 f (x l V ) = µ l = f =1 j=1 m x T l,j v j,f }{{} µ j,f (x l ) each pair of mode j and latent dimension f has a different linear model µ j,f (x l ) = x T l,jv j,f with predictor vector x l,j describing each mode j and p j different model parameters v j,f per mode j and latent dimension f Christoph Freudenthaler 14 / 34 ISMLL, University of Hildesheim

39 GFM I: Predictive Model Interpretation More intuitive interpretation: 2-level hierarchical representation y l N(µ l, α = 1) =1 y l T j, f x l =x l, j vj,f Level 1 V l=1,...,n Level 2 f (x l V ) = µ l = f =1 j=1 m x T l,j v j,f }{{} µ j,f (x l ) each pair of mode j and latent dimension f has a different linear model µ j,f (x l ) = x T l,jv j,f with predictor vector x l,j describing each mode j and p j different model parameters v j,f per mode j and latent dimension f factorization models are 2-level hierarchical (multi-)linear models with µ j,f (x l ) per latent dimension f and mode j in the upper level Christoph Freudenthaler 14 / 34 ISMLL, University of Hildesheim

40 GFM I: Predictive Model Interpretation More intuitive interpretation: 2-level hierarchical representation y l N(µ l, α = 1) =1 y l T j, f x l =x l, j vj,f Level 1 V l=1,...,n Level 2 f (x l V ) = µ l = f =1 j=1 m x T l,j v j,f }{{} µ j,f (x l ) each pair of mode j and latent dimension f has a different linear model µ j,f (x l ) = x T l,jv j,f with predictor vector x l,j describing each mode j and p j different model parameters v j,f per mode j and latent dimension f factorization models are 2-level hierarchical (multi-)linear models with µ j,f (x l ) per latent dimension f and mode j in the upper level modeled as linear functions of mode-dependent predictors x l,j Christoph Freudenthaler 14 / 34 ISMLL, University of Hildesheim

41 GFM I: Predictive Model Interpretation More intuitive interpretation: 2-level hierarchical representation y l N(µ l, α = 1) =1 y l T j, f x l =x l, j vj,f Level 1 V l=1,...,n Level 2 f (x l V ) = µ l = f =1 j=1 m x T l,j v j,f }{{} µ j,f (x l ) each pair of mode j and latent dimension f has a different linear model µ j,f (x l ) = x T l,jv j,f with predictor vector x l,j describing each mode j and p j different model parameters v j,f per mode j and latent dimension f factorization models are 2-level hierarchical (multi-)linear models with µ j,f (x l ) per latent dimension f and mode j in the upper level modeled as linear functions of mode-dependent predictors x l,j m merged by the dot product f =1 j=1 µ j,f to get f (x l V ) Christoph Freudenthaler 14 / 34 ISMLL, University of Hildesheim

42 GFM I: Predictive Model Interpretation More intuitive interpretation: 2-level hierarchical representation y l N(µ l, α = 1) =1 y l T j, f x l =x l, j vj,f Level 1 V l=1,...,n Level 2 f (x l V ) = µ l = f =1 j=1 m x T l,j v j,f }{{} µ j,f (x l ) each pair of mode j and latent dimension f has a different linear model µ j,f (x l ) = x T l,jv j,f with predictor vector x l,j describing each mode j and p j different model parameters v j,f per mode j and latent dimension f factorization models are 2-level hierarchical (multi-)linear models with µ j,f (x l ) per latent dimension f and mode j in the upper level modeled as linear functions of mode-dependent predictors x l,j m merged by the dot product f =1 j=1 µ j,f to get f (x l V ) and y CD l = f (xl CD V ) using x CD l Christoph Freudenthaler 14 / 34 ISMLL, University of Hildesheim

43 Important example: Matrix Factorization Level 1: =1 V 1 V 2 T 1, f x l =x l,1 v 1,f Level 1 T 2, f x l = x l,2 v2,f µ 1,f (x l ) = x T l,1 v 1,f µ 2,f (x l ) = x T l,2 v 2,f ( row means) ( column means) Level 2 y l l=1,...,n Christoph Freudenthaler 15 / 34 ISMLL, University of Hildesheim

44 Important example: Matrix Factorization =1 Level 2 V 1 V 2 T 1, f x l =x l,1 v 1,f Level 1 T 2, f x l = x l,2 v2,f y l l=1,...,n Level 1: µ 1,f (x l ) = x T l,1 v 1,f µ 2,f (x l ) = x T l,2 v 2,f Level 2: ( row means) ( column means) f (x CD l V ) = f =1 µ 1,f (x l )µ 2,f (x l ) = v1,f T v 2,f Christoph Freudenthaler 15 / 34 ISMLL, University of Hildesheim

45 GFM II: Predictive Model Extension From x CD l to arbitrary x l : x l = (x l,1,..., x l,m ) still consists of m subvectors for each mode Christoph Freudenthaler 16 / 34 ISMLL, University of Hildesheim

46 GFM II: Predictive Model Extension From x CD l to arbitrary x l : x l = (x l,1,..., x l,m ) still consists of m subvectors for each mode While x l,j x CD l values, e.g., has exactly one active position, thus p j 1 zero x l,1 = ( }{{} 1, 0,..., 0) x l,2 = (0, }{{} 1, 0,..., 0) 1 st row 2 nd column Christoph Freudenthaler 16 / 34 ISMLL, University of Hildesheim

47 GFM II: Predictive Model Extension From x CD l to arbitrary x l : x l = (x l,1,..., x l,m ) still consists of m subvectors for each mode While x l,j x CD l values, e.g., has exactly one active position, thus p j 1 zero x l,1 = ( }{{} 1, 0,..., 0) x l,2 = (0, }{{} 1, 0,..., 0) 1 st row 2 nd column Mode-vectors x l,j may be defined arbitrarily, e.g., to tae into account more complex conditional dependencies Christoph Freudenthaler 16 / 34 ISMLL, University of Hildesheim

48 Understanding induced dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Simplest case SVD: Correlation only on identical mode-indices, e.g. row or column index rows/columns are independent from other rows/columns i j y i, j p i Y 1 = V 1 j V 2 T T. v 1,i v 2,j = v f =1 1,i,f v 2,j,f p 2 Christoph Freudenthaler 17 / 34 ISMLL, University of Hildesheim

49 Understanding induced dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Simplest case SVD: Correlation only on identical mode-indices, e.g. row or column index rows/columns are independent from other rows/columns i j y i, j p i Y 1 = V 1 j V 2 T T. v 1,i v 2,j = v f =1 1,i,f v 2,j,f p 2 e.g. p 1 = 3 rows, p 2 = 8 columns: x l,1 = (1, 0, 0), x l,2 = (0, 1, 0, 0, 0, 0, 0, 0) y l = µ 1,f (x l )µ 2,f (x l )+ɛ l = (x T l,1 v 1,f )(x T l,2 v 2,f )+ɛ l = v 1,1,f v 2,2,f +ɛ l f =1 f =1 f =1 Christoph Freudenthaler 17 / 34 ISMLL, University of Hildesheim

50 Creating new dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Dependencies between different rows/columns for SVD: More than one active indicator per mode, with p 1 = 3 rows, p 2 = 8 columns: i j y i, j x l,1 = (1, 0, 0), x l,2 = (0, 1, 1, 0, 1, 0, 0, 0) j j ' j ' ' p i Y 1 = V 1 V 2 T p 2 4 Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 18 / 34 ISMLL, University of Hildesheim

51 Creating new dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Dependencies between different rows/columns for SVD: More than one active indicator per mode, with p 1 = 3 rows, p 2 = 8 columns: i j y i, j x l,1 = (1, 0, 0), x l,2 = (0, 1, 1, 0, 1, 0, 0, 0) j j ' j ' ' p i Y 1 = V 1 V 2 T p 2 f (x l V ) = (x T l,1v 1,f )(x T l,2v 2,f ) = v 1,1,f (v 2,2,f + v 2,3,f + v 2,5,f ) f =1 f =1 4 Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 18 / 34 ISMLL, University of Hildesheim

52 Creating new dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Dependencies between different rows/columns for SVD: More than one active indicator per mode, with p 1 = 3 rows, p 2 = 8 columns: i j y i, j x l,1 = (1, 0, 0), x l,2 = (0, 1, 1, 0, 1, 0, 0, 0) j j ' j ' ' p i Y 1 = V 1 V 2 T p 2 f (x l V ) = (x T l,1v 1,f )(x T l,2v 2,f ) = v 1,1,f (v 2,2,f + v 2,3,f + v 2,5,f ) f =1 f =1 Example from recommender systems: SVD Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model, KDD08. Christoph Freudenthaler 18 / 34 ISMLL, University of Hildesheim

53 Creating new dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Dependencies beyond rows and/or columns for SVD: p 1 = 3 rows and p 2 = 8 columns + additional rating information: x l,1 = (1, 0, 0), x l,2 = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 4, 0, 0, 0 ) }{{}}{{}}{{} users items ratings for co-rated items j j 3 4 i y i, j p i Y 1 = V 1 V 2 T f (x l V ) = p 2 (x T l,1v 1,f )(x T l,2v 2,f ) = f =1 f =1 v 1,1,f (3 v 2,3,f + 4 v 2,5,f ) 5 Rendle, S., Freudenthaler, C., Schmidt-Thieme, L.: Factorizing personalized Marov chains for next-baset Christoph recommendation. Freudenthaler WWW / 34 ISMLL, University of Hildesheim

54 Creating new dependencies: Overall independence assumption: y l x l N(µ l = f (x l V ), 1) Dependencies beyond rows and/or columns for SVD: p 1 = 3 rows and p 2 = 8 columns + additional rating information: x l,1 = (1, 0, 0), x l,2 = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 4, 0, 0, 0 ) }{{}}{{}}{{} users items ratings for co-rated items j j 3 4 i y i, j p i Y 1 = V 1 V 2 T f (x l V ) = p 2 (x T l,1v 1,f )(x T l,2v 2,f ) = f =1 v 1,1,f (3 v 2,3,f + 4 v 2,5,f ) Example from recommender systems: Factorized Transition Tensors 5 5 Rendle, S., Freudenthaler, C., Schmidt-Thieme, L.: Factorizing personalized Marov chains for next-baset Christoph recommendation. Freudenthaler WWW / 34 ISMLL, University of Hildesheim f =1

55 GFM II: Predictive Model Extension II f (x l V ) = µ l = m x T l,j v j,f = f =1 j=1 p m j x l,j,ij v j,ij,f f =1 j=1 i j =1 Only one predictive model per mode m Increase predictive accuracy: combine several reasonable predictive models per mode Christoph Freudenthaler 20 / 34 ISMLL, University of Hildesheim

56 GFM II: Predictive Model Extension II f (x l V ) = µ l = m x T l,j v j,f = f =1 j=1 p m j x l,j,ij 1 v j,ij,f f =1 j=1 i j =1 Only one predictive model per mode m Increase predictive accuracy: combine several reasonable predictive models per mode Introduce mode-defining selection vectors d j {0, 1} p on x l = (x 1,..., x m) R p, j = 1,..., m One set of selection vectors {d 1,..., d m} defines one model i V 1 m p f (x l V ) = µ l = x l,i d j,i v i,f f =1 j=1 i=1 j 3 4 V 2 T V 1 V 2 V Christoph Freudenthaler 20 / 34 ISMLL, University of Hildesheim

57 GFM II: Predictive Model Extension II f (x l V ) = µ l = m x T l,j v j,f = f =1 j=1 p m j x l,j,ij 1 v j,ij,f f =1 j=1 i j =1 Only one predictive model per mode m Increase predictive accuracy: combine several reasonable predictive models per mode Introduce mode-defining selection vectors d j {0, 1} p on x l = (x 1,..., x m) R p, j = 1,..., m One set of selection vectors {d 1,..., d m} defines one model i V 1 m p f (x l V ) = µ l = x l,i d j,i v i,f f =1 j=1 i=1 j 3 4 V 2 T V 1 V 2 V Collect several selection sets {d 1,..., d m} D: m p f GFM (x l V ) = µ l = x l,i d j,i v i,f f =1 {d 1,...,d m} D j=1 i=1 Christoph Freudenthaler 20 / 34 ISMLL, University of Hildesheim

58 GFM II: Predictive Model Learning? p p f GFM (x l V ) =... x l,i1... x l,im d 1,i1... d m,im v i1,f... v im,f i 1 =1 im=1 D f =1 }{{}}{{} Interaction Weight I Interaction Weight II Infer model selection D Bayesian model averaging? + Also negative interaction effects of even order - Redundant parameterization - Expensive model prediction O( D mp) Christoph Freudenthaler 21 / 34 ISMLL, University of Hildesheim

59 GFM II: Predictive Model Selection Specify D, i.e. select a specific model Christoph Freudenthaler 22 / 34 ISMLL, University of Hildesheim

60 GFM II: Predictive Model Selection Specify D, i.e. select a specific model Extended Matrix Factorization m = 2 different modes, p different mode-models D = mp = 2p { } {(d1,1,..., d D = 1,p ), : d (d 2,1,..., d 2,p )} 1,i1 = 1, d 2,i2 = 1 i 2 > i 1, 0 else, Christoph Freudenthaler 22 / 34 ISMLL, University of Hildesheim

61 GFM II: Predictive Model Selection Specify D, i.e. select a specific model Extended Matrix Factorization m = 2 different modes, p different mode-models D = mp = 2p { } {(d1,1,..., d D = 1,p ), : d (d 2,1,..., d 2,p )} 1,i1 = 1, d 2,i2 = 1 i 2 > i 1, 0 else, ex.1: d 1 = (1, 0, 0,..., 0), d 2 = (0, 1, 1, 1,..., 1) ex.2: d 1 = (0, 1, 0,..., 0), d 2 = (0, 0, 1, 1,..., 1) p p f (x l V ) = µ l = x l,i1 x l,i2 i 1=1 i 2>i 1 v i1,f v i2,f f =1 Christoph Freudenthaler 22 / 34 ISMLL, University of Hildesheim

62 Selecting a prior distribution: So far: y l x l N(µ l = f (x l V ), 1) More general x l unifies recent factorization model enhancements Christoph Freudenthaler 23 / 34 ISMLL, University of Hildesheim

63 Selecting a prior distribution: So far: y l x l N(µ l = f (x l V ), 1) More general x l unifies recent factorization model enhancements Next step: Selecting a prior distribution for model parameters V, using prior nowledge: If all tensor elements R linearly independent real valued latent dimensions Christoph Freudenthaler 23 / 34 ISMLL, University of Hildesheim

64 Selecting a prior distribution: So far: y l x l N(µ l = f (x l V ), 1) More general x l unifies recent factorization model enhancements Next step: Selecting a prior distribution for model parameters V, using prior nowledge: If all tensor elements R linearly independent real valued latent dimensions Each tensor element is a sum over linear functions of the same x l Different variances σ 2 f along latent dimensions Christoph Freudenthaler 23 / 34 ISMLL, University of Hildesheim

65 Selecting a prior distribution: So far: y l x l N(µ l = f (x l V ), 1) More general x l unifies recent factorization model enhancements Next step: Selecting a prior distribution for model parameters V, using prior nowledge: If all tensor elements R linearly independent real valued latent dimensions Each tensor element is a sum over linear functions of the same x l Different variances σ 2 f along latent dimensions Centered at some dimension dependent mean µf Christoph Freudenthaler 23 / 34 ISMLL, University of Hildesheim

66 Selecting a prior distribution: So far: y l x l N(µ l = f (x l V ), 1) More general x l unifies recent factorization model enhancements Next step: Selecting a prior distribution for model parameters V, using prior nowledge: If all tensor elements R linearly independent real valued latent dimensions Each tensor element is a sum over linear functions of the same x l Different variances σ 2 f along latent dimensions Centered at some dimension dependent mean µf Center µf and variance σ 2 f may differ for different modes Christoph Freudenthaler 23 / 34 ISMLL, University of Hildesheim

67 Selecting a prior distribution: Conjugate Gaussian prior distribution: Each latent -dim representation v i R, i = 1,..., p is an independent draw v i N((µ 1,..., µ ), diag(σ 2 1,..., σ 2 )) V i p V Christoph Freudenthaler 24 / 34 ISMLL, University of Hildesheim

68 Selecting a prior distribution: Conjugate Gaussian prior distribution: Each latent -dim representation v i R, i = 1,..., p is an independent draw v i N((µ 1,..., µ ), diag(σ 2 1,..., σ 2 )) V i p V Conjugate Gaussian hyperprior: Each prior mean µ f is considered an independent realization of, v f c, 0 =0 µ f N(µ 0, c λ f ) f=1,..., f Conjugate Gamma hyperprior: Each precision λ f = σ 2 f is considered an independent realization of v i i=1,...,p 1 f=1,..., λ f G(α 0, β 0 ) y l l=1,...,n Christoph Freudenthaler 24 / 34 ISMLL, University of Hildesheim

69 Learning Generalized Factorization Models Stochastic Gradient Descent: scalable, simple Christoph Freudenthaler 25 / 34 ISMLL, University of Hildesheim

70 Learning Generalized Factorization Models Stochastic Gradient Descent: scalable, simple Alternating Least Squares Christoph Freudenthaler 25 / 34 ISMLL, University of Hildesheim

71 Learning Generalized Factorization Models Stochastic Gradient Descent: scalable, simple Alternating Least Squares Bayesian Analysis, i.e. standard Gibbs sampling: Easy to derive for multi-linear models lie p p f (x l V ) = x l,i1 x l,i2 i 1=1 i 2>i 1 v i1,f v i2,f f =1 Christoph Freudenthaler 25 / 34 ISMLL, University of Hildesheim

72 Learning Generalized Factorization Models Stochastic Gradient Descent: scalable, simple Alternating Least Squares Bayesian Analysis, i.e. standard Gibbs sampling: Easy to derive for multi-linear models lie p p f (x l V ) = x l,i1 x l,i2 i 1=1 i 2>i 1 v i1,f v i2,f f =1 Learning on large datasets: bloc size = 1 O(Nz ) Christoph Freudenthaler 25 / 34 ISMLL, University of Hildesheim

73 Outline Generalized Factorization Model Relations to standard Models Empirical Evaluation Conclusion and Discussion Christoph Freudenthaler 26 / 34 ISMLL, University of Hildesheim

74 Polynomial Regression p p p p p f (x l V ) = x l,i1 β i1 + x l,i1 x l,i2 β i1,i x l,i1... x l,i1 β i1,...,i o i 1 =1 i 1 =1 i 2 i 1 i 1 =1 i o i o 1 Order-o polynomial regression models Christoph Freudenthaler 26 / 34 ISMLL, University of Hildesheim

75 Polynomial Regression p p p p p f (x l V ) = x l,i1 β i1 + x l,i1 x l,i2 β i1,i x l,i1... x l,i1 β i1,...,i o i 1 =1 i 1 =1 i 2 i 1 i 1 =1 i o i o 1 Order-o polynomial regression models f GFM (x l V ) = f =1 {d 1,...,d m} D m p x l,i d j,i v i,f j=1 i=1 Generalized Factorization Model of m-mode tensor Christoph Freudenthaler 26 / 34 ISMLL, University of Hildesheim

76 Polynomial Regression p p p p p f (x l V ) = x l,i1 β i1 + x l,i1 x l,i2 β i1,i x l,i1... x l,i1 β i1,...,i o i 1 =1 i 1 =1 i 2 i 1 i 1 =1 i o i o 1 Order-o polynomial regression models f GFM (x l V ) = f =1 {d 1,...,d m} D m p x l,i d j,i v i,f j=1 i=1 Generalized Factorization Model of m-mode tensor Define one of the p predictor variables as constant, e.g. xl,1 = 1 l = 1,..., n Fix corresponding parameter vector v1 V = (1,..., 1) }{{} Christoph Freudenthaler 26 / 34 ISMLL, University of Hildesheim

77 Polynomial Regression p p p p p f (x l V ) = x l,i1 β i1 + x l,i1 x l,i2 β i1,i x l,i1... x l,i1 β i1,...,i o i 1 =1 i 1 =1 i 2 i 1 i 1 =1 i o i o 1 Order-o polynomial regression models f GFM (x l V ) = f =1 {d 1,...,d m} D m p x l,i d j,i v i,f j=1 i=1 Generalized Factorization Model of m-mode tensor Define one of the p predictor variables as constant, e.g. xl,1 = 1 l = 1,..., n Fix corresponding parameter vector v1 V = (1,..., 1) }{{} Selecting D accordingly gives p p p p p f GFM (x l V ) = x l,i1 v i1,f + x l,i1 x l,i2 v i1,f v i2,f x l,i1... x l,im v i1,f... v im,f i 1 =1 f =1 i 1 =1 i 2 i 1 f =1 i 1 =1 im i }{{}}{{} m 1 f =1 }{{} β i1 β i1,i β 2 i1,...,im Christoph Freudenthaler 26 / 34 ISMLL, University of Hildesheim

78 Polynomial Regression vs. GFM Generalized Factorization Model include Polynomial Regression For factorized parameters, e.g. β i1,i 2 If number of modes m equals order o = f =1 v i 1,f v i2,f Christoph Freudenthaler 27 / 34 ISMLL, University of Hildesheim

79 Factorization Machines 6 vs. GFM Very similar to previous factorized polynomial regression model f GFM (x l V ) = p i 1=1 f FM (x l V ) = x l,i1 f =1 v i1,f }{{} β i1 p x l,i1 β i1 + i 1=1 + p p i 1=1 i 2 i 1 x l,i1 x l,i2 v i1,f v i2,f +... f =1 } {{ } β i1,i 2 p p i 1=1 i 2 i 1 x l,i1 x l,i2 v i1,f v i2,f f =1 } {{ } β i1,i Rendle, S.: Factorization Machines. ICDM10. Christoph Freudenthaler 28 / 34 ISMLL, University of Hildesheim

80 Factorization Machines 6 vs. GFM Very similar to previous factorized polynomial regression model f GFM (x l V ) = p i 1=1 f FM (x l V ) = x l,i1 f =1 v i1,f }{{} β i1 p x l,i1 β i1 + i 1=1 + p p i 1=1 i 2 i 1 x l,i1 x l,i2 v i1,f v i2,f +... f =1 } {{ } β i1,i 2 p p i 1=1 i 2 i 1 x l,i1 x l,i2 v i1,f v i2,f f =1 } {{ } β i1,i Factorization Machines using simple Gibbs are successfully applied in several challenges 6 Rendle, S.: Factorization Machines. ICDM10. Christoph Freudenthaler 28 / 34 ISMLL, University of Hildesheim

81 Neural Networs vs. GFM Output y l Hidden f x l,v 1 w 1 w 2 f x l,v 2 Input v v v 1,1 1,2 4,2 v3,1 v 4,1 x l,1 x l,2 x l,3 x l,4 Christoph Freudenthaler 29 / 34 ISMLL, University of Hildesheim

82 Neural Networs vs. GFM Output y l Hidden f x l,v 1 w 1 w 2 f x l,v 2 Input p=4 v v v 1,1 1,2 4,2 v3,1 v 4,1 x l,1 x l,2 x l,3 x l,4 Input: predictor vector x l R p, l = 1,..., n Christoph Freudenthaler 29 / 34 ISMLL, University of Hildesheim

83 Neural Networs vs. GFM Output y l Hidden =2 f x l,v 1 w 1 w 2 f x l,v 2 Input p=4 v v v 1,1 1,2 4,2 v3,1 v 4,1 x l,1 x l,2 x l,3 x l,4 Input: predictor vector x l R p, l = 1,..., n Hidden: Hidden nodes correspond to latent dimensions nodes Christoph Freudenthaler 29 / 34 ISMLL, University of Hildesheim

84 Neural Networs vs. GFM Output y l Hidden =2 f x l,v 1 w 1 w 2 f x l,v 2 Input p=4 v v v 1,1 1,2 4,2 v3,1 v 4,1 x l,1 x l,2 x l,3 x l,4 Input: predictor vector x l R p, l = 1,..., n Hidden: Hidden nodes correspond to latent dimensions nodes Activation Function: f (x l, v f ) := {d 1,...,d m} D m j=1 p i=1 x l,id j,i v i,f Christoph Freudenthaler 29 / 34 ISMLL, University of Hildesheim

85 Neural Networs vs. GFM Output y l Hidden =2 f x l,v 1 w 1 w 2 f x l,v 2 Input p=4 v v v 1,1 1,2 4,2 v3,1 v 4,1 x l,1 x l,2 x l,3 x l,4 Input: predictor vector x l R p, l = 1,..., n Hidden: Hidden nodes correspond to latent dimensions nodes Activation Function: f (x l, v f ) := {d 1,...,d m} D Output: Weights w f = 1 m j=1 p i=1 x l,id j,i v i,f Christoph Freudenthaler 29 / 34 ISMLL, University of Hildesheim

86 Outline Generalized Factorization Model Relations to standard Models Empirical Evaluation Conclusion and Discussion Christoph Freudenthaler 30 / 34 ISMLL, University of Hildesheim

87 Empirical Evaluation on Recommender System Data Tas: movie rating prediction Data 7 : 100M ratings of 480 users on 18 items Netflix RMSE SGD KNN256 SGD KNN256++ BFM KNN256 BFM KNN256++ SGD PMF BPMF BFM (u,i) BFM (u,i,t,f) BFM (u,i,ut,it,if) BFM Ensemble (#latent Dimensions) Christoph Freudenthaler 30 / 34 ISMLL, University of Hildesheim

88 Bayesian Factorization Machines - Kaggle Tas: student performance prediction on GMAT (Graduate Management Admission Test), SAT and ACT (college admission test) 118 contestants Data 8 : 8 Christoph Freudenthaler 31 / 34 ISMLL, University of Hildesheim

89 Bayesian Factorization Machines - KDDCup 11 Tas: music rating prediction 1000 contestants Data 9 : 250M ratings of 1M users on 625 items (songs, tracs, album or artists) 9 Christoph Freudenthaler 32 / 34 ISMLL, University of Hildesheim

90 Outline Generalized Factorization Model Relations to standard Models Empirical Evaluation Conclusion and Discussion Christoph Freudenthaler 33 / 34 ISMLL, University of Hildesheim

91 In a nutshell: GFM, thus matrix and tensor Factorization are types of regression models Christoph Freudenthaler 33 / 34 ISMLL, University of Hildesheim

92 In a nutshell: GFM, thus matrix and tensor Factorization are types of regression models Generalized Factorization Models relate to Polynomial regression for factorized parameters Christoph Freudenthaler 33 / 34 ISMLL, University of Hildesheim

93 In a nutshell: GFM, thus matrix and tensor Factorization are types of regression models Generalized Factorization Models relate to Polynomial regression for factorized parameters Feed-forward Neural Networs for given activation function Christoph Freudenthaler 33 / 34 ISMLL, University of Hildesheim

94 In a nutshell: GFM, thus matrix and tensor Factorization are types of regression models Generalized Factorization Models relate to Polynomial regression for factorized parameters Feed-forward Neural Networs for given activation function Factorization Machines with simple Gibbs show decent predictive performance Christoph Freudenthaler 33 / 34 ISMLL, University of Hildesheim

95 In a nutshell: GFM, thus matrix and tensor Factorization are types of regression models Generalized Factorization Models relate to Polynomial regression for factorized parameters Feed-forward Neural Networs for given activation function Factorization Machines with simple Gibbs show decent predictive performance Open questions: Bayesian learning for models with non-linear functions of V? Bayesian model averaging for GFM? Efficient Bayesian inference for non-gaussian lielihood? Christoph Freudenthaler 33 / 34 ISMLL, University of Hildesheim

96 Christoph Freudenthaler 34 / 34 ISMLL, University of Hildesheim

Scaling Neighbourhood Methods

Scaling Neighbourhood Methods Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Large-scale Ordinal Collaborative Filtering

Large-scale Ordinal Collaborative Filtering Large-scale Ordinal Collaborative Filtering Ulrich Paquet, Blaise Thomson, and Ole Winther Microsoft Research Cambridge, University of Cambridge, Technical University of Denmark ulripa@microsoft.com,brmt2@cam.ac.uk,owi@imm.dtu.dk

More information

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods Techniques for Dimensionality Reduction PCA and Other Matrix Factorization Methods Outline Principle Compoments Analysis (PCA) Example (Bishop, ch 12) PCA as a mixture model variant With a continuous latent

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Andriy Mnih and Ruslan Salakhutdinov

Andriy Mnih and Ruslan Salakhutdinov MATRIX FACTORIZATION METHODS FOR COLLABORATIVE FILTERING Andriy Mnih and Ruslan Salakhutdinov University of Toronto, Machine Learning Group 1 What is collaborative filtering? The goal of collaborative

More information

Matrix Factorization In Recommender Systems. Yong Zheng, PhDc Center for Web Intelligence, DePaul University, USA March 4, 2015

Matrix Factorization In Recommender Systems. Yong Zheng, PhDc Center for Web Intelligence, DePaul University, USA March 4, 2015 Matrix Factorization In Recommender Systems Yong Zheng, PhDc Center for Web Intelligence, DePaul University, USA March 4, 2015 Table of Contents Background: Recommender Systems (RS) Evolution of Matrix

More information

Using SVD to Recommend Movies

Using SVD to Recommend Movies Michael Percy University of California, Santa Cruz Last update: December 12, 2009 Last update: December 12, 2009 1 / Outline 1 Introduction 2 Singular Value Decomposition 3 Experiments 4 Conclusion Last

More information

* Matrix Factorization and Recommendation Systems

* Matrix Factorization and Recommendation Systems Matrix Factorization and Recommendation Systems Originally presented at HLF Workshop on Matrix Factorization with Loren Anderson (University of Minnesota Twin Cities) on 25 th September, 2017 15 th March,

More information

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning Anton Rodomanov Higher School of Economics, Russia Bayesian methods research group (http://bayesgroup.ru) 14 March

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Kai Yu Joint work with John Lafferty and Shenghuo Zhu NEC Laboratories America, Carnegie Mellon University First Prev Page

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

a Short Introduction

a Short Introduction Collaborative Filtering in Recommender Systems: a Short Introduction Norm Matloff Dept. of Computer Science University of California, Davis matloff@cs.ucdavis.edu December 3, 2016 Abstract There is a strong

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)

More information

A Modified PMF Model Incorporating Implicit Item Associations

A Modified PMF Model Incorporating Implicit Item Associations A Modified PMF Model Incorporating Implicit Item Associations Qiang Liu Institute of Artificial Intelligence College of Computer Science Zhejiang University Hangzhou 31007, China Email: 01dtd@gmail.com

More information

Collaborative Filtering. Radek Pelánek

Collaborative Filtering. Radek Pelánek Collaborative Filtering Radek Pelánek 2017 Notes on Lecture the most technical lecture of the course includes some scary looking math, but typically with intuitive interpretation use of standard machine

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF) Case Study 4: Collaborative Filtering Review: Probabilistic Matrix Factorization Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 2 th, 214 Emily Fox 214 1 Probabilistic

More information

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:

More information

Collaborative Filtering Applied to Educational Data Mining

Collaborative Filtering Applied to Educational Data Mining Journal of Machine Learning Research (200) Submitted ; Published Collaborative Filtering Applied to Educational Data Mining Andreas Töscher commendo research 8580 Köflach, Austria andreas.toescher@commendo.at

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Matrix Factorization Techniques for Recommender Systems

Matrix Factorization Techniques for Recommender Systems Matrix Factorization Techniques for Recommender Systems Patrick Seemann, December 16 th, 2014 16.12.2014 Fachbereich Informatik Recommender Systems Seminar Patrick Seemann Topics Intro New-User / New-Item

More information

15 Singular Value Decomposition

15 Singular Value Decomposition 15 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD DATA MINING LECTURE 8 Dimensionality Reduction PCA -- SVD The curse of dimensionality Real data usually have thousands, or millions of dimensions E.g., web documents, where the dimensionality is the vocabulary

More information

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond PCA, ICA and beyond Summer School on Manifold Learning in Image and Signal Analysis, August 17-21, 2009, Hven Technical University of Denmark (DTU) & University of Copenhagen (KU) August 18, 2009 Motivation

More information

Matrix Factorization Techniques For Recommender Systems. Collaborative Filtering

Matrix Factorization Techniques For Recommender Systems. Collaborative Filtering Matrix Factorization Techniques For Recommender Systems Collaborative Filtering Markus Freitag, Jan-Felix Schwarz 28 April 2011 Agenda 2 1. Paper Backgrounds 2. Latent Factor Models 3. Overfitting & Regularization

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University 2 Often, our data can be represented by an m-by-n matrix. And this matrix can be closely approximated by the product of two matrices that share a small common dimension

More information

Mixed Membership Matrix Factorization

Mixed Membership Matrix Factorization Mixed Membership Matrix Factorization Lester Mackey 1 David Weiss 2 Michael I. Jordan 1 1 University of California, Berkeley 2 University of Pennsylvania International Conference on Machine Learning, 2010

More information

Content-based Recommendation

Content-based Recommendation Content-based Recommendation Suthee Chaidaroon June 13, 2016 Contents 1 Introduction 1 1.1 Matrix Factorization......................... 2 2 slda 2 2.1 Model................................. 3 3 flda 3

More information

Lecture 16: Introduction to Neural Networks

Lecture 16: Introduction to Neural Networks Lecture 16: Introduction to Neural Networs Instructor: Aditya Bhasara Scribe: Philippe David CS 5966/6966: Theory of Machine Learning March 20 th, 2017 Abstract In this lecture, we consider Bacpropagation,

More information

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent KDD 2011 Rainer Gemulla, Peter J. Haas, Erik Nijkamp and Yannis Sismanis Presenter: Jiawen Yao Dept. CSE, UT Arlington 1 1

More information

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang Matrix Factorization & Latent Semantic Analysis Review Yize Li, Lanbo Zhang Overview SVD in Latent Semantic Indexing Non-negative Matrix Factorization Probabilistic Latent Semantic Indexing Vector Space

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

SCMF: Sparse Covariance Matrix Factorization for Collaborative Filtering

SCMF: Sparse Covariance Matrix Factorization for Collaborative Filtering SCMF: Sparse Covariance Matrix Factorization for Collaborative Filtering Jianping Shi Naiyan Wang Yang Xia Dit-Yan Yeung Irwin King Jiaya Jia Department of Computer Science and Engineering, The Chinese

More information

Collaborative Filtering: A Machine Learning Perspective

Collaborative Filtering: A Machine Learning Perspective Collaborative Filtering: A Machine Learning Perspective Chapter 6: Dimensionality Reduction Benjamin Marlin Presenter: Chaitanya Desai Collaborative Filtering: A Machine Learning Perspective p.1/18 Topics

More information

Recommendation Systems

Recommendation Systems Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume

More information

Restricted Boltzmann Machines for Collaborative Filtering

Restricted Boltzmann Machines for Collaborative Filtering Restricted Boltzmann Machines for Collaborative Filtering Authors: Ruslan Salakhutdinov Andriy Mnih Geoffrey Hinton Benjamin Schwehn Presentation by: Ioan Stanculescu 1 Overview The Netflix prize problem

More information

Decoupled Collaborative Ranking

Decoupled Collaborative Ranking Decoupled Collaborative Ranking Jun Hu, Ping Li April 24, 2017 Jun Hu, Ping Li WWW2017 April 24, 2017 1 / 36 Recommender Systems Recommendation system is an information filtering technique, which provides

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Dimensionality Reduction and Principle Components Analysis

Dimensionality Reduction and Principle Components Analysis Dimensionality Reduction and Principle Components Analysis 1 Outline What is dimensionality reduction? Principle Components Analysis (PCA) Example (Bishop, ch 12) PCA vs linear regression PCA as a mixture

More information

NetBox: A Probabilistic Method for Analyzing Market Basket Data

NetBox: A Probabilistic Method for Analyzing Market Basket Data NetBox: A Probabilistic Method for Analyzing Market Basket Data José Miguel Hernández-Lobato joint work with Zoubin Gharhamani Department of Engineering, Cambridge University October 22, 2012 J. M. Hernández-Lobato

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Scalable Bayesian Matrix Factorization

Scalable Bayesian Matrix Factorization Scalable Bayesian Matrix Factorization Avijit Saha,1, Rishabh Misra,2, and Balaraman Ravindran 1 1 Department of CSE, Indian Institute of Technology Madras, India {avijit, ravi}@cse.iitm.ac.in 2 Department

More information

Mixed Membership Matrix Factorization

Mixed Membership Matrix Factorization Mixed Membership Matrix Factorization Lester Mackey University of California, Berkeley Collaborators: David Weiss, University of Pennsylvania Michael I. Jordan, University of California, Berkeley 2011

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #21: Dimensionality Reduction Seoul National University 1 In This Lecture Understand the motivation and applications of dimensionality reduction Learn the definition

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS. The original slides can be accessed at: www.mmds.org J. Leskovec,

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

10-701/15-781, Machine Learning: Homework 4

10-701/15-781, Machine Learning: Homework 4 10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Algorithms for Collaborative Filtering

Algorithms for Collaborative Filtering Algorithms for Collaborative Filtering or How to Get Half Way to Winning $1million from Netflix Todd Lipcon Advisor: Prof. Philip Klein The Real-World Problem E-commerce sites would like to make personalized

More information

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data. Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two

More information

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians Exercise Sheet 1 1 Probability revision 1: Student-t as an infinite mixture of Gaussians Show that an infinite mixture of Gaussian distributions, with Gamma distributions as mixing weights in the following

More information

Generative Models for Discrete Data

Generative Models for Discrete Data Generative Models for Discrete Data ddebarr@uw.edu 2016-04-21 Agenda Bayesian Concept Learning Beta-Binomial Model Dirichlet-Multinomial Model Naïve Bayes Classifiers Bayesian Concept Learning Numbers

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Introduction to Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models Introduction to Probabilistic Graphical Models Sargur Srihari srihari@cedar.buffalo.edu 1 Topics 1. What are probabilistic graphical models (PGMs) 2. Use of PGMs Engineering and AI 3. Directionality in

More information

Recommender Systems EE448, Big Data Mining, Lecture 10. Weinan Zhang Shanghai Jiao Tong University

Recommender Systems EE448, Big Data Mining, Lecture 10. Weinan Zhang Shanghai Jiao Tong University 2018 EE448, Big Data Mining, Lecture 10 Recommender Systems Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html Content of This Course Overview of

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

Latent Semantic Analysis. Hongning Wang

Latent Semantic Analysis. Hongning Wang Latent Semantic Analysis Hongning Wang CS@UVa VS model in practice Document and query are represented by term vectors Terms are not necessarily orthogonal to each other Synonymy: car v.s. automobile Polysemy:

More information

Dimension Reduction. David M. Blei. April 23, 2012

Dimension Reduction. David M. Blei. April 23, 2012 Dimension Reduction David M. Blei April 23, 2012 1 Basic idea Goal: Compute a reduced representation of data from p -dimensional to q-dimensional, where q < p. x 1,...,x p z 1,...,z q (1) We want to do

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

Collaborative Filtering

Collaborative Filtering Collaborative Filtering Nicholas Ruozzi University of Texas at Dallas based on the slides of Alex Smola & Narges Razavian Collaborative Filtering Combining information among collaborating entities to make

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

More information

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007 Recommender Systems Dipanjan Das Language Technologies Institute Carnegie Mellon University 20 November, 2007 Today s Outline What are Recommender Systems? Two approaches Content Based Methods Collaborative

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Information retrieval LSI, plsi and LDA. Jian-Yun Nie Information retrieval LSI, plsi and LDA Jian-Yun Nie Basics: Eigenvector, Eigenvalue Ref: http://en.wikipedia.org/wiki/eigenvector For a square matrix A: Ax = λx where x is a vector (eigenvector), and

More information

Probabilistic Latent-Factor Database Models

Probabilistic Latent-Factor Database Models Probabilistic Latent-Factor Database Models Denis Krompaß 1, Xueyian Jiang 1, Maximilian Nicel 2, and Voler Tresp 1,3 1 Ludwig Maximilian University of Munich, Germany 2 Massachusetts Institute of Technology,

More information

Large Scale Data Analysis Using Deep Learning

Large Scale Data Analysis Using Deep Learning Large Scale Data Analysis Using Deep Learning Linear Algebra U Kang Seoul National University U Kang 1 In This Lecture Overview of linear algebra (but, not a comprehensive survey) Focused on the subset

More information

CS60021: Scalable Data Mining. Dimensionality Reduction

CS60021: Scalable Data Mining. Dimensionality Reduction J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Dimensionality Reduction Sourangshu Bhattacharya Assumption: Data lies on or near a

More information

Background Mathematics (2/2) 1. David Barber

Background Mathematics (2/2) 1. David Barber Background Mathematics (2/2) 1 David Barber University College London Modified by Samson Cheung (sccheung@ieee.org) 1 These slides accompany the book Bayesian Reasoning and Machine Learning. The book and

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Matrix Factorization Techniques for Recommender Systems

Matrix Factorization Techniques for Recommender Systems Matrix Factorization Techniques for Recommender Systems By Yehuda Koren Robert Bell Chris Volinsky Presented by Peng Xu Supervised by Prof. Michel Desmarais 1 Contents 1. Introduction 4. A Basic Matrix

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Factor Analysis (10/2/13)

Factor Analysis (10/2/13) STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

More information

Personalized Ranking for Non-Uniformly Sampled Items

Personalized Ranking for Non-Uniformly Sampled Items JMLR: Workshop and Conference Proceedings 18:231 247, 2012 Proceedings of KDD-Cup 2011 competition Personalized Ranking for Non-Uniformly Sampled Items Zeno Gantner Lucas Drumond Christoph Freudenthaler

More information

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/

More information

Linear Algebra for Machine Learning. Sargur N. Srihari

Linear Algebra for Machine Learning. Sargur N. Srihari Linear Algebra for Machine Learning Sargur N. srihari@cedar.buffalo.edu 1 Overview Linear Algebra is based on continuous math rather than discrete math Computer scientists have little experience with it

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Relational Stacked Denoising Autoencoder for Tag Recommendation. Hao Wang

Relational Stacked Denoising Autoencoder for Tag Recommendation. Hao Wang Relational Stacked Denoising Autoencoder for Tag Recommendation Hao Wang Dept. of Computer Science and Engineering Hong Kong University of Science and Technology Joint work with Xingjian Shi and Dit-Yan

More information

Matrix decompositions

Matrix decompositions Matrix decompositions Zdeněk Dvořák May 19, 2015 Lemma 1 (Schur decomposition). If A is a symmetric real matrix, then there exists an orthogonal matrix Q and a diagonal matrix D such that A = QDQ T. The

More information

Linear Algebra Background

Linear Algebra Background CS76A Text Retrieval and Mining Lecture 5 Recap: Clustering Hierarchical clustering Agglomerative clustering techniques Evaluation Term vs. document space clustering Multi-lingual docs Feature selection

More information

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Michalis K. Titsias Department of Informatics Athens University of Economics and Business

More information

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine Nitish Srivastava nitish@cs.toronto.edu Ruslan Salahutdinov rsalahu@cs.toronto.edu Geoffrey Hinton hinton@cs.toronto.edu

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Machine Learning Techniques

Machine Learning Techniques Machine Learning Techniques ( 機器學習技法 ) Lecture 13: Deep Learning Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University ( 國立台灣大學資訊工程系

More information

Graphical Models 359

Graphical Models 359 8 Graphical Models Probabilities play a central role in modern pattern recognition. We have seen in Chapter 1 that probability theory can be expressed in terms of two simple equations corresponding to

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

Wavelet Transform And Principal Component Analysis Based Feature Extraction

Wavelet Transform And Principal Component Analysis Based Feature Extraction Wavelet Transform And Principal Component Analysis Based Feature Extraction Keyun Tong June 3, 2010 As the amount of information grows rapidly and widely, feature extraction become an indispensable technique

More information