ECS289: Scalable Machine Learning

Size: px
Start display at page:

Download "ECS289: Scalable Machine Learning"

Transcription

1 ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016

2 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday) Send me the paper you want to present before October 21 Choose a paper in ICML, NIPS, KDD, ICDM, AISTATS or JMLR Final project proposal (one page): due October 30 (11:59pm)

3 Outline Matrix Completion (Background) Alternating Least Squares (ALS) Stochastic Gradient method (SG) Coordinate Descent (CD)

4 Recommender Systems

5 Matrix Factorization Approach A WH T

6 Matrix Factorization Approach A WH T

7 Matrix Factorization Approach min W R m k H R n k (i,j) Ω Ω = {(i, j) A ij is observed} Regularized terms to avoid over-fitting (A ij wi T h j ) 2 + λ ( W 2 F + ) H 2 F, Matrix factorization maps users/items to latent feature space R k the i th user i th row of W, w T i, the j th item j th row of H, h T j. wi T h j : measures the interaction between i th user and j th item.

8 Latent Feature Space

9 Latent Feature Space

10 Other Factorizations Nonnegative Matrix Factorization min W 0,H 0 A WHT 2 F + λ W 2 F + λ H 2 F Each entry is positive A is either fully or partially observed Goal: find nonnegative latent factors

11 NMF vs PCA

12 Optimization for Matrix Completion: Alternating Least Squares

13 Properties of the Objective Function Nonconvex problem (why?) Example: f (x, y) = 1 2 (xy 1)2 f (0, 0) = 0, but clearly [0, 0] is not a global optimum

14 ALS: Alternating Least Squares Objective function: 1 min (A W,H ij (WH T ) ij ) 2 + λ 2 2 W 2 F + λ 2 H 2 F := f (W, H) i,j Ω Iteratively fix either H or W and optimize the other: Input: partially observed matrix A, initial values of W, H For t = 1, 2,... Fix W and update H: H argmin H f (W, H) Fix H and update W : W argmin W f (W, H)

15 ALS: Alternating Least Squares Define: Ω j := {i (i, j) Ω} w i : the i-th row of W ; h j : the j-th row of H The subproblem: 1 argmin H 2 = (A ij (WH T ) ij ) 2 + λ 2 H 2 F i,j Ω n ( 1 2 j=1 i Ω j (A ij w T i h j ) 2 + λ 2 h j 2 } {{ } ridge regression problem )

16 ALS: Alternating Least Squares Define: Ω j := {i (i, j) Ω} w i : the i-th row of W ; h j : the j-th row of H The subproblem: 1 argmin H 2 = (A ij (WH T ) ij ) 2 + λ 2 H 2 F i,j Ω n ( 1 2 j=1 i Ω j (A ij w T i h j ) 2 + λ 2 h j 2 } {{ } ridge regression problem ) n ridge regression problems, each with k variables O( Ω k 2 + nk 3 ) Easy to parallelize (n independent ridge regression subproblems)

17 ALS: Alternating Least Squares ( ) H T ( ) H T w1 T w2 T w3 T A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33 w1 T w2 T w3 T A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33

18 Optimization for Matrix Completion: Stochastic Gradient Method

19 Stochastic Gradient Method ni W : number of nonzeroes in the i-th row of A nj H :number of nonzeroes in the j-th column of A Decompose the problem into Ω components: f (W, H) = 1 (A ij wi T h j ) 2 + λ 2 2 W 2 F + λ 2 H 2 F i,j Ω ( Ω h j ) 2 + λ Ω w i 2 + λ Ω ) h j 2 = 1 Ω i,j Ω 2 (A ij wi T 2n W i 2n H j } {{ } f i,j (W,H)

20 Stochastic Gradient Method ni W : number of nonzeroes in the i-th row of A nj H :number of nonzeroes in the j-th column of A Decompose the problem into Ω components: f (W, H) = 1 (A ij wi T h j ) 2 + λ 2 2 W 2 F + λ 2 H 2 F i,j Ω ( Ω h j ) 2 + λ Ω w i 2 + λ Ω ) h j 2 = 1 Ω i,j Ω 2 (A ij wi T The gradient of each component: 2n W i 2n H j } {{ } f i,j (W,H) wi f i,j (W, H) = Ω (w T i hj f i,j (W, H) = Ω (w T i h j A ij )h j + λ Ω ni W w i h j A ij )w i + λ Ω nj H h j

21 Stochastic Gradient Method SG algorithm: Input; partially observed matrix A, initial values of W, H For t = 1, 2,... Randomly pick a pair (i, j) Ω w i (1 ηtλ )w ni W i η t (wi T h j A ij )h j h j (1 ηtλ )h nj H j η t (wi T h j A ij )w i

22 Stochastic Gradient Method SG algorithm: Input; partially observed matrix A, initial values of W, H For t = 1, 2,... Randomly pick a pair (i, j) Ω w i (1 ηtλ )w ni W i η t (wi T h j A ij )h j h j (1 ηtλ )h nj H j η t (wi T h j A ij )w i Time complexity: O(k) per iteration; O( Ω k) for one pass of all observed entries.

23 Stochastic Gradient Method ( ) h 1 h 2 h 3 ( ) h 1 h 2 ; h 3 w1 T w2 T w3 T A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33 w1 T w2 T w3 T A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33

24 Optimization for Matrix Completion: Distributed Stochastic Gradient Descent (DSGD)

25 How to parallelize SG? Two SG updates on (i 1, j 1 ) and (i 2, j 2 ) in the same time: (i 1, j 1 ): Update w i1 and h j1 (i 2, j 2 ): Update w i2 and h j2 Confliction happens when i 1 = i 2 or j 1 = j 2 How to avoid confliction? Gemulla et al., Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent. In KDD 2011.

26 DSGD: Distributed SGD [Gemulla et al, 2011] w T 1 w T 2 w T 3 h 1 h 2 h 3 A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33 P 1 P 2 P 3

27 DSGD: Distributed SGD w T 1 w T 2 w T 3 h 1 h 2 h 3 A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33 P 1 P 2 P 3

28 DSGD: Distributed SGD w T 1 w T 2 w T 3 h 1 h 2 h 3 A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33 P 1 P 2 P 3

29 Optimization for Matrix Completion: Coordinate Descent

30 Coordinate Descent Update a variable at a time: j Ω w it i (A ij wi T h j + w it h jt )h jt λ + j Ω i hjt 2 Subproblem is just a univariate quadratic problem Ω i = {j : (i, j) Ω} Can be done in O( Ω i ) Update Sequence: Item/user-wise update: pick a user i or an item j update the i-th row of W or the j-th column of H Feature-wise update: pick a feature index t {1,..., k} update t-column of W and H alternatively.

31 Feature-wise Update: CCD++ When T = 2

32 Feature-wise Update: CCD++ When T = 2

33 Feature-wise Update: CCD++ When T = 2

34 Feature-wise Update: CCD++ When T = 2

35 Feature-wise Update: CCD++ When T = 2

36 Feature-wise Update: CCD++ When T = 2

37 Feature-wise Update: CCD++ When T = 2

38 Feature-wise Update: CCD++ When T = 2

39 Feature-wise Update: CCD++ When T = 2

40 Feature-wise Update: CCD++ When T = 2

41 Feature-wise Update: CCD++ When T = 2

42 Feature-wise Update: CCD++ When T = 2

43 Feature-wise Update: CCD++ When T = 2 netflix with k = 40 Cycle through k feature dimensions

44 Related papers Alternating Minimization and SGD: used in Netflix Price. (Koren et al., Matrix Factorization Techniques for Recommender Systems. Computer, ) Coordinate Descent: (Yu et al., Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems. ICDM, ) DSGD (block partition for distributed SGD): (Gemulla et al., Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent. KDD, ) NOMAD (improved distributed SGD): ( NOMAD: Non-locking, stochastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion. VLDB ) Software: LIBPMF (multi-core), LIBMF (multi-core), GraphLab, NOMAD (multicore and distributed), MLlib (spark), (more?)

45 Provable Guarantees for the Non-convex Form ALS with re-sampling can recover the underlying low-rank matrix: (Jain et al., Low-rank matrix completion using alternating minimization. STOC ) ALS without re-sampling with similar guarantee: (Sun and Luo, Guaranteed Matrix Completion via Non-convex Factorization. FOCS, ) (Ge et al., Matrix Completion has No Spurious Local Minimum. NIPS, )

46 Coming up Next class: other matrix completion topics Questions?

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent KDD 2011 Rainer Gemulla, Peter J. Haas, Erik Nijkamp and Yannis Sismanis Presenter: Jiawen Yao Dept. CSE, UT Arlington 1 1

More information

Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems

Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon Department of Computer Science University of Texas

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Matrix Factorization and Factorization Machines for Recommender Systems

Matrix Factorization and Factorization Machines for Recommender Systems Talk at SDM workshop on Machine Learning Methods on Recommender Systems, May 2, 215 Chih-Jen Lin (National Taiwan Univ.) 1 / 54 Matrix Factorization and Factorization Machines for Recommender Systems Chih-Jen

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 6 Structured and Low Rank Matrices Section 6.3 Numerical Optimization Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at

More information

Large-scale Collaborative Ranking in Near-Linear Time

Large-scale Collaborative Ranking in Near-Linear Time Large-scale Collaborative Ranking in Near-Linear Time Liwei Wu Depts of Statistics and Computer Science UC Davis KDD 17, Halifax, Canada August 13-17, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack

More information

SQL-Rank: A Listwise Approach to Collaborative Ranking

SQL-Rank: A Listwise Approach to Collaborative Ranking SQL-Rank: A Listwise Approach to Collaborative Ranking Liwei Wu Depts of Statistics and Computer Science UC Davis ICML 18, Stockholm, Sweden July 10-15, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack

More information

Parallel Matrix Factorization for Recommender Systems

Parallel Matrix Factorization for Recommender Systems Under consideration for publication in Knowledge and Information Systems Parallel Matrix Factorization for Recommender Systems Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit S. Dhillon Department of

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

A Learning-rate Schedule for Stochastic Gradient Methods to Matrix Factorization

A Learning-rate Schedule for Stochastic Gradient Methods to Matrix Factorization A Learning-rate Schedule for Stochastic Gradient Methods to Matrix Factorization Wei-Sheng Chin, Yong Zhuang, Yu-Chin Juan, and Chih-Jen Lin Department of Computer Science National Taiwan University, Taipei,

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Large-scale Matrix Factorization. Kijung Shin Ph.D. Student, CSD

Large-scale Matrix Factorization. Kijung Shin Ph.D. Student, CSD Large-scale Matrix Factorization Kijung Shin Ph.D. Student, CSD Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

PU Learning for Matrix Completion

PU Learning for Matrix Completion Cho-Jui Hsieh Dept of Computer Science UT Austin ICML 2015 Joint work with N. Natarajan and I. S. Dhillon Matrix Completion Example: movie recommendation Given a set Ω and the values M Ω, how to predict

More information

Fast Coordinate Descent methods for Non-Negative Matrix Factorization

Fast Coordinate Descent methods for Non-Negative Matrix Factorization Fast Coordinate Descent methods for Non-Negative Matrix Factorization Inderjit S. Dhillon University of Texas at Austin SIAM Conference on Applied Linear Algebra Valencia, Spain June 19, 2012 Joint work

More information

Distributed Stochastic ADMM for Matrix Factorization

Distributed Stochastic ADMM for Matrix Factorization Distributed Stochastic ADMM for Matrix Factorization Zhi-Qin Yu Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering Shanghai Jiao Tong University, China

More information

Scaling Neighbourhood Methods

Scaling Neighbourhood Methods Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)

More information

Improved Bounded Matrix Completion for Large-Scale Recommender Systems

Improved Bounded Matrix Completion for Large-Scale Recommender Systems Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-7) Improved Bounded Matrix Completion for Large-Scale Recommender Systems Huang Fang, Zhen Zhang, Yiqun

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

More information

Scalable Bayesian Matrix Factorization

Scalable Bayesian Matrix Factorization Scalable Bayesian Matrix Factorization Avijit Saha,1, Rishabh Misra,2, and Balaraman Ravindran 1 1 Department of CSE, Indian Institute of Technology Madras, India {avijit, ravi}@cse.iitm.ac.in 2 Department

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Matrix Factorization Techniques for Recommender Systems

Matrix Factorization Techniques for Recommender Systems Matrix Factorization Techniques for Recommender Systems Patrick Seemann, December 16 th, 2014 16.12.2014 Fachbereich Informatik Recommender Systems Seminar Patrick Seemann Topics Intro New-User / New-Item

More information

Collaborative Filtering

Collaborative Filtering Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin

More information

Linear dimensionality reduction for data analysis

Linear dimensionality reduction for data analysis Linear dimensionality reduction for data analysis Nicolas Gillis Joint work with Robert Luce, François Glineur, Stephen Vavasis, Robert Plemmons, Gabriella Casalino The setup Dimensionality reduction for

More information

Collaborative Filtering Matrix Completion Alternating Least Squares

Collaborative Filtering Matrix Completion Alternating Least Squares Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 19, 2016

More information

Low Rank Matrix Completion Formulation and Algorithm

Low Rank Matrix Completion Formulation and Algorithm 1 2 Low Rank Matrix Completion and Algorithm Jian Zhang Department of Computer Science, ETH Zurich zhangjianthu@gmail.com March 25, 2014 Movie Rating 1 2 Critic A 5 5 Critic B 6 5 Jian 9 8 Kind Guy B 9

More information

Matrix and Tensor Factorization from a Machine Learning Perspective

Matrix and Tensor Factorization from a Machine Learning Perspective Matrix and Tensor Factorization from a Machine Learning Perspective Christoph Freudenthaler Information Systems and Machine Learning Lab, University of Hildesheim Research Seminar, Vienna University of

More information

Variables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014

Variables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014 for a Million Variables Cho-Jui Hsieh The University of Texas at Austin ICML workshop on Covariance Selection Beijing, China June 26, 2014 Joint work with M. Sustik, I. Dhillon, P. Ravikumar, R. Poldrack,

More information

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Chengjie Qin 1, Martin Torres 2, and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced August 31, 2017 Machine

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Yi Zhou Department of ECE The Ohio State University zhou.1172@osu.edu Zhe Wang Department of ECE The Ohio State University

More information

https://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:

More information

A Practical Algorithm for Topic Modeling with Provable Guarantees

A Practical Algorithm for Topic Modeling with Provable Guarantees 1 A Practical Algorithm for Topic Modeling with Provable Guarantees Sanjeev Arora, Rong Ge, Yoni Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, Michael Zhu Reviewed by Zhao Song December

More information

Non-convex Robust PCA: Provable Bounds

Non-convex Robust PCA: Provable Bounds Non-convex Robust PCA: Provable Bounds Anima Anandkumar U.C. Irvine Joint work with Praneeth Netrapalli, U.N. Niranjan, Prateek Jain and Sujay Sanghavi. Learning with Big Data High Dimensional Regime Missing

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion Robert M. Freund, MIT joint with Paul Grigas (UC Berkeley) and Rahul Mazumder (MIT) CDC, December 2016 1 Outline of Topics

More information

Prediction and Clustering in Signed Networks: A Local to Global Perspective

Prediction and Clustering in Signed Networks: A Local to Global Perspective Journal of Machine Learning Research 15 (2014) 1177-1213 Submitted 2/13; Revised 10/13; Published 3/14 Prediction and Clustering in Signed Networks: A Local to Global Perspective Kai-Yang Chiang Cho-Jui

More information

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

EE 381V: Large Scale Optimization Fall Lecture 24 April 11 EE 381V: Large Scale Optimization Fall 2012 Lecture 24 April 11 Lecturer: Caramanis & Sanghavi Scribe: Tao Huang 24.1 Review In past classes, we studied the problem of sparsity. Sparsity problem is that

More information

A Quick Tour of Linear Algebra and Optimization for Machine Learning

A Quick Tour of Linear Algebra and Optimization for Machine Learning A Quick Tour of Linear Algebra and Optimization for Machine Learning Masoud Farivar January 8, 2015 1 / 28 Outline of Part I: Review of Basic Linear Algebra Matrices and Vectors Matrix Multiplication Operators

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Decoupled Collaborative Ranking

Decoupled Collaborative Ranking Decoupled Collaborative Ranking Jun Hu, Ping Li April 24, 2017 Jun Hu, Ping Li WWW2017 April 24, 2017 1 / 36 Recommender Systems Recommendation system is an information filtering technique, which provides

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Provable Alternating Minimization Methods for Non-convex Optimization

Provable Alternating Minimization Methods for Non-convex Optimization Provable Alternating Minimization Methods for Non-convex Optimization Prateek Jain Microsoft Research, India Joint work with Praneeth Netrapalli, Sujay Sanghavi, Alekh Agarwal, Animashree Anandkumar, Rashish

More information

Supplement: On Model Parallelization and Scheduling Strategies for Distributed Machine Learning

Supplement: On Model Parallelization and Scheduling Strategies for Distributed Machine Learning Supplement: On Model Parallelization and Scheduling Strategies for Distributed Machine Learning Seunghak Lee, Jin Kyu Kim, Xun Zheng, Qirong Ho, Garth A. Gibson, Eric P. Xing School of Computer Science

More information

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016 GloVe on Spark Alex Adamson SUNet ID: aadamson June 6, 2016 Introduction Pennington et al. proposes a novel word representation algorithm called GloVe (Global Vectors for Word Representation) that synthesizes

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

IBM Research Report. Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

IBM Research Report. Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent RJ10481 (A1103-008) March 16, 2011 Computer Science IBM Research Report Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla Max-Planck-Institut für Informatik Saarbrücken,

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J 7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured

More information

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

Non-negative Matrix Factorization: Algorithms, Extensions and Applications Non-negative Matrix Factorization: Algorithms, Extensions and Applications Emmanouil Benetos www.soi.city.ac.uk/ sbbj660/ March 2013 Emmanouil Benetos Non-negative Matrix Factorization March 2013 1 / 25

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)

More information

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

Nonnegative Matrix Factorization

Nonnegative Matrix Factorization Nonnegative Matrix Factorization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

Sparse Linear Programming via Primal and Dual Augmented Coordinate Descent

Sparse Linear Programming via Primal and Dual Augmented Coordinate Descent Sparse Linear Programg via Primal and Dual Augmented Coordinate Descent Presenter: Joint work with Kai Zhong, Cho-Jui Hsieh, Pradeep Ravikumar and Inderjit Dhillon. Sparse Linear Program Given vectors

More information

Estimators based on non-convex programs: Statistical and computational guarantees

Estimators based on non-convex programs: Statistical and computational guarantees Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright

More information

Polynomial optimization methods for matrix factorization

Polynomial optimization methods for matrix factorization Polynomial optimization methods for matrix factorization Po-Wei Wang Machine Learning Department Carnegie Mellon University Pittsburgh, PA 53 poweiw@cs.cmu.edu Chun-Liang Li Machine Learning Department

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

1 Non-negative Matrix Factorization (NMF)

1 Non-negative Matrix Factorization (NMF) 2018-06-21 1 Non-negative Matrix Factorization NMF) In the last lecture, we considered low rank approximations to data matrices. We started with the optimal rank k approximation to A R m n via the SVD,

More information

Point-of-Interest Recommendations: Learning Potential Check-ins from Friends

Point-of-Interest Recommendations: Learning Potential Check-ins from Friends Point-of-Interest Recommendations: Learning Potential Check-ins from Friends Huayu Li, Yong Ge +, Richang Hong, Hengshu Zhu University of North Carolina at Charlotte + University of Arizona Hefei University

More information

Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria

Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria Distributed Machine Learning: A Brief Overview Dan Alistarh IST Austria Background The Machine Learning Cambrian Explosion Key Factors: 1. Large s: Millions of labelled images, thousands of hours of speech

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Quick Introduction to Nonnegative Matrix Factorization

Quick Introduction to Nonnegative Matrix Factorization Quick Introduction to Nonnegative Matrix Factorization Norm Matloff University of California at Davis 1 The Goal Given an u v matrix A with nonnegative elements, we wish to find nonnegative, rank-k matrices

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

EECS 275 Matrix Computation

EECS 275 Matrix Computation EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 22 1 / 21 Overview

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Algorithmic Foundations Of Data Sciences: Lectures explaining convex and non-convex optimization

Algorithmic Foundations Of Data Sciences: Lectures explaining convex and non-convex optimization Algorithmic Foundations Of Data Sciences: Lectures explaining convex and non-convex optimization INSTRUCTOR: CHANDRAJIT BAJAJ ( bajaj@cs.utexas.edu ) http://www.cs.utexas.edu/~bajaj Problems A. Sparse

More information

Machine Learning Techniques

Machine Learning Techniques Machine Learning Techniques ( 機器學習技法 ) Lecture 15: Matrix Factorization Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University

More information

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2017

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2017 CPSC 340: Machine Learning and Data Mining Sparse Matrix Factorization Fall 2017 Admin Assignment 4: Due Friday. Assignment 5: Posted, due Monday of last week of classes Last Time: PCA with Orthogonal/Sequential

More information

Matrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY

Matrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY Matrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY OUTLINE Why We Need Matrix Decomposition SVD (Singular Value Decomposition) NMF (Nonnegative

More information

Structured matrix factorizations. Example: Eigenfaces

Structured matrix factorizations. Example: Eigenfaces Structured matrix factorizations Example: Eigenfaces An extremely large variety of interesting and important problems in machine learning can be formulated as: Given a matrix, find a matrix and a matrix

More information

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms:

More information

Linear Regression. S. Sumitra

Linear Regression. S. Sumitra Linear Regression S Sumitra Notations: x i : ith data point; x T : transpose of x; x ij : ith data point s jth attribute Let {(x 1, y 1 ), (x, y )(x N, y N )} be the given data, x i D and y i Y Here D

More information

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods Techniques for Dimensionality Reduction PCA and Other Matrix Factorization Methods Outline Principle Compoments Analysis (PCA) Example (Bishop, ch 12) PCA as a mixture model variant With a continuous latent

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

Data Mining and Matrices

Data Mining and Matrices Data Mining and Matrices 6 Non-Negative Matrix Factorization Rainer Gemulla, Pauli Miettinen May 23, 23 Non-Negative Datasets Some datasets are intrinsically non-negative: Counters (e.g., no. occurrences

More information

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Fast Nonnegative Matrix Factorization with Rank-one ADMM Fast Nonnegative Matrix Factorization with Rank-one Dongjin Song, David A. Meyer, Martin Renqiang Min, Department of ECE, UCSD, La Jolla, CA, 9093-0409 dosong@ucsd.edu Department of Mathematics, UCSD,

More information

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization Jialin Dong ShanghaiTech University 1 Outline Introduction FourVignettes: System Model and Problem Formulation Problem Analysis

More information

Lecture Notes 10: Matrix Factorization

Lecture Notes 10: Matrix Factorization Optimization-based data analysis Fall 207 Lecture Notes 0: Matrix Factorization Low-rank models. Rank- model Consider the problem of modeling a quantity y[i, j] that depends on two indices i and j. To

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

arxiv: v3 [cs.lg] 18 Mar 2013

arxiv: v3 [cs.lg] 18 Mar 2013 Hierarchical Data Representation Model - Multi-layer NMF arxiv:1301.6316v3 [cs.lg] 18 Mar 2013 Hyun Ah Song Department of Electrical Engineering KAIST Daejeon, 305-701 hyunahsong@kaist.ac.kr Abstract Soo-Young

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information