Fast, Cheap and Deep Scaling machine learning
|
|
- Barbara Franklin
- 6 years ago
- Views:
Transcription
1 Fast, Cheap and Deep Scaling machine learning SFW Alexander Smola CMU Machine Learning and github.com/dmlc
2 Many thanks to Mu Li Dave Andersen Chris Dyer Li Zhou Ziqi Liu Manzil Zaheer Qicong Chen Amr Ahmed (Google) Yu-Xiang Wang Jay Yoon Lee Ha Loc Do (SMU) CXXNET Team Tianqi Chen (UW) Bing Xu Naiyang Wang Minerva Team Minjie Wang Tianjun Xiao Jianpeng Li Jiaxing Zhang
3 This talk in 3 slides
4 Parameter Server Server Server Data (local or cloud) read Data (local or cloud) read Data (local or cloud) read local state (or copy) local state (or copy) local state (or copy) write update write update write update Data (local or cloud) Data (local or cloud) Data (local or cloud)
5 Multicore Data (local or cloud) read Parameter Server local state (or copy) write update Data (local or cloud)
6 GPUs (for Deep Learning) Data (local or cloud) read Parameter Server local state (or copy) write update Data (local or cloud)
7 Details Parameter Server Basics Logistic Regression (Classification) Large Distributed State Factorization Machines (CTR) Memory Subsystem Matrix Factorization (Recommender) GPUs Deep Learning (Images)
8 p(click {z} =:y ad, query,w) {z } =:x Estimate Click Through Rate
9 Click Through Rate (CTR) Linear function class Logistic regression Optimization Problem minimize w f(x) =hw, xi p(y x, w) = mx i=1 Solve distributed over many machines (typically 1TB to 1PB of data) exp ( y hw, xi) sparse models for advertising log(1 + exp( y i hw, x i i)) + kwk 1
10 Optimization Algorithm Compute gradient on data l1 norm is nonsmooth, hence proximal operator argmin w kwk kw (w t g t )k 2 Updates for l1 are very simple w i sgn(w i ) max(0, w i ) All steps decompose by coordinates Solve in parallel (and asynchronously)
11 Parameter Server Template Smola & Narayanamurthy, 2010, VLDB Gonzalez et al., 2012, WSDM Dean et al, 2012, NIPS Shervashidze et al., 2013, WWW Google, Baidu, Facebook, Amazon, Yahoo, Microsoft Compute gradient on (subset of data) on each client Send gradient from client to server asynchronously push(key_list,value_list,timestamp) Proximal gradient update on server per coordinate Server returns parameters pull(key_list,value_list,timestamp)
12 Solving it at scale Li et al., OSDI TB data, variables Local file system stores files 1000 servers (corp cloud), 1h time, 140 MB/s learning Online solver System A System B Parameter Server time (hour) 1.1 TB (Criteo), variables, samples S3 stores files (no preing) - better IO library 5 machines (c4.8xlarge), 1000s time, 220 MB/s learning 2014
13 Details Parameter Server Basics Logistic Regression (Classification) Large Distributed State Factorization Machines (CTR) Memory Subsystem Matrix Factorization (Recommender) GPUs Deep Learning (Images)
14 p(y x, w) A Linear Model is not enough
15 Factorization Machines Linear Model f(x) =hw, xi memory hog Polynomial Expansion (Rendle, 2012) f(x) =hw, xi + X i<j x i x j tr V (2) V (2) i j + X i<j<k x i x j x k tr V (3) V (3) V (3) i j k +... too large for individual machine
16 Prefetching to the rescue Most keys are infrequent (power law distribution) Prefetch the embedding vectors for a minibatch from t=1 t=1 t=2 t=2 t=3 t= parameter server Compute gradients and push to server Variable dimensionality embedding Enforcing sparsity (ANOVA style) Adaptive gradient normalization Frequency adaptive regularization (CF style)
17 Better Models what relative logloss (%) everyone else does (Criteo 1TB) no mem adaption freqency freqency + l1 shrk k
18 Faster Solver (small Criteo) test logloss LibFM DiFacto, 1 worker DiFacto, 10 workers LibFM died on large models sec
19 Multiple Machines Li, Wang, Liu, Smola, WSDM 16, submitted x LibFM speed on 16 machines speedup (x) Criteo2 CTR # of machines
20 Details Parameter Server Basics Logistic Regression (Classification) Large Distributed State Factorization Machines (CTR) Memory Subsystem Matrix Factorization (Recommender) GPUs Deep Learning (Images)
21 Recommender Systems Users u, movies m (or projects) Function class rum = hvu, wm i + bu + bm Loss function for recommendation (Yelp, Netflix) X u m (hvu, wm i + bu + bm yum ) 2
22 Recommender Systems Regularized Objective X u m (hv u,w m i + b u + b m + b 0 r um ) hkuk 2 Frob + kv k2 Frob Update operations v u (1 t )v u t w m (hv u,w m i + b u + b m + b 0 r um ) w m (1 t )w m t v u (hv u,w m i + b u + b m + b 0 r um ) Very simple SGD algorithm (random pairs) This should be cheap i memory subsystem
23 This should be cheap O(md) burst reads and O(m) random reads Netflix dataset m = 100 million, d = 2048 dimensions, 30 steps Runtime should be > 4500s 60 GB/s memory bandwidth = 3300s 100 ns random reads = 1200s We get 560s. Why? Liu, Wang, Smola, RecSys 2015
24 Power law in Collaborative Filtering 10 4 Netflix dataset 10 3 # movies # ratings
25 Key Ideas Stratify ratings by users (only 1 cache miss / read per user / out of core) Keep frequent movies in cache (stratify by blocks of movie popularity) Avoid false sharing between sockets (key cached in the wrong CPU causes miss)
26 Key Ideas GraphChi Partitioning
27 Key Ideas SC-SGD partitioning
28 Speed (c4.8xlarge) C-SGD Fast SGLD Graphchi Graphlab BIDMach 5000 C-SGD Fast SGLD Graphchi Graphlab Seconds g2.8xlarge Seconds Num of Dimensions Netflix - 100M, 15 iterations Num of Dimensions Yahoo - 250M, 30 iterations
29 Convergence Testing RMSE C-SGD at k=2048 GraphChi SGD at k=2048 Fast SGLD at k=2048 GraphChi blocks (users, movies) into random groups Poor mixing Slow convergence Runtime in sec (30 epoches in total)
30 Details Parameter Server Basics Logistic Regression (Classification) Large Distributed State Factorization Machines (CTR) Memory Subsystem Matrix Factorization (Recommender) GPUs Deep Learning (Images) github.com/dmlc
31 The Challenge Multiple good single-machine toolkits Caffe - convolution optimized (images) CXXNET - good tensor library Minerva - Scheduler & Layout on CPU/GPU Torch - Lua + interesting C preor (very very popular, though) Theano - Deep network compiler built by ML Don t reinvent the wheel for deep learning Integrate with parameter server
32 Minerva (dmlc/minerva) Tensor interface in python (similar to numpy) Dataflow engine Auto parallel execution On multi-core CPU On multi-gpu Optimizes layout automatically Zhang et al, 14 (NIPS workshop)
33 Minerva Scaling Images/ second AlexNet VGGNet GoogLeNet 1 card cards cards
34 Distributed Deep Learning
35 Distributed Deep Learning
36 Scaling on AWS g2.2xlarge 1Gbit network limit (alexnet scaling)
37 Amazon just released g2.8xlarge 12 instances (48 $0.50/h spot Minibatch size 512 BSP with 1 delay between machines 2 GB/s bandwidth between machines (awful) , , , , , , , , , , , (all over the place in availability zone) Compressing to 1 byte per coordinate helps a bit but adds latency due to extra pass (need to fix) 37x speedup on 48 GPUS Imagenet 12 dataset in trained in 4h, i.e. $24 (with alexnet; googlenet even better for network)
38 Summary Parameter Server Basics Data (local or cloud) read Logistic Regression local state (or copy) Large Distributed State Factorization Machines Server write update Data (local or cloud) Data (local or cloud) Memory Subsystem read We are hiring! Matrix Factorization write update local state (or copy) GPUs Deep Learning Server Data (local or cloud) Data (local or cloud) read Much more - Topic Models, NLP local state (or copy) Docker, Sketches, Fault Tolerance write Data (local or cloud) update
Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)
Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine
More informationDistributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria
Distributed Machine Learning: A Brief Overview Dan Alistarh IST Austria Background The Machine Learning Cambrian Explosion Key Factors: 1. Large s: Millions of labelled images, thousands of hours of speech
More informationScalable Asynchronous Gradient Descent Optimization for Out-of-Core Models
Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Chengjie Qin 1, Martin Torres 2, and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced August 31, 2017 Machine
More informationarxiv: v2 [cs.lg] 5 May 2015
fastfm: A Library for Factorization Machines Immanuel Bayer University of Konstanz 78457 Konstanz, Germany immanuel.bayer@uni-konstanz.de arxiv:505.0064v [cs.lg] 5 May 05 Editor: Abstract Factorization
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction
More informationMatrix Factorization and Factorization Machines for Recommender Systems
Talk at SDM workshop on Machine Learning Methods on Recommender Systems, May 2, 215 Chih-Jen Lin (National Taiwan Univ.) 1 / 54 Matrix Factorization and Factorization Machines for Recommender Systems Chih-Jen
More informationQuantum Artificial Intelligence and Machine Learning: The Path to Enterprise Deployments. Randall Correll. +1 (703) Palo Alto, CA
Quantum Artificial Intelligence and Machine : The Path to Enterprise Deployments Randall Correll randall.correll@qcware.com +1 (703) 867-2395 Palo Alto, CA 1 Bundled software and services Professional
More informationARock: an algorithmic framework for asynchronous parallel coordinate updates
ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,
More informationStreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory
StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,
More informationA Practitioner s Guide to MXNet
1/34 A Practitioner s Guide to MXNet Xingjian Shi Hong Kong University of Science and Technology (HKUST) HKUST CSE Seminar, March 31st, 2017 2/34 Outline 1 Introduction Deep Learning Basics MXNet Highlights
More informationDEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY
DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationBinary Principal Component Analysis in the Netflix Collaborative Filtering Task
Binary Principal Component Analysis in the Netflix Collaborative Filtering Task László Kozma, Alexander Ilin, Tapani Raiko first.last@tkk.fi Helsinki University of Technology Adaptive Informatics Research
More informationMatrix Factorization Techniques for Recommender Systems
Matrix Factorization Techniques for Recommender Systems Patrick Seemann, December 16 th, 2014 16.12.2014 Fachbereich Informatik Recommender Systems Seminar Patrick Seemann Topics Intro New-User / New-Item
More informationarxiv: v3 [cs.lg] 23 Nov 2016
Journal of Machine Learning Research 17 (2016) 1-5 Submitted 7/15; Revised 5/16; Published 10/16 fastfm: A Library for Factorization Machines Immanuel Bayer University of Konstanz 78457 Konstanz, Germany
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationCollaborative Filtering
Collaborative Filtering Nicholas Ruozzi University of Texas at Dallas based on the slides of Alex Smola & Narges Razavian Collaborative Filtering Combining information among collaborating entities to make
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)
More informationLarge-Scale Behavioral Targeting
Large-Scale Behavioral Targeting Ye Chen, Dmitry Pavlov, John Canny ebay, Yandex, UC Berkeley (This work was conducted at Yahoo! Labs.) June 30, 2009 Chen et al. (KDD 09) Large-Scale Behavioral Targeting
More informationA Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices
A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22
More informationImpression Store: Compressive Sensing-based Storage for. Big Data Analytics
Impression Store: Compressive Sensing-based Storage for Big Data Analytics Jiaxing Zhang, Ying Yan, Liang Jeff Chen, Minjie Wang, Thomas Moscibroda & Zheng Zhang Microsoft Research The Curse of O(N) in
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationMaximum Margin Matrix Factorization for Collaborative Ranking
Maximum Margin Matrix Factorization for Collaborative Ranking Joint work with Quoc Le, Alexandros Karatzoglou and Markus Weimer Alexander J. Smola sml.nicta.com.au Statistical Machine Learning Program
More informationA Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation
A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation Yue Ning 1 Yue Shi 2 Liangjie Hong 2 Huzefa Rangwala 3 Naren Ramakrishnan 1 1 Virginia Tech 2 Yahoo Research. Yue Shi
More informationCNTK Microsoft s Open Source Deep Learning Toolkit. Taifeng Wang Lead Researcher, Microsoft Research Asia 2016 GTC China
CNTK Microsoft s Open Source Deep Learning Toolkit Taifeng Wang Lead Researcher, Microsoft Research Asia 2016 GTC China Deep learning in Microsoft Cognitive Services https://how-old.net http://www.captionbot.ai
More informationDot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics
Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics Chengjie Qin 1 and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced June 29, 2017 Machine Learning (ML) Is
More informationMaster Thesis Communication Filters for Distributed Optimization with the Parameter Server
Master Thesis Communication Filters for Distributed Optimization with the Parameter Server Yipei Wang 2015 May School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee:
More informationA Distributed Solver for Kernelized SVM
and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationTHE STATE OF CONTEMPORARY COMPUTING SUBSTRATES FOR OPTIMIZATION METHODS. Benjamin Recht UC Berkeley
THE STATE OF CONTEMPORARY COMPUTING SUBSTRATES FOR OPTIMIZATION METHODS Benjamin Recht UC Berkeley MY QUIXOTIC QUEST FOR SUPERLINEAR ALGORITHMS Benjamin Recht UC Berkeley Collaborators Slides extracted
More informationRecommendation Systems
Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume
More informationOutline. Overview CNTK introduction. Educational resources Conclusions. Symbolic loop Batch scheduling Data parallel training
Outline Overview CNTK introduction Symbolic loop Batch scheduling Data parallel training Educational resources Conclusions Outline Overview CNTK introduction Symbolic loop Batch scheduling Data parallel
More informationLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descent
Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent KDD 2011 Rainer Gemulla, Peter J. Haas, Erik Nijkamp and Yannis Sismanis Presenter: Jiawen Yao Dept. CSE, UT Arlington 1 1
More informationDeep Learning: a gentle introduction
Deep Learning: a gentle introduction Jamal Atif jamal.atif@dauphine.fr PSL, Université Paris-Dauphine, LAMSADE February 8, 206 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 206 / Why
More informationIntroduction to Logistic Regression
Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the
More informationA Parallel SGD method with Strong Convergence
A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,
More informationECE521 W17 Tutorial 1. Renjie Liao & Min Bai
ECE521 W17 Tutorial 1 Renjie Liao & Min Bai Schedule Linear Algebra Review Matrices, vectors Basic operations Introduction to TensorFlow NumPy Computational Graphs Basic Examples Linear Algebra Review
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationCollaborative Filtering. Radek Pelánek
Collaborative Filtering Radek Pelánek 2017 Notes on Lecture the most technical lecture of the course includes some scary looking math, but typically with intuitive interpretation use of standard machine
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and
More informationA Randomized Approach for Crowdsourcing in the Presence of Multiple Views
A Randomized Approach for Crowdsourcing in the Presence of Multiple Views Presenter: Yao Zhou joint work with: Jingrui He - 1 - Roadmap Motivation Proposed framework: M2VW Experimental results Conclusion
More informationParallel Matrix Factorization for Recommender Systems
Under consideration for publication in Knowledge and Information Systems Parallel Matrix Factorization for Recommender Systems Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit S. Dhillon Department of
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationUsing R for Iterative and Incremental Processing
Using R for Iterative and Incremental Processing Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, Robert Schreiber UC Berkeley and HP Labs UC BERKELEY Big Data, Complex Algorithms PageRank (Dominant
More informationBlock stochastic gradient update method
Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic
More informationDemystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
T. BEN-NUN, T. HOEFLER Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis https://www.arxiv.org/abs/1802.09941 What is Deep Learning good for? Digit Recognition Object
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationCSC321 Lecture 16: ResNets and Attention
CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the
More informationCoordinate Update Algorithm Short Course The Package TMAC
Coordinate Update Algorithm Short Course The Package TMAC Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 16 TMAC: A Toolbox of Async-Parallel, Coordinate, Splitting, and Stochastic Methods C++11 multi-threading
More informationLearning in a Distributed and Heterogeneous Environment
Learning in a Distributed and Heterogeneous Environment Martin Jaggi EPFL Machine Learning and Optimization Laboratory mlo.epfl.ch Inria - EPFL joint Workshop - Paris - Feb 15 th Machine Learning Methods
More informationNatural Language Processing
Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley . https://www.forbes.com/sites/kevinmurnane/206/04/0/what-is-deep-learning-and-how-is-it-useful
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More informationScaling Neighbourhood Methods
Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)
More informationDeep Learning (CNNs)
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Deep Learning (CNNs) Deep Learning Readings: Murphy 28 Bishop - - HTF - - Mitchell
More informationWarm up. Regrade requests submitted directly in Gradescope, do not instructors.
Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required
More information6.034 Introduction to Artificial Intelligence
6.34 Introduction to Artificial Intelligence Tommi Jaakkola MIT CSAIL The world is drowning in data... The world is drowning in data...... access to information is based on recommendations Recommending
More informationLarge-scale Collaborative Ranking in Near-Linear Time
Large-scale Collaborative Ranking in Near-Linear Time Liwei Wu Depts of Statistics and Computer Science UC Davis KDD 17, Halifax, Canada August 13-17, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack
More informationDecoupled Collaborative Ranking
Decoupled Collaborative Ranking Jun Hu, Ping Li April 24, 2017 Jun Hu, Ping Li WWW2017 April 24, 2017 1 / 36 Recommender Systems Recommendation system is an information filtering technique, which provides
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationHOMEWORK #4: LOGISTIC REGRESSION
HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a
More informationSVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning
SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are
More informationDeep neural networks and fraud detection
U.U.D.M. Project Report 2017:38 Deep neural networks and fraud detection Yifei Lu Examensarbete i matematik, 30 hp Handledare: Kaj Nyström Examinator: Erik Ekström Oktober 2017 Department of Mathematics
More informationAndriy Mnih and Ruslan Salakhutdinov
MATRIX FACTORIZATION METHODS FOR COLLABORATIVE FILTERING Andriy Mnih and Ruslan Salakhutdinov University of Toronto, Machine Learning Group 1 What is collaborative filtering? The goal of collaborative
More informationOptimization geometry and implicit regularization
Optimization geometry and implicit regularization Suriya Gunasekar Joint work with N. Srebro (TTIC), J. Lee (USC), D. Soudry (Technion), M.S. Nacson (Technion), B. Woodworth (TTIC), S. Bhojanapalli (TTIC),
More informationAdaDelay: Delay Adaptive Distributed Stochastic Optimization
Suvrit Sra Adams Wei Yu Mu Li Alexander J. Smola MI CMU CMU CMU Abstract We develop distributed stochastic convex optimization algorithms under a delayed gradient model in which server nodes update parameters
More informationUsing SVD to Recommend Movies
Michael Percy University of California, Santa Cruz Last update: December 12, 2009 Last update: December 12, 2009 1 / Outline 1 Introduction 2 Singular Value Decomposition 3 Experiments 4 Conclusion Last
More informationDeep Learning for NLP
Deep Learning for NLP Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Greg Durrett Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks
More informationAsynchrony begets Momentum, with an Application to Deep Learning
Asynchrony begets Momentum, with an Application to Deep Learning Ioannis Mitliagkas Dept. of Computer Science Stanford University Email: imit@stanford.edu Ce Zhang Dept. of Computer Science ETH, Zurich
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer
More informationOnline Advertising is Big Business
Online Advertising Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationHOMEWORK #4: LOGISTIC REGRESSION
HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your
More informationEE-559 Deep learning 1.4. Tensor basics and linear regression. François Fleuret Mon Mar 18 20:04:34 UTC 2019
EE-559 Deep learning 1.4. Tensor basics and linear regression François Fleuret https://fleuret.org/ee559/ Mon Mar 18 20:04:34 UTC 2019 A tensor is a generalized matrix, a finite table of numerical values
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationOnline Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?
Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification
More informationConvolutional Neural Networks II. Slides from Dr. Vlad Morariu
Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate
More informationLarge-Scale Feature Learning with Spike-and-Slab Sparse Coding
Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab
More informationSP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay
SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain
More informationThis ensures that we walk downhill. For fixed λ not even this may be the case.
Gradient Descent Objective Function Some differentiable function f : R n R. Gradient Descent Start with some x 0, i = 0 and learning rate λ repeat x i+1 = x i λ f(x i ) until f(x i+1 ) ɛ Line Search Variant
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationINF 5860 Machine learning for image classification. Lecture 5 : Introduction to TensorFlow Tollef Jahren February 14, 2018
INF 5860 Machine learning for image classification Lecture 5 : Introduction to TensorFlow Tollef Jahren February 14, 2018 OUTLINE Deep learning frameworks TensorFlow TensorFlow graphs TensorFlow session
More informationAdministrative Issues. CS5242 mirror site: https://web.bii.a-star.edu.sg/~leehk/cs5424.html
Administrative Issues CS5242 mirror site: https://web.bii.a-star.edu.sg/~leehk/cs5424.html 1 Assignments (10% * 4 = 40%) For each assignment It has multiple sub-tasks, including coding tasks You need to
More informationMatrix Factorization and Collaborative Filtering
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Matrix Factorization and Collaborative Filtering MF Readings: (Koren et al., 2009)
More informationIntroduction to Convolutional Neural Networks (CNNs)
Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
More informationScikit-learn. scikit. Machine learning for the small and the many Gaël Varoquaux. machine learning in Python
Scikit-learn Machine learning for the small and the many Gaël Varoquaux scikit machine learning in Python In this meeting, I represent low performance computing Scikit-learn Machine learning for the small
More informationDeep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i
More informationArcGIS Enterprise: What s New. Philip Heede Shannon Kalisky Melanie Summers Shreyas Shinde
ArcGIS Enterprise: What s New Philip Heede Shannon Kalisky Melanie Summers Shreyas Shinde ArcGIS Enterprise is the new name for ArcGIS for Server ArcGIS Enterprise Software Components ArcGIS Server Portal
More informationCollaborative topic models: motivations cont
Collaborative topic models: motivations cont Two topics: machine learning social network analysis Two people: " boy Two articles: article A! girl article B Preferences: The boy likes A and B --- no problem.
More informationAn Integrative Model for Parallelism
An Integrative Model for Parallelism Victor Eijkhout ICERM workshop 2012/01/09 Introduction Formal part Examples Extension to other memory models Conclusion tw-12-exascale 2012/01/09 2 Introduction tw-12-exascale
More informationCSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer
CSE446: Neural Networks Spring 2017 Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5 Scene
More information2018 EE448, Big Data Mining, Lecture 4. (Part I) Weinan Zhang Shanghai Jiao Tong University
2018 EE448, Big Data Mining, Lecture 4 Supervised Learning (Part I) Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html Content of Supervised Learning
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationScalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems
Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon Department of Computer Science University of Texas
More informationIntroduction to (Convolutional) Neural Networks
Introduction to (Convolutional) Neural Networks Philipp Grohs Summer School DL and Vis, Sept 2018 Syllabus 1 Motivation and Definition 2 Universal Approximation 3 Backpropagation 4 Stochastic Gradient
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More information