Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Similar documents
Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

The Origin of Deep Learning. Lili Mou Jan, 2015

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Introduction to Restricted Boltzmann Machines

Lecture 16 Deep Neural Generative Models

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Machine Learning for Data Science (CS4786) Lecture 24

Least Mean Squares Regression

Unsupervised Learning

ECS171: Machine Learning

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Probabilistic Machine Learning

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Bayesian Deep Learning

Introduction to Bayesian Learning

CPSC 540: Machine Learning

arxiv: v2 [cs.ne] 22 Feb 2013

Linear classifiers: Overfitting and regularization

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

UNSUPERVISED LEARNING

Logistic Regression Logistic

Learning Tetris. 1 Tetris. February 3, 2009

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Neural Networks: Optimization & Regularization

Logistic Regression. COMP 527 Danushka Bollegala

Learning Deep Architectures for AI. Part II - Vijay Chakilam

CS Lecture 3. More Bayesian Networks

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

The connection of dropout and Bayesian statistics

Conditional probabilities and graphical models

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Kyle Reing University of Southern California April 18, 2018

Least Mean Squares Regression. Machine Learning Fall 2018

Neural Networks and Deep Learning

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

The Ising model and Markov chain Monte Carlo

STA 4273H: Statistical Machine Learning

Linear Regression (continued)

Knowledge Extraction from DBNs for Images

Need for Sampling in Machine Learning. Sargur Srihari

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Deep Learning for Computer Vision

J. Sadeghi E. Patelli M. de Angelis

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

SUPPORT VECTOR MACHINE

COURSE INTRODUCTION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Restricted Boltzmann Machines for Collaborative Filtering

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Latent Variable Models

CSC321 Lecture 20: Autoencoders

Stochastic Gradient Descent. CS 584: Big Data Analytics

Approximate Inference

CS 188: Artificial Intelligence Spring Announcements

CSC321 Lecture 18: Learning Probabilistic Models

Ad Placement Strategies

Tutorial on Probabilistic Programming with PyMC3

Machine Learning. Neural Networks

SGD and Deep Learning

CSC321 Lecture 2: Linear Regression

Weighted Majority and the Online Learning Approach

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Overfitting, Bias / Variance Analysis

CS 188: Artificial Intelligence Fall 2008

CS260: Machine Learning Algorithms

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Bayesian Networks BY: MOHAMAD ALSABBAGH

Homework 2 Solutions Kernel SVM and Perceptron

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

Lecture 15. Probabilistic Models on Graph

Lecture 5 Neural models for NLP

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

ML (cont.): SUPPORT VECTOR MACHINES

Deep unsupervised learning

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

16 : Approximate Inference: Markov Chain Monte Carlo

Bayesian Inference and MCMC

Notes on Machine Learning for and

Lecture : Probabilistic Machine Learning

Neural Networks: Backpropagation

CS221 / Autumn 2017 / Liang & Ermon. Lecture 2: Machine learning I

CSC 2541: Bayesian Methods for Machine Learning

Classification & Information Theory Lecture #8

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Numerical Optimization Techniques

Machine Learning

Transcription:

Algorithms other than SGD CS6787 Lecture 10 Fall 2017

Machine learning is not just SGD Once a model is trained, we need to use it to classify new examples This inference task is not computed with SGD There are other algorithms for optimizing objectives besides SGD Stochastic coordinate descent Derivative-free optimization There are other common tasks, such as sampling from a distribution Gibbs sampling and other Markov chain Monte Carlo methods And we sometimes use this together with SGD à called contrastive divergence

Why understand these algorithms? They represent a significant fraction of machine learning computations Inference in particular is huge You may want to use them instead of SGD But you don t want to suddenly pay a computational penalty for doing so because you don t know how to make them fast Intuition from SGD can be used to make these algorithms faster too And vice-versa

Inference

Inference Suppose that our training loss function looks like f(w) = 1 N nx i=1 l(ŷ(w; x i ),y i ) Inference is the problem of computing the prediction ŷ(w; x i )

How important is inference? Train once, infer many times Many production machine learning systems just do inference Image recognition, voice recognition, translation All are just applications of inference once they re trained Need to get responses to users quickly On the web, users won t wait more than a second

Inference on linear models Computational cost: relatively low Just a matrix-vector multiply But still can be more costly in some settings For example, if we need to compute a random kernel feature map What is the cost of this? Which methods can we use to speed up inference in this setting?

Inference on neural networks Computational cost: relatively high Several matrix-vector multiplies and non-linear elements Which methods can we use to speed up inference in this setting? Compression Find an easier-to-compute network with similar accuracy by fine-tuning The subject of this week s paper

Other techniques for speeding up inference Train a fast model, and run it most of the time If it s uncertain, then run a more accurate, slower model For video and time-series data, re-use some of the computation from previous frames For example, only update some of the activations in the network at each frame Or have a more-heavyweight network run less frequently Rests on the notion that the objects in the scene do not change frequently in most video streams

Other Techniques for Training, Besides SGD

Coordinate Descent Start with objective Choose a random index i, and update And repeat in a loop minimize:f(x 1,x 2,...,x n ) x i = arg min ˆx i f(x 1,x 2,...,x i,...,x n )

Variants Coordinate descent with derivative and step size Stochastic coordinate descent How do these compare to SGD?

Derivative Free Optimization (DFO) Optimization methods that don t require differentiation Basic coordinate descent is actually an example of this Another example: for normally distributed ε Applications? x t+1 = x t f(x t + ) f(x t ) 2

Another Task: Sampling

Focus problem for this setting: Statistical Inference Major class of machine learning applications Goal: draw conclusions from data using a statistical model Formally: find marginal distribution of unobserved variables given observations Example: decide whether a coin is biased from a series of flips Applications: LDA, recommender systems, text extraction, etc. De facto algorithm used for inference at scale: Gibbs sampling

Graphical models A graphical way to describe a probability distribution Common in machine learning applications Especially for applications that deal with uncertainty

What types of inference exist here? Maximum-a-posteriori (MAP) inference Find the state with the highest probability Often reduces to an optimization problem What is the most likely state of the world? Marginal inference Compute the marginal distributions of some variables What does our model of the world tell us about this object or event?

What is Gibbs Sampling? Algorithm 1 Gibbs sampling Choose Compute a variable its conditional to Require: Variables x i for 1 apple iupdate appledistribution n, and at random. distribution given the. loop Output the current other variables. Choose state as a ssample. by sampling uniformly from {1,...,n}. Re-sample x s uniformly from P (x s x {1,...,n}\{s} ). output x endupdate loopthe variable by sampling from its conditional distribution. x 2 x 1 x 3 x 4 x 5 x 7 x 4 P( ) = 0.7 x 5 P( ) = 0.3 x 5 x 5 x 6 x 7

Learning graphical models Contrastive divergence SGD on top of Gibbs sampling The de facto way of training Restricted boltzmann machines (RBM) Deep belief networks (DBN) Knowledge-base construction (KBC) applications

What do all these algorithms look like? Stochastic Iterative Algorithms Given an immutable input dataset and a model we want to output. Repeat: 1. Pick a data point at random 2. Update the model 3. Iterate same structure same systems properties same techniques

Questions? Upcoming things Paper Review #9 due today Paper Presentation #10 on Wednesday Project proposals due the following Monday