Learning Representations for Hyperparameter Transfer Learning

Size: px
Start display at page:

Download "Learning Representations for Hyperparameter Transfer Learning"

Transcription

1 Learning Representations for Hyperparameter Transfer Learning Cédric Archambeau DALI 2018 Lanzarote, Spain

2 My co-authors Rodolphe Jenatton Valerio Perrone Matthias Seeger

3

4 Tuning deep neural nets for optimal performance LeNet5 [LBBH98] The search space X is large and diverse: Architecture: # hidden layers, activation functions,... Model complexity: regularization, dropout,... Optimisation parameters: learning rates, momentum, batch size,...

5 Two straightforward approaches (Figure by Bergstra and Bengio, 2012) Exhaustive search on a regular or random grid Complexity is exponential in p Wasteful of resources, but easy to parallelise Memoryless

6 Can we do better?

7 Hyperparameter transfer learning

8 Hyperparameter transfer learning

9 Hyperparameter transfer learning

10 Hyperparameter transfer learning

11 Democratising machine learning Abstract away training algorithms Abstract away representation (feature engineering) Abstract away computing infrastructure Abstract away memory constraints Abstract away network architecture

12 Black-box global optimisation The function f to optimise can be non-convex. The number of hyperparameters p is moderate (typically < 20). Our goal is to solve the following optimisation problem: x = argmin x X f (x). Evaluating f (x) is expensive. No analytical form or gradient. Evaluations may be noisy.

13 Black-box global optimisation The function f to optimise can be non-convex. The number of hyperparameters p is moderate (typically < 20). Our goal is to solve the following optimisation problem: x = argmin x X f (x). Evaluating f (x) is expensive. No analytical form or gradient. Evaluations may be noisy.

14 Example: tuning deep neural nets [SLA12, SRS + 15, KFB + 16] LeNet5 [LBBH98] f (x) is the validation loss of the neural net as a function of its hyperparameters x. Evaluating f (x) is very costly up to weeks!

15 Bayesian (black-box) optimisation [MTZ78, SSW + 16] x = argmin x X f (x) Canonical algorithm: Surrogate model M of f #cheaper to evaluate Set of evaluated candidates C = {} While some BUDGET available: Select candidate xnew X using M and C #exploration/exploitation Collect evaluation ynew of f at x new #time-consuming Update C = C {(x new, y new )} Update M with C #Update surrogate model Update BUDGET

16 Bayesian (black-box) optimisation [MTZ78, SSW + 16] x = argmin x X f (x) Canonical algorithm: Surrogate model M of f #cheaper to evaluate Set of evaluated candidates C = {} While some BUDGET available: Select candidate xnew X using M and C #exploration/exploitation Collect evaluation ynew of f at x new #time-consuming Update C = C {(x new, y new )} Update M with C #Update surrogate model Update BUDGET

17 Bayesian (black-box) optimisation [MTZ78, SSW + 16] x = argmin x X f (x) Canonical algorithm: Surrogate model M of f #cheaper to evaluate Set of evaluated candidates C = {} While some BUDGET available: Select candidate xnew X using M and C #exploration/exploitation Collect evaluation ynew of f at x new #time-consuming Update C = C {(x new, y new )} Update M with C #Update surrogate model Update BUDGET

18 Bayesian (black-box) optimisation [MTZ78, SSW + 16] x = argmin x X f (x) Canonical algorithm: Surrogate model M of f #cheaper to evaluate Set of evaluated candidates C = {} While some BUDGET available: Select candidate xnew X using M and C #exploration/exploitation Collect evaluation ynew of f at x new #time-consuming Update C = C {(xnew, y new )} Update M with C #Update surrogate model Update BUDGET

19 Bayesian (black-box) optimisation [MTZ78, SSW + 16] x = argmin x X f (x) Canonical algorithm: Surrogate model M of f #cheaper to evaluate Set of evaluated candidates C = {} While some BUDGET available: Select candidate xnew X using M and C #exploration/exploitation Collect evaluation ynew of f at x new #time-consuming Update C = C {(xnew, y new )} Update M with C #Update surrogate model Update BUDGET

20 Bayesian (black-box) optimisation with Gaussian processes [JSW98] 1 Learn a probabilistic model of f, which is cheap to evaluate: y i f (x i ) Gaussian ( f (x i ), ς 2), f (x) GP(0, K).

21 Bayesian (black-box) optimisation with Gaussian processes [JSW98] 1 Learn a probabilistic model of f, which is cheap to evaluate: y i f (x i ) Gaussian ( f (x i ), ς 2), f (x) GP(0, K). 2 Given the observations y = (y 1,..., y n ), compute the predictive mean and the predictive standard deviation: 3 Repeatedly query f by balancing exploitation against exploration!

22 Bayesian (black-box) optimisation with Gaussian processes [JSW98] 1 Learn a probabilistic model of f, which is cheap to evaluate: y i f (x i ) Gaussian ( f (x i ), ς 2), f (x) GP(0, K). 2 Given the observations y = (y 1,..., y n ), compute the predictive mean and the predictive standard deviation: 3 Repeatedly query f by balancing exploitation against exploration!

23 Bayesian optimisation in practice (Image credit: Javier González)

24 What is wrong with the Gaussian process surrogate? Scaling is O ( N 3).

25 Adaptive Bayesian linear regression (ABLR) [Bis06] The model: P(y w, z, β) = n N (φ z (x n )w, β 1 ), P(w α) = N (0, α 1 I D ). The predictive distribution: P(y x, D) = P(y x, w)p(w D)dw = N (µ t (x ), σt 2 (x ))

26 Multi-task ABLR for transfer learning 1 Multi-task extension of the model: P(y t w t, z, β t ) = n t N (φ z (x nt )w t, β 1 t ), P(w t α t ) = N (0, α 1 t I D ). 2 Shared features φ z (x): Explicit features set (e.g., RBF) Random kitchen sinks [RR + 07] Learned by feedforward neural net 3 Multi-task objective: ( ) ρ z, {α t, β t } T t=1 T = log P(y t z, α t, β t ) t=1

27

28 Warm-start procedure for hyperparameter optimisation (HPO) Leave-one-task out.

29 A representation to optimise parametrised quadratic functions best Random search ABLR NN ABLR RKS ABLR NN transfer ABLR NN transfer (ctx) GP GP transfer (ctx) best DNGO DNGO transfer (ctx) BOHAMIANN BOHAMIANN transfer (ctx) ABLR NN ABLR NN transfer ABLR NN transfer (ctx) iteration Transfer learning with baselines [KO11] iteration Transfer learning with neural nets [SRS + 15, SKFH16]. f t (x) = a 2,t 2 x 2 + a 1,t 1 x + a 0,t

30 A representation to warm-start HPO across OpenML data sets (1 - AUC) * Random search ABLR NN ABLR RKS ABLR NN transfer GP GP transfer (ctx, L1) iteration Transfer learning in SVM. (1 - AUC) * Random search ABLR NN ABLR RKS ABLR NN transfer GP GP transfer (ctx, L1) iteration Transfer learning in XGBoost.

31 Learning better representations by jointly modelling multiple signals Tasks are now auxiliary signals (related to the target function to optimise). The target is still the validation loss f (x). Examples are training cost or training loss.

32 A representation for transfer learning across OpenML data sets Test error * Random search ABLR ABLR train loss (2) GP ABLR cost (2) ABLR cost + train loss (3) ABLR cost + epochwise train loss (21) iteration Transfer learning accross LIBSVM data sets.

33 Conclusion Bayesian optimisation automates machine learning: Algorithm tuning Model tuning Pipeline tuning Bayesian optimisation is a model-based approach that can leverage auxiliary information: It can exploit dependency structures [JAGS17] It can be extended to warm-start HPO jobs [PJSA17] It is a key building block of meta-learning Effective representations for HPO transfer learning can be learned.

34 Conclusion Bayesian optimisation automates machine learning: Algorithm tuning Model tuning Pipeline tuning Bayesian optimisation is a model-based approach that can leverage auxiliary information: It can exploit dependency structures [JAGS17] It can be extended to warm-start HPO jobs [PJSA17] It is a key building block of meta-learning Effective representations for HPO transfer learning can be learned.

35 Conclusion Bayesian optimisation automates machine learning: Algorithm tuning Model tuning Pipeline tuning Bayesian optimisation is a model-based approach that can leverage auxiliary information: It can exploit dependency structures [JAGS17] It can be extended to warm-start HPO jobs [PJSA17] It is a key building block of meta-learning Effective representations for HPO transfer learning can be learned.

36

37 References I James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13: , C. M. Bishop. Pattern Recognition and Machine Learning. Springer New York, Rodolphe Jenatton, Cedric Archambeau, Javier González, and Matthias Seeger. Bayesian optimization with tree-structured dependencies. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70 of Proceedings of Machine Learning Research, pages , International Convention Centre, Sydney, Australia, Aug PMLR. Donald R Jones, Matthias Schonlau, and William J Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4): , 1998.

38 References II Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast Bayesian optimization of machine learning hyperparameters on large datasets. Technical report, preprint arxiv: , Andreas Krause and Cheng S Ong. Contextual gaussian process bandit optimization. In Advances in Neural Information Processing Systems (NIPS), pages , Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): , Jonas Mockus, Vytautas Tiesis, and Antanas Zilinskas. The application of Bayesian methods for seeking the extremum. Towards Global Optimization, 2( ):2, V. Perrone, R. Jenatton, M. Seeger, and C. Archambeau. Multiple Adaptive Bayesian Linear Regression for Scalable Bayesian Optimization with Warm Start. ArXiv e-prints, December 2017.

39 References III Ali Rahimi, Benjamin Recht, et al. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems (NIPS) 20, pages , Matthias Seeger, Asmus Hetzel, Zhenwen Dai, and Neil D Lawrence. Auto-differentiating linear algebra. Technical report, preprint arxiv: , Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter. Bayesian optimization with robust Bayesian neural networks. In Advances in Neural Information Processing Systems (NIPS), pages , Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems (NIPS), pages , 2012.

40 References IV Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. Scalable Bayesian optimization using deep neural networks. In Proceedings of the International Conference on Machine Learning (ICML), pages , Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1): , Joaquin Vanschoren, Jan N Van Rijn, Bernd Bischl, and Luis Torgo. OpenML: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49 60, 2014.

Predictive Variance Reduction Search

Predictive Variance Reduction Search Predictive Variance Reduction Search Vu Nguyen, Sunil Gupta, Santu Rana, Cheng Li, Svetha Venkatesh Centre of Pattern Recognition and Data Analytics (PRaDA), Deakin University Email: v.nguyen@deakin.edu.au

More information

Quantifying mismatch in Bayesian optimization

Quantifying mismatch in Bayesian optimization Quantifying mismatch in Bayesian optimization Eric Schulz University College London e.schulz@cs.ucl.ac.uk Maarten Speekenbrink University College London m.speekenbrink@ucl.ac.uk José Miguel Hernández-Lobato

More information

Bayesian Optimization with Robust Bayesian Neural Networks

Bayesian Optimization with Robust Bayesian Neural Networks Bayesian Optimization with Robust Bayesian Neural Networks Jost Tobias Springenberg Aaron Klein Stefan Falkner Frank Hutter Department of Computer Science University of Freiburg {springj,kleinaa,sfalkner,fh}@cs.uni-freiburg.de

More information

Bayesian optimization for automatic machine learning

Bayesian optimization for automatic machine learning Bayesian optimization for automatic machine learning Matthew W. Ho man based o work with J. M. Hernández-Lobato, M. Gelbart, B. Shahriari, and others! University of Cambridge July 11, 2015 Black-bo optimization

More information

Information-Based Multi-Fidelity Bayesian Optimization

Information-Based Multi-Fidelity Bayesian Optimization Information-Based Multi-Fidelity Bayesian Optimization Yehong Zhang, Trong Nghia Hoang, Bryan Kian Hsiang Low and Mohan Kankanhalli Department of Computer Science, National University of Singapore, Republic

More information

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation COMP 55 Applied Machine Learning Lecture 2: Bayesian optimisation Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material posted

More information

Denoising Autoencoders

Denoising Autoencoders Denoising Autoencoders Oliver Worm, Daniel Leinfelder 20.11.2013 Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 1 / 11 Introduction Poor initialisation can lead to local minima 1986 -

More information

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm

More information

Raiders of the Lost Architecture: Kernels for Bayesian Optimization in Conditional Parameter Spaces

Raiders of the Lost Architecture: Kernels for Bayesian Optimization in Conditional Parameter Spaces Raiders of the Lost Architecture: Kernels for Bayesian Optimization in Conditional Parameter Spaces Kevin Swersky University of Toronto kswersky@cs.toronto.edu David Duvenaud University of Cambridge dkd23@cam.ac.uk

More information

Context-Dependent Bayesian Optimization in Real-Time Optimal Control: A Case Study in Airborne Wind Energy Systems

Context-Dependent Bayesian Optimization in Real-Time Optimal Control: A Case Study in Airborne Wind Energy Systems Context-Dependent Bayesian Optimization in Real-Time Optimal Control: A Case Study in Airborne Wind Energy Systems Ali Baheri Department of Mechanical Engineering University of North Carolina at Charlotte

More information

Black-box α-divergence Minimization

Black-box α-divergence Minimization Black-box α-divergence Minimization José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, Richard Turner, Harvard University, University of Cambridge, Universidad Autónoma de Madrid.

More information

Nonparametric Inference for Auto-Encoding Variational Bayes

Nonparametric Inference for Auto-Encoding Variational Bayes Nonparametric Inference for Auto-Encoding Variational Bayes Erik Bodin * Iman Malik * Carl Henrik Ek * Neill D. F. Campbell * University of Bristol University of Bath Variational approximations are an

More information

An Efficient Approach for Assessing Parameter Importance in Bayesian Optimization

An Efficient Approach for Assessing Parameter Importance in Bayesian Optimization An Efficient Approach for Assessing Parameter Importance in Bayesian Optimization Frank Hutter Freiburg University fh@informatik.uni-freiburg.de Holger H. Hoos and Kevin Leyton-Brown University of British

More information

Talk on Bayesian Optimization

Talk on Bayesian Optimization Talk on Bayesian Optimization Jungtaek Kim (jtkim@postech.ac.kr) Machine Learning Group, Department of Computer Science and Engineering, POSTECH, 77-Cheongam-ro, Nam-gu, Pohang-si 37673, Gyungsangbuk-do,

More information

Bayesian Optimization with Unknown Constraints

Bayesian Optimization with Unknown Constraints Bayesian Optimization with Unknown Constraints Michael A. Gelbart Harvard University Cambridge, MA Jasper Snoek Harvard University Cambridge, MA Ryan P. Adams Harvard University Cambridge, MA Abstract

More information

A parametric approach to Bayesian optimization with pairwise comparisons

A parametric approach to Bayesian optimization with pairwise comparisons A parametric approach to Bayesian optimization with pairwise comparisons Marco Co Eindhoven University of Technology m.g.h.co@tue.nl Bert de Vries Eindhoven University of Technology and GN Hearing bdevries@ieee.org

More information

arxiv: v1 [stat.ml] 16 Jun 2014

arxiv: v1 [stat.ml] 16 Jun 2014 FREEZE-THAW BAYESIAN OPTIMIZATION BY KEVIN SWERSKY, JASPER SNOEK AND RYAN P. ADAMS Harvard University and University of Toronto 1. Introduction arxiv:1406.3896v1 [stat.ml] 16 Jun 2014 In machine learning,

More information

Deep Learning & Neural Networks Lecture 4

Deep Learning & Neural Networks Lecture 4 Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly

More information

Parallelised Bayesian Optimisation via Thompson Sampling

Parallelised Bayesian Optimisation via Thompson Sampling Parallelised Bayesian Optimisation via Thompson Sampling Kirthevasan Kandasamy Carnegie Mellon University Google Research, Mountain View, CA Sep 27, 2017 Slides: www.cs.cmu.edu/~kkandasa/talks/google-ts-slides.pdf

More information

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence ESANN 0 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 7-9 April 0, idoc.com publ., ISBN 97-7707-. Stochastic Gradient

More information

arxiv: v2 [stat.ml] 16 Oct 2017

arxiv: v2 [stat.ml] 16 Oct 2017 Correcting boundary over-exploration deficiencies in Bayesian optimization with virtual derivative sign observations arxiv:7.96v [stat.ml] 6 Oct 7 Eero Siivola, Aki Vehtari, Jarno Vanhatalo, Javier González,

More information

Probabilistic numerics for deep learning

Probabilistic numerics for deep learning Presenter: Shijia Wang Department of Engineering Science, University of Oxford rning (RLSS) Summer School, Montreal 2017 Outline 1 Introduction Probabilistic Numerics 2 Components Probabilistic modeling

More information

arxiv: v3 [stat.ml] 11 Jun 2014

arxiv: v3 [stat.ml] 11 Jun 2014 INPUT WARPING FOR BAYESIAN OPTIMIZATION OF NON-STATIONARY FUNCTIONS BY JASPER SNOEK, KEVIN SWERSKY, RICHARD S. ZEMEL AND RYAN P. ADAMS Harvard University and University of Toronto arxiv:42.929v3 [stat.ml]

More information

arxiv: v1 [cs.lg] 10 Oct 2018

arxiv: v1 [cs.lg] 10 Oct 2018 Combining Bayesian Optimization and Lipschitz Optimization Mohamed Osama Ahmed Sharan Vaswani Mark Schmidt moahmed@cs.ubc.ca sharanv@cs.ubc.ca schmidtm@cs.ubc.ca University of British Columbia arxiv:1810.04336v1

More information

High Dimensional Bayesian Optimization via Restricted Projection Pursuit Models

High Dimensional Bayesian Optimization via Restricted Projection Pursuit Models High Dimensional Bayesian Optimization via Restricted Projection Pursuit Models Chun-Liang Li Kirthevasan Kandasamy Barnabás Póczos Jeff Schneider {chunlial, kandasamy, bapoczos, schneide}@cs.cmu.edu Carnegie

More information

Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005

Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005 Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005 Presented by Piotr Mirowski CBLL meeting, May 6, 2009 Courant Institute of Mathematical Sciences, New York University

More information

Forward and Reverse Gradient-based Hyperparameter Optimization

Forward and Reverse Gradient-based Hyperparameter Optimization Forward and Reverse Gradient-based Hyperparameter Optimization Luca Franceschi 1,2, Michele Donini 1, Paolo Frasconi 3 and Massimilano Pontil 1,2 massimiliano.pontil@iit.it (1) Istituto Italiano di Tecnologia,

More information

Bayesian Semi-supervised Learning with Deep Generative Models

Bayesian Semi-supervised Learning with Deep Generative Models Bayesian Semi-supervised Learning with Deep Generative Models Jonathan Gordon Department of Engineering Cambridge University jg801@cam.ac.uk José Miguel Hernández-Lobato Department of Engineering Cambridge

More information

Multi-Attribute Bayesian Optimization under Utility Uncertainty

Multi-Attribute Bayesian Optimization under Utility Uncertainty Multi-Attribute Bayesian Optimization under Utility Uncertainty Raul Astudillo Cornell University Ithaca, NY 14853 ra598@cornell.edu Peter I. Frazier Cornell University Ithaca, NY 14853 pf98@cornell.edu

More information

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Practical Bayesian Optimization of Machine Learning. Learning Algorithms Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that

More information

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University

More information

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Randomized SPENs for Multi-Modal Prediction

Randomized SPENs for Multi-Modal Prediction Randomized SPENs for Multi-Modal Prediction David Belanger 1 Andrew McCallum 2 Abstract Energy-based prediction defines the mapping from an input x to an output y implicitly, via minimization of an energy

More information

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville Chapter 6 :Deep Feedforward Networks Benoit Massé Dionyssos Kounades-Bastian Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward

More information

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration

Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration Emile Contal David Buffoni Alexandre Robicquet Nicolas Vayatis CMLA, ENS Cachan, France September 25, 2013 Motivating

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Learning Deep Architectures

Learning Deep Architectures Learning Deep Architectures Yoshua Bengio, U. Montreal Microsoft Cambridge, U.K. July 7th, 2009, Montreal Thanks to: Aaron Courville, Pascal Vincent, Dumitru Erhan, Olivier Delalleau, Olivier Breuleux,

More information

Mini-Course 6: Hyperparameter Optimization Harmonica

Mini-Course 6: Hyperparameter Optimization Harmonica Mini-Course 6: Hyperparameter Optimization Harmonica Yang Yuan Computer Science Department Cornell University Last time Bayesian Optimization [Snoek et al., 2012, Swersky et al., 2013, Snoek et al., 2014,

More information

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning Lab Course 2017 (Deep Learning Practical) Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning

More information

Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization Journal of Machine Learning Research 18 (018) 1-5 Submitted 11/16; Revised 1/17; Published /18 Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization Lisha Li Carnegie Mellon University,

More information

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Shunsuke Horii Waseda University s.horii@aoni.waseda.jp Abstract In this paper, we present a hierarchical model which

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Trust-Region Optimization Methods Using Limited-Memory Symmetric Rank-One Updates for Off-The-Shelf Machine Learning

Trust-Region Optimization Methods Using Limited-Memory Symmetric Rank-One Updates for Off-The-Shelf Machine Learning Trust-Region Optimization Methods Using Limited-Memory Symmetric Rank-One Updates for Off-The-Shelf Machine Learning by Jennifer B. Erway, Joshua Griffin, Riadh Omheni, and Roummel Marcia Technical report

More information

ON HYPERPARAMETER OPTIMIZATION IN LEARNING SYSTEMS

ON HYPERPARAMETER OPTIMIZATION IN LEARNING SYSTEMS ON HYPERPARAMETER OPTIMIZATION IN LEARNING SYSTEMS Luca Franceschi 1,2, Michele Donini 1, Paolo Frasconi 3, Massimiliano Pontil 1,2 (1) Istituto Italiano di Tecnologia, Genoa, 16163 Italy (2) Dept of Computer

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d

More information

J. Sadeghi E. Patelli M. de Angelis

J. Sadeghi E. Patelli M. de Angelis J. Sadeghi E. Patelli Institute for Risk and, Department of Engineering, University of Liverpool, United Kingdom 8th International Workshop on Reliable Computing, Computing with Confidence University of

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

More information

arxiv: v1 [cs.lg] 6 Nov 2015

arxiv: v1 [cs.lg] 6 Nov 2015 Deep Kernel Learning arxiv:1511.02222v1 [cs.lg] 6 Nov 2015 Andrew Gordon Wilson Carnegie Mellon University andrewgw@cs.cmu.edu Ruslan Salakhutdinov University of Toronto rsalakhu@cs.toronto.edu Abstract

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

FreezeOut: Accelerate Training by Progressively Freezing Layers

FreezeOut: Accelerate Training by Progressively Freezing Layers FreezeOut: Accelerate Training by Progressively Freezing Layers Andrew Brock, Theodore Lim, & J.M. Ritchie School of Engineering and Physical Sciences Heriot-Watt University Edinburgh, UK {ajb5, t.lim,

More information

Deep Neural Networks as Gaussian Processes

Deep Neural Networks as Gaussian Processes Deep Neural Networks as Gaussian Processes Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein Google Brain {jaehlee, yasamanb, romann, schsam, jpennin,

More information

From CDF to PDF A Density Estimation Method for High Dimensional Data

From CDF to PDF A Density Estimation Method for High Dimensional Data From CDF to PDF A Density Estimation Method for High Dimensional Data Shengdong Zhang Simon Fraser University sza75@sfu.ca arxiv:1804.05316v1 [stat.ml] 15 Apr 2018 April 17, 2018 1 Introduction Probability

More information

TUTORIAL PART 1 Unsupervised Learning

TUTORIAL PART 1 Unsupervised Learning TUTORIAL PART 1 Unsupervised Learning Marc'Aurelio Ranzato Department of Computer Science Univ. of Toronto ranzato@cs.toronto.edu Co-organizers: Honglak Lee, Yoshua Bengio, Geoff Hinton, Yann LeCun, Andrew

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

arxiv: v1 [stat.ml] 11 Nov 2015

arxiv: v1 [stat.ml] 11 Nov 2015 Training Deep Gaussian Processes using Stochastic Expectation Propagation and Probabilistic Backpropagation arxiv:1511.03405v1 [stat.ml] 11 Nov 2015 Thang D. Bui University of Cambridge tdb40@cam.ac.uk

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Deep Learning: a gentle introduction

Deep Learning: a gentle introduction Deep Learning: a gentle introduction Jamal Atif jamal.atif@dauphine.fr PSL, Université Paris-Dauphine, LAMSADE February 8, 206 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 206 / Why

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

Adaptive Bayesian Optimization for Dynamic Problems

Adaptive Bayesian Optimization for Dynamic Problems Adaptive Bayesian Optimization for Dynamic Problems Favour Mandanji Nyikosa Linacre College University of Oxford A thesis submitted for the degree of Doctor of Philosophy Hilary 2018 To the memory of

More information

Truncated Variance Reduction: A Unified Approach to Bayesian Optimization and Level-Set Estimation

Truncated Variance Reduction: A Unified Approach to Bayesian Optimization and Level-Set Estimation Truncated Variance Reduction: A Unified Approach to Bayesian Optimization and Level-Set Estimation Ilija Bogunovic, Jonathan Scarlett, Andreas Krause, Volkan Cevher Laboratory for Information and Inference

More information

Data Mining & Machine Learning

Data Mining & Machine Learning Data Mining & Machine Learning CS57300 Purdue University March 1, 2018 1 Recap of Last Class (Model Search) Forward and Backward passes 2 Feedforward Neural Networks Neural Networks: Architectures 2-layer

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Stochastic Optimization for Deep CCA via Nonlinear Orthogonal Iterations

Stochastic Optimization for Deep CCA via Nonlinear Orthogonal Iterations Stochastic Optimization for Deep CCA via Nonlinear Orthogonal Iterations Weiran Wang Toyota Technological Institute at Chicago * Joint work with Raman Arora (JHU), Karen Livescu and Nati Srebro (TTIC)

More information

Joint Emotion Analysis via Multi-task Gaussian Processes

Joint Emotion Analysis via Multi-task Gaussian Processes Joint Emotion Analysis via Multi-task Gaussian Processes Daniel Beck, Trevor Cohn, Lucia Specia October 28, 2014 1 Introduction 2 Multi-task Gaussian Process Regression 3 Experiments and Discussion 4 Conclusions

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

Lecture 14: Deep Generative Learning

Lecture 14: Deep Generative Learning Generative Modeling CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 14: Deep Generative Learning Density estimation Reconstructing probability density function using samples Bohyung Han

More information

Multi-Kernel Gaussian Processes

Multi-Kernel Gaussian Processes Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Arman Melkumyan Australian Centre for Field Robotics School of Aerospace, Mechanical and Mechatronic Engineering

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

Bayesian Deep Learning

Bayesian Deep Learning Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference

More information

Classification and Pattern Recognition

Classification and Pattern Recognition Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations

More information

Convolutional Neural Networks. Srikumar Ramalingam

Convolutional Neural Networks. Srikumar Ramalingam Convolutional Neural Networks Srikumar Ramalingam Reference Many of the slides are prepared using the following resources: neuralnetworksanddeeplearning.com (mainly Chapter 6) http://cs231n.github.io/convolutional-networks/

More information

arxiv: v1 [cs.cl] 21 May 2017

arxiv: v1 [cs.cl] 21 May 2017 Spelling Correction as a Foreign Language Yingbo Zhou yingbzhou@ebay.com Utkarsh Porwal uporwal@ebay.com Roberto Konow rkonow@ebay.com arxiv:1705.07371v1 [cs.cl] 21 May 2017 Abstract In this paper, we

More information

Gaussian Cardinality Restricted Boltzmann Machines

Gaussian Cardinality Restricted Boltzmann Machines Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Gaussian Cardinality Restricted Boltzmann Machines Cheng Wan, Xiaoming Jin, Guiguang Ding and Dou Shen School of Software, Tsinghua

More information

Constrained Bayesian Optimization and Applications

Constrained Bayesian Optimization and Applications Constrained Bayesian Optimization and Applications The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Gelbart, Michael

More information

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants Advanced Machine Learning Lecture 2 Neural Networks 24..206 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Talk Announcement Yann LeCun (NYU & FaceBook AI) 28..

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

arxiv: v3 [stat.ml] 7 Feb 2018

arxiv: v3 [stat.ml] 7 Feb 2018 Bayesian Optimization with Gradients Jian Wu Matthias Poloczek Andrew Gordon Wilson Peter I. Frazier Cornell University, University of Arizona arxiv:703.04389v3 stat.ml 7 Feb 08 Abstract Bayesian optimization

More information

Scalable Gaussian Process Regression Using Deep Neural Networks

Scalable Gaussian Process Regression Using Deep Neural Networks Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) Scalable Gaussian Process Regression Using Deep eural etworks Wenbing Huang 1, Deli Zhao 2, Fuchun

More information

Natural Gradients via the Variational Predictive Distribution

Natural Gradients via the Variational Predictive Distribution Natural Gradients via the Variational Predictive Distribution Da Tang Columbia University datang@cs.columbia.edu Rajesh Ranganath New York University rajeshr@cims.nyu.edu Abstract Variational inference

More information

Learning Deep Architectures

Learning Deep Architectures Learning Deep Architectures Yoshua Bengio, U. Montreal CIFAR NCAP Summer School 2009 August 6th, 2009, Montreal Main reference: Learning Deep Architectures for AI, Y. Bengio, to appear in Foundations and

More information