CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

Similar documents
Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen

Using deep belief network modelling to characterize differences in brain morphometry in schizophrenia

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Deep Learning: A Quick Overview

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Multilayer Perceptron (MLP)

Lecture Notes on Linear Regression

1 Convex Optimization

Deep Learning. Boyang Albert Li, Jie Jay Tan

Ensemble Methods: Boosting

Course 395: Machine Learning - Lectures

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Generalized Linear Methods

I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models

Supporting Information

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Week 5: Neural Networks

CSE 546 Midterm Exam, Fall 2014(with Solution)

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

6. Stochastic processes (2)

6. Stochastic processes (2)

10-701/ Machine Learning, Fall 2005 Homework 3

Markov Chain Monte Carlo Lecture 6

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Introduction to the Introduction to Artificial Neural Network

Multi-Conditional Learning for Joint Probability Models with Latent Variables

Hidden Markov Models

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Artificial Intelligence Bayesian Networks

Multilayer neural networks

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Probability Theory (revisited)

Feature Selection: Part 1

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Gaussian Mixture Models

Unsupervised Learning

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Neural-Network Quantum States

VIDEO KEY FRAME DETECTION BASED ON THE RESTRICTED BOLTZMANN MACHINE

Gaussian-Bernoulli Deep Boltzmann Machine

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

10.34 Fall 2015 Metropolis Monte Carlo Algorithm

Homework Assignment 3 Due in class, Thursday October 15

arxiv: v3 [cs.ne] 18 Jan 2019

Generative classification models

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Linear Classification, SVMs and Nearest Neighbors

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

6 Supplementary Materials

Support Vector Machines

CS 3750 Machine Learning Lecture 6. Monte Carlo methods. CS 3750 Advanced Machine Learning. Markov chain Monte Carlo

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

1 The Mistake Bound Model

Multi-layer neural networks

Kernel Methods and SVMs Extension

CSC 411 / CSC D11 / CSC C11

VQ widely used in coding speech, image, and video

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Learning undirected Models. Instructor: Su-In Lee University of Washington, Seattle. Mean Field Approximation

Self-organising Systems 2 Simulated Annealing and Boltzmann Machines

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

arxiv: v2 [cs.lg] 16 Sep 2009

Structured Perceptrons & Structural SVMs

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

EEE 241: Linear Systems

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Singular Value Decomposition: Theory and Applications

EM and Structure Learning

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Using Immune Genetic Algorithm to Optimize BP Neural Network and Its Application Peng-fei LIU1,Qun-tai SHEN1 and Jun ZHI2,*

Information Geometry of Gibbs Sampler

On Autoencoders and Score Matching for Energy Based Models

Tracking with Kalman Filter

Maximum Likelihood Estimation

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

3.1 ML and Empirical Distribution

Open Problem: The landscape of the loss surfaces of multilayer networks

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

Deep Belief Network using Reinforcement Learning and its Applications to Time Series Forecasting

SDMML HT MSc Problem Sheet 4

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Outline. Communication. Bellman Ford Algorithm. Bellman Ford Example. Bellman Ford Shortest Path [1]

1 Input-Output Mappings. 2 Hebbian Failure. 3 Delta Rule Success.

Announcements EWA with ɛ-exploration (recap) Lecture 20: EXP3 Algorithm. EECS598: Prediction and Learning: It s Only a Game Fall 2013.

Lecture 3: Shannon s Theorem

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

Evaluation of classifiers MLPs

Limited Dependent Variables

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

The Geometry of Logit and Probit

18.1 Introduction and Recap

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Transcription:

CSC321 Tutoral 9: Revew of Boltzmann machnes and smulated annealng (Sldes based on Lecture 16-18 and selected readngs) Yue L Emal: yuel@cs.toronto.edu Wed 11-12 March 19 Fr 10-11 March 21

Outlne Boltzmann Machnes Smulated Annealng Restrcted Boltzmann Machnes Deep learnng usng stacked RBM

General Boltzmann Machnes [1] h1 v1 h2 v2 Network s symmetrcally connected Allow connecton between vsble and hdden unts Each bnary unt makes stochastc decson to be ether on or off The confguraton of the network dctates ts energy At the equlbrum state of the network, the lkelhood s defned as the exponentated negatve energy known as the Boltzmann dstrbuton

Boltzmann Dstrbuton h1 h2 E(v, h) = s b + s s w (1) < P(v) = h exp( E(v, h)) v,h exp( E(v, h)) (2) v1 Two problems: v2 where v and h are vsble and hdden unts, w s are connecton weghts b/w vsble-vsble, hdden-hdden, and vsble-hdden unts, E(v, h) s the energy functon 1. Gven w, how to acheve thermal equlbrum of P(v, h) over all possble network confg. ncludng vsble & hdden unts 2. Gven v, learn w to maxmze P(v)

Thermal equlbrum s a dffcult concept (Lec 16): It does not mean that the system has settled down nto the lowest energy confguraton. The thng that settles down s the probablty dstrbuton over confguratons. Transton probabltes* at hgh temperature T: 0.01 0.1 0.02 A A B C B C 0.1 0.02 0.1 0.1 0.2 0.01 Thermal equlbrum A C Transton probabltes* at low temperature T A 1e-2 A 1e-3 B 1e-4 C B 1e-9 0.3 0 C 1e-8 1e-3 0.01 A B C *unnormalzed probabltes (llustraton only) B

Smulated annealng [2] Scale Boltzmann factor by T ( temperature ): P(s) = exp( E(s)/T ) exp( E(s)/T ) (3) s exp( E(s)/T ) where s = {v, h}. At state t + 1, a proposed state s s compared wth current state s t : P(s ( ) P(s t ) = exp E(s ) E(s t ) ( ) = exp E ) (4) T T s t+1 { s, s t f E < 0 or exp( E/T ) > rand(0,1) otherwse (5) NB: T controls the stochastc of the transton: when E > 0, T exp( E/T ) ; T exp( E/T )

A nce demo of smulated annealng from Wkpeda: http://www.cs.utoronto.ca/~yuel/csc321_utm_2014_ fles/hll_clmbng_wth_smulated_annealng.gf Note: smulated annealng s not used n the Restrcted Boltzmann Machne algorthm dscussed below. Instead, Gbbs samplng s used. Nonetheless, t s stll a nce concept and has been used n many many other applcatons (the paper by Krkpatrck et al. (1983) [2] has been has been cted for over 30,000 tmes based on Google Scholar!)

Learnng weghts from Boltzmann Machnes s dffcult N P(v) = h exp( E(v, h)) v,h exp( E(v, h)) = n n=1 h exp( s b < s s w ) v,h exp( s b < s s w ) log P(v) = ( log exp( s b s s w ) n h < log exp( s b ) s s w ) v,h < log P(v, h) w = n ( ) s s P(h v) s s P(v, h) s,s s,s =< s s > data < s s > model where < x > s the expected value of x. s, s {v, h}. < s s > model s dffcult or takes long tme to compute.

Restrcted Boltzmann Machne (RBM) [3] A smple unsupervsed learnng module; Only one layer of hdden unts and one layer of vsble unts; No connecton between hdden unts nor between vsble unts (.e. a specal case of Boltzmann Machne); Edges are stll undrected or b-drectonal e.g., an RBM wth 2 vsble and 3 hdden unts: h1 h2 h3 hdden v1 v2 nput

Obectve functon of RBM - maxmum lkelhood: E(v, h θ) = w v h + b v + b h p(v θ) = log p(v θ) = log p(v θ) w = N N p(v, h θ) = h exp( E(v, h θ)) h n=1 v,h exp( E(v, h θ)) N log exp( E(v, h θ)) log exp( E(v, h θ)) h v,h N v h p(h v) v h p(v, h) h v,h n=1 n=1 n=1 = E data [v h ] E model [ˆv ĥ ] < v h > data < ˆv ĥ > model But < ˆv ĥ > model s stll too large to estmate, we apply Markov Chan Monte Carlo (MCMC) (.e., Gbbs samplng) to estmate t.

<v h > 0 <v h > 1 <v h > a fantasy t = 0 t = 1 t = 2 t = nfnty shortcut <v h > 0 <v h > 1 t = 0 t = 1 data reconstructon log p(v 0 ) w =< h 0 (v 0 v 1 ) > + < v 1 (h 0 h 1 ) > + < h 1 (v 1 v 2 ) > +... =< v 0 h 0 > < v h > < v 0 h 0 > < v 1 h 1 >

How Gbbs samplng works <v h > 0 <v h > 1 t = 0 t = 1 data reconstructon 1. Start wth a tranng vector on the vsble unts 2. Update all the hdden unts n parallel 3. Update all the vsble unts n parallel to get a reconstructon 4. Update the hdden unts agan w = ɛ(< v 0 h 0 > < v 1 h 1 >) (6)

Approxmate maxmum lkelhood learnng log p(v) w 1 N N [ n=1 v (n) h (n) ˆv (n) ĥ (n) ] (7) where v (n) s the value of th vsble (nput) unt for n th tranng case; h (n) s the value of th hdden unt; ˆv (n) s the sampled value for the th vsble unt or the negatve data generated based on h (n) and w ; ĥ (n) s the sampled value for the th hdden unt or the negatve hdden actvtes generated based on ˆv (n) and w ; Stll how exactly the negatve data and negatve hdden actvtes are generated?

wake-sleep algorthm (Lec18 p5) 1. Postve ( wake ) phase (clamp the vsble unts wth data): Use nput data to generate hdden actvtes: h = 1 1 + exp( v w b ) Sample hdden state from Bernoull dstrbuton: { 1, f h > rand(0,1) h 0, otherwse 2. Negatve ( sleep ) phase (unclamp the vsble unts from data): Use h to generate negatve data: ˆv = 1 1 + exp( w h b ) Use negatve data ˆv to generate negatve hdden actvtes: ĥ = 1 1 + exp( ˆv w b )

RBM learnng algorthm (con td) - Learnng where w (t) b (t) b (t) = η w (t 1) = η b (t 1) = η b (t 1) log p(v θ) + ɛ w ( λw (t 1) ) w log p(v θ) + ɛ vb b log p(v θ) + ɛ hb b log p(v θ) w log p(v θ) b log p(v θ) b 1 N 1 N 1 N N [ n=1 N n=1 N n=1 v (n) [ v (n) [ h (n) h (n) ˆv (n) ] ˆv (n) ] ĥ(n) ĥ (n) ]

Deep learnng usng stacked RBM on mages [3] 2000 top-level unts 10 label unts Ths could be the top level of another sensory pathway 500 unts 500 unts 28 x 28 pxel mage A greedy learnng algorthm Bottom layer encode the 28 28 handwrtten mage The upper adacent layer of 500 hdden unts are used for dstrbuted representaton of the mages The next 500-unts layer and the top layer of 2000 unts called assocatve memory layers, whch have undrected connectons between them The very top layer encodes the class labels wth softmax

The network traned on 60,000 tranng cases acheved 1.25% test error on classfyng 10,000 MNIST testng cases On the rght are the ncorrectly classfed mages, where the predctons are on the top left corner (Fgure 6, Hnton et al., 2006)

Let model generate 28 28 mages for specfc class label Each row shows 10 samples from the generatve model wth a partcular label clamped on. The top-level assocatve memory s run for 1000 teratons of alternatng Gbbs samplng (Fgure 8, Hnton et al., 2006).

Look nto the mnd of the network Each row shows 10 samples from the generatve model wth a partcular label clamped on.... Subsequent columns are produced by 20 teratons of alternatng Gbbs samplng n the assocatve memory (Fgure 9, Hnton et al., 2006).

Deep learnng usng stacked RBM on handwrtten mages (Hnton et al., 2006) A real-tme demo from Prof. Hnton s webpage: http://www.cs.toronto.edu/~hnton/dgts.html

Further References Davd H Ackley, Geoffrey E Hnton, and Terrence J Senowsk. A learnng algorthm for boltzmann machnes. Cogntve scence, 9(1):147 169, 1985. S. Krkpatrck, C. D. Gelatt, and M. P. Vecch. Optmzaton by smulated annealng. Scence, 220(4598):671 680, 1983. G Hnton and S Osndero. A fast learnng algorthm for deep belef nets. Neural computaton, 2006.