Kyle Reing University of Southern California April 18, 2018

Similar documents
Restricted Boltzmann Machines for Collaborative Filtering

Machine Learning Techniques for Computer Vision

Introduction to the Renormalization Group

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Lecture 1: Introduction, Entropy and ML estimation

Restricted Boltzmann Machines

Pattern Recognition and Machine Learning

Robust Classification using Boltzmann machines by Vasileios Vasilakakis

Deep Boltzmann Machines

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Self Supervised Boosting

STA 4273H: Statistical Machine Learning

CPSC 540: Machine Learning

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Lecture 16 Deep Neural Generative Models

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

5 Mutual Information and Channel Capacity

Rapid Introduction to Machine Learning/ Deep Learning

Introduction to Restricted Boltzmann Machines

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Learning MN Parameters with Approximation. Sargur Srihari

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Basic Principles of Unsupervised and Unsupervised

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Information Theory Primer:

Gaussian Mixture Models

Machine Learning for Data Science (CS4786) Lecture 24

STA 4273H: Statistical Machine Learning

Need for Sampling in Machine Learning. Sargur Srihari

Deep unsupervised learning

CS 630 Basic Probability and Information Theory. Tim Campbell

Learning Energy-Based Models of High-Dimensional Data

Review: Directed Models (Bayes Nets)

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Annealing Between Distributions by Averaging Moments

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Variational Autoencoder

Inductive Principles for Restricted Boltzmann Machine Learning

Introduction to Machine Learning

Bayesian Machine Learning

Performance Evaluation and Comparison

Information theoretic perspectives on learning algorithms

Probabilistic Graphical Models

Expectation Maximization

Recent Advances in Bayesian Inference Techniques

STA414/2104 Statistical Methods for Machine Learning II

Undirected Graphical Models: Markov Random Fields

Monte Carlo Simulation of the Ising Model. Abstract

Contrastive Divergence

19 : Slice Sampling and HMC

Computational Physics (6810): Session 13

Latent Variable Models

Accelerated Monte Carlo simulations with restricted Boltzmann machines

Why is Deep Learning so effective?

Learning Tetris. 1 Tetris. February 3, 2009

Active learning in sequence labeling

Lecture : Probabilistic Machine Learning

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

The Origin of Deep Learning. Lili Mou Jan, 2015

Error Functions & Linear Regression (1)

ECE521 Lecture 19 HMM cont. Inference in HMM

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Graphical Models for Collaborative Filtering

arxiv: v2 [cs.ne] 22 Feb 2013

Bayesian Mixtures of Bernoulli Distributions

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Lecture 7 and 8: Markov Chain Monte Carlo

3 Undirected Graphical Models

Complex Systems Methods 9. Critical Phenomena: The Renormalization Group

Lecture 2: Linear regression

How to Quantitate a Markov Chain? Stochostic project 1

Learning in Markov Random Fields An Empirical Study

Lecture 13 : Variational Inference: Mean Field Approximation

Chapter 16. Structured Probabilistic Models for Deep Learning

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Outline of Today s Lecture

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Submodularity in Machine Learning

Based on slides by Richard Zemel

Different Criteria for Active Learning in Neural Networks: A Comparative Study

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Undirected Graphical Models

Bayesian Networks BY: MOHAMAD ALSABBAGH

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Assignment 6: MCMC Simulation and Review Problems

Bayesian Methods for Machine Learning

Learning to Disentangle Factors of Variation with Manifold Learning

8.3.2 The finite size scaling method

Linear Models for Regression

Bayesian Learning in Undirected Graphical Models

Machine Learning Basics: Maximum Likelihood Estimation

Chapter 17: Undirected Graphical Models

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

8 Error analysis: jackknife & bootstrap

Transcription:

Renormalization Group and Information Theory Kyle Reing University of Southern California April 18, 2018

Overview Renormalization Group Overview Information Theoretic Preliminaries Real Space Mutual Information Better ways to do this? The ML and Physics Friendship

Renormalization Group Example: Block Spin Renormalization Group

Renormalization Group Describe system in terms of block variables (ex: average spin of the block) Use the same form for the Hamiltonian, except defined over blocks, with different values for the parameters (ex: coupling strength) Change in parameters after renormalization defines a RG Flow Limit of flow often leads to fixed point characterizations (ex: Ising ferromagnetic/ paramagnetic)

Another Perspective Sufficient Statistic: A statistic (function) of the data that contains all of the information with respect to some model (ex: mean/ variance with Gaussians) Can think of RG as a (lossy) compression, for some course graining operation R, data X, and measure of macro scale properties Y, want:

Super Quick Info Theory Definitions (Shannon) Entropy Measure of uncertainty in a random variable, how spread out is the distribution? Discrete Continuous Mutual Information Measure of similarity between two random variables, reduction in uncertainty when we know another variable

Learning Relevant Degrees of Freedom? For a given model, RG procedure is often explicitly defined ( what are the fast/ slow degrees of freedom, what is the course graining procedure, how many steps to take, what is the relevant macro scale property?) What if we wanted to learn the relevant degrees of freedom and course graining automatically in a data driven way?

Real Space Mutual Information Algorithm: Estimate data marginals using RBM s and contrastive divergence training Use these data estimates to optimize another RBM between hidden degrees of freedom (a function of some small subset of sites) and the environment (the remaining sites) through gradient descent

Restricted Boltzmann Machines Hamiltonian (Energy Function) given by: Boltzmann distribution:

Contrastive Divergence Speed up sampling from Restricted Boltzmann Machines Want P(X), run a Markov Chain to convergence using Gibbs Sampling Gibbs Sampling in RBM s: Fix value of hidden, sample marginal probability of a single spin conditioned on all other spins Fix value of observed, sample marginal probability of hidden conditioned on other hidden Repeat many times Contrastive Divergence: Initialize Markov Chain close to target distribution, not randomly Samples of P(X) not gathered after chain convergence, gathered after k steps of Gibbs sampling (often k=1)

Real Space Mutual Information Intuition behind I(H:E) as a measure of macro scale information: Hidden variables H should contain information in V that overlaps with E If E contained all the information in V, H would try to copy V By restricting the support/ number of variables in H, can ensure that V can never be fully copied, introducing some potential information loss each renormalization step

Possible Issues Each renormalization step is computationally expensive (relies on many Monte Carlo samples to construct data distribution, and maximizes I(H:E) at every step for some partition) Depending on the information present in the subset V, a renormalization step may result in undesirable loss of information about macro scale

Possible Alternatives Shameless plug: Multivariate Mutual Information Mutual Information: How much information is shared between two random variables? Multivariate Mutual Information: How much information is shared between a group of random variables? Information among a group of random variables can be shared in different ways! (Redundantly, Synergistically)

Redundancy as Macro Scale One interpretation of macro scale properties of a system corresponds to information in a system that is consistently repeated Extracting / preserving redundant information (if present) would have a similar effect as a large number of renormalization course graining steps There are measures of multivariate mutual information that are maximized for redundancy (total correlation)

Redundancy as Macro Scale There are efficient algorithms (O(n) time!) for extracting redundancy in a similar fashion as Real Space Mutual Information https://github.com/gregversteeg Next step: run it and see if it works!

Final Note Physics helps ML, ML helps Physics?