Tensor intro 1. SIAM Rev., 51(3), Tensor Decompositions and Applications, Kolda, T.G. and Bader, B.W.,

Similar documents
Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia

Tensor Methods for Feature Learning

Dictionary Learning Using Tensor Methods

Estimating Unnormalized models. Without Numerical Integration

Notes on Noise Contrastive Estimation (NCE)

FEAST at Play: Feature ExtrAction using Score function Tensors

Estimating Unnormalised Models by Score Matching

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Linear Models for Classification

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Failures of Gradient-Based Deep Learning

Clustering and Gaussian Mixture Models

Estimation theory and information geometry based on denoising

Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA

From independent component analysis to score matching

ESTIMATION THEORY AND INFORMATION GEOMETRY BASED ON DENOISING. Aapo Hyvärinen

Cheng Soon Ong & Christian Walder. Canberra February June 2018

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Deep unsupervised learning

Linear Regression and Its Applications

Neural networks and optimization

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Some tensor decomposition methods for machine learning

Artificial Neural Networks (ANN)

Greedy Layer-Wise Training of Deep Networks

Recent Advances in Non-Convex Optimization and its Implications to Learning

Ch 4. Linear Models for Classification

Estimating Latent Variable Graphical Models with Moments and Likelihoods

Statistical Learning Reading Assignments

Latent Variable Models

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Rapid Introduction to Machine Learning/ Deep Learning

A Connection Between Score Matching and Denoising Autoencoders

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Dimensionality Reduction and Principle Components Analysis

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Kernel Methods in Machine Learning

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Summary and discussion of: Dropout Training as Adaptive Regularization

Expectation Maximization Algorithm

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Unsupervised Learning: Projections

Introduction to Graphical Models

Statistical Machine Learning

Mathematical Formulation of Our Example

A summary of Deep Learning without Poor Local Minima

A Connection Between Score Matching and Denoising Autoencoders

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

An Introduction to Spectral Learning

CSCI-567: Machine Learning (Spring 2019)

Dreem Challenge report (team Bussanati)

GRADIENT DESCENT AND LOCAL MINIMA

Introduction to Tensors. 8 May 2014

Supervised Learning: Non-parametric Estimation

Deep Belief Networks are compact universal approximators

Qualifying Exam in Machine Learning

Variational Autoencoder

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Denoising Autoencoders

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Third-Order Tensor Decompositions and Their Application in Quantum Chemistry

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probabilistic Time Series Classification

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global

Introduction to Machine Learning Spring 2018 Note Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016

Introduction to Machine Learning Midterm Exam Solutions

Neural networks and optimization

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Lecture 16 Deep Neural Generative Models

ECE521 Lectures 9 Fully Connected Neural Networks

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

CSE446: Clustering and EM Spring 2017

Importance Sampling for Minibatches

Feature Engineering, Model Evaluations

CSC 411 Lecture 12: Principal Component Analysis

Generative Adversarial Networks

DM534 - Introduction to Computer Science

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Contrastive Divergence

arxiv: v1 [cs.lg] 26 Jul 2017

COM336: Neural Computing

L11: Pattern recognition principles

Ch.6 Deep Feedforward Networks (2/3)

Experimental Design and Statistics - AGA47A

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Transcription:

Overview 1. Brief tensor introduction 2. Stein s lemma 3. Score and score matching for fitting models 4. Bringing it all together for supervised deep learning

Tensor intro 1 Tensors are multidimensional arrays. Dimensionality of tensor is referred to as order. Vectors are first order and matrices are second order tensors. Tensors of order 3 or higher are called higher order tensors. Dimension of a tensor is called mode. Matrices have 2 modes, columns and rows. 1 Tensor Decompositions and Applications, Kolda, T.G. and Bader, B.W., SIAM Rev., 51(3), 2009.

Fibers and slices

Rank one tensors An N th -order tensor X R I 1 I 2 I N be expressed as is of rank 1 if its entries can x i1,i 2,i 3,...,i N = a (1) i 1 a (2) i 2... a (N) i N where a (n) R In More succinctly and i n {1,..., I n }, for n {1,..., N}. X = a (1) a (2)... a (N)

Canonical Decomposition or Paralel Factors (CP decomposition) Represent tensor as a sum of rank-1 tensors: Rank of a tensor X is the smallest number of rank-1 tensors whose sum generates X.

Score In the context of recent deep learning literature, given a density p(x) score 2 is x log p(x θ) This function reflects how the log density (log p(x)) changes with the data-vector (x). 2 A very similar and closely related function θ log p(x θ) is also a score, Fisher score.

Stein s lemma Critical piece of machinery in working with score functions is Stein s lemma: p(x)f (x) x log p(x)dx = p(x) x f (x)dx x x E f (x) x log p(x) }{{} = E [ x f (x)] score In words, shift the gradient to f and drop the score. Left side for telling stories about scores, right for algorithms. In practice, you d plug-in empirical distribution: 1 T T i=1 f (x i ) x log p(x i ) = 1 }{{} T score T x f (x i ) i=1

Learning by score matching 3 Fitting a pdf of form p(ξ θ) = 1 Z(θ) f (ξ θ) can be intractable if Z( ) is hard to compute. 3 Estimation of Non-Normalized Statistical Models by Score Matching, Aapo Hyvärinen, JMLR 2005.

Learning by score matching 3 Fitting a pdf of form p(ξ θ) = 1 Z(θ) f (ξ θ) can be intractable if Z( ) is hard to compute. The idea: the true data density function and the model density should drop off and increase the same way. More formally, the idea of score matching is to fit parameters θ so that score of the model, ξ log p(ξ θ), matches score of the data density, ξ log p x (ξ). 3 Estimation of Non-Normalized Statistical Models by Score Matching, Aapo Hyvärinen, JMLR 2005.

Score matching The objective is ξ p x (ξ) ξ log p x (ξ) }{{} data distribution ξ log p(ξ θ) }{{} model distribution 2 dξ

Score matching The objective is ξ p x (ξ) ξ log p x (ξ) }{{} data distribution ξ log p(ξ θ) }{{} model distribution Use Stein s lemma and plug-in empirical distribution to obtain: t 1 T 2 model model {}}{{}}{ 2 log f (ξ θ) + 1 log f (ξ θ) 2 ξ i i ξ 2 i 2 dξ No partition function, Z(θ) or the true data density p x (ξ).

Score matching and denoising autoencoders 5 Training a denosing autoencoder can be seen as minimizing distance between score of the model p(x θ) = 1 exp ( E(x θ)) Z(θ) E(w W, b, c) = ct x 1 2 x 2 + j log {1 + exp(w jx + b j )} }{{} σ 2 θ and score of a mixture of gaussians, each centered on a datapoint. 4 4 Parzen window density estimator. 5 A Connection Between Score Matching and Denoising Autoencoders, P. Vincent, Neural Computation, 2011.

Higher order scores 6 We can obtain higher order scores: Recursive formula: S m (x) := ( 1) m m p(x) p(x) S m (x) = S m 1 (x) log p(x) S m 1 (x) S 0 (x) = 1 Note that S 1 (x) = log p(x) 6 Score Function Features for Discriminative Learning, M. Janzamin, H. Sedghi, A. Anandkumar, 2015.

Supervised learning 7,8 We only looked at fitting densities p(x θ) unsupervised learning. What about discriminative setting? 7 Score Function Features for Discriminative Learning: Matrix and Tensor Frameworks, M. Janzamin, H. Sedghi, A. Anandkumar, 2015. 8 Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods,M. Janzamin, H. Sedghi, A. Anandkumar, 2016.

Sigmoidal net f (x) := E [ỹ x] = a 2, σ(a T 1 x + b 1 ) + b 2

Learning weights (A 1 ) of the network Series of papers by Sedghi and Anandkumar culminate in the following observation (heavily paraphrased): Using Stein s lemma expectation of product between label (y) is equal to a rank-k tensor computed using weight vectors (A 1 ) j E [y S 3 (x)] = E [ 3 xf (x) ] = j λ j (A 1 ) j (A 1 ) j (A 1 ) j 9 Using E [y S 2(x)] = E [ 2 xf (x) ] = j λ j(a 1) j (A 1) j does not lead to unique solution.

Learning weights (A 1 ) of the network Series of papers by Sedghi and Anandkumar culminate in the following observation (heavily paraphrased): Using Stein s lemma expectation of product between label (y) is equal to a rank-k tensor computed using weight vectors (A 1 ) j E [y S 3 (x)] = E [ 3 xf (x) ] = j λ j (A 1 ) j (A 1 ) j (A 1 ) j Practical version 1 T T y t S 3 (x t ) = t=1 k λ 3,j (Ã 1 ) j (Ã 1 ) j (Ã 1 ) j j=1 Algorithm: Compute the left side of the equation and perform tensor decomposition up to rank k. 9 9 Using E [y S 2(x)] = E [ 2 xf (x) ] = j λ j(a 1) j (A 1) j does not lead to unique solution.

Training algorithm

Crucial requirements and payoff Crucial requirements: data must have come from a sigmoidal net of the same structure (realizable setting) usual story you must get scores of p(x), S 1, use denoising autoencoder In return you get a guarantee on the mean square error between output of the true net and learned net.

What I did not tell you You have to learn b 1, A 2, b 2. b 1 Uses Fourier transform A 2, b 2 Uses plain regression

Practical concerns 1. S 3 is large (# features) 3 need to use approximation (sketch the tensor) 2. getting scores S 1 may be fragile (local minima in training DAE) 3. the result is for shallow nets 4. non-realizable setting needs additional assumptions on Fourier decomp of the target function Still, it might be worth implementing. 10 10 Read arxiv versions of these papers. ICLR ones are short on details. http://arxiv.org/pdf/1506.08473.pdf http://arxiv.org/pdf/1412.2863v2.pdf