Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning

Similar documents
TENSOR APPROXIMATION TOOLS FREE OF THE CURSE OF DIMENSIONALITY

TENSORS AND COMPUTATIONS

NUMERICAL METHODS WITH TENSOR REPRESENTATIONS OF DATA

Tensor networks and deep learning

Math 671: Tensor Train decomposition methods

NEW TENSOR DECOMPOSITIONS IN NUMERICAL ANALYSIS AND DATA PROCESSING

Numerical tensor methods and their applications

Linear Algebra and its Applications

Math 671: Tensor Train decomposition methods II

Institute for Computational Mathematics Hong Kong Baptist University

Conditional Random Field

Undirected Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models

Machine Learning with Quantum-Inspired Tensor Networks

Institute for Computational Mathematics Hong Kong Baptist University

Chapter 16. Structured Probabilistic Models for Deep Learning

CPSC 540: Machine Learning

Lecture 21: Spectral Learning for Graphical Models

Tensors and graphical models

Undirected Graphical Models: Markov Random Fields

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 9: PGM Learning

STA 4273H: Statistical Machine Learning

Machine Learning for Structured Prediction

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models

Probabilistic Graphical Models

Lecture 1: Introduction to low-rank tensor representation/approximation. Center for Uncertainty Quantification. Alexander Litvinenko

Chris Bishop s PRML Ch. 8: Graphical Models

Sequential Supervised Learning

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Introduction to Probabilistic Graphical Models

Bayesian Networks Inference with Probabilistic Graphical Models

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Machine Learning with Tensor Networks

Gaussian Process Regression

Brief Introduction of Machine Learning Techniques for Content Analysis

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Naïve Bayes classification

This work has been submitted to ChesterRep the University of Chester s online research repository.

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Using Graphs to Describe Model Structure. Sargur N. Srihari

Introduction - Motivation. Many phenomena (physical, chemical, biological, etc.) are model by differential equations. f f(x + h) f(x) (x) = lim

The Expectation-Maximization Algorithm

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Probabilistic Graphical Models

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Managing Uncertainty

arxiv: v1 [math.na] 25 Jul 2017

Graphical Models for Collaborative Filtering

The multiple-vector tensor-vector product

Bayesian Machine Learning - Lecture 7

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

An Introduction to Bayesian Machine Learning

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

p L yi z n m x N n xi

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Learning Bayesian network : Given structure and completely observed data

Matrix and Tensor Factorization from a Machine Learning Perspective

Lecture 4 October 18th

für Mathematik in den Naturwissenschaften Leipzig

The Origin of Deep Learning. Lili Mou Jan, 2015

Parameter learning in CRF s

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

GWAS V: Gaussian processes

A brief introduction to Conditional Random Fields

Tensor networks, TT (Matrix Product States) and Hierarchical Tucker decomposition

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

14 : Theory of Variational Inference: Inner and Outer Approximation

Probabilistic Graphical Models

Notes on Markov Networks

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

The connection of dropout and Bayesian statistics

STA 414/2104: Machine Learning

Machine Learning for Signal Processing Bayes Classification and Regression

CSCI-567: Machine Learning (Spring 2019)

Introduction to Probabilistic Graphical Models: Exercises

Matrix-Product-States/ Tensor-Trains

STA141C: Big Data & High Performance Statistical Computing

STA 4273H: Statistical Machine Learning

Does Better Inference mean Better Learning?

Lecture 15. Probabilistic Models on Graph

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Sum-Product Networks: A New Deep Architecture

Introduction to Machine Learning Midterm, Tues April 8

Variational Autoencoders

5. Sum-product algorithm

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Intelligent Systems (AI-2)

3 : Representation of Undirected GM

Least Squares Regression

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

CS-E3210 Machine Learning: Basic Principles

Probabilistic Graphical Models & Applications

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Transcription:

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning Anton Rodomanov Higher School of Economics, Russia Bayesian methods research group (http://bayesgroup.ru) 14 March 2016 HSE Seminar on Applied Linear Algebra, Moscow, Russia

Outline 1 Tensor Train Format 2 ML Application 1: Markov Random Fields 3 ML Application 2: TensorNet

What is a tensor? Tensor = multidimensional array: A = [A(i 1,..., i d )], i k {1,..., n k }. Terminology: dimensionality = d (number of indices). size = n 1 n d (number of nodes along each axis). Case d = 1 vector, d = 2 matrix.

Curse of dimensionality Number of elements = n d (exponential in d) When n = 2, d = 100 2 100 > 10 30 ( 10 18 PB of memory). Cannot work with tensors using standard methods.

Tensor rank decomposition [Hitchcock, 1927] Recall the rank decomposition for matrices: r A(i 1, i 2 ) = U(i 1, α)v (i 2, α). α=1 This can be generalized to tensors. Tensor rank decomposition (canonical decomposition): R A(i 1,..., i d ) = U 1 (i 1, α)... U d (i d, α). α=1 The minimal possible R is called the (canonical) rank of the tensor A. (+) No curse of dimensionality. (-) Ill-posed problem [de Silva, Lim, 2008]. (-) Rank R should be known in advance for many methods. (-) Computation of R is NP-hard [Hillar, Lim, 2013].

Unfolding matrices: definition Every tensor A has d 1 unfolding matrices: A k := [A(i 1... i k ; i k+1... i d )], where A(i 1... i k ; i k+1... i d ) := A(i 1,..., i d ). Here i 1... i k and i k+1... i d are row and column (multi)indices; A k are matrices of size M k N k with M k = k s=1 n s, N k = d s=k+1 n s. This is just a reshape.

Unfolding matrices: example Consider A = [A(i, j, k)] given by its elements: Then A(1, 1, 1) = 111, A(2, 1, 1) = 211, A(1, 2, 1) = 121, A(2, 2, 1) = 221, A(1, 1, 2) = 112, A(2, 1, 2) = 212, A(1, 2, 2) = 122, A(2, 2, 2) = 222. [ 111 121 112 122 A 1 = [A(i; jk)] = 211 221 212 222 111 112 A 2 = [A(ij; k)] = 211 212 121 122. 221 222 ],

Tensor Train decompositon: motivation Main idea: variable splitting. Consider a rank decomposition of an unfolding matrix: A(i 1 i 2 ; i 3 i 4 i 5 i 6 ) = α 2 U(i 1 i 2 ; α 2 )V (i 3 i 4 i 5 i 6 ; α 2 ). On the left: 6-dimensional tensor; on the right: 3- and 5-dimensional. The dimension has reduced! Proceed recursively.

Tensor Train decomposition [Oseledets, 2011] TT-format for a tensor A: A(i 1,..., i d ) = G 1 (α 0, i 1, α 1 )G 2 (α 1, i 2, α 2 )... G d (α d 1, i d, α d ). α 0,...,α d This can be written compactly as a matrix product: A(i 1,..., i d ) = G 1 [i 1 ] G 2 [i 2 ]... G d [i d ] }{{}}{{}}{{} 1 r 1 r 1 r 2 r d 1 1 Terminology: G i : TT-cores (collections of matrices) r i : TT-ranks r = max r i : maximal TT-rank TT-format uses O(dnr 2 ) memory to store O(n d ) elements. Efficient only if the ranks are small.

TT-format: example Consider a tensor: Its TT-format: where Check: A(i 1, i 2, i 3 ) := i 1 + i 2 + i 3, i 1 {1, 2, 3}, i 2 {1, 2, 3, 4}, i 3 {1, 2, 3, 4, 5}. A(i 1, i 2, i 3 ) = G 1 [i 1 ]G 2 [i 2 ]G 3 [i 3 ], G 1 [i 1 ] := [ i 1 1 ], G 2 [i 2 ] := A(i 1, i 2, i 3 ) = [ i 1 1 ] [ 1 0 i 2 1 [ 1 0 i 2 1 ] [ 1 i 3 ] = = [ i 1 + i 2 1 ] [ 1 i 3 ] [ 1, G 3 [i 3 ] := i 3 ] ] = i 1 + i 2 + i 3.

TT-format: example Consider a tensor: Its TT-format: where A(i 1, i 2, i 3 ) := i 1 + i 2 + i 3, i 1 {1, 2, 3}, i 2 {1, 2, 3, 4}, i 3 {1, 2, 3, 4, 5}. A(i 1, i 2, i 3 ) = G 1 [i 1 ]G 2 [i 2 ]G 3 [i 3 ], G 1 = ([ 1 1 ], [ 2 1 ], [ 3 1 ]) ([ ] [ ] [ 1 0 1 0 1 0 G 2 =,, 1 1 2 1 3 1 ([ ] [ ] [ ] [ ] 1 1 1 1 G 3 =,,,, 1 2 3 4 The tensor has 3 4 5 = 60 elements. TT-format uses 32 elements to describe it. ], [ ]) 1 0 4 1 ]) [ 1 5

Finding a TT-representation of a tensor General ways of building a of a tensor: Analytical formulas for the TT-cores. TT-SVD algorithm [Oseledets, 2011]: Exact quasi-optimal method. Suitable only for small tensors (which fit into memory). Interpolation algorithms: AMEn-cross [Dolgov & Savostyanov, 2013], DMRG [Khoromskij & Oseledets, 2010], TT-cross [Oseledets, 2010] Approximate heuristically-based methods. Can be applied for large tensors. No strong guarantees but work well in practice. Operations between other tensors in the TT-format: addition, element-wise product etc.

TT-interpolation: problem formulation Input: procedure A(i 1,..., i d ) for computing an arbitrary element of A. Output: of B A: B(i 1,..., i d ) = G 1 [i 1 ]... G d [i d ].

Matrix interpolation: Matrix-cross Let A R m n with rank r. It admits a skeleton decomposition [Goreinov et al., 1997]: where A = }{{} C m r  1 }{{} r r  = A(I, J ): non-singular matrix. C = A(:, J ): columns containing Â. R = A(I, :): rows containing Â. }{{} R, r n Q: Which  to choose? A: Any non-singular submatrix if rank A = r exact decomposition. Q: What if rank A r? Different  will give different error. A: Choose a maximal volume submatrix [Goreinov, Tyrtyshnikov, 2001]. It can be found with the maxvol algorithm [Goreinov et al., 2008].

TT interpolation with TT-cross: example Hilbert tensor: A(i 1, i 2,..., i d ) := 1 i 1 + i 2 +... + i d. TT-rank Time Iterations Relative accuracy 2 1.37 5 1.897278e+00 3 4.22 7 5.949094e-02 4 7.19 7 2.226874e-02 5 15.42 9 2.706828e-03 6 21.82 9 1.782433e-04 7 29.62 9 2.151107e-05 8 38.12 9 4.650634e-06 9 48.97 9 5.233465e-07 10 59.14 9 6.552869e-08 11 72.14 9 7.915633e-09 12 75.27 8 2.814507e-09 [Oseledets & Tyrtyshnikov, 2009]

Operations: addition and multiplication by number Let C = A + B: C(i 1,..., i d ) = A(i 1,..., i d ) + B(i 1,..., i d ). TT-cores of C are as [ follows: Ak [i C k [i k ] = k ] 0 0 B k [i k ] C 1 [i 1 ] = [ A 1 [i 1 ] B 1 [i 1 ] ], C d [i d ] = The ranks are doubled. ], k = 2,..., d 1, [ Ad [i d ] B d [i d ] Multiplication by a number: C = A const. Multiply only one core by const. The ranks do not increase. ].

Operations: element-wise product Let C = A B: C(i 1,..., i d ) = A(i 1,..., i d ) B(i 1,..., i d ). TT-cores of C can be computed as follows: C k [i k ] = A k [i k ] B k [i k ], where is the Kronecker product operation. rank(c) = rank(a) rank(b).

TT-format: efficient operations Operation Output rank Complexity A const r A O(dr A )) A + const r A + 1 A + B r A + r B O(dnrA 2)) O(dn(r A + r B ) 2 ) A B r A r B O(dnrA 2r B 2 ) sum(a) O(dnrA 2)...

TT-rounding TT-rounding procedure: Ã = round(a, ε), ε = accuracy: 1 Maximally reduces TT-ranks ensuring that A Ã F ε A F. 2 Uses SVD for compression (like TT-SVD). round( ) Allows one to trade off accuracy vs maximal rank of the TT-representation. Example: round(a + A, ε mach ) = A within machine accuracy ε mach.

Example: multivariate integration ( (e i ) d ) 1 I (d) := sin(x 1 + x 2 +... + x d )dx 1 dx 2... dx d = Im. i [0,1] d Use Chebyshev quadrature with n = 11 nodes + TT-cross with r = 2. d I (d) Relative accuracy Time 10-6.299353e-01 1.409952e-15 0.14 100-3.926795e-03 2.915654e-13 0.77 500-7.287664e-10 2.370536e-12 4.64 1 000-2.637513e-19 3.482065e-11 11.70 2 000 2.628834e-37 8.905594e-12 33.05 4 000 9.400335e-74 2.284085e-10 105.49 [Oseledets & Tyrtyshnikov, 2009]

Outline 1 Tensor Train Format 2 ML Application 1: Markov Random Fields 3 ML Application 2: TensorNet

Motivational example: image segmentation Task: assign a label y i to each pixel of an M N image. Let P(y) be the joint probability of labelling y. Two extreme cases: No assumptions about independence: O(K MN ) parameters (K = total number of labels) represents every distribution intractable in general Everything is independent: P(y) = p 1 (y 1 )... p MN (y MN ) O(MNK) parameters represents only a small class of distributions tractable

Graphical models Provide a convenient way to define probabilistic models using graphs. Two types: directed graphical models and Markov random fields. We will consider only (discrete) Markov random fields. The edges represent dependencies between the variables. E.g., for image segmentation: yi yi A variable y i is independent of the rest given its immediate neighbours.

Markov random fields The model: P(y) = 1 Z Ψ c (y c ), c C Z: normalization constant C: set of all (maximal) cliques in the graph Ψ c : non-negative functions which are called factors Example: y 1 y 2 y 3 y 4 P(y 1, y 2, y 3, y 4 ) = 1 Z Ψ 1(y 1 )Ψ 2 (y 2 )Ψ 3 (y 3 )Ψ 4 (y 4 ) Ψ 12 (y 1, y 2 )Ψ 24 (y 2, y 4 )Ψ 34 (y 3, y 4 )Ψ 13 (y 1, y 3 ) The factors Ψ ij measure compatibility between variables y i and y j.

Main problems of interest Probabilistic model: P(y) = 1 Ψ c (y Z c ) = 1 Z exp( E(y)), c C where E is the energy function: E(y) = c C Θ c (y c ), Θ c (y c ) = ln Ψ c (y c ) Maximum a posteriori (MAP) inference: y = argmax P(y) = argmin E(y) y y Estimation of the normalization constant: Z = P(y) y Estimation of the marginal distributions: P(y i ) = y\y i P(y)

Tensorial perspective Energy and unnormalized probability are tensors: m E(y 1,..., y n ) = Θ c (y c ), P(y 1,..., y n ) = c=1 m Ψ c (y c ), c=1 tensors where y i {1,..., d}. In this language: MAP-inference minimal element in E Normalization constant sum of all the elements of P Details: Putting MRFs on a Tensor Train [Novikov et al., ICML 2014].

Outline 1 Tensor Train Format 2 ML Application 1: Markov Random Fields 3 ML Application 2: TensorNet

Classification problem

Neural networks hidden 1 Use a composite function: g(x) = f (W 2 f (W 1 x) ) }{{} h R m h k = f ((W 1 x) k ) 1 f (x) = 1 + exp( x) input h 1 x 1 h 2 output...... g(x) x n... h m

TensorNet: motivation Goal: compress fully-connected layers. Why care about memory? State-of-the-art deep neural networks don t fit into the memory of mobile devices. Up to 95% of the parameters are in the fully-connected layers. A shallow network with a huge fully-connected layer can achieve almost the same accuracy as an ensemble of deep CNNs [Ba and Caruana, 2014].

Tensor Train layer Let x be the input, y be the output: y = Wx. The matrix W is represented in the TT-format: y(i 1,..., i d ) = Parameters: TT-cores {G k } d k=1. j 1,..., j d G 1 [i 1, j 1 ]... G d [i d, j d ]x(j 1,..., j d ). Details: Tensorizing Neural Networks [Novikov et al., NIPS 2015].

Conclusions and corresponding algorithms are a good way to work with huge tensors. Memory and complexity depend on d linearly no curse of dimensionality. TT-format is efficient only if the TT-ranks are small. This is the case in many applications. Code is available: Python: https://github.com/oseledets/ttpy MATLAB: https://github.com/oseledets/tt-toolbox Thanks for attention!