The simple essence of automatic differentiation

Similar documents
The simple essence of automatic differentiation

The simple essence of automatic differentiation

The essence and origins of FRP

Transformation Matrices; Geometric and Otherwise As examples, consider the transformation matrices of the C 3v

NOTES WEEK 15 DAY 1 SCOT ADAMS

1. Examples. We did most of the following in class in passing. Now compile all that data.

String diagrams. a functorial semantics of proofs and programs. Paul-André Melliès. CNRS, Université Paris Denis Diderot.

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Chapter 2: Vector Geometry

L11: Algebraic Path Problems with applications to Internet Routing Lecture 15. Path Weight with functions on arcs?

Categorical Databases

Beautiful Differentiation

Linear Algebra Basics

NOTES WEEK 10 DAY 2. Unassigned HW: Let V and W be finite dimensional vector spaces and let x P V. Show, for all f, g : V W, that

Teooriaseminar. TTÜ Küberneetika Instituut. May 10, Categorical Models. for Two Intuitionistic Modal Logics. Wolfgang Jeltsch.

L E C T U R E 2 1 : M AT R I X T R A N S F O R M AT I O N S. Monday, November 14

4 Relativistic kinematics

Very quick introduction to the conformal group and cft

C*-algebras - a case study

MTH 505: Number Theory Spring 2017

1 Differentiable manifolds and smooth maps

NOTES WEEK 04 DAY 1 SCOT ADAMS

DS-GA 3001: PROBLEM SESSION 2 SOLUTIONS VLADIMIR KOBZAR

RELATIVE THEORY IN SUBCATEGORIES. Introduction

arxiv: v1 [math.ra] 24 Jun 2013

COMPLEX NUMBERS LAURENT DEMONET

t-deformations of Grothendieck rings as quantum cluster algebras

PRODUCT MEASURES AND FUBINI S THEOREM

Category Theory. Categories. Definition.

Correlation to the Common Core State Standards

NOTES WEEK 01 DAY 1 SCOT ADAMS

Exact solutions of (0,2) Landau-Ginzburg models

NOTES WEEK 02 DAY 1. THEOREM 0.3. Let A, B and C be sets. Then

K. P. R. Rao, K. R. K. Rao UNIQUE COMMON FIXED POINT THEOREMS FOR PAIRS OF HYBRID MAPS UNDER A NEW CONDITION IN PARTIAL METRIC SPACES

Math (P)Review Part II:

Exam 1 Solutions. Solution: The 16 contributes 5 to the total and contributes 2. All totaled, there are 5 ˆ 2 10 abelian groups.

Entropy and Ergodic Theory Notes 22: The Kolmogorov Sinai entropy of a measure-preserving system

MTH 505: Number Theory Spring 2017

Review++ of linear algebra continues

b. Induction (LfP 50 3)

Approximation Algorithms using Allegories and Coq

Rank-one LMIs and Lyapunov's Inequality. Gjerrit Meinsma 4. Abstract. We describe a new proof of the well-known Lyapunov's matrix inequality about

Complex Numbers Year 12 QLD Maths C

MATH 260 Class notes/questions January 10, 2013

MATH 360 Final Exam Thursday, December 14, a n 2. a n 1 1

1.1.1 Algebraic Operations

3.3.1 Linear functions yet again and dot product In 2D, a homogenous linear scalar function takes the general form:

MATH 101B: ALGEBRA II PART A: HOMOLOGICAL ALGEBRA 23

3. Signatures Problem 27. Show that if K` and K differ by a crossing change, then σpk`q

ADVANCE TOPICS IN ANALYSIS - REAL. 8 September September 2011

PRINCIPLES OF ANALYSIS - LECTURE NOTES

Review of Coordinate Systems

A categorical model for a quantum circuit description language

Structures in tangent categories

NOTES WEEK 13 DAY 2 SCOT ADAMS

Relational Algebra by Way of Adjunctions. Jeremy Gibbons (joint work with Fritz Henglein, Ralf Hinze, Nicolas Wu) DBPL, October 2015

NOTES WEEK 14 DAY 2 SCOT ADAMS

Mariusz Jurkiewicz, Bogdan Przeradzki EXISTENCE OF SOLUTIONS FOR HIGHER ORDER BVP WITH PARAMETERS VIA CRITICAL POINT THEORY

J. D. H. Smith ONE-SIDED QUANTUM QUASIGROUPS AND LOOPS

From Lay, 5.4. If we always treat a matrix as defining a linear transformation, what role does diagonalisation play?

Ohio s Learning Standards-Extended. Mathematics. The Real Number System Complexity a Complexity b Complexity c

APPROXIMATE HOMOMORPHISMS BETWEEN THE BOOLEAN CUBE AND GROUPS OF PRIME ORDER

Approximation in the Zygmund Class

Vectors. January 13, 2013

Ground States are generically a periodic orbit

Flag Varieties. Matthew Goroff November 2, 2016

Linear Algebra and Robot Modeling

Singular integral operators and the Riesz transform

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

A primer on matrices

A Full RNS Implementation of Fan and Vercauteren Somewhat Homomorphic Encryption Scheme

Chapter 2. Matrix Arithmetic. Chapter 2

Homework for MATH 4604 (Advanced Calculus II) Spring 2017

Entropy and Ergodic Theory Lecture 28: Sinai s and Ornstein s theorems, II

CBSE 2018 ANNUAL EXAMINATION DELHI

CSCI 239 Discrete Structures of Computer Science Lab 6 Vectors and Matrices

Quantifier elimination

Existence of weak adiabatic limit in almost all models of perturbative QFT

Gap probabilities in tiling models and Painlevé equations

Remark 3.2. The cross product only makes sense in R 3.

Introduction to Applied Algebraic Topology

Exotic Crossed Products

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Programming Languages in String Diagrams. [ one ] String Diagrams. Paul-André Melliès. Oregon Summer School in Programming Languages June 2011

A primer on matrices

Whitehead problems for words in Z m ˆ F n

CSE 1400 Applied Discrete Mathematics Proofs

A Language of Case Differences

Duality and Automata Theory

DISCRETE DIFFERENTIAL GEOMETRY: AN APPLIED INTRODUCTION Keenan Crane CMU /858B Fall 2017

7.1 Truth tables from the perpective of evidence semantics

Towards a Flowchart Diagrammatic Language for Monad-based Semantics

Review of Vector Analysis in Cartesian Coordinates

13th International Conference on Relational and Algebraic Methods in Computer Science (RAMiCS 13)

Notes on multivariable calculus

STOKES THEOREM ON MANIFOLDS

Extraction from classical proofs using game models

Sets. We discuss an informal (naive) set theory as needed in Computer Science. It was introduced by G. Cantor in the second half of the nineteenth

Exercises on chapter 0

Math (P)Review Part I:

Transcription:

The simple essence of automatic differentiation Conal Elliott Target January/June 2018 Conal Elliott January/June 2018 1 / 51

Differentiable programming made easy Current AI revolution runs on large data, speed, and AD, but AD algorithm (backprop) is complex and stateful. Graph APIs are complex and semantically dubious. Solutions in this paper: AD: Simple, calculated, efficient, parallel-friendly, generalized. API: derivative. Conal Elliott Simple essence of AD January/June 2018 2 / 51

What s a derivative? Number Vector Covector Matrix Higher derivatives Chain rule for each. Conal Elliott Simple essence of AD January/June 2018 3 / 51

What s a derivative? D :: pa Ñ bq Ñ pa Ñ pa bqq A local linear (affine) approximation: f pa ` εq pf a ` D f a εq lim 0 εñ0 ε See Calculus on Manifolds by Michael Spivak. Conal Elliott Simple essence of AD January/June 2018 4 / 51

Composition Sequential: p q :: pb Ñ cq Ñ pa Ñ bq Ñ pa Ñ cq pg f q a g pf aq D pg f q a D g pf aq D f a -- chain rule Parallel: pÿq :: pa Ñ cq Ñ pa Ñ dq Ñ pa Ñ c ˆ dq pf Ÿ gq a pf a, g aq D pf Ÿ gq a D f a Ÿ D g a Conal Elliott Simple essence of AD January/June 2018 5 / 51

Linear functions Linear functions are their own derivatives everywhere. D id a id D fst a fst D snd a snd... Conal Elliott Simple essence of AD January/June 2018 6 / 51

Compositionality Chain rule: D pg f q a D g pf aq D f a -- non-compositional To fix, combine regular result with derivative: ˆD :: pa Ñ bq Ñ pa Ñ pb ˆ pa bqqq ˆD f f Ÿ D f -- specification Often much work in common to f and D f. Conal Elliott Simple essence of AD January/June 2018 7 / 51

Abstract algebra for functions class Category p q where id :: a a p q :: pb cq Ñ pa bq Ñ pa cq class Category p q ñ Cartesian p q where exl :: pa ˆ bq a exr :: pa ˆ bq b pÿq :: pa cq Ñ pa dq Ñ pa pc ˆ dqq Plus laws and classes for arithmetic etc. Conal Elliott Simple essence of AD January/June 2018 8 / 51

Automatic differentiation newtype D a b D pa Ñ b ˆ pa bqq ˆD :: pa Ñ bq Ñ D a b ˆD f D pf Ÿ D f q -- not computable Specification: ˆD preserves Category and Cartesian structure: ˆD id id ˆD pg f q ˆD g ˆD f ˆD exl exl ˆD exr exr ˆD pf Ÿ gq ˆD f Ÿ ˆD g The game: solve these equations for the RHS operations. Conal Elliott Simple essence of AD January/June 2018 9 / 51

Solution: simple automatic differentiation newtype D a b D pa Ñ b ˆ pa bqq lineard f D pλa Ñ pf a, f qq instance Category D where id lineard id D g D f D pλa Ñ let tpb, f 1 q f a; pc, g 1 q g b u in pc, g 1 f 1 qq instance Cartesian D where exl lineard exl exr lineard exr D f Ÿ D g D pλa Ñ let tpb, f 1 q f a; pc, g 1 q g a u in ppb, cq, f 1 Ÿ g 1 qq instance NumCat D where negate lineard negate add lineard add mul D pmul Ÿ pλpa, bq Ñ λpda, dbq Ñ b da ` a dbqq Conal Elliott Simple essence of AD January/June 2018 10 / 51

Running examples sqr :: Num a ñ a Ñ a sqr a a a magsqr :: Num a ñ a ˆ a Ñ a magsqr pa, bq sqr a ` sqr b cossinprod :: Floating a ñ a ˆ a Ñ a ˆ a cossinprod px, yq pcos z, sin zq where z x y In categorical vocabulary: sqr mul pid Ÿ idq magsqr add pmul pexl Ÿ exlq Ÿ mul pexr Ÿ exrqq cossinprod pcos Ÿ sinq mul Conal Elliott Simple essence of AD January/June 2018 11 / 51

Visualizing computations magsqr pa, bq sqr a ` sqr b magsqr add pmul pexl Ÿ exlq Ÿ mul pexr Ÿ exrqq In + Auto-generated from Haskell code. See Compiling to categories. Conal Elliott Simple essence of AD January/June 2018 12 / 51

AD example In sqr a a a sqr mul pid Ÿ idq In In + Conal Elliott Simple essence of AD January/June 2018 13 / 51

AD example In + magsqr pa, bq sqr a ` sqr b magsqr add pmul pexl Ÿ exlq Ÿ mul pexr Ÿ exrqq In + + + In + Conal Elliott Simple essence of AD January/June 2018 14 / 51

AD example In cos sin cossinprod px, yq pcos z, sin zq where z x y cossinprod pcos Ÿ sinq mul negate In + In sin cos Conal Elliott Simple essence of AD January/June 2018 15 / 51

Generalizing AD newtype D a b D pa Ñ b ˆ pa bqq lineard f D pλa Ñ pf a, f qq instance Category D where id lineard id D g D f D pλa Ñ let tpb, f 1 q f a; pc, g 1 q g b u in pc, g 1 f 1 qq instance Cartesian D where exl lineard exl exr lineard exr D f Ÿ D g D pλa Ñ let tpb, f 1 q f a; pc, g 1 q g a u in ppb, cq, f 1 Ÿ g 1 qq Each D operation just uses corresponding p q operation. Generalize from p q to other cartesian categories. Conal Elliott Simple essence of AD January/June 2018 16 / 51

Generalized AD newtype D p q a b D pa Ñ b ˆ pa bqq lineard f f 1 D pλa Ñ pf a, f 1 qq instance Category p q ñ Category D p q where id lineard id id D g D f D pλa Ñ let tpb, f 1 q f a; pc, g 1 q g b u in pc, g 1 f 1 qq instance Cartesian p q ñ Cartesian D p q where exl lineard exl exl exr lineard exr exr D f Ÿ D g D pλa Ñ let tpb, f 1 q f a; pc, g 1 q g a u in ppb, cq, f 1 Ÿ g 1 qq instance... ñ NumCat D where negate lineard negate negate add lineard add add mul?? Conal Elliott Simple essence of AD January/June 2018 17 / 51

Numeric operations Specific to (linear) functions: mul D pmul Ÿ pλpa, bq Ñ λpda, dbq Ñ b da ` a dbqq Rephrase: Now scale :: Multiplicative a ñ a Ñ pa aq scale u λv Ñ u v pźq :: pa cq Ñ pb cq Ñ ppa ˆ bq cq f Ź g λpa, bq Ñ f a ` g b mul D pmul Ÿ pλpa, bq Ñ scale b Ź scale aqq Conal Elliott Simple essence of AD January/June 2018 18 / 51

Linear arrow (biproduct) vocabulary class Category p q where id :: a a p q :: pb cq Ñ pa bq Ñ pa cq class Category p q ñ Cartesian p q where exl :: pa ˆ bq a exr :: pa ˆ bq b pÿq :: pa cq Ñ pa dq Ñ pa pc ˆ dqq class Category p q ñ Cocartesian p q where inl :: a pa ˆ bq inr :: b pa ˆ bq pźq :: pa cq Ñ pb cq Ñ ppa ˆ bq cq class ScalarCat p q a where scale :: a Ñ pa aq Conal Elliott Simple essence of AD January/June 2018 19 / 51

Linear transformations as functions newtype a Ñ`b AddFun pa Ñ bq instance Category pñ`q where id AddFun id p q innew 2 p q instance Cartesian pñ`q where exl AddFun exl exr AddFun exr pÿq innew 2 pÿq instance Cocartesian pñ`q where inl AddFun p, 0q inr AddFun p0, q pźq innew 2 pλf g px, yq Ñ f x ` g yq instance Multiplicative s ñ ScalarCat pñ`q s where scale s AddFun ps q Conal Elliott Simple essence of AD January/June 2018 20 / 51

Extracting a data representation How to extract a matrix or gradient vector? Sample over a domain basis (rows of identity matrix). For n-dimensional domain, Make n passes. Each pass works on n-d sparse ( one-hot ) input. Very inefficient. For gradient-based optimization, High-dimensional domain. Very low-dimensional (1-D) codomain. Conal Elliott Simple essence of AD January/June 2018 21 / 51

Generalized matrices newtype M s a b L pv s b pv s a sqq applyl :: M s a b Ñ pa Ñ bq Require applyl to preserve structure. Solve for methods. Conal Elliott Simple essence of AD January/June 2018 22 / 51

Core vocabulary Sufficient to build arbitrary matrices : scale :: a Ñ pa aq -- 1 ˆ 1 pźq :: pa cq Ñ pb cq Ñ ppa ˆ bq cq -- horizontal juxt pÿq :: pa cq Ñ pa dq Ñ pa pc ˆ dqq -- vertical juxt Types guarantee rectangularity. Conal Elliott Simple essence of AD January/June 2018 23 / 51

Efficiency of composition Arrow composition is associative. Some associations are more efficient than others, so Associate optimally. Equivalent to matrix chain multiplication Opn log nq. Choice determined by types, i.e., compile-time information. All-right: forward mode AD (FAD). All-left: reverse mode AD (RAD). RAD is much better for gradient-based optimization. Conal Elliott Simple essence of AD January/June 2018 24 / 51

Left-associating composition (RAD) CPS-like category: Represent a b by pb rq Ñ pa rq. Meaning: f ÞÑ p f q. Results in left-composition. Initialize with id :: r r. Construct h D f a directly, without D f a. Conal Elliott Simple essence of AD January/June 2018 25 / 51

Continuation category newtype Cont r p q a b Cont ppb rq Ñ pa rqq cont :: Category p q ñ pa bq Ñ Cont r p q a b cont f Cont p f q Require cont to preserve structure. Solve for methods. We ll use an isomorphism: join :: Cocartesian p q ñ pc aq ˆ pd aq Ñ ppc ˆ dq aq unjoin :: Cocartesian p q ñ ppc ˆ dq aq Ñ pc aq ˆ pd aq join pf, gq f Ź g unjoin h ph inl, h inrq Conal Elliott Simple essence of AD January/June 2018 26 / 51

Continuation category (solution) instance Category p q ñ Category Cont r p q where id Cont id Cont g Cont f Cont pf gq instance Cartesian p q ñ Cartesian Cont r p q where exl Cont pjoin inlq exr Cont pjoin inrq pÿq innew 2 pλf g Ñ pf Ź gq unjoinq instance Cocartesian p q ñ Cocartesian Cont r p q where inl Cont pexl unjoinq inr Cont pexr unjoinq pźq innew 2 pλf g Ñ join pf Ÿ gqq instance ScalarCat p q a ñ ScalarCat Cont r p q a where scale s Cont pscale sq Conal Elliott Simple essence of AD January/June 2018 27 / 51

Reverse-mode AD without tears D Cont r Ms Conal Elliott Simple essence of AD January/June 2018 28 / 51

Left-associating composition (RAD) CPS-like category: Represent a b by pb rq Ñ pa rq. Meaning: f ÞÑ p f q. Results in left-composition. Initialize with id :: r r. Construct h D f a directly, without D f a. We ve seen this trick before: Transforming naive reverse from quadratic to linear. List generalizes to monoids, and monoids to categories. Conal Elliott Simple essence of AD January/June 2018 29 / 51

One of my favorite papers Continuation-Based Program Transformation Strategies Mitch Wand, 1980, JACM. Introduce a continuation argument, e.g., ra s Ñ ra s. Notice the continuations that arise, e.g., p ` asq. Find a data representation, e.g., as :: ra s Identify associative operation that represents composition, e.g., p `q, since p ` bsq p ` asq p ` pas ` bsqq. Conal Elliott Simple essence of AD January/June 2018 30 / 51

Duality Vector space dual: u s, with u a vector space over s. If u has finite dimension, then u s u. For f :: u s, f dot v for some v :: u. Gradients are derivatives of functions with scalar codomain. Represent a b by pb sq Ñ pa sq by b Ñ a. Ideal for extracting gradient vector. Just apply to 1 (id). Conal Elliott Simple essence of AD January/June 2018 31 / 51

Duality newtype Dual p q a b Dual pb aq asdual :: Cont s p q a b Ñ Dual p q a b asdual pcont f q Dual pdot 1 f dotq where dot :: u Ñ pu sq dot 1 :: pu sq Ñ u Require asdual to preserve structure. Solve for methods. Conal Elliott Simple essence of AD January/June 2018 32 / 51

Duality (solution) newtype Dual p q a b Dual pb aq instance Category p q ñ Category Dual p q where id Dual id p q innew 2 pflip p qq instance Cocartesian p q ñ Cartesian Dual p q where exl Dual inl exr Dual inr pÿq innew 2 pźq instance Cartesian p q ñ Cocartesian Dual p q where inl Dual exl inr Dual exr pźq innew 2 pÿq instance ScalarCat p q s ñ ScalarCat Dual p q s where scale s Dual pscale sq Conal Elliott Simple essence of AD January/June 2018 33 / 51

Backpropagation D Dual Ñ` Conal Elliott Simple essence of AD January/June 2018 34 / 51

RAD example (dual function) In + In + In Conal Elliott Simple essence of AD January/June 2018 35 / 51

RAD example (dual vector) In + In + 1.0 Conal Elliott Simple essence of AD January/June 2018 36 / 51

RAD example (dual function) In In In + Conal Elliott Simple essence of AD January/June 2018 37 / 51

RAD example (vector) In In 1.0 Conal Elliott Simple essence of AD January/June 2018 38 / 51

RAD example (dual function) In In In 0.0 Conal Elliott Simple essence of AD January/June 2018 39 / 51

RAD example (dual vector) In In 1.0 0.0 Conal Elliott Simple essence of AD January/June 2018 40 / 51

RAD example (dual function) In In In + Conal Elliott Simple essence of AD January/June 2018 41 / 51

RAD example (dual vector) In In + Conal Elliott Simple essence of AD January/June 2018 42 / 51

RAD example (dual function) In + In + + In + Conal Elliott Simple essence of AD January/June 2018 43 / 51

RAD example (dual vector) In + + In + + Conal Elliott Simple essence of AD January/June 2018 44 / 51

RAD example (dual function) In cos sin cos sin cos In In sin Conal Elliott Simple essence of AD January/June 2018 45 / 51

RAD example (matrix) In cos sin In cos sin negate negate Conal Elliott Simple essence of AD January/June 2018 46 / 51

Incremental evaluation In + if In if + if if if if + + In + Conal Elliott Simple essence of AD January/June 2018 47 / 51

Conclusions Simple AD algorithm, specializing to forward, reverse, mixed. No graphs, tapes, tags, partial derivatives, or mutation. Parallel-friendly and low memory use. Calculated from simple, regular algebra problems. Generalizes to derivative categories other than linear maps. Differentiate regular Haskell code (via plugin). More details in my ICFP 2018 paper. Conal Elliott Simple essence of AD January/June 2018 48 / 51

Reflections: recipe for success Key principles: Capture main concepts as first-class values. Focus on abstract notions, not specific representations. Calculate efficient implementation from simple specification. Not previously applied to AD (afaik). Quandary: Most programming languages poor for function-like things. Solution: Compiling to categories. Conal Elliott Simple essence of AD January/June 2018 49 / 51

Symbolic vs automatic differentiation Often described as opposing techniques: Symbolic: Apply differentiation rules symbolically. Can duplicate much work. Needs algebraic manipulation. Automatic: FAD: easy to implement but often inefficient. RAD: efficient but tricky to implement. My view: AD is SD done by a compiler. Compilers already work symbolically and preserve sharing. Conal Elliott Simple essence of AD January/June 2018 50 / 51