On Computational Thinking, Inferential Thinking and Data Science

Size: px

Start display at page:

Download "On Computational Thinking, Inferential Thinking and Data Science"

Cory Arnold
6 years ago
Views:

1 On Computational Thinking, Inferential Thinking and Data Science Michael I. Jordan University of California, Berkeley December 17, 2016

2 A Job Description, circa 2015 Your Boss: I need a Big Data system that will replace our classic service with a personalized service

3 A Job Description, circa 2015 Your Boss: I need a Big Data system that will replace our classic service with a personalized service It should work reasonably well for anyone and everyone; I can tolerate a few errors but not too many dumb ones that will embarrass us

4 A Job Description, circa 2015 Your Boss: I need a Big Data system that will replace our classic service with a personalized service It should work reasonably well for anyone and everyone; I can tolerate a few errors but not too many dumb ones that will embarrass us It should run just as fast as our classic service

5 A Job Description, circa 2015 Your Boss: I need a Big Data system that will replace our classic service with a personalized service It should work reasonably well for anyone and everyone; I can tolerate a few errors but not too many dumb ones that will embarrass us It should run just as fast as our classic service It should only improve as we collect more data; in particular it shouldn t slow down

6 A Job Description, circa 2015 Your Boss: I need a Big Data system that will replace our classic service with a personalized service It should work reasonably well for anyone and everyone; I can tolerate a few errors but not too many dumb ones that will embarrass us It should run just as fast as our classic service It should only improve as we collect more data; in particular it shouldn t slow down There are serious privacy concerns of course, and they vary across the clients

7 Some Challenges Driven by Big Data Big Data analysis requires a thorough blending of computational thinking and inferential thinking

8 Some Challenges Driven by Big Data Big Data analysis requires a thorough blending of computational thinking and inferential thinking What I mean by computational thinking abstraction, modularity, scalability, robustness, etc.

9 Some Challenges Driven by Big Data Big Data analysis requires a thorough blending of computational thinking and inferential thinking What I mean by computational thinking abstraction, modularity, scalability, robustness, etc. Inferential thinking means (1) considering the realworld phenomenon behind the data, (2) considering the sampling pattern that gave rise to the data, and (3) developing procedures that will go backwards from the data to the underlying phenomenon

10 The Challenges are Daunting The core theories in computer science and statistics developed separately and there is an oil and water problem Core statistical theory doesn t have a place for runtime and other computational resources Core computational theory doesn t have a place for statistical risk

11 Outline Inference under privacy constraints Inference under communication constraints Lower bounds, the variational perspective and symplectic integration

12 Part I: Inference and Privacy with John Duchi and Martin Wainwright

13 Privacy and Data Analysis Individuals are not generally willing to allow their personal data to be used without control on how it will be used and now much privacy loss they will incur Privacy loss can be quantified via differential privacy We want to trade privacy loss against the value we obtain from data analysis The question becomes that of quantifying such value and juxtaposing it with privacy loss

14 Privacy query database

15 Privacy query database

16 Privacy query database Q privatized database

17 Privacy query query database Q privatized database

18 Privacy query query database Q privatized database ˆ

19 Privacy query query database Q privatized database ˆ Classical problem in differential privacy: show that are close under constraints on Q ˆ and

20 Inference query database

21 Inference query P S database

22 Inference query query P S database

23 Inference query query P S database Classical problem in statistical theory: show that are close under constraints on S and

24 Privacy and Inference query query query P S database Q privatized database ˆ The privacy-meets-inference problem: show that are close under constraints on Q and on S and ˆ

25 Background on Inference In the 1930 s, Wald laid the foundations of statistical decision theory Given a family of distributions P, a parameter (P ) for each P 2 P, an estimator ˆ, and a loss l(ˆ, (P )), define the risk: i R P (ˆ ) :=E P hl(ˆ, (P ))

26 Background on Inference In the 1930 s, Wald laid the foundations of statistical decision theory Given a family of distributions P, a parameter (P ) for each P 2 P, an estimator ˆ, and a loss l(ˆ, (P )), define the risk: i R P (ˆ ) :=E P hl(ˆ, (P )) Minimax principle [Wald, 39, 43]: choose estimator minimizing worst-case risk: sup P 2P i E P hl(ˆ, (P ))

27 Private Local Privacy

28 Local Privacy Private Channel Individuals Estimator with private data

29 Differential Privacy Definition: channel private if is -differentially [Dwork, McSherry, Nissim, Smith 06] Given Z, cannot reliably discriminate between x and x

30 Private Minimax Risk Parameter of distribution Family of distributions Loss measuring error Family of private channels -private Minimax risk Best -private channel Minimax risk under privacy constraint

31 Vignette: Private Mean Estimation Example: estimate reasons for hospital visits Patients admitted to hospital for substance abuse Estimate prevalence of different substances 1 Alcohol 1 Cocaine 0 Heroin 0 Cannabis 0 LSD 0 Amphetamines Proportions = =.45 =.32 =.16 =.20 =.00 =.02

32 Vignette: Mean Estimation Consider estimation of mean errors measured in -norm, for, with Proposition: Minimax rate (achieved by sample mean)

33 Vignette: Mean Estimation Consider estimation of mean errors measured in -norm, for, with Proposition: Private minimax rate for

34 Vignette: Mean Estimation Consider estimation of mean errors measured in -norm, for, with Proposition: Private minimax rate for Note: Effective sample size

35 Optimal mechanism? Non-private observation Idea 1: add independent noise (e.g. Laplace mechanism) [Dwork et al. 06] Problem: magnitude much too large (this is unavoidable: provably sub-optimal)

36 Empirical evidence Sample size n Estimate proportion of emergency room visits involving different substances Data source: Drug Abuse Warning Network

37 Additional Examples Fixed-design regression Convex risk minimization Multinomial estimation Nonparametric density estimation Almost always, the effective sample size reduction is:

38 Part III: Inference and Compression with Yuchen Zhang, John Duchi and Martin Wainwright

39 Computation and Inference How does inferential quality trade off against classical computational resources such as time and space?

40 Computation and Inference How does inferential quality trade off against classical computational resources such as time and space? Hard!

41 Computation and Inference: Mechanisms and Bounds Tradeoffs via convex relaxations linking runtime to convex geometry and risk to convex geometry Tradeoffs via concurrency control optimistic concurrency control Bounds via optimization oracles number of accesses to a gradient as a surrogate for computation Bounds via communication complexity Tradeoffs via subsampling bag of little bootstraps, variational consensus Monte Carlo

42 A Variational Framework for Accelerated Methods in Optimization with Andre Wibisono and Ashia Wilson July 12, 2016

43 Accelerated gradient descent Setting: Unconstrained convex optimization min x R d f (x)

44 Accelerated gradient descent Setting: Unconstrained convex optimization min x R d f (x) Classical gradient descent: x k+1 = x k β f (x k ) obtains a convergence rate of O(1/k)

45 Accelerated gradient descent Setting: Unconstrained convex optimization min x R d f (x) Classical gradient descent: x k+1 = x k β f (x k ) obtains a convergence rate of O(1/k) Accelerated gradient descent: y k+1 = x k β f (x k ) x k+1 = (1 λ k )y k+1 + λ k y k obtains the (optimal) convergence rate of O(1/k 2 )

46 The acceleration phenomenon Two classes of algorithms: Gradient methods Gradient descent, mirror descent, cubic-regularized Newton s method (Nesterov and Polyak 06), etc. Greedy descent methods, relatively well-understood

47 The acceleration phenomenon Two classes of algorithms: Gradient methods Gradient descent, mirror descent, cubic-regularized Newton s method (Nesterov and Polyak 06), etc. Greedy descent methods, relatively well-understood Accelerated methods Nesterov s accelerated gradient descent, accelerated mirror descent, accelerated cubic-regularized Newton s method (Nesterov 08), etc. Important for both theory (optimal rate for first-order methods) and practice (many extensions: FISTA, stochastic setting, etc.) Not descent methods, faster than gradient methods, still mysterious

48 Accelerated methods Analysis using Nesterov s estimate sequence technique Common interpretation as momentum methods (Euclidean case)

49 Accelerated methods Analysis using Nesterov s estimate sequence technique Common interpretation as momentum methods (Euclidean case) Many proposed explanations: Chebyshev polynomial (Hardt 13) Linear coupling (Allen-Zhu, Orecchia 14) Optimized first-order method (Drori, Teboulle 14; Kim, Fessler 15) Geometric shrinking (Bubeck, Lee, Singh 15) Universal catalyst (Lin, Mairal, Harchaoui 15)...

50 Accelerated methods Analysis using Nesterov s estimate sequence technique Common interpretation as momentum methods (Euclidean case) Many proposed explanations: Chebyshev polynomial (Hardt 13) Linear coupling (Allen-Zhu, Orecchia 14) Optimized first-order method (Drori, Teboulle 14; Kim, Fessler 15) Geometric shrinking (Bubeck, Lee, Singh 15) Universal catalyst (Lin, Mairal, Harchaoui 15)... But only for strongly convex functions, or first-order methods Question: What is the underlying mechanism that generates acceleration (including for higher-order methods)?

51 Accelerated methods: Continuous time perspective Gradient descent is discretization of gradient flow Ẋ t = f (X t ) (and mirror descent is discretization of natural gradient flow)

52 Accelerated methods: Continuous time perspective Gradient descent is discretization of gradient flow Ẋ t = f (X t ) (and mirror descent is discretization of natural gradient flow) Su, Boyd, Candes 14: Continuous time limit of accelerated gradient descent is a second-order ODE Ẍ t + 3 t Ẋt + f (X t ) = 0

53 Accelerated methods: Continuous time perspective Gradient descent is discretization of gradient flow Ẋ t = f (X t ) (and mirror descent is discretization of natural gradient flow) Su, Boyd, Candes 14: Continuous time limit of accelerated gradient descent is a second-order ODE Ẍ t + 3 t Ẋt + f (X t ) = 0 These ODEs are obtained by taking continuous time limits. Is there a deeper generative mechanism? Our work: A general variational approach to acceleration A systematic discretization methodology

54 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x)

55 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Function of position x, velocity ẋ, and time t

56 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Function of position x, velocity ẋ, and time t D h (y, x) = h(y) h(x) h(x), y x is the Bregman divergence h is the convex distance-generating function h(y) D h (y, x) h(x) x y

57 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Function of position x, velocity ẋ, and time t D h (y, x) = h(y) h(x) h(x), y x is the Bregman divergence h is the convex distance-generating function f is the convex objective function h(x) h(y) D h (y, x) x y

58 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Function of position x, velocity ẋ, and time t D h (y, x) = h(y) h(x) h(x), y x is the Bregman divergence h is the convex distance-generating function f is the convex objective function α t, β t, γ t R are arbitrary smooth functions h(x) x h(y) y D h (y, x)

59 Bregman Lagrangian Define the Bregman Lagrangian: ( ) 1 L(x, ẋ, t) = e γt αt 2 ẋ 2 e 2αt+βt f (x) Function of position x, velocity ẋ, and time t D h (y, x) = h(y) h(x) h(x), y x is the Bregman divergence h is the convex distance-generating function f is the convex objective function α t, β t, γ t R are arbitrary smooth functions In Euclidean setting, simplifies to damped Lagrangian h(x) x h(y) y D h (y, x)

60 Bregman Lagrangian Define the Bregman Lagrangian: ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Function of position x, velocity ẋ, and time t D h (y, x) = h(y) h(x) h(x), y x is the Bregman divergence h is the convex distance-generating function f is the convex objective function α t, β t, γ t R are arbitrary smooth functions In Euclidean setting, simplifies to damped Lagrangian Ideal scaling conditions: β t e αt γ t = e αt h(x) x h(y) y D h (y, x)

61 Bregman Lagrangian ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Variational problem over curves: min L(X t, Ẋt, t) dt X Optimal curve is characterized by Euler-Lagrange equation: { } d L dt ẋ (X t, Ẋ t, t) = L x (X t, Ẋ t, t) x t

62 Bregman Lagrangian ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Variational problem over curves: min L(X t, Ẋt, t) dt X Optimal curve is characterized by Euler-Lagrange equation: { } d L dt ẋ (X t, Ẋ t, t) = L x (X t, Ẋ t, t) E-L equation for Bregman Lagrangian under ideal scaling: Ẍ t + (e αt α t )Ẋ t + e 2αt+βt [ 2 h(x t + e αt Ẋ t)] 1 f (Xt ) = 0 x t

63 General convergence rate Theorem Theorem Under ideal scaling, the E-L equation has convergence rate f (X t ) f (x ) O(e βt )

64 General convergence rate Theorem Theorem Under ideal scaling, the E-L equation has convergence rate f (X t ) f (x ) O(e βt ) Proof. Exhibit a Lyapunov function for the dynamics: ) E t = D h (x, X t + e αt Ẋ t + e βt (f (X t ) f (x )) E t = e αt+βt D f (x, X t ) + ( β t e αt )e βt (f (X t ) f (x )) 0 Note: Only requires convexity and differentiability of f, h

65 Polynomial convergence rate For p > 0, choose parameters: α t = log p log t β t = p log t + log C γ t = p log t E-L equation has O(e βt ) = O(1/t p ) convergence rate: Ẍ t + p + 1 Ẋ t + Cp 2 t p 2[ ( 2 h X t + t Ẋt)] 1 f (Xt ) = 0 t p

66 Polynomial convergence rate For p > 0, choose parameters: α t = log p log t β t = p log t + log C γ t = p log t E-L equation has O(e βt ) = O(1/t p ) convergence rate: Ẍ t + p + 1 Ẋ t + Cp 2 t p 2[ ( 2 h X t + t Ẋt)] 1 f (Xt ) = 0 t p For p = 2: Recover result of Krichene et al with O(1/t 2 ) convergence rate In Euclidean case, recover ODE of Su et al: Ẍ t + 3 t Ẋt + f (X t ) = 0

67 Time dilation property (reparameterizing time) (p = 2: accelerated gradient descent) ( ) 1 O t 2 : Ẍ t + 3 [ ( t Ẋt + 4C 2 h X t + 2Ẋt)] t 1 f (Xt ) = 0 speed up time: Y t = X t 3/2 ( ) 1 O t 3 : Ÿ t + 4 Y t [ ( t + 9Ct 2 h Y t + t Y 3 )] 1 f t (Yt ) = 0 (p = 3: accelerated cubic-regularized Newton s method)

68 Time dilation property (reparameterizing time) (p = 2: accelerated gradient descent) ( ) 1 O t 2 : Ẍ t + 3 [ ( t Ẋt + 4C 2 h X t + 2Ẋt)] t 1 f (Xt ) = 0 speed up time: Y t = X t 3/2 ( ) 1 O t 3 : Ÿ t + 4 Y t [ ( t + 9Ct 2 h Y t + t Y 3 )] 1 f t (Yt ) = 0 (p = 3: accelerated cubic-regularized Newton s method) All accelerated methods are traveling the same curve in space-time at different speeds Gradient methods don t have this property From gradient flow to rescaled gradient flow: Replace by 1 p p

69 Time dilation for general Bregman Lagrangian O(e βt ) : E-L for Lagrangian L α,β,γ speed up time: Y t = X τ(t) where O(e β τ(t) ) : E-L for Lagrangian L α, β, γ α t = α τ(t) + log τ(t) β t = β τ(t) γ t = γ τ(t)

70 Time dilation for general Bregman Lagrangian O(e βt ) : E-L for Lagrangian L α,β,γ speed up time: Y t = X τ(t) where O(e β τ(t) ) : E-L for Lagrangian L α, β, γ α t = α τ(t) + log τ(t) β t = β τ(t) γ t = γ τ(t) Question: How to discretize E-L while preserving the convergence rate?

71 Discretizing the dynamics (naive approach) Write E-L as a system of first-order equations: Z t = X t + t p Ẋt d dt h(z t) = Cpt p 1 f (X t )

72 Discretizing the dynamics (naive approach) Write E-L as a system of first-order equations: Z t = X t + t p Ẋt d dt h(z t) = Cpt p 1 f (X t ) Euler discretization with time step δ > 0 (i.e., set x k = X t, x k+1 = X t+δ ): x k+1 = p k + p z k + z k = arg min z k k + p x k { Cpk (p 1) f (x k ), z + 1 ɛ D h(z, z k 1 ) with step size ɛ = δ p, and k (p 1) = k(k + 1) (k + p 2) is the rising factorial }

73 Naive discretization doesn t work x k+1 = p k + p z k + z k = arg min z k k + p x k { Cpk (p 1) f (x k ), z + 1 ɛ D h(z, z k 1 ) Cannot obtain a convergence guarantee, and empirically unstable } f(xk) k

74 Modified discretization Introduce an auxiliary sequence y k : x k+1 = p k + p z k + z k = arg min z k k + p y k { Cpk (p 1) f (y k ), z + 1 ɛ D h(z, z k 1 ) Sufficient condition: f (y k ), x k y k Mɛ 1 p 1 f (y k ) } p p 1

75 Modified discretization Introduce an auxiliary sequence y k : x k+1 = p k + p z k + z k = arg min z k k + p y k { Cpk (p 1) f (y k ), z + 1 ɛ D h(z, z k 1 ) Sufficient condition: f (y k ), x k y k Mɛ 1 p 1 f (y k ) Assume h is uniformly convex: D h (y, x) 1 p y x p } p p 1 Theorem Theorem f (y k ) f (x ) O ( ) 1 ɛk p Note: Matching convergence rates 1/(ɛk p ) = 1/(δk) p = 1/t p Proof using generalization of Nesterov s estimate sequence technique

76 Accelerated higher-order gradient method x k+1 = p k + p z k + k k + p y k { y k = arg min f p 1 (y; x k ) + 2 y z k = arg min z } ɛp y x k p ( 1 O } { Cpk (p 1) f (y k ), z + 1 ɛ D h(z, z k 1 ) ɛk p 1 If p 1 f is (1/ɛ)-Lipschitz and h is uniformly convex of order p, then: ( ) 1 f (y k ) f (x ) O ɛk p accelerated rate p = 2: Accelerated gradient/mirror descent p = 3: Accelerated cubic-regularized Newton s method (Nesterov 08) p 2: Accelerated higher-order method )

77 Recap: Gradient vs. accelerated methods How to design dynamics for minimizing a convex function f? Rescaled gradient flow p 2 p 1 Ẋ t = f (X t) / f (X t) ( ) 1 O t p 1 Higher-order gradient method ( ) 1 O ɛk p 1 when p 1 f is 1 ɛ -Lipschitz matching rate with ɛ = δ p 1 δ = ɛ 1 p 1

78 Recap: Gradient vs. accelerated methods How to design dynamics for minimizing a convex function f? Rescaled gradient flow p 2 p 1 Ẋ t = f (X t) / f (X t) ( ) 1 O t p 1 Polynomial Euler-Lagrange equation Ẍ p + 1 t+ Ẋ t+t p 2[ ( 2 h X t )] 1 f t+ t p Ẋt (Xt) = ( ) 1 O t p Higher-order gradient method ( ) 1 O ɛk p 1 when p 1 f is 1 ɛ -Lipschitz matching rate with ɛ = δ p 1 δ = ɛ 1 p 1 Accelerated higher-order method ( ) 1 O ɛk p when p 1 f is 1 ɛ -Lipschitz matching rate with ɛ = δ p δ = ɛ 1 p

79 Summary: Bregman Lagrangian Bregman Lagrangian family with general convergence guarantee ) L(x, ẋ, t) = e (D γt+αt h (x + e αt ẋ, x) e βt f (x) Polynomial subfamily generates accelerated higher-order methods: O(1/t p ) convergence rate via higher-order smoothness Exponential subfamily: O(e ct ) rate via uniform convexity Understand structure and properties of Bregman Lagrangian: Gauge invariance, symmetry, gradient flows as limit points, etc. Bregman Hamiltonian: ( H(x, p, t) = e (D αt+γt h h(x) + e γ t p, h(x) ) + e βt f (x))

80 Discussion Many conceptual and mathematical challenges arising in taking seriously the problem of Big Data Facing these challenges will require a rapprochement between computational thinking and inferential thinking bringing computational and inferential fields together at the level of their foundations

On Computational Thinking, Inferential Thinking and Data Science

On Computational Thinking, Inferential Thinking and Data Science Michael I. Jordan University of California, Berkeley September 24, 2016 What Is the Big Data Phenomenon? Science in confirmatory mode (e.g.,