Discrete Optimization in Machine Learning. Colorado Reed

Size: px

Start display at page:

Download "Discrete Optimization in Machine Learning. Colorado Reed"

Hilary Bates
5 years ago
Views:

1 Discrete Optimization in Machine Learning Colorado Reed [ML-RCC] 31 Jan

2 Acknowledgements Some slides/animations based on: Krause et al. tutorials: Pushmeet Kohli tutorial: Jeff Bilmes class please consult before reusing slide if marked AK, PM, or JB 2

3 What is Discrete Optimization? Optimization problem with finite or countably infinite set of solutions image segmentation games common examples TSP; Set Cover; Network flows; Vertex Coloring; Knapsack Problem 3

4 Discrete Optimization in ML Given RVs Y, X 1,, X n Predict Y X i1,, X ik Y Sick k most informative features X 1 Fever X 2 Age X 3 Biopsy Result A * := argmax I(X A ; Y) s.t. A k where I(X A ; Y) = H(Y) H(Y X A ) AK 4

$6 formally A X 4 X 1 X 3 V \ A X 2 X 5 X 6 A * := argmin A I(X A ; X V\A ) s.t.$

5 Discrete Optimization in ML Given RVs X 1,, X n, V X 1 X 4 X 3 Partition V into sets that are as independent as possible X 2 X 5 X 6 formally A X 4 X 1 X 3 V \ A X 2 X 5 X 6 A * := argmin A I(X A ; X V\A ) s.t. A n where I(X A ; X V\A ) = H(X V\A ) H(X V\A X A ) 5

6 This Tutorial Given a finite set V, function F: 2 V R A * = argopt A F(A) s.t. satisfy contraints(a) Methods not covered here Relaxations Mixed integer programming POMDPs Heuristics submodularity scalable algorithms naturally occurring rising interest in ML performance guarantees! Goal: foster submodular intuition 6

7 Submodular set functions Set function F: 2 V R on Finite Set V = {1, 2,, n} is called submodular if For all A, B V: F(A) + F(B) F(A B) + F(A B) + B A A B A B + AK Equivalent diminishing returns characterization: + {s} B A + {s} Large improvement Small improvement For A B, s B, F(A {s}) F(A) F(B {s}) F(B) 7

8 Intuitive Example: Shopping F(A) = amount of time spent shopping for items in A V = { } A = { } + B = { } + F({ }) F({ }) - F({ }) - F({ }) 8 For A B, s B, F(A {s}) F(A) F(B {s}) F(B)

9 Example: Entropy Given random variables X 1,...,X N and index set V F (A) =H(X A )= X x A p(x A ) log p(x A ) Proof let A B V, 2 V \ B, need to show H(X A[{ } ) H(X A ) H(X B[{ } ) H(X B ) H(X { } X A ) H(X { } X B ) information never hurts JB 9

10 Example: Entropy Given random variables X 1,...,X N and index set V F (A) =H(X A )= X x A p(x A ) log p(x A ) Proof # 2 (mutual information) let A B V, 2 V \ B, I(X A ; X B ) 0 H(X A )+H(x H(X B ) H(X A[B ) H(X A\B ) 0 H(X A )+H(x H(X B ) H(X A[B )+H(X A\B ) 10

$..,X N and index set V F (A) =I(X A ; X V \A )=H(X V \A ) H(X V \A X A ) F (A [ { }) F (A) =H(X { } X A ) H(X { } X V \A[{ } )$

11 Example: Mutual information Given random variables X 1,...,X N and index set V F (A) =I(X A ; X V \A )=H(X V \A ) H(X V \A X A ) F (A [ { }) F (A) =H(X { } X A ) H(X { } X V \A[{ } ) Nonincreasing in A Nondecreasing in A F(A {ϵ})-f(a) monotonically nonincreasing F submodular 11

12 Quick Definition: [Super]modular Set function F: 2 V R on V = {1,2,,n} is supermodular iff For all A, B V: F(A) + F(B) F(A B) + F(A B) increasing returns OR synergy ; -F is submodular Dr. Steven Pinker (Harvard Psychologist) answering NYT question: What scientific concept would improve everybody's cognitive toolkit? Emergent systems are ones in which many different elements interact. The pattern of interaction then produces a new element that is greater than the sum of the parts, which then exercises a top-down influence on the constituent elements. JB modular if both submodular and supermodular 12

13 Quiz V = {1,...,n},A V define characteristic vector : w A = where wi A =1ifi2A; 0 otherwise (w A 1,...,w A n ) F (A) =w A r T where r i 2 R, r = n Prove F (A) is submodular Submodular definition For A B, s B: F(A {s}) F(A) F(B {s}) F(B) 13

14 Closedness properties F 1,,F m submodular functions on V and λ 1,,λ m > 0 Then: F(A) = i λ i F i (A) is submodular Submodularity closed under nonnegative linear combinations Very useful: F θ (A) submodular: θ P(θ) F θ (A) submodular 14

15 Submodularity and convexity V = {1,...,n},A V define characteristic vector : w A = {w A 1,...,w A n } where w A i =1ifi 2 A; 0 otherwise Very important Theorem [Lovasz 1983] Every submodular function F ( ) induces a function g( ) on R n s.t.! g(w) is convex! F (A) =g(w A ) for all A V! min F (A) =ming(w) s.t.w 2 [0, 1]n A w but what is g(w)? 15

16 Submodular Polytope P f = {x 2 R n : x(a) apple F (A) for all A V } where x(a) = X i2a x i x {b} Example: V = {a,b} A F(A) 0 {a} - 1 {b} 2 {a,b} 0 P F 2 1 x({b}) F({b}) x({a,b}) F({a,b}) x {a} AK x({a}) F({a}) 16

Lovasz extension [Lovasz 1983] g(w) = max w T x x2p f P f = {x 2 R n : x(a) apple F (A) for all A V } w {b} x w =argmax xϵ PF w T x x w 2 1 w g(w)=w T x w -2-1

17 Lovasz extension [Lovasz 1983] g(w) = max w T x x2p f P f = {x 2 R n : x(a) apple F (A) for all A V } w {b} x w =argmax xϵ PF w T x x w 2 1 w g(w)=w T x w w {a} Quiz We defined g(w), however evaluating g(w) involves solving why exponentially many constraints? an LP with exponentially many constraints AK, JB 17

18 Evaluating the Lovasz extension g(w) = max w T x x2p f P f = {x 2 R n : x(a) apple F (A) for all A V } Very Important Theorem [Edmonds 71, Lovasz 83] For any given w can obtain optimal solution x w to the LP via the following greedy algorithm : Order V = {e 1,...,e n } : w(e 1 )... w(e n ) Let x w (e i )=F ({e 1,...,e i }) F ({e 1,...,e i 1 }) then w T x w = g(w) = max x2p f w T x 18

19 [-2,2] Example: Lovasz extension -2 g([0, 1]) = [0, 1] T [ g([1, 1]) = [1, 1] T [ [-1,1] -1 w {b} 2, 2] = 2 = F ({b}) 2 {b} {a,b} 1 {} {a} w 0 1 {a} 1, 1] = 0 = F ({a, b}) A F(A) 0 {a} - 1 {b} 2 {a,b} 0 w =[0, 1], want g(w) g(w) = max w T x x2p f greedy ordering: e 1 = b, e 2 = a x w (e 1 )=F ({b}) F (;) =2 x w (e 2 )=F ({b, a}) F ({b}) = 2 AK Order V = {e 1,...,e n } : w(e 1 )... w(e n ) Let x w (e i )=F({e 1,...,e i }) F ({e 1,...,e i 1 }) then w T x w = g(w) = max w T x 19 x2p f

20 Lovasz Extension: useful? Theorem [Lovasz 1983] g(w) obtains its minimum in [0, 1] n at a corner [0,1] 2 Translation: the corners correspond to the characteristic vector inputs, so minimizing g(w) minimizes F(A)! How to minimize? 20

21 21 slide credit: Satoru Iwata

22 Submodular Minimization in Practice x(v)=f(v) [-1,1] x* 2 1 x {b} Base polytope: B F = P F {x(v) = F(V)} A F(A) 0 {a} - 1 {b} 2 {a,b} x {a} Minimum norm algorithm: 1. Find x* = argmin x 2 s.t. x ϵ B F x*=[-1,1] 2. Return A* = {i: x*(i) < 0} A*={a} AK Theorem [Fujishige 1991]: A* is an optimal solution Note: solve 1. via Wolfe s algorithm Runtime finite but unknown 22

23 23

24 Empirical Justification [Fujishige et al 2006] Running time (seconds) Min-norm algorithm problem size AK DEMO 24

25 MNP: Super-fast! Blue: [Fujishige 2006] ; Red: [Garcia 2012] 25

26 Symmetric Submodular Functions If F(A) = F(V \ A) then we can minimize in O(n 3 ) o NB: for nontrivial solutions Queyranne s Algorithm: o Straightforward split/merge element sets algorithm o Takes around 10 minutes to explain o Implemented in Krause s Matlab toolbox general submodular functions can be exactly minimized with min-norm algorithm symmetric submodular functions can be exactly minimized with Queyranne s algorithm in O(n 3 ) 26

Example: Q-Clustering [Narasimhan, Jojic, Bilmes NIPS 2005] o o o o o A 1 V o o o o A 2 o o AK Group data points V into homogeneous clusters Find a partition that minimizes F(A 1,,A k ) = i E(A i ) e.

27 Example: Q-Clustering [Narasimhan, Jojic, Bilmes NIPS 2005] o o o o o A 1 V o o o o A 2 o o AK Group data points V into homogeneous clusters Find a partition that minimizes F(A 1,,A k ) = i E(A i ) e.g: Entropy H(A) Cut function Theorem: (2-2/k) F(P) F(P opt ) for partition P k = 2 and F(A) is symmetric submod : use Q-Algo and obtain opt! First algorithm for finding optimal MDL clustering for k = 2 And provides optimality guarantees for all symmetric submodular partitions! 27

28 Example: MAP for Pairwise MRFs p q unary terms (data) - x p are discrete variables (i.e., x p {0,1}) - θ p ( ) are unary potentials - θ pq (, ) are pairwise potentials submodular iff pairwise terms (coherence) pairwise submodular functions can be minimized in O(n 3 ) O(n) in practice! (via st-mincut algorithm) 28

29 Example: MAP for Pairwise MRFs S st-mincut E(x) T Solution PK 29

30 Sub-example: Image Segmentation PK 30

31 Sub-example: Image Segmentation E(x θ) = θ p x p + θ pq x p (1 x q ) p likelihood p,q regularity x p {0,1} E: {0, 1} n R 0 fg 1 bg Can find global minimum using min-cut/max-flow algorithm PK 31

32 Sub-example: Image Segmentation E(a 1, a 2 ) = 2a 1 + 5ā 1 + 9a 2 + 4ā 2 + 2a 1 ā 2 + ā 1 a 2 a 1 a 2 source Augmenting Paths Algorithm 1. Find path from source to sink with positive capacity 2. Push maximum possible flow through this path a a 2 3. Repeat until no path can be found Cut = 11 sink E(1,1) = 11 32

Sub-example: Image Segmentation E(a 1, a 2 ) = 26 8 + 3ā ā 1 + 1 + 3a9a 5a 2 2 + + ā4ā 1 2a a 2 1 + ā 2 2a + ā 1 ā 1 a 22 + ā 1 a 2 E(a 1, a 2 ) = 2a 1 + 5ā 1 + 9a 2 + 4ā 2 + 2a 1 ā 2 + ā 1 a 2 a 1 a

33 Sub-example: Image Segmentation E(a 1, a 2 ) = ā ā a9a 5a ā4ā 1 2a a ā 2 2a + ā 1 ā 1 a 22 + ā 1 a 2 E(a 1, a 2 ) = 2a 1 + 5ā 1 + 9a 2 + 4ā 2 + 2a 1 ā 2 + ā 1 a 2 a 1 a 2 max flow source Augmenting Paths Algorithm 1. Find path from source to sink with positive capacity 2. Push maximum possible flow through this path a a 2 3. Repeat until no path can be found sink 33

34 History of Maxflow Algorithms Augmenting Path and Push-Relabel n: #nodes m: #edges U: maximum edge weight Predates Edmonds & Lovasz work! [Slide credit: Andrew Goldberg; 34 from PK]

35 Sub-example: Image denoising image credit: 35

36 Sub-example: Image denoising Y 1 Y 2 Y 3 Pairwise Markov Random Field X 1 X 2 X 3 P(x 1,,x n,y 1,,y n ) = i,j ψ i,j (y i,y j ) Π i φ i (x i,y i ) Y 4 Y 5 Y 6 Y 7 X 4 Y 8 X 5 Y 9 X 6 Want argmax y P(y x) =argmax y log P(x,y) =argmin y i,j E i,j (y i,y j )+ i E i (y i ) X 7 X 8 X 9 X i : observed pixels Y i : true pixels E i,j (y i,y j ) = -log ψ i,j (y i,y j ) Suppose y i are binary F(A) = E(y A ) where y A i = 1 iff i2 A Then min y E(y) = min A F(A) AK 36

37 maximizing submodular functions 37

38 Concave or Convex? g( A ) A For A B, s B, F(A {s}) F(A) F(B {s}) F(B) Theorem Suppose g: N R and F(A) = g( A ) Then F(A) submodular if and only if g is concave 38

39 Maximizing convex functions: NP hard? Maximizing submodular functions: NP hard Yes, but it some cases we have approximability guarantees 39

40 Maximizing Submodular Functions Approximability (Greedy) 1 1/e Constraints monotone with cardinality constraints monotone with matroid constraints nonnegative symmetric submodular 1/3 nonnegativity 40

41 Example: Max Cover (NP-Hard) Want to cover floorplan with discs Place sensors in building Possible locations V For A V: F(A) area covered by sensors placed at A Goal: Place k sensors that cover as much area as possible AK 41

42 Example: max cover is submodular A={S 1, S 2 } S 1 S 2 S F(A {S }) - F(A) S 1 S 2 F(B {S }) - F(B) S 3 AK S 4 S B = {S 1, S 2, S 3, S 4 } 42

43 repeat Greedy Max Cover Algorithm -pick the location that covers the max uncovered area -mark the area covered by the sensor as covered until done GREEDY (1 1/e)OPT : best possible [Feige 1998] 43

44 Submodularity embraces Greed Theorem [Nemhauser et al 1978]: Greedy gives a (1-1/e)-approximation for maximizing monotone submodular functions.* *Subject to cardinality constraints A greedy : F(A greedy ) (1-1/e) max A k F(A) Common scenario: best possible guarantee unless NP = P 44

45 Theorem [Nemhauser et al 1978]: Greedy gives a (1-1/e)-approximation for maximizing monotone submodular functions.* *Subject to cardinality constraints Proof: S i : first i elements selected by the greedy algorithm C : F (C) =OPT Show via induction : F (C) F (S i ) apple (1 1/k) i F (C) case i =0:F (C) F (S 0 ) apple F (C) In step i>0, Greedy selects element i maximizing F Si 1 ( i ) F (C) F (S i 1 ) apple X F Si 1 ( ) by submodularity & inductive hypothesis 2C\S i 1 X 1 implying : F Si 1 ( i ) C \ S i 1 F Si 1 ( ) 2 \ 2C\S i 1 1 k (F (C) F (S i 1)) 45

46 Theorem [Nemhauser et al 1978]: Greedy gives a (1-1/e)-approximation for maximizing monotone submodular functions.* *Subject to cardinality constraints Proof continued: S i : first i elements selected by the greedy algorithm C : F (C) =OPT Show via induction : F (C) F (S i ) apple (1 1/k) i F (C) In step i>0, Greedy selects element i maximizing F Si 1 ( i ) 1... F Si 1 ( i ) k (F (C) F (S i 1)) F (C) F (S i )=F(C) F (S i 1 ) F Si 1 ( i ) 1 apple F (C) F (S i 1 ) k (F (C) F (S i 1)) =(1 1/k)(F (C) F (S i 1 )) apple (1 1/k) i F (C) apple 1 46 e F (C)

47 Example: Influence in social networks [Kempe, Kleinberg, Tardos KDD 03] Alice Dorothy 0.2 Eric Prob. of influencing Bob 0.5 Fiona Charlie Who should get free cell phones? V = {Alice,Bob,Charlie,Dorothy,Eric,Fiona} F(A) = Expected number of people influenced when targeting A Key idea: Flip coins in advance live edges slide credit: AK 47

P(x V x A ) [F(A {e}, x V ) - F(A, x V ))] x V Adaptive submodularity: (e x A ) (e x B ) for A B Adaptive monotonicity: (e x A ) 0 for all e, x A

48 Adaptive Submodularity [Golovan & Krause 2010] Key idea: optimize over policies (decision trees) instead of sets Objective function: F(A, x V ) A: set of actions you ve taken x V : realization of a set of random variables Expected marginal benefits conditioned on the observations (e x A ) = Σ P(x V x A ) [F(A {e}, x V ) - F(A, x V ))] x V Adaptive submodularity: (e x A ) (e x B ) for A B Adaptive monotonicity: (e x A ) 0 for all e, x A Theorem: Adaptive greedy algorithm returns policy π greedy G(π greedy ) (1 1/e) G(π opt ), where G(π) = E[F(π(x V ), x V )] 48 more policies than sets

hypotheses y How should we test to eliminate all incorrect hypotheses?

49 Adaptive Submodularity Example Prior over diseases P(Y) Likelihood of outcomes P(X V Y) Suppose that P(X V Y) is deterministic (noise free) Each test eliminates hypotheses y How should we test to eliminate all incorrect hypotheses? Generalized binary search - Equivalent to max. info-gain Y Sick X 1 Fever X 2 Age X 3 Biopsy Result 49

50 Summary: Submodularity in ML Minimization Clustering MAP inference MRF (computer vision) Structure learning* Maximization Active learning Ranking* Feature Selection 50

51 DEMO See toolbox: 51

52 Resources Beyond convexity: submodularity in machine learning K Jeff Bilmes Submodular Optimization Class: ee595a_spring_2011/ Pushmeet Kohli video lecture on MAP in MRF: Coursera courses on discrete optimization:

Submodularity in Machine Learning

Submodularity in Machine Learning Saifuddin Syed MLRG Summer 2016 1 / 39 What are submodular functions Outline 1 What are submodular functions Motivation Submodularity and Concavity Examples 2 Properties of submodular functions Submodularity