Lecture 8: Inference. Kai-Wei Chang University of Virginia

Lecture 8: Inference Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar s course on Structured Prediction Advanced ML: Inference 1

So far what we learned v Thinking about structures v A graph, a collection of parts that are labeled jointly, a collection of decisions A D E A D E B C F B C F G v Next: Prediction v Sets structured prediction apart from binary/multiclass G Advanced ML: Inference 2

The bigger picture v The goal of structured prediction: Predicting a graph v Modeling: Defining probability distributions over the random variables v Involves making independence assumptions v Inference: The computational step that actually constructs the output v Also called decoding v Learning creates functions that score predictions (e.g., learning model parameters) Advanced ML: Inference 3

Computational issues Data annotation difficulty Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? Background knowledge about domain How to do inference? Semisupervised/indirectly supervised? Inference: deriving the probability of one or more random variables based on the model Advanced ML: Inference 4

Inference v What is inference? v An overview of what we have seen before v Combinatorial optimization v Different views of inference v Integer programming v Graph algorithms v Sum-product, max-sum v Heuristics for inference v LP relaxation v Sampling Advanced ML: Inference 5

Remember sequence prediction v Goal: Find the most probable/highest scoring state sequence v argmax y score(y) = argmax y w " φ(x, y) v Computationally: discrete optimization v The naïve algorithm v Enumerate all sequences, score each one and pick the max v Terrible idea! v We can do better v Scores decomposed over edges Advanced ML: Inference 6

The Viterbi algorithm: Recurrence Goal: Find argmax y w " φ(x, y) y = (y 1, y 2,!, y n ) y 1 y 2 y 3 y n Advanced ML: Inference 7

The Viterbi algorithm: Recurrence Goal: Find argmax y w " φ(x, y) y 1 y = (y 1, y 2,!, y n ) y 2 y 3 y n Advanced ML: Inference 8

The Viterbi algorithm: Recurrence Goal: Find argmax y w " φ(x, y) y = (y 1, y 2,!, y n ) y 1 y 2 y 3 y n, y )*+ )] Idea 1. If I know the score of all sequences y 1 to y n-1, then I could decide y n easily 2. Recurse to get score up to y n-1 Advanced ML: Inference 9

Inference questions v This class: v Mostly we use inference to mean What is the highest scoring assignment to the output random variables for a given input? v Maximum A Posteriori (MAP) inference (if the score is probabilistic) v Other inference questions v What is the highest scoring assignment to some of the output variables given the input? v Sample from the posterior distribution over the Y v Loss-augmented inference: Which structure most violates the margin for a given scoring function? v Computing marginal probabilities over Y Advanced ML: Inference 10

MAP inference v A combinatorial problem,. 0.4-10 41.3 v Computational complexity depends on v The size of the input v The factorization of the scores v More complex factors generally lead to expensive inference v A generally bad strategy in most but the simplest cases: Enumerate all possible structures and pick the highest scoring one Advanced ML: Inference 11

MAP inference is discrete optimization v A combinatorial problem 0.4-10 41.3,. v Computational complexity depends on v The size of the input v The factorization of the scores v More complex factors generally lead to expensive inference v A generally bad strategy in most but the simplest cases: Enumerate all possible structures and pick the highest scoring one Advanced ML: Inference 13

MAP inference is search v We want a graph that has highest scoring structure argmax y w " φ(x, y) v Without assumptions, no algorithm can find the max without considering every possible structure v How can we solve this computational problem? v Exploit the structure of the search space and the cost function v That is, exploit decomposition of the scoring function v Usually stronger assumptions lead to easier inference v E.g., consider 10 independent random variables Advanced ML: Inference 15

Approaches for inference v Exact vs. approximate inference v Should the maximization be performed exactly? v Or is a close-to-highest-scoring structure good enough? v Exact: Search, dynamic programming, integer linear programming,. v Heuristic (called approximate inference): Gibbs sampling, belief propagation, beam search, linear programming relaxations, v Randomized vs. deterministic v Relevant for approximate inference: If I run the inference program twice, will I get the same answer? Advanced ML: Inference 16

Coming up v Formulating general inference as integer linear programs v And variants of this idea v Graph algorithms, dynamic programing, greedy search v We have seen Viterbi algorithm Uses a cleverly defined ordering to decompose the output into a sequence of decisions We will talk about a general algorithm -- max-product, sum-product v Heuristics for inference v Sampling, Gibbs Sampling v Approximate graph search, beam search v LP-relaxation Advanced ML: Inference 17

Inference: Integer Linear Programs Advanced ML: Inference 18

The big picture v MAP Inference is combinatorial optimization v Combinatorial optimization problems can be written as integer linear programs (ILP) v The conversion is not always trivial v Allows injection of knowledge in the form of constraints v Different ways of solving ILPs v Commercial solvers: CPLEX, Gurobi, etc v Specialized solvers if you know something about your problem v Lagrangianrelaxation, amortized inference, etc v Can approximate to linear programs and hope for the best v Integer linear programs are NP hard in general v No free lunch Advanced ML: Inference 19

Detour: Linear programming v Minimizing a linear objective function subject to a finite number of linear constraints (equality or inequality) v Very widely applicable v Operations research, micro-economics, management v Historical note v Developed during world war II to reduce army costs v Programming not the same as computer programming Advanced ML: Inference 20

Example: The diet problem A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 4 Double cheeseburger 0.3 0.01 2 How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference 22

Linear programming v In general linear linear Advanced ML: Inference 27

Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: x 2 x 1 x 3 Advanced ML: Inference 28

Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 29

Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 30

Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region c x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 31

Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region c x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 32

Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region c x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 33

Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region c x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 34

Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region These three vertices are the only possible solutions! x 3 x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 Advanced ML: Inference 35

Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v The constraint matrix defines a polytope v Only the vertices or faces of the polytope can be solutions Advanced ML: Inference 36

Geometry of linear programming The constraint matrix defines polytope that contains allowed solutions (possibly not closed) Advanced ML: Inference 37

Geometry of linear programming The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Darker is higher Advanced ML: Inference 38

Integer linear programming v In general Advanced ML: Inference 42

Geometry of integer linear programming The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Only integer points allowed Advanced ML: Inference 43

Integer linear programming v In general v Solving integer linear programs in general can be NP-hard! v LP-relaxation: Drop the integer constraints and hope for the best Advanced ML: Inference 44

0-1 integer linear programming v In general v An instance of integer linear programs v Still NP-hard v Geometry: We are only considering points that are vertices of the Boolean hypercube Advanced ML: Inference 45

0-1 integer linear programming v In general Solution can be an interior point of the constraint set defined by Ax b v An instance of integer linear programs v Still NP-hard v Geometry: We are only considering points that are vertices of the Boolean hypercube v Constraints prohibit certain vertices Questions? Eg: Only points within this region are allowed Advanced ML: Inference 46

Back to structured prediction v Recall that we are solving argmax y w " φ(x, y) v The goal is to produce a graph v The set of possible values that y can take is finite, but large v General idea: Frame the argmax problem as a 0-1 integer linear program v Allows addition of arbitrary constraints Advanced ML: Inference 47

Thinking in ILPs Let s start with multi-class classification arg max φ(x, y) = argmax 2 {5,6,7} score(y) 2 {5,6,7} Introduce decision variables for each label vz A = 1 if output = A, 0 otherwise vz B = 1 if output = B, 0 otherwise vz C = 1 if output = C, 0 otherwise Advanced ML: Inference 48

Thinking in ILPs Let s start with multi-class classification arg max φ(x, y) = argmax 2 {5,6,7} score(y) 2 {5,6,7} Introduce decision variables for each label v z A = 1 if output = A, 0 otherwise v z B = 1 if output = B, 0 otherwise v z C = 1 if output = C, 0 otherwise Pick exactly one label Maximize the score Advanced ML: Inference 49

Thinking in ILPs Let s start with multi-class classification arg max φ(x, y) = argmax 2 {5,6,7} score(y) 2 {5,6,7} Introduce decision variables for each label v z A = 1 if output = A, 0 otherwise v z B = 1 if output = B, 0 otherwise v z C = 1 if output = C, 0 otherwise Maximize the score Advanced ML: Inference 50

Thinking in ILPs Let s start with multi-class classification arg max φ(x, y) = argmax 2 {5,6,7} score(y) 2 {5,6,7} We Introduce have taken decision a trivial variables problem for (finding each label a highest scoring element v z A = of 1 if a output list) and = A, converted 0 otherwise it into a representation that is NP-hard in the worst case! v z B = 1 if output = B, 0 otherwise Lesson: v z C = Don t 1 if output solve = multiclass C, 0 otherwise classification with an ILP solver Maximize the score Advanced ML: Inference 51

ILP for a general conditional models y 1 y 3 Suppose each y i can be A, B or C y 2 x 1 x 2 x 3 Introduce one decision variable for each part being assigned labels Our goal: max y W T φ(x 1, y 1 ) + W T φ (y 1, y 2, y 3 ) + W T φ (x 3, y 2, y 3 ) + W T φ (x 1, x 2, y 2 ) Advanced ML: Inference 52

ILP for a general conditional models Suppose each y i can be A, B or C z 1A, z 1B, z 1C y 1 y 2 y 3 Introduce one decision variable for each part being assigned labels x 1 x 2 x 3 Our goal: max y W T φ(x 1, y 1 ) + W T φ (y 1, y 2, y 3 ) + W T φ (x 3, y 2, y 3 ) + W T φ (x 1, x 2, y 2 ) Advanced ML: Inference 53

ILP for a general conditional models z 13AA, z 13AB, z 13AC, z 13BA, z 13BB, z 13BC, z 13CA, z 13CB, z 13CC Suppose each y i can be A, B or C z 1A, z 1B, z 1C y 1 y 2 y 3 Introduce one decision variable for each part being assigned labels x 1 x 2 x 3 Each of these decision variables is associated with a score z 2A, z 2B, z 2C z 23AA, z 23AB, z 23AC, z 23BA, z 23BB, z 23BC, z 23CA, z 23CB, z 23CC Our goal: max y W T φ(x 1, y 1 ) + W T φ (y 1, y 2, y 3 ) + W T φ (x 3, y 2, y 3 ) + W T φ (x 1, x 2, y 2 ) Questions? Advanced ML: Inference 54

ILP for a general conditional models y 1 y 3 Suppose each y i can be A, B or C y 2 Introduce one decision variable for each part being assigned labels x 1 x 2 x 3 Each of these decision variables is associated with a score Not all decisions can exist together. Eg: z 13AB implies z 1A and z 3B Our goal: max y W T φ(x 1, y 1 ) + W T φ (y 1, y 2, y 3 ) + W T φ (x 3, y 2, y 3 ) + W T φ (x 1, x 2, y 2 ) Advanced ML: Inference 55

Writing constraints as linear inequalities v Exactly one of z 1A, z 1B, z 1C can be true z 1A + z 1B + z 1C = 1 v At least m of z 1A, z 1B, z 1C should be true z 1A + z 1B + z 1C m v At most m of z 1A, z 1B, z 1C should be true z 1A + z 1B + z 1C m v Implication: z i z j v Convert to disjunction: z i z j (At least one of not z i or z j ) 1 z i + z j 1 Advanced ML: Inference 56

Integer linear programming for inference v Easy to add additional knowledge v Specify them as Boolean formulas v Examples v If y 1 is an A, then y 2 or y 3 should be a B or C v No more than two A s allowed in the output v Many inference problems have standard mappings to ILPs v Sequences, parsing, dependency parsing v Encoding of the problem makes a difference in solving time v The mechanical encoding may not be efficient to solve v Generally: more complex constraints make solving harder Advanced ML: Inference 58

Exercise: Sequence labeling Goal: Find argmax y W T φ (x,y) y = (y 1, y 2,!, y n ) y 1 y 2 y 3 y n How can this be written as an ILP? Advanced ML: Inference 59

ILP for inference: Remarks v Many combinatorial optimization problem can be written as an ILP v Even the easy /polynomial ones v Given an ILP, checking whether it represents a polynomial problem is intractable in general v ILPs are a general language for thinking about combinatorial optimization v The representation allows us to make general statements about inference v Off-the-shelf solvers for ILPs are quite good v Gurobi, CPLEX v Use an off the shelf solver only if you can t solve your inference problem otherwise Advanced ML: Inference 60

Inference: Graph algorithms Belief Propagation Advanced ML: Inference 62

Variable elimination (motivation) v Remember: We have a collection of inference variables that need to be assigned y = (y 1, y 2,!) v General algorithm v First fix an ordering of the variables, say (y 1, y 2,!) v Iteratively: v Find the best value for y i given the values of the previous neighbors v Use back pointers to find final answer Advanced ML: Inference 63

Variable elimination: (motivation) v Remember: We have a collection of inference variables that need to be assigned y = (y 1, y 2,!) v General algorithm v First fix an ordering of the variables, say (y 1, y 2,!) v Iteratively: v Find the best value for y i given the values of the previous neighbors v Use back pointers to find final answer v Viterbi is an instance of max-product variable elimination Advanced ML: Inference 64

Variable elimination example (max-sum) y 1 A B C D 0 2 0 B 4 4 4 0 C 1 1 1 1 D 0 2 0 Score_local y 2 y 3 y n B 4 C 1 D 0 B 4 C 1 D 0 First eliminate y 1 Score + s = score_local + s, START Score O s = max P QRS [score O*+ (y O*+ ) + score_local O (y O*+, y O )] Advanced ML: Inference 66

Variable elimination example A B C D A B C D y 2 y 3 y n A B C D Next eliminate y 2 Score + s = score_local + s, START Score O s = max P QRS [score O*+ (y O*+ ) + score_local O (y O*+, y O )] Advanced ML: Inference 67

Variable elimination example y 3 A B C D A B C D y n A B C D Next eliminate y 3 Score + s = score_local + s, START Score O s = max P QRS [score O*+ (y O*+ ) + score_local O (y O*+, y O )] Advanced ML: Inference 68

Variable elimination example y n A B C D We have all the information to make a decision for y n Score + s = score_local + s, START Score O s = max P QRS [score O*+ (y O*+ ) + score_local O (y O*+, y O )] Advanced ML: Inference Questions? 69

Two types of inference problems v Marginals: find v Maximizer: find Probability in different domains Advanced ML: Inference 70

Belief Propagation v BP provides exact solution when there are no loops in graph v Viterbi is a special case v Otherwise, loopy BP provides an approximate solution We use sum-product BP as running example, where we want to compute the Z in Advanced ML: Inference 71

Intuition v iterative process in which neighboring variables pass message to each other: I (variable x3) think that you (variable x2) belong in these states with various likelihoods v After enough iterations, the conversations is likely to converge to a consensus that determines the marginal probabilities of all the variables. Advanced ML: Inference 72

Message v Message from node i to node j: m )W (x W ) v Message is not probability v May not sum to 1 v A high value of m )W (x W ) means that node i believes the marginal value P(x W ) to be high. Advanced ML: Inference 73

Beliefs vestimated marginal probabilities are called beliefs. valgorithm vupdate messages until convergence vthen calculate beliefs Advanced ML: Inference 74

Message update v To update message from i to j, consider all messages flowing into i Advanced ML: Inference 75

Message update Advanced ML: Inference 76

Message update Advanced ML: Inference 77

Sum-product vs. max-product v The standard BP we just described is sumproduct used to estimate marginal v A variant called max-product (or max-sum in log space), is used to estimate MAP Advanced ML: Inference 78

Max-product v Message update same as before, except that sum is replaced by max: v Belief: estimate most likely states Advanced ML: Inference 79

Recap Advanced ML: Inference 80

Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 For the sake of simplicity, let s assume all the transition and emission scores are the same How many possible assignments? Advanced ML: Inference 81

Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 1 2 + 3 1 B 2 2 + 4 1 Advanced ML: Inference 82

Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 5 B 8 Advanced ML: Inference 83

Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 5 B 8 A 5 B 8 Advanced ML: Inference 84

Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 (5 5) (8 8) A 5 B 8 A 5 B 8 Advanced ML: Inference 85

Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 A 5 B 8 A 5 B 8 Advanced ML: Inference 86

Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 A 5 B 8 A 5 B 8 Let s verify: (A,(A,A)) = 2 1 2 1 2 = 8 (A,(A,B)) = 2 1 2 3 1 = 12 (A,(B,A)) = 2 3 1 1 2 = 12 (A,(B,B)) = 2 3 1 3 1 = 18 Advanced ML: Inference 87

Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 = 2 2 + 3 2 + 3 A 1 2 + 3 1 = 5 B 8 A 5 B 8 Let s verify: (A,(A,A)) = 2 1 2 1 2 = 8 (A,(A,B)) = 2 1 2 3 1 = 12 (A,(B,A)) = 2 3 1 1 2 = 12 (A,(B,B)) = 2 3 1 3 1 = 18 Advanced ML: Inference 88

Example (Sum-Product) A -> A 1 A -> B 2 A 1 50 + 3 64 = 242 B -> A 3 B 2 50 + 4 64 = 356 B-> B 4 A 50 B 64 A 5 B 8 A 5 B 8 A 5 B 8 A 5 B 8 Advanced ML: Inference 89

Example (Sum-Product) A -> A 1 A -> B 2 42 B -> A 3 B 356 B-> B 4 A 50 B 64 242 5 5 = 12,100 356 8 8 = 22,784 A 5 A 5 B 8 B 8 A 5 B 8 A 5 B 8 Advanced ML: Inference 90

Example (Sum-Product) A -> A 1 A -> B 2 42 B -> A 3 B 356 B-> B 4 A 50 B 64 A 12,100 B 22,784 A 5 B 8 A 5 B 8 34,884 A 5 B 8 A 5 B 8 Advanced ML: Inference 91

Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 For the sake of simplicity, let s assume all the transition and emission scores are the same How many possible assignments? Advanced ML: Inference 92

Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A max 1 2, 3 1 = 3 B max 2 2, 4 1 = 4 Advanced ML: Inference 93

Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A 3 B 4 Advanced ML: Inference 94

Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A 3 B 4 A 3 B 4 Advanced ML: Inference 95

Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) (3 3) (4 4) A 3 B 4 A 3 B 4 Advanced ML: Inference 96

Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A 3 B 4 A 3 B 4 Advanced ML: Inference 97

Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A 3 B 4 A 3 B 4 Let s verify: (A,(A,A)) = 2 1 2 1 2 = 8 (A,(A,B)) = 2 1 2 3 1 = 12 (A,(B,A)) = 2 3 1 1 2 = 12 (A,(B,B)) = 2 3 1 3 1 = 18 Advanced ML: Inference 98

Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) = 2 3 3 A 3 B 4 A 3 B 4 Let s verify: (A,(A,A)) = 2 1 2 1 2 = 8 (A,(A,B)) = 2 1 2 3 1 = 12 (A,(B,A)) = 2 3 1 1 2 = 12 (A,(B,B)) = 2 3 1 3 1 = 18 Advanced ML: Inference 99

Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A max 1 18, 3 16 = 48 B max 2 18, 4 16 = 64 A 3 A 18 A 3 B 4 6 B 4 A 3 B 4 A 3 B 4 Advanced ML: Inference 100

Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c 48 3 3 = h(x 864 ) ) A 48 64 4 4 = 1,024 B 64 A 3 A 3 B 4 B 4 A 3 B 4 A 3 B 4 Advanced ML: Inference 101

Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c A 864 h(x ) ) A 48,024 B 64 A 3 A 3 B 4 B 4 A 3 B 4 A 3 B 4 Advanced ML: Inference 102

Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c A 864 h(x ) ) A 48 (B),024 B 64 (B) A 3 (B) A 3 (B) B 4 (A) B 4 (A) A B 3 (B) 4 (A) A B 3(B) 4(A) Advanced ML: Inference 103

Inference: Graph algorithms General Search Advanced ML: Inference 105

Inference as search: General setting v Predicting a graph as a sequence of decisions v General data structures: v State: Encodes partial structure v Transitions: Move from one partial structure to another v Start state v End state: We have a full structure v There may be more than one end state v Each transition is scored with the learned model v Goal: Find an end state that has the highest total score Questions? Advanced ML: Inference 106

Example y 1 y 3 Suppose each y can be one of A, B or C y 2 x 1 x 2 x 3 State: Triples (y 1, y 2, y 3 ) all possibly unknown (A, -, -), (-, A, A), (-, -, -), Transition: Fill in one of the unknowns Start state: (-,-,-) End state: All three y s are assigned Advanced ML: Inference 107

Example y 1 y 3 Suppose each y can be one of A, B or C y 2 x 1 x 2 x 3 State: Triples (y 1, y 2, y 3 ) all possibly unknown (A, -, -), (-, A, A), (-, -, -), Transition: Fill in one of the unknowns Start state: (-,-,-) (-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-).. End state: All three y s are assigned (A,A,A) (C,C,C) Advanced ML: Inference 108

Example y 1 y 3 Suppose each y can be one of A, B or C y 2 Note: Here we have assumed an ordering (y 1, y 2, y 3 ) x 1 x 2 x 3 State: Triples (y 1, y 2, y 3 ) all possibly unknown (A, -, -), (-, A, A), (-, -, -), Transition: Fill in one of the unknowns Start state: (-,-,-) (-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-).. End state: All three y s are assigned (A,A,A) (C,C,C) Advanced ML: Inference 109

Example y 1 y 3 Suppose each y can be one of A, B or C y 2 Note: Here we have assumed an ordering (y 1, y 2, y 3 ) How do the transitions get scored? x 1 x 2 x 3 State: Triples (y 1, y 2, y 3 ) all possibly unknown (A, -, -), (-, A, A), (-, -, -), Transition: Fill in one of the unknowns Start state: (-,-,-) (-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-).. End state: All three y s are assigned Questions? Advanced ML: Inference (A,A,A) (C,C,C) 110

Graph search algorithms v Breadth/depth first search v Keep a stack/queue/priority queue of open states v The good: Guaranteed to be correct v Explores every option (-,-,-) v The bad? v Explores every option: Could be slow for any non-trivial graph (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-).. (A,A,A) (C,C,C) Advanced ML: Inference 111

Greedy search v At each state, choose the highest scoring next transition v Keep only one state in memory: The current state v What is the problem? v Local decisions may override global optimum v Does not explore full search space v Greedy algorithms can give the true optimum for special class of problems v Eg: Maximum-spanning tree algorithms are greedy Questions? Advanced ML: Inference 112

Example y 1 y 3 Suppose each y can be one of A, B or C y 2 Note: Here we have assumed an ordering (y 1, y 2, y 3 ) How do the transitions get scored? x 1 x 2 x 3 State: Triples (y 1, y 2, y 3 ) all possibly unknown (A, -, -), (-, A, A), (-, -, -), Transition: Fill in one of the unknowns Start state: (-,-,-) (-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (A,B,-) (A,C,-) End state: All three y s are assigned Questions? Advanced ML: Inference (A,A,-) (A,B,C) (A,C,-) 113

Example (greedy) A -> A 1 A -> B 2 B -> A 3 B-> B 4 Advanced ML: Inference 114

Example (greedy) 3 16 2 2 = 384 16 4 2 2 2 2 = 1,024 A -> A 1 A -> B 2 B -> A 3 B-> B 4 2 2 = 8 B 2 2 1 2 2 = 16 Advanced ML: Inference 115

Beam search: A compromise v Keep size-limited priority queue of states v Called the beam, sorted by total score for the state v At each step: v Explore all transitions from the current state v Add all to beam and trim the size v The good: Explores more than greedy search v The bad: A good state might fall out of the beam v In general, easy to implement, very popular v No guarantees Questions? Advanced ML: Inference 116

Example beam = 3 Credit: Graham Neubig Advanced ML: Inference 117

v Calculate score, but ignore removed hypotheses Advanced ML: Inference 118

v Keep only best three Advanced ML: Inference 119

Structured prediction approaches based on search vlearning to search approaches vassume the complex decision is incrementally constructed by a sequence of decisions ve.g., dagger, Searn, transition-based methods vlearn how to make decisions at each branch. Advanced ML: Inference 120

Example: Dependency Parsing v Identifying relations between words I ate a cake with a fork I ate a cake with a fork Advanced ML: Inference 121

Learning to search (L2S) approaches 1. Define a search space and features v Example: dependency parsing [Nivre03,NIPS16] v Maintain a buffer and a stack v Make predictions from left to right v Three (four) types of actions: Shift, Reduce-Left, Reduce-Right Advanced ML: Inference Credit: Google research blog 122

Learning to search approaches Shift-Reduce parser v Maintain a buffer and a stack v Make predictions from left to right v Three (four) types of actions: Shift, Reduce-Left, Reduce-Right I ate a cake Shift I ate a cake I ate a cake Reduce-Left ate a cake I Advanced ML: Inference 123

Learning to search (L2S) approaches 1. Define a search space and features 2. Construct a reference policy (Ref) based on the gold label 3. Learning a policy that imitates Ref sentence Advanced ML: Inference 124

Policies v A policy maps observations to actions π( ) =a obs. input: x timestep: t partial traj: τ anything else Advanced ML: Inference 125

Imitation learning for joint prediction Challenges: v There are combinatorial number of search states v How a sub-decision affect the final decision? Advanced ML: Inference 126

Credit Assignment Problem v When someone goes wrong which decision should be blamed Advanced ML: Inference 127

Imitation learning for joint prediction Searn: [Langford, Daume& Marcu] Dagger: [Ross, Gordon & Bagnell] AggreVaTe: [Ross & Bagnell] LOLS: [Chang, Krishnamurthy, Agarwal, Daume, Langford] Advanced ML: Inference 128

Learning a Policy[ICML 15, Ross+15] v At? state, we construct a cost-sensitive multi-class example (?, [0,.2,.8]) E loss=0? E loss=.2 rollin E loss=.8 rollout one-step deviations Advanced ML: Inference 129

Example: Sequence Labeling Receive input: x = the monster ate the sandwich y = Dt Nn Vb Dt Nn Make a sequence of predictions: x = the monster ate the sandwich ŷ = Dt Dt Dt Dt Dt Pick a timestep and try all perturbations there: x = the monster ate the sandwich ŷ Dt = Dt Dt Vb Dt Nn l=1 ŷ Nn = Dt Nn Vb Dt Nn l=0 ŷ Vb = Dt Vb Vb Dt Nn l=1 Compute losses and construct example: ( { w=monster, p=dt, }, [1,0,1] ) Advanced ML: Inference 130

Learning to search approaches: Credit Assignment Compiler [NIPS16] v Write the decoder, providing some side information for training v Library translates this piece of program with data to the update rules of model v Applied to dependency parsing, Name entity recognition, relation extraction, POS tagging v Implementation: Vowpal Wabbit Advanced ML: Inference 131

Approximate Inference Inference by sampling Advanced ML: Inference 132

Inference by sampling v Monte Carlo methods: A large class of algorithms v Origins in physics v Basic idea: v Repeatedly sample from a distribution v Compute aggregate statistics from samples v E.g.: The marginal distribution v Useful when we have many, many interacting variables 133

Why sampling works v Suppose we have some probability distribution P(z) v Might be a cumbersome function v We want to answer questions about this distribution v What is the mean? v Approximate with samples from the distribution {z 1, z 2,!, z n } Eg: Expectation v Theory tells us that this is a good estimator v Chernoff-Hoeffding style bounds 134

Key idea rejection sampling Advanced ML: Inference 135

Key idea rejection sampling Work well when number of variables are small Advanced ML: Inference 136

The Markov Chain Monte Carlo revolution v Goal: To generate samples from a distribution P(y x) v The target distribution could be intractable to sample from v Idea: Construct a Markov chain of structures whose stationary distribution converges to P v An iterative process that constructs examples v Initially samples might not be from the target distribution v But after a long enough time, the samples are from a distribution that is close to P 137

The Markov Chain Monte Carlo revolution v Goal: To generate samples from a distribution P(y x) v The target distribution could be intractable to sample from v Idea: drawing examples in a way that in a long run the distribution is closed to P(y x) v Formally: Construct a Markov chain of structures whose stationary distribution converges to P v An iterative process that constructs examples v Initially samples might not be from the target distribution v But after a long enough time, the samples are from a distribution that is increasingly close to P 138

A detour Recall: Markov Chains A collection of random variables y 0, y 1, y 2,, y t form a Markov chain if the i th state depends only on the previous one D F 0.1 0.1 0.8 A B 0.8 0.1 0.1 0.1 0.8 B C 0.1 A! B! C! D! E! F F! A! A! E! F! B! C 0.1 0.9 0.1 D E 0.1 0.8 0.1 D 0.1 0.8 139

A detour Recall: Markov Chains A collection of random variables y 0, y 1, y 2,, y t form a Markov chain if the i th state depends only on the previous one D F 0.1 0.1 0.8 A B 0.8 0.1 0.1 0.1 0.8 B C 0.1 A B C D E F F A A E F B C 0.1 0.9 0.1 D E 0.1 0.8 0.1 D 0.1 0.8 140

Temporal dynamics of a Markov chain What is the probability that a chain is in a state z at time t+1? 141

Temporal dynamics of a Markov chain What is the probability that a chain is in a state z at time t+1? D F 0.1 0.1 0.8 A B 0.8 0.1 0.1 0.1 0.8 B C 0.1 Exercise: Suppose a Markov chain for these transition probabilities starts at A. What is the distribution over states after two steps? 0.1 0.9 0.1 D E 0.1 0.8 0.1 D 0.1 0.8 142

Stationary distributions v Informally, v If the set of states is {A, B, C, D, E, F} v A distribution over the states such that after a transition, the distribution over the states is still π v How do we get to a stationary distribution? v A regular Markov chain: There is an non-zero probability of getting from any states to any other in a finite number of steps v If transition matrix is regular, just run it for a long time v Steady-state behavior is the stationary distribution 143

Key idea rejection sampling Advanced ML: Inference 144

Back to inference Markov Chain Monte Carlo for inference v Design a Markov chain such that v Every state is a structure v The stationary distribution of the chain is the probability distribution we care about P(y x) v How to do inference? v Run the Markov chain for a long time till we think it gets to steady state v Let the chain wander around the space and collect samples v We have samples from P(y x) 145

MCMC for inference 1 146

MCMC for inference After many steps 1 1 147

MCMC for inference After many steps After many steps 1 1 1 148

MCMC for inference After many steps After many steps After many steps 2 1 1 149

MCMC for inference After many steps After many steps After many steps After many steps 3 1 1 150

MCMC for inference After many steps After many steps After many steps After many steps 3 1 1 With sufficient samples, we can answer inference questions like calculating the partition function (just sum over the samples) 151

Key idea rejection sampling Advanced ML: Inference 152

MCMC algorithms v Metropolis-Hastings algorithm v Gibbs sampling v An instance of the Metropolis Hastings algorithm v Many variants exist v Remember: We are sampling from an exponential state space v All possible assignments to the random variables 154

Metropolis-Hastings v Proposal distribution q(y y ) v Proposes changes to the state [Metropolis, Rosenbluth, Rosenbluth, Teller & Teller 1953] [Hastings 1970] v Could propose large changes to the state v Acceptance probability α v Should the proposal be accepted or not v If yes, move to the proposed state, else remain in the previous state 155

Metropolis-Hastings Algorithm v Start with an initial guess y 0 v Loop for t = 1, 2, N v Propose next state y v Calculate acceptance probability α v With probability α, accept proposal v If accepted: y t+1 y, else y t+1 y t v Return {y 0, y 1,, y N } The distribution we care about is P(y x) 156

Proposal functions for Metropolis v A design choice v Different possibilities v Only make local changes to the factor graph v But the chain might not explore widely v Make big jumps in the state space v But the chain might move very slowly v Doesn t have to depend on the size of the graph 161

Gibbs Sampling v Start with an initial guess y = (y 1, y 2,!, y n ) v Loop several times v For i = 1 to n: v Sample y i from P(y i y 1, y 2,! y i-1, y i+1,!, y n, x) v We now have a complete sample The ordering is arbitrary A specific instance of Metropolis-Hastings algorithm, no proposal needs to be designed 162

MAP inference with MCMC v So far we have only seen how to collect samples v Marginal inference with samples is easy v Compute the marginal probabilities from the samples v MAP inference: v Find the sample with highest probability v To help convergence to the maximum, acceptance condition becomes T is a temperature parameter that increases with every step Similar to simulated annealing 163

Summary of MCMC methods v A different approach for inference v No guarantee of exactness v General idea v Set up a Markov chain whose stationary distribution is the probability distribution that we care about v Run the chain, collect samples, aggregate v Metropolis-Hastings, Gibbs sampling v Many, many, many variants abound! v Useful when exact inference is intractable v Typically low memory costs, local changes only for Gibbs sampling Questions? 164

Inference v What is inference? The prediction step v More broadly, an aggregation operation on the space of outputs for an example: max, expectation, sample, sum v Different flavors: MAP, marginal, loss augmented. v Many algorithms, solution strategies v One size doesn t fit all v Next steps: v How can we take advantage of domain knowledge in inference? v How can we deal making predictions about latent variables for which we don t have data Questions? 165