Lecture 8: Inference Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar s course on Structured Prediction Advanced ML: Inference 1
So far what we learned v Thinking about structures v A graph, a collection of parts that are labeled jointly, a collection of decisions A D E A D E B C F B C F G v Next: Prediction v Sets structured prediction apart from binary/multiclass G Advanced ML: Inference 2
The bigger picture v The goal of structured prediction: Predicting a graph v Modeling: Defining probability distributions over the random variables v Involves making independence assumptions v Inference: The computational step that actually constructs the output v Also called decoding v Learning creates functions that score predictions (e.g., learning model parameters) Advanced ML: Inference 3
Computational issues Data annotation difficulty Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? Background knowledge about domain How to do inference? Semisupervised/indirectly supervised? Inference: deriving the probability of one or more random variables based on the model Advanced ML: Inference 4
Inference v What is inference? v An overview of what we have seen before v Combinatorial optimization v Different views of inference v Integer programming v Graph algorithms v Sum-product, max-sum v Heuristics for inference v LP relaxation v Sampling Advanced ML: Inference 5
Remember sequence prediction v Goal: Find the most probable/highest scoring state sequence v argmax y score(y) = argmax y w " φ(x, y) v Computationally: discrete optimization v The naïve algorithm v Enumerate all sequences, score each one and pick the max v Terrible idea! v We can do better v Scores decomposed over edges Advanced ML: Inference 6
The Viterbi algorithm: Recurrence Goal: Find argmax y w " φ(x, y) y = (y 1, y 2,!, y n ) y 1 y 2 y 3 y n Advanced ML: Inference 7
The Viterbi algorithm: Recurrence Goal: Find argmax y w " φ(x, y) y 1 y = (y 1, y 2,!, y n ) y 2 y 3 y n Advanced ML: Inference 8
The Viterbi algorithm: Recurrence Goal: Find argmax y w " φ(x, y) y = (y 1, y 2,!, y n ) y 1 y 2 y 3 y n, y )*+ )] Idea 1. If I know the score of all sequences y 1 to y n-1, then I could decide y n easily 2. Recurse to get score up to y n-1 Advanced ML: Inference 9
Inference questions v This class: v Mostly we use inference to mean What is the highest scoring assignment to the output random variables for a given input? v Maximum A Posteriori (MAP) inference (if the score is probabilistic) v Other inference questions v What is the highest scoring assignment to some of the output variables given the input? v Sample from the posterior distribution over the Y v Loss-augmented inference: Which structure most violates the margin for a given scoring function? v Computing marginal probabilities over Y Advanced ML: Inference 10
MAP inference v A combinatorial problem,. 0.4-10 41.3 v Computational complexity depends on v The size of the input v The factorization of the scores v More complex factors generally lead to expensive inference v A generally bad strategy in most but the simplest cases: Enumerate all possible structures and pick the highest scoring one Advanced ML: Inference 11
MAP inference v A combinatorial problem,. 0.4-10 41.3 v Computational complexity depends on v The size of the input v The factorization of the scores v More complex factors generally lead to expensive inference v A generally bad strategy in most but the simplest cases: Enumerate all possible structures and pick the highest scoring one Advanced ML: Inference 12
MAP inference is discrete optimization v A combinatorial problem 0.4-10 41.3,. v Computational complexity depends on v The size of the input v The factorization of the scores v More complex factors generally lead to expensive inference v A generally bad strategy in most but the simplest cases: Enumerate all possible structures and pick the highest scoring one Advanced ML: Inference 13
MAP inference is discrete optimization v A combinatorial problem 0.4-10 41.3,. v Computational complexity depends on v The size of the input v The factorization of the scores v More complex factors generally lead to expensive inference v A generally bad strategy in most but the simplest cases: Enumerate all possible structures and pick the highest scoring one Advanced ML: Inference 14
MAP inference is search v We want a graph that has highest scoring structure argmax y w " φ(x, y) v Without assumptions, no algorithm can find the max without considering every possible structure v How can we solve this computational problem? v Exploit the structure of the search space and the cost function v That is, exploit decomposition of the scoring function v Usually stronger assumptions lead to easier inference v E.g., consider 10 independent random variables Advanced ML: Inference 15
Approaches for inference v Exact vs. approximate inference v Should the maximization be performed exactly? v Or is a close-to-highest-scoring structure good enough? v Exact: Search, dynamic programming, integer linear programming,. v Heuristic (called approximate inference): Gibbs sampling, belief propagation, beam search, linear programming relaxations, v Randomized vs. deterministic v Relevant for approximate inference: If I run the inference program twice, will I get the same answer? Advanced ML: Inference 16
Coming up v Formulating general inference as integer linear programs v And variants of this idea v Graph algorithms, dynamic programing, greedy search v We have seen Viterbi algorithm Uses a cleverly defined ordering to decompose the output into a sequence of decisions We will talk about a general algorithm -- max-product, sum-product v Heuristics for inference v Sampling, Gibbs Sampling v Approximate graph search, beam search v LP-relaxation Advanced ML: Inference 17
Inference: Integer Linear Programs Advanced ML: Inference 18
The big picture v MAP Inference is combinatorial optimization v Combinatorial optimization problems can be written as integer linear programs (ILP) v The conversion is not always trivial v Allows injection of knowledge in the form of constraints v Different ways of solving ILPs v Commercial solvers: CPLEX, Gurobi, etc v Specialized solvers if you know something about your problem v Lagrangianrelaxation, amortized inference, etc v Can approximate to linear programs and hope for the best v Integer linear programs are NP hard in general v No free lunch Advanced ML: Inference 19
Detour: Linear programming v Minimizing a linear objective function subject to a finite number of linear constraints (equality or inequality) v Very widely applicable v Operations research, micro-economics, management v Historical note v Developed during world war II to reduce army costs v Programming not the same as computer programming Advanced ML: Inference 20
Example: The diet problem A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 4 Double cheeseburger 0.3 0.01 2 How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Advanced ML: Inference 21
Example: The diet problem A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 4 Double cheeseburger 0.3 0.01 2 How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference 22
Example: The diet problem A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 4 Double cheeseburger 0.3 0.01 2 How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference 23
Example: The diet problem A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 4 Double cheeseburger 0.3 0.01 2 How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference 24
Example: The diet problem A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 4 Double cheeseburger 0.3 0.01 2 How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference 25
Example: The diet problem A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: Item Cost/100g Vitamin Z Nutrient X Carrots 2 4 0.4 Sunflower seeds 6 10 4 Double cheeseburger 0.3 0.01 2 How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference 26
Linear programming v In general linear linear Advanced ML: Inference 27
Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: x 2 x 1 x 3 Advanced ML: Inference 28
Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 29
Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 30
Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region c x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 31
Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region c x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 32
Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region c x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 33
Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region c x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 34
Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region These three vertices are the only possible solutions! x 3 x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 Advanced ML: Inference 35
Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v The constraint matrix defines a polytope v Only the vertices or faces of the polytope can be solutions Advanced ML: Inference 36
Geometry of linear programming The constraint matrix defines polytope that contains allowed solutions (possibly not closed) Advanced ML: Inference 37
Geometry of linear programming The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Darker is higher Advanced ML: Inference 38
Geometry of linear programming The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Advanced ML: Inference 39
Geometry of linear programming The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Even though all points in the region are allowed, the vertices maximize/minimize the cost Advanced ML: Inference 40
Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v The constraint matrix defines a polytope v Only the vertices or faces of the polytope can be solutions v Linear programs can be solved in polynomial time Questions? Advanced ML: Inference 41
Integer linear programming v In general Advanced ML: Inference 42
Geometry of integer linear programming The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Only integer points allowed Advanced ML: Inference 43
Integer linear programming v In general v Solving integer linear programs in general can be NP-hard! v LP-relaxation: Drop the integer constraints and hope for the best Advanced ML: Inference 44
0-1 integer linear programming v In general v An instance of integer linear programs v Still NP-hard v Geometry: We are only considering points that are vertices of the Boolean hypercube Advanced ML: Inference 45
0-1 integer linear programming v In general Solution can be an interior point of the constraint set defined by Ax b v An instance of integer linear programs v Still NP-hard v Geometry: We are only considering points that are vertices of the Boolean hypercube v Constraints prohibit certain vertices Questions? Eg: Only points within this region are allowed Advanced ML: Inference 46
Back to structured prediction v Recall that we are solving argmax y w " φ(x, y) v The goal is to produce a graph v The set of possible values that y can take is finite, but large v General idea: Frame the argmax problem as a 0-1 integer linear program v Allows addition of arbitrary constraints Advanced ML: Inference 47
Thinking in ILPs Let s start with multi-class classification arg max φ(x, y) = argmax 2 {5,6,7} score(y) 2 {5,6,7} Introduce decision variables for each label vz A = 1 if output = A, 0 otherwise vz B = 1 if output = B, 0 otherwise vz C = 1 if output = C, 0 otherwise Advanced ML: Inference 48
Thinking in ILPs Let s start with multi-class classification arg max φ(x, y) = argmax 2 {5,6,7} score(y) 2 {5,6,7} Introduce decision variables for each label v z A = 1 if output = A, 0 otherwise v z B = 1 if output = B, 0 otherwise v z C = 1 if output = C, 0 otherwise Pick exactly one label Maximize the score Advanced ML: Inference 49
Thinking in ILPs Let s start with multi-class classification arg max φ(x, y) = argmax 2 {5,6,7} score(y) 2 {5,6,7} Introduce decision variables for each label v z A = 1 if output = A, 0 otherwise v z B = 1 if output = B, 0 otherwise v z C = 1 if output = C, 0 otherwise Maximize the score Advanced ML: Inference 50
Thinking in ILPs Let s start with multi-class classification arg max φ(x, y) = argmax 2 {5,6,7} score(y) 2 {5,6,7} We Introduce have taken decision a trivial variables problem for (finding each label a highest scoring element v z A = of 1 if a output list) and = A, converted 0 otherwise it into a representation that is NP-hard in the worst case! v z B = 1 if output = B, 0 otherwise Lesson: v z C = Don t 1 if output solve = multiclass C, 0 otherwise classification with an ILP solver Maximize the score Advanced ML: Inference 51
ILP for a general conditional models y 1 y 3 Suppose each y i can be A, B or C y 2 x 1 x 2 x 3 Introduce one decision variable for each part being assigned labels Our goal: max y W T φ(x 1, y 1 ) + W T φ (y 1, y 2, y 3 ) + W T φ (x 3, y 2, y 3 ) + W T φ (x 1, x 2, y 2 ) Advanced ML: Inference 52
ILP for a general conditional models Suppose each y i can be A, B or C z 1A, z 1B, z 1C y 1 y 2 y 3 Introduce one decision variable for each part being assigned labels x 1 x 2 x 3 Our goal: max y W T φ(x 1, y 1 ) + W T φ (y 1, y 2, y 3 ) + W T φ (x 3, y 2, y 3 ) + W T φ (x 1, x 2, y 2 ) Advanced ML: Inference 53
ILP for a general conditional models z 13AA, z 13AB, z 13AC, z 13BA, z 13BB, z 13BC, z 13CA, z 13CB, z 13CC Suppose each y i can be A, B or C z 1A, z 1B, z 1C y 1 y 2 y 3 Introduce one decision variable for each part being assigned labels x 1 x 2 x 3 Each of these decision variables is associated with a score z 2A, z 2B, z 2C z 23AA, z 23AB, z 23AC, z 23BA, z 23BB, z 23BC, z 23CA, z 23CB, z 23CC Our goal: max y W T φ(x 1, y 1 ) + W T φ (y 1, y 2, y 3 ) + W T φ (x 3, y 2, y 3 ) + W T φ (x 1, x 2, y 2 ) Questions? Advanced ML: Inference 54
ILP for a general conditional models y 1 y 3 Suppose each y i can be A, B or C y 2 Introduce one decision variable for each part being assigned labels x 1 x 2 x 3 Each of these decision variables is associated with a score Not all decisions can exist together. Eg: z 13AB implies z 1A and z 3B Our goal: max y W T φ(x 1, y 1 ) + W T φ (y 1, y 2, y 3 ) + W T φ (x 3, y 2, y 3 ) + W T φ (x 1, x 2, y 2 ) Advanced ML: Inference 55
Writing constraints as linear inequalities v Exactly one of z 1A, z 1B, z 1C can be true z 1A + z 1B + z 1C = 1 v At least m of z 1A, z 1B, z 1C should be true z 1A + z 1B + z 1C m v At most m of z 1A, z 1B, z 1C should be true z 1A + z 1B + z 1C m v Implication: z i z j v Convert to disjunction: z i z j (At least one of not z i or z j ) 1 z i + z j 1 Advanced ML: Inference 56
Writing constraints as linear inequalities v Exactly one of z 1A, z 1B, z 1C can be true z 1A + z 1B + z 1C = 1 v At least m of z 1A, z 1B, z 1C should be true z 1A + z 1B + z 1C m v At most m of z 1A, z 1B, z 1C should be true z 1A + z 1B + z 1C m v Implication: z i z j v Convert to disjunction: z i z j (At least one of not z i or z j ) 1 z i + z j 1 Advanced ML: Inference 57
Integer linear programming for inference v Easy to add additional knowledge v Specify them as Boolean formulas v Examples v If y 1 is an A, then y 2 or y 3 should be a B or C v No more than two A s allowed in the output v Many inference problems have standard mappings to ILPs v Sequences, parsing, dependency parsing v Encoding of the problem makes a difference in solving time v The mechanical encoding may not be efficient to solve v Generally: more complex constraints make solving harder Advanced ML: Inference 58
Exercise: Sequence labeling Goal: Find argmax y W T φ (x,y) y = (y 1, y 2,!, y n ) y 1 y 2 y 3 y n How can this be written as an ILP? Advanced ML: Inference 59
ILP for inference: Remarks v Many combinatorial optimization problem can be written as an ILP v Even the easy /polynomial ones v Given an ILP, checking whether it represents a polynomial problem is intractable in general v ILPs are a general language for thinking about combinatorial optimization v The representation allows us to make general statements about inference v Off-the-shelf solvers for ILPs are quite good v Gurobi, CPLEX v Use an off the shelf solver only if you can t solve your inference problem otherwise Advanced ML: Inference 60
Coming up v Formulating general inference as integer linear programs v And variants of this idea v Graph algorithms, dynamic programing, greedy search v We have seen Viterbi algorithm Uses a cleverly defined ordering to decompose the output into a sequence of decisions We will talk about a general algorithm -- max-product, sum-product v Heuristics for inference v Sampling, Gibbs Sampling v Approximate graph search, beam search v LP-relaxation Advanced ML: Inference 61
Inference: Graph algorithms Belief Propagation Advanced ML: Inference 62
Variable elimination (motivation) v Remember: We have a collection of inference variables that need to be assigned y = (y 1, y 2,!) v General algorithm v First fix an ordering of the variables, say (y 1, y 2,!) v Iteratively: v Find the best value for y i given the values of the previous neighbors v Use back pointers to find final answer Advanced ML: Inference 63
Variable elimination: (motivation) v Remember: We have a collection of inference variables that need to be assigned y = (y 1, y 2,!) v General algorithm v First fix an ordering of the variables, say (y 1, y 2,!) v Iteratively: v Find the best value for y i given the values of the previous neighbors v Use back pointers to find final answer v Viterbi is an instance of max-product variable elimination Advanced ML: Inference 64
Variable elimination: (motivation) v Remember: We have a collection of inference variables that need to be assigned y = (y 1, y 2,!) v General algorithm v First fix an ordering of the variables, say (y 1, y 2,!) v Iteratively: v Find the best value for y i given the values of the previous neighbors v Use back pointers to find final answer v Viterbi is an instance of max-product variable elimination Advanced ML: Inference 65
Variable elimination example (max-sum) y 1 A B C D 0 2 0 B 4 4 4 0 C 1 1 1 1 D 0 2 0 Score_local y 2 y 3 y n B 4 C 1 D 0 B 4 C 1 D 0 First eliminate y 1 Score + s = score_local + s, START Score O s = max P QRS [score O*+ (y O*+ ) + score_local O (y O*+, y O )] Advanced ML: Inference 66
Variable elimination example A B C D A B C D y 2 y 3 y n A B C D Next eliminate y 2 Score + s = score_local + s, START Score O s = max P QRS [score O*+ (y O*+ ) + score_local O (y O*+, y O )] Advanced ML: Inference 67
Variable elimination example y 3 A B C D A B C D y n A B C D Next eliminate y 3 Score + s = score_local + s, START Score O s = max P QRS [score O*+ (y O*+ ) + score_local O (y O*+, y O )] Advanced ML: Inference 68
Variable elimination example y n A B C D We have all the information to make a decision for y n Score + s = score_local + s, START Score O s = max P QRS [score O*+ (y O*+ ) + score_local O (y O*+, y O )] Advanced ML: Inference Questions? 69
Two types of inference problems v Marginals: find v Maximizer: find Probability in different domains Advanced ML: Inference 70
Belief Propagation v BP provides exact solution when there are no loops in graph v Viterbi is a special case v Otherwise, loopy BP provides an approximate solution We use sum-product BP as running example, where we want to compute the Z in Advanced ML: Inference 71
Intuition v iterative process in which neighboring variables pass message to each other: I (variable x3) think that you (variable x2) belong in these states with various likelihoods v After enough iterations, the conversations is likely to converge to a consensus that determines the marginal probabilities of all the variables. Advanced ML: Inference 72
Message v Message from node i to node j: m )W (x W ) v Message is not probability v May not sum to 1 v A high value of m )W (x W ) means that node i believes the marginal value P(x W ) to be high. Advanced ML: Inference 73
Beliefs vestimated marginal probabilities are called beliefs. valgorithm vupdate messages until convergence vthen calculate beliefs Advanced ML: Inference 74
Message update v To update message from i to j, consider all messages flowing into i Advanced ML: Inference 75
Message update Advanced ML: Inference 76
Message update Advanced ML: Inference 77
Sum-product vs. max-product v The standard BP we just described is sumproduct used to estimate marginal v A variant called max-product (or max-sum in log space), is used to estimate MAP Advanced ML: Inference 78
Max-product v Message update same as before, except that sum is replaced by max: v Belief: estimate most likely states Advanced ML: Inference 79
Recap Advanced ML: Inference 80
Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 For the sake of simplicity, let s assume all the transition and emission scores are the same How many possible assignments? Advanced ML: Inference 81
Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 1 2 + 3 1 B 2 2 + 4 1 Advanced ML: Inference 82
Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 5 B 8 Advanced ML: Inference 83
Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 5 B 8 A 5 B 8 Advanced ML: Inference 84
Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 (5 5) (8 8) A 5 B 8 A 5 B 8 Advanced ML: Inference 85
Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 A 5 B 8 A 5 B 8 Advanced ML: Inference 86
Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 A 5 B 8 A 5 B 8 Let s verify: (A,(A,A)) = 2 1 2 1 2 = 8 (A,(A,B)) = 2 1 2 3 1 = 12 (A,(B,A)) = 2 3 1 1 2 = 12 (A,(B,B)) = 2 3 1 3 1 = 18 Advanced ML: Inference 87
Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 = 2 2 + 3 2 + 3 A 1 2 + 3 1 = 5 B 8 A 5 B 8 Let s verify: (A,(A,A)) = 2 1 2 1 2 = 8 (A,(A,B)) = 2 1 2 3 1 = 12 (A,(B,A)) = 2 3 1 1 2 = 12 (A,(B,B)) = 2 3 1 3 1 = 18 Advanced ML: Inference 88
Example (Sum-Product) A -> A 1 A -> B 2 A 1 50 + 3 64 = 242 B -> A 3 B 2 50 + 4 64 = 356 B-> B 4 A 50 B 64 A 5 B 8 A 5 B 8 A 5 B 8 A 5 B 8 Advanced ML: Inference 89
Example (Sum-Product) A -> A 1 A -> B 2 42 B -> A 3 B 356 B-> B 4 A 50 B 64 242 5 5 = 12,100 356 8 8 = 22,784 A 5 A 5 B 8 B 8 A 5 B 8 A 5 B 8 Advanced ML: Inference 90
Example (Sum-Product) A -> A 1 A -> B 2 42 B -> A 3 B 356 B-> B 4 A 50 B 64 A 12,100 B 22,784 A 5 B 8 A 5 B 8 34,884 A 5 B 8 A 5 B 8 Advanced ML: Inference 91
Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 For the sake of simplicity, let s assume all the transition and emission scores are the same How many possible assignments? Advanced ML: Inference 92
Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A max 1 2, 3 1 = 3 B max 2 2, 4 1 = 4 Advanced ML: Inference 93
Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A 3 B 4 Advanced ML: Inference 94
Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A 3 B 4 A 3 B 4 Advanced ML: Inference 95
Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) (3 3) (4 4) A 3 B 4 A 3 B 4 Advanced ML: Inference 96
Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A 3 B 4 A 3 B 4 Advanced ML: Inference 97
Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A 3 B 4 A 3 B 4 Let s verify: (A,(A,A)) = 2 1 2 1 2 = 8 (A,(A,B)) = 2 1 2 3 1 = 12 (A,(B,A)) = 2 3 1 1 2 = 12 (A,(B,B)) = 2 3 1 3 1 = 18 Advanced ML: Inference 98
Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) = 2 3 3 A 3 B 4 A 3 B 4 Let s verify: (A,(A,A)) = 2 1 2 1 2 = 8 (A,(A,B)) = 2 1 2 3 1 = 12 (A,(B,A)) = 2 3 1 1 2 = 12 (A,(B,B)) = 2 3 1 3 1 = 18 Advanced ML: Inference 99
Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A max 1 18, 3 16 = 48 B max 2 18, 4 16 = 64 A 3 A 18 A 3 B 4 6 B 4 A 3 B 4 A 3 B 4 Advanced ML: Inference 100
Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c 48 3 3 = h(x 864 ) ) A 48 64 4 4 = 1,024 B 64 A 3 A 3 B 4 B 4 A 3 B 4 A 3 B 4 Advanced ML: Inference 101
Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c A 864 h(x ) ) A 48,024 B 64 A 3 A 3 B 4 B 4 A 3 B 4 A 3 B 4 Advanced ML: Inference 102
Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c A 864 h(x ) ) A 48 (B),024 B 64 (B) A 3 (B) A 3 (B) B 4 (A) B 4 (A) A B 3 (B) 4 (A) A B 3(B) 4(A) Advanced ML: Inference 103
Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c A 864 h(x ) ) A 48 (B),024 B 64 (B) A 3 (B) A 3 (B) B 4 (A) B 4 (A) A B 3 (B) 4 (A) A B 3(B) 4(A) Advanced ML: Inference 104
Inference: Graph algorithms General Search Advanced ML: Inference 105
Inference as search: General setting v Predicting a graph as a sequence of decisions v General data structures: v State: Encodes partial structure v Transitions: Move from one partial structure to another v Start state v End state: We have a full structure v There may be more than one end state v Each transition is scored with the learned model v Goal: Find an end state that has the highest total score Questions? Advanced ML: Inference 106
Example y 1 y 3 Suppose each y can be one of A, B or C y 2 x 1 x 2 x 3 State: Triples (y 1, y 2, y 3 ) all possibly unknown (A, -, -), (-, A, A), (-, -, -), Transition: Fill in one of the unknowns Start state: (-,-,-) End state: All three y s are assigned Advanced ML: Inference 107
Example y 1 y 3 Suppose each y can be one of A, B or C y 2 x 1 x 2 x 3 State: Triples (y 1, y 2, y 3 ) all possibly unknown (A, -, -), (-, A, A), (-, -, -), Transition: Fill in one of the unknowns Start state: (-,-,-) (-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-).. End state: All three y s are assigned (A,A,A) (C,C,C) Advanced ML: Inference 108
Example y 1 y 3 Suppose each y can be one of A, B or C y 2 Note: Here we have assumed an ordering (y 1, y 2, y 3 ) x 1 x 2 x 3 State: Triples (y 1, y 2, y 3 ) all possibly unknown (A, -, -), (-, A, A), (-, -, -), Transition: Fill in one of the unknowns Start state: (-,-,-) (-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-).. End state: All three y s are assigned (A,A,A) (C,C,C) Advanced ML: Inference 109
Example y 1 y 3 Suppose each y can be one of A, B or C y 2 Note: Here we have assumed an ordering (y 1, y 2, y 3 ) How do the transitions get scored? x 1 x 2 x 3 State: Triples (y 1, y 2, y 3 ) all possibly unknown (A, -, -), (-, A, A), (-, -, -), Transition: Fill in one of the unknowns Start state: (-,-,-) (-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-).. End state: All three y s are assigned Questions? Advanced ML: Inference (A,A,A) (C,C,C) 110
Graph search algorithms v Breadth/depth first search v Keep a stack/queue/priority queue of open states v The good: Guaranteed to be correct v Explores every option (-,-,-) v The bad? v Explores every option: Could be slow for any non-trivial graph (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-).. (A,A,A) (C,C,C) Advanced ML: Inference 111
Greedy search v At each state, choose the highest scoring next transition v Keep only one state in memory: The current state v What is the problem? v Local decisions may override global optimum v Does not explore full search space v Greedy algorithms can give the true optimum for special class of problems v Eg: Maximum-spanning tree algorithms are greedy Questions? Advanced ML: Inference 112
Example y 1 y 3 Suppose each y can be one of A, B or C y 2 Note: Here we have assumed an ordering (y 1, y 2, y 3 ) How do the transitions get scored? x 1 x 2 x 3 State: Triples (y 1, y 2, y 3 ) all possibly unknown (A, -, -), (-, A, A), (-, -, -), Transition: Fill in one of the unknowns Start state: (-,-,-) (-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (A,B,-) (A,C,-) End state: All three y s are assigned Questions? Advanced ML: Inference (A,A,-) (A,B,C) (A,C,-) 113
Example (greedy) A -> A 1 A -> B 2 B -> A 3 B-> B 4 Advanced ML: Inference 114
Example (greedy) 3 16 2 2 = 384 16 4 2 2 2 2 = 1,024 A -> A 1 A -> B 2 B -> A 3 B-> B 4 2 2 = 8 B 2 2 1 2 2 = 16 Advanced ML: Inference 115
Beam search: A compromise v Keep size-limited priority queue of states v Called the beam, sorted by total score for the state v At each step: v Explore all transitions from the current state v Add all to beam and trim the size v The good: Explores more than greedy search v The bad: A good state might fall out of the beam v In general, easy to implement, very popular v No guarantees Questions? Advanced ML: Inference 116
Example beam = 3 Credit: Graham Neubig Advanced ML: Inference 117
v Calculate score, but ignore removed hypotheses Advanced ML: Inference 118
v Keep only best three Advanced ML: Inference 119
Structured prediction approaches based on search vlearning to search approaches vassume the complex decision is incrementally constructed by a sequence of decisions ve.g., dagger, Searn, transition-based methods vlearn how to make decisions at each branch. Advanced ML: Inference 120
Example: Dependency Parsing v Identifying relations between words I ate a cake with a fork I ate a cake with a fork Advanced ML: Inference 121
Learning to search (L2S) approaches 1. Define a search space and features v Example: dependency parsing [Nivre03,NIPS16] v Maintain a buffer and a stack v Make predictions from left to right v Three (four) types of actions: Shift, Reduce-Left, Reduce-Right Advanced ML: Inference Credit: Google research blog 122
Learning to search approaches Shift-Reduce parser v Maintain a buffer and a stack v Make predictions from left to right v Three (four) types of actions: Shift, Reduce-Left, Reduce-Right I ate a cake Shift I ate a cake I ate a cake Reduce-Left ate a cake I Advanced ML: Inference 123
Learning to search (L2S) approaches 1. Define a search space and features 2. Construct a reference policy (Ref) based on the gold label 3. Learning a policy that imitates Ref sentence Advanced ML: Inference 124
Policies v A policy maps observations to actions π( ) =a obs. input: x timestep: t partial traj: τ anything else Advanced ML: Inference 125
Imitation learning for joint prediction Challenges: v There are combinatorial number of search states v How a sub-decision affect the final decision? Advanced ML: Inference 126
Credit Assignment Problem v When someone goes wrong which decision should be blamed Advanced ML: Inference 127
Imitation learning for joint prediction Searn: [Langford, Daume& Marcu] Dagger: [Ross, Gordon & Bagnell] AggreVaTe: [Ross & Bagnell] LOLS: [Chang, Krishnamurthy, Agarwal, Daume, Langford] Advanced ML: Inference 128
Learning a Policy[ICML 15, Ross+15] v At? state, we construct a cost-sensitive multi-class example (?, [0,.2,.8]) E loss=0? E loss=.2 rollin E loss=.8 rollout one-step deviations Advanced ML: Inference 129
Example: Sequence Labeling Receive input: x = the monster ate the sandwich y = Dt Nn Vb Dt Nn Make a sequence of predictions: x = the monster ate the sandwich ŷ = Dt Dt Dt Dt Dt Pick a timestep and try all perturbations there: x = the monster ate the sandwich ŷ Dt = Dt Dt Vb Dt Nn l=1 ŷ Nn = Dt Nn Vb Dt Nn l=0 ŷ Vb = Dt Vb Vb Dt Nn l=1 Compute losses and construct example: ( { w=monster, p=dt, }, [1,0,1] ) Advanced ML: Inference 130
Learning to search approaches: Credit Assignment Compiler [NIPS16] v Write the decoder, providing some side information for training v Library translates this piece of program with data to the update rules of model v Applied to dependency parsing, Name entity recognition, relation extraction, POS tagging v Implementation: Vowpal Wabbit Advanced ML: Inference 131
Approximate Inference Inference by sampling Advanced ML: Inference 132
Inference by sampling v Monte Carlo methods: A large class of algorithms v Origins in physics v Basic idea: v Repeatedly sample from a distribution v Compute aggregate statistics from samples v E.g.: The marginal distribution v Useful when we have many, many interacting variables 133
Why sampling works v Suppose we have some probability distribution P(z) v Might be a cumbersome function v We want to answer questions about this distribution v What is the mean? v Approximate with samples from the distribution {z 1, z 2,!, z n } Eg: Expectation v Theory tells us that this is a good estimator v Chernoff-Hoeffding style bounds 134
Key idea rejection sampling Advanced ML: Inference 135
Key idea rejection sampling Work well when number of variables are small Advanced ML: Inference 136
The Markov Chain Monte Carlo revolution v Goal: To generate samples from a distribution P(y x) v The target distribution could be intractable to sample from v Idea: Construct a Markov chain of structures whose stationary distribution converges to P v An iterative process that constructs examples v Initially samples might not be from the target distribution v But after a long enough time, the samples are from a distribution that is close to P 137
The Markov Chain Monte Carlo revolution v Goal: To generate samples from a distribution P(y x) v The target distribution could be intractable to sample from v Idea: drawing examples in a way that in a long run the distribution is closed to P(y x) v Formally: Construct a Markov chain of structures whose stationary distribution converges to P v An iterative process that constructs examples v Initially samples might not be from the target distribution v But after a long enough time, the samples are from a distribution that is increasingly close to P 138
A detour Recall: Markov Chains A collection of random variables y 0, y 1, y 2,, y t form a Markov chain if the i th state depends only on the previous one D F 0.1 0.1 0.8 A B 0.8 0.1 0.1 0.1 0.8 B C 0.1 A! B! C! D! E! F F! A! A! E! F! B! C 0.1 0.9 0.1 D E 0.1 0.8 0.1 D 0.1 0.8 139
A detour Recall: Markov Chains A collection of random variables y 0, y 1, y 2,, y t form a Markov chain if the i th state depends only on the previous one D F 0.1 0.1 0.8 A B 0.8 0.1 0.1 0.1 0.8 B C 0.1 A B C D E F F A A E F B C 0.1 0.9 0.1 D E 0.1 0.8 0.1 D 0.1 0.8 140
Temporal dynamics of a Markov chain What is the probability that a chain is in a state z at time t+1? 141
Temporal dynamics of a Markov chain What is the probability that a chain is in a state z at time t+1? D F 0.1 0.1 0.8 A B 0.8 0.1 0.1 0.1 0.8 B C 0.1 Exercise: Suppose a Markov chain for these transition probabilities starts at A. What is the distribution over states after two steps? 0.1 0.9 0.1 D E 0.1 0.8 0.1 D 0.1 0.8 142
Stationary distributions v Informally, v If the set of states is {A, B, C, D, E, F} v A distribution over the states such that after a transition, the distribution over the states is still π v How do we get to a stationary distribution? v A regular Markov chain: There is an non-zero probability of getting from any states to any other in a finite number of steps v If transition matrix is regular, just run it for a long time v Steady-state behavior is the stationary distribution 143
Key idea rejection sampling Advanced ML: Inference 144
Back to inference Markov Chain Monte Carlo for inference v Design a Markov chain such that v Every state is a structure v The stationary distribution of the chain is the probability distribution we care about P(y x) v How to do inference? v Run the Markov chain for a long time till we think it gets to steady state v Let the chain wander around the space and collect samples v We have samples from P(y x) 145
MCMC for inference 1 146
MCMC for inference After many steps 1 1 147
MCMC for inference After many steps After many steps 1 1 1 148
MCMC for inference After many steps After many steps After many steps 2 1 1 149
MCMC for inference After many steps After many steps After many steps After many steps 3 1 1 150
MCMC for inference After many steps After many steps After many steps After many steps 3 1 1 With sufficient samples, we can answer inference questions like calculating the partition function (just sum over the samples) 151
Key idea rejection sampling Advanced ML: Inference 152
Back to inference Markov Chain Monte Carlo for inference v Design a Markov chain such that v Every state is a structure v The stationary distribution of the chain is the probability distribution we care about P(y x) v How to do inference? v Run the Markov chain for a long time till we think it gets to steady state v Let the chain wander around the space and collect samples v We have samples from P(y x) 153
MCMC algorithms v Metropolis-Hastings algorithm v Gibbs sampling v An instance of the Metropolis Hastings algorithm v Many variants exist v Remember: We are sampling from an exponential state space v All possible assignments to the random variables 154
Metropolis-Hastings v Proposal distribution q(y y ) v Proposes changes to the state [Metropolis, Rosenbluth, Rosenbluth, Teller & Teller 1953] [Hastings 1970] v Could propose large changes to the state v Acceptance probability α v Should the proposal be accepted or not v If yes, move to the proposed state, else remain in the previous state 155
Metropolis-Hastings Algorithm v Start with an initial guess y 0 v Loop for t = 1, 2, N v Propose next state y v Calculate acceptance probability α v With probability α, accept proposal v If accepted: y t+1 y, else y t+1 y t v Return {y 0, y 1,, y N } The distribution we care about is P(y x) 156
Metropolis-Hastings Algorithm v Start with an initial guess y 0 v Loop for t = 1, 2, N v Propose next state y v Calculate acceptance probability α v With probability α, accept proposal v If accepted: y t+1 y, else y t+1 y t v Return {y 0, y 1,, y N } The distribution we care about is P(y x) 157
Metropolis-Hastings Algorithm v Start with an initial guess y 0 v Loop for t = 1, 2, N v Propose next state y v Calculate acceptance probability α v With probability α, accept proposal v If accepted: y t+1 y, else y t+1 y t v Return {y 0, y 1,, y N } The distribution we care about is P(y x) Sample from q(y t y ) 158
Metropolis-Hastings Algorithm v Start with an initial guess y 0 v Loop for t = 1, 2, N v Propose next state y v Calculate acceptance probability α v With probability α, accept proposal v If accepted: y t+1 y, else y t+1 y t v Return {y 0, y 1,, y N } The distribution we care about is P(y x) Sample from q(y t y ) Note that we don t need to compute the partition function. Why? 159
Metropolis-Hastings Algorithm v Start with an initial guess y 0 v Loop for t = 1, 2, N v Propose next state y v Calculate acceptance probability α v With probability α, accept proposal v If accepted: y t+1 y, else y t+1 y t v Return {y 0, y 1,, y N } The distribution we care about is P(y x) Sample from q(y t y ) Idea: when running with enough iteration, the distribution is invariant 160
Proposal functions for Metropolis v A design choice v Different possibilities v Only make local changes to the factor graph v But the chain might not explore widely v Make big jumps in the state space v But the chain might move very slowly v Doesn t have to depend on the size of the graph 161
Gibbs Sampling v Start with an initial guess y = (y 1, y 2,!, y n ) v Loop several times v For i = 1 to n: v Sample y i from P(y i y 1, y 2,! y i-1, y i+1,!, y n, x) v We now have a complete sample The ordering is arbitrary A specific instance of Metropolis-Hastings algorithm, no proposal needs to be designed 162
MAP inference with MCMC v So far we have only seen how to collect samples v Marginal inference with samples is easy v Compute the marginal probabilities from the samples v MAP inference: v Find the sample with highest probability v To help convergence to the maximum, acceptance condition becomes T is a temperature parameter that increases with every step Similar to simulated annealing 163
Summary of MCMC methods v A different approach for inference v No guarantee of exactness v General idea v Set up a Markov chain whose stationary distribution is the probability distribution that we care about v Run the chain, collect samples, aggregate v Metropolis-Hastings, Gibbs sampling v Many, many, many variants abound! v Useful when exact inference is intractable v Typically low memory costs, local changes only for Gibbs sampling Questions? 164
Inference v What is inference? The prediction step v More broadly, an aggregation operation on the space of outputs for an example: max, expectation, sample, sum v Different flavors: MAP, marginal, loss augmented. v Many algorithms, solution strategies v One size doesn t fit all v Next steps: v How can we take advantage of domain knowledge in inference? v How can we deal making predictions about latent variables for which we don t have data Questions? 165