Lecture 8: Inference. Kai-Wei Chang University of Virginia

Size: px

Start display at page:

Download "Lecture 8: Inference. Kai-Wei Chang University of Virginia"

Dorthy Henry
6 years ago
Views:

1 Lecture 8: Inference Kai-Wei Chang University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar s course on Structured Prediction Advanced ML: Inference 1

2 So far what we learned v Thinking about structures v A graph, a collection of parts that are labeled jointly, a collection of decisions A D E A D E B C F B C F G v Next: Prediction v Sets structured prediction apart from binary/multiclass G Advanced ML: Inference 2

3 The bigger picture v The goal of structured prediction: Predicting a graph v Modeling: Defining probability distributions over the random variables v Involves making independence assumptions v Inference: The computational step that actually constructs the output v Also called decoding v Learning creates functions that score predictions (e.g., learning model parameters) Advanced ML: Inference 3

4 Computational issues Data annotation difficulty Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? Background knowledge about domain How to do inference? Semisupervised/indirectly supervised? Inference: deriving the probability of one or more random variables based on the model Advanced ML: Inference 4

5 Inference v What is inference? v An overview of what we have seen before v Combinatorial optimization v Different views of inference v Integer programming v Graph algorithms v Sum-product, max-sum v Heuristics for inference v LP relaxation v Sampling Advanced ML: Inference 5

6 Remember sequence prediction v Goal: Find the most probable/highest scoring state sequence v argmax y score(y) = argmax y w " φ(x, y) v Computationally: discrete optimization v The naïve algorithm v Enumerate all sequences, score each one and pick the max v Terrible idea! v We can do better v Scores decomposed over edges Advanced ML: Inference 6

7 The Viterbi algorithm: Recurrence Goal: Find argmax y w " φ(x, y) y = (y 1, y 2,!, y n ) y 1 y 2 y 3 y n Advanced ML: Inference 7

8 The Viterbi algorithm: Recurrence Goal: Find argmax y w " φ(x, y) y 1 y = (y 1, y 2,!, y n ) y 2 y 3 y n Advanced ML: Inference 8

9 The Viterbi algorithm: Recurrence Goal: Find argmax y w " φ(x, y) y = (y 1, y 2,!, y n ) y 1 y 2 y 3 y n, y )*+ )] Idea 1. If I know the score of all sequences y 1 to y n-1, then I could decide y n easily 2. Recurse to get score up to y n-1 Advanced ML: Inference 9

10 Inference questions v This class: v Mostly we use inference to mean What is the highest scoring assignment to the output random variables for a given input? v Maximum A Posteriori (MAP) inference (if the score is probabilistic) v Other inference questions v What is the highest scoring assignment to some of the output variables given the input? v Sample from the posterior distribution over the Y v Loss-augmented inference: Which structure most violates the margin for a given scoring function? v Computing marginal probabilities over Y Advanced ML: Inference 10

11 MAP inference v A combinatorial problem, v Computational complexity depends on v The size of the input v The factorization of the scores v More complex factors generally lead to expensive inference v A generally bad strategy in most but the simplest cases: Enumerate all possible structures and pick the highest scoring one Advanced ML: Inference 11

12 MAP inference v A combinatorial problem, v Computational complexity depends on v The size of the input v The factorization of the scores v More complex factors generally lead to expensive inference v A generally bad strategy in most but the simplest cases: Enumerate all possible structures and pick the highest scoring one Advanced ML: Inference 12

13 MAP inference is discrete optimization v A combinatorial problem ,. v Computational complexity depends on v The size of the input v The factorization of the scores v More complex factors generally lead to expensive inference v A generally bad strategy in most but the simplest cases: Enumerate all possible structures and pick the highest scoring one Advanced ML: Inference 13

More complex factors generally lead to expensive

the simplest cases: Enumerate all possible

14 MAP inference is discrete optimization v A combinatorial problem ,. v Computational complexity depends on v The size of the input v The factorization of the scores v More complex factors generally lead to expensive inference v A generally bad strategy in most but the simplest cases: Enumerate all possible structures and pick the highest scoring one Advanced ML: Inference 14

15 MAP inference is search v We want a graph that has highest scoring structure argmax y w " φ(x, y) v Without assumptions, no algorithm can find the max without considering every possible structure v How can we solve this computational problem? v Exploit the structure of the search space and the cost function v That is, exploit decomposition of the scoring function v Usually stronger assumptions lead to easier inference v E.g., consider 10 independent random variables Advanced ML: Inference 15

16 Approaches for inference v Exact vs. approximate inference v Should the maximization be performed exactly? v Or is a close-to-highest-scoring structure good enough? v Exact: Search, dynamic programming, integer linear programming,. v Heuristic (called approximate inference): Gibbs sampling, belief propagation, beam search, linear programming relaxations, v Randomized vs. deterministic v Relevant for approximate inference: If I run the inference program twice, will I get the same answer? Advanced ML: Inference 16

17 Coming up v Formulating general inference as integer linear programs v And variants of this idea v Graph algorithms, dynamic programing, greedy search v We have seen Viterbi algorithm Uses a cleverly defined ordering to decompose the output into a sequence of decisions We will talk about a general algorithm -- max-product, sum-product v Heuristics for inference v Sampling, Gibbs Sampling v Approximate graph search, beam search v LP-relaxation Advanced ML: Inference 17

18 Inference: Integer Linear Programs Advanced ML: Inference 18

19 The big picture v MAP Inference is combinatorial optimization v Combinatorial optimization problems can be written as integer linear programs (ILP) v The conversion is not always trivial v Allows injection of knowledge in the form of constraints v Different ways of solving ILPs v Commercial solvers: CPLEX, Gurobi, etc v Specialized solvers if you know something about your problem v Lagrangianrelaxation, amortized inference, etc v Can approximate to linear programs and hope for the best v Integer linear programs are NP hard in general v No free lunch Advanced ML: Inference 19

20 Detour: Linear programming v Minimizing a linear objective function subject to a finite number of linear constraints (equality or inequality) v Very widely applicable v Operations research, micro-economics, management v Historical note v Developed during world war II to reduce army costs v Programming not the same as computer programming Advanced ML: Inference 20

21 Example: The diet problem A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: Item Cost/100g Vitamin Z Nutrient X Carrots Sunflower seeds Double cheeseburger How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Advanced ML: Inference 21

22 Example: The diet problem A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: Item Cost/100g Vitamin Z Nutrient X Carrots Sunflower seeds Double cheeseburger How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference 22

23 Example: The diet problem A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: Item Cost/100g Vitamin Z Nutrient X Carrots Sunflower seeds Double cheeseburger How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference 23

24 Example: The diet problem A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: Item Cost/100g Vitamin Z Nutrient X Carrots Sunflower seeds Double cheeseburger How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference 24

25 Example: The diet problem A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: Item Cost/100g Vitamin Z Nutrient X Carrots Sunflower seeds Double cheeseburger How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference 25

26 Example: The diet problem A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: Item Cost/100g Vitamin Z Nutrient X Carrots Sunflower seeds Double cheeseburger How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference 26

27 Linear programming v In general linear linear Advanced ML: Inference 27

28 Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: x 2 x 1 x 3 Advanced ML: Inference 28

29 Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 29

30 Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 30

31 Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region c x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 31

32 Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region c x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 32

33 Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region c x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 33

34 Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region c x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 34

35 Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region These three vertices are the only possible solutions! x 3 x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 Advanced ML: Inference 35

36 Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v The constraint matrix defines a polytope v Only the vertices or faces of the polytope can be solutions Advanced ML: Inference 36

37 Geometry of linear programming The constraint matrix defines polytope that contains allowed solutions (possibly not closed) Advanced ML: Inference 37

38 Geometry of linear programming The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Darker is higher Advanced ML: Inference 38

39 Geometry of linear programming The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Advanced ML: Inference 39

40 Geometry of linear programming The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Even though all points in the region are allowed, the vertices maximize/minimize the cost Advanced ML: Inference 40

41 Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v The constraint matrix defines a polytope v Only the vertices or faces of the polytope can be solutions v Linear programs can be solved in polynomial time Questions? Advanced ML: Inference 41

42 Integer linear programming v In general Advanced ML: Inference 42

43 Geometry of integer linear programming The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Only integer points allowed Advanced ML: Inference 43

44 Integer linear programming v In general v Solving integer linear programs in general can be NP-hard! v LP-relaxation: Drop the integer constraints and hope for the best Advanced ML: Inference 44

45 0-1 integer linear programming v In general v An instance of integer linear programs v Still NP-hard v Geometry: We are only considering points that are vertices of the Boolean hypercube Advanced ML: Inference 45

46 0-1 integer linear programming v In general Solution can be an interior point of the constraint set defined by Ax b v An instance of integer linear programs v Still NP-hard v Geometry: We are only considering points that are vertices of the Boolean hypercube v Constraints prohibit certain vertices Questions? Eg: Only points within this region are allowed Advanced ML: Inference 46

47 Back to structured prediction v Recall that we are solving argmax y w " φ(x, y) v The goal is to produce a graph v The set of possible values that y can take is finite, but large v General idea: Frame the argmax problem as a 0-1 integer linear program v Allows addition of arbitrary constraints Advanced ML: Inference 47

48 Thinking in ILPs Let s start with multi-class classification arg max φ(x, y) = argmax 2 {5,6,7} score(y) 2 {5,6,7} Introduce decision variables for each label vz A = 1 if output = A, 0 otherwise vz B = 1 if output = B, 0 otherwise vz C = 1 if output = C, 0 otherwise Advanced ML: Inference 48

49 Thinking in ILPs Let s start with multi-class classification arg max φ(x, y) = argmax 2 {5,6,7} score(y) 2 {5,6,7} Introduce decision variables for each label v z A = 1 if output = A, 0 otherwise v z B = 1 if output = B, 0 otherwise v z C = 1 if output = C, 0 otherwise Pick exactly one label Maximize the score Advanced ML: Inference 49

50 Thinking in ILPs Let s start with multi-class classification arg max φ(x, y) = argmax 2 {5,6,7} score(y) 2 {5,6,7} Introduce decision variables for each label v z A = 1 if output = A, 0 otherwise v z B = 1 if output = B, 0 otherwise v z C = 1 if output = C, 0 otherwise Maximize the score Advanced ML: Inference 50

51 Thinking in ILPs Let s start with multi-class classification arg max φ(x, y) = argmax 2 {5,6,7} score(y) 2 {5,6,7} We Introduce have taken decision a trivial variables problem for (finding each label a highest scoring element v z A = of 1 if a output list) and = A, converted 0 otherwise it into a representation that is NP-hard in the worst case! v z B = 1 if output = B, 0 otherwise Lesson: v z C = Don t 1 if output solve = multiclass C, 0 otherwise classification with an ILP solver Maximize the score Advanced ML: Inference 51

52 ILP for a general conditional models y 1 y 3 Suppose each y i can be A, B or C y 2 x 1 x 2 x 3 Introduce one decision variable for each part being assigned labels Our goal: max y W T φ(x 1, y 1 ) + W T φ (y 1, y 2, y 3 ) + W T φ (x 3, y 2, y 3 ) + W T φ (x 1, x 2, y 2 ) Advanced ML: Inference 52

53 ILP for a general conditional models Suppose each y i can be A, B or C z 1A, z 1B, z 1C y 1 y 2 y 3 Introduce one decision variable for each part being assigned labels x 1 x 2 x 3 Our goal: max y W T φ(x 1, y 1 ) + W T φ (y 1, y 2, y 3 ) + W T φ (x 3, y 2, y 3 ) + W T φ (x 1, x 2, y 2 ) Advanced ML: Inference 53

54 ILP for a general conditional models z 13AA, z 13AB, z 13AC, z 13BA, z 13BB, z 13BC, z 13CA, z 13CB, z 13CC Suppose each y i can be A, B or C z 1A, z 1B, z 1C y 1 y 2 y 3 Introduce one decision variable for each part being assigned labels x 1 x 2 x 3 Each of these decision variables is associated with a score z 2A, z 2B, z 2C z 23AA, z 23AB, z 23AC, z 23BA, z 23BB, z 23BC, z 23CA, z 23CB, z 23CC Our goal: max y W T φ(x 1, y 1 ) + W T φ (y 1, y 2, y 3 ) + W T φ (x 3, y 2, y 3 ) + W T φ (x 1, x 2, y 2 ) Questions? Advanced ML: Inference 54

55 ILP for a general conditional models y 1 y 3 Suppose each y i can be A, B or C y 2 Introduce one decision variable for each part being assigned labels x 1 x 2 x 3 Each of these decision variables is associated with a score Not all decisions can exist together. Eg: z 13AB implies z 1A and z 3B Our goal: max y W T φ(x 1, y 1 ) + W T φ (y 1, y 2, y 3 ) + W T φ (x 3, y 2, y 3 ) + W T φ (x 1, x 2, y 2 ) Advanced ML: Inference 55

56 Writing constraints as linear inequalities v Exactly one of z 1A, z 1B, z 1C can be true z 1A + z 1B + z 1C = 1 v At least m of z 1A, z 1B, z 1C should be true z 1A + z 1B + z 1C m v At most m of z 1A, z 1B, z 1C should be true z 1A + z 1B + z 1C m v Implication: z i z j v Convert to disjunction: z i z j (At least one of not z i or z j ) 1 z i + z j 1 Advanced ML: Inference 56

57 Writing constraints as linear inequalities v Exactly one of z 1A, z 1B, z 1C can be true z 1A + z 1B + z 1C = 1 v At least m of z 1A, z 1B, z 1C should be true z 1A + z 1B + z 1C m v At most m of z 1A, z 1B, z 1C should be true z 1A + z 1B + z 1C m v Implication: z i z j v Convert to disjunction: z i z j (At least one of not z i or z j ) 1 z i + z j 1 Advanced ML: Inference 57

58 Integer linear programming for inference v Easy to add additional knowledge v Specify them as Boolean formulas v Examples v If y 1 is an A, then y 2 or y 3 should be a B or C v No more than two A s allowed in the output v Many inference problems have standard mappings to ILPs v Sequences, parsing, dependency parsing v Encoding of the problem makes a difference in solving time v The mechanical encoding may not be efficient to solve v Generally: more complex constraints make solving harder Advanced ML: Inference 58

59 Exercise: Sequence labeling Goal: Find argmax y W T φ (x,y) y = (y 1, y 2,!, y n ) y 1 y 2 y 3 y n How can this be written as an ILP? Advanced ML: Inference 59

60 ILP for inference: Remarks v Many combinatorial optimization problem can be written as an ILP v Even the easy /polynomial ones v Given an ILP, checking whether it represents a polynomial problem is intractable in general v ILPs are a general language for thinking about combinatorial optimization v The representation allows us to make general statements about inference v Off-the-shelf solvers for ILPs are quite good v Gurobi, CPLEX v Use an off the shelf solver only if you can t solve your inference problem otherwise Advanced ML: Inference 60

61 Coming up v Formulating general inference as integer linear programs v And variants of this idea v Graph algorithms, dynamic programing, greedy search v We have seen Viterbi algorithm Uses a cleverly defined ordering to decompose the output into a sequence of decisions We will talk about a general algorithm -- max-product, sum-product v Heuristics for inference v Sampling, Gibbs Sampling v Approximate graph search, beam search v LP-relaxation Advanced ML: Inference 61

62 Inference: Graph algorithms Belief Propagation Advanced ML: Inference 62

63 Variable elimination (motivation) v Remember: We have a collection of inference variables that need to be assigned y = (y 1, y 2,!) v General algorithm v First fix an ordering of the variables, say (y 1, y 2,!) v Iteratively: v Find the best value for y i given the values of the previous neighbors v Use back pointers to find final answer Advanced ML: Inference 63

64 Variable elimination: (motivation) v Remember: We have a collection of inference variables that need to be assigned y = (y 1, y 2,!) v General algorithm v First fix an ordering of the variables, say (y 1, y 2,!) v Iteratively: v Find the best value for y i given the values of the previous neighbors v Use back pointers to find final answer v Viterbi is an instance of max-product variable elimination Advanced ML: Inference 64

65 Variable elimination: (motivation) v Remember: We have a collection of inference variables that need to be assigned y = (y 1, y 2,!) v General algorithm v First fix an ordering of the variables, say (y 1, y 2,!) v Iteratively: v Find the best value for y i given the values of the previous neighbors v Use back pointers to find final answer v Viterbi is an instance of max-product variable elimination Advanced ML: Inference 65

66 Variable elimination example (max-sum) y 1 A B C D B C D Score_local y 2 y 3 y n B 4 C 1 D 0 B 4 C 1 D 0 First eliminate y 1 Score + s = score_local + s, START Score O s = max P QRS [score O*+ (y O*+ ) + score_local O (y O*+, y O )] Advanced ML: Inference 66

67 Variable elimination example A B C D A B C D y 2 y 3 y n A B C D Next eliminate y 2 Score + s = score_local + s, START Score O s = max P QRS [score O*+ (y O*+ ) + score_local O (y O*+, y O )] Advanced ML: Inference 67

68 Variable elimination example y 3 A B C D A B C D y n A B C D Next eliminate y 3 Score + s = score_local + s, START Score O s = max P QRS [score O*+ (y O*+ ) + score_local O (y O*+, y O )] Advanced ML: Inference 68

69 Variable elimination example y n A B C D We have all the information to make a decision for y n Score + s = score_local + s, START Score O s = max P QRS [score O*+ (y O*+ ) + score_local O (y O*+, y O )] Advanced ML: Inference Questions? 69

70 Two types of inference problems v Marginals: find v Maximizer: find Probability in different domains Advanced ML: Inference 70

71 Belief Propagation v BP provides exact solution when there are no loops in graph v Viterbi is a special case v Otherwise, loopy BP provides an approximate solution We use sum-product BP as running example, where we want to compute the Z in Advanced ML: Inference 71

72 Intuition v iterative process in which neighboring variables pass message to each other: I (variable x3) think that you (variable x2) belong in these states with various likelihoods v After enough iterations, the conversations is likely to converge to a consensus that determines the marginal probabilities of all the variables. Advanced ML: Inference 72

73 Message v Message from node i to node j: m )W (x W ) v Message is not probability v May not sum to 1 v A high value of m )W (x W ) means that node i believes the marginal value P(x W ) to be high. Advanced ML: Inference 73

74 Beliefs vestimated marginal probabilities are called beliefs. valgorithm vupdate messages until convergence vthen calculate beliefs Advanced ML: Inference 74

75 Message update v To update message from i to j, consider all messages flowing into i Advanced ML: Inference 75

76 Message update Advanced ML: Inference 76

77 Message update Advanced ML: Inference 77

78 Sum-product vs. max-product v The standard BP we just described is sumproduct used to estimate marginal v A variant called max-product (or max-sum in log space), is used to estimate MAP Advanced ML: Inference 78

79 Max-product v Message update same as before, except that sum is replaced by max: v Belief: estimate most likely states Advanced ML: Inference 79

80 Recap Advanced ML: Inference 80

81 Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 For the sake of simplicity, let s assume all the transition and emission scores are the same How many possible assignments? Advanced ML: Inference 81

82 Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A B Advanced ML: Inference 82

83 Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 5 B 8 Advanced ML: Inference 83

84 Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 5 B 8 A 5 B 8 Advanced ML: Inference 84

85 Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 (5 5) (8 8) A 5 B 8 A 5 B 8 Advanced ML: Inference 85

86 Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 A 5 B 8 A 5 B 8 Advanced ML: Inference 86

87 Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 A 5 B 8 A 5 B 8 Let s verify: (A,(A,A)) = = 8 (A,(A,B)) = = 12 (A,(B,A)) = = 12 (A,(B,B)) = = 18 Advanced ML: Inference 87

88 Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 = A = 5 B 8 A 5 B 8 Let s verify: (A,(A,A)) = = 8 (A,(A,B)) = = 12 (A,(B,A)) = = 12 (A,(B,B)) = = 18 Advanced ML: Inference 88

89 Example (Sum-Product) A -> A 1 A -> B 2 A = 242 B -> A 3 B = 356 B-> B 4 A 50 B 64 A 5 B 8 A 5 B 8 A 5 B 8 A 5 B 8 Advanced ML: Inference 89

90 Example (Sum-Product) A -> A 1 A -> B 2 42 B -> A 3 B 356 B-> B 4 A 50 B = 12, = 22,784 A 5 A 5 B 8 B 8 A 5 B 8 A 5 B 8 Advanced ML: Inference 90

91 Example (Sum-Product) A -> A 1 A -> B 2 42 B -> A 3 B 356 B-> B 4 A 50 B 64 A 12,100 B 22,784 A 5 B 8 A 5 B 8 34,884 A 5 B 8 A 5 B 8 Advanced ML: Inference 91

92 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 For the sake of simplicity, let s assume all the transition and emission scores are the same How many possible assignments? Advanced ML: Inference 92

93 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A max 1 2, 3 1 = 3 B max 2 2, 4 1 = 4 Advanced ML: Inference 93

94 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A 3 B 4 Advanced ML: Inference 94

95 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A 3 B 4 A 3 B 4 Advanced ML: Inference 95

96 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) (3 3) (4 4) A 3 B 4 A 3 B 4 Advanced ML: Inference 96

97 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A 3 B 4 A 3 B 4 Advanced ML: Inference 97

98 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A 3 B 4 A 3 B 4 Let s verify: (A,(A,A)) = = 8 (A,(A,B)) = = 12 (A,(B,A)) = = 12 (A,(B,B)) = = 18 Advanced ML: Inference 98

99 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) = A 3 B 4 A 3 B 4 Let s verify: (A,(A,A)) = = 8 (A,(A,B)) = = 12 (A,(B,A)) = = 12 (A,(B,B)) = = 18 Advanced ML: Inference 99

100 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A max 1 18, 3 16 = 48 B max 2 18, 4 16 = 64 A 3 A 18 A 3 B 4 6 B 4 A 3 B 4 A 3 B 4 Advanced ML: Inference 100

101 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c = h(x 864 ) ) A = 1,024 B 64 A 3 A 3 B 4 B 4 A 3 B 4 A 3 B 4 Advanced ML: Inference 101

102 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c A 864 h(x ) ) A 48,024 B 64 A 3 A 3 B 4 B 4 A 3 B 4 A 3 B 4 Advanced ML: Inference 102

103 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c A 864 h(x ) ) A 48 (B),024 B 64 (B) A 3 (B) A 3 (B) B 4 (A) B 4 (A) A B 3 (B) 4 (A) A B 3(B) 4(A) Advanced ML: Inference 103

104 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c A 864 h(x ) ) A 48 (B),024 B 64 (B) A 3 (B) A 3 (B) B 4 (A) B 4 (A) A B 3 (B) 4 (A) A B 3(B) 4(A) Advanced ML: Inference 104

105 Inference: Graph algorithms General Search Advanced ML: Inference 105

106 Inference as search: General setting v Predicting a graph as a sequence of decisions v General data structures: v State: Encodes partial structure v Transitions: Move from one partial structure to another v Start state v End state: We have a full structure v There may be more than one end state v Each transition is scored with the learned model v Goal: Find an end state that has the highest total score Questions? Advanced ML: Inference 106

107 Example y 1 y 3 Suppose each y can be one of A, B or C y 2 x 1 x 2 x 3 State: Triples (y 1, y 2, y 3 ) all possibly unknown (A, -, -), (-, A, A), (-, -, -), Transition: Fill in one of the unknowns Start state: (-,-,-) End state: All three y s are assigned Advanced ML: Inference 107

108 Example y 1 y 3 Suppose each y can be one of A, B or C y 2 x 1 x 2 x 3 State: Triples (y 1, y 2, y 3 ) all possibly unknown (A, -, -), (-, A, A), (-, -, -), Transition: Fill in one of the unknowns Start state: (-,-,-) (-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-).. End state: All three y s are assigned (A,A,A) (C,C,C) Advanced ML: Inference 108

109 Example y 1 y 3 Suppose each y can be one of A, B or C y 2 Note: Here we have assumed an ordering (y 1, y 2, y 3 ) x 1 x 2 x 3 State: Triples (y 1, y 2, y 3 ) all possibly unknown (A, -, -), (-, A, A), (-, -, -), Transition: Fill in one of the unknowns Start state: (-,-,-) (-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-).. End state: All three y s are assigned (A,A,A) (C,C,C) Advanced ML: Inference 109

110 Example y 1 y 3 Suppose each y can be one of A, B or C y 2 Note: Here we have assumed an ordering (y 1, y 2, y 3 ) How do the transitions get scored? x 1 x 2 x 3 State: Triples (y 1, y 2, y 3 ) all possibly unknown (A, -, -), (-, A, A), (-, -, -), Transition: Fill in one of the unknowns Start state: (-,-,-) (-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-).. End state: All three y s are assigned Questions? Advanced ML: Inference (A,A,A) (C,C,C) 110

111 Graph search algorithms v Breadth/depth first search v Keep a stack/queue/priority queue of open states v The good: Guaranteed to be correct v Explores every option (-,-,-) v The bad? v Explores every option: Could be slow for any non-trivial graph (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-).. (A,A,A) (C,C,C) Advanced ML: Inference 111

Greedy search v At each state, choose the highest scoring next transition v Keep only one state in memory: The current state v What is the problem?

112 Greedy search v At each state, choose the highest scoring next transition v Keep only one state in memory: The current state v What is the problem? v Local decisions may override global optimum v Does not explore full search space v Greedy algorithms can give the true optimum for special class of problems v Eg: Maximum-spanning tree algorithms are greedy Questions? Advanced ML: Inference 112

113 Example y 1 y 3 Suppose each y can be one of A, B or C y 2 Note: Here we have assumed an ordering (y 1, y 2, y 3 ) How do the transitions get scored? x 1 x 2 x 3 State: Triples (y 1, y 2, y 3 ) all possibly unknown (A, -, -), (-, A, A), (-, -, -), Transition: Fill in one of the unknowns Start state: (-,-,-) (-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (A,B,-) (A,C,-) End state: All three y s are assigned Questions? Advanced ML: Inference (A,A,-) (A,B,C) (A,C,-) 113

114 Example (greedy) A -> A 1 A -> B 2 B -> A 3 B-> B 4 Advanced ML: Inference 114

115 Example (greedy) = = 1,024 A -> A 1 A -> B 2 B -> A 3 B-> B = 8 B = 16 Advanced ML: Inference 115

116 Beam search: A compromise v Keep size-limited priority queue of states v Called the beam, sorted by total score for the state v At each step: v Explore all transitions from the current state v Add all to beam and trim the size v The good: Explores more than greedy search v The bad: A good state might fall out of the beam v In general, easy to implement, very popular v No guarantees Questions? Advanced ML: Inference 116

117 Example beam = 3 Credit: Graham Neubig Advanced ML: Inference 117

118 v Calculate score, but ignore removed hypotheses Advanced ML: Inference 118

119 v Keep only best three Advanced ML: Inference 119

120 Structured prediction approaches based on search vlearning to search approaches vassume the complex decision is incrementally constructed by a sequence of decisions ve.g., dagger, Searn, transition-based methods vlearn how to make decisions at each branch. Advanced ML: Inference 120

121 Example: Dependency Parsing v Identifying relations between words I ate a cake with a fork I ate a cake with a fork Advanced ML: Inference 121

122 Learning to search (L2S) approaches 1. Define a search space and features v Example: dependency parsing [Nivre03,NIPS16] v Maintain a buffer and a stack v Make predictions from left to right v Three (four) types of actions: Shift, Reduce-Left, Reduce-Right Advanced ML: Inference Credit: Google research blog 122

123 Learning to search approaches Shift-Reduce parser v Maintain a buffer and a stack v Make predictions from left to right v Three (four) types of actions: Shift, Reduce-Left, Reduce-Right I ate a cake Shift I ate a cake I ate a cake Reduce-Left ate a cake I Advanced ML: Inference 123

124 Learning to search (L2S) approaches 1. Define a search space and features 2. Construct a reference policy (Ref) based on the gold label 3. Learning a policy that imitates Ref sentence Advanced ML: Inference 124

125 Policies v A policy maps observations to actions π( ) =a obs. input: x timestep: t partial traj: τ anything else Advanced ML: Inference 125

126 Imitation learning for joint prediction Challenges: v There are combinatorial number of search states v How a sub-decision affect the final decision? Advanced ML: Inference 126

127 Credit Assignment Problem v When someone goes wrong which decision should be blamed Advanced ML: Inference 127

128 Imitation learning for joint prediction Searn: [Langford, Daume& Marcu] Dagger: [Ross, Gordon & Bagnell] AggreVaTe: [Ross & Bagnell] LOLS: [Chang, Krishnamurthy, Agarwal, Daume, Langford] Advanced ML: Inference 128

129 Learning a Policy[ICML 15, Ross+15] v At? state, we construct a cost-sensitive multi-class example (?, [0,.2,.8]) E loss=0? E loss=.2 rollin E loss=.8 rollout one-step deviations Advanced ML: Inference 129

130 Example: Sequence Labeling Receive input: x = the monster ate the sandwich y = Dt Nn Vb Dt Nn Make a sequence of predictions: x = the monster ate the sandwich ŷ = Dt Dt Dt Dt Dt Pick a timestep and try all perturbations there: x = the monster ate the sandwich ŷ Dt = Dt Dt Vb Dt Nn l=1 ŷ Nn = Dt Nn Vb Dt Nn l=0 ŷ Vb = Dt Vb Vb Dt Nn l=1 Compute losses and construct example: ( { w=monster, p=dt, }, [1,0,1] ) Advanced ML: Inference 130

with data to the update rules of model v Applied to dependency parsing, Name entity

131 Learning to search approaches: Credit Assignment Compiler [NIPS16] v Write the decoder, providing some side information for training v Library translates this piece of program with data to the update rules of model v Applied to dependency parsing, Name entity recognition, relation extraction, POS tagging v Implementation: Vowpal Wabbit Advanced ML: Inference 131

132 Approximate Inference Inference by sampling Advanced ML: Inference 132

133 Inference by sampling v Monte Carlo methods: A large class of algorithms v Origins in physics v Basic idea: v Repeatedly sample from a distribution v Compute aggregate statistics from samples v E.g.: The marginal distribution v Useful when we have many, many interacting variables 133

134 Why sampling works v Suppose we have some probability distribution P(z) v Might be a cumbersome function v We want to answer questions about this distribution v What is the mean? v Approximate with samples from the distribution {z 1, z 2,!, z n } Eg: Expectation v Theory tells us that this is a good estimator v Chernoff-Hoeffding style bounds 134

135 Key idea rejection sampling Advanced ML: Inference 135

136 Key idea rejection sampling Work well when number of variables are small Advanced ML: Inference 136

137 The Markov Chain Monte Carlo revolution v Goal: To generate samples from a distribution P(y x) v The target distribution could be intractable to sample from v Idea: Construct a Markov chain of structures whose stationary distribution converges to P v An iterative process that constructs examples v Initially samples might not be from the target distribution v But after a long enough time, the samples are from a distribution that is close to P 137

138 The Markov Chain Monte Carlo revolution v Goal: To generate samples from a distribution P(y x) v The target distribution could be intractable to sample from v Idea: drawing examples in a way that in a long run the distribution is closed to P(y x) v Formally: Construct a Markov chain of structures whose stationary distribution converges to P v An iterative process that constructs examples v Initially samples might not be from the target distribution v But after a long enough time, the samples are from a distribution that is increasingly close to P 138

139 A detour Recall: Markov Chains A collection of random variables y 0, y 1, y 2,, y t form a Markov chain if the i th state depends only on the previous one D F A B B C 0.1 A! B! C! D! E! F F! A! A! E! F! B! C D E D

140 A detour Recall: Markov Chains A collection of random variables y 0, y 1, y 2,, y t form a Markov chain if the i th state depends only on the previous one D F A B B C 0.1 A B C D E F F A A E F B C D E D

141 Temporal dynamics of a Markov chain What is the probability that a chain is in a state z at time t+1? 141

142 Temporal dynamics of a Markov chain What is the probability that a chain is in a state z at time t+1? D F A B B C 0.1 Exercise: Suppose a Markov chain for these transition probabilities starts at A. What is the distribution over states after two steps? D E D

143 Stationary distributions v Informally, v If the set of states is {A, B, C, D, E, F} v A distribution over the states such that after a transition, the distribution over the states is still π v How do we get to a stationary distribution? v A regular Markov chain: There is an non-zero probability of getting from any states to any other in a finite number of steps v If transition matrix is regular, just run it for a long time v Steady-state behavior is the stationary distribution 143

144 Key idea rejection sampling Advanced ML: Inference 144

145 Back to inference Markov Chain Monte Carlo for inference v Design a Markov chain such that v Every state is a structure v The stationary distribution of the chain is the probability distribution we care about P(y x) v How to do inference? v Run the Markov chain for a long time till we think it gets to steady state v Let the chain wander around the space and collect samples v We have samples from P(y x) 145

146 MCMC for inference 1 146

147 MCMC for inference After many steps

148 MCMC for inference After many steps After many steps

149 MCMC for inference After many steps After many steps After many steps

150 MCMC for inference After many steps After many steps After many steps After many steps

151 MCMC for inference After many steps After many steps After many steps After many steps With sufficient samples, we can answer inference questions like calculating the partition function (just sum over the samples) 151

152 Key idea rejection sampling Advanced ML: Inference 152

153 Back to inference Markov Chain Monte Carlo for inference v Design a Markov chain such that v Every state is a structure v The stationary distribution of the chain is the probability distribution we care about P(y x) v How to do inference? v Run the Markov chain for a long time till we think it gets to steady state v Let the chain wander around the space and collect samples v We have samples from P(y x) 153

154 MCMC algorithms v Metropolis-Hastings algorithm v Gibbs sampling v An instance of the Metropolis Hastings algorithm v Many variants exist v Remember: We are sampling from an exponential state space v All possible assignments to the random variables 154

155 Metropolis-Hastings v Proposal distribution q(y y ) v Proposes changes to the state [Metropolis, Rosenbluth, Rosenbluth, Teller & Teller 1953] [Hastings 1970] v Could propose large changes to the state v Acceptance probability α v Should the proposal be accepted or not v If yes, move to the proposed state, else remain in the previous state 155

156 Metropolis-Hastings Algorithm v Start with an initial guess y 0 v Loop for t = 1, 2, N v Propose next state y v Calculate acceptance probability α v With probability α, accept proposal v If accepted: y t+1 y, else y t+1 y t v Return {y 0, y 1,, y N } The distribution we care about is P(y x) 156

157 Metropolis-Hastings Algorithm v Start with an initial guess y 0 v Loop for t = 1, 2, N v Propose next state y v Calculate acceptance probability α v With probability α, accept proposal v If accepted: y t+1 y, else y t+1 y t v Return {y 0, y 1,, y N } The distribution we care about is P(y x) 157

158 Metropolis-Hastings Algorithm v Start with an initial guess y 0 v Loop for t = 1, 2, N v Propose next state y v Calculate acceptance probability α v With probability α, accept proposal v If accepted: y t+1 y, else y t+1 y t v Return {y 0, y 1,, y N } The distribution we care about is P(y x) Sample from q(y t y ) 158

159 Metropolis-Hastings Algorithm v Start with an initial guess y 0 v Loop for t = 1, 2, N v Propose next state y v Calculate acceptance probability α v With probability α, accept proposal v If accepted: y t+1 y, else y t+1 y t v Return {y 0, y 1,, y N } The distribution we care about is P(y x) Sample from q(y t y ) Note that we don t need to compute the partition function. Why? 159

160 Metropolis-Hastings Algorithm v Start with an initial guess y 0 v Loop for t = 1, 2, N v Propose next state y v Calculate acceptance probability α v With probability α, accept proposal v If accepted: y t+1 y, else y t+1 y t v Return {y 0, y 1,, y N } The distribution we care about is P(y x) Sample from q(y t y ) Idea: when running with enough iteration, the distribution is invariant 160

161 Proposal functions for Metropolis v A design choice v Different possibilities v Only make local changes to the factor graph v But the chain might not explore widely v Make big jumps in the state space v But the chain might move very slowly v Doesn t have to depend on the size of the graph 161

162 Gibbs Sampling v Start with an initial guess y = (y 1, y 2,!, y n ) v Loop several times v For i = 1 to n: v Sample y i from P(y i y 1, y 2,! y i-1, y i+1,!, y n, x) v We now have a complete sample The ordering is arbitrary A specific instance of Metropolis-Hastings algorithm, no proposal needs to be designed 162

MAP inference with MCMC v So far we have only seen how to collect samples v Marginal inference with samples is easy v Compute the marginal probabilities from the samples v MAP inference: v Find

163 MAP inference with MCMC v So far we have only seen how to collect samples v Marginal inference with samples is easy v Compute the marginal probabilities from the samples v MAP inference: v Find the sample with highest probability v To help convergence to the maximum, acceptance condition becomes T is a temperature parameter that increases with every step Similar to simulated annealing 163

164 Summary of MCMC methods v A different approach for inference v No guarantee of exactness v General idea v Set up a Markov chain whose stationary distribution is the probability distribution that we care about v Run the chain, collect samples, aggregate v Metropolis-Hastings, Gibbs sampling v Many, many, many variants abound! v Useful when exact inference is intractable v Typically low memory costs, local changes only for Gibbs sampling Questions? 164

165 Inference v What is inference? The prediction step v More broadly, an aggregation operation on the space of outputs for an example: max, expectation, sample, sum v Different flavors: MAP, marginal, loss augmented. v Many algorithms, solution strategies v One size doesn t fit all v Next steps: v How can we take advantage of domain knowledge in inference? v How can we deal making predictions about latent variables for which we don t have data Questions? 165

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page