Lecture 8: Inference. Kai-Wei Chang University of Virginia
|
|
- Dorthy Henry
- 6 years ago
- Views:
Transcription
1 Lecture 8: Inference Kai-Wei Chang University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar s course on Structured Prediction Advanced ML: Inference 1
2 So far what we learned v Thinking about structures v A graph, a collection of parts that are labeled jointly, a collection of decisions A D E A D E B C F B C F G v Next: Prediction v Sets structured prediction apart from binary/multiclass G Advanced ML: Inference 2
3 The bigger picture v The goal of structured prediction: Predicting a graph v Modeling: Defining probability distributions over the random variables v Involves making independence assumptions v Inference: The computational step that actually constructs the output v Also called decoding v Learning creates functions that score predictions (e.g., learning model parameters) Advanced ML: Inference 3
4 Computational issues Data annotation difficulty Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? Background knowledge about domain How to do inference? Semisupervised/indirectly supervised? Inference: deriving the probability of one or more random variables based on the model Advanced ML: Inference 4
5 Inference v What is inference? v An overview of what we have seen before v Combinatorial optimization v Different views of inference v Integer programming v Graph algorithms v Sum-product, max-sum v Heuristics for inference v LP relaxation v Sampling Advanced ML: Inference 5
6 Remember sequence prediction v Goal: Find the most probable/highest scoring state sequence v argmax y score(y) = argmax y w " φ(x, y) v Computationally: discrete optimization v The naïve algorithm v Enumerate all sequences, score each one and pick the max v Terrible idea! v We can do better v Scores decomposed over edges Advanced ML: Inference 6
7 The Viterbi algorithm: Recurrence Goal: Find argmax y w " φ(x, y) y = (y 1, y 2,!, y n ) y 1 y 2 y 3 y n Advanced ML: Inference 7
8 The Viterbi algorithm: Recurrence Goal: Find argmax y w " φ(x, y) y 1 y = (y 1, y 2,!, y n ) y 2 y 3 y n Advanced ML: Inference 8
9 The Viterbi algorithm: Recurrence Goal: Find argmax y w " φ(x, y) y = (y 1, y 2,!, y n ) y 1 y 2 y 3 y n, y )*+ )] Idea 1. If I know the score of all sequences y 1 to y n-1, then I could decide y n easily 2. Recurse to get score up to y n-1 Advanced ML: Inference 9
10 Inference questions v This class: v Mostly we use inference to mean What is the highest scoring assignment to the output random variables for a given input? v Maximum A Posteriori (MAP) inference (if the score is probabilistic) v Other inference questions v What is the highest scoring assignment to some of the output variables given the input? v Sample from the posterior distribution over the Y v Loss-augmented inference: Which structure most violates the margin for a given scoring function? v Computing marginal probabilities over Y Advanced ML: Inference 10
11 MAP inference v A combinatorial problem, v Computational complexity depends on v The size of the input v The factorization of the scores v More complex factors generally lead to expensive inference v A generally bad strategy in most but the simplest cases: Enumerate all possible structures and pick the highest scoring one Advanced ML: Inference 11
12 MAP inference v A combinatorial problem, v Computational complexity depends on v The size of the input v The factorization of the scores v More complex factors generally lead to expensive inference v A generally bad strategy in most but the simplest cases: Enumerate all possible structures and pick the highest scoring one Advanced ML: Inference 12
13 MAP inference is discrete optimization v A combinatorial problem ,. v Computational complexity depends on v The size of the input v The factorization of the scores v More complex factors generally lead to expensive inference v A generally bad strategy in most but the simplest cases: Enumerate all possible structures and pick the highest scoring one Advanced ML: Inference 13
14 MAP inference is discrete optimization v A combinatorial problem ,. v Computational complexity depends on v The size of the input v The factorization of the scores v More complex factors generally lead to expensive inference v A generally bad strategy in most but the simplest cases: Enumerate all possible structures and pick the highest scoring one Advanced ML: Inference 14
15 MAP inference is search v We want a graph that has highest scoring structure argmax y w " φ(x, y) v Without assumptions, no algorithm can find the max without considering every possible structure v How can we solve this computational problem? v Exploit the structure of the search space and the cost function v That is, exploit decomposition of the scoring function v Usually stronger assumptions lead to easier inference v E.g., consider 10 independent random variables Advanced ML: Inference 15
16 Approaches for inference v Exact vs. approximate inference v Should the maximization be performed exactly? v Or is a close-to-highest-scoring structure good enough? v Exact: Search, dynamic programming, integer linear programming,. v Heuristic (called approximate inference): Gibbs sampling, belief propagation, beam search, linear programming relaxations, v Randomized vs. deterministic v Relevant for approximate inference: If I run the inference program twice, will I get the same answer? Advanced ML: Inference 16
17 Coming up v Formulating general inference as integer linear programs v And variants of this idea v Graph algorithms, dynamic programing, greedy search v We have seen Viterbi algorithm Uses a cleverly defined ordering to decompose the output into a sequence of decisions We will talk about a general algorithm -- max-product, sum-product v Heuristics for inference v Sampling, Gibbs Sampling v Approximate graph search, beam search v LP-relaxation Advanced ML: Inference 17
18 Inference: Integer Linear Programs Advanced ML: Inference 18
19 The big picture v MAP Inference is combinatorial optimization v Combinatorial optimization problems can be written as integer linear programs (ILP) v The conversion is not always trivial v Allows injection of knowledge in the form of constraints v Different ways of solving ILPs v Commercial solvers: CPLEX, Gurobi, etc v Specialized solvers if you know something about your problem v Lagrangianrelaxation, amortized inference, etc v Can approximate to linear programs and hope for the best v Integer linear programs are NP hard in general v No free lunch Advanced ML: Inference 19
20 Detour: Linear programming v Minimizing a linear objective function subject to a finite number of linear constraints (equality or inequality) v Very widely applicable v Operations research, micro-economics, management v Historical note v Developed during world war II to reduce army costs v Programming not the same as computer programming Advanced ML: Inference 20
21 Example: The diet problem A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: Item Cost/100g Vitamin Z Nutrient X Carrots Sunflower seeds Double cheeseburger How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Advanced ML: Inference 21
22 Example: The diet problem A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: Item Cost/100g Vitamin Z Nutrient X Carrots Sunflower seeds Double cheeseburger How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference 22
23 Example: The diet problem A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: Item Cost/100g Vitamin Z Nutrient X Carrots Sunflower seeds Double cheeseburger How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference 23
24 Example: The diet problem A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: Item Cost/100g Vitamin Z Nutrient X Carrots Sunflower seeds Double cheeseburger How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference 24
25 Example: The diet problem A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: Item Cost/100g Vitamin Z Nutrient X Carrots Sunflower seeds Double cheeseburger How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference 25
26 Example: The diet problem A student wants to spend as little money on food while getting sufficient amount of vitamin Z and nutrient X. Her options are: Item Cost/100g Vitamin Z Nutrient X Carrots Sunflower seeds Double cheeseburger How should she spend her money to get at least 5 units of vitamin Z and 3 units of nutrient X? Let c, s and d denote how much of each item is purchased Minimize total cost At least 5 units of vitamin Z, At least 3 units of nutrient X, The number of units purchased is not negative Advanced ML: Inference 26
27 Linear programming v In general linear linear Advanced ML: Inference 27
28 Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: x 2 x 1 x 3 Advanced ML: Inference 28
29 Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 29
30 Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 30
31 Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region c x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 31
32 Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region c x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 32
33 Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region c x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 33
34 Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region c x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 x 3 Advanced ML: Inference 34
35 Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v For example: Suppose we had to maximize any c T x on this region These three vertices are the only possible solutions! x 3 x 2 a 1 x 1 + a 2 x 2 + a 3 x 3 = b x 1 Advanced ML: Inference 35
36 Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v The constraint matrix defines a polytope v Only the vertices or faces of the polytope can be solutions Advanced ML: Inference 36
37 Geometry of linear programming The constraint matrix defines polytope that contains allowed solutions (possibly not closed) Advanced ML: Inference 37
38 Geometry of linear programming The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Darker is higher Advanced ML: Inference 38
39 Geometry of linear programming The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Advanced ML: Inference 39
40 Geometry of linear programming The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Even though all points in the region are allowed, the vertices maximize/minimize the cost Advanced ML: Inference 40
41 Linear programming v In general v This is a continuous optimization problem v And yet, there are only a finite set of possible solutions v The constraint matrix defines a polytope v Only the vertices or faces of the polytope can be solutions v Linear programs can be solved in polynomial time Questions? Advanced ML: Inference 41
42 Integer linear programming v In general Advanced ML: Inference 42
43 Geometry of integer linear programming The constraint matrix defines polytope that contains allowed solutions (possibly not closed) The objective defines cost for every point in the space Only integer points allowed Advanced ML: Inference 43
44 Integer linear programming v In general v Solving integer linear programs in general can be NP-hard! v LP-relaxation: Drop the integer constraints and hope for the best Advanced ML: Inference 44
45 0-1 integer linear programming v In general v An instance of integer linear programs v Still NP-hard v Geometry: We are only considering points that are vertices of the Boolean hypercube Advanced ML: Inference 45
46 0-1 integer linear programming v In general Solution can be an interior point of the constraint set defined by Ax b v An instance of integer linear programs v Still NP-hard v Geometry: We are only considering points that are vertices of the Boolean hypercube v Constraints prohibit certain vertices Questions? Eg: Only points within this region are allowed Advanced ML: Inference 46
47 Back to structured prediction v Recall that we are solving argmax y w " φ(x, y) v The goal is to produce a graph v The set of possible values that y can take is finite, but large v General idea: Frame the argmax problem as a 0-1 integer linear program v Allows addition of arbitrary constraints Advanced ML: Inference 47
48 Thinking in ILPs Let s start with multi-class classification arg max φ(x, y) = argmax 2 {5,6,7} score(y) 2 {5,6,7} Introduce decision variables for each label vz A = 1 if output = A, 0 otherwise vz B = 1 if output = B, 0 otherwise vz C = 1 if output = C, 0 otherwise Advanced ML: Inference 48
49 Thinking in ILPs Let s start with multi-class classification arg max φ(x, y) = argmax 2 {5,6,7} score(y) 2 {5,6,7} Introduce decision variables for each label v z A = 1 if output = A, 0 otherwise v z B = 1 if output = B, 0 otherwise v z C = 1 if output = C, 0 otherwise Pick exactly one label Maximize the score Advanced ML: Inference 49
50 Thinking in ILPs Let s start with multi-class classification arg max φ(x, y) = argmax 2 {5,6,7} score(y) 2 {5,6,7} Introduce decision variables for each label v z A = 1 if output = A, 0 otherwise v z B = 1 if output = B, 0 otherwise v z C = 1 if output = C, 0 otherwise Maximize the score Advanced ML: Inference 50
51 Thinking in ILPs Let s start with multi-class classification arg max φ(x, y) = argmax 2 {5,6,7} score(y) 2 {5,6,7} We Introduce have taken decision a trivial variables problem for (finding each label a highest scoring element v z A = of 1 if a output list) and = A, converted 0 otherwise it into a representation that is NP-hard in the worst case! v z B = 1 if output = B, 0 otherwise Lesson: v z C = Don t 1 if output solve = multiclass C, 0 otherwise classification with an ILP solver Maximize the score Advanced ML: Inference 51
52 ILP for a general conditional models y 1 y 3 Suppose each y i can be A, B or C y 2 x 1 x 2 x 3 Introduce one decision variable for each part being assigned labels Our goal: max y W T φ(x 1, y 1 ) + W T φ (y 1, y 2, y 3 ) + W T φ (x 3, y 2, y 3 ) + W T φ (x 1, x 2, y 2 ) Advanced ML: Inference 52
53 ILP for a general conditional models Suppose each y i can be A, B or C z 1A, z 1B, z 1C y 1 y 2 y 3 Introduce one decision variable for each part being assigned labels x 1 x 2 x 3 Our goal: max y W T φ(x 1, y 1 ) + W T φ (y 1, y 2, y 3 ) + W T φ (x 3, y 2, y 3 ) + W T φ (x 1, x 2, y 2 ) Advanced ML: Inference 53
54 ILP for a general conditional models z 13AA, z 13AB, z 13AC, z 13BA, z 13BB, z 13BC, z 13CA, z 13CB, z 13CC Suppose each y i can be A, B or C z 1A, z 1B, z 1C y 1 y 2 y 3 Introduce one decision variable for each part being assigned labels x 1 x 2 x 3 Each of these decision variables is associated with a score z 2A, z 2B, z 2C z 23AA, z 23AB, z 23AC, z 23BA, z 23BB, z 23BC, z 23CA, z 23CB, z 23CC Our goal: max y W T φ(x 1, y 1 ) + W T φ (y 1, y 2, y 3 ) + W T φ (x 3, y 2, y 3 ) + W T φ (x 1, x 2, y 2 ) Questions? Advanced ML: Inference 54
55 ILP for a general conditional models y 1 y 3 Suppose each y i can be A, B or C y 2 Introduce one decision variable for each part being assigned labels x 1 x 2 x 3 Each of these decision variables is associated with a score Not all decisions can exist together. Eg: z 13AB implies z 1A and z 3B Our goal: max y W T φ(x 1, y 1 ) + W T φ (y 1, y 2, y 3 ) + W T φ (x 3, y 2, y 3 ) + W T φ (x 1, x 2, y 2 ) Advanced ML: Inference 55
56 Writing constraints as linear inequalities v Exactly one of z 1A, z 1B, z 1C can be true z 1A + z 1B + z 1C = 1 v At least m of z 1A, z 1B, z 1C should be true z 1A + z 1B + z 1C m v At most m of z 1A, z 1B, z 1C should be true z 1A + z 1B + z 1C m v Implication: z i z j v Convert to disjunction: z i z j (At least one of not z i or z j ) 1 z i + z j 1 Advanced ML: Inference 56
57 Writing constraints as linear inequalities v Exactly one of z 1A, z 1B, z 1C can be true z 1A + z 1B + z 1C = 1 v At least m of z 1A, z 1B, z 1C should be true z 1A + z 1B + z 1C m v At most m of z 1A, z 1B, z 1C should be true z 1A + z 1B + z 1C m v Implication: z i z j v Convert to disjunction: z i z j (At least one of not z i or z j ) 1 z i + z j 1 Advanced ML: Inference 57
58 Integer linear programming for inference v Easy to add additional knowledge v Specify them as Boolean formulas v Examples v If y 1 is an A, then y 2 or y 3 should be a B or C v No more than two A s allowed in the output v Many inference problems have standard mappings to ILPs v Sequences, parsing, dependency parsing v Encoding of the problem makes a difference in solving time v The mechanical encoding may not be efficient to solve v Generally: more complex constraints make solving harder Advanced ML: Inference 58
59 Exercise: Sequence labeling Goal: Find argmax y W T φ (x,y) y = (y 1, y 2,!, y n ) y 1 y 2 y 3 y n How can this be written as an ILP? Advanced ML: Inference 59
60 ILP for inference: Remarks v Many combinatorial optimization problem can be written as an ILP v Even the easy /polynomial ones v Given an ILP, checking whether it represents a polynomial problem is intractable in general v ILPs are a general language for thinking about combinatorial optimization v The representation allows us to make general statements about inference v Off-the-shelf solvers for ILPs are quite good v Gurobi, CPLEX v Use an off the shelf solver only if you can t solve your inference problem otherwise Advanced ML: Inference 60
61 Coming up v Formulating general inference as integer linear programs v And variants of this idea v Graph algorithms, dynamic programing, greedy search v We have seen Viterbi algorithm Uses a cleverly defined ordering to decompose the output into a sequence of decisions We will talk about a general algorithm -- max-product, sum-product v Heuristics for inference v Sampling, Gibbs Sampling v Approximate graph search, beam search v LP-relaxation Advanced ML: Inference 61
62 Inference: Graph algorithms Belief Propagation Advanced ML: Inference 62
63 Variable elimination (motivation) v Remember: We have a collection of inference variables that need to be assigned y = (y 1, y 2,!) v General algorithm v First fix an ordering of the variables, say (y 1, y 2,!) v Iteratively: v Find the best value for y i given the values of the previous neighbors v Use back pointers to find final answer Advanced ML: Inference 63
64 Variable elimination: (motivation) v Remember: We have a collection of inference variables that need to be assigned y = (y 1, y 2,!) v General algorithm v First fix an ordering of the variables, say (y 1, y 2,!) v Iteratively: v Find the best value for y i given the values of the previous neighbors v Use back pointers to find final answer v Viterbi is an instance of max-product variable elimination Advanced ML: Inference 64
65 Variable elimination: (motivation) v Remember: We have a collection of inference variables that need to be assigned y = (y 1, y 2,!) v General algorithm v First fix an ordering of the variables, say (y 1, y 2,!) v Iteratively: v Find the best value for y i given the values of the previous neighbors v Use back pointers to find final answer v Viterbi is an instance of max-product variable elimination Advanced ML: Inference 65
66 Variable elimination example (max-sum) y 1 A B C D B C D Score_local y 2 y 3 y n B 4 C 1 D 0 B 4 C 1 D 0 First eliminate y 1 Score + s = score_local + s, START Score O s = max P QRS [score O*+ (y O*+ ) + score_local O (y O*+, y O )] Advanced ML: Inference 66
67 Variable elimination example A B C D A B C D y 2 y 3 y n A B C D Next eliminate y 2 Score + s = score_local + s, START Score O s = max P QRS [score O*+ (y O*+ ) + score_local O (y O*+, y O )] Advanced ML: Inference 67
68 Variable elimination example y 3 A B C D A B C D y n A B C D Next eliminate y 3 Score + s = score_local + s, START Score O s = max P QRS [score O*+ (y O*+ ) + score_local O (y O*+, y O )] Advanced ML: Inference 68
69 Variable elimination example y n A B C D We have all the information to make a decision for y n Score + s = score_local + s, START Score O s = max P QRS [score O*+ (y O*+ ) + score_local O (y O*+, y O )] Advanced ML: Inference Questions? 69
70 Two types of inference problems v Marginals: find v Maximizer: find Probability in different domains Advanced ML: Inference 70
71 Belief Propagation v BP provides exact solution when there are no loops in graph v Viterbi is a special case v Otherwise, loopy BP provides an approximate solution We use sum-product BP as running example, where we want to compute the Z in Advanced ML: Inference 71
72 Intuition v iterative process in which neighboring variables pass message to each other: I (variable x3) think that you (variable x2) belong in these states with various likelihoods v After enough iterations, the conversations is likely to converge to a consensus that determines the marginal probabilities of all the variables. Advanced ML: Inference 72
73 Message v Message from node i to node j: m )W (x W ) v Message is not probability v May not sum to 1 v A high value of m )W (x W ) means that node i believes the marginal value P(x W ) to be high. Advanced ML: Inference 73
74 Beliefs vestimated marginal probabilities are called beliefs. valgorithm vupdate messages until convergence vthen calculate beliefs Advanced ML: Inference 74
75 Message update v To update message from i to j, consider all messages flowing into i Advanced ML: Inference 75
76 Message update Advanced ML: Inference 76
77 Message update Advanced ML: Inference 77
78 Sum-product vs. max-product v The standard BP we just described is sumproduct used to estimate marginal v A variant called max-product (or max-sum in log space), is used to estimate MAP Advanced ML: Inference 78
79 Max-product v Message update same as before, except that sum is replaced by max: v Belief: estimate most likely states Advanced ML: Inference 79
80 Recap Advanced ML: Inference 80
81 Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 For the sake of simplicity, let s assume all the transition and emission scores are the same How many possible assignments? Advanced ML: Inference 81
82 Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A B Advanced ML: Inference 82
83 Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 5 B 8 Advanced ML: Inference 83
84 Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 5 B 8 A 5 B 8 Advanced ML: Inference 84
85 Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 (5 5) (8 8) A 5 B 8 A 5 B 8 Advanced ML: Inference 85
86 Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 A 5 B 8 A 5 B 8 Advanced ML: Inference 86
87 Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 A 5 B 8 A 5 B 8 Let s verify: (A,(A,A)) = = 8 (A,(A,B)) = = 12 (A,(B,A)) = = 12 (A,(B,B)) = = 18 Advanced ML: Inference 87
88 Example (Sum-Product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 50 B 64 = A = 5 B 8 A 5 B 8 Let s verify: (A,(A,A)) = = 8 (A,(A,B)) = = 12 (A,(B,A)) = = 12 (A,(B,B)) = = 18 Advanced ML: Inference 88
89 Example (Sum-Product) A -> A 1 A -> B 2 A = 242 B -> A 3 B = 356 B-> B 4 A 50 B 64 A 5 B 8 A 5 B 8 A 5 B 8 A 5 B 8 Advanced ML: Inference 89
90 Example (Sum-Product) A -> A 1 A -> B 2 42 B -> A 3 B 356 B-> B 4 A 50 B = 12, = 22,784 A 5 A 5 B 8 B 8 A 5 B 8 A 5 B 8 Advanced ML: Inference 90
91 Example (Sum-Product) A -> A 1 A -> B 2 42 B -> A 3 B 356 B-> B 4 A 50 B 64 A 12,100 B 22,784 A 5 B 8 A 5 B 8 34,884 A 5 B 8 A 5 B 8 Advanced ML: Inference 91
92 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 For the sake of simplicity, let s assume all the transition and emission scores are the same How many possible assignments? Advanced ML: Inference 92
93 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A max 1 2, 3 1 = 3 B max 2 2, 4 1 = 4 Advanced ML: Inference 93
94 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A 3 B 4 Advanced ML: Inference 94
95 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A 3 B 4 A 3 B 4 Advanced ML: Inference 95
96 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) (3 3) (4 4) A 3 B 4 A 3 B 4 Advanced ML: Inference 96
97 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A 3 B 4 A 3 B 4 Advanced ML: Inference 97
98 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A 3 B 4 A 3 B 4 Let s verify: (A,(A,A)) = = 8 (A,(A,B)) = = 12 (A,(B,A)) = = 12 (A,(B,B)) = = 18 Advanced ML: Inference 98
99 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) = A 3 B 4 A 3 B 4 Let s verify: (A,(A,A)) = = 8 (A,(A,B)) = = 12 (A,(B,A)) = = 12 (A,(B,B)) = = 18 Advanced ML: Inference 99
100 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c h(x ) ) A max 1 18, 3 16 = 48 B max 2 18, 4 16 = 64 A 3 A 18 A 3 B 4 6 B 4 A 3 B 4 A 3 B 4 Advanced ML: Inference 100
101 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c = h(x 864 ) ) A = 1,024 B 64 A 3 A 3 B 4 B 4 A 3 B 4 A 3 B 4 Advanced ML: Inference 101
102 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c A 864 h(x ) ) A 48,024 B 64 A 3 A 3 B 4 B 4 A 3 B 4 A 3 B 4 Advanced ML: Inference 102
103 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c A 864 h(x ) ) A 48 (B),024 B 64 (B) A 3 (B) A 3 (B) B 4 (A) B 4 (A) A B 3 (B) 4 (A) A B 3(B) 4(A) Advanced ML: Inference 103
104 Example (max-product) A -> A 1 A -> B 2 B -> A 3 B-> B 4 A 18 6 m def Oc x W = max f ),W x ), x W g ) x ) j m lmn g k) (x ) ) Q k opn())\c A 864 h(x ) ) A 48 (B),024 B 64 (B) A 3 (B) A 3 (B) B 4 (A) B 4 (A) A B 3 (B) 4 (A) A B 3(B) 4(A) Advanced ML: Inference 104
105 Inference: Graph algorithms General Search Advanced ML: Inference 105
106 Inference as search: General setting v Predicting a graph as a sequence of decisions v General data structures: v State: Encodes partial structure v Transitions: Move from one partial structure to another v Start state v End state: We have a full structure v There may be more than one end state v Each transition is scored with the learned model v Goal: Find an end state that has the highest total score Questions? Advanced ML: Inference 106
107 Example y 1 y 3 Suppose each y can be one of A, B or C y 2 x 1 x 2 x 3 State: Triples (y 1, y 2, y 3 ) all possibly unknown (A, -, -), (-, A, A), (-, -, -), Transition: Fill in one of the unknowns Start state: (-,-,-) End state: All three y s are assigned Advanced ML: Inference 107
108 Example y 1 y 3 Suppose each y can be one of A, B or C y 2 x 1 x 2 x 3 State: Triples (y 1, y 2, y 3 ) all possibly unknown (A, -, -), (-, A, A), (-, -, -), Transition: Fill in one of the unknowns Start state: (-,-,-) (-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-).. End state: All three y s are assigned (A,A,A) (C,C,C) Advanced ML: Inference 108
109 Example y 1 y 3 Suppose each y can be one of A, B or C y 2 Note: Here we have assumed an ordering (y 1, y 2, y 3 ) x 1 x 2 x 3 State: Triples (y 1, y 2, y 3 ) all possibly unknown (A, -, -), (-, A, A), (-, -, -), Transition: Fill in one of the unknowns Start state: (-,-,-) (-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-).. End state: All three y s are assigned (A,A,A) (C,C,C) Advanced ML: Inference 109
110 Example y 1 y 3 Suppose each y can be one of A, B or C y 2 Note: Here we have assumed an ordering (y 1, y 2, y 3 ) How do the transitions get scored? x 1 x 2 x 3 State: Triples (y 1, y 2, y 3 ) all possibly unknown (A, -, -), (-, A, A), (-, -, -), Transition: Fill in one of the unknowns Start state: (-,-,-) (-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-).. End state: All three y s are assigned Questions? Advanced ML: Inference (A,A,A) (C,C,C) 110
111 Graph search algorithms v Breadth/depth first search v Keep a stack/queue/priority queue of open states v The good: Guaranteed to be correct v Explores every option (-,-,-) v The bad? v Explores every option: Could be slow for any non-trivial graph (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-).. (A,A,A) (C,C,C) Advanced ML: Inference 111
112 Greedy search v At each state, choose the highest scoring next transition v Keep only one state in memory: The current state v What is the problem? v Local decisions may override global optimum v Does not explore full search space v Greedy algorithms can give the true optimum for special class of problems v Eg: Maximum-spanning tree algorithms are greedy Questions? Advanced ML: Inference 112
113 Example y 1 y 3 Suppose each y can be one of A, B or C y 2 Note: Here we have assumed an ordering (y 1, y 2, y 3 ) How do the transitions get scored? x 1 x 2 x 3 State: Triples (y 1, y 2, y 3 ) all possibly unknown (A, -, -), (-, A, A), (-, -, -), Transition: Fill in one of the unknowns Start state: (-,-,-) (-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (A,B,-) (A,C,-) End state: All three y s are assigned Questions? Advanced ML: Inference (A,A,-) (A,B,C) (A,C,-) 113
114 Example (greedy) A -> A 1 A -> B 2 B -> A 3 B-> B 4 Advanced ML: Inference 114
115 Example (greedy) = = 1,024 A -> A 1 A -> B 2 B -> A 3 B-> B = 8 B = 16 Advanced ML: Inference 115
116 Beam search: A compromise v Keep size-limited priority queue of states v Called the beam, sorted by total score for the state v At each step: v Explore all transitions from the current state v Add all to beam and trim the size v The good: Explores more than greedy search v The bad: A good state might fall out of the beam v In general, easy to implement, very popular v No guarantees Questions? Advanced ML: Inference 116
117 Example beam = 3 Credit: Graham Neubig Advanced ML: Inference 117
118 v Calculate score, but ignore removed hypotheses Advanced ML: Inference 118
119 v Keep only best three Advanced ML: Inference 119
120 Structured prediction approaches based on search vlearning to search approaches vassume the complex decision is incrementally constructed by a sequence of decisions ve.g., dagger, Searn, transition-based methods vlearn how to make decisions at each branch. Advanced ML: Inference 120
121 Example: Dependency Parsing v Identifying relations between words I ate a cake with a fork I ate a cake with a fork Advanced ML: Inference 121
122 Learning to search (L2S) approaches 1. Define a search space and features v Example: dependency parsing [Nivre03,NIPS16] v Maintain a buffer and a stack v Make predictions from left to right v Three (four) types of actions: Shift, Reduce-Left, Reduce-Right Advanced ML: Inference Credit: Google research blog 122
123 Learning to search approaches Shift-Reduce parser v Maintain a buffer and a stack v Make predictions from left to right v Three (four) types of actions: Shift, Reduce-Left, Reduce-Right I ate a cake Shift I ate a cake I ate a cake Reduce-Left ate a cake I Advanced ML: Inference 123
124 Learning to search (L2S) approaches 1. Define a search space and features 2. Construct a reference policy (Ref) based on the gold label 3. Learning a policy that imitates Ref sentence Advanced ML: Inference 124
125 Policies v A policy maps observations to actions π( ) =a obs. input: x timestep: t partial traj: τ anything else Advanced ML: Inference 125
126 Imitation learning for joint prediction Challenges: v There are combinatorial number of search states v How a sub-decision affect the final decision? Advanced ML: Inference 126
127 Credit Assignment Problem v When someone goes wrong which decision should be blamed Advanced ML: Inference 127
128 Imitation learning for joint prediction Searn: [Langford, Daume& Marcu] Dagger: [Ross, Gordon & Bagnell] AggreVaTe: [Ross & Bagnell] LOLS: [Chang, Krishnamurthy, Agarwal, Daume, Langford] Advanced ML: Inference 128
129 Learning a Policy[ICML 15, Ross+15] v At? state, we construct a cost-sensitive multi-class example (?, [0,.2,.8]) E loss=0? E loss=.2 rollin E loss=.8 rollout one-step deviations Advanced ML: Inference 129
130 Example: Sequence Labeling Receive input: x = the monster ate the sandwich y = Dt Nn Vb Dt Nn Make a sequence of predictions: x = the monster ate the sandwich ŷ = Dt Dt Dt Dt Dt Pick a timestep and try all perturbations there: x = the monster ate the sandwich ŷ Dt = Dt Dt Vb Dt Nn l=1 ŷ Nn = Dt Nn Vb Dt Nn l=0 ŷ Vb = Dt Vb Vb Dt Nn l=1 Compute losses and construct example: ( { w=monster, p=dt, }, [1,0,1] ) Advanced ML: Inference 130
131 Learning to search approaches: Credit Assignment Compiler [NIPS16] v Write the decoder, providing some side information for training v Library translates this piece of program with data to the update rules of model v Applied to dependency parsing, Name entity recognition, relation extraction, POS tagging v Implementation: Vowpal Wabbit Advanced ML: Inference 131
132 Approximate Inference Inference by sampling Advanced ML: Inference 132
133 Inference by sampling v Monte Carlo methods: A large class of algorithms v Origins in physics v Basic idea: v Repeatedly sample from a distribution v Compute aggregate statistics from samples v E.g.: The marginal distribution v Useful when we have many, many interacting variables 133
134 Why sampling works v Suppose we have some probability distribution P(z) v Might be a cumbersome function v We want to answer questions about this distribution v What is the mean? v Approximate with samples from the distribution {z 1, z 2,!, z n } Eg: Expectation v Theory tells us that this is a good estimator v Chernoff-Hoeffding style bounds 134
135 Key idea rejection sampling Advanced ML: Inference 135
136 Key idea rejection sampling Work well when number of variables are small Advanced ML: Inference 136
137 The Markov Chain Monte Carlo revolution v Goal: To generate samples from a distribution P(y x) v The target distribution could be intractable to sample from v Idea: Construct a Markov chain of structures whose stationary distribution converges to P v An iterative process that constructs examples v Initially samples might not be from the target distribution v But after a long enough time, the samples are from a distribution that is close to P 137
138 The Markov Chain Monte Carlo revolution v Goal: To generate samples from a distribution P(y x) v The target distribution could be intractable to sample from v Idea: drawing examples in a way that in a long run the distribution is closed to P(y x) v Formally: Construct a Markov chain of structures whose stationary distribution converges to P v An iterative process that constructs examples v Initially samples might not be from the target distribution v But after a long enough time, the samples are from a distribution that is increasingly close to P 138
139 A detour Recall: Markov Chains A collection of random variables y 0, y 1, y 2,, y t form a Markov chain if the i th state depends only on the previous one D F A B B C 0.1 A! B! C! D! E! F F! A! A! E! F! B! C D E D
140 A detour Recall: Markov Chains A collection of random variables y 0, y 1, y 2,, y t form a Markov chain if the i th state depends only on the previous one D F A B B C 0.1 A B C D E F F A A E F B C D E D
141 Temporal dynamics of a Markov chain What is the probability that a chain is in a state z at time t+1? 141
142 Temporal dynamics of a Markov chain What is the probability that a chain is in a state z at time t+1? D F A B B C 0.1 Exercise: Suppose a Markov chain for these transition probabilities starts at A. What is the distribution over states after two steps? D E D
143 Stationary distributions v Informally, v If the set of states is {A, B, C, D, E, F} v A distribution over the states such that after a transition, the distribution over the states is still π v How do we get to a stationary distribution? v A regular Markov chain: There is an non-zero probability of getting from any states to any other in a finite number of steps v If transition matrix is regular, just run it for a long time v Steady-state behavior is the stationary distribution 143
144 Key idea rejection sampling Advanced ML: Inference 144
145 Back to inference Markov Chain Monte Carlo for inference v Design a Markov chain such that v Every state is a structure v The stationary distribution of the chain is the probability distribution we care about P(y x) v How to do inference? v Run the Markov chain for a long time till we think it gets to steady state v Let the chain wander around the space and collect samples v We have samples from P(y x) 145
146 MCMC for inference 1 146
147 MCMC for inference After many steps
148 MCMC for inference After many steps After many steps
149 MCMC for inference After many steps After many steps After many steps
150 MCMC for inference After many steps After many steps After many steps After many steps
151 MCMC for inference After many steps After many steps After many steps After many steps With sufficient samples, we can answer inference questions like calculating the partition function (just sum over the samples) 151
152 Key idea rejection sampling Advanced ML: Inference 152
153 Back to inference Markov Chain Monte Carlo for inference v Design a Markov chain such that v Every state is a structure v The stationary distribution of the chain is the probability distribution we care about P(y x) v How to do inference? v Run the Markov chain for a long time till we think it gets to steady state v Let the chain wander around the space and collect samples v We have samples from P(y x) 153
154 MCMC algorithms v Metropolis-Hastings algorithm v Gibbs sampling v An instance of the Metropolis Hastings algorithm v Many variants exist v Remember: We are sampling from an exponential state space v All possible assignments to the random variables 154
155 Metropolis-Hastings v Proposal distribution q(y y ) v Proposes changes to the state [Metropolis, Rosenbluth, Rosenbluth, Teller & Teller 1953] [Hastings 1970] v Could propose large changes to the state v Acceptance probability α v Should the proposal be accepted or not v If yes, move to the proposed state, else remain in the previous state 155
156 Metropolis-Hastings Algorithm v Start with an initial guess y 0 v Loop for t = 1, 2, N v Propose next state y v Calculate acceptance probability α v With probability α, accept proposal v If accepted: y t+1 y, else y t+1 y t v Return {y 0, y 1,, y N } The distribution we care about is P(y x) 156
157 Metropolis-Hastings Algorithm v Start with an initial guess y 0 v Loop for t = 1, 2, N v Propose next state y v Calculate acceptance probability α v With probability α, accept proposal v If accepted: y t+1 y, else y t+1 y t v Return {y 0, y 1,, y N } The distribution we care about is P(y x) 157
158 Metropolis-Hastings Algorithm v Start with an initial guess y 0 v Loop for t = 1, 2, N v Propose next state y v Calculate acceptance probability α v With probability α, accept proposal v If accepted: y t+1 y, else y t+1 y t v Return {y 0, y 1,, y N } The distribution we care about is P(y x) Sample from q(y t y ) 158
159 Metropolis-Hastings Algorithm v Start with an initial guess y 0 v Loop for t = 1, 2, N v Propose next state y v Calculate acceptance probability α v With probability α, accept proposal v If accepted: y t+1 y, else y t+1 y t v Return {y 0, y 1,, y N } The distribution we care about is P(y x) Sample from q(y t y ) Note that we don t need to compute the partition function. Why? 159
160 Metropolis-Hastings Algorithm v Start with an initial guess y 0 v Loop for t = 1, 2, N v Propose next state y v Calculate acceptance probability α v With probability α, accept proposal v If accepted: y t+1 y, else y t+1 y t v Return {y 0, y 1,, y N } The distribution we care about is P(y x) Sample from q(y t y ) Idea: when running with enough iteration, the distribution is invariant 160
161 Proposal functions for Metropolis v A design choice v Different possibilities v Only make local changes to the factor graph v But the chain might not explore widely v Make big jumps in the state space v But the chain might move very slowly v Doesn t have to depend on the size of the graph 161
162 Gibbs Sampling v Start with an initial guess y = (y 1, y 2,!, y n ) v Loop several times v For i = 1 to n: v Sample y i from P(y i y 1, y 2,! y i-1, y i+1,!, y n, x) v We now have a complete sample The ordering is arbitrary A specific instance of Metropolis-Hastings algorithm, no proposal needs to be designed 162
163 MAP inference with MCMC v So far we have only seen how to collect samples v Marginal inference with samples is easy v Compute the marginal probabilities from the samples v MAP inference: v Find the sample with highest probability v To help convergence to the maximum, acceptance condition becomes T is a temperature parameter that increases with every step Similar to simulated annealing 163
164 Summary of MCMC methods v A different approach for inference v No guarantee of exactness v General idea v Set up a Markov chain whose stationary distribution is the probability distribution that we care about v Run the chain, collect samples, aggregate v Metropolis-Hastings, Gibbs sampling v Many, many, many variants abound! v Useful when exact inference is intractable v Typically low memory costs, local changes only for Gibbs sampling Questions? 164
165 Inference v What is inference? The prediction step v More broadly, an aggregation operation on the space of outputs for an example: max, expectation, sample, sum v Different flavors: MAP, marginal, loss augmented. v Many algorithms, solution strategies v One size doesn t fit all v Next steps: v How can we take advantage of domain knowledge in inference? v How can we deal making predictions about latent variables for which we don t have data Questions? 165
Lecture 13: Structured Prediction
Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page
More informationLab 12: Structured Prediction
December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationimitation learning Recurrent Hal Daumé III University of
Networks imitation learning Recurrent Neural Hal Daumé III University of Maryland me@hal3.name @haldaume3 Networks imitation learning Recurrent Neural NON-DIFFERENTIABLE DISCONTINUOUS Hal Daumé III University
More informationTopics in Natural Language Processing
Topics in Natural Language Processing Shay Cohen Institute for Language, Cognition and Computation University of Edinburgh Lecture 5 Solving an NLP Problem When modelling a new problem in NLP, need to
More informationNatural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu
Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Projects Project descriptions due today! Last class Sequence to sequence models Attention Pointer networks Today Weak
More informationMachine Learning for Data Science (CS4786) Lecture 24
Machine Learning for Data Science (CS4786) Lecture 24 Graphical Models: Approximate Inference Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016sp/ BELIEF PROPAGATION OR MESSAGE PASSING Each
More informationLecture 6: Graphical Models
Lecture 6: Graphical Models Kai-Wei Chang CS @ Uniersity of Virginia kw@kwchang.net Some slides are adapted from Viek Skirmar s course on Structured Prediction 1 So far We discussed sequence labeling tasks:
More information1 Probabilities. 1.1 Basics 1 PROBABILITIES
1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More information1 Probabilities. 1.1 Basics 1 PROBABILITIES
1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationMarkov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)
Markov Networks l Like Bayes Nets l Graph model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov
More informationSeq2Seq Losses (CTC)
Seq2Seq Losses (CTC) Jerry Ding & Ryan Brigden 11-785 Recitation 6 February 23, 2018 Outline Tasks suited for recurrent networks Losses when the output is a sequence Kinds of errors Losses to use CTC Loss
More informationA brief introduction to Conditional Random Fields
A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationMarkov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)
Markov Networks l Like Bayes Nets l Graphical model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov
More informationCS 188: Artificial Intelligence. Bayes Nets
CS 188: Artificial Intelligence Probabilistic Inference: Enumeration, Variable Elimination, Sampling Pieter Abbeel UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationCSC 411 Lecture 3: Decision Trees
CSC 411 Lecture 3: Decision Trees Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 03-Decision Trees 1 / 33 Today Decision Trees Simple but powerful learning
More informationLecture 3: Multiclass Classification
Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 3, 2016 CPSC 422, Lecture 11 Slide 1 422 big picture: Where are we? Query Planning Deterministic Logics First Order Logics Ontologies
More informationCS 343: Artificial Intelligence
CS 343: Artificial Intelligence Bayes Nets: Sampling Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley.
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More information13 : Variational Inference: Loopy Belief Propagation and Mean Field
10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction
More informationJunction Tree, BP and Variational Methods
Junction Tree, BP and Variational Methods Adrian Weller MLSALT4 Lecture Feb 21, 2018 With thanks to David Sontag (MIT) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,
More informationBayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction
15-0: Learning vs. Deduction Artificial Intelligence Programming Bayesian Learning Chris Brooks Department of Computer Science University of San Francisco So far, we ve seen two types of reasoning: Deductive
More informationProbabilistic Graphical Models
2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector
More informationSampling Algorithms for Probabilistic Graphical models
Sampling Algorithms for Probabilistic Graphical models Vibhav Gogate University of Washington References: Chapter 12 of Probabilistic Graphical models: Principles and Techniques by Daphne Koller and Nir
More informationProbabilistic Graphical Models
Probabilistic Graphical Models David Sontag New York University Lecture 6, March 7, 2013 David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 1 / 25 Today s lecture 1 Dual decomposition 2 MAP inference
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationCS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash
CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash Equilibrium Price of Stability Coping With NP-Hardness
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov
More informationInference in Bayesian Networks
Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)
More informationBayesian networks: approximate inference
Bayesian networks: approximate inference Machine Intelligence Thomas D. Nielsen September 2008 Approximative inference September 2008 1 / 25 Motivation Because of the (worst-case) intractability of exact
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationIntroduction to Computer Science and Programming for Astronomers
Introduction to Computer Science and Programming for Astronomers Lecture 8. István Szapudi Institute for Astronomy University of Hawaii March 7, 2018 Outline Reminder 1 Reminder 2 3 4 Reminder We have
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationAlgorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,
More informationIntroduction to Artificial Intelligence (AI)
Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 9 Oct, 11, 2011 Slide credit Approx. Inference : S. Thrun, P, Norvig, D. Klein CPSC 502, Lecture 9 Slide 1 Today Oct 11 Bayesian
More informationBayesian Networks BY: MOHAMAD ALSABBAGH
Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional
More informationProbabilistic Context-free Grammars
Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John
More informationMarkov Chain Monte Carlo The Metropolis-Hastings Algorithm
Markov Chain Monte Carlo The Metropolis-Hastings Algorithm Anthony Trubiano April 11th, 2018 1 Introduction Markov Chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a probability
More informationConvergence Rate of Markov Chains
Convergence Rate of Markov Chains Will Perkins April 16, 2013 Convergence Last class we saw that if X n is an irreducible, aperiodic, positive recurrent Markov chain, then there exists a stationary distribution
More informationMachine Learning for Structured Prediction
Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationGraphical Models and Kernel Methods
Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.
More informationLecture 12: EM Algorithm
Lecture 12: EM Algorithm Kai-Wei hang S @ University of Virginia kw@kwchang.net ouse webpage: http://kwchang.net/teaching/nlp16 S6501 Natural Language Processing 1 Three basic problems for MMs v Likelihood
More informationInference in Graphical Models Variable Elimination and Message Passing Algorithm
Inference in Graphical Models Variable Elimination and Message Passing lgorithm Le Song Machine Learning II: dvanced Topics SE 8803ML, Spring 2012 onditional Independence ssumptions Local Markov ssumption
More informationMidterm sample questions
Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts
More informationCSCE 478/878 Lecture 6: Bayesian Learning
Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Outline CSCE 478/878 Lecture 6: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchell
More informationStatistical Methods for NLP
Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition
More informationResults: MCMC Dancers, q=10, n=500
Motivation Sampling Methods for Bayesian Inference How to track many INTERACTING targets? A Tutorial Frank Dellaert Results: MCMC Dancers, q=10, n=500 1 Probabilistic Topological Maps Results Real-Time
More informationMachine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017
1 Introduction Let x = (x 1,..., x M ) denote a sequence (e.g. a sequence of words), and let y = (y 1,..., y M ) denote a corresponding hidden sequence that we believe explains or influences x somehow
More informationNotes on Machine Learning for and
Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori
More informationMarkov Networks.
Markov Networks www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts Markov network syntax Markov network semantics Potential functions Partition function
More informationProbabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm
Probabilistic Graphical Models 10-708 Homework 2: Due February 24, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 4-8. You must complete all four problems to
More informationBayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks Matt Gormley Lecture 24 April 9, 2018 1 Homework 7: HMMs Reminders
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationLearning MN Parameters with Alternative Objective Functions. Sargur Srihari
Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient
More informationProbability and Structure in Natural Language Processing
Probability and Structure in Natural Language Processing Noah Smith, Carnegie Mellon University 2012 Interna@onal Summer School in Language and Speech Technologies Quick Recap Yesterday: Bayesian networks
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationFrom Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018
From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction
More informationAnnouncements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic
CS 188: Artificial Intelligence Fall 2008 Lecture 16: Bayes Nets III 10/23/2008 Announcements Midterms graded, up on glookup, back Tuesday W4 also graded, back in sections / box Past homeworks in return
More information4 : Exact Inference: Variable Elimination
10-708: Probabilistic Graphical Models 10-708, Spring 2014 4 : Exact Inference: Variable Elimination Lecturer: Eric P. ing Scribes: Soumya Batra, Pradeep Dasigi, Manzil Zaheer 1 Probabilistic Inference
More information6.045: Automata, Computability, and Complexity (GITCS) Class 17 Nancy Lynch
6.045: Automata, Computability, and Complexity (GITCS) Class 17 Nancy Lynch Today Probabilistic Turing Machines and Probabilistic Time Complexity Classes Now add a new capability to standard TMs: random
More informationLecture 1 : Data Compression and Entropy
CPS290: Algorithmic Foundations of Data Science January 8, 207 Lecture : Data Compression and Entropy Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will study a simple model for
More informationNeural Networks in Structured Prediction. November 17, 2015
Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving
More information1 Closest Pair of Points on the Plane
CS 31: Algorithms (Spring 2019): Lecture 5 Date: 4th April, 2019 Topic: Divide and Conquer 3: Closest Pair of Points on a Plane Disclaimer: These notes have not gone through scrutiny and in all probability
More informationIntroduction To Graphical Models
Peter Gehler Introduction to Graphical Models Introduction To Graphical Models Peter V. Gehler Max Planck Institute for Intelligent Systems, Tübingen, Germany ENS/INRIA Summer School, Paris, July 2013
More informationMarkov Chain Monte Carlo
Markov Chain Monte Carlo Recall: To compute the expectation E ( h(y ) ) we use the approximation E(h(Y )) 1 n n h(y ) t=1 with Y (1),..., Y (n) h(y). Thus our aim is to sample Y (1),..., Y (n) from f(y).
More informationLecture 8: PGM Inference
15 September 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I 1 Variable elimination Max-product Sum-product 2 LP Relaxations QP Relaxations 3 Marginal and MAP X1 X2 X3 X4
More informationMCMC and Gibbs Sampling. Kayhan Batmanghelich
MCMC and Gibbs Sampling Kayhan Batmanghelich 1 Approaches to inference l Exact inference algorithms l l l The elimination algorithm Message-passing algorithm (sum-product, belief propagation) The junction
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationMarkov chain Monte Carlo Lecture 9
Markov chain Monte Carlo Lecture 9 David Sontag New York University Slides adapted from Eric Xing and Qirong Ho (CMU) Limitations of Monte Carlo Direct (unconditional) sampling Hard to get rare events
More informationThe Ising model and Markov chain Monte Carlo
The Ising model and Markov chain Monte Carlo Ramesh Sridharan These notes give a short description of the Ising model for images and an introduction to Metropolis-Hastings and Gibbs Markov Chain Monte
More information17 : Markov Chain Monte Carlo
10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo
More informationMath 456: Mathematical Modeling. Tuesday, April 9th, 2018
Math 456: Mathematical Modeling Tuesday, April 9th, 2018 The Ergodic theorem Tuesday, April 9th, 2018 Today 1. Asymptotic frequency (or: How to use the stationary distribution to estimate the average amount
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Variational Inference IV: Variational Principle II Junming Yin Lecture 17, March 21, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3 X 3 Reading: X 4
More informationLecture 6: Graphical Models: Learning
Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)
More informationBayes Nets III: Inference
1 Hal Daumé III (me@hal3.name) Bayes Nets III: Inference Hal Daumé III Computer Science University of Maryland me@hal3.name CS 421: Introduction to Artificial Intelligence 10 Apr 2012 Many slides courtesy
More informationData-Intensive Computing with MapReduce
Data-Intensive Computing with MapReduce Session 8: Sequence Labeling Jimmy Lin University of Maryland Thursday, March 14, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationCS540 ANSWER SHEET
CS540 ANSWER SHEET Name Email 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 1 2 Final Examination CS540-1: Introduction to Artificial Intelligence Fall 2016 20 questions, 5 points
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationEssential facts about NP-completeness:
CMPSCI611: NP Completeness Lecture 17 Essential facts about NP-completeness: Any NP-complete problem can be solved by a simple, but exponentially slow algorithm. We don t have polynomial-time solutions
More informationStatistical Sequence Recognition and Training: An Introduction to HMMs
Statistical Sequence Recognition and Training: An Introduction to HMMs EECS 225D Nikki Mirghafori nikki@icsi.berkeley.edu March 7, 2005 Credit: many of the HMM slides have been borrowed and adapted, with
More informationProbabilistic Graphical Networks: Definitions and Basic Results
This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical
More informationData Structures in Java
Data Structures in Java Lecture 21: Introduction to NP-Completeness 12/9/2015 Daniel Bauer Algorithms and Problem Solving Purpose of algorithms: find solutions to problems. Data Structures provide ways
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationBayesian Learning in Undirected Graphical Models
Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul
More informationBayesian Inference and MCMC
Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the
More information10-701/ Machine Learning: Assignment 1
10-701/15-781 Machine Learning: Assignment 1 The assignment is due September 27, 2005 at the beginning of class. Write your name in the top right-hand corner of each page submitted. No paperclips, folders,
More informationReminder of some Markov Chain properties:
Reminder of some Markov Chain properties: 1. a transition from one state to another occurs probabilistically 2. only state that matters is where you currently are (i.e. given present, future is independent
More information