Better! Faster! Stronger*!

Size: px

Start display at page:

Download "Better! Faster! Stronger*!"

Debra Nelson
6 years ago
Views:

1 Jason Eisner Jiarong Jiang He He Better! Faster! Stronger*! Learning to balance accuracy and efficiency when predicting linguistic structures (*theorems) Hal Daumé III UMD CS, UMIACS,

2 2 Hal Daumé III NLP as transduction Task Input Output Machine Translation Document Summarization Ces deux principes se tiennent à la croisée de la philosophie, de la politique, de l économie, de la sociologie et du droit. Argentina was still obsessed with the Falkland Islands even in 1994, 12 years after its defeat in the 74-day war with Britain. The country's overriding foreign policy aim continued to be winning sovereignty over the islands. Both principles lie at the crossroads of philosophy, politics, economics, sociology, and law. The Falkland islands war, in 1982, was fought between Britain and Argentina. Syntactic Analysis...many more... The man ate a big sandwich. The man ate a big sandwich

3 3 Hal Daumé III Why are complex predictions slow? Parsing # trees ~ O(2 sentence ) Translation # trans ~ O(2 foreign x english ) Summarization # sums ~ O(2 document ) Concretely, n 3 = grammar = ½ million What about dynamic programming...? Often not possible (features are too complicated) Even when possible, polynomial-time is too painful: Parsing is O( grammar x sentence 3 ) but grammar is

4 4 Hal Daumé III Case study: prioritized parsing 2500 Time (seconds) minutes 10 minutes 0 1 minute Sentence Length

5 5 Hal Daumé III Case study: prioritized parsing 2500 Time (seconds) minutes 10 minutes S VP NP VP NP Pro Aux Vrb Noun Output Input 0 1 minute Sentence Length

6 6 Hal Daumé III Learning to be fast! Training Data Hand-built Heuristics Fancy Learning Algorithm Fancier Learning Slow Predictor Algorithm Fast Predictor Quality = tradeoff(accuracy, time)

7 Prioritized search (eg., parsing) S VP NP VP NP Pro Aux Vrb Noun NP Aux Vrb NP Hal Daumé III (me@hal3.name) 7 VP[2,4] 2 VP[1,3] 1 S[0,2] (Jiang+Teichert+Eisner+D, NIPS 2012)

8 Prioritized search (eg., parsing) S VP NP VP NP Pro Aux Vrb Noun NP Aux Vrb NP Hal Daumé III (me@hal3.name) 7 VP[2,4] 2 VP[1,3] 1 S[0,2] (Jiang+Teichert+Eisner+D, NIPS 2012)

9 Prioritized search (eg., parsing) S VP NP VP NP Pro Aux Vrb Noun VP NP Aux Vrb NP Hal Daumé III (me@hal3.name) 2 VP[1,3] 2 VP[1,4] 1 S[0,2] (Jiang+Teichert+Eisner+D, NIPS 2012)

10 Prioritized search (eg., parsing) S VP NP VP NP Pro Aux Vrb Noun VP VP NP Aux Vrb NP Hal Daumé III (me@hal3.name) 2 VP[1,4] 1 S[1,3] 1 S[0,2] (Jiang+Teichert+Eisner+D, NIPS 2012)

11 Prioritized search (eg., parsing) S VP NP VP NP Pro Aux Vrb Noun VP VP VP NP Aux Vrb NP Hal Daumé III (me@hal3.name) 5 S[0,4] 1 S[1,3] 1 S[0,2] (Jiang+Teichert+Eisner+D, NIPS 2012)

12 Prioritized search (eg., parsing) S VP NP VP NP Pro Aux Vrb Noun S VP VP VP NP Aux Vrb NP Hal Daumé III (me@hal3.name) 9 stop 1 S[1,3] 1 S[0,2] (Jiang+Teichert+Eisner+D, NIPS 2012)

13 What do better priorities buy us? Ideally, run until queue is empty Ordering only matters because you want to: Only pop items that have a real impact on result Avoid premature pops Stop early Goal: learn a priority function that optimizes accuracy/speed trade-off Typical solution: hand-built heuristics 13 Hal Daumé III (me@hal3.name) User parameter Our solution: learning to optimize accuracy - λ time (Jiang+Teichert+Eisner+D, NIPS 2012)

14 Prioritized search (eg., parsing) S Learning a heuristic: VP priority(x) = θ φ(chart, item) NP VP NP Example features: Inside score of VP Are 2 word VPs good? p(vp I) and p(cans VP) Could VP combine with NP? VP compete with other spans? Crossing constituents? Pro Aux Vrb Noun VP NP Aux Vrb NP Hal Daumé III (me@hal3.name) 2 VP[1,3] 2 VP[1,4] 1 S[0,2] (Jiang+Teichert+Eisner+D, NIPS 2012)

15 15 Hal Daumé III Learning priorities S VP NP VP NP Pro Aux Vrb Noun Challenges: The world is non-deterministic State space is huge (25 words 5x1014 states) The oracle is way too good (25 actions vs 25k actions) VP The oracle does not experience NP Aux Vrb NP a trade-off between speed and accuracy VP[1,3] 2 VP[1,4] 1 S[0,2] Pssst! You should choose the 2 nd one!

16 16 Hal Daumé III Parsing Trajectories (I) Policy Gradient Trajectories Ground Truth Trajectories (Oracle Trajectories)

17 17 Hal Daumé III Preliminary Result Method Recall Relative # of pops Apprenticeship Learning x with Reward Shaping x Policy Gradient with Boltzmann Exploration x Uniform cost search x Pruned Uniform cost search x Failure Causes: Too hard to imitate the oracle with our features!

18 18 Hal Daumé III Parsing Trajectories (II) Policy Gradient Trajectories Oracle Trajectori Good Trajectories Oracle-Infused Trajectories

19 Hal Daumé III (me@hal3.name) Oracle-Infused Policy Gradient Method Recall Relative # of pops Oracle-Infused Policy Gradient 91.2 0.

19 19 Hal Daumé III Oracle-Infused Policy Gradient Method Recall Relative # of pops Oracle-Infused Policy Gradient x Apprenticeship Learning x with Reward Shaping x Policy Gradient with Boltzmann Exploration x Uniform cost search x Pruned Uniform cost search x

20 20 Hal Daumé III Pareto Frontier Has access to multiple levels of grammar Multiple pruning thresholds

21 take away messages... Many AI tasks can be cast as prioritized search (parsing, planning, inference...) Non-determinism is a unique property of such sequential decision making processes Allows us to reason about trajectories, and learn to trade speed for accuracy

22 22 Hal Daumé III Dependency parsing [root] object n-mod n-mod subject n-mod p-mod n-mod NLP algorithms use a kitchen sink of features

23 23 Hal Daumé III Dependency parsing [root] NLP algorithms use a kitchen sink of features

24 Dependency parsing algorithms use a kitchen NLP sink features 24 Hal Daumé III (me@hal3.name) of

25 Dependency parsing algorithms Edge Features: Lex(use sink) POS(verb noun) skip(2) skip(det) NL skip(noun) P skip(det Noun) dist=3 various regexps... spelling features etc... use Three steps: 1. Compute POS tags 2. Compute kn 2 features 3. Run directed MST features 25 Hal Daumé III (me@hal3.name) a of kitchen sink

26 26 Hal Daumé III Case study: dependency parsing Time (seconds) Average Sentence <10 <20 <30 <40 <50 >=50 Sentence Length 0.12 Tagging Features Parsing

27 27 Hal Daumé III Dynamic feature selection

y*value*weight 90% Ends after selecting one feature Coach says

You should choose argmin E[l(a)] a Overall accuracy If N = T

DAgger's epsilon! 28 Hal Daumé III (me@hal3.name) 80% Pssst!

28 The oracle too good! The oracle knows the label Picks feature with highest y*value*weight 90% Ends after selecting one feature Coach says how to improve, not the best thing to do Pssst! You should choose argmin E[l(a)] a Overall accuracy If N = T log T, L(π n ) < T N + O(1) for some n Provably smaller than DAgger's epsilon! 28 Hal Daumé III (me@hal3.name) 80% Pssst! You should choose argmin E[l(a)] η f(a) a 70% Oracle Coaching DAgger Forward selection Average cost/example (He+Eisner+D, NIPS 2012)

29 29 Hal Daumé III Results across languages

Jason Eisner Jiarong Jiang We can build systems that learn to trade-off speed vs accuracy (it's hard) Requires new algorithms Important to model the problem right Adam Teichert He He Tim

30 Jason Eisner Jiarong Jiang We can build systems that learn to trade-off speed vs accuracy (it's hard) Requires new algorithms Important to model the problem right Adam Teichert He He Tim Vieira I'm going to be on the job market soon! Imitation/reinforcement learning applied to non-deterministic search Orthogonal improvements to most speedup type papers Thanks! Questions?

imitation learning Recurrent Hal Daumé III University of

imitation learning Recurrent Hal Daumé III University of Networks imitation learning Recurrent Neural Hal Daumé III University of Maryland me@hal3.name @haldaume3 Networks imitation learning Recurrent Neural NON-DIFFERENTIABLE DISCONTINUOUS Hal Daumé III University