Language and Statistics II
|
|
- Jared Chandler
- 5 years ago
- Views:
Transcription
1 Language and Statistics II Lecture 19: EM for Models of Structure Noah Smith
2 Epectation-Maimization E step: i,, q i # p r $ t = p r i % ' $ t i, p r $ t i,' soft assignment or voting M step: r t +1 # argma r $, p ( )q( ) log p r, pretend p, full-observed data MLE
3 Proof that EM = Partial-Data MLE Claim: EM iterations improve likelihood, converging to a local optimum. maimizing likelihood n p r i i=1 n # = $ p r # i, % $ log$ p r ( i,) % $ p ( )log$ p r, i=1 the M step n i=1 # p ( )q( )log p r (, ) = p #, # q log p r,
4 MLE-Objectve - M-Step-Objective what MLE wants maimized what the M step maimizes # p ( )log# p r (,) $ # p ( ) # q( )log p r (, ) = # p ( ) # q( )log# p r (,' ) $ # p # = $ p # = $ p # q # q ' log p r, # p r, ' ' log p r # q log p r, 6 what 4 MLE 44 wants 7 maimized: 4448 $ # p ( )log# p r (,) what the M step maimizes: % ' = # p ( ) # q( )log p r (, ) & p ( ) q # # log p r
5 (# ( t +1) ) $ r # t r (# ( t +1),q) & ' r # t +1 % r part 1: M step part 2: E step Central Claim ( ) (,q) $ % r # ( t),q & '( r # ( t),q) what MLE wants maimized: $ what the M step maimizes: % ' # p ( )log# p r (,) = p ( ) q # # log p r (, ) & p ( ) q # # log p r (# ( t +1),q) = ma r # r (# ( t),q) $ r # t +1 r $ ( r # ( t),q) r #,q % (,q) = q p % = p ( ) p r ( )log p r # # ( t ) = E ( t ) p D p r p r t+1 log p r # ( t ) p r # t+1 [ ( p ( t ) r ) ] $ 0 ( t+1)
6 Central Claim (# ( t +1) ) $ r # t r (# ( t +1),q) & ' r # t +1 % r ( ) (,q) $ % r # ( t),q & '( r # ( t),q) M step guarantees an increase in Φ E step guarantee s a decrease in Δ 6 what 4 MLE 44 wants 7 maimized: 4448 $ # p ( )log# p r (,) what the M step maimizes: % ' = # p ( ) # q( )log p r (, ) & p ( ) q # # log p r
7 Convergence EM iterations will never decrease likelihood. Under some conditions, EM converges to a saddle point; generall it is assumed that EM will converge to a local maimum. Linear convergence (i.e., slow); depends on how much information is missing.
8 Epectation-Maimization E step: i,, q i # p r $ t = p r i % ' $ t i, p r $ t i,' soft assignment or voting M step: r t +1 # argma r $, p ( )q( ) log p r, pretend p, full-observed data MLE
9 Toward Structure Models There are wa too man values of Y to sum over! Two ke points: Never need to sum over Y b enumeration. Never need q to be computed eplicitl.
10 Consider the M Step r t +1 # argma r $, p ( )q( ) log p r, pretend p, To maimize likelihood, what do we need? For multinomial-based models (HMMs, PCFGs, etc.), we need counts. For log-linear models in general, we need counts.
11 Simplifing the M Step r t +1 (multinomials) # argma r $, p ( )q( ) log p r, pretend p, p ( )q( )log p r count e;, # (,) = # p ( )q( )log$ p e, # = p ( )q,, # count e;, e log p e = # log( p e )# p ( )q( )count e;, e, = # log( p e )# p e n # q count e;, = 1 # log( p e )## q( n i )count( e; i,) e i= epected count of e given e
12 Simplifing the M Step r t +1 (log-linear models) # argma r $, p ( )q( ) log p r, pretend p, # p ( )q( )log p r (, ) = # p ( )q,, = r. $ # p ( )q -, r = 1, r n $ ## q i n. - i=1 r f, f i,, / 1 % logz r 0 / 1 % logz r 0 & ( r ( $ r f, ( ' % log ep r ( ) $ r ) # ( f ( ',' )) + ',' + * net time! Z r
13 Sufficient Statisics A statistic is sufficient for a parameter when = p data S r p data r ( ( )) The M step onl requires sufficient statistics under q. For NLP models, this usuall means epected counts.
14 HMM Forward and Backward Probabilities i,c = p s i+1 ( n C i = c) #( i,c) = p( s i 1,C i = c) backward probabilit forward probabilit ( i,c) # $ ( i,c) = p( s n 1,C i = c) i,c # $ ( i,c) = p C i = c s 1 $ n +1,stop ( n) n ( i,c) # $ ( i,c) % = E i : C i = c $ n +1,stop i=1 [ n { } s ] 1 posterior probabilit that s i is labeled with class c epected count of class c
15 CKY Inside and Outside Probabilities i, j,n ( n S ) 1n = p s 1 i#1 N ij s j +1 $ i, j,n = p s i j N ij # $ ( i, j,n) = p s n 1,N ij # $ ( i, j,n) n = p( N $ ( 1,n,S ) ij s 1 ) # $ ( i, j, N %) # $ ( j +1,k, N % ) # p( N & N % N % ) $ ( 1,n,S ) i, j,n i, j,n i,k,n outside probabilit inside probabilit n = p N ik & N ij % N ( % j +1)k s 1 n n n n $ $ $ p N ik N ij # N ( # j +1)k s 1 = E i, j,k : N ik N ij # # i=1 j= i k= j +1 [ N n { } s ] ( j +1)k 1 epected count of rule
16 In General Don t compute q directl in the E step. Just get the sufficient statistics. Inside and Outside algorithms can help for some models! Other alternatives (less common in NLP): Sample from q( ) to get sufficient statistics. Use a variational approimation to q( ). This should remind ou of the factored dual in structured maimum margin training! Use statistics on structure pieces instead of whole structures.
17 Pereira and Schabes (1992) Suppose ou have a partiall bracketed corpus. Want to constrain re-estimation to respect the known bracketings. Everthing else is hidden. (Democrats took control of both houses) no information ((Democrats) took control of (both houses)) base NPs ((Democrats) took (control of (both houses))) all NPs (Democrats (took control of both houses)) VP
18 CKY X w i w i i X i X Y Z Y Z i j j + 1 k X i k S 1 n
19 CKY X w i w i i X i X Y Z Y Z i j j + 1 k X (i,k) doesn t cross an known constituents i k S 1 n
20 Pereira and Schabes (1992) When compared with unconstrained EM: Better fit to the data (cross-entrop) Better accurac Faster convergence Later result (Hwa, 1999): high-level structure helps more than low-level structure.
21 Merialdo (1994) Suppose ou have some tagged tet and some untagged tet. You could train a tagger on the tagged tet. Can ou use the untagged tet to help? Merialdo: Var the amount of tagged tet Use the tagged tet to initialize Run EM.
22 Merialdo (1994)
23 Merialdo (1994) Similar results b Elworth (1994). Another wa to combine the data: ma r $ log p( i, i ) + $ log p i i#l Equivalentl, augment E step counts with observed counts before each M step. i#u (Same effect, anecdotall.)
24 The Two Main Problems With EM Marginal likelihood Accurac Local optima Plot from Smith (2006); similar results in Charniak (1993)
25 Variants MAP instead of MLE: add a prior (smoothing) For speed: Viterbi approimation (mode instead of epectation in E step) Incremental EM (M step after ever eample) To improve search qualit: Deterministic annealing: graduall rela an entrop constraint on q (affects the E step onl) Random restarts Random reweighting of eamples Reall good initialization Alternative objective (net week)
26 Klein & Manning (2002) A highl deficient grammatical model that predicts POS tag sequences. Constituent-contet model (CCM). Best published unsupervised parsing results on WSJ-10 (in 2002) Trained using EM with an interesting initializer: Pr split
27 Constituent-Contet Model t = (t 1 t n ) is the tag sequence. Let C i,j be a r.v., equal to 1 if t i t j is a constituent, 0 if not. C is the set of all C i,j (an upper matri). The valid values of C are the ones that are binar trees. Pr(t, C) = Pr(C) Pr(t C) = Pr(C) П i,j: 1 i j n Pr(t i t j C i,j ) Pr(t i-1, t j+1 C i,j ) = (1/τ(n)) П i,j: 1 i j n Pr(t i t j C i,j ) Pr(t i-1, t j+1 C i,j ) uniform distribution over trees constituent contet
28 t 1 t 2 t 3 t 4 t 5 t 6 <S> t 1 t 2 t 3 t 4 t 5 t 6 </S> Pr(t Pr(<S>, 1 t 2 t 3 </S> t 4 t 5 t 6 1) 1)
29 t 1 t 2 t 3 t 4 t 5 t 6 <S> t 1 t 2 t 3 t 4 t 5 t 6 Pr(t Pr(<S>, 1 t 2 t 3 t 46 t 5 0) 0)
30 t 1 t 2 t 3 t 4 t 5 t 6 <S> t 1 t 2 t 3 t 4 t 5 Pr(t Pr(<S>, 1 t 2 t 3 t 54 1)
31 t 1 t 2 t 3 t 4 t 5 t 6 t 4 t 5 t 6 </S> Pr(t Pr(t 4, </S> 5 t 6 1) 1)
32 About CCM Onl four multinomials to estimate. Need two features for ever substring of tags! (This is wh the used WSJ-10.) Highl deficient!
33 Pr split Given length of sentence, n. Choose N 1 u.a.r. from {1, 2,, n} Choose N 2 u.a.r. from {1, 2,, N 1 1} Choose N 3 u.a.r. from {1, 2,, n N 1 1} Continue until no further splits can be made. N 1 n N 1 n N 1 N 2 N 1 N 2 This model gives a closed form for the epectations; K&M use it to generate initial epectations, then start with an M step. N 2 N 1 N 2 N 3 n N 1 N 3 *See also Cover & Thomas problem 4.3 (p. 72).
34 Importance of Pr split Alg m Model UP (%) UR (%) Ave. CB Perf ct (%) 0CB 2CB It. s Cross-E (bits) Rightbranching trees EM (Pr split ) CCM EM (unif.) CCM Upper bound (binar trees)
35 Net Time Contrastive estimation as an alternative to EM Application to unsupervised tagging, parsing
Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.
Recap: HMM ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2018 Elements of HMM: Set of states (tags) Output alphabet (word types) Start state (beginning of sentence) State transition probabilities
More informationNatural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Natural Language Processing CS 6840 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Statistical Parsing Define a probabilistic model of syntax P(T S):
More informationAdvanced Natural Language Processing Syntactic Parsing
Advanced Natural Language Processing Syntactic Parsing Alicia Ageno ageno@cs.upc.edu Universitat Politècnica de Catalunya NLP statistical parsing 1 Parsing Review Statistical Parsing SCFG Inside Algorithm
More informationLecture 12: Algorithms for HMMs
Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.
More informationPart of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015
Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about
More informationLecture 12: Algorithms for HMMs
Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a
More informationSpectral Unsupervised Parsing with Additive Tree Metrics
Spectral Unsupervised Parsing with Additive Tree Metrics Ankur Parikh, Shay Cohen, Eric P. Xing Carnegie Mellon, University of Edinburgh Ankur Parikh 2014 1 Overview Model: We present a novel approach
More informationProbabilistic Context-free Grammars
Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John
More informationAnnealing Techniques for Unsupervised Statistical Language Learning
Appeared in Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, July 2004. Annealing Techniques for Unsupervised Statistical Language Learning
More informationINF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification
INF 4300 151014 Introduction to classifiction Anne Solberg anne@ifiuiono Based on Chapter 1-6 in Duda and Hart: Pattern Classification 151014 INF 4300 1 Introduction to classification One of the most challenging
More informationHidden Markov Models
CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each
More informationMACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING
MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING Outline Some Sample NLP Task [Noah Smith] Structured Prediction For NLP Structured Prediction Methods Conditional Random Fields Structured Perceptron Discussion
More informationLecture 12: EM Algorithm
Lecture 12: EM Algorithm Kai-Wei hang S @ University of Virginia kw@kwchang.net ouse webpage: http://kwchang.net/teaching/nlp16 S6501 Natural Language Processing 1 Three basic problems for MMs v Likelihood
More informationINF Introduction to classifiction Anne Solberg
INF 4300 8.09.17 Introduction to classifiction Anne Solberg anne@ifi.uio.no Introduction to classification Based on handout from Pattern Recognition b Theodoridis, available after the lecture INF 4300
More informationQuiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7
Quiz 1, COMS 4705 Name: 10 30 30 20 Good luck! 4705 Quiz 1 page 1 of 7 Part #1 (10 points) Question 1 (10 points) We define a PCFG where non-terminal symbols are {S,, B}, the terminal symbols are {a, b},
More informationLecture 3: Pattern Classification. Pattern classification
EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and
More information6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm
6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm Overview The EM algorithm in general form The EM algorithm for hidden markov models (brute force) The EM algorithm for hidden markov models (dynamic
More informationMore on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013
More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative
More informationLECTURER: BURCU CAN Spring
LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can
More informationStructured Prediction with Perceptron: Theory and Algorithms
Structured Prediction with Perceptron: Theor and Algorithms the man bit the dog the man hit the dog DT NN VBD DT NN 那 人咬了狗 =+1 =-1 Kai Zhao Dept. Computer Science Graduate Center, the Cit Universit of
More informationCS838-1 Advanced NLP: Hidden Markov Models
CS838-1 Advanced NLP: Hidden Markov Models Xiaojin Zhu 2007 Send comments to jerryzhu@cs.wisc.edu 1 Part of Speech Tagging Tag each word in a sentence with its part-of-speech, e.g., The/AT representative/nn
More informationSoft Inference and Posterior Marginals. September 19, 2013
Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard inference Give me a single solution Viterbi algorithm Maximum spanning tree (Chu-Liu-Edmonds alg.) Soft inference
More informationEmpirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs
Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21
More informationCS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm
+ September13, 2016 Professor Meteer CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm Thanks to Dan Jurafsky for these slides + ASR components n Feature
More informationCOMP90051 Statistical Machine Learning
COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 24. Hidden Markov Models & message passing Looking back Representation of joint distributions Conditional/marginal independence
More informationCMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009
CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models Jimmy Lin The ischool University of Maryland Wednesday, September 30, 2009 Today s Agenda The great leap forward in NLP Hidden Markov
More informationProbabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning
Probabilistic Context Free Grammars Many slides from Michael Collins and Chris Manning Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted
More informationStatistical methods in NLP, lecture 7 Tagging and parsing
Statistical methods in NLP, lecture 7 Tagging and parsing Richard Johansson February 25, 2014 overview of today's lecture HMM tagging recap assignment 3 PCFG recap dependency parsing VG assignment 1 overview
More informationNatural Language Processing : Probabilistic Context Free Grammars. Updated 5/09
Natural Language Processing : Probabilistic Context Free Grammars Updated 5/09 Motivation N-gram models and HMM Tagging only allowed us to process sentences linearly. However, even simple sentences require
More informationINF Anne Solberg One of the most challenging topics in image analysis is recognizing a specific object in an image.
INF 4300 700 Introduction to classifiction Anne Solberg anne@ifiuiono Based on Chapter -6 6inDuda and Hart: attern Classification 303 INF 4300 Introduction to classification One of the most challenging
More informationhttp://futurama.wikia.com/wiki/dr._perceptron 1 Where we are Eperiments with a hash-trick implementation of logistic regression Net question: how do you parallelize SGD, or more generally, this kind of
More informationStatistical Methods for NLP
Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition
More informationProbabilistic Context-Free Grammar
Probabilistic Context-Free Grammar Petr Horáček, Eva Zámečníková and Ivana Burgetová Department of Information Systems Faculty of Information Technology Brno University of Technology Božetěchova 2, 612
More information6. Vector Random Variables
6. Vector Random Variables In the previous chapter we presented methods for dealing with two random variables. In this chapter we etend these methods to the case of n random variables in the following
More informationLESSON 35: EIGENVALUES AND EIGENVECTORS APRIL 21, (1) We might also write v as v. Both notations refer to a vector.
LESSON 5: EIGENVALUES AND EIGENVECTORS APRIL 2, 27 In this contet, a vector is a column matri E Note 2 v 2, v 4 5 6 () We might also write v as v Both notations refer to a vector (2) A vector can be man
More informationA Supertag-Context Model for Weakly-Supervised CCG Parser Learning
A Supertag-Context Model for Weakly-Supervised CCG Parser Learning Dan Garrette Chris Dyer Jason Baldridge Noah A. Smith U. Washington CMU UT-Austin CMU Contributions 1. A new generative model for learning
More informationProbabilistic Context-Free Grammars. Michael Collins, Columbia University
Probabilistic Context-Free Grammars Michael Collins, Columbia University Overview Probabilistic Context-Free Grammars (PCFGs) The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar
More informationCSE 546 Midterm Exam, Fall 2014
CSE 546 Midterm Eam, Fall 2014 1. Personal info: Name: UW NetID: Student ID: 2. There should be 14 numbered pages in this eam (including this cover sheet). 3. You can use an material ou brought: an book,
More informationSemi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analysis
Semi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analsis Matthew B. Blaschko, Christoph H. Lampert, & Arthur Gretton Ma Planck Institute for Biological Cbernetics Tübingen, German
More informationStatistical Methods for NLP
Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22) Structured Classification
More informationContext-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing
Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing Natural Language Processing CS 4120/6120 Spring 2017 Northeastern University David Smith with some slides from Jason Eisner & Andrew
More informationBasic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology
Basic Text Analysis Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakimnivre@lingfiluuse Basic Text Analysis 1(33) Hidden Markov Models Markov models are
More informationHidden Markov Models. x 1 x 2 x 3 x N
Hidden Markov Models 1 1 1 1 K K K K x 1 x x 3 x N Example: The dishonest casino A casino has two dice: Fair die P(1) = P() = P(3) = P(4) = P(5) = P(6) = 1/6 Loaded die P(1) = P() = P(3) = P(4) = P(5)
More informationNatural Language Processing
Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability
More informationLecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan
Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always
More informationState Space and Hidden Markov Models
State Space and Hidden Markov Models Kunsch H.R. State Space and Hidden Markov Models. ETH- Zurich Zurich; Aliaksandr Hubin Oslo 2014 Contents 1. Introduction 2. Markov Chains 3. Hidden Markov and State
More informationStatistical Methods for NLP
Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured
More informationStochastic Parsing. Roberto Basili
Stochastic Parsing Roberto Basili Department of Computer Science, System and Production University of Roma, Tor Vergata Via Della Ricerca Scientifica s.n.c., 00133, Roma, ITALY e-mail: basili@info.uniroma2.it
More informationChapter 6. Self-Adjusting Data Structures
Chapter 6 Self-Adjusting Data Structures Chapter 5 describes a data structure that is able to achieve an epected quer time that is proportional to the entrop of the quer distribution. The most interesting
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:
More information8. BOOLEAN ALGEBRAS x x
8. BOOLEAN ALGEBRAS 8.1. Definition of a Boolean Algebra There are man sstems of interest to computing scientists that have a common underling structure. It makes sense to describe such a mathematical
More informationStatistical NLP Spring A Discriminative Approach
Statistical NLP Spring 2008 Lecture 6: Classification Dan Klein UC Berkeley A Discriminative Approach View WSD as a discrimination task (regression, really) P(sense context:jail, context:county, context:feeding,
More informationCSE446: Clustering and EM Spring 2017
CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More informationAnalyzing the Errors of Unsupervised Learning
Analyzing the Errors of Unsupervised Learning Percy Liang Dan Klein Computer Science Division, EECS Department University of California at Berkeley Berkeley, CA 94720 {pliang,klein}@cs.berkeley.edu Abstract
More informationLecture 15. Probabilistic Models on Graph
Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how
More informationMachine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017
1 Introduction Let x = (x 1,..., x M ) denote a sequence (e.g. a sequence of words), and let y = (y 1,..., y M ) denote a corresponding hidden sequence that we believe explains or influences x somehow
More informationMidterm sample questions
Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts
More informationRandom Vectors. 1 Joint distribution of a random vector. 1 Joint distribution of a random vector
Random Vectors Joint distribution of a random vector Joint distributionof of a random vector Marginal and conditional distributions Previousl, we studied probabilit distributions of a random variable.
More informationReview: critical point or equivalently f a,
Review: a b f f a b f a b critical point or equivalentl f a, b A point, is called a of if,, 0 A local ma or local min must be a critical point (but not conversel) 0 D iscriminant (or Hessian) f f D f f
More informationCISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)
CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models
More informationNatural Language Processing
SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University October 20, 2017 0 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class
More informationUnit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing
CS 562: Empirical Methods in Natural Language Processing Unit 2: Tree Models Lectures 19-23: Context-Free Grammars and Parsing Oct-Nov 2009 Liang Huang (lhuang@isi.edu) Big Picture we have already covered...
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization
More informationEM with Features. Nov. 19, Sunday, November 24, 13
EM with Features Nov. 19, 2013 Word Alignment das Haus ein Buch das Buch the house a book the book Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical
More informationLecture 21: Spectral Learning for Graphical Models
10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Hidden Markov Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 33 Introduction So far, we have classified texts/observations
More informationHidden Markov Models: Maxing and Summing
Hidden Markov Models: Maxing and Summing Introduction to Natural Language Processing Computer Science 585 Fall 2009 University of Massachusetts Amherst David Smith 1 Markov vs. Hidden Markov Models Fed
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationStructured Prediction
Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the
More informationBeyond Newton s method Thomas P. Minka
Beyond Newton s method Thomas P. Minka 2000 (revised 7/21/2017) Abstract Newton s method for optimization is equivalent to iteratively maimizing a local quadratic approimation to the objective function.
More informationChapter 14 (Partially) Unsupervised Parsing
Chapter 14 (Partially) Unsupervised Parsing The linguistically-motivated tree transformations we discussed previously are very effective, but when we move to a new language, we may have to come up with
More informationMA3232 Numerical Analysis Week 9. James Cooley (1926-)
MA umerical Analysis Week 9 James Cooley (96-) James Cooley is an American mathematician. His most significant contribution to the world of mathematics and digital signal processing is the Fast Fourier
More informationTo make a grammar probabilistic, we need to assign a probability to each context-free rewrite
Notes on the Inside-Outside Algorithm To make a grammar probabilistic, we need to assign a probability to each context-free rewrite rule. But how should these probabilities be chosen? It is natural to
More informationMachine Learning 4771
Machine Learning 4771 Instructor: Tony Jebara Topic 7 Unsupervised Learning Statistical Perspective Probability Models Discrete & Continuous: Gaussian, Bernoulli, Multinomial Maimum Likelihood Logistic
More informationMobile Robot Localization
Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations
More informationLING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars
LING 473: Day 10 START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars 1 Issues with Projects 1. *.sh files must have #!/bin/sh at the top (to run on Condor) 2. If run.sh is supposed
More informationIn this chapter, we explore the parsing problem, which encompasses several questions, including:
Chapter 12 Parsing Algorithms 12.1 Introduction In this chapter, we explore the parsing problem, which encompasses several questions, including: Does L(G) contain w? What is the highest-weight derivation
More informationFoundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM
Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM Alex Lascarides (Slides from Alex Lascarides and Sharon Goldwater) 2 February 2019 Alex Lascarides FNLP Lecture
More informationType-Based MCMC. Michael I. Jordan UC Berkeley 1 In NLP, this is sometimes referred to as simply the collapsed
Type-Based MCMC Percy Liang UC Berkeley pliang@cs.berkeley.edu Michael I. Jordan UC Berkeley jordan@cs.berkeley.edu Dan Klein UC Berkeley klein@cs.berkeley.edu Abstract Most existing algorithms for learning
More informationStatistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields
Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project
More informationApplications of Hidden Markov Models
18.417 Introduction to Computational Molecular Biology Lecture 18: November 9, 2004 Scribe: Chris Peikert Lecturer: Ross Lippert Editor: Chris Peikert Applications of Hidden Markov Models Review of Notation
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationStatistical NLP: Hidden Markov Models. Updated 12/15
Statistical NLP: Hidden Markov Models Updated 12/15 Markov Models Markov models are statistical tools that are useful for NLP because they can be used for part-of-speech-tagging applications Their first
More informationLecture 5: UDOP, Dependency Grammars
Lecture 5: UDOP, Dependency Grammars Jelle Zuidema ILLC, Universiteit van Amsterdam Unsupervised Language Learning, 2014 Generative Model objective PCFG PTSG CCM DMV heuristic Wolff (1984) UDOP ML IO K&M
More informationIntroduction to Artificial Intelligence (AI)
Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 9 Oct, 11, 2011 Slide credit Approx. Inference : S. Thrun, P, Norvig, D. Klein CPSC 502, Lecture 9 Slide 1 Today Oct 11 Bayesian
More information1 Kernel methods & optimization
Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions
More informationUniversity of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I
University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()
More informationBasic math for biology
Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood
More informationProbabilistic Context Free Grammars. Many slides from Michael Collins
Probabilistic Context Free Grammars Many slides from Michael Collins Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar
More informationL20: MLPs, RBFs and SPR Bayes discriminants and MLPs The role of MLP hidden units Bayes discriminants and RBFs Comparison between MLPs and RBFs
L0: MLPs, RBFs and SPR Bayes discriminants and MLPs The role of MLP hidden units Bayes discriminants and RBFs Comparison between MLPs and RBFs CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU
More informationLecture 3: ASR: HMMs, Forward, Viterbi
Original slides by Dan Jurafsky CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 3: ASR: HMMs, Forward, Viterbi Fun informative read on phonetics The
More informationSo# Inference and Posterior Marginals. September 19, 2013
So# Inference and Posterior Marginals September 19, 2013 So# vs. Hard Inference Hard inference Give me a single solucon Viterbi algorithm Maximum spanning tree (Chu- Liu- Edmonds alg.) So# inference Task
More informationHidden Markov Models in Language Processing
Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications
More informationWhere now? Machine Learning and Bayesian Inference
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone etension 67 Email: sbh@clcamacuk wwwclcamacuk/ sbh/ Where now? There are some simple take-home messages from
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are
More informationMath Lesson 2-2 Properties of Exponents
Math-00 Lesson - Properties of Eponents Properties of Eponents What is a power? Power: An epression formed b repeated multiplication of the base. Coefficient Base Eponent The eponent applies to the number
More informationLatent Variable Models in NLP
Latent Variable Models in NLP Aria Haghighi with Slav Petrov, John DeNero, and Dan Klein UC Berkeley, CS Division Latent Variable Models Latent Variable Models Latent Variable Models Observed Latent Variable
More information