Language Models. Hongning Wang
|
|
- Posy Harrison
- 5 years ago
- Views:
Transcription
1 Language Models Hongning Wang
2 Notion of Relevance Relevance (Rep(q), Rep(d)) Similarity P(r1 q,d) r {0,1} Probability of Relevance P(d q) or P(q d) Probabilistic inference Different rep & similarity Vector space model (Salton et al., 75) Regression Model (Fox 83) Prob. distr. model (Wong & Yao, 89) Doc generation Classical prob. Model (Robertson & Sparck Jones, 76) Generative Model Query generation LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Different inference system Prob. concept space model (Wong & Yao, 95) Inference network model (Turtle & Croft, 91) CS@UVa CS 4501: Information Retrieval 2
3 Query generation models O( R 1 Q, D) P( Q, D R P( Q, D R 1) 0) P( Q D, R 1) P( D R 1) P( Q D, R 0) P( D R 0) P( D R P( Q D, R 1) P( D R 1) 0) ( Assume P( Q D, R 0) P( Q R 0)) Query likelihood p(q θ d ) Document prior Assuming uniform document prior, we have O( R 1 Q, D) P( Q D, R 1) Now, the question is how to compute P( Q D, R 1)? Generally involves two steps: (1) estimate a language model based on D (2) compute the query likelihood according to the estimated model Language models, we will cover it now! CS@UVa CS 4501: Information Retrieval 3
4 What is a statistical LM? A model specifying a probability distribution over word sequences p( Today is Wednesday ) p( Today Wednesday is ) p( The eigenvalue is positive ) It can be regarded as a probabilistic mechanism for generating text, thus also called a generative model CS@UVa CS 4501: Information Retrieval 4
5 Why is a LM useful? Provides a principled way to quantify the uncertainties associated with natural language Allows us to answer questions like: Given that we see John and feels, how likely will we see happy as opposed to habit as the next word? (speech recognition) Given that we observe baseball three times and game once in a news article, how likely is it about sports? (text categorization, information retrieval) Given that a user is interested in sports news, how likely would the user use baseball in a query? (information retrieval) CS@UVa CS 4501: Information Retrieval 5
6 Source-Channel framework [Shannon 48] Source P(X) Transmitter Noisy Receiver X (encoder) Channel Y (decoder) X P(Y X) P(X Y)? Destination Xˆ arg max X p( X Y ) arg max X p( Y X ) p( X ) (Bayes Rule) When X is text, p(x) is a language model Many Examples: Speech recognition: XWord sequence YSpeech signal Machine translation: XEnglish sentence YChinese sentence OCR Error Correction: XCorrect word Y Erroneous word Information Retrieval: XDocument YQuery Summarization: XSummary YDocument CS@UVa CS 4501: Information Retrieval 6
7 Language model for text Probability distribution over word sequences pp ww 1 ww 2 ww nn pp ww 1 pp ww 2 ww 1 pp ww 3 ww 1, ww 2 Complexity - OO(VV nn ) nn - maximum document length sentence We need independence assumptions! pp ww nn ww 1, ww 2,, ww nn 1 Chain rule: from conditional probability to joint probability 475,000 main headwords in Webster's Third New International Dictionary Average English sentence length is 14.3 words A rough estimate: OO( ) How large is this? bbbbbbbbbb ee66 TTTT CS@UVa CS 4501: Information Retrieval 7
8 Unigram language model Generate a piece of text by generating each word independently pp(ww 1 ww 2 ww nn ) pp(ww 1 )pp(ww 2 ) pp(ww nn ) ss. tt. pp ww NN ii ii1, ii pp ww ii 1, pp ww ii 0 Essentially a multinomial distribution over the vocabulary A Unigram Language Model The simplest and most popular choice! w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16 w17 w18 w19 w20 w21 w22 w23 w24 w25 CS@UVa CS 4501: Information Retrieval 8
9 More sophisticated LMs N-gram language models In general, pp(ww 1 ww 2 ww nn ) pp(ww 1 )pp(ww 2 ww 1 ) pp(ww nn ww 1 ww nn 1 ) N-gram: conditioned only on the past N-1 words E.g., bigram: pp(ww 1 ww nn ) pp(ww 1 )pp(ww 2 ww 1 ) pp(ww 3 ww 2 ) pp(ww nn ww nn 1 ) Remote-dependence language models (e.g., Maximum Entropy model) Structured language models (e.g., probabilistic context-free grammar) CS@UVa CS 4501: Information Retrieval 9
10 Why just unigram models? Difficulty in moving toward more complex models They involve more parameters, so need more data to estimate They increase the computational complexity significantly, both in time and space Capturing word order or structure may not add so much value for topical inference But, using more sophisticated models can still be expected to improve performance... CS 4501: Information Retrieval 10
11 Language models for IR [Ponte & Croft SIGIR 98] Document Language Model Query Text mining paper text? mining? assocation? clustering? food?? data mining algorithms Food nutrition paper food? nutrition? healthy? diet? Which model would most likely have generated this query? CS 4501: Information Retrieval 11
12 Ranking docs by query likelihood Doc LM Query likelihood d 1 θ d 1 p(q θ d 1 ) q d 2 θ d 2 p(q θ d 2 ) p(q θ d N ) d N θ d N Justification: PRP O( R 1 Q, D) P( Q D, R 1) CS@UVa CS 4501: Information Retrieval 12
13 Justification from PRP 0)) ( 0), ( ( 0) ( 1) ( 1), ( 0) ( 0), ( 1) ( 1), ( 0), ( 1), ( ), 1 ( R Q P R D Q P Assume R D P R D P R D Q P R D P R D Q P R D P R D Q P R D Q P R D Q P D Q R O Query likelihood p(q θ d ) Document prior Assuming uniform document prior, we have 1), ( ), 1 ( R D Q P D Q R O CS@UVa CS 4501: Information Retrieval 13 Query generation
14 Retrieval as language model estimation Document ranking based on query likelihood log p( q d) log where, q i w w w p( w Retrieval problem Estimation of pp(ww ii dd) Common approach Maximum likelihood estimation (MLE) n i d) Document language model CS@UVa CS 4501: Information Retrieval 14
15 Recap: maximum likelihood estimation Data: a document d with counts c(w 1 ),, c(w N ) Model: multinomial distribution p(ww θθ) with parameters θθ ii pp(ww ii ) Maximum likelihood estimator: θθ aaaaaaaaaaxx θθ pp(ww θθ) pp WW θθ NN NN cc ww 1,, cc(ww NN ) ii1 NN NN θθ ii cc(ww ii ) ii1 LL WW, θθ cc ww ii log θθ ii + λλ θθ ii 1 ii1 ii1 LL cc ww ii + λλ θθ θθ ii θθ ii cc ww ii ii λλ Since θθ ii ii1 NN θθ ii cc(ww ii ) NN θθ ii 1 we have λλ cc ww ii cc ww ii NN ii1 log pp WW θθ cc ww ii Using Lagrange multiplier approach, we ll tune θθ ii to maximize LL(WW, θθ) Set partial derivatives to zero ML estimate CS@UVa NN CS 4501: Information Retrieval 15 ii1 cc ww ii NN ii1 Requirement from probability log θθ ii
16 Problem with MLE Unseen events There are 440K tokens on a large collection of Yelp reviews, but: Only 30,000 unique words occurred Word frequency ff kk; ss, NN 1/kk ss Only 0.04% of all possible bigrams occurred This means any word/n-gram that does not occur in the collection has a zero probability with MLE! No future queries/documents can contain those unseen words/n-grams Word rank by frequency A plot of word frequency in Wikipedia (Nov 27, 2006) CS@UVa CS 4501: Information Retrieval 16
17 Problem with MLE What probability should we give a word that has not been observed in the document? log0? If we want to assign non-zero probabilities to such words, we ll have to discount the probabilities of observed words This is so-called smoothing CS@UVa CS 4501: Information Retrieval 17
18 Illustration of language model smoothing P(w d) Max. Likelihood Estimate p ( w) count of w ML count of all words Smoothed LM w Discount from the seen words Assigning nonzero probabilities to the unseen words CS@UVa CS 4501: Information Retrieval 18
19 General idea of smoothing All smoothing methods try to 1. Discount the probability of words seen in a document 2. Re-allocate the extra counts such that unseen words will have a non-zero count CS@UVa CS 4501: Information Retrieval 19
20 Smoothing methods Method 1: Additive smoothing Add a constant δ to the counts of each word cwd (, ) + 1 pw ( d) d + V Problems? Counts of w in d Length of d (total counts) Hint: all words are equally important? Add one, Laplace smoothing Vocabulary size CS@UVa CS 4501: Information Retrieval 20
21 Add one smoothing for bigrams CS 4501: Information Retrieval 21
22 After smoothing Giving too much to the unseen events CS 4501: Information Retrieval 22
23 Refine the idea of smoothing Should all unseen words get equal probabilities? We can use a reference model to discriminate unseen words Discounted ML estimate pseen ( w d) if w is seen in d pw ( d) αd p( w REF) otherwise α d 1 p ( w d) w is seen w is unseen seen p( w REF) Reference language model CS@UVa CS 4501: Information Retrieval 23
24 Smoothing methods Method 2: Absolute discounting Subtract a constant δ from the counts of each word u p( w d) max( c( w; d ) δ,0) + δ d p( w REF ) d Problems? Hint: varied document length? # uniq words CS@UVa CS 4501: Information Retrieval 24
25 Smoothing methods Method 3: Linear interpolation, Jelinek- Mercer Shrink uniformly toward p(w REF) cwd (, ) p( w d) (1 λ) + λ p( w REF) d Problems? MLE Hint: what is missing? parameter CS@UVa CS 4501: Information Retrieval 25
26 Smoothing methods Method 4: Dirichlet Prior/Bayesian Assume pseudo counts µp(w REF) c( w; d ) + µ p( w REF ) d cwd (, ) µ p ( w d) d + µ d + µ + d + µ p( w REF) d Problems? parameter CS@UVa CS 4501: Information Retrieval 26
27 Dirichlet prior smoothing Bayesian estimator Posterior of LM: pp θθ dd pp dd θθ pp(θθ) Conjugate prior Posterior will be in the same form as prior Prior can be interpreted as extra / pseudo data Dirichlet distribution is a conjugate prior for multinomial distribution Dir( θ α,, α 1 N ) likelihood of doc given the model Γ( α + + α N 1 N α i 1 Πθ i Γ( α 1 1) Γ( α ) i N extra / pseudo word counts, we set α i µ p(w i REF) ) prior over models CS@UVa CS 4501: Information Retrieval 27
28 Some background knowledge Conjugate prior Posterior dist in the same family as prior Dirichlet distribution Continuous Samples from it will be the parameters in a multinomial distribution Gaussian -> Gaussian Beta -> Binomial Dirichlet -> Multinomial CS@UVa CS 4501: Information Retrieval 28
29 Dirichlet prior smoothing (cont) Posterior distribution of parameters: p θ d ) Dir( θ c( w ) + α,, c( ) + α ( 1 1 w N N ) Property : If θ ~ Dir( θ α ), then E( θ ) { The predictive distribution is the same as the mean: p(w i ˆ) θ p(w c(w d i i + θ ) Dir( θ α ) dθ ) + α N i 1 α i i c(w i ) + µ p( wi d + µ α i α i } REF ) Dirichlet prior smoothing CS@UVa CS 4501: Information Retrieval 29
30 Estimating µ using leave-one-out [Zhai & Lafferty 02] w 1 w 2 P(w 1 d - w1 ) P(w 2 d - w2 ) Leave-one-out log-likelihood c( w, d ) 1+ µ p( w C) L N i ( µ C) c( w, di )log( 1 i 1 w V di 1+ µ ) w n... Maximum Likelihood Estimator μˆ argmax L 1(μ C) μ P(w n d - wn ) CS@UVa CS 4501: Information Retrieval 30
31 Understanding smoothing Topical words Query the algorithms for data mining p ML (w d1): p ML (w d2): p( algorithms d1) p( algorithms d2) p( data d1) < p( data d2) p( mining d1) < p( mining d2) Intuitively, d2 should have a higher score, but p(q d1)>p(q d2) So we should make p( the ) and p( for ) less different for all docs, and smoothing helps to achieve this goal After smoothing with p( w d) 0.1p ( w d) p( w REF), p( q d1) < DML p( q d2)! Query the algorithms for data mining P(w REF) Smoothed p(w d1): Smoothed p(w d2): CS@UVa CS 4501: Information Retrieval 31
32 Two-stage smoothing [Zhai & Lafferty 02] Stage-1 -Explain unseen words -Dirichlet prior (Bayesian) µ Stage-2 -Explain noise in query -2-component mixture λ P(w d) (1-λ) c(w,d) d +µp(w C) +µ Collection LM + λp(w U) User background model CS@UVa CS 4501: Information Retrieval 32
33 Smoothing & TF-IDF weighting log pq ( d) cwq (, ) log pw ( d) w V, c( wq, ) > 0 cwq (, ) log p ( w d) + cwq (, ) log α pw ( C) Seen w V, c( wd, ) > 0, w V, c( wq, ) > 0, c( wd, ) 0 c( wq, ) > 0 Retrieval formula using the general smoothing scheme cwq (, ) log p ( w d) + cwq (, ) log α pw ( C) cwq (, ) log α pw ( C) Seen d d w V, c( wd, ) > 0 w V, c( wq, ) > 0 w V, c( wq, ) > 0, c( wd, ) > 0 cc ww, qq > 0 pseen ( w d) cwq (, ) log q log αd cwqpw (, ) ( C) w V, c( wd, ) > 0 αd pwc ( ) w V, c( wq, ) > 0 c( wq, ) > α d pseen ( w d) if w is seen in d pw ( d) αd p( w C) otherwise 1 p ( w d) w is seen w is unseen Seen pwc ( ) d Smoothed ML estimate Reference language model Key rewriting step (where did we see it before?) Similar rewritings are very common when using probabilistic models for IR CS@UVa CS 4501: Information Retrieval 33
34 Understanding smoothing αd Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain TF weighting 1 p ( w d) w is seen w is unseen seen p( w REF) Doc length normalization (longer doc is expected to have a smaller α d ) log p( q d) w i d q p [log α seen d ( w p( w i i d) ] + C) q logα d + w q i log p( w i C) IDF weighting Ignore for ranking Smoothing with p(w C) TF-IDF + doclength normalization CS@UVa CS 4501: Information Retrieval 34
35 What you should know How to estimate a language model General idea and different ways of smoothing Effect of smoothing CS@UVa CS 4501: Information Retrieval 35
36 Today s reading Introduction to information retrieval Chapter 12: Language models for information retrieval CS@UVa CS 4501: Information Retrieval 36
Ranked Retrieval (2)
Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF
More informationLecture 13: More uses of Language Models
Lecture 13: More uses of Language Models William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 13 What we ll learn in this lecture Comparing documents, corpora using LM approaches
More informationMidterm Examination Practice
University of Illinois at Urbana-Champaign Midterm Examination Practice CS598CXZ Advanced Topics in Information Retrieval (Fall 2013) Professor ChengXiang Zhai 1. Basic IR evaluation measures: The following
More informationUniversity of Illinois at Urbana-Champaign. Midterm Examination
University of Illinois at Urbana-Champaign Midterm Examination CS410 Introduction to Text Information Systems Professor ChengXiang Zhai TA: Azadeh Shakery Time: 2:00 3:15pm, Mar. 14, 2007 Place: Room 1105,
More information5 10 12 32 48 5 10 12 32 48 4 8 16 32 64 128 4 8 16 32 64 128 2 3 5 16 2 3 5 16 5 10 12 32 48 4 8 16 32 64 128 2 3 5 16 docid score 5 10 12 32 48 O'Neal averaged 15.2 points 9.2 rebounds and 1.0 assists
More informationLanguage Models. CS6200: Information Retrieval. Slides by: Jesse Anderton
Language Models CS6200: Information Retrieval Slides by: Jesse Anderton What s wrong with VSMs? Vector Space Models work reasonably well, but have a few problems: They are based on bag-of-words, so they
More informationA Study of Smoothing Methods for Language Models Applied to Information Retrieval
A Study of Smoothing Methods for Language Models Applied to Information Retrieval CHENGXIANG ZHAI and JOHN LAFFERTY Carnegie Mellon University Language modeling approaches to information retrieval are
More informationNatural Language Processing. Statistical Inference: n-grams
Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability
More informationLecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:
Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS 6501: Natural Language Processing 1 This lecture Language Models What are
More informationRecap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP
Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp
More informationLanguage Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13
Language Models LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing Web Search Slides based on the books: 13 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis
More informationLanguage Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016
Language Modelling: Smoothing and Model Complexity COMP-599 Sept 14, 2016 Announcements A1 has been released Due on Wednesday, September 28th Start code for Question 4: Includes some of the package import
More informationLanguage as a Stochastic Process
CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any
More informationChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign
Axiomatic Analysis and Optimization of Information Retrieval Models ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign http://www.cs.uiuc.edu/homes/czhai
More informationFoundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model
Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January
More informationMachine Learning for natural language processing
Machine Learning for natural language processing N-grams and language models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 25 Introduction Goals: Estimate the probability that a
More informationEmpirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model
Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider
More informationModeling Environment
Topic Model Modeling Environment What does it mean to understand/ your environment? Ability to predict Two approaches to ing environment of words and text Latent Semantic Analysis (LSA) Topic Model LSA
More informationA REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 10, October 2014,
More informationRETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic
More informationIntroduction to Probabilistic Machine Learning
Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning
More informationA Note on the Expectation-Maximization (EM) Algorithm
A Note on the Expectation-Maximization (EM) Algorithm ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign March 11, 2007 1 Introduction The Expectation-Maximization
More informationStatistical Methods for NLP
Statistical Methods for NLP Language Models, Graphical Models Sameer Maskey Week 13, April 13, 2010 Some slides provided by Stanley Chen and from Bishop Book Resources 1 Announcements Final Project Due,
More informationChapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze
Chapter 10: Information Retrieval See corresponding chapter in Manning&Schütze Evaluation Metrics in IR 2 Goal In IR there is a much larger variety of possible metrics For different tasks, different metrics
More informationProbabilistic Language Modeling
Predicting String Probabilities Probabilistic Language Modeling Which string is more likely? (Which string is more grammatical?) Grill doctoral candidates. Regina Barzilay EECS Department MIT November
More informationA Risk Minimization Framework for Information Retrieval
A Risk Minimization Framework for Information Retrieval ChengXiang Zhai a John Lafferty b a Department of Computer Science University of Illinois at Urbana-Champaign b School of Computer Science Carnegie
More informationInformation retrieval LSI, plsi and LDA. Jian-Yun Nie
Information retrieval LSI, plsi and LDA Jian-Yun Nie Basics: Eigenvector, Eigenvalue Ref: http://en.wikipedia.org/wiki/eigenvector For a square matrix A: Ax = λx where x is a vector (eigenvector), and
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample
More informationN-gram Language Modeling
N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Classification Clustering
More informationBayesian Models in Machine Learning
Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of
More informationINFO 630 / CS 674 Lecture Notes
INFO 630 / CS 674 Lecture Notes The Language Modeling Approach to Information Retrieval Lecturer: Lillian Lee Lecture 9: September 25, 2007 Scribes: Vladimir Barash, Stephen Purpura, Shaomei Wu Introduction
More informationLecture 3: ASR: HMMs, Forward, Viterbi
Original slides by Dan Jurafsky CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 3: ASR: HMMs, Forward, Viterbi Fun informative read on phonetics The
More informationLecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25
Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25 Trevor Cohn (Slide credits: William Webber) COMP90042, 2015, Semester 1 What we ll learn in this lecture Probabilistic models for
More informationN-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24
L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,
More informationCS 188: Artificial Intelligence Fall 2011
CS 188: Artificial Intelligence Fall 2011 Lecture 20: HMMs / Speech / ML 11/8/2011 Dan Klein UC Berkeley Today HMMs Demo bonanza! Most likely explanation queries Speech recognition A massive HMM! Details
More informationGaussian Models
Gaussian Models ddebarr@uw.edu 2016-04-28 Agenda Introduction Gaussian Discriminant Analysis Inference Linear Gaussian Systems The Wishart Distribution Inferring Parameters Introduction Gaussian Density
More informationN-gram Language Modeling Tutorial
N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures
More informationNaïve Bayes, Maxent and Neural Models
Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Classification: Naive Bayes Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 20 Introduction Classification = supervised method for
More informationNote for plsa and LDA-Version 1.1
Note for plsa and LDA-Version 1.1 Wayne Xin Zhao March 2, 2011 1 Disclaimer In this part of PLSA, I refer to [4, 5, 1]. In LDA part, I refer to [3, 2]. Due to the limit of my English ability, in some place,
More informationWeek 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya
Week 13: Language Modeling II Smoothing in Language Modeling Irina Sergienya 07.07.2015 Couple of words first... There are much more smoothing techniques, [e.g. Katz back-off, Jelinek-Mercer,...] and techniques
More informationBayesian Methods: Naïve Bayes
Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior
More informationMidterm sample questions
Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationLanguage Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional
Language Technology CUNY Graduate Center Spring 2013 Unit 1: Sequence Models Lectures 5-6: Language Models and Smoothing required hard optional Professor Liang Huang liang.huang.sh@gmail.com Python Review:
More informationLatent Semantic Analysis. Hongning Wang
Latent Semantic Analysis Hongning Wang CS@UVa Recap: vector space model Represent both doc and query by concept vectors Each concept defines one dimension K concepts define a high-dimensional space Element
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 11: Probabilistic Information Retrieval 1 Outline Basic Probability Theory Probability Ranking Principle Extensions 2 Basic Probability Theory For events A
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 12: Language Models for IR Outline Language models Language Models for IR Discussion What is a language model? We can view a finite state automaton as a deterministic
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More informationChapter 3: Basics of Language Modelling
Chapter 3: Basics of Language Modelling Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationDT2118 Speech and Speaker Recognition
DT2118 Speech and Speaker Recognition Language Modelling Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 56 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language
More informationMachine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation
Machine Learning CMPT 726 Simon Fraser University Binomial Parameter Estimation Outline Maximum Likelihood Estimation Smoothed Frequencies, Laplace Correction. Bayesian Approach. Conjugate Prior. Uniform
More informationLanguage Models, Smoothing, and IDF Weighting
Language Models, Smoothing, and IDF Weighting Najeeb Abdulmutalib, Norbert Fuhr University of Duisburg-Essen, Germany {najeeb fuhr}@is.inf.uni-due.de Abstract In this paper, we investigate the relationship
More informationLanguage Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky
Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationLanguage Models and Smoothing Methods for Collections with Large Variation in Document Length. 2 Models
Language Models and Smoothing Methods for Collections with Large Variation in Document Length Najeeb Abdulmutalib and Norbert Fuhr najeeb@is.inf.uni-due.de, norbert.fuhr@uni-due.de Information Systems,
More informationCS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas
CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1
More informationEvaluation Methods for Topic Models
University of Massachusetts Amherst wallach@cs.umass.edu April 13, 2009 Joint work with Iain Murray, Ruslan Salakhutdinov and David Mimno Statistical Topic Models Useful for analyzing large, unstructured
More informationperplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)
Chapter 5 Language Modeling 5.1 Introduction A language model is simply a model of what strings (of words) are more or less likely to be generated by a speaker of English (or some other language). More
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released
More informationPROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability
More informationACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging
ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationLatent Dirichlet Allocation Introduction/Overview
Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models
More informationNaïve Bayes. Vibhav Gogate The University of Texas at Dallas
Naïve Bayes Vibhav Gogate The University of Texas at Dallas Supervised Learning of Classifiers Find f Given: Training set {(x i, y i ) i = 1 n} Find: A good approximation to f : X Y Examples: what are
More informationInformation Retrieval Basic IR models. Luca Bondi
Basic IR models Luca Bondi Previously on IR 2 d j q i IRM SC q i, d j IRM D, Q, R q i, d j d j = w 1,j, w 2,j,, w M,j T w i,j = 0 if term t i does not appear in document d j w i,j and w i:1,j assumed to
More informationLanguage Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN
Language Models Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN Data Science: Jordan Boyd-Graber UMD Language Models 1 / 8 Language models Language models answer
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationReview: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler
Review: Probability BM1: Advanced Natural Language Processing University of Potsdam Tatjana Scheffler tatjana.scheffler@uni-potsdam.de October 21, 2016 Today probability random variables Bayes rule expectation
More informationGibbs Sampler Derivation for Latent Dirichlet Allocation (Blei et al., 2003) Lecture Notes Arjun Mukherjee (UH)
Gibbs Sampler Derivation for Latent Dirichlet Allocation (Blei et al., 2003) Lecture Notes Arjun Mukherjee (UH) I. Generative process, Plates, Notations II. The joint distribution: PP (WW, ZZ; αα; ββ)
More informationWhy Language Models and Inverse Document Frequency for Information Retrieval?
Why Language Models and Inverse Document Frequency for Information Retrieval? Catarina Moreira, Andreas Wichert Instituto Superior Técnico, INESC-ID Av. Professor Cavaco Silva, 2744-016 Porto Salvo, Portugal
More informationLanguage Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky
Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high
More informationForward algorithm vs. particle filtering
Particle Filtering ØSometimes X is too big to use exact inference X may be too big to even store B(X) E.g. X is continuous X 2 may be too big to do updates ØSolution: approximate inference Track samples
More informationN-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition
2010 11 5 N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition 1 48-106413 Abstract Large-Vocabulary Continuous Speech Recognition(LVCSR) system has rapidly been growing today.
More informationIntroduction: MLE, MAP, Bayesian reasoning (28/8/13)
STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this
More informationLanguage Processing with Perl and Prolog
Language Processing with Perl and Prolog Chapter 5: Counting Words Pierre Nugues Lund University Pierre.Nugues@cs.lth.se http://cs.lth.se/pierre_nugues/ Pierre Nugues Language Processing with Perl and
More informationLearning Bayesian network : Given structure and completely observed data
Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution
More informationCS249: ADVANCED DATA MINING
CS249: ADVANCED DATA MINING Vector Data: Clustering: Part II Instructor: Yizhou Sun yzsun@cs.ucla.edu May 3, 2017 Methods to Learn: Last Lecture Classification Clustering Vector Data Text Data Recommender
More informationCSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP
University of Malta BSc IT (Hons)Year IV CSA4050: Advanced Topics Natural Language Processing Lecture Statistics III Statistical Approaches to NLP Witten-Bell Discounting Unigrams Bigrams Dept Computer
More informationTopic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up
Much of this material is adapted from Blei 2003. Many of the images were taken from the Internet February 20, 2014 Suppose we have a large number of books. Each is about several unknown topics. How can
More informationComputational Cognitive Science
Computational Cognitive Science Lecture 8: Frank Keller School of Informatics University of Edinburgh keller@inf.ed.ac.uk Based on slides by Sharon Goldwater October 14, 2016 Frank Keller Computational
More informationCSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado
CSCI 5822 Probabilistic Model of Human and Machine Learning Mike Mozer University of Colorado Topics Language modeling Hierarchical processes Pitman-Yor processes Based on work of Teh (2006), A hierarchical
More informationClassification & Information Theory Lecture #8
Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing
More informationLoss Functions, Decision Theory, and Linear Models
Loss Functions, Decision Theory, and Linear Models CMSC 678 UMBC January 31 st, 2018 Some slides adapted from Hamed Pirsiavash Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2018/cmsc678
More informationLecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher
Lecture 3 STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher Previous lectures What is machine learning? Objectives of machine learning Supervised and
More informationLanguage Models. Philipp Koehn. 11 September 2018
Language Models Philipp Koehn 11 September 2018 Language models 1 Language models answer the question: How likely is a string of English words good English? Help with reordering p LM (the house is small)
More informationPV211: Introduction to Information Retrieval
PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 11: Probabilistic Information Retrieval Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk
More informationThe Noisy Channel Model and Markov Models
1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle
More informationIntroduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak
Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,
More informationLanguage Modeling. Michael Collins, Columbia University
Language Modeling Michael Collins, Columbia University Overview The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques: Linear interpolation Discounting
More informationCHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2017. Tom M. Mitchell. All rights reserved. *DRAFT OF September 16, 2017* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationLecture 8: Graphical models for Text
Lecture 8: Graphical models for Text 4F13: Machine Learning Joaquin Quiñonero-Candela and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/
More informationDocument and Topic Models: plsa and LDA
Document and Topic Models: plsa and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced Topics in Machine Learning 2 October 2018 Outline Topic Models plsa LSA Model Fitting via EM phits: link analysis
More informationRetrieval models II. IN4325 Information Retrieval
Retrieval models II IN4325 Information Retrieval 1 Assignment 1 Deadline Wednesday means as long as it is Wednesday you are still on time In practice, any time before I make it to the office on Thursday
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,
More information