NATURAL LANGUAGE PROCESSING. Dr. G. Bharadwaja Kumar

Similar documents
CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Classification & Information Theory Lecture #8

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Statistical methods for NLP Estimation

Dept. of Linguistics, Indiana University Fall 2015

Graphical models for part of speech tagging

Language as a Stochastic Process

10/17/04. Today s Main Points

The Noisy Channel Model and Markov Models

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Lecture 13: Structured Prediction

10-701/ Machine Learning: Assignment 1

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Automated Biomedical Text Fragmentation. in support of biomedical sentence fragment classification

Natural Language Processing

Named Entity Recognition using Maximum Entropy Model SEEM5680

Machine Learning for NLP

SYNTHER A NEW M-GRAM POS TAGGER

CLRG Biocreative V

A survey of Probability concepts. Chapter 5

Processing/Speech, NLP and the Web

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Sequences and Information

Text Mining. March 3, March 3, / 49

CS 188: Artificial Intelligence Spring Today

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Hidden Markov Models

CS460/626 : Natural Language

Probability Distributions

Language Modeling. Michael Collins, Columbia University

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

3 PROBABILITY TOPICS

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

8. TRANSFORMING TOOL #1 (the Addition Property of Equality)

The Naïve Bayes Classifier. Machine Learning Fall 2017

Decision trees COMS 4771

Lecture 7: DecisionTrees

Topics. Probability Theory. Perfect Secrecy. Information Theory

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Probabilistic Context Free Grammars. Many slides from Michael Collins

CS 630 Basic Probability and Information Theory. Tim Campbell

Forward algorithm vs. particle filtering

Lab 12: Structured Prediction

More Smoothing, Tuning, and Evaluation

Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019

Learning Features from Co-occurrences: A Theoretical Analysis

CSC321 Lecture 15: Recurrent Neural Networks

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Randomized Decision Trees

CS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0.

ADVICE ON MATHEMATICAL WRITING

Section 1.2: Conditional Statements

Statistical Methods for NLP

Natural Language Processing

Chapter 14. From Randomness to Probability. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Word Sense Disambiguation (WSD) Based on Foundations of Statistical NLP by C. Manning & H. Schütze, ch. 7 MIT Press, 2002

Maxent Models and Discriminative Estimation

Student s Name Course Name Mathematics Grade 7. General Outcome: Develop number sense. Strand: Number. R D C Changed Outcome/achievement indicator

X 1 : X Table 1: Y = X X 2

CSCI 5832 Natural Language Processing. Today 1/31. Probability Basics. Lecture 6. Probability. Language Modeling (N-grams)

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

1 Introduction. Sept CS497:Learning and NLP Lec 4: Mathematical and Computational Paradigms Fall Consider the following examples:

INTRODUCTION. Tomoya Sato. Department of Philosophy University of California, San Diego. Phil120: Symbolic Logic Summer 2014

Natural Language Processing and Recurrent Neural Networks

A Context-Free Grammar

Communication Theory and Engineering

Machine Learning

Unit 1: Sequence Models

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Mathematical Foundations

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Math.3336: Discrete Mathematics. Applications of Propositional Logic

1. True/False (40 points) total

DRAFT CONCEPTUAL SOLUTION REPORT DRAFT

Natural Language Processing

30. TRANSFORMING TOOL #1 (the Addition Property of Equality)

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Discrete Probability Refresher

Information and Entropy. Professor Kevin Gold

Predicates, Quantifiers and Nested Quantifiers

Intro to probability concepts

Date: Math 7 Final Exam Review Name:

Information Extraction and GATE. Valentin Tablan University of Sheffield Department of Computer Science NLP Group

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Structured Prediction Theory and Algorithms

Statistical Methods for NLP

Decision Tree Learning. Dr. Xiaowei Huang

Lecture 5 - Information theory

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Lecture 1. ABC of Probability

Intelligent Systems:

CHAPTER 2 Estimating Probabilities

Transcription:

NATURAL LANGUAGE PROCESSING Dr. G. Bharadwaja Kumar

Sentence Boundary Markers Many natural language processing (NLP) systems generally take a sentence as an input unit part of speech (POS) tagging, chunking, and parsing, machine translation (MT), information retrieval (IR), and so forth. Since the errors of the SBD system propagate into the subsequent processes when they rely on accurate sentence segmentation, and the overall system performance is negatively affected.

Sentence Boundary Markers In some contexts period is not sentence boundary marker Full-stop. has many different uses (decimal point, ellipsis, abbreviations, email and Internet addresses,..) 27.5, etc., google.com, Mr. The President lives in Washington D.C. He likes that place. Is K.H. Smith here? I bought the apples, pears, lemons, etc. Did you eat them?

Some complex Examples for SBD

Punctuation can be present in an embedded quotation Now the burning question is How can we identify the right hyper-plane?.! and? are less ambiguous but may appear in proper names (Yahoo!) or may be repeated to stress their meaning (go out!!!)

Semicolon as boundary marker It is a reality that pluralism emerges from the very nature of our country; it is a choice made inevitable by India s geography, re-affirmed by its history and reflected in its ethnography. I had a ton of homework for math last night; I stayed up late to finish it all.

Ambiguous examples Often people ask questions such as "why not have an Operating System run in Tamil or Bengali?", just as Arabic or Japanese Windows.

Ambiguous examples What is a sentence here? A few expected insights from the graph are : 1. Males in our population have a higher average height. 2. Females in our population have longer scalp hairs. Consider the following statements: 1. Deputy Speaker and Speaker may resign by writing to each other 2. Attorney General and Solicitor General may resign by writing to each other Which among the above statements is / are correct?

Sentence Segments Language is a word that may be used by extension: The language of a community or country. The ability of speech. Formal language in mathematics, logic and computing. Sign language for deaf people (people who cannot hear). A type of school subject.

In the Wall Street Journal (WSJ), about 42% of the periods denote abbreviations and decimal points while the corresponding percentage for Brown corpus is only 11%. https://www.hindawi.com/journals/tswj/2014/196 574/

The colon :, semicolon ;, and comma, can either be a separator of grammatical subsentences or a delimiter of sentences. About 19% and 14% of colons are being used as the boundary delimiters of sentences in the WSJ and Brown corpus, respectively.

The list of abbreviations can be built from the training data An abbreviation is a token that contains a. that is not a sentence boundary in the training corpus. Examples: Lib. For Library / abbr. for abbreviation / approx. for approximate http://abbreviations.yourdictionary.com/

Feature Extraction for SBD

Entropy & Information Entropy is simply the average(expected) amount of the information from the event. Note that, to estimate entropy, the statistical properties of the source must be known, i.e. one must know what values s can take and how likely (p(s)) they are.

Entropy in Information Theory

Entropy in Information Theory As the number of possible outcomes for a random variable increases, entropy increases. Entropy of flipping a fair coin: S= - (½ log 2 (½) + ½ log 2 (½)) = -2 ½ -1 = 1 Entropy of rolling a die: S= -6 1/6 log 2 (1/6) = -1 log 2 (1/6) = log 2 (6) = 2.585

Principle of Maximum Entropy When estimating the probability distribution, you should select that distribution which leaves you the largest remaining uncertainty (i.e., the maximum entropy) consistent with your constraints. Maximum entropy (H = log N) is reached when all symbols are equiprobable, i.e., p i = 1/N.

The Maximum Entropy Model considers only specific evidence of sentence boundaries in the text. This evidence represents prior linguistic knowledge about contextual features of text that may indicate a sentence boundary and are determined by the experimenter.

Assumptions of the Model If we treat a paragraph/corpus as a token stream, we can consider the SBD problem to be a random process, which delimits this paragraph/corpus into sentences. Such a process will produce an output value y, whose domain Y is all of the possible boundary positions in the stream. In this random process, the value of Y may be affected by some contextual information x, whose domain x is all the possible textual combinations in the stream. To solve the SBD problem, we can build a stochastic model to correctly simulate this random process. Such a model is a conditional probability model - give a context stream x and predict the boundaries y p(y x)

At the first step, we collect large number of samples (x, y ), (x, y ) (x, y ) are extracted 1 1 2 2 N N from the training set. We can define a joint empirical distribution over x and y from these samples: ~ p( x, y) 1 number of (x,y) N

Since the model only considers the distribution of these explicitly identified features, all other features of the text are assigned a uniform distribution. Thus, the model is "maximally" uncertain about features of text for which it does not have prior knowledge.

For each potential sentence boundary token, we estimate a joint probability of the token and its surrounding context, both of which are denoted by x, occurring as an actual sentence boundary. k p (y x) = π f ൯ α j (y,x j j=1

In the equation of previous slide α j y {no, yes}, where the are the unknown parameters of the model, and where each α j corresponds to f j a feature of the model, π is the normalization constant

Let Y = {no, yes} the set of possible classes, and X the set of possible contexts. Features are binary-valued functions on events: f j = Y X {0,1 which are used to encode contextual information that is of interest.

The probability y of seeing an actual sentence boundary in the context x is given by p (yes, x). Now, the model can be viewed under the maximum entropy framework, in which we choose a probability distribution p that maximizes the entropy: H(p) = p(y, x)logp(y, x) Y X

The expectation of the feature from the training sample must be the same as the expectation of this feature in the model: y x p(y, x) f ( y, x) p( x, y) f ( x, y) j y x ~ j ~ p ( x ) Where in the constraint is the empirical distribution of x in the training sample.

The best model p* for the SBD should maximize the conditional entropy on p(y x): y p( y x) 1 p* arg max H ( p) And subject to the following constraints at the same time: p(y x) 0. x, y p y x ~ p( y, x) f ( x, y) p( x, y) f ( x, y) j {1,2... k} j y x j

As a result, the mode in practice tends not to commit towards a particular outcome (yes or no) unless it has seen sufficient evidence for that outcome.

Using the model on Test Data While testing, we use a simple decision rule to classify each potential sentence boundary: a potential sentence boundary is an actual sentence boundary if and only if p (yes c) > 0.5 where p(yes x) = p(yes, x) p(x) = p(yes, x) p(yes, x) + p(no, x) where x is the context including the potential sentence boundary.

Paper on Sentence Boundary Detection for Social Media Text is uploaded Rudrapal, Dwijen, et al. "Sentence Boundary Detection for Social Media Text." (2015).