Topics in Natural Language Processing

Similar documents
Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Lecture 5 Neural models for NLP

Applied Natural Language Processing

Natural Language Processing

Machine Learning Theory (CS 6783)

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Foundations of Machine Learning

Bayesian Analysis for Natural Language Processing Lecture 2

Spectral Unsupervised Parsing with Additive Tree Metrics

IFT Lecture 7 Elements of statistical learning theory

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

Conditional Language Modeling. Chris Dyer

1 Review of The Learning Setting

Does Unlabeled Data Help?

Generalization Bounds and Stability

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

Machine Learning Theory (CS 6783)

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert

Joint distribution optimal transportation for domain adaptation

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

Lecture 21: Minimax Theory

Lecture 13: Structured Prediction

ECE 5984: Introduction to Machine Learning

6.036 midterm review. Wednesday, March 18, 15

Learning Features from Co-occurrences: A Theoretical Analysis

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Lecture 12: EM Algorithm

Active Learning: Disagreement Coefficient

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Generative Clustering, Topic Modeling, & Bayesian Inference

Interpretable Latent Variable Models

Machine Learning and Deep Learning! Vincent Lepetit!

Probabilistic Graphical Models

CS 6375 Machine Learning

Logic and machine learning review. CS 540 Yingyu Liang

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Expectation Maximization (EM)

Advanced Introduction to Machine Learning

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Introduction to Machine Learning CMU-10701

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Supplementary Material for: Spectral Unsupervised Parsing with Additive Tree Metrics

CS Lecture 18. Topic Models and LDA

Latent Dirichlet Allocation Introduction/Overview

Introduction. Chapter 1

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

THE LANGUAGE OF FUNCTIONS *

Forward algorithm vs. particle filtering

Machine Learning Basics: Maximum Likelihood Estimation

Expectation Maximization (EM)

Statistical Pattern Recognition

Lecture 16 Deep Neural Generative Models

Learning to translate with neural networks. Michael Auli

AN INTRODUCTION TO TOPIC MODELS

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

CSE 105 THEORY OF COMPUTATION

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Qualifying Exam in Machine Learning

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

1 Boosting. COS 511: Foundations of Machine Learning. Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/2003

Classification, Linear Models, Naïve Bayes

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Unsupervised Learning

Machine Learning. Instructor: Pranjal Awasthi

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Computability and Complexity

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Statistical Learning Learning From Examples

Empirical Risk Minimization, Model Selection, and Model Assessment

Dimension Reduction (PCA, ICA, CCA, FLD,

Towards Universal Sentence Embeddings

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Lecture 11: Viterbi and Forward Algorithms

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Notes on Noise Contrastive Estimation (NCE)

PATTERN RECOGNITION AND MACHINE LEARNING

Data Preprocessing. Cluster Similarity

Introduction to Machine Learning

Active learning. Sanjoy Dasgupta. University of California, San Diego

Big Data Analysis with Apache Spark UC#BERKELEY

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

ECE 5984: Introduction to Machine Learning

CSE 546 Final Exam, Autumn 2013

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

1 The Probably Approximately Correct (PAC) Model

Chapter 3: Basics of Language Modelling

What is semi-supervised learning?

Introduction to Machine Learning (67577) Lecture 3

Chaos, Complexity, and Inference (36-462)

ECS171: Machine Learning

Machine Learning (CS 567) Lecture 2

Transcription:

Topics in Natural Language Processing Shay Cohen Institute for Language, Cognition and Computation University of Edinburgh Lecture 9

Administrativia Next class will be a summary Please email me questions about the material by Thursday 5pm Minor changes to the schedule. Those who are affected know it. Final schedule on website, please take a quick look

Last class Unsupervised learning: Latent-variable learning:

Semi-supervised Learning Main idea: use a relatively small amount of annotated data, and exploit also large amounts of unannotated data The term itself is used in various ways with various methodologies

Example: Word Clusters and Embeddings Learn clusters of words or embed them in Euclidean space using large amounts of text Use these clusters/embeddings as features in a discriminative model

Semi-supervised Learning: Example 2 Combine the log-likelihood for labelled data with the log-likelihood for unlabelled data L(x 1, y 1,..., x n, y n, x 1,..., x m θ) =

Semi-supervised Learning: Example 3 Self-training

Semi-supervised Learning: Example 3 Self-training Step 1: Step 2: Step 3: Potentially, repeat step 2

Today s class Evaluation and experimental design Testing for error Hypothesis testing Sample complexity

Next step after prediction We learned/estimated a model We decoded and made predictions (inference) What s next?

Next step after prediction We learned/estimated a model We decoded and made predictions (inference) What s next? Evaluation

NLP as an empirical science NLP is an empirical science As such, conclusions should be carefully drawn from the conducted experiments The most common type of empirical issue NLP research addresses: comparison of methods

Improving state of the art We have two competing methods. How do we decide which method is better? We have two sets of predictions from each method: We have an evaluation metric for error err(y method 1 ), err(y method 2 ) We compute err 1 = err 2 = We ask, is err 1 < err 2?

Validity of error comparison The empirical error as a proxy for the true error Empirical error: err 1 = 1 n n err(y 1 i method 1 ) i=1 True error: err =

Validity of error comparison The empirical error as a proxy for the true error Empirical error: err 1 = 1 n n err(y 1 i method 1 ) i=1 True error: err = The empirical error is a random variable, that depends on the training data we have

Error comparison Say your system s empirical error is 25.6% and my system s error is 35.5%. Is your system better?

Error comparison Say your system s empirical error is 25.6% and my system s error is 35.5%. Is your system better? What does it mean for the system to be better?

Error comparison Say your system s empirical error is 25.6% and my system s error is 35.5%. Is your system better? What does it mean for the system to be better? Say your system s empirical error is 25.6% and my system s error is 28.5%. Is your system better?

Error comparison Say your system s empirical error is 25.6% and my system s error is 35.5%. Is your system better? What does it mean for the system to be better? Say your system s empirical error is 25.6% and my system s error is 28.5%. Is your system better? Say your system s empirical error is 25.6% and my system s error is 25.7%. Is your system better?

State of the art in NLP Sometimes improvements reported are very small Over time, there is perhaps a big effect Test case: machine translation But how do we test the validity of a single improvement?

Hypothesis testing Null hypothesis: our systems are the same To reject this hypothesis, we need sufficient evidence How do we look for that evidence?

Hypothesis testing Set α significance level. Usually α = 0.05 or α = 0.01 Calculate t on the test data, which is an estimate of a parameter (such as the difference between errors of the two systems) test statistic Calculate a p-value the probability of the t getting its value if the null hypothesis is true See if t is surprising if the null hypothesis is true, i.e. p < α. If so, reject the null hypothesis

Example: Sign test We want to test whether two systems are the same, or one is better than the other We list all errors: err(y i method 1 ) and err(y i method 2 ) Sign test: we assume in the null hypothesis that the probabilities of getting 1 or 0 are the same (0.5) Null hypothesis is rejected if:

A theoretical way to quantify error We learn from data The more data we have, the better (usually) But how better do we get with more data? This is answered through the notion of sample complexity

Sample complexity A measure for the number of samples that are required in order to get a desired error level What will a sample complexity depend on?

Sample complexity bounds We usually cannot get the actual number for a desired error level Instead we get a bound These bounds are often loose and not so informative They give an idea about asymptotic behavior, though

Back to pre-historic languages We are observing x 1,..., x n where x i {argh, blah} Estimation of θ:

Back to pre-historic languages n θ i=1 = z i where z i = 1 iff x i = argh. n An average of binary random variables Hoeffding s inequality: ( ) 1 n ) p Z i θ 0 ɛ 2 exp ( nɛ2 n 2 i=1 How to interpret this inequality?

Back to pre-historic languages n θ i=1 = z i where z i = 1 iff x i = argh. n An average of binary random variables Hoeffding s inequality: ( ) 1 n ) p Z i θ 0 ɛ 2 exp ( nɛ2 n 2 i=1 How to interpret this inequality? We want the right handside to be smaller than some δ:

Probably approximately correct With probability 1 δ (want δ to be small), if n 2 log δ 2 1 ɛ 2 then θ θ 0 < ɛ, where θ 0 is the true parameter, and θ is the maximum likelihood estimate. Probably Approximately correct

Final note about evaluation Not all metrics are the same When designing a metric, see if: It correlates with human judgements? (the case of BLEU in MT) It correlates with the error of a downstream application? (the case of perplexity of LM and word error rate in speech)

Next class Summary Please send me questions by Thursday (12/2), 5pm