Classification & Information Theory Lecture #8

Similar documents
2/3/04. Syllabus. Probability Lecture #2. Grading. Probability Theory. Events and Event Spaces. Experiments and Sample Spaces

Probability Lecture #7

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

CS 630 Basic Probability and Information Theory. Tim Campbell

1 Introduction. Sept CS497:Learning and NLP Lec 4: Mathematical and Computational Paradigms Fall Consider the following examples:

CS546:Learning and NLP Lec 4: Mathematical and Computational Paradigms

Naïve Bayes classification

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Information in Biology

CS 188: Artificial Intelligence Spring Today

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Information in Biology

Quantitative Biology II Lecture 4: Variational Methods

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Lecture 1: Introduction, Entropy and ML estimation

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Entropies & Information Theory

Information. = more information was provided by the outcome in #2

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University

Lecture 2: August 31

Probabilistic and Bayesian Machine Learning

Bayesian Models in Machine Learning

Probability Review and Naïve Bayes

Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CSC321 Lecture 18: Learning Probabilistic Models

Bayesian Methods: Naïve Bayes

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

The Naïve Bayes Classifier. Machine Learning Fall 2017

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Lecture : Probabilistic Machine Learning

Lecture 5 - Information theory

Machine Learning

Intelligent Systems (AI-2)

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Information and Entropy. Professor Kevin Gold

Information Theory Primer:

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Computational Cognitive Science

Probability Theory for Machine Learning. Chris Cremer September 2015

PATTERN RECOGNITION AND MACHINE LEARNING

Intelligent Systems (AI-2)

Entropy. Probability and Computing. Presentation 22. Probability and Computing Presentation 22 Entropy 1/39

Probability Theory. Introduction to Probability Theory. Principles of Counting Examples. Principles of Counting. Probability spaces.

MLE/MAP + Naïve Bayes

6.02 Fall 2011 Lecture #9

Introduction to Information Theory. B. Škorić, Physical Aspects of Digital Security, Chapter 2

Communication Theory and Engineering

Introduction to Bayesian Learning

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Shannon s Noisy-Channel Coding Theorem

Lecture 7: DecisionTrees

Language as a Stochastic Process

CHAPTER 2 Estimating Probabilities

Generative Clustering, Topic Modeling, & Bayesian Inference

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Computing and Communications 2. Information Theory -Entropy

Lecture 18: Quantum Information Theory and Holevo s Bound

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Information Theory in Intelligent Decision Making

Approximate Inference Part 1 of 2

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Quantitative Biology Lecture 3

Approximate Inference Part 1 of 2

Information Theory, Statistics, and Decision Trees

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Lecture 2: Probability, Naive Bayes

Advanced Probabilistic Modeling in R Day 1

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Probability Review. September 25, 2015

Computational Cognitive Science

CMPSCI 240: Reasoning about Uncertainty

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Unsupervised Learning

Bioinformatics: Biology X

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

Generative Learning algorithms

CS 6140: Machine Learning Spring 2016

QB LECTURE #4: Motif Finding

CSC 411 Lecture 3: Decision Trees

Non-parametric Methods

STAT Chapter 3: Probability

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

Introduction to Machine Learning. Lecture 2

Lecture 10: Bayes' Theorem, Expected Value and Variance Lecturer: Lale Özkahya

Some Concepts of Probability (Review) Volker Tresp Summer 2018

CS 361: Probability & Statistics

M378K In-Class Assignment #1

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Introduction to Machine Learning

02 Background Minimum background on probability. Random process

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

CPSC 340: Machine Learning and Data Mining

Noisy channel communication

Transcription:

Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum

Today s Main Points Automatically categorizing text Parameter estimation and smoothing a general recipe for a statistical CompLing model Building a Spam Filter Information Theory What is information? How can you measure it? Entropy, Cross Entropy, Information gain

Maximum Likelihood Parameter Estimation Example: Binomial Toss a coin 100 times, observe r heads Assume a binomial distribution Order doesn t matter, successive flips are independent One parameter is q (probability of flipping a head) Binomial gives p(r n,q). We know r and n. Find arg max q p(r n, q)

Maximum Likelihood Parameter Estimation Example: Binomial Toss a coin 100 times, observe r heads Assume a binomial distribution Order doesn t matter, successive flips are independent One parameter is q (probability of flipping a head) Binomial gives p(r n,q). We know r and n. Find arg max q p(r n, q) (Notes for board) " likelihood = p(r = r n,q) = $ n % ' q r (1( q) n(r # r& log ( likelihood = L = log(p(r n,q)) )log(q r (1( q) n(r ) = rlog(q) + (n ( r)log(1( q) *L *q = r q ( n ( r 1( q + r(1( q) = (n ( r)q + q = r n Our familiar ratio-of-counts is the maximum likelihood estimate!

Binomial Parameter Estimation Examples Make 1000 coin flips, observe 300 Heads P(Heads) = 300/1000 Make 3 coin flips, observe 2 Heads P(Heads) = 2/3?? Make 1 coin flips, observe 1 Tail P(Heads) = 0??? Make 0 coin flips P(Heads) =??? We have some prior belief about P(Heads) before we see any data. After seeing some data, we have a posterior belief.

Maximum A Posteriori Parameter Estimation We ve been finding the parameters that maximize p(data parameters), not the parameters that maximize p(parameters data) (parameters are random variables!) p(q n,r) = p(r n,q) p(q n) = p(r n,q) p(q) p(r n) constant And let p(q) = 2 q(1-q)

Maximum A Posteriori Parameter Estimation Example: Binomial " posterior = p(r n,q) p(q) = n % $ ' q r (1( q) n(r (6q(1( 2 q)) # r& log ( posterior = L )log(q r+1 (1( q) n(r+1 ) = (r +1)log(q) + (n ( r +1)log(1( q) *L *q = (r +1) q ( (n ( r +1) 1( q + (r +1)(1( q) = (n ( r +1)q + q = r +1 n + 2

Bayesian Decision Theory We can use such techniques for choosing among models: Which among several models best explains the data? Likelihood Ratio P(model1 data) = P(data model1) P(model1) P(model2 data) P(data model2) P(model2)

...back to our example: French vs English p(french glacier, melange) versus p(english glacier, melange)? We have real data for Jane Austin William Shakespeare p(austin stars, thou ) p(shakespeare stars, thou )

Statistical Spam Filtering Testing Document:. Are you free to meet with Dan Jurafsky today at 3pm? He wants to talk about computational methods for noun coreference. Categories: Training data: Real Email Spam Email Speaking at awards ceremony... Coming home for dinner... Free for a research meeting at 6pm... Computational Linguistics office hours... Nigerian minister awards Earn money at home today!... FREE CASH Just hours per day...

Document Classification by Machine Learning Testing Document:. Temporal reasoning for planning has long been studied formally. We discuss the semantics of several planning... Categories: Machine Learning Planning Prog. Lang. Semantics Garbage Collection Multimedia GUI Training data: Neural networks and other machine learning methods of classification Plannning with temporal reasoning has been based on the semantics of program dependence Garbage collection for strongly-typed languages Multimedia streaming video for User studies of GUI

Work out Naïve Bayes formulation interactively on the board

Recipe for Solving a NLP Task Statistically 1) Data: Notation, representation 2) Problem: Write down the problem in notation 3) Model: Make some assumptions, define a parametric model 4) Inference: How to search through possible answers to find the best one 5) Learning: How to estimate parameters 6) Implementation: Engineering considerations for an efficient implementation

(Engineering) Components of a Naïve Bayes Document Classifier Split documents into training and testing Cycle through all documents in each class Tokenize the character stream into words Count occurrences of each word in each class Estimate P(w c) by a ratio of counts (+1 prior) For each test document, calculate P(c d) for each class Record predicted (and true) class, and keep accuracy statistics

A Probabilistic Approach to Classification: Naïve Bayes Pick the most probable class, given the evidence: c " = argmax c j Pr(c j d) c j d - a class (like Planning ) - a document (like language intelligence proof... ) Bayes Rule: Pr(c j d) = Pr(c j )Pr(d c j ) Pr(d) w d i " Naïve Bayes : d Pr(c j )# Pr(w d i c j ) i=1 d $ Pr(c k )# Pr(w d c k ) i c k i=1 - the i th word in d (like proof )

Parameter Estimation in Naïve Bayes Estimate of P(c) P(c j ) = Estimate of P(w c) P(w i c j ) = 1 +Count(d " c j ) # k C + Count(d " c k ) # 1+ Count(w i,d k ) d k "c j V # # t=1 d k "c j V + Count(w t,d k )

Information Theory

What is Information? The sun will come up tomorrow. Condi Rice was shot and killed this morning.

Efficient Encoding I have a 8-sided die. How many bits do I need to tell you what face I just rolled? My 8-sided die is unfair P(1)=1/2, P(2)=1/8, P(3)= =P(8)=1/16 1 0 1 0 1 2 0 1 0 1 0 1 0 1 0 1 3 4 5 6 7 8 2.375

Entropy (of a Random Variable) Average length of message needed to transmit the outcome of the random variable. First used in: Data compression Transmission rates over noisy channel

Coding Interpretation of Entropy Given some distribution over events P(X) What is the average number of bits needed to encode a message (a event, string, sequence) = Entropy of P(X): H( p(x)) = " $ p(x)log 2 (p(x)) x #X Notation: H(X) = H p (X)=H(p)=H X (p)=h(p X ) What is the entropy of a fair coin? A fair 32-sided die? What is the entropy of an unfair coin that always comes up heads? What is the entropy of an unfair 6-sided die that always {1,2} Upper and lower bound? (Prove lower bound?)

Entropy and Expectation Recall E[X] = Σ x X(Ω) x p(x) Then E[-log 2 (p(x))] = Σ x X(Ω) -log 2 (p(x)) p(x) = H(X)

Entropy of a coin

Entropy, intuitively High entropy ~ chaos, fuzziness, opposite of order Comes from physics: Entropy does not go down unless energy is used Measure of uncertainty High entropy: a lot of uncertainty about the outcome, uniform distribution over outcomes Low entropy: high certainty about the outcome

Claude Shannon Claude Shannon 1916-2001 Creator of Information Theory 1950 Lays the foundation for implementing logic in digital circuits as part of his Masters Thesis! (1939) A Mathematical Theory of Communication (1948)

Joint Entropy and Conditional Entropy Two random variables: X (space Ω), Y (Ψ) Joint entropy no big deal: (X,Y) considered a single event: H(X,Y) = - Σ x Ω Σ y Ψ p(x,y) log 2 p(x,y) Conditional entropy: H(X Y) = - Σ x Ω Σ y Ψ p(x,y) log 2 p(x y) recall that H(X) = E[-log 2 (p(x))] (weighted average, and weights are not conditional) How much extra information you need to supply to transmit X given that the other person knows Y.

Conditional Entropy (another way) " x H(Y X) = p(x)h(y X = x) " x " y = p(x)(# p(y x)log 2 ( p(y x)) "" x y = # p(x)p(y x)log 2 (p(y x)) "" x y = # p(x, y)log 2 ( p(y x))

Chain Rule for Entropy Since, like random variables, entropy is based on an expectation.. H(X, Y) = H(Y X) + H(X) H(X, Y) = H(X Y) + H(Y)

Cross Entropy What happens when you use a code that is sub-optimal for your event distribution? I created my code to be efficient for a fair 8-sided die. But the coin is unfair and always gives 1 or 2 uniformly. How many bits on average for the optimal code? How many bits on average for the sub-optimal code? H( p,q) = " $ p(x)log 2 (q(x)) x #X

KL Divergence What are the average number of bits that are wasted by encoding events from distribution p using distribution q? D( p q) = H(p,q) " H(p) = " $ p(x)log 2 (q(x)) + $ p(x)log 2 (p(x)) x #X = $ p(x)log 2 ( p(x) q(x) ) x #X x #X A sort of distance between distributions p and q, but It is not symmetric! It does not satisfy the triangle inequality!

Mutual Information Recall: H(X) = average # bits for me to tell you which event occurred from distribution P(X). Now, first I tell you event y Y, H(X Y) = average # bits necessary to tell you which event occurred from distribution P(X)? By how many bits does knowledge of Y lower the entropy of X? I(X;Y) = H(X) " H(X Y) = H(X) + H(Y) " H(X,Y) 1 = "# p(x)log 2 p(x) " # p 1 (y)log 2 p(y) + x p(x, y) = # p(x, y)log 2 p(x) p(y) x,y y # p (x, y)log 2 p(x, y) x,y

Mutual Information Symmetric, non-negative. Measure of independence. I(X;Y) = 0 when X and Y are independent I(X;Y) grows both with degree of dependence and entropy of the variables. Sometimes also called information gain Used often in NLP clustering words word sense disambiguation feature selection

Pointwise Mutual Information Previously measuring mutual information between two random variables. Could also measure mutual information between two events I(x, y) = log p(x, y) p(x) p(y)