PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Similar documents
Recitation 2: Probability

Quick Tour of Basic Probability Theory and Linear Algebra

Introduction to Machine Learning

Probability Theory for Machine Learning. Chris Cremer September 2015

Introduction to Machine Learning

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

02 Background Minimum background on probability. Random process

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Lecture 2: Repetition of probability theory and statistics

Algorithms for Uncertainty Quantification

Review of Probabilities and Basic Statistics

5. Conditional Distributions

Discrete Random Variables

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Lecture 1: Probability Fundamentals

Exercise 1: Basics of probability calculus

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Data Mining Techniques. Lecture 3: Probability

Random Variables. Saravanan Vijayakumaran Department of Electrical Engineering Indian Institute of Technology Bombay

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture 5 Channel Coding over Continuous Channels

Lecture 2: August 31

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Basics on Probability. Jingrui He 09/11/2007

Exercises with solutions (Set D)

Chapter 3, 4 Random Variables ENCS Probability and Stochastic Processes. Concordia University

Brief Review of Probability

Machine learning - HT Maximum Likelihood

Introduction to Probability and Statistics (Continued)

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

Chapter 3. Chapter 3 sections

Lecture 11. Probability Theory: an Overveiw

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

ECE 4400:693 - Information Theory

Review of Probability. CS1538: Introduction to Simulations

Bivariate distributions

IAM 530 ELEMENTS OF PROBABILITY AND STATISTICS LECTURE 3-RANDOM VARIABLES

Lecture 1: Basics of Probability

Chapter 2: Source coding

Lecture 1: August 28

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Review (Probability & Linear Algebra)

EE514A Information Theory I Fall 2013

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

1 Presessional Probability

Dept. of Linguistics, Indiana University Fall 2015

EE376A: Homework #2 Solutions Due by 11:59pm Thursday, February 1st, 2018

Measure-theoretic probability

Information theory and decision tree

Naïve Bayes classification

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Summary of basic probability theory Math 218, Mathematical Statistics D Joyce, Spring 2016

Basic Probability and Information Theory: quick revision

Appendix A : Introduction to Probability and stochastic processes

Recap of Basic Probability Theory

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

Language as a Stochastic Process

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Decision making and problem solving Lecture 1. Review of basic probability Monte Carlo simulation

Course: ESO-209 Home Work: 1 Instructor: Debasis Kundu

Probability Review. Yutian Li. January 18, Stanford University. Yutian Li (Stanford University) Probability Review January 18, / 27

Recap of Basic Probability Theory

Unit 1: Sequence Models

L2: Review of probability and statistics

1.1 Review of Probability Theory

Communication Theory II

Introduction to Probability Theory

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R

Information Theory Primer:

Deep Learning for Computer Vision

Dept. of Linguistics, Indiana University Fall 2015

Entropy as a measure of surprise

Random Variables and Their Distributions

Introduction to Statistical Inference Self-study

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Statistics for scientists and engineers

Probability Review. Gonzalo Mateos

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Week 2. Review of Probability, Random Variables and Univariate Distributions

Expectation Maximization

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Bandits, Experts, and Games

Probability theory for Networks (Part 1) CS 249B: Science of Networks Week 02: Monday, 02/04/08 Daniel Bilar Wellesley College Spring 2008

M378K In-Class Assignment #1

Introduction to Probability and Statistics (Continued)

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Lecture 2: Review of Probability

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

CS 630 Basic Probability and Information Theory. Tim Campbell

Probability reminders

Transcription:

PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1

Outline Intro Basics of probability and information theory Probability space Rules of probability Useful distributions Zipf s law & Heaps law Information content Entropy (average content) Lossless compression Tf-idf weighting scheme Retrieval models Retrieval evaluation Link analysis From queries to top-k results Social search Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 2

Probability space (Ω, E, P) with Set-theoretic view of probability theory Ω: sample space of elementary events E: event space, i.e. subsets of Ω, closed under,, and, usually E = 2 Ω P: E [0, 1], probability measure Properties of P (set-theoretic view): 1. P( ) = 0 (impossible event) 2. P(Ω) = 1 3. P(A) + P( A) = 1 4. P(A B) = P(A) + P(B) P(A B) 5. P( A i i ) = P(A i ) i for pairwise disjoint A i Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 3

Sample space and events: examples Rolling a die Sample space: {1, 2, 3, 4, 5, 6} Probability of even number: looking for events A = {2}, B = {4}, C = {6}, P(A B C) = 1/6 + 1/6 + 1/6 = 0.5 Tossing two coins Sample space: {HH, HT, TH, TT} Probability of HH or TT: looking for events A = {TT}, B = {HH}, P(A B) = 0.5 In general, when all outcomes in Ω are equally likely, for an e E holds: P(e) = # outcomes in e # outcomes in sample space Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 4

Calculating with probabilities Total/marginal probability P(B) = Σ j P(B A j ) for any partitioning of Ω in A 1,, A n (sum rule) Joint and conditional probability P(A, B) = P(A B) = P(B A) P(A) (product rule) Bayes theorem P(B A) = P A B P(B) P(A) Thomas Bayes Independence P(A 1,, A n) = P(A 1 A n) = P(A 1 ) P(A 2 ) P(A n ), for independent events A 1,, A n Conditional Independence A is independent of B given C P(A B, C) = P(A C) If A and B are independent, are they also independent given C? Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 5

Discrete and continuous random variables Random variable on probability space (Ω, E, P) X: Ω M R (numerical representations of outcomes) with {e X(e) x} E for all x M Examples Rolling a die: X i = i Rolling two dice: X a, b = 6 a 1 + b If M is countable X is called discrete, otherwise continuous Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 6

Calculating probabilities: example (1) Example from C. Bishop: PRML Marginal probability Sum rule P X = x i = c i N P X = x i = P(X = x i, Y = y j ) j = 1 N n ij j = c i N Joint probability Product rule P X = x i, Y = y j P X = x i, Y = y j = n ij N = P(Y = y j X = x i )P X = x i = n ij c i c i N = n ij N Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 7

Calculating probabilities: example (2) Suppose: P(B = r) = 2/5 P(B = b F = o) = P(F = o) = P(F = o B = r) P(B = r) + P(F = o B = b) P(B = b) = 9/20 Example from C. Bishop: PRML P(F = o B = b) P(B = b) P(F = o) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 8

Probability density function (pdf) Pdfs, cdfs, and quantiles f X : M [0,1] with f X x = P(X = x) Cumulative distribution function (cdf) F X : M [0,1] with F X x = P(X x) f X F X From C. Bishop: Pattern Recognition and Machine Learning Quantile function F 1 q = inf {x F X x > q}, q [0,1] (for q = 0.5, F 1 q is called median) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 9

Examples of discrete distributions Useful distributions (1) Uniform distribution over {1, 2,, m}: P X = k = f X (k) = 1 m Bernoulli distribution with parameter p: P X = x = f X x = p x (1 p) 1 x P X = x 1 0 1 Binomial distribution with parameter p: P X = k = f X k = m k pk 1 p m k P X = k P X = k P X = k x k k k Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 10

Useful distributions (2) Examples of continuous distributions Uniform distribution over [a, b] P X = x = f X (x) = 1 b a for a < x < b Pareto distribution: P X = x = f X x = k b b x k+1 for x > b Source: Wikipedia Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 11

Multivariate distributions Let X 1,, X m be random variables over the same prob. space with domains dom X 1,, dom(x m ). The joint distribution of X 1,, X m has a pdf f X1,,X m x 1,, x m with x 1 dom(x 1 ) xm dom(x m ) f X1,,X m x 1,, x m = 1, or f x 1 dom(x 1 ) X1,,X m x 1,, x x m dom(x m ) m The marginal distribution of X i is F X1,,X m x i = dx 1 dx m = 1 x 1 dom(x 1 ) x i 1 dom(x i 1 ) x i+1 dom(x i+1 ) xm dom(x m ) f X1,,X m x 1,, x m or f x 1 dom(x 1 ) X1,,X m x 1,, x x x x m dom(x m ) m i 1 dom(x i 1 ) i+1 dom(x i+1 ) dx m dx 1 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 12

Multivariate distribution: example Multinomial distribution with parameters n, m (rolling n m-sided dice) P X 1 = k 1 X m = k m = f X1,,X m k 1,, k m = with k 1 + + k m = n and p 1 + + p m = 1 n! k 1! k m! p 1 k 1 p m k m Note: in information retrieval, the multinomial distribution is often used to model the following case: document d with n terms from the alphabet w 1,, w m, where each w i occurs k i times in d Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 13

Expectation Expectation, variance, and covariance For discrete variable X: E X = x f X (x) For continuous variable X: E X = x f X x dx Properties Variance x E X i + X j = E X i + E(X j ) E X i X j = E X i E(X j ) for independent, identically distributed (i.i.d.) variables X i, X j E ax + b = ae x + b for constants a, b Var X = E[ X E[X]) 2 = E X 2 E[X] 2, StDev X = Var(X) Properties Var X i + X j = Var X i + Var(X j ) for i.i.d. variables X i, X j Var ax + b = a 2 Var x for constants a, b Covariance Cov X i, X j = E[(X i E X i ) (X j E[X j ])] Var X = Cov(X, X) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 14

Statistical parameter estimation through MLE Maximum Likelihood Estimation (MLE) After tossing a coin n times, we have seen k times head. Let p be the unknown probability of the coin showing head. Is it possible to estimate p? We know observation corresponds to Binomial distribution, hence: L p; k, n = n k pk (1 p) n k Maximizing L p; k, n is equivalent to maximizing log L p; k, n log L p; k, n is called log-likelihood function log L p; k, n = log n k + k log p + n k log (1 p) log L p = k p (n k) (1 p) = 0 p = k n Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 15

Formal definition of MLE Maximum Likelihood Estimation (MLE) Let x 1,, x n be a random sample from a distribution f(θ, x) (Note that x 1,, x n can be viewed as the values of i.i.d. random variables X 1,, X n ) L θ; x 1,, x n = P[x 1,, x n originate fromf(θ, x)] Maximizing L(θ; x 1,, x n ) is equivalent to maximizing log L(θ; x 1,, x n ), i.e., the log-likelihood function: log P(x 1,, x n θ). If log L p is analytically intractable, use iterative numerical methods, e.g. Expectation Maximization (EM) (More on this, in the Data Mining lecture ) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 16

Modeling natural language: three questions 1. Is there a general model for the distribution of terms in natural language? 2. Given a term in a document, what is its information content? 3. Given a document, by which terms is it best represented? Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 17

Term frequencies Modeling natural language Term distribution Expected information content Most representative terms Terms in text ordered by frequencies Is there a weighting scheme that gives higher weights to representative terms? Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 18

Linguistic observation In large text corpus few terms occur very frequently many terms occur infrequently Zipf s law Frequency of term t is inversely proportional to its rank f t = C 1 r(t) C: frequency of the most frequent term r(t): rank of term t Source: http://www.ucl.ac.uk/~ucbplrd/language_page.htm Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 19

Example: retweets on Twitter Note the log-log scale of the graph! Source: Master s thesis by M. Jenders (2012) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 20

Pareto distribution Probability that continuous random variable X is equal to some value x is given by f X x; k, θ = P X = x = k θ θ x k+1 for x θ 0 for x < θ Pareto principle 80% of the effects come from 20% of the causes Family of distributions Power law distributions Source: Wikipedia Examples Distribution of populations over cities Distribution of wealth Degree distribution in web graph (or social graphs) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 21

Size of vocabulary V(n) Heap s law Empirical law describing the portion of vocabulary captured by a document V n = Kn β For parameters K (typically 10 K 100) and β (typically 0.4 β 0.6) Vocabulary of a text grows sublinearly with its size! Document size n See also: Modern Information Retrieval, 6.5.2 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 22

Zipf s law & Heaps law Source: Modern Information Retrieval Two sides of the same coin Both laws suggest opportunities for compression (more on this, later) How to compress as much as possible without loosing information? Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 23

From information content to entropy Information content Can we formally capture the content of information? 1. Intuition: the more surprising a piece of information (i.e., event), the higher its information content should be. h x P(x) 2. Intuition: the information content of two independent events x and event y should simply add up (additivity). h x + y = h x + h(y) Define h x : = log 2 P(x) Entropy (expected information content) Let X be a random variable with 8 equally possible states. What is the average number of bits needed to encode a state of X? H X = x dom(x) P x log P(x) (i.e. the entropy of X) = 8 1 8 log 1 8 = 3 Also: entropy is a lower bound on the average number of bits needed to encode a state of X. Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 24

H(X) Entropy function 1.0 0.5 P(X = x) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 25

Relative entropy Relative entropy (Kullback-Leibler Divergence) Let f and g be two probability density functions over random variable X. Assuming that g is an approximation of f, the additional average number of bits to encode a state of X through g is given by KL f g = f x log f(x) g(x) x dx Properties of relative entropy KL f g 0 (Gibbs inequality) KL f g KL g f (asymmetric) Related symmetric measure: Jensen-Shannon Divergence JS f, g = α KL f g + β KL g f with α + β = 1 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 26

Mutual information Mutual information Let X and Y be two random variables with a joint distribution function P. The degree of their independence is given by P(X, Y) I X, Y = KL P X, Y P X P Y = p X, Y log P X P(Y) dx dy Properties of mutual information I X, Y 0 I X, Y = 0 if and only if X and Y are independent I X, Y = H X H X Y = H Y H[Y X] (also known as: information gain) (i.e., the entropy reduction of X by being told the value of Y) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 27

Huffman compression Lossless compression (1) Let X be a random variable with 8 possible states {x 1, x 2, x 3, x 4, x 5, x 6, x 7, x 8 } with occurrence probabilities ( 1 2, 1 4, 1 8, 1 16, 1 64, 1 64, 1 64, 1 64 ) In any case: 3 bits would be sufficient to encode any of the 8 states. Can we do better? encoding: 0,10,110,1110,111100,111101,111110,111111 1 2 x 1 1 2 1 4 x 2 1 4 1 8 x 3 1 1 1 1 Prefix property: no codeword is 64 64 64 64 prefix of another codeword! x 5 x 6 x 7 x 8 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 28 1 8 1 16 x 4 1 16 1 32 1 32 Bottom-up tree construction by combining lowest-frequency subtrees

Lossless compression (2) Shannon s noiseless coding theorem Let X be a random variable with n possible states. For any noiseless encoding of the states of X, H(X) is a lower bound on the average code length of a state of X. Theorem The Huffman compression is an entropy encoding algorithm (i.e., it achieves the lower bound estimated by entropy) Corollary The Huffman compression is optimal for lossless compression Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 29

Lossless compression (3) Ziv-Lempel compression (e.g., LZ77) Use lookahead window and backward window to scan text Identify in lookahead window the longest string that occurs in backward window Replace the string by a pointer to its previous occurrence Text is encoded in triples previous, length, new previous: distance to previous occurrence length: length of the string new: symbol following the string More advanced variants use adaptive dictionaries with statistical occurrence analysis! Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 30

Lossless compression (4) Ziv-Lempel compression (e.g., LZ77) Example Text: A A B A B B B A B A A B A B B B A B B A B B Code:, 0, A 1,1, B 2,2, B 4, 3, A ( 9, 8, B)( 3,3, ) Note that LZ77 and other sophisticated lossless compression algorithms (e.g. LZ78, Lempel-Ziv-Welch, ) encode several states at the same time. With appropriately generalized notions of variables and states, Shannon s lossless coding theorem still holds! Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 31

Tf-idf weighting scheme (1) Given a document, by which terms is it best represented? Is there a weighting scheme that gives higher weights to representative terms? Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 32

Tf-idf weighting scheme (2) Given a document, by which terms is it best represented? Is there a weighting scheme that gives higher weights to representative terms? Consider corpus with documents D = vocabulary V = t 1,, t m. d 1,, d n with terms from a The term frequency of term t i in document d j is measured by tf t i, d j = freq t i,d j max k freq t k,d j Normalisation makes estimation independent of document length. The inverse document frequency for a term t i is measured by idf t i, D = log Central weighting scheme for scoring and ranking D d D; t i occurs in d Downweights terms that ocurr in many documents (i.e., stop words: the, to, from, if, ). Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 33

Tf, idf, and tf-idf Tf, idf, and tf-idf weights (plotted in log-scale) computed on a collection from Wall Street Journal (~99,000 articles published between 1987 and 1989) Source: Modern Information Retrieval Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 34

Various tf-idf weighting schemes Different weighting schemes based on the tf-idf model, implemented in the SMART system Source: Introduction to Information Retrieval Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 35

Probability theory Summary Sample space, events random variables Sum rule (for marginals), product rule (for joint distributions), Bayes theorem (using conditionals) Distributions (discrete, continuous, multivariate), pdfs, cdfs, quantiles Expectation, variance, covariance Maximum likelihood estimation Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 36

Information theory Summary Information content Entropy, relative entropy (= KL divergence), mutual Information Lossless compression, Lempel-Ziv and Huffman compression (entropy encoding algorithm) Shannon s noiseless coding theorem Tf-idf Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 37