CS 630 Basic Probability and Information Theory. Tim Campbell

Similar documents
Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Probability Theory. Introduction to Probability Theory. Principles of Counting Examples. Principles of Counting. Probability spaces.

Classification & Information Theory Lecture #8

Lecture 1: Introduction, Entropy and ML estimation

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

1 Introduction. Sept CS497:Learning and NLP Lec 4: Mathematical and Computational Paradigms Fall Consider the following examples:

Information Theory Primer:

CS546:Learning and NLP Lec 4: Mathematical and Computational Paradigms

Recitation 2: Probability

Lecture 5 - Information theory

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Lecture 22: Final Review

Dept. of Linguistics, Indiana University Fall 2015

Lecture 2: August 31

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

02 Background Minimum background on probability. Random process

Deep Learning for Computer Vision

Introduction to Machine Learning

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Entropies & Information Theory

Some Basic Concepts of Probability and Information Theory: Pt. 2

Bayesian Inference and MCMC

Introduction to Machine Learning

Advanced Probabilistic Modeling in R Day 1

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Shannon s Noisy-Channel Coding Theorem

Computing and Communications 2. Information Theory -Entropy

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Exercises with solutions (Set D)

Discrete Probability Refresher

Lecture 1: Probability Fundamentals

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Computational Perception. Bayesian Inference

Introduction to Stochastic Processes

Probability and Estimation. Alan Moses

Shannon s noisy-channel theorem

Probability Theory and Applications

Noisy channel communication

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

Probabilistic Systems Analysis Spring 2018 Lecture 6. Random Variables: Probability Mass Function and Expectation

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Information. = more information was provided by the outcome in #2

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University

Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

CS37300 Class Notes. Jennifer Neville, Sebastian Moreno, Bruno Ribeiro

Expectation Maximization

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Review of probability

Introduction to Probability and Statistics (Continued)

Quantitative Biology II Lecture 4: Variational Methods

Bioinformatics: Biology X

LEARNING WITH BAYESIAN NETWORKS

Bayesian Methods for Machine Learning

STAT 302 Introduction to Probability Learning Outcomes. Textbook: A First Course in Probability by Sheldon Ross, 8 th ed.

INTRODUCTION TO INFORMATION THEORY

CSC321 Lecture 18: Learning Probabilistic Models

Lecture 8: Information Theory and Maximum Entropy

Solutions to Homework Set #3 Channel and Source coding

ECE Information theory Final

Bayesian Models in Machine Learning

Noisy-Channel Coding

Intro to Probability. Andrei Barbu

Topics. Probability Theory. Perfect Secrecy. Information Theory

M378K In-Class Assignment #1

Information Theory in Intelligent Decision Making

Random Signals and Systems. Chapter 1. Jitendra K Tugnait. Department of Electrical & Computer Engineering. James B Davis Professor.

Naïve Bayes classification

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Lecture 7. Union bound for reducing M-ary to binary hypothesis testing

Revision of Lecture 5

Probability Theory for Machine Learning. Chris Cremer September 2015

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Information in Biology

STAT 598L Probabilistic Graphical Models. Instructor: Sergey Kirshner. Probability Review

Classical Information Theory Notes from the lectures by prof Suhov Trieste - june 2006

Robust Monte Carlo Methods for Sequential Planning and Decision Making

I - Probability. What is Probability? the chance of an event occuring. 1classical probability. 2empirical probability. 3subjective probability

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Multivariate probability distributions and linear regression

Statistical Concepts

Unit 1: Sequence Models

Capacity of a channel Shannon s second theorem. Information Theory 1/33

Probability Review. Chao Lan

Origins of Probability Theory

Advanced Herd Management Probabilities and distributions

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Information in Biology

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Review of Probability Theory

QB LECTURE #4: Motif Finding

Principles of Communications

Massachusetts Institute of Technology

Why study probability? Set theory. ECE 6010 Lecture 1 Introduction; Review of Random Variables

5 Mutual Information and Channel Capacity

Transcription:

CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003

Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event) is a process by which observable results come to pass. Define the set D as the space in which experiments occur. Define F to be a collection of subsets of D including both D and the null set. F must have closure under finite intersection and union operations and complements. 1

A probability function (or distribution) is a function P:F [, ] such that P (D) = 1 and for disjoint sets A i F it must be that P ( i A i ) = i P (A i ). A probability space consists of a sample space D, a set F, and a probability function P. 2

Continuous Spaces The discussion being presented is given in discrete spaces, but they carry over to continuous spaces. Probability density functions are zero for any finite union of points, P (D) = D p(u)du = 1 and P event) = event p(u)du 3

Conditional Probability Conditional Probability is the (possibly) changed probability of an event given some knowledge. Prior Probability of an event is an event s probability before new knowledge is considered. Posterior Probability is the new probability resulting from use of new knowledge. Conditional probability of event A given B has happened is: P (A B) = P (A B) P (B) 4

This generalizes to the chain rule: P (A 1... A n ) = P (A 1 )P (A 2 A 1 )P (A 3 A 1 A 2 )...P (A n n 1 i=1 A i) If events A and B are independent of eachother then P (A B) = P (A) and P (B A) = P (A) so it follows that P (A B) = P (A)P (B) Events A and B are conditionally independent given event C if P (A, B, C) = P (A, B C)P (C) = P (A C)P (B C)P (C)

Bayes Theorem Bayes theorem: P (B A) P (B A) = P (A) = P (A B)P (B) P (A) The denominator P (A) can be thought of as a normalizing constant and ignored if one is just trying to find a most likely event given A. More generally if B is a group of sets that are disjoint and partition A then P (B A) = P (A B)P (B) B i B P (A B i )P (B i ) 5

Random Variables A random variable is a function X : D R n The probability mass function is defined as where p(x) = p(x = x) = P (A x ) A x = a D : X(a) = x Expectation is defined as E(x) = x xp(x) Variance is defined as V ar(x) = E((X E(X)) 2 ) = E(X 2 ) E 2 (X) Standard Deviation is defined as the square root of variance. 6

Joint probability distributions are possible using many random variables over a sample space. A joint probability mass function is defined p(x, y) = P (A x, B x ) Marginal probability mass functions total up the probability masses for the values of each variable separately, for example, p x (x) = y p(x, y) Conditional probability mass function is defined p X Y (x y) = p(x, y) p y (y) p y(y) > 0 The chain rule for random variables follows p(w, x, y, z) = p(w)p(x w)p(y w, x)p(z w, x, y) 7

Determining P The function P is not always easy to obtain. Methods of construction include Relative Frequency, Parametric construction, and empirical estimation. Uniform distribution has the same value for all points in the domain. Binomial distribution is the result of a series of Bernoulli trials. Poisson distribution distributes points in such a way that the expected number of points in an interval is proportional to the length of the interval. Normal distribution or Gaussian distribution. 8

Bayesian Statistics Bayesian Statistics integrates prior beliefs about probabilities into observations using Bayes theorem. Example: Consider the toss of a possibly unbalanced coin. A sequence of flips s gives i heads and j tails and µ m is a model in which P(h) = m, then P (s µ m ) = m i (1 m) j Now suppose the prior belief is modeled by P (µ m ) = 6m(1 m) which is centered on.5 and integrates to 1. Bayes theorem gives P (µ m s) = P (s µ m)p (µ m ) P (s) = 6mi+1 (1 m) i+1 P (s) P(s) is a marginal probability, which means summing P (s µ m ) weighted by P (µ m ): P (s) = 1 0 P (s µ m )P (µ m )dm = 1 0 6m i+1 (1 m) i+1 dm 9

Bayesian Updating is a process in which the above technique can be used regularly to update beliefs as new data become available. Bayesian Decision Theory is a method by which multiple models can be evaluated. Given two models µ and v, P (µ s) = P (s v)p (v) P (s µ)p (µ) P (s) and P (v s) =. The likelihood ratio between these models P (s) is P (µ s) P (v s) = P (s µ)p (µ) P (s v)p (v) If the ratio is greater than 1 then µ is preferable, otherwise v is preferable.

Information Theory Developed by Claude Shannon Addresses the questions of maximizing data compression and transmission rate for any source of information and any communication channel. 10

Entropy Entropy measures the amount of information in a random variable and is defined H(p) = H(X) = x X p(x) log 2 p(x) = E(log 2 1 p(x) ) Joint Entropy of a pair of discrete random variables X and Y is defined H(X, Y ) = p(x, y) log 2 p(x, y) x X y Y Conditional Entropy of a random variable Y given X expresses the amount of information needed to communicate Y if X is already universally known. H(Y X) = p(x)h(y X = x) = p(x, y) log p(y x) x X x X y Y The chain rule for entropy is defined H(X 1,..., X n ) = H(X 1 ) + H(X 2 X 1 ) +... + H(X n X 1,..., X n 1 ) 11

Mutual Information Mutual Information is the reduction in uncertainty of a random variable caused by knowing about another. Using the chain rule for H(X, Y ), H(X) H(X Y ) = H(Y ) H(Y X) Denote mutual information for random variables X and Y I(X; Y ), I(X; Y ) = H(X) H(X Y ) = H(X) + H(Y ) H(X, Y ) = x X,y Y p(x, y) log 2 p(x, y) p(x)p(y) Conditional mutual information is defined: I(X; Y Z) = I((X; Y ) Z) = H(X Z) H(X Y, Z) 12

The chain rule for mutual information is defined: I(X 1,..., X n ; Y ) = I(X 1 ; Y ) +... + I(Xn; Y X 1,..., X n 1 ) = n I(X i ; Y X 1,..., X i 1 ) i=1

The Noisy Channel Model There is a trade-off between compression and transmission accuracy. The first reduces space, the second increases it. Channels are characterized by their capacity, which (in a memoryless channel) can be expressed C = max p(x) I(X; Y ) where X is input to the channel and Y is channel output. Channel capacity can be reached if an input code X is designed that maximizes mutual information between X and Y over all possible input distributions p(x). 13

Relative Entropy Given two probability mass functions p and q, relative entropy is defined D(p q) = x X p(x) log p(x) q(x) Relative Entropy gives a measure of how different two probability distributions are. Mutual Information is really a measure of how far a joint distribution is from independence I(X; Y ) = D(p(x, y) p(x)p (y)) Conditional relative entropy and a chain rule are also defined. 14

The Relation to Language Given a history of words h, the next word w, and a model m, define point-wise entropy as H(w h) = log 2 m(w h). If the model is correct point-wise entropy is 0, if the model is incorrect point-wise entropy is infinite. In this sense a model s accuracy is tested, and one would hope to keep these surprises to a minimum. In practice p(x) may not be known, so a model m is best when D(p m) is minimal. Unfortunately if p(x) is unknown, D(p m) can only be approximated using techniques like cross entropy and perplexity. 15

Cross Entropy The cross entropy between X with actual probability distribution p(x) and a model q(x) is H(X, q) = H(X)+D(p q) = x X p(x) log q(x) If a large sample body is available cross entropy can be approximated H(X, q) 1 n log q(x 1,n) Minimizing cross entropy is equivalent to minimizing relative entropy, which brings the model s probability distribution closer to the actual probability distribution. 16

Perplexity A perplexity of k means that you are as surprised on average as you would have been if you had had to guess between k equiprobable choices at each step. It is defined perplexity(x 1n, m) = 2 H(x 1,n,m) = m(x 1n ) 1 n 17

The Entropy of English English can be modeled using n-gram models, or Markov chains. They assume the probability of the next word relies on the previous k in the stream. Models have exhibited cross entropy with English as low as 2.8 bits, and experiments with humans have resulted in cross entropy of 1.34 bits. 18