A Brief Review of Probability, Bayesian Statistics, and Information Theory

Similar documents
Bayesian Models in Machine Learning

Bayesian Learning (II)

Computer vision: models, learning and inference

02 Background Minimum background on probability. Random process

CSC321 Lecture 18: Learning Probabilistic Models

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

MLE/MAP + Naïve Bayes

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

CS 630 Basic Probability and Information Theory. Tim Campbell

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Introduction to Machine Learning

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Naïve Bayes classification

Human-Oriented Robotics. Probability Refresher. Kai Arras Social Robotics Lab, University of Freiburg Winter term 2014/2015

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

(1) Introduction to Bayesian statistics

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Exercise 1: Basics of probability calculus

Randomized Algorithms

Introduction to Machine Learning

Discrete Probability Refresher

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct /

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

MLE/MAP + Naïve Bayes

Some Concepts of Probability (Review) Volker Tresp Summer 2018

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Statistics: Learning models from data

Expectation Maximization Mixture Models HMMs

STA 4273H: Sta-s-cal Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Parametric Models: from data to models

CS 361: Probability & Statistics

Probability calculus and statistics

Machine Learning 4771

Bayesian Methods for Machine Learning

Recitation 2: Probability

Probability Theory. Introduction to Probability Theory. Principles of Counting Examples. Principles of Counting. Probability spaces.

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Probability and Information Theory. Sargur N. Srihari

Lecture 1: Probability Fundamentals

Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr( )

Ways to make neural networks generalize better

Why Bayesian? Rigorous approach to address statistical estimation problems. The Bayesian philosophy is mature and powerful.

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Brief Introduction of Machine Learning Techniques for Content Analysis

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

COMP90051 Statistical Machine Learning

Classification & Information Theory Lecture #8

Machine Learning

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

CS 6140: Machine Learning Spring 2016

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

The Random Variable for Probabilities Chris Piech CS109, Stanford University

What are the Findings?

ECE521 week 3: 23/26 January 2017

Exam 2 Practice Questions, 18.05, Spring 2014

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Statistical Models. David M. Blei Columbia University. October 14, 2014

STA 247 Solutions to Assignment #1

Announcements. Proposals graded

CS 361: Probability & Statistics

Entropy. Probability and Computing. Presentation 22. Probability and Computing Presentation 22 Entropy 1/39

Introduction to Machine Learning

SAMPLE CHAPTER. Avi Pfeffer. FOREWORD BY Stuart Russell MANNING

18.05 Practice Final Exam

Lecture 11: Probability Distributions and Parameter Estimation

Non-parametric Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CMPSCI 240: Reasoning Under Uncertainty

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Inference Control and Driving of Natural Systems

Probability Rules. MATH 130, Elements of Statistics I. J. Robert Buchanan. Fall Department of Mathematics

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

Confidence Intervals

Bayesian Inference. Introduction

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Computational Perception. Bayesian Inference

Introduction to Machine Learning. Lecture 2

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

The connection of dropout and Bayesian statistics

6.867 Machine Learning

Bayesian Inference and MCMC

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Midterm: CS 6375 Spring 2018

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Transcription:

A Brief Review of Probability, Bayesian Statistics, and Information Theory Brendan Frey Electrical and Computer Engineering University of Toronto frey@psi.toronto.edu http://www.psi.toronto.edu

A system is described by a set of random variables with domains A configuration is an assignment to A sample space is the set of possible configurations and is given by the product of the domains: Ex: A die. Ex: 2 dice. ( &"$.!#"$ (# of dots on die) %. &"'$. ) * ) %+, "-&" $. 4 5 8 2 9 $ Ex: 2 dice.. ( &./"$. Less untuitive than above. Ex: 2 dice, angle of hand.. 32 6(7 %;: 10 ( &"$,

! 6 7 8 8 The probability of configuration, ', is a real number that satisfies ' ' Ex: 2 unbiased dice. './", for 3 " #" The probability density for configuration, number that satisfies where NOTE: 6 7 8 is a differential volume of.! is possible., is a real #" $ %. 1 6 7 8 0

"! A random experiment or simulation produces a configuration 4. Discrete case: In experiments, the fraction of times configuration ' occurs converges to as. Continuous case: Suppose is a region of. In experiments, the fraction of times a configuration in occurs converges to as.

! #" % is the probability of given the value of $ Imagine throwing away all experiments where $ equal to the given value is not & '( ) *+ and $ are independent if $ 6 $ % and $ Knowing tells us nothing about the value of and vice versa,

general, $ and in If and $ are independent, ) " From the chain rule and normalization, $ is sometimes called the marginal of For densities, $

) Since $ $ % %, Using, we get Bayes rule $ % % For observed and $ hidden, we call the prior, % $ the likelihood and the posterior For densities, $ % % $

$ $ ()*+! The expected value of is can be a vector, eg, $, or * *+ The variance of is If and $ are independent, $

$ *!* The covariance of and $ is $!$ $ If and $ are independent, $ (not vice versa) In general, 7 $ % * )* ) $ The covariance matrix of! or, for 4 5, is

! 4 37 $ (eg, coin toss) where 4 7 is the probability that is 1. Sometimes, we parameterize using - 2, 2 4 7 7 7

7 8 8 4 5 (eg, prior for the probability that a coin will land heads up) % - '*% otherwise

Machine learning and statistics study how models are learned from data. In Bayesian machine learning and statistics, the model is considered to be a hidden variable with a prior distribution. Given the data, the posterior distribution over models can be used to make predictions, interpret the data, etc. Maximum likelihood (ML) estimation and maximum a posteriori (MAP) estimation can be viewed as approximations to Bayesian learning, where the most probable model is selected. (In ML estimation, the prior over models is assumed to be uniform.)

Suppose we flip a coin a bunch of times and see heads and tails. In a frequentist approach, we estimate the probability of heads as - In the Bayesian approach, we first specify a prior, say that the probability of seeing a head, is uniform on 7. Using Bayes rule, we obtain ) which is a Beta distribution with mode mean * - %+. and This distribution can be used to make decisions, compute confidence intervals, or interpret the data. For example, the minimum squared loss estimate of is 3 - %+ This is closer to the prior than the frequentist estimate. 0

Entropy is a measure of the maximum average amount of information that a random variable can convey in its value 4 ( The entropy of a discrete variable % $ is For a discrete variable, 7 since The more uniform bits 8 is, the greater the entropy If is an integer for all, bits of information can be conveyed using an encoder that uses bits to pick If (natural logarithm) is used instead of information is measured in nats.,

* *!) String 1 0.5 1 0 2 0.125 3 100 3 0.125 3 101 4 0.125 3 110 2 bits 5 0.125 3 111 Imagine we have a queue of random bits (eg, a compressed image) that we d like to convey. We can use this information to produce a series of experiments for. Each experiment is produced thus: Draw a bit from the queue If 7 set and terminate the experiment If draw two more bits and use these to pick 2, 3, 4 or 5 and terminate the experiment This procedure picks according to average of 7 7 %. % and conveys an bits per experiment,

Instead of encoding a bit string into a random variable, we can encode into a bit string using a source code. The decoder uses the bit string to recover. It turns out that if has a distribution minimum average bit string length is then the If is an integer for all, the minimum can be achieved by mapping each to a bit string with length

! 4 37 $ (eg, coin toss, bit from a magnetic disk), where 4 7 is the probability that is 1. ) Entropy of Bernoulli variable 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -x*log(x)/log(2)-(1-x)*log(1-x)/log(2) 0 0.2 0.4 0.6 0.8 1 Probability that Bernoulli variable equals 1

) ) * ) ) ' ) * " Relative entropy is a measure of the average excess string length when the wrong source code is used. Suppose the true distribution for is minimum average string length is, so the Suppose we use bit strings determined from the wrong distribution,. The average string length will be The average excess string length is the relative entropy: 7

! Suppose we try to use a source code to compress a real 4 5 variable. We can create infinitesimal bins, where bin will have probability The minimum average string length for this distribution is at This average length ( entropy ) is infinite; ie, conveys infinite information However, on the next page, we see that the relative entropy is finite...

) ) )! *+ ) Suppose we use bit strings determined from the wrong density. Under this density, bin at will have probability. The average string length is! The relative entropy (excess average string length) is 7 Since the relative entropy is finite, we refer to as entropy, although it may be NEGATIVE!

7 " 4 5 (eg, distribution of failure times), nats The entropy increases as 7 otherwise bits increases

! 4 5 (eg, variable that is a sum of a large number of other real random variables) %;: %;: nats %;: bits The entropy increases as increases &

!! 4 5 %;: where 4 5, is an positive definite matrix and is the determinant, an covariance matrix 0

Suppose we have an invertible function $ density. and a When a small volume is mapped from -space to $ -space, the probability in the volume should stay constant. However, because the volume may change shape, the probability density will change and the Jacobian captures this effect. Conservation of probability mass gives $, and is called the Jacobian 4 5 For and $ 4 5, is a matrix of derivatives

: A. Leon-Garcia, Probability and Random Processes for Electrical Engineering, Addison Wesley, New York, NY, 1994. ) *! * : R. M. Neal. Bayesian Learning for Neural Networks, Springer, New York, NY, 1996. & : T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley & Sons, New York, NY, 1991. ) * * ) * Useful when we study Gaussian models. : http://www.psi.toronto.edu/matrix/matrix.html,