Introduction to Bayesian Statistics

Similar documents
Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Gaussian Processes for Machine Learning

Statistical Machine Learning Lectures 4: Variational Bayes

PATTERN RECOGNITION AND MACHINE LEARNING

Bayesian RL Seminar. Chris Mansley September 9, 2008

(3) Review of Probability. ST440/540: Applied Bayesian Statistics

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Lecture : Probabilistic Machine Learning

Probability and Statistical Decision Theory

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

Bayesian Machine Learning

GAUSSIAN PROCESS REGRESSION

Naïve Bayes classification

Bayesian Inference and MCMC

Computational Cognitive Science

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

01 Probability Theory and Statistics Review

Statistical Learning Theory

Intro to Probability. Andrei Barbu

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Computational Cognitive Science

Probability and Estimation. Alan Moses

Bivariate distributions

Lecture 1: Bayesian Framework Basics

PROBABILITY THEORY REVIEW

Introduction to Machine Learning

Probability Review. September 25, 2015

Problem Y is an exponential random variable with parameter λ = 0.2. Given the event A = {Y < 2},

Exercise 1: Basics of probability calculus

Lecture 11. Probability Theory: an Overveiw

Expectation Propagation for Approximate Bayesian Inference

Nonparameteric Regression:

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Introduction to Bayesian inference

Classical and Bayesian inference

Introduction into Bayesian statistics

DD Advanced Machine Learning

Introduction to Computational Finance and Financial Econometrics Probability Review - Part 2

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Time Series and Dynamic Models

SDS 321: Introduction to Probability and Statistics

Review: mostly probability and some statistics

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

MATH c UNIVERSITY OF LEEDS Examination for the Module MATH2715 (January 2015) STATISTICAL METHODS. Time allowed: 2 hours

Lecture 1: August 28

Lecture 4: Probabilistic Learning

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Grundlagen der Künstlichen Intelligenz

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

COM336: Neural Computing

Lecture 13 Fundamentals of Bayesian Inference

Some Concepts of Probability (Review) Volker Tresp Summer 2018

Bayesian Machine Learning

Computational Perception. Bayesian Inference

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Linear Models for Regression

CPSC 540: Machine Learning

Lecture 2: From Linear Regression to Kalman Filter and Beyond

an introduction to bayesian inference

Bayesian Machine Learning

Gaussian Process Regression

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Random Variables and Expectations

CSC321 Lecture 18: Learning Probabilistic Models

Bayesian Methods: Naïve Bayes

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

1 Expectation Maximization

Foundations of Statistical Inference

Fourier and Stats / Astro Stats and Measurement : Stats Notes

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Machine Learning

P (x). all other X j =x j. If X is a continuous random vector (see p.172), then the marginal distributions of X i are: f(x)dx 1 dx n

STA 256: Statistics and Probability I

COMP90051 Statistical Machine Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Bristol Machine Learning Reading Group

Density Estimation. Seungjin Choi

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

CS 109 Review. CS 109 Review. Julia Daniel, 12/3/2018. Julia Daniel

INTRODUCTION TO BAYESIAN METHODS II

Practice Examination # 3

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

SDS 321: Introduction to Probability and Statistics

Machine Learning Srihari. Probability Theory. Sargur N. Srihari

Posterior Regularization

STAT 430/510: Lecture 16

Fundamentals of Machine Learning for Predictive Data Analytics

Probability Review. Chao Lan

Chapter 4 Multiple Random Variables

Introduction to the regression problem. Luca Martino

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Machine Learning

Linear Models A linear model is defined by the expression

Problem Set 1. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 20

ECE531 Homework Assignment Number 6 Solution

An introduction to Variational calculus in Machine Learning

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Transcription:

School of Computing & Communication, UTS January, 207

Random variables Pre-university: A number is just a fixed value. When we talk about probabilities: When X is a continuous random variable, it has a probability density function (pdf) When X is a discrete random variable, it has a probability mass function (pmf) p(x) p(x x) means that: The probability when a random variable X is equal to a fixed number x, i.e., the probablity that number of machine learning participants 20

Mean or Expectation discrete case: continous case: µ E(X) N x i N i µ E(X) xp(x)dx x S can also measure the expecation of a function: E(f (X)) f (x)p(x)dx x S For example, E(cos(X)) cos(x)p(x)dx E(X 2 ) x 2 p(x)dx x S x S What about f (E(X)): Discuss later when we discuss Jensens Equality in Expecation-Maximization

Variances an intuitive explanation You have data X {2,,, 2,, 4}, i.e., x 2, x 2,... x 4 You have the mean: µ 2 + + + 2 + + 4 2.5 The variance is then: VAR(data) σ 2 N N (x i µ) 2 i Division by N is intuitive. Otherwise, more data implies more variance Also think about what kind of values can VAR and σ take? - we will look at what kind of distribuiton is required for them.

Two alternative expression: People sometimes use: You have data X {2,,, 2,, 4}, i.e., x 2, x 2,... x 4 VAR(X) σ 2 N (x i µ) 2 N i N x 2 i N 2x i µ + N µ 2 N N N i i i N x 2 i 2µ N x i +µ 2 N N i i }{{} µ ( ) N x 2 i µ 2 N i Other times, people use: You have data X {, 2,, 4}, and P(X ) 2, P(X 2) 2, P(X ) and P(X 4). Discrete :VAR(X) σ 2 x X(x µ) 2 p(x) Continous :VAR(X) σ 2 (x µ) 2 p(x) x X VAR(X) x X x 2 p(x) 2 x X x X x 2 p(x) 2µ x X x X ( x 2 2µx + µ 2) p(x) µxp(x) + x X µ 2 p(x) xp(x) +µ 2 p(x) x X x 2 p(x) µ 2 x X } {{ } µ } {{ } It s easy to verify that both sides are the same

Numerical example First version X {2,,, 2,, 4}, i.e., x 2, x 2,... x 4 ( ) N VAR(X) x 2 i µ 2 N i (2 2.5)2 + ( 2.5) 2 + ( 2.5) 2 + (2 2.5) 2 + ( 2.5) 2 + (4 2.5) 2 0.97 Both sides are the same Second version X {, 2,, 4}, and P(X ) 2, P(X 2) 2, P(X ) and P(X 4). Discrete :VAR(X) σ 2 x X(x µ) 2 p(x) Continous :VAR(X) σ 2 x X VAR(X) x 2 p(x) µ 2 x X (x µ) 2 p(x) }{{} f (x) ( 2.5) 2 + (2 2.5)2 2 + ( 2.5)2 2 + (4 2.5) 2 0.97

Important fact of the Variances VAR(X) E[(X E(X)) 2 ] (x µ) 2 p(x)dx x S x S E(X 2 ) (E(X)) 2 x 2 p(x)dx 2µ xp(x)dx + µ 2 xp(x)dx x S x S Think VAR(X) as mean-subtracted second order moment of random variable X.

Joint distributions The following is a tablet form of joint density Pr(X, Y ): Y 0 Y Y 2 Total X 0 0 5 5 5 2 8 X 0 5 5 5 X 2 0 0 5 5 9 Total 5 5 5 This table shows Pr(X, Y ) or Pr(X x, Y y). For example, p(x, Y ) 5 : exercise what is the probablity that X 2, Y? exercise what is the probablity that X, Y 2? exercise what is the value of: 2 2 Pr(X i, Y j)? i0 j0

Marginal distributions Y 0 Y Y 2 Total X 0 0 5 5 5 2 8 X 0 5 5 5 X 2 0 0 5 5 9 Total 5 5 5 Using sum rule, the marginal distribution tells us that: Pr(X) Pr(x, y) or p(x) p(x, y)dy y S y S y y For example: Pr(Y ) 2 2 i0 j0 exercise what is Pr(X 2) and Pr(X )? p(x i, y ) 5 + 5 + 0 5 9 5

Conditional distributions Y 0 Y Y 2 Total X 0 0 5 5 5 2 8 X 0 5 5 5 X 2 0 0 5 5 9 Total 5 5 5 Conditional density: p(x Y ) p(x, Y ) p(y ) p(y X)p(X) p(y ) What about p(x Y y)? Pick an example: p(y X)p(X) X p(y X)p(X) p(x Y ) p(x, Y ) p(y ) /5 9/5 2

Conditional distributions: Exercise Y 0 Y Y 2 Total X 0 0 5 5 5 2 8 X 0 5 5 5 X 2 0 0 5 5 9 Total 5 5 5 The formulation for conditional density: p(x Y ) p(x, Y ) p(y ) exercise: What is p(x 2 Y )? exercise: What is p(x Y 2)? p(y X)p(X) p(y ) p(y X)p(X) X p(y X)p(X)

Independence If X and Y are independent: p(x Y ) p(x) p(x, Y ) P(X)P(Y ) Both factors are related when A and B are independent: p(x Y ) p(x, Y ) p(y ) p(x)p(y ) p(y ) p(x) Y 0 Y Y 2 Total X 0 0 5 5 5 2 8 X 5 5 0 5 X 2 5 0 0 5 9 Total 5 5 5 X and Y are NOT independent X 0 8 225 X 24 225 X 2 225 Total 5 Y 0 Y Y 2 Total 54 8 225 225 5 72 24 8 225 225 5 9 225 225 5 9 5 5 X and Y are independent

Conditional Independence Imagine we have three random variables: X, Y and Z : Once we know Z, then knowing Y does NOT tell us any additional information about X Therefore: Pr(X Y, Z ) Pr(X Z ) This means that X is conditionally independent of Y given Z. If Pr(X Y, Z ) Pr(X Z ), then what about Pr(X, Y Z )? Pr(X, Y Z ) Pr(X, Y, Z ) Pr(Z ) Pr(X Y, Z ) Pr(Y Z ) Pr(X Z ) Pr(Y Z ) Pr(X Y, Z ) Pr(Y, Z ) Pr(Z )

An example of Conditional Independence We will study Dynamic model later. x t x t x t+ y t y t y t+ From this model, we can see: p(x t x,..., x t, y,..., y t ) p(x t x t ) p(y t x,..., x t, x t, y,..., y t ) p(y t x t ) Right now, think of if a given variable is the only item that blocks the path between two (or more) variables.

Another Example: Bayesian Linear Regression We have data pairs: Input: X x,... x N Output: Y y,... y N Each pair, x i and y i are related through model equation: y i f (x i w) + N (0, σ 2 ) Input alone isn t going to tell you model parameter: p(w X) p(w) Output alone isn t going to tell you model parameter: p(w Y ) p(w) Obviously: p(w X, Y ) p(w) Posterior over parameter w: p(w x, y) p(y w, x)p(w x)p(x) p(y x)p(x) p(y w, x)p(w) p(y x) p(y w, x)p(w) w p(y x, w)p(w)

Expectation of Joint probabilities Given that X, Y is a two-dimensional random variable: Continous case: E[f (X, Y )] f (x, y)p(x, y)dxdy y S y x S x Discrete case: N i N j E[f (X, Y )] f (X i, Y j)p(x i, Y j) i j

Numerical Example: Y Y 2 Y X 0 5 5 2 X 2 5 5 0 X 5 0 0 p (X,Y) Y Y 2 Y X 7 8 X 2 2 X 8 f (X,Y) N N i j E[f (X, Y )] f (X i, Y j)p(x i, Y j) i j 0 + 7 5 + 8 5 + 2 5 + 5 + 2 0 + 5 + 8 0 + 0

Conditional Expectation It s a useful property for later E(Y ) X X E(Y X)p(X)dx yp(y X)dy Y }{{} ( y p(y, X)dx Y X yp(y )dy E(Y ) Y p(x)dx yp(y, X)dydx X Y ) dy

Bayesian Predictive distribution Put marginal distribution and Conditional Independence into a test: Very often, in machine learning, you want to compute the probability of new data y given training data Y, i.e., p(y Y ). You assume there are some model explains both Y and y. The model parameter is θ: p(y Y ) p(y θ)p(θ Y )dθ θ Excercise, tell me why the above works?

Revisit Bayes Theorem Instead of using arbitrary random variable symbols, we now use: θ for model parameter and X x,... x n for dataset: p(θ X) }{{} posterior p(x θ) p(θ) }{{}}{{} likelihood prior p(x) }{{} normalization constant p(x θ)p(θ) θ p(x θ)p(θ)

An Intrusion Detection System (IDS) Example The setting: Imagine out of all the TCP connections (say millons), % of which are intrusions: When there is an intrusion, the probability of system sends alarm is 87%. When there is no intrusion, the probability of system sends alarm is %. Prior probability: % of which are intrusions p(θ intrusion) 0.0 p(θ no intrusion) 0.99 Likelihood probability: given intrusion occur, probability of system sends alarm is 87% p(x alarm θ intrusion) 0.87 p(x no alarm θ intrusion) 0. given there is no intrusion, the probability of system sends alarm is %: p(x alarm θ no intrusion) 0.0 p(x no alarm θ no intrusion) 0.94

Posterior We are interested in posterior probability: Pr(θ X): There 2 two possible values for parameter θ and 2 possible observation X Therefore, there are 4 rates we need to compute: True Positive When system sends alarm, probability of an intrusion occurs: Pr(θ intrusion X alarm) False Positive When system sends alarm, probability that there is no intrusion: Pr(θ no intrusion X alarm) True Negative When system sends no alarm, probability that there is no intrusion: Pr(θ no intrusion X no alarm) False Negative When system sends no alarm, probability that an intrusion occurs: Pr(θ intrusion X no alarm) Question which are the two probabilities you d like to maximise?

Apply Bayes Theorem in this setting Pr(X θ) Pr(θ) Pr(θ X) θ Pr(X θ) Pr(θ) Pr(X θ) Pr(θ) Pr(X θ Intrusion) Pr(θ Intrusion) + Pr(X θ no intrusion) Pr(θ no intrusion)

Apply Bayes Theorem in this setting True Positive rate When system sends alarm, what is the probability of an intrusion occurs: Pr(θ intrusion X alarm) Pr(X alarm θ intrusion) Pr(θ intrusion) Pr(X alarm θ Intrusion) Pr(θ Intrusion) + Pr(X alarm θ no intrusion) Pr(θ Intrusion) 0.87 0.0 0.87 0.0 + 0.0 0.99 0.278 False Positive rate When system sends alarm, what is the probability that there is no intrusion: Pr(θ no intrusion X alarm) Pr(X alarm θ no intrusion) Pr(θ no intrusion) Pr(X alarm θ no intrusion) Pr(θ no intrusion) + Pr(X alarm θ no intrusion) Pr(θ no intrusion) 0.0 0.99 0.87 0.0 + 0.0 0.99 0.8722

Apply Bayes Theorem in this setting False Negative When system sends no alarm, what is the probability that an intrusion occurs? Pr(θ intrusion X no alarm) Pr(X no alarm θ intrusion)p(θ intrusion) Pr(X no alarm θ Intrusion) Pr(θ Intrusion) + Pr(X no alarm θ no intrusion) Pr(θ no Intrusion) 0. 0.0 0. 0.0 + 0.94 0.99 0.004 True Negative When system sends no alarm, what is the probability that there is no intrusion? Pr(θ no intrusion X no alarm) Pr(X no alarm θ no intrusion) Pr(θ no intrusion) Pr(X no alarm θ no intrusion) Pr(θ no intrusion) + Pr(X no alarm θ no intrusion) Pr(θ no intrusion) 0.94 0.99 0.87 0.00 + 0.0 0.99 0.998

Statistics way to think about Posterior Inference The posterior inference is to find the best q(θ) to approximate p(θ X), such that: { inf q(θ) Q KL(q(θ) p(θ)) Eθ q(θ) ln(p(x θ) } { inf q(θ) Q ln q(θ) } θ p(θ) q(θ) ln(p(x θ)q(θ) θ { } inf q(θ) Q [ln q(θ) (ln p(θ) + ln p(x θ))] q(θ) θ { [ ] } q(θ) inf q(θ) Q ln q(θ) θ p(θ)p(x θ) { [ p(x) inf q(θ) Q ln q(θ) ] } q(θ) θ p(θ X) inf q(θ) Q {KL(q(θ) p(θ X))}