Exercise 1: Basics of probability calculus

Similar documents
Dept. of Linguistics, Indiana University Fall 2015

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

A Brief Review of Probability, Bayesian Statistics, and Information Theory

CSC321 Lecture 18: Learning Probabilistic Models

Introduction to Bayesian Statistics

Probability Theory. Introduction to Probability Theory. Principles of Counting Examples. Principles of Counting. Probability spaces.

Bayesian Statistics Part III: Building Bayes Theorem Part IV: Prior Specification

Bayesian Methods: Naïve Bayes

Language as a Stochastic Process

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Probably About Probability p <.05. Probability. What Is Probability?

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Bayesian Learning. Machine Learning Fall 2018

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Statistical learning. Chapter 20, Sections 1 3 1

Naïve Bayes classification

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Recitation 2: Probability

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Probability calculus and statistics

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Computational Cognitive Science

Parametric Techniques Lecture 3

Classification & Information Theory Lecture #8

Probability Theory Review Reading Assignments

Midterm sample questions

Parametric Techniques

Intro to Probability. Andrei Barbu

Algorithmisches Lernen/Machine Learning

Some Concepts of Probability (Review) Volker Tresp Summer 2018

Introduction to Machine Learning

Text Mining. March 3, March 3, / 49

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Lecture 16. Lectures 1-15 Review

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

COS513 LECTURE 8 STATISTICAL CONCEPTS

Model Averaging (Bayesian Learning)

Statistical learning. Chapter 20, Sections 1 3 1

Generative Clustering, Topic Modeling, & Bayesian Inference

Expectation maximization tutorial

Probability Theory for Machine Learning. Chris Cremer September 2015

7.1 What is it and why should we care?

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Machine Learning 4771

PATTERN RECOGNITION AND MACHINE LEARNING

CS 109 Review. CS 109 Review. Julia Daniel, 12/3/2018. Julia Daniel

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Bayesian Models in Machine Learning

Motif representation using position weight matrix

Lecture 1: Bayesian Framework Basics

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Probability and Probability Distributions. Dr. Mohammed Alahmed

Bayesian RL Seminar. Chris Mansley September 9, 2008

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

1 INFO Sep 05

Computational Cognitive Science

Notes on Mathematics Groups

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture 1: Probability Fundamentals

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probability Theory Review

p(d θ ) l(θ ) 1.2 x x x

2/3/04. Syllabus. Probability Lecture #2. Grading. Probability Theory. Events and Event Spaces. Experiments and Sample Spaces

Generative Techniques: Bayes Rule and the Axioms of Probability

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Review of probabilities

Bayesian Inference and MCMC

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

6.867 Machine Learning

6.867 Machine Learning

Probability, Entropy, and Inference / More About Inference

Probability Theory and Applications

Density Estimation. Seungjin Choi

Name: Matriculation Number: Tutorial Group: A B C D E

Introduction to Bayesian Learning

Review of Probability. CS1538: Introduction to Simulations

CSCI 5832 Natural Language Processing. Today 1/31. Probability Basics. Lecture 6. Probability. Language Modeling (N-grams)

Edexcel past paper questions

Statistical learning. Chapter 20, Sections 1 4 1

Algorithm-Independent Learning Issues

{ p if x = 1 1 p if x = 0

Modeling Environment

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

UC Berkeley, CS 174: Combinatorics and Discrete Probability (Fall 2008) Midterm 1. October 7, 2008

Introduction to Machine Learning

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Introduction to Stochastic Processes

Bayesian Linear Regression [DRAFT - In Progress]

Transcription:

: Basics of probability calculus Stig-Arne Grönroos Department of Signal Processing and Acoustics Aalto University, School of Electrical Engineering stig-arne.gronroos@aalto.fi [21.01.2016]

Ex 1.1: Conditional probability Unconditional probability: P(A) Conditional probability: P(A B) Joint probability: P(A, B) Chain rule: P(A, B) = P(A)P(B A) [21.01.2016]2/26 Grönroos

Ex 1.1: Conditional probability The following probabilities might apply to English: P(word is abbreviation word has three letters) = 0.8 P(word has three letters) = 0.0003 What is the probability that an observed word is a three letter abbreviation? [21.01.2016]3/26 Grönroos

Ex 1.1: Conditional probability What is the probability that an observed word is a three letter abbreviation? P(word is abbreviation, word has three letters) First, how probable it is for a word to be three letters long, then how probable it is to be abbreviation when being three letters: =P(three letters) P(abbreviation three letters) =0.0003 0.8 =0.00024 [21.01.2016]4/26 Grönroos

Ex 1.2: Bayes Theorem A stemming program for English can determine whether the stem for does is 1. the verb do 2. or the noun doe (female deer). The stem do is much more common, only 1 inflection of doe. 1000 The program is returning does doe. What is the probability that the program is correct? does is an [21.01.2016]5/26 Grönroos

Ex 1.2: Bayes Theorem P(R = C i T = C j ) True T Result R C 1 C 2 C 1 0.95 0.05 C 2 0.05 0.95 P(T = C 1 ) = 0.999 P(T = C 2 ) = 0.001 do doe [21.01.2016]6/26 Grönroos

Ex 1.2: Bayes Theorem We have P(R = C i T = C j ), but we want P(T = C j R = C i ) Bayes theorem: P(B j A) = P(A B j)p(b j ) P(A) = P(A B j)p(b j ) i P(A B i)p(b i ) [21.01.2016]7/26 Grönroos

Ex 1.2: Bayes Theorem Practical when you need to turn a conditional probability around =) Forms the basis for Bayesian statistics, where you have a prior probability P(B) some new observations A which you want to combine into a posterior probability P(B A). [21.01.2016]8/26 Grönroos

Ex 1.3: Zipf s law Sort the words so that the most common word comes first (r = 1). Also include how many times it occurred it the text (f ). Zipf alleges that i.e. f times r remains constant. f 1 r Does this apply to a randomly generated language, which has 30 letters including the word boundary? [21.01.2016]9/26 Grönroos

Ex 1.3: Zipf s law A particular one letter word. Generate two symbols: word boundary after something else. There are 29 words of this kind. P(s = t 1 ) = 1 30 1 30 Two letter word. There are 29 2 words of this kind. P(s = t 1, t 2 ) = 1 30 1 30 1 30 Three letter word. There are 29 3 words of this kind. P(s = t 1, t 2, t 3 ) = 1 30 1 30 1 30 1 30 [21.01.2016]10/26 Grönroos

Ex 1.3: Zipf s law r f k 15 1111 16111 450 37.04 16648 13064 1.235 16129 378900 0.0412 15593 1098800 0.00137 15073 318660000 0.0000457 14570 Table : Zipf constant. r: the ranking number when sorted by frequency, f : expected occurrence count in a text of 1000000 words. Even for a random language, k remains quite constant for a large range of r. [21.01.2016]11/26 Grönroos

Ex 1.3: Zipf s law 0.017 Zipf s law for a randomly generated language 0.016 k = frequency rank 0.015 0.014 0.013 0.012 10 0 10 5 10 10 10 15 rank Figure : k = rank frequency as a function of rank. For a randomly generated language with 30 letters k is roughly constant. [21.01.2016]12/26 Grönroos

Ex 1.3: Zipf s law 10 6 Zipf s law for Finnish corpus of 32 million words 10 4 frequency 10 2 10 0 10 0 10 2 10 4 10 6 Figure : Logarithmic plot of ranks and frequencies for Finnish corpus of 32 million words. rank [21.01.2016]13/26 Grönroos

Ex 1.3: Zipf s law The power law distribution is highly peaked at low frequencies. Already a quite short list of the most frequent words will give good coverage of tokens. The long tail of infrequent words contain most of the interesting content words. [21.01.2016]14/26 Grönroos

Ex 1.4: Central limit theorem Figure : Expectation as center of probability mass. CC BY-SA 3.0 Erzbischof. [21.01.2016]15/26 Grönroos

Ex 1.4: Central limit theorem Expectation and variance E(x) = Var(x) = xp(x)dx (1) ( x E(x) ) 2p(x)dx (2) Expectation and variance sum of independent random variables E(x + y) = E(x) + E(y) (3) Var(x + y) = Var(x) + Var(y) (4) Variance of a random variable multiplied by a constant Var(ax) = a 2 Var(x) (5) [21.01.2016]16/26 Grönroos

Ex 1.4: Central limit theorem Throwing one 101-sided die. Equal probability p(x) = 1 101. Expectation: 100 E(x) = ip(x = i) i=0 = 1 (1 + 2 + 3 + 4 +... + 100) 101 = 1 ( ) (1 + 100) + (2 + 99) + (3 + 98)... + (50 + 51) 101 50 101 = = 50 (6) 101 [21.01.2016]17/26 Grönroos

Ex 1.4: Central limit theorem Variance: 100 Var(x) = (i E(x)) 2 p(x = i) i=0 = 1 101 (502 + 49 2 +... + 1 + 0 + 1 + 2 2 +... + 49 2 + 50 2 ) = 2 101 (1 + 22 +... + 49 2 + 50 2 ) Now we can use formula to get the result 1 + 2 2 + 3 2 +... + n 2 = n(n + 1)(2n + 1) 6 Var(x) = 2 50 51 101 = 850 (7) 101 6 [21.01.2016]18/26 Grönroos

Ex 1.4: Central limit theorem Calculate the expectation for the sum (x + y)/2. E( x + y 2 ) = 1 2 (E(x) + E(y)) = 1 (50 + 50) = 50 2 The expectation does not change. What about variance, then? Var( x + y 2 ) = Var(x 2 ) + Var(y 2 ) = 1 4 Var(x) + 1 4 Var(y) = 1 (850 + 850) = 425 4 [21.01.2016]19/26 Grönroos

Ex 1.4: Central limit theorem We throw ten dice. Extending the previous solutions: E( x 1 + x 2 +... + x 10 ) = 1 10 50 10 10 = 50 Var( x 1 + x 2 +... + x 10 ) = 1 10 850 10 100 = 85 As we throw even more dice, the distribution will sharpen around the expectation. At the infinite limit, the expectation is 50 and variance 0, which means that we will always get a result of 50. [21.01.2016]20/26 Grönroos

Ex 1.4: Central limit theorem 10000 1 dice 2 dice 5000 0 20 40 60 80 x 10 4 3 dice 2 1 0 20 40 60 80 x 10 4 10 dice 4 2 0 20 40 60 80 10000 0 20 40 60 80 x 10 4 5 dice 2 1 0 20 40 60 80 x 10 4 100 dice 10 5 0 20 40 60 80 Figure : Throwing dice. The throw was simulated 1 million times for each curve. [21.01.2016]21/26 Grönroos

Ex 1.4: Central limit theorem Here we used the uniform distribution, but the CLT applies to most reasonable cases (i.i.d., finite E and Var). The normal distribution (Gaussian) is common in nature, and easy to use in closed form solutions. Normal distribution approximation. Sample size in scientific experiments. [21.01.2016]22/26 Grönroos

Ex 1.5: Minimum description length Agree in advance on a model class that could generate the data. For our message, select a particular model by setting θ. Use that model to compress the data. The receiver does not know the θ we choose, must send that too. [21.01.2016]23/26 Grönroos

Ex 1.5: Minimum description length Model parameters DL to encode parameters L(θ) DL to encode data given θ L(x θ) Total code length L(x, θ) = L(θ) + L(x θ) DL to encode message with p(i) log p(i) bits Prior distribution p(θ) Likelihood (a function of θ) p(x θ) Posterior p(θ x) θ [21.01.2016]24/26 Grönroos

Ex 1.5: Minimum description length Show that the optimal selection of the parameters in two-part coding scheme equals Maximum A Posteriori estimation. Minimizing the total message length ˆθ = arg min L(x, θ) θ... ˆθ = arg max p(θ x) θ Maximizing the posterior θ after observing x [21.01.2016]25/26 Grönroos

Ex 1.5: Minimum description length In compression, statistical regularities are used for compressing data. In MDL-flavored machine learning, compression is used for finding statistical regularities. [21.01.2016]26/26 Grönroos