CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Similar documents
Introduction to Machine Learning

Introduction to Machine Learning

Lecture 3. Probability - Part 2. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. October 19, 2016

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Hands-On Learning Theory Fall 2016, Lecture 3

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

CS 630 Basic Probability and Information Theory. Tim Campbell

Information Theory Primer:

Machine learning - HT Maximum Likelihood

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Lecture 5 - Information theory

Chapter 5 continued. Chapter 5 sections

Probability reminders

Recitation 2: Probability

Expectation Maximization

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Lecture 1: Introduction, Entropy and ML estimation

Statistical Machine Learning Lectures 4: Variational Bayes

Lecture 1: August 28

ECE 4400:693 - Information Theory

Review of probability

01 Probability Theory and Statistics Review

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Formulas for probability theory and linear models SF2941

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

CS37300 Class Notes. Jennifer Neville, Sebastian Moreno, Bruno Ribeiro

Lecture 2: August 31

Basics on Probability. Jingrui He 09/11/2007

Introduction to Statistical Learning Theory

Quantitative Biology II Lecture 4: Variational Methods

Gaussian discriminant analysis Naive Bayes

Probability Theory Review Reading Assignments

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY

Cheng Soon Ong & Christian Walder. Canberra February June 2017

13: Variational inference II

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Grundlagen der Künstlichen Intelligenz

Random Variables and Their Distributions

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Bioinformatics: Biology X

Bivariate distributions

Quick Tour of Basic Probability Theory and Linear Algebra

Lecture 2: Repetition of probability theory and statistics

Some Concepts of Probability (Review) Volker Tresp Summer 2018

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

CS Lecture 19. Exponential Families & Expectation Propagation

Advanced topics from statistics

Review (Probability & Linear Algebra)

Computational Systems Biology: Biology X

Expectation Propagation Algorithm

Basic Statistics and Probability Theory

Information geometry for bivariate distribution control

Maximum likelihood estimation

Lecture 11. Probability Theory: an Overveiw

Naïve Bayes classification

Review of Probabilities and Basic Statistics

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

A Few Special Distributions and Their Properties

Lecture 1a: Basic Concepts and Recaps

Stat 5101 Notes: Brand Name Distributions

Information theory and decision tree

Chapter 5. Chapter 5 sections

Introduction to Bayesian Statistics

5 Mutual Information and Channel Capacity

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 1 October 9, 2013

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

(Multivariate) Gaussian (Normal) Probability Densities

Covariance. Lecture 20: Covariance / Correlation & General Bivariate Normal. Covariance, cont. Properties of Covariance

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Some Basic Concepts of Probability and Information Theory: Pt. 2

Data Mining Techniques. Lecture 3: Probability

STA 4273H: Statistical Machine Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Series 7, May 22, 2018 (EM Convergence)

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

Signal Processing - Lecture 7

Lecture 5: Moment Generating Functions

MATH c UNIVERSITY OF LEEDS Examination for the Module MATH2715 (January 2015) STATISTICAL METHODS. Time allowed: 2 hours

Stat 5101 Lecture Notes

Expectation. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Multiple Random Variables

Gaussian Mixture Models, Expectation Maximization

Topic 2: Probability & Distributions. Road Map Probability & Distributions. ECO220Y5Y: Quantitative Methods in Economics. Dr.

COM336: Neural Computing

Machine Learning using Bayesian Approaches

Department of Large Animal Sciences. Outline. Slide 2. Department of Large Animal Sciences. Slide 4. Department of Large Animal Sciences

CS 109 Review. CS 109 Review. Julia Daniel, 12/3/2018. Julia Daniel

Artificial Intelligence

Probability Theory for Machine Learning. Chris Cremer September 2015

Expectation. DS GA 1002 Probability and Statistics for Data Science. Carlos Fernandez-Granda

G8325: Variational Bayes

Week 3: The EM algorithm

Lecture 1: Review on Probability and Statistics

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

U Logo Use Guidelines

Transcription:

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University Charalampos E. Tsourakakis January 25rd, 2017

Probability Theory

The theory of probability is a system for making better guesses. http://www.feynmanlectures.caltech.edu/i_06.html Babis Tsourakakis CS 591 Data Analytics, Lecture 2 3 / 36

By the probability of a particular outcome of an observation we mean our estimate for the most likely fraction of a number of repeated observations that will yield that particular outcome. http://www.feynmanlectures.caltech.edu/i_06.html p(a) = N A N Babis Tsourakakis CS 591 Data Analytics, Lecture 2 4 / 36

Inclusion Exclusion theorem Theorem Suppose n N and A i is a finite set for 1 i n. It follows that A i = A i1 A i1 A i2 1 i n 1 i 1 n + 1 i 1 i 2 i 3 n 1 i 1 i 2 n n A i1 A i2 A i3... + ( 1) n+1 A i i=1 Application (aka matching hat problem): Deal two packs of shuffled cards simultaneously. What is the probability that no pair of identical cards will be exposed simultaneously? Babis Tsourakakis CS 591 Data Analytics, Lecture 2 5 / 36

Inclusion Exclusion theorem Fix the first pack Let A i be the set of all possible arrangements of the second pack which match the card in position i of the first pack. X = i A i Details on whiteboard. X /52! = (52!) ( 1 ( 52 1)51! ( 52 2)50! + ( 52 3)49!.. ( 52 52)0! ) = 1 1/2! + 1/3!... 1/52! 1 ( ( 1) i /i!) i=0 = 1 1/e. Thus the desired probability is 1/e as n +. Babis Tsourakakis CS 591 Data Analytics, Lecture 2 6 / 36

Fundamental Rules Pr [X Y ] = Pr [X ] + Pr [Y ] Pr [X Y ] (1) Pr [X ] = y Pr [X, Y = y] = y Pr [X Y = y]pr [Y = y] (2) Sum Rule Pr [X, Y ] = Pr [X Y ] = Pr [X ]Pr [Y X ] = Pr [Y ]Pr [X Y ] (3) Product Rule Babis Tsourakakis CS 591 Data Analytics, Lecture 2 7 / 36

Fundamental Rules By applying the product rule multiple times we obtain the chain rule: Pr [X 1, X 2,..., X n ] = Pr [X 1 ]Pr [X 2,..., X n X 1 ] =.. = Pr [X 1 ]Pr [X 2 X 1 ]Pr [X 3 X 2, X 1 ]... Pr [X n X 1,..., X n 1 ] (4) Chain Rule Pr [X Y ] = Pr [X Y ] (5) Pr [Y ] Conditional probability Babis Tsourakakis CS 591 Data Analytics, Lecture 2 8 / 36

Reminder: Bayes rule Bayes rule is a direct application of conditional probabilities. Pr [H D] = Pr [D H]Pr [H], and Pr [D] > 0, or... Pr [D] posterior likelihood prior. Babis Tsourakakis CS 591 Data Analytics, Lecture 2 9 / 36

Independence and Conditional Independence We say X and Y are unconditionally independent or marginally independent, or just independent if As a result Notation: X Y Pr [X Y ] = Pr [X ],Pr [Y X ] = Pr [Y ] Pr [X, Y ] = Pr [X ]Pr [Y ]. Babis Tsourakakis CS 591 Data Analytics, Lecture 2 10 / 36

Independence and Conditional Independence Source: http://colah.github.io/posts/ 2015-09-Visual-Information/ Babis Tsourakakis CS 591 Data Analytics, Lecture 2 11 / 36

Independence and Conditional Independence Independence visualized Babis Tsourakakis CS 591 Data Analytics, Lecture 2 12 / 36

Independence and Conditional Independence Closer to reality Babis Tsourakakis CS 591 Data Analytics, Lecture 2 13 / 36

Independence and Conditional Independence Closer to reality... Babis Tsourakakis CS 591 Data Analytics, Lecture 2 14 / 36

Independence and Conditional Independence... or alternatively... Babis Tsourakakis CS 591 Data Analytics, Lecture 2 15 / 36

Independence and Conditional Independence... or alternatively... Babis Tsourakakis CS 591 Data Analytics, Lecture 2 16 / 36

Independence and Conditional Independence We say X and Y are conditionally independent given Z if Pr [X, Y Z] = Pr [X Z]Pr [Y Z]. Joint distribution factorizes as Pr [X, Y, Z] = Pr [X Z]Pr [Y Z]Pr [Z]. Notation: X Y Z Babis Tsourakakis CS 591 Data Analytics, Lecture 2 17 / 36

Mean, variance, covariance For discrete RVs E [X ] = x xpr [X = x] and for continuous E [X ] = x xp(x)dx The variance and the standard deviation std[x ] = σ are defined as Var [X ] = σ 2 = E [ (X E [X ]) 2] = E [ X 2] E [X ] 2. Reminder: Jensen s inequality states that if f is convex, then f (E [X ]) E [f (X )]. Babis Tsourakakis CS 591 Data Analytics, Lecture 2 18 / 36

Mean, variance, covariance Covariance of two random variables X, Y cov[x, Y ] = E [(X E [X ])(Y E [Y ])] = = E [XY ] E [X ] E [Y ]. In general, if x is a d-dimensional random vector, the covariance is defined as cov[x] = E [ (x E [x])(x E [x]) T ]. Pearson correlation coefficient: corr[x, Y ] = cov[x, Y ] Var [X ] Var [Y ]. Babis Tsourakakis CS 591 Data Analytics, Lecture 2 19 / 36

Mean, variance, covariance Correlation examples, Wikipedia Babis Tsourakakis CS 591 Data Analytics, Lecture 2 20 / 36

Probability distributions Source We will go over few important ones. Babis Tsourakakis CS 591 Data Analytics, Lecture 2 21 / 36

Discrete distributions Details on whiteboard. Bernoulli: X Ber(p) Binomial: X Bin(n, p) Multinomial: x Mu(n, θ) Poisson: X Po(λ) Babis Tsourakakis CS 591 Data Analytics, Lecture 2 22 / 36

Continuous Univariate distributions Normal: X N(x; µ, σ 2 ) Student t distribution: X T (x; µ, σ 2, ν) Laplace: X Lap(x; µ, β) Gamma: X Ga(x; α, β) Pareto: Pareto(x k, m) Babis Tsourakakis CS 591 Data Analytics, Lecture 2 23 / 36

Multivariate normal distribution Isotropic, i.e., Σ = σ 2 I N (x; µ, Σ) = 1 { exp 1 } (2π) D 2 Σ 1/2 2 (x µ)t Σ 1 (x µ) where µ = E [x], Σ = Cov[x]. Σ 1 = Λ is also known as the precision matrix. Babis Tsourakakis CS 591 Data Analytics, Lecture 2 24 / 36

Multivariate normal distribution µ = (0, 0), Σ = [21.8; 1.82] Contour plot Babis Tsourakakis CS 591 Data Analytics, Lecture 2 25 / 36

Linear transformations of Random Variables Suppose f is a linear function: Then, y = f (x) = Ax + b E [y] = AE [x] + b (6) by Linearity of Expectation Cov[y] = ACov[x]A T (7) Covariance Cov[y] = Var [ a T x + b ] = a T Cov[x]a (8) if f() scalar valued Babis Tsourakakis CS 591 Data Analytics, Lecture 2 26 / 36

Information Theory

Information Theory Suppose Bob wants to communicate with Alice by sending her bits. Example: Babis Tsourakakis CS 591 Data Analytics, Lecture 2 28 / 36

Information Theory Can we use fewer than 2 bits? Babis Tsourakakis CS 591 Data Analytics, Lecture 2 29 / 36

Information Theory Can we use fewer than 1.75 bits? Babis Tsourakakis CS 591 Data Analytics, Lecture 2 30 / 36

Information Theory Suppose there are n events, the k-th event with probability p k Shannon entropy, or just entropy is defined as: H(p 1,..., p n ) = n p k log 2 ( 1 ). p k k=1 Babis Tsourakakis CS 591 Data Analytics, Lecture 2 31 / 36

Information Theory Intuition: When the k-th event happends, we receive log( 1 p k ) bits of information. Therefore, H(p 1,..., p n ) is the expected number of bits in a random event. If p k = 0, we define p k log( 1 p k ) = 0 To see why: lim ɛ log( 1 ) = 0. ɛ 0+ p k Question: For what values p 1,..., p n is the entropy maximized? Babis Tsourakakis CS 591 Data Analytics, Lecture 2 32 / 36

Information Theory Cross-entropy: Babis Tsourakakis CS 591 Data Analytics, Lecture 2 33 / 36

Information Theory Cross-entropy: H p (q) = x 1 q(x) log( p(x) ). H(p) = 1.75 H(q) = 1.75 H p (q) = 2.25 2.375 = H q (p) Cross-entropy isnt symmetric! For the interested: Cross entropy and neural networks Babis Tsourakakis CS 591 Data Analytics, Lecture 2 34 / 36

Information Theory Kullback-Leibler divergence (aka as relative entropy): KL(p, q) = k p k log( p k q k ). KL(p, q) = H(p) + H q (p). Theorem with equality iff p = q. KL(p, q) 0 (9) Information Inequality Babis Tsourakakis CS 591 Data Analytics, Lecture 2 35 / 36

Information Theory How similar is the joint probability distribution p(x, Y ) to the factorization p(x )p(y )? I (X ; Y ) = KL(p(X, Y ) p(x )p(y )) = p(x, y) p(x, y) log( p(x)p(y) ) x,y (10) Mutual information Babis Tsourakakis CS 591 Data Analytics, Lecture 2 36 / 36