Introduction. Le Song. Machine Learning I CSE 6740, Fall 2013

Similar documents
Lecture 8: Information Theory and Maximum Entropy

Quantum Cellular Automata (QCA) Andrew Valesky

CSC321 Lecture 18: Learning Probabilistic Models

COMP90051 Statistical Machine Learning

Machine Learning CSE546

Bayesian Methods: Naïve Bayes

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Probabilistic Graphical Models

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CS 6375 Machine Learning

Introduction to Probability and Statistics (Continued)

Learning Bayesian network : Given structure and completely observed data

Naïve Bayes classification

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

Machine Learning. Instructor: Pranjal Awasthi

Probabilistic and Bayesian Machine Learning

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Probabilistic Machine Learning

Kernel methods, kernel SVM and ridge regression

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction to Machine Learning CMU-10701

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes

The Naïve Bayes Classifier. Machine Learning Fall 2017

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Thai. What is in a word?

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Probabilistic Graphical Models for Image Analysis - Lecture 1

Machine Learning

Discrete Binary Distributions

Machine Learning

Brief Introduction of Machine Learning Techniques for Content Analysis

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Probability and Estimation. Alan Moses

Probability Theory for Machine Learning. Chris Cremer September 2015

On Musical Performances Identification, Entropy and String Matching

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Probabilistic Graphical Models

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

COMP 328: Machine Learning

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

CPSC 540: Machine Learning

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Statistical Models. David M. Blei Columbia University. October 14, 2014

CPSC 540: Machine Learning

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

UVA CS / Introduc8on to Machine Learning and Data Mining

Terminology. Experiment = Prior = Posterior =

Machine Learning, Fall 2012 Homework 2

Bayesian RL Seminar. Chris Mansley September 9, 2008

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Graphical Models for Collaborative Filtering

CS 540: Machine Learning Lecture 1: Introduction

Estimating Parameters

CHAPTER 2 Estimating Probabilities

CPSC 340: Machine Learning and Data Mining

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Classification & Information Theory Lecture #8

Lecture 18: Bayesian Inference

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

an introduction to bayesian inference

Non-parametric Methods

Dimensionality reduction

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

INSTITUTE FOR INTEGRATIVE SCIENCE & HEALTH

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Bayesian Models in Machine Learning

Lecture 2: Conjugate priors

An Introduction to Statistical and Probabilistic Linear Models

Be able to define the following terms and answer basic questions about them:

Intro to Probability. Andrei Barbu

Data Mining Techniques. Lecture 3: Probability

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Machine Learning 2007: Slides 1. Instructor: Tim van Erven Website: erven/teaching/0708/ml/

CS 559: Machine Learning Fundamentals and Applications 1 st Set of Notes

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Course Introduction. Probabilistic Modelling and Reasoning. Relationships between courses. Dealing with Uncertainty. Chris Williams.

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Computational Cognitive Science

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Introduction to Stochastic Processes

Bayesian Inference and MCMC

Introduction to Probabilistic Machine Learning

Transcription:

Introduction Le Song Machine Learning I CSE 6740, Fall 2013

What is Machine Learning (ML) Study of algorithms that improve their performance at some task with experience 2

Common to Industrial scale problems 13 million wikipedia pages 800 million users 6 billion photos 24 hours video uploaded per minutes > 1 trillion webpages 3

Organizing Images Image Databases What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 4

Visualize Image Relations Each image has thousands or millions of pixels. What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 5

Organizing documents Reading, digesting, and categorizing a vast text database is too much for human! We want: What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 6

Weather Prediction Predict Numeric values: 40 F Wind: NE at 14 km/h Humidity: 83% Predict What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 7

Face Detection What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 8

Understanding brain activity What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 9

Product Recommendation What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 10

Handwritten digit recognition/text annotation Inter-character dependency What are the desired outcomes? What are the inputs (data)? Inter-word dependency Aoccdrnig to a sudty at Cmabrigde Uinervtisy, it deosn t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a ttoal mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. What are the learning paradigms? 11

Similar problem: speech recognition Models Hidden Markov Models Text Machine Learning is the preferred method for speech recognition Audio signals 12

Similar problem: bioinformatics 13 acatcgctgcgtttcggcagctaattgccttttagaaattattttcccatttcgagaaactcgtgtgggatgccggatgcggctttcaatcacttctggcccgggatcggattgggtcacattgtctgcgggctctattgtctcgatccgc gcgcagttcgcgtgcttagcggtcagaaaggcagagattcggttcggattgatgcgctggcagcagggcacaaagatctaatgactggcaaatcgctacaaataaattaaagtccggcggctaattaatgagcggactgaagccactttgg ttaaccaaaaaacagcagataaacaaaaacggcaaagaaaattgccacagagttgtcacgctttgttgcacaaacatttgtgcagaaaagtgaaaagcttttagccattattaagtttttcctcagctcgctggcagcacttgcgaatgta tgatgttcctcataaatgaaaattaatgtttgctctacgctccaccgaactcgcttgtttgggggattggctggctaatcgcggctagatcccaggcggtataaccttttcgcttcatcagttgtgaaaccagatggctggtgttttggca agcggactcccctcgaacgctctcgaaatcaagtggctttccagccggcccgctgggccgctcgcccactggaccggtattcccaggccaggccacactgtaccgcaccgcataatcctcgccagactcggcgctgataaggcccaatgtc ctccgcaggcgtctatttatgccaaggaccgttcttcttcagctttcggctcgagtatttgttgtgccatgttggttacgatgccaatcgcggtacagttatgcaaatgagcagcgaataccgctcactgacaatgaacggcgtcttgtca attcatgctgacattcatattcattcctttggttttttgtcttcgacggactgaaaagtgcggagagaaacccaaaaacagaagcgcgcaaagcgccgttaatatgcgaactcagcgaactcattgaagttatcacaacaccatatccata atatccatatcaatatcaatatcgctattattaacgatcatgctctgctgatcaagtattcagcgctgcgctagattcgacagattgaatcgagctcaatagactcaacagactccactcgacagatgcgcaatgccaaggacaattgccg ggagtaaacgaggcgtatgcgcaacctgcacctggcggacgcggcgtatgcgcaatgtgcaattcgcttaccttctcgttgcgggtcaggaactcccagatgggaatggccgatgacgagctgatctgaatgtggaaggcgcccagcaggc agattactttcgccgcagtcgtcatggtgtcgttgctgcttttatgttgcgtactccgcactacacggagagttcaggggattcgtgctccgtgatctgtgatccgtgttccgtgggtcaattgcacggttcggttgtgtaaccttcgtgt ctttttttttagggcccaataaaagcgcttttgtggcggcttgatagattatcacttggtttcggtggctagccaagtggctttcttctgtccgacgcacttaattgaattaaccaaacaacgagcgtggccaattcgtattatcgctgtt acgtgtgtctcagcttgaaacgcaaaagcttgtttcacacatcggtttctcggcaagatgggggagtcagtcggtctagggagaggggcgcccaccagtcgatcacgaaaacggcgaattccaagcgaaacggaaacggagcgagcactat gtactatgtcgaacaaccgatcgcggcgatgtcagtgagtcgtcttcggacagcgctggcgctccacacgtatttaagctctgagatcggctttgggagagcgcagagagcgccatcgcacggcagagcgaaagcggcagtgagcgaaagc agcggcagcgggtgggggatcgggagccccccgaaaaaaacagaggcgcacgtcgatgccatcggggaattggaacctcaatgtgtgggaatgtttaaatattctgtgttaggtagtgtagtttcatagactatagattctcatacagatt agtccttcgagccgattatacacgacagcaaaatatttcagtcgcgcttgggcaaaaggcttaagcacgactcccagtccccccttacatttgtcttcctaagcccctggagccactatcaaacttgttctacgcttgcactgaaaataga ccaaagtaaacaatcaaaaagaccaaaaacaataacaaccagcaccgagtcgaacatcagtgaggcattgcaaaaatttcaaagtcaagtttgcgtcgtcatcgcgtctgagtccgatcaagccgggcttgtaattgaagttgttgatgag tactggattgtggcgaattctggtcagcatacttaacagcagcccgctaattaagcaaaataaacatatcaaattccagaatgcgacggcgccatcatcctgtttgggaattcaattcgcgggcagatcgtttaattcaattaaaaggtag aaagggagcagaagaatgcgatcgctggaatttcctaacatcacggaccccataaatttgataagcccgagctcgctgcgttgagtcagccaccccacatccccaaatccccgccaaaagaagacagctgggttgttgactcgccagattg ttgcagtggagtggacctggtcaaagaagcaccgttaatgtgctgattccattcgattccatccgggaatgcgataaagaaaggctctgatccaagcaactgcaatccggatttcgattttctctttccatttggttttgtatttacgtac agcattctaatgaagacttggagaagacttacgttatattcagaccatcgtgcgatagaggatgagtcatttccatatggccgaaatttattatgtttactatcgtttttagaggtgttttttggacttaccaaaagaggcatttgttttc tcaactgaaaagatatttaaattttttcttggaccattttcaaggttccggatatatttgaaacacactagctagcagtgttggtaagttacatgtatttctataatgtcatattcctttgtccgtattcaaatcgaatactccacatctc tgtacttgaggaattggcgatcgtagcgatttcccccgccgtaaagttcctgatcctcgttgtttttgtacatcataaagtccggattctgctcgtcgccgaagatgggaacgaagctgccaaagctgagagtctgcttgaggtgctggtc tcccagctggataaccttgctgtacagatcggcatctgcctggagggcacgatcgaaatccttccagtggacgaacttcacctgctcgctgggaatagcgttgttgtcaagcagctcaaggagcgtattcgagttgacgggctgcaccacg tgctccttcgctggggattcccctgcgggtaagcgccgcttgcttggactcgtttccaaatcccatagccacgccagcagaggagtaacagagctcwhereisthegenetgattaaaaatatcctttaagaaagcccatgggtataactt ctgcgtcctatgcgaggaatggtctttaggttctttatggcaaagttctcgcctcgcttgcccagccgcggtacgttcttggtgatctttaggaagaatcctggactactgtcgtctgcctggcttatggccacaagacccaccaagagcg ggactgttatgattctcatgctgatgcgactgaagcttcacctgactcctgctccacaattggtggcctttatatagcgagatccacccgcatcttgcgtggaatagaaatgcgggtgactccaggaattagcattatcgatcggaaagtg taaaactgaactaacctgacctaaatgcctggccataattaagtgcatacatacacattacattacttacatttgtataagaactaaattttatagtacataccacttgcgtatgtaaatgcttgtcttttctcttatatacgttttataa ccagcatattttacgtaaaaacaaaacggtaatgcgaacataacttatttattggggcccggaccgcaaaccggccaaacgcgtttgcacccataaaaacataagggcaacaaaaaaattgttaagctgttgtttatttttgcaatcgaaa gctcaaatagctgcgatcactcgggagcagggtaaagtcgcctcgaaacaggaagctgaagcatcttctataaatacactcaaagcgatcattccgaggcgagtctggttagaaatttacatggactgcaaaaaggtatagccccacaaac acatcgctgcgtttcggcagctaattgccttttagaaattattttcccatttcgagaaactcgtgtgggatgccggatgcggctttcaatcacttctggcccgggatcggattgggtcacattgtctgcgggctctattgtctcgatccgc gcgcagttcgcgtgcttagcggtcagaaaggcagagattcggttcggattgatgcgctggcagcagggcacaaagatctaatgactggcaaatcgctacaaataaattaaagtccggcggctaattaatgagcggactgaagccactttgg ttaaccaaaaaacagcagataaacaaaaacggcaaagaaaattgccacagagttgtcacgctttgttgcacaaacatttgtgcagaaaagtgaaaagcttttagccattattaagtttttcctcagctcgctggcagcacttgcgaatgta tgatgttcctcataaatgaaaattaatgtttgctctacgctccaccgaactcgcttgtttgggggattggctggctaatcgcggctagatcccaggcggtataaccttttcgcttcatcagttgtgaaaccagatggctggtgttttggca agcggactcccctcgaacgctctcgaaatcaagtggctttccagccggcccgctgggccgctcgcccactggaccggtattcccaggccaggccacactgtaccgcaccgcataatcctcgccagactcggcgctgataaggcccaatgtc ctccgcaggcgtctatttatgccaaggaccgttcttcttcagctttcggctcgagtatttgttgtgccatgttggttacgatgccaatcgcggtacagttatgcaaatgagcagcgaataccgctcactgacaatgaacggcgtcttgtca attcatgctgacattcatattcattcctttggttttttgtcttcgacggactgaaaagtgcggagagaaacccaaaaacagaagcgcgcaaagcgccgttaatatgcgaactcagcgaactcattgaagttatcacaacaccatatccata atatccatatcaatatcaatatcgctattattaacgatcatgctctgctgatcaagtattcagcgctgcgctagattcgacagattgaatcgagctcaatagactcaacagactccactcgacagatgcgcaatgccaaggacaattgccg ggagtaaacgaggcgtatgcgcaacctgcacctggcggacgcggcgtatgcgcaatgtgcaattcgcttaccttctcgttgcgggtcaggaactcccagatgggaatggccgatgacgagctgatctgaatgtggaaggcgcccagcaggc agattactttcgccgcagtcgtcatggtgtcgttgctgcttttatgttgcgtactccgcactacacggagagttcaggggattcgtgctccgtgatctgtgatccgtgttccgtgggtcaattgcacggttcggttgtgtaaccttcgtgt ctttttttttagggcccaataaaagcgcttttgtggcggcttgatagattatcacttggtttcggtggctagccaagtggctttcttctgtccgacgcacttaattgaattaaccaaacaacgagcgtggccaattcgtattatcgctgtt acgtgtgtctcagcttgaaacgcaaaagcttgtttcacacatcggtttctcggcaagatgggggagtcagtcggtctagggagaggggcgcccaccagtcgatcacgaaaacggcgaattccaagcgaaacggaaacggagcgagcactat gtactatgtcgaacaaccgatcgcggcgatgtcagtgagtcgtcttcggacagcgctggcgctccacacgtatttaagctctgagatcggctttgggagagcgcagagagcgccatcgcacggcagagcgaaagcggcagtgagcgaaagc agcggcagcgggtgggggatcgggagccccccgaaaaaaacagaggcgcacgtcgatgccatcggggaattggaacctcaatgtgtgggaatgtttaaatattctgtgttaggtagtgtagtttcatagactatagattctcatacagatt agtccttcgagccgattatacacgacagcaaaatatttcagtcgcgcttgggcaaaaggcttaagcacgactcccagtccccccttacatttgtcttcctaagcccctggagccactatcaaacttgttctacgcttgcactgaaaataga ccaaagtaaacaatcaaaaagaccaaaaacaataacaaccagcaccgagtcgaacatcagtgaggcattgcaaaaatttcaaagtcaagtttgcgtcgtcatcgcgtctgagtccgatcaagccgggcttgtaattgaagttgttgatgag tactggattgtggcgaattctggtcagcatacttaacagcagcccgctaattaagcaaaataaacatatcaaattccagaatgcgacggcgccatcatcctgtttgggaattcaattcgcgggcagatcgtttaattcaattaaaaggtag aaagggagcagaagaatgcgatcgctggaatttcctaacatcacggaccccataaatttgataagcccgagctcgctgcgttgagtcagccaccccacatccccaaatccccgccaaaagaagacagctgggttgttgactcgccagattg ttgcagtggagtggacctggtcaaagaagcaccgttaatgtgctgattccattcgattccatccgggaatgcgataaagaaaggctctgatccaagcaactgcaatccggatttcgattttctctttccatttggttttgtatttacgtac agcattctaatgaagacttggagaagacttacgttatattcagaccatcgtgcgatagaggatgagtcatttccatatggccgaaatttattatgtttactatcgtttttagaggtgttttttggacttaccaaaagaggcatttgttttc tcaactgaaaagatatttaaattttttcttggaccattttcaaggttccggatatatttgaaacacactagctagcagtgttggtaagttacatgtatttctataatgtcatattcctttgtccgtattcaaatcgaatactccacatctc tgtacttgaggaattggcgatcgtagcgatttcccccgccgtaaagttcctgatcctcgttgtttttgtacatcataaagtccggattctgctcgtcgccgaagatgggaacgaagctgccaaagctgagagtctgcttgaggtgctggtc tcccagctggataaccttgctgtacagatcggcatctgcctggagggcacgatcgaaatccttccagtggacgaacttcacctgctcgctgggaatagcgttgttgtcaagcagctcaaggagcgtattcgagttgacgggctgcaccacg tgctccttcgctggggattcccctgcgggtaagcgccgcttgcttggactcgtttccaaatcccatagccacgccagcagaggagtaacagagctctgaaaacagttcatggtttaaaaatatcctttaagaaagcccatgggtataactt ctgcgtcctatgcgaggaatggtctttaggttctttatggcaaagttctcgcctcgcttgcccagccgcggtacgttcttggtgatctttaggaagaatcctggactactgtcgtctgcctggcttatggccacaagacccaccaagagcg ggactgttatgattctcatgctgatgcgactgaagcttcacctgactcctgctccacaattggtggcctttatatagcgagatccacccgcatcttgcgtggaatagaaatgcgggtgactccaggaattagcattatcgatcggaaagtg taaaactgaactaacctgacctaaatgcctggccataattaagtgcatacatacacattacattacttacatttgtataagaactaaattttatagtacataccacttgcgtatgtaaatgcttgtcttttctcttatatacgttttataa Where is the gene?

Spam Filtering What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 14

Similar problem: webpage classification Company homepage vs. University homepage 15

Robot Control Now cars can find their own ways! What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 16

Nonlinear classifier Nonlinear Decision Boundaries Linear SVM Decision Boundaries 17

Nonconventional clusters Need more advanced methods, such as kernel methods or spectral clustering to work 18

Syllabus Cover a number of most commonly used machine learning algorithms in sufficient amount of details in their mechanisms. Organization Unsupervised learning (data exploration) Clustering, dimensionality reduction, density estimation, novelty detection Supervised learning (predictive models) Classifications, regressions Complex models (dealing with nonlinearity, combine models etc) Kernel methods, graphical models, boosting 19

Prerequisites Probabilities Distributions, densities, marginalization, conditioning. Basic statistics Moments, classification, regression, maximum likelihood estimation Algorithms Dynamic programming, basic data structures, complexity Programming Mostly your choice of language, but Matlab will be very useful The class will be fast paced Ability to deal with abstract mathematical concepts 20

Textbooks Textbooks: Pattern Recognition and Machine Learning, Chris Bishop The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie, Robert Tibshirani, Jerome Friedman Machine Learning, Tom Mitchell 21

Grading 6 assignments (60%) Approximately 1 assignment every 4 lectures Start early Midterm exam (20%) Final exam (20%) Project for advanced students Can be used to replace exams Based on student experience and lecturer interests. 22

Homeworks Zero credit after each deadline All homeworks must be handed in, even for zero credit Collaboration You may discuss the questions Each student writes their own answers Write on your homework anyone with whom you collaborate Each student must write their own codes for the programming part 23

Staff Instructor: Le Song, Klaus 1340 TA: Joonseok, Klaus 1305, Nan Du, Klaus 1305 Guest Lecturer: TBD Administrative Assistant: Mimi Haley, Klaus 1321 Mailing list: mlcda2013@gmail.com More information: http://www.cc.gatech.edu/~lsong/teaching/cse6740fall13.html 24

Today Probabilities Independence Conditional Independence 25

Random Variables (RV) Data may contain many different attributes Age, grade, color, location, coordinate, time Upper-case for random variables (eg. X, Y), lower-case for values (eg. x, y) P(X) for distribution, p(x) for density Val(X) = possible values of random variable X For discrete (categorical): P(X = x i ) i=1 Val(X) = 1 For continuous: p X = x dx = 1 Val X P x 0 Shorthand: P(x) for P(X = x) 26

Interpretations of probability Frequentists P(x) is the frequency of x in the limit Many arguments against this interpretation What is the frequency of the event it will rain tomorrow? Subjective interpretation P(x) is my degree of belief that x will happen What does degree of belief mean? If P(x) = 0.8, then I am willing to bet For this class, we don t care the type of interpretation. 27

Conditional probability After we have seen x, how do we feel y will happen? P y x means P(Y = y X = x) A conditional distribution are a family of distributions For each X = x, it is a distribution P(Y x) 28

Two of the most important rules: I. The chain rule P y, x = P y x P x More generally: P x 1, x 2,, x k = P x 1 P x 2 x 1 P x k x k 1,, x 2, x 1 29

Two of the most important rules: II. Bayes rule likelihood Prior P y x = P x y P y P x = P(x,y) x Val X P(x,y) posterior Normalization constant More generally, additional variable z: P(y x, z) = P x y,z P(y z) P(x z) 30

Independence X and Y independent, if P(Y X) = P(Y) P Y X = P Y (X Y) Proposition: X and Y independent if and only if P(X, Y) = P(X)P(Y) 31

Conditional independence Independence is rarely true; conditional independence is more prevalent X and Y conditionally independent given Z if P(Y X, Z) = P(Y Z) P(Y X, Z) = P(Y Z) (X Y Z) X Y Z if and only if P X, Y Z = P X Z P Y Z 32

Joint distribution, marginalization Two random variables Grades (G) & Intelligence (I) P G, I = G I VH H A 0.7 0.1 B 0.15 0.05 For n binary variables, the table (multiway array) gets really big P X 1, X 2,, X n has 2 n entries! Marginalization Compute marginal over a single variable P(G = B) = P(G = B, I = VH) + P(G = B, I = H) = 0.2 33

Marginalization the general case Compute marginal distribution P X i from P X 1, X 2,, X i, X i+1,, X n P X 1, X 2,, X i = P X 1, X 2,, X i, x i+1, x n x i+1,,x n P X i = P x 1,, x i 1, X i x 1,,xi 1 If binary variables, need to sum over 2 n 1 terms! 34

Example problem Estimate the probability θ of landing in heads using a biased coin Given a sequence of N independently and identically distributed (iid) flips Eg., D = x 1, x 2,, x N = {1,0,1,, 0}, x i {0,1} Model: P x θ P(x θ ) = = θ x 1 θ 1 x 1 θ, for x = 0 θ, for x = 1 Likelihood of a single observation x i? P x i θ = θ x i 1 θ 1 x i 35

Frequentist Parameter Estimation Frequentists think of a parameter as a fixed, unknown constant, not a random variable Hence different objective estimators, instead of Bayes rule These estimators have different properties, such as being unbiased, minimum variance, etc. A very popular estimator is the maximum likelihood estimator (MLE), which is simple and has good statistical properties N θ = argmax θ P D θ = argmax θ i=1 P(x i θ) 36

MLE for Biased Coin Objective function, log likelihood l θ; D = log P D θ = log θ n h 1 θ n t = n h log θ + N n h log 1 θ We need to maximize this w.r.t. θ Take derivatives w.r.t. θ l = n h N n h θ θ 1 θ = 0 θ MLE = n h N or θ MLE = 1 N i x i 37

Bayesian Parameter Estimation Bayesian treat the unknown parameters as a random variable, whose distribution can be inferred using Bayes rule: P(θ D) = P D θ P(θ) P(D) = P D θ P(θ) P D θ P θ dθ θ The crucial equation can be written in words Posterior = likelihood prior marginal likelihood X N N For iid data, the likelihood is P D θ = P(x i θ) i=1 N i=1 θ x i 1 θ 1 x i = θ i x i 1 θ i 1 x i = θ #head 1 θ #tail The prior P θ encodes our prior knowledge on the domain Different prior P θ will end up with different estimate P(θ D)! 38

Bayesian estimation for biased coin Prior over θ, Beta distribution P θ; α, β = Γ α+β Γ a Γ β θα 1 1 θ β 1 When x is discrete Γ x + 1 = xγ x = x! Posterior distribution of θ P θ x 1,, x N θ n h = P x 1,,x N θ P θ P x 1,,x N 1 θ n tθ α 1 1 θ β 1 = θ n h+α 1 1 θ n t+β 1 α and β are hyperparameters and correspond to the number of virtual heads and tails (pseudo counts) 39

Bayesian Estimation for Bernoulli Posterior distribution θ P θ x 1,, x N θ n h = P x 1,,x N θ P θ P x 1,,x N 1 θ n tθ α 1 1 θ β 1 = θ n h+α 1 1 θ n t+β 1 Posterior mean estimation: θ bayes = θ P θ D dθ = C θ θ n h+α 1 1 θ n t+β 1 dθ = (n h +α) N+α+β Prior strength: A = α + β A can be interpreted as an imaginary dataset 40