} Often, when learning, we deal with uncertainty:

Similar documents
Course 395: Machine Learning - Lectures

Lecture 3: Probability Distributions

EGR 544 Communication Theory

Machine learning: Density estimation

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Lecture 3: Shannon s Theorem

Lecture 4. Instructor: Haipeng Luo

ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

For example, if the drawing pin was tossed 200 times and it landed point up on 140 of these trials,

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Probability and Random Variable Primer

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Introduction to Information Theory, Data Compression,

1 The Mistake Bound Model

Lecture 4: November 17, Part 1 Single Buffer Management

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

EPR Paradox and the Physical Meaning of an Experiment in Quantum Mechanics. Vesselin C. Noninski

Randomness and Computation

Expected Value and Variance

Statistics and Quantitative Analysis U4320. Segment 3: Probability Prof. Sharyn O Halloran

A random variable is a function which associates a real number to each element of the sample space

Assignment 2. Tyler Shendruk February 19, 2010

Learning from Data 1 Naive Bayes

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Decision-making and rationality

Chapter 1. Probability

Quantum and Classical Information Theory with Disentropy

P exp(tx) = 1 + t 2k M 2k. k N

The Expectation-Maximization Algorithm

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

Introduction to information theory and data compression

6. Stochastic processes (2)

Lecture 10: May 6, 2013

Homework Assignment 3 Due in class, Thursday October 15

ESCI 341 Atmospheric Thermodynamics Lesson 10 The Physical Meaning of Entropy

6. Stochastic processes (2)

6.842 Randomness and Computation February 18, Lecture 4

COS 511: Theoretical Machine Learning

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

Linear Regression Analysis: Terminology and Notation

Introduction to Random Variables

NAME and Section No.

Lecture 10 Support Vector Machines II

Equilibrium with Complete Markets. Instructor: Dmytro Hryshko

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Engineering Risk Benefit Analysis

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

Lecture 7: Boltzmann distribution & Thermodynamics of mixing

Retrieval Models: Language models

find (x): given element x, return the canonical element of the set containing x;

STAT 3008 Applied Regression Analysis

Online Classification: Perceptron and Winnow

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Limited Dependent Variables

Lecture 3. Ax x i a i. i i

Lecture 14 (03/27/18). Channels. Decoding. Preview of the Capacity Theorem.

3.1 ML and Empirical Distribution

Bayesian belief networks

Rules of Probability

PROBABILITY PRIMER. Exercise Solutions

The Order Relation and Trace Inequalities for. Hermitian Operators

Power law and dimension of the maximum value for belief distribution with the max Deng entropy

Boning Yang. March 8, 2018

Linear Approximation with Regularization and Moving Least Squares

CSE 546 Midterm Exam, Fall 2014(with Solution)

Markov Chain Monte Carlo Lecture 6

Winter 2008 CS567 Stochastic Linear/Integer Programming Guest Lecturer: Xu, Huan

Expectation Maximization Mixture Models HMMs

10.34 Fall 2015 Metropolis Monte Carlo Algorithm

Vapnik-Chervonenkis theory

Channel Encoder. Channel. Figure 7.1: Communication system

Applied Stochastic Processes

Note on EM-training of IBM-model 1

Bayesian Learning. Smart Home Health Analytics Spring Nirmalya Roy Department of Information Systems University of Maryland Baltimore County

Lecture 4 Hypothesis Testing

Stochastic Structural Dynamics

Basically, if you have a dummy dependent variable you will be estimating a probability.

FUZZY FINITE ELEMENT METHOD

Lecture Space-Bounded Derandomization

Ph 219a/CS 219a. Exercises Due: Wednesday 12 November 2008

Edge Isoperimetric Inequalities

Ensemble Methods: Boosting

Laboratory 3: Method of Least Squares

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Notes on Frequency Estimation in Data Streams

Lecture 10 Support Vector Machines. Oct

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

ECE559VV Project Report

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Calculation of time complexity (3%)

NUMERICAL DIFFERENTIATION

Probabilistic & Unsupervised Learning. Introduction and Foundations

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests

Laboratory 1c: Method of Least Squares

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen

Hopfield Training Rules 1 N

Transcription:

Uncertanty and Learnng } Often, when learnng, we deal wth uncertanty: } Incomplete data sets, wth mssng nformaton } Nosy data sets, wth unrelable nformaton } Stochastcty: causes and effects related non-determnstcally } And many more Class #03: Informaton Theory } Probablty theory gves us mathematcs for such cases } A precse mathematcal theory of chance and causalty Machne Learnng (CS 49/59): M. Allen, 0 Sept. 8 Monday, 0 Sep. 08 Machne Learnng (CS 49/59) Basc Elements of Probablty } Suppose we have some event, e : some fact about the world that may be true or false } We wrte P (e ) for the probablty that e occurs: 0 apple P (e) apple } We can understand ths value as:. P (e ) = : e wll certanly happen. P (e ) = 0: e wll certanly not happen 3. P (e ) = k, 0 < k < : over an arbtrarly long stretch of tme, we wll observe the fracton Event e occurs Total # of events = k Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 3 Propertes of Probablty } Every event must ether occur, or not occur: P (e _ e) = P (e) = p( e) } Furthermore, suppose that we have a set of all possble events, each wth ts own probablty: E = {e,e,...,e k } } Ths set of probabltes s called a probablty dstrbuton, and t must have the followng property: X p = Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 4

Probablty Dstrbutons } A unform dstrbuton s one n whch every event occurs wth equal probablty, whch means that we have: ^ 8, p = k } Such dstrbutons are common n games of chance, e.g. where we have a far con-toss: E = {Heads, Tals} P = {0.5, 0.5} } Not every dstrbuton s unform, and we mght have a con that comes up tals more often than heads (or even always!) P = {0.5, 0.75} P 3 = {0.0,.0} Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 5 Informaton Theory } Claude Shannon created nformaton theory n hs 948 paper, A mathematcal theory of communcaton } A theory of the amount of nformaton that can be carred by communcaton channels } Has mplcatons n networks, encrypton, compresson, and many other areas } Also the source of the term bt (credted to John Tukey) Photo source: Konrad Jacobs (https://opc.mfo.de/detal?photo_d=3807) Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 6 Informaton Carred by Events } Informaton s relatve to our uncertanty about an event } If we do not know whether an event has happened or not, then learnng that fact s a gan n nformaton } If we already know ths fact, then there s no nformaton ganed when we see the outcome } Thus, f we have a fxed con that always comes up tals, actually flppng t tells us nothng we don t already know } Flppng a far con does tell us somethng, on the other hand, snce we can t predct the outcome ahead of tme Amount of Informaton } From N. Abramson (963): If an event e occurs wth probablty p, the amount of nformaton carred s: I(e ) = log p } (The base of the logarthm doesn t really matter, but f we use base-, we are measurng nformaton n bts) } Thus, f we flp a far con, and t comes up tals, we have ganed nformaton equal to: I(Tals) = log P (Tals) = log 0.5 = log =.0 Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 7 Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 8

Based Data Carres Less Informaton } Whle flppng a far con yelds.0 bt of nformaton, flppng one that s based gves us less } If we have a somewhat based con, then we get: E = {Heads, Tals} P = {0.5, 0.75} I(Tals) = log P (Tals) = log 0.75 = log.33 0.45 } If we have a totally based con, then we get: P 3 = {0.0,.0} I(Tals) = log P (Tals) = log.0 = log.0 =0.0 Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 9 Entropy: Total Average Informaton } Shannon defned the entropy of a probablty dstrbuton as the average amount of nformaton carred by events: H(P) = X p log = X p log p p } Ths can be thought of n a varety of ways, ncludng: } How much uncertanty we have about the average event } How much nformaton we get when an average event occurs } How many bts on average are needed to communcate about the events (Shannon was nterested n fndng the most effcent overall encodngs to use n transmttng nformaton) Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 0 Entropy: Total Average Informaton } For a con, C, the formula for entropy becomes: H(C) = (P (Heads) log P (Heads)+P(Tals) log P (Tals)) } A far con, {0.5, 0.5}, has maxmum entropy: H(C) = (0.5 log 0.5+0.5 log 0.5) =.0 } A somewhat based con, {0.5, 0.75}, has less: H(C) = (0.5 log 0.5 + 0.75 log 0.75) 0.8 } And a fxed con, {0.0,.0}, has none: H(C) = (.0 log.0+0.0 log 0.0) = 0.0 Monday, 0 Sep. 08 Machne Learnng (CS 49/59) A Mathematcal Defnton H(P) = X p log p } It s easy to show that for any dstrbuton, entropy s always greater than or equal to 0 (never negatve) } Maxmum entropy occurs wth a unform dstrbuton } In such cases, entropy s log k, where k s the number of dfferent probablstc outcomes } Thus, for any dstrbuton possble, we have: 0 apple H(P) apple log k Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 3

Jont Probablty & Independence } If we have two events e and e, the probablty that both events occur, called the jont probablty, s wrtten: P (e ^ e )=P (e,e ) } We say that two events are ndependent f and only f: P (e,e )=P (e ) P (e ) } Independent events tell us nothng about each other Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 3 Jont Probablty & Independence } Independent events tell us nothng about each other: } For example, suppose rany weather s unformly dstrbuted } Suppose further that we choose a day of the week, unformly at random: that day s ether on a weekend or not, gvng us: W = {Ran, Ran} P W = {0.5, 0.5} D = {Weekend, Weekend} P D = {/7, 5/7} } If the weather on any day s ndependent of whether or not that day s a weekend, then we wll have the followng: P (Ran, W eekend) =P (Ran)P (Weekend) = 0.5 /7 =/4 P ( Ran, W eekend) =P ( Ran)P (Weekend) = 0.5 /7 =/4 P (Ran, Weekend)=P (Ran)P ( Weekend) = 0.5 5/7 =5/4 P ( Ran, Weekend)=P ( Ran)P ( Weekend) = 0.5 5/7 =5/4 Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 4 Lack of Independence } Suppose we compare the probablty that t rans to the probablty that I brng an umbrella to work: W = {Ran, Ran} P W = {0.5, 0.5} U = {Umbrella, Umbrella} P U = {0., 0.8} } Note: presumably, nether of these s really purely random; we can stll treat them as random varables based upon observng how frequently they occur (ths s sometmes called the emprcal probablty) } Now, f these were ndependent events, then the probablty, e.g., that I am carryng an umbrella and t s ranng s: P (Ran, Umbrella) =P (Ran)P (Umbrella) =0.5 0. =0. } Obvously, however, these are not ndependent; and the actual probablty of seeng me wth my umbrella on rany days could be much hgher than just calculated Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 5 Condtonal Probablty } Gven two events e and e, the probablty that e occurs, gven that e also occurs, called the condtonal probablty of e gven e, s wrtten: P (e e ) } In general, the condtonal probablty of an event can be qute dfferent from the basc probablty that t occurs } Thus, for our weather example, we mght have: W = {R, R} P W = {0.5, 0.5} U = {U, U} P U = {0., 0.8} P (U R) =0.8 P (U R) =0. P ( U R) =0. P ( U R) =0.9 P (e e )+P ( e e )=.0 P (e e )+P (e e ) 6=.0 Can be equal, but not necessarly Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 6 4

Propertes of Condtonal Probablty } Condtonal probablty can be defned usng jont probablty: P (e e )= P (e,e ) P (e ) P (e,e )=P (e e )P (e ) } Thus, f the events are actually ndependent, we get: P (e e )= P (e,e ) P (e ) P (e e )= P (e )P (e ) P (e ) P (e e )=P (e ) By defnton of ndependence Ths Week } Informaton Theory & Decson Trees } Readngs: } Blog post on Informaton Theory (lnked from class schedule) } Secton 8.3 from Russell & Norvg } Offce Hours: Wng 0 } Monday/Wednesday/Frday, :00 PM :00 PM } Tuesday/Thursday, :30 PM 3:00 PM Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 7 Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 8 5