CSC321: 2011 Introduction to Neural Networks and Machine Learning. Lecture 10: The Bayesian way to fit models. Geoffrey Hinton

Similar documents
CSC321: 2011 Introduction to Neural Networks and Machine Learning. Lecture 11: Bayesian learning continued. Geoffrey Hinton

Ways to make neural networks generalize better

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

Bayesian classification CISC 5800 Professor Daniel Leeds

Introduction to Probability for Graphical Models

10.5 Unsupervised Bayesian Learning

STA 250: Statistics. Notes 7. Bayesian Approach to Statistics. Book chapters: 7.2

18.05 Problem Set 6, Spring 2014 Solutions

EconS 503 Homework #8. Answer Key

CS 687 Jana Kosecka. Uncertainty, Bayesian Networks Chapter 13, Russell and Norvig Chapter 14,

BAYES CLASSIFIER. Ivan Michael Siregar APLYSIT IT SOLUTION CENTER. Jl. Ir. H. Djuanda 109 Bandung

Named Entity Recognition using Maximum Entropy Model SEEM5680

INFORMATION TRANSFER THROUGH CLASSIFIERS AND ITS RELATION TO PROBABILITY OF ERROR

4. Score normalization technical details We now discuss the technical details of the score normalization method.

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Maximum Entropy and Exponential Families

Lecture 7: Linear Classification Methods

Bayesian Networks Practice

NONLINEAR MODEL: CELL FORMATION

Risk Analysis in Water Quality Problems. Souza, Raimundo 1 Chagas, Patrícia 2 1,2 Departamento de Engenharia Hidráulica e Ambiental

Detection and Estimation Theory

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III

Fundamental Theorem of Calculus

2 The Bayesian Perspective of Distributions Viewed as Information

Lecture 3a: The Origin of Variational Bayes

Problem 1(a): At equal times or in the Schrödinger picture the quantum scalar fields ˆΦ a (x) and ˆΠ a (x) satisfy commutation relations

Naïve Bayes classification

Chapter 2. Conditional Probability

Bayesian Model Averaging Kriging Jize Zhang and Alexandros Taflanidis

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3

INTUITIONISTIC FUZZY SOFT MATRIX THEORY IN MEDICAL DIAGNOSIS USING MAX-MIN AVERAGE COMPOSITION METHOD

ECE 534 Information Theory - Midterm 2

Blackbody radiation (Text 2.2)

Bayesian Inference and an Intro to Monte Carlo Methods

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CSC2515 Winter 2015 Introduc3on to Machine Learning. Lecture 5: Clustering, mixture models, and EM

Bayesian Networks Practice

Principal Components Analysis and Unsupervised Hebbian Learning

CSC321 Lecture 18: Learning Probabilistic Models

Learning Sequence Motif Models Using Gibbs Sampling

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Machine Learning. Topic 12: Probability and Bayesian Networks

Complexity of Regularization RBF Networks

Mathematics II. Tutorial 5 Basic mathematical modelling. Groups: B03 & B08. Ngo Quoc Anh Department of Mathematics National University of Singapore

Any AND-OR formula of size N can be evaluated in time N 1/2+o(1) on a quantum computer

Introduction into Bayesian statistics

Bayesian Models in Machine Learning

V. Interacting Particles

Model-based mixture discriminant analysis an experimental study

36-463/663: Multilevel & Hierarchical Models

(1) Introduction to Bayesian statistics

EXTENDED MATRIX CUBE THEOREMS WITH APPLICATIONS TO -THEORY IN CONTROL

Today. Calculus. Linear Regression. Lagrange Multipliers

CS 630 Basic Probability and Information Theory. Tim Campbell

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Methods of evaluating tests

Statistical learning. Chapter 20, Sections 1 3 1

Supplementary Material

A Channel-Based Perspective on Conjugate Priors

Advanced Computational Fluid Dynamics AA215A Lecture 4

Introduction to Bayesian Learning. Machine Learning Fall 2018

A special reference frame is the center of mass or zero momentum system frame. It is very useful when discussing high energy particle reactions.

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

ECE 6960: Adv. Random Processes & Applications Lecture Notes, Fall 2010

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Lecture 7: Sampling/Projections for Least-squares Approximation, Cont. 7 Sampling/Projections for Least-squares Approximation, Cont.

Radial Basis Function Networks: Algorithms

gradp -the fluid is inviscid -the fluid is barotropic -the mass forces form a potential field

Real-time Hand Tracking Using a Sum of Anisotropic Gaussians Model

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

arxiv: v2 [cs.ai] 16 Feb 2016

ELG 5372 Error Control Coding. Claude D Amours Lecture 2: Introduction to Coding 2

Quantum Mechanics: Wheeler: Physics 6210

LEARNING WITH BAYESIAN NETWORKS

CPSC 340: Machine Learning and Data Mining

A Unified View on Multi-class Support Vector Classification Supplement

Lecture 14: Introduction to Decision Making

Bayesian Analysis for Natural Language Processing Lecture 2

Bayesian Linear Regression [DRAFT - In Progress]

2.2 BUDGET-CONSTRAINED CHOICE WITH TWO COMMODITIES

Lecture 6: Model Checking and Selection

ECE295, Data Assimila0on and Inverse Problems, Spring 2015

Bayesian Learning (II)

Case I: 2 users In case of 2 users, the probability of error for user 1 was earlier derived to be 2 A1

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Internal Model Control

Study of EM waves in Periodic Structures (mathematical details)

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Logistic Regression. Machine Learning Fall 2018

Statistical learning. Chapter 20, Sections 1 4 1

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Generalized Dimensional Analysis

E( x ) = [b(n) - a(n,m)x(m) ]

Inconsistency of Bayesian inference when the model is wrong, and how to repair it

Bayesian Machine Learning

Probabilistic Graphical Models

Lecture 3 - Lorentz Transformations

Transcription:

CSC31: 011 Introdution to Neural Networks and Mahine Learning Leture 10: The Bayesian way to fit models Geoffrey Hinton

The Bayesian framework The Bayesian framework assumes that we always have a rior distribution for everything. The rior may be very vague. hen we see some data, we ombine our rior distribution with a likelihood term to get a osterior distribution. The likelihood term takes into aount how robable the observed data is given the arameters of the model. It favors arameter settings that make the data likely. It fights the rior ith enough data the likelihood terms always win.

A oin tossing examle Suose we know nothing about oins exet that eah tossing event rodues a head with some unknown robability and a tail with robability 1-. Our model of a oin has one arameter,. Suose we observe 100 tosses and there are 53 heads. hat is? The frequentist answer: Pik the value of that makes the observation of 53 heads and 47 tails most robable. P dp d 53 1 53 53 0 if 5 47 1 47 1.53 47 47 53 robability of a artiular sequene 1 53 1 47 46

Some roblems with iking the arameters that are most likely to generate the data hat if we only tossed the oin one and we got 1 head? Is =1 a sensible answer? Surely =0.5 is a muh better answer. Is it reasonable to give a single answer? If we don t have muh data, we are unsure about. Our omutations of robabilities will work muh better if we take this unertainty into aount.

Using a distribution over arameter values Start with a rior distribution over. In this ase we used a uniform distribution. Multily the rior robability of eah arameter value by the robability of observing a head given that value. robability density robability density 1 area=1 0 1 1 Then sale u all of the robability densities so that their integral omes to 1. This gives the osterior distribution. robability density area=1

Lets do it again: Suose we get a tail Start with a rior distribution over. robability density area=1 1 Multily the rior robability of eah arameter value by the robability of observing a tail given that value. 0 1 Then renormalize to get the osterior distribution. Look how sensible it is! area=1

Lets do it another 98 times After 53 heads and 47 tails we get a very sensible osterior distribution that has its eak at 0.53 assuming a uniform rior. area=1 robability density 1 0 1

Bayes Theorem, Prior robability of weight vetor Posterior robability of weight vetor given training data Probability of observed data given joint robability onditional robability

A hea trik to avoid omuting the osterior robabilities of all weight vetors Suose we just try to find the most robable weight vetor. e an do this by starting with a random weight vetor and then adjusting it in the diretion that imroves. It is easier to work in the log domain. If we want to minimize a ost we use negative log robabilities: / Cost log log log log

hy we maximize sums of log robs e want to maximize the rodut of the robabilities of the oututs on all the different training ases Assume the outut errors on different training ases,, are indeendent. d Beause the log funtion is monotoni, it does not hange where the maxima are. So we an maximize sums of log robabilities log log d

A even heaer trik Suose we omletely ignore the rior over weight vetors This is equivalent to giving all ossible weight vetors the same rior robability density. Then all we have to do is to maximize: log log This is alled maximum likelihood learning. It is very widely used for fitting models in statistis.

Suervised Maximum Likelihood Learning Minimizing the squared residuals is equivalent to maximizing the log robability of the orret answer under a Gaussian entered at the model s guess. y outut log f inut d outut, inut d, inut d, k y d = the orret answer d 1 y = model s estimate of most robable value y e d y

Suervised Maximum Likelihood Learning Finding a set of weights,, that minimizes the squared errors is exatly the same as finding a that maximizes the log robability that the model would rodue the desired oututs on all the training ases. e imliitly assume that zero-mean Gaussian noise is added to the model s atual outut. e do not need to know the variane of the noise beause we are assuming it s the same in all ases. So it just sales the squared error.