Machine Learning: Probability Theory

Similar documents
Probability Theory. Prof. Dr. Martin Riedmiller AG Maschinelles Lernen Albert-Ludwigs-Universität Freiburg.

Probability Theory for Machine Learning. Chris Cremer September 2015

Recitation 2: Probability

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Econ 325: Introduction to Empirical Economics

Course Introduction. Probabilistic Modelling and Reasoning. Relationships between courses. Dealing with Uncertainty. Chris Williams.

L2: Review of probability and statistics

Basic Probability. Robert Platt Northeastern University. Some images and slides are used from: 1. AIMA 2. Chris Amato

Deep Learning for Computer Vision

Lecture 1: Probability Fundamentals

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Statistical methods in recognition. Why is classification a problem?

Grundlagen der Künstlichen Intelligenz

{ p if x = 1 1 p if x = 0

LECTURE 1. 1 Introduction. 1.1 Sample spaces and events

Origins of Probability Theory

Basics on Probability. Jingrui He 09/11/2007

COMP9414/ 9814/ 3411: Artificial Intelligence. 14. Uncertainty. Russell & Norvig, Chapter 13. UNSW c AIMA, 2004, Alan Blair, 2012

Probability Theory and Applications

Lecture Overview. Introduction to Artificial Intelligence COMP 3501 / COMP Lecture 11: Uncertainty. Uncertainty.

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Single Maths B: Introduction to Probability

Lecture notes for probability. Math 124

Quick Tour of Basic Probability Theory and Linear Algebra

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Basic Probability and Decisions

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Computational Genomics

Conditional Probability & Independence. Conditional Probabilities

Artificial Intelligence Uncertainty

STAT:5100 (22S:193) Statistical Inference I

Lecture 1. ABC of Probability

Review (Probability & Linear Algebra)

Statistical Methods in Particle Physics. Lecture 2

Dept. of Linguistics, Indiana University Fall 2015

Why study probability? Set theory. ECE 6010 Lecture 1 Introduction; Review of Random Variables

Probability Review. Yutian Li. January 18, Stanford University. Yutian Li (Stanford University) Probability Review January 18, / 27

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 3 9/10/2008 CONDITIONING AND INDEPENDENCE

Lecture 1: Basics of Probability

Mean, Median and Mode. Lecture 3 - Axioms of Probability. Where do they come from? Graphically. We start with a set of 21 numbers, Sta102 / BME102

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 11: Probability Review. Dr. Yanjun Qi. University of Virginia. Department of Computer Science

Brief Review of Probability

Review of Probabilities and Basic Statistics

Lecture 3 - Axioms of Probability

Introduction to Probability and Statistics (Continued)

Bayesian RL Seminar. Chris Mansley September 9, 2008

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

2. AXIOMATIC PROBABILITY

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Where are we? Knowledge Engineering Semester 2, Reasoning under Uncertainty. Probabilistic Reasoning

An AI-ish view of Probability, Conditional Probability & Bayes Theorem

10/18/2017. An AI-ish view of Probability, Conditional Probability & Bayes Theorem. Making decisions under uncertainty.

Some Concepts of Probability (Review) Volker Tresp Summer 2018

Unit 1: Sequence Models

What is Probability? Probability. Sample Spaces and Events. Simple Event

Random variables. DS GA 1002 Probability and Statistics for Data Science.

Statistical Inference

10-701: Introduction to Deep Neural Networks Machine Learning.

Uncertainty. Chapter 13

Topic 5 Basics of Probability

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Where are we in CS 440?

Naïve Bayes classification

Machine Learning

Lecture : Probabilistic Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Communication Theory II

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Week 12-13: Discrete Probability

EE514A Information Theory I Fall 2013

Probability and Estimation. Alan Moses

Statistical Theory 1

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

G.PULLAIAH COLLEGE OF ENGINEERING & TECHNOLOGY DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING PROBABILITY THEORY & STOCHASTIC PROCESSES

Lecture 1: Bayesian Framework Basics

Introduction to Machine Learning

STAT Chapter 3: Probability

Bayesian Learning (II)

Chapter 6: Probability The Study of Randomness

Probability Review. Chao Lan

Statistika pro informatiku

Appendix A : Introduction to Probability and stochastic processes

Lecture 2: Repetition of probability theory and statistics

Conditional Probability & Independence. Conditional Probabilities

Relationship between probability set function and random variable - 2 -

Machine Learning. Probability Theory. WS2013/2014 Dmitrij Schlesinger, Carsten Rother

Introduction to Probability

Human-Oriented Robotics. Probability Refresher. Kai Arras Social Robotics Lab, University of Freiburg Winter term 2014/2015

GOV 2001/ 1002/ Stat E-200 Section 1 Probability Review

Math 105 Course Outline

Probability and statistics; Rehearsal for pattern recognition

Probability theory. References:

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Data Mining Techniques. Lecture 3: Probability

Probabilistic Graphical Models for Image Analysis - Lecture 1

Introduction to Stochastic Processes

CSC321 Lecture 18: Learning Probabilistic Models

Transcription:

Machine Learning: Probability Theory Prof. Dr. Martin Riedmiller Albert-Ludwigs-University Freiburg AG Maschinelles Lernen Theories p.1/28

Probabilities probabilistic statements subsume different effects due to: convenience: declaring all conditions, exceptions, assumptions would be too complicated. Example: I will be in lecture if I go to bed early enough the day before and I do not become ill and my car does not have a breakdown and... or simply: I will be in lecture with probability of 0.87 Theories p.2/28

Probabilities probabilistic statements subsume different effects due to: convenience: declaring all conditions, exceptions, assumptions would be too complicated. Example: I will be in lecture if I go to bed early enough the day before and I do not become ill and my car does not have a breakdown and... or simply: I will be in lecture with probability of 0.87 lack of information: relevant information is missing for a precise statement. Example: weather forcasting Theories p.2/28

Probabilities probabilistic statements subsume different effects due to: convenience: declaring all conditions, exceptions, assumptions would be too complicated. Example: I will be in lecture if I go to bed early enough the day before and I do not become ill and my car does not have a breakdown and... or simply: I will be in lecture with probability of 0.87 lack of information: relevant information is missing for a precise statement. Example: weather forcasting intrinsic randomness: non-deterministic processes. Example: appearance of photons in a physical process Theories p.2/28

Probabilities intuitively, probabilities give the expected relative frequency of an event (cont.) Theories p.3/28

Probabilities intuitively, probabilities give the expected relative frequency of an event (cont.) mathematically, probabilities are defined by axioms (Kolmogorov axioms). We assume a set of possible outcomes Ω. An event A is a subset of Ω the probability of an event A, P(A) is a welldefined non-negative number: P(A) 0 the certain event Ω has probability 1: P(Ω) = 1 for two disjoint events A and B: P(A B) = P(A) + P(B) P is called probability distribution Theories p.3/28

Probabilities intuitively, probabilities give the expected relative frequency of an event (cont.) mathematically, probabilities are defined by axioms (Kolmogorov axioms). We assume a set of possible outcomes Ω. An event A is a subset of Ω the probability of an event A, P(A) is a welldefined non-negative number: P(A) 0 the certain event Ω has probability 1: P(Ω) = 1 for two disjoint events A and B: P(A B) = P(A) + P(B) P is called probability distribution important conclusions (can be derived from the above axioms): P( ) = 0 P( A) = 1 P(A) if A B follows P(A) P(B) P(A B) = P(A) + P(B) P(A B) Theories p.3/28

example: rolling the dice Ω = {1, 2, 3, 4, 5, 6} Probability distribution (optimal dice): P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1 6 Probabilities (cont.) Theories p.4/28

example: rolling the dice Ω = {1, 2, 3, 4, 5, 6} Probability distribution (optimal dice): P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1 6 probabilities of events, e.g.: P({1}) = 1 6 P({1, 2}) = P({1}) + P({2}) = 1 3 P({1, 2} {2, 3}) = 1 2 Probabilities (cont.) Theories p.4/28

Probabilities (cont.) example: rolling the dice Ω = {1, 2, 3, 4, 5, 6} Probability distribution (optimal dice): P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1 6 probabilities of events, e.g.: P({1}) = 1 6 P({1, 2}) = P({1}) + P({2}) = 1 3 P({1, 2} {2, 3}) = 1 2 Probability distribution (manipulated dice): P(1) = P(2) = P(3) = 0.13,P(4) = P(5) = 0.17,P(6) = 0.27 typically, the actual probability distribution is not known in advance, it has to be estimated Theories p.4/28

Joint events for pairs of events A,B, the joint probability expresses the probability of both events occuring at same time: P(A,B) example: P( Bayern München is losing, Werder Bremen is winning ) = 0.3 Theories p.5/28

Joint events for pairs of events A,B, the joint probability expresses the probability of both events occuring at same time: P(A,B) example: P( Bayern München is losing, Werder Bremen is winning ) = 0.3 Definition: for two events the conditional probability of A B is defined as the probability of event A if we consider only cases in which event B occurs. In formulas: P(A B) = P(A,B) P(B),P(B) 0 Theories p.5/28

Joint events for pairs of events A,B, the joint probability expresses the probability of both events occuring at same time: P(A,B) example: P( Bayern München is losing, Werder Bremen is winning ) = 0.3 Definition: for two events the conditional probability of A B is defined as the probability of event A if we consider only cases in which event B occurs. In formulas: with the above, we also have P(A B) = P(A,B) P(B),P(B) 0 P(A,B) = P(A B)P(B) = P(B A)P(A) example: P( caries toothaches ) = 0.8 P( toothaches caries ) = 0.3 Theories p.5/28

Joint events (cont.) a contigency table makes clear the relationship between joint probabilities and conditional probabilities: B B A P(A,B) P(A, B) P(A) A P( A,B) P( A, B) P( A) P(B) P( B) with P(A) = P(A,B) + P(A, B), P( A) = P( A,B) + P( A, B), P(B) = P(A,B) + P( A,B), P( B) = P(A, B) + P( A, B) conditional probability = joint probability / marginal probability example blackboard (cars: colors and drivers) marginals joint prob. Theories p.6/28

Marginalisation Let B 1,...B n disjoint events with i B i = Ω. Then P(A) = i P(A,B i) This process is called marginalisation. Theories p.7/28

from definition of conditional probability: Productrule and chainrule P(A,B) = P(A B)P(B) = P(B A)P(A) Theories p.8/28

from definition of conditional probability: Productrule and chainrule P(A,B) = P(A B)P(B) = P(B A)P(A) repeated application: chainrule: P(A 1,...,A n ) = P(A n,...,a 1 ) = P(A n A n 1,...,A 1 )P(A n 1,...,A 1 ) = P(A n A n 1,...,A 1 )P(A n 1 A n 2,...,A 1 )P(A n 2,...,A 1 ) =... = Π n i=1p(a i A 1,...,A i 1 ) Theories p.8/28

Conditional Probabilities conditionals: Example: if someone is taking a shower, he gets wet (by causality) P( wet taking a shower ) = 1 while: P( taking a shower wet ) = 0.4 because a person also gets wet if it is raining Theories p.9/28

Conditional Probabilities conditionals: Example: if someone is taking a shower, he gets wet (by causality) P( wet taking a shower ) = 1 while: P( taking a shower wet ) = 0.4 because a person also gets wet if it is raining causality and conditionals: causality typically causes conditional probabilities close to 1: P( wet taking a shower ) = 1, e.g. P( score a goal shoot strong ) = 0.92 ( vague causality : if you shoot strong, you very likely score a goal ). Offers the possibility to express vagueness in reasoning. you cannot conclude causality from large conditional probabilities: P( being rich owning an airplane ) 1 but: owning an airplane is not the reason for being rich Theories p.9/28

Bayes rule from the definition of conditional distributions: Hence: is known as Bayes rule. P(A B)P(B) = P(A,B) = P(B A)P(A) P(A B) = P(B A)P(A) P(B) Theories p.10/28

Bayes rule from the definition of conditional distributions: P(A B)P(B) = P(A,B) = P(B A)P(A) Hence: P(A B) = P(B A)P(A) P(B) is known as Bayes rule. example: P( taking a shower ) P( taking a shower wet )=P( wet taking a shower ) P( wet ) P(reason) P(reason observation) = P(observation reason) P(observation) Theories p.10/28

Bayes rule (cont) often this is useful in diagnosis situations, since P(observation reason) might be easily determined. often delivers suprising results Theories p.11/28

Bayes rule - Example if patient has meningitis, then very often a stiff neck is observed Theories p.12/28

Bayes rule - Example if patient has meningitis, then very often a stiff neck is observed P(S M) = 0.8 (can be easily determined by counting) observation: I have a stiff neck! Do I have meningitis? (is it reasonable to be afraid?) Theories p.12/28

Bayes rule - Example if patient has meningitis, then very often a stiff neck is observed P(S M) = 0.8 (can be easily determined by counting) observation: I have a stiff neck! Do I have meningitis? (is it reasonable to be afraid?) P(M S) =? we need to now: P(M) = 0.0001 (one of 10000 people has meningitis) and P(S) = 0.1 (one out of 10 people has a stiff neck). Theories p.12/28

Bayes rule - Example if patient has meningitis, then very often a stiff neck is observed P(S M) = 0.8 (can be easily determined by counting) observation: I have a stiff neck! Do I have meningitis? (is it reasonable to be afraid?) P(M S) =? we need to now: P(M) = 0.0001 (one of 10000 people has meningitis) and P(S) = 0.1 (one out of 10 people has a stiff neck). then: P(M S) = P(S M)P(M) P(S) = 0.8 0.0001 0.1 = 0.0008 Keep cool. Not very likely Theories p.12/28

Independence two events A and B are called independent, if P(A,B) = P(A) P(B) independence means: we cannot make conclusions about A if we know B and vice versa. Follows: P(A B) = P(A), P(B A) = P(B) example of independent events: roll-outs of two dices example of dependent events: A = car is blue, B = driver is male contingency table at blackboard Theories p.13/28

Random variables random variables describe the outcome of a random experiment in terms of a (real) number a random experiment is a experiment that can (in principle) be repeated several times under the same conditions Theories p.14/28

Random variables random variables describe the outcome of a random experiment in terms of a (real) number a random experiment is a experiment that can (in principle) be repeated several times under the same conditions discrete and continuous random variables Theories p.14/28

Random variables random variables describe the outcome of a random experiment in terms of a (real) number a random experiment is a experiment that can (in principle) be repeated several times under the same conditions discrete and continuous random variables probability distributions for discrete random variables can be represented in tables: Example: random variable X (rolling a dice): X 1 2 3 4 5 6 P(X) 1 6 1 6 1 6 1 6 1 6 1 6 probability distributions for continuous random variables need another form of representation Theories p.14/28

problem: infinitely many outcomes Continuous random variables Theories p.15/28

problem: infinitely many outcomes Continuous random variables considering intervals instead of single real numbers: P(a < X b) Theories p.15/28

problem: infinitely many outcomes Continuous random variables considering intervals instead of single real numbers: P(a < X b) cumulative distribution functions (cdf): A function F : R [0, 1] is called cumulative distribution function of a random variable X if for all c R hold: P(X c) = F(c) Knowing F, we can calculate P(a < X b) for all intervals from a to b F is monotonically increasing, lim x F(x) = 0, lim x F(x) = 1 Theories p.15/28

problem: infinitely many outcomes Continuous random variables considering intervals instead of single real numbers: P(a < X b) cumulative distribution functions (cdf): A function F : R [0, 1] is called cumulative distribution function of a random variable X if for all c R hold: P(X c) = F(c) Knowing F, we can calculate P(a < X b) for all intervals from a to b F is monotonically increasing, lim x F(x) = 0, lim x F(x) = 1 if exists, the derivative of F is called a probability density function (pdf). It yields large values in the areas of large probability and small values in the areas with small probability. But: the value of a pdf cannot be interpreted as a probability! Theories p.15/28

Continuous random variables (cont.) example: a continuous random variable that can take any value between a and b and does not prefer any value over another one (uniform distribution): cdf(x) pdf(x) 1 0 0 a b X a b X Theories p.16/28

Gaussian distribution the Gaussian/Normal distribution is a very important probability distribution. Its pdf is: pdf(x) = 1 2πσ 2 e 1 2 (x µ) 2 σ 2 µ R and σ 2 > 0 are parameters of the distribution. The cdf exists but cannot be expressed in a simple form µ controls the position of the distribution, σ 2 the spread of the distribution cdf(x) pdf(x) 1 0 X 0 X Theories p.17/28

Statistical inference determining the probability distribution of a random variable (estimation) Theories p.18/28

Statistical inference determining the probability distribution of a random variable (estimation) collecting outcome of repeated random experiments (data sample) adapt a generic probability distribution to the data. example: Bernoulli-distribution (possible outcomes: 1 or 0) with success parameter p (=probability of outcome 1 ) Gaussian distribution with parameters µ and σ 2 uniform distribution with parameters a and b Theories p.18/28

Statistical inference determining the probability distribution of a random variable (estimation) collecting outcome of repeated random experiments (data sample) adapt a generic probability distribution to the data. example: Bernoulli-distribution (possible outcomes: 1 or 0) with success parameter p (=probability of outcome 1 ) Gaussian distribution with parameters µ and σ 2 uniform distribution with parameters a and b maximum-likelihood approach: maximize parameters P(data sample distribution) Theories p.18/28

Statistical inference (cont.) maximum likelihood with Bernoulli-distribution: assume: coin toss with a twisted coin. How likely is it to observe head? Theories p.19/28

Statistical inference (cont.) maximum likelihood with Bernoulli-distribution: assume: coin toss with a twisted coin. How likely is it to observe head? repeat several experiments, to get a sample of observations, e.g.: head, head, number, head, number, head, head, head, number, number,... You observe k times head and n times number Theories p.19/28

Statistical inference (cont.) maximum likelihood with Bernoulli-distribution: assume: coin toss with a twisted coin. How likely is it to observe head? repeat several experiments, to get a sample of observations, e.g.: head, head, number, head, number, head, head, head, number, number,... You observe k times head and n times number Probabilisitic model: head occurs with (unknown) probability p, number with probability 1 p maximize the likelihood, e.g. for the above sample: maximize p p p (1 p) p (1 p) p p p (1 p) (1 p) = p k (1 p) n Theories p.19/28

Statistical inference (cont.) simplification: minimize p log(p k (1 p) n ) = k log p n log(1 p) Theories p.20/28

Statistical inference (cont.) simplification: minimize p log(p k (1 p) n ) = k log p n log(1 p) calculating partial derivatives w.r.t p and zeroing: p = k k+n The relative frequency of observations is used as estimator for p Theories p.20/28

maximum likelihood with Gaussian distribution: given: data sample {x (1),...,x (p) } task: determine optimal values for µ and σ 2 Statistical inference (cont.) Theories p.21/28

maximum likelihood with Gaussian distribution: given: data sample {x (1),...,x (p) } task: determine optimal values for µ and σ 2 assume independence of the observed data: Statistical inference (cont.) P(data sample distribution) = P(x (1) distribution) P(x (p) distribution) replacing probability by density: P(data sample distribution) 1 2πσ 2 e 1 2 1 σ 2 2πσ 2 e 1 2 (x (1) µ) 2 (x (p) µ) 2 σ 2 Theories p.21/28

maximum likelihood with Gaussian distribution: given: data sample {x (1),...,x (p) } task: determine optimal values for µ and σ 2 assume independence of the observed data: Statistical inference (cont.) P(data sample distribution) = P(x (1) distribution) P(x (p) distribution) replacing probability by density: P(data sample distribution) 1 2πσ 2 e 1 2 1 σ 2 2πσ 2 e 1 2 (x (1) µ) 2 (x (p) µ) 2 σ 2 performing log transformation: p i=1 ( log 1 2πσ 2 1 2 (x (i) µ) 2 σ 2 ) Theories p.21/28

Statistical inference minimizing negative log likelihood instead of maximizing log likelihood: (cont.) minimize µ,σ 2 p i=1 ( log 1 2πσ 2 1 2 (x (i) µ) 2 σ 2 ) Theories p.22/28

Statistical inference minimizing negative log likelihood instead of maximizing log likelihood: (cont.) minimize µ,σ 2 p i=1 ( log 1 2πσ 2 1 2 (x (i) µ) 2 σ 2 ) transforming into: minimize µ,σ 2 p 2 log(σ2 ) + p 2 log(2π) + 1 (1 σ 2 2 p (x (i) µ) 2) i=1 Theories p.22/28

Statistical inference minimizing negative log likelihood instead of maximizing log likelihood: (cont.) minimize µ,σ 2 p i=1 ( log 1 2πσ 2 1 2 (x (i) µ) 2 σ 2 ) transforming into: minimize µ,σ 2 p 2 log(σ2 ) + p 2 log(2π) + 1 (1 p (x (i) µ) 2) σ 2 2 i=1 }{{} sq. error term Theories p.22/28

Statistical inference minimizing negative log likelihood instead of maximizing log likelihood: (cont.) minimize µ,σ 2 p i=1 ( log 1 2πσ 2 1 2 (x (i) µ) 2 σ 2 ) transforming into: minimize µ,σ 2 p 2 log(σ2 ) + p 2 log(2π) + 1 (1 p (x (i) µ) 2) σ 2 2 i=1 }{{} sq. error term observation: maximizing likelihood w.r.t. µ is equivalent to minimizing squared error term w.r.t. µ Theories p.22/28

Statistical inference extension: regression case, µ depends on input pattern and some parameters (cont.) given: pairs of input patterns and target values ( x (1),d (1) ),...,( x (p),d (p) ), a parameterized function f depending on some parameters w task: estimate w and σ 2 so that d (i) f( x (i) ; w) fits a Gaussian distribution in best way Theories p.23/28

Statistical inference extension: regression case, µ depends on input pattern and some parameters (cont.) given: pairs of input patterns and target values ( x (1),d (1) ),...,( x (p),d (p) ), a parameterized function f depending on some parameters w task: estimate w and σ 2 so that d (i) f( x (i) ; w) fits a Gaussian distribution in best way maximum likelihood principle: 1 maximize w,σ 2 2πσ 2 e 1 2 (d (1) f( x (1) ; w)) 2 σ 2 1 2πσ 2 e 1 2 (d (p) f( x (p) ; w)) 2 σ 2 Theories p.23/28

Statistical inference (cont.) minimizing negative log likelihood: minimize w,σ 2 p 2 log(σ2 ) + p 2 log(2π) + 1 (1 σ 2 2 p (d (i) f( x (i) ; w)) 2) i=1 Theories p.24/28

Statistical inference (cont.) minimizing negative log likelihood: minimize w,σ 2 p 2 log(σ2 ) + p 2 log(2π) + 1 (1 p (d (i) f( x (i) ; w)) 2) σ 2 2 i=1 }{{} sq. error term Theories p.24/28

Statistical inference (cont.) minimizing negative log likelihood: minimize w,σ 2 p 2 log(σ2 ) + p 2 log(2π) + 1 (1 p (d (i) f( x (i) ; w)) 2) σ 2 2 i=1 }{{} sq. error term f could be, e.g., a linear function or a multi layer perceptron y f(x) x Theories p.24/28

Probability and machine learning machine learning unsupervised learning we want to create a model of observed patterns classification guessing the class from an input pattern regression predicting the output from input pattern statistics estimating the probability distribution P(patterns) estimating P(class input pattern) estimating P(output input pattern) probabilities allow to precisely describe the relationships in a certain domain, e.g. distribution of the input data, distribution of outputs conditioned on inputs,... ML principles like minimizing squared error can be interpreted in a stochastic sense Theories p.25/28

References Norbert Henze: Stochastik für Einsteiger Chris Bishop: Neural Networks for Pattern Recognition Theories p.26/28