Probabilistic & Unsupervised Learning. Introduction and Foundations

Size: px
Start display at page:

Download "Probabilistic & Unsupervised Learning. Introduction and Foundations"

Transcription

1 Probablstc & Unsupervsed Learnng Introducton and Foundatons Maneesh Sahan Gatsby Computatonal Neuroscence Unt, and MSc ML/CSML, Dept Computer Scence Unversty College London Term 1, Autumn 2018

2 What do we mean by learnng? Jan Steen Not just rememberng:

3 What do we mean by learnng? Jan Steen Not just rememberng: Systematsng (nosy) observatons: dscoverng structure.

4 What do we mean by learnng? Jan Steen Not just rememberng: Systematsng (nosy) observatons: dscoverng structure. Predctng new outcomes: generalsng.

5 What do we mean by learnng? Jan Steen Not just rememberng: Systematsng (nosy) observatons: dscoverng structure. Predctng new outcomes: generalsng. Choosng actons wsely.

6 Three learnng problems Systematsng (nosy) observatons: dscoverng structure. Predctng new outcomes: generalsng. Choosng actons wsely.

7 Three learnng problems Systematsng (nosy) observatons: dscoverng structure. Unsupervsed learnng. Observe (sensory) nput alone: x 1, x 2, x 3, x 4,... Descrbe pattern of data [p(x)], dentfy and extract underlyng structural varables [x y ]. Predctng new outcomes: generalsng. Choosng actons wsely.

8 Three learnng problems Systematsng (nosy) observatons: dscoverng structure. Unsupervsed learnng. Observe (sensory) nput alone: x 1, x 2, x 3, x 4,... Descrbe pattern of data [p(x)], dentfy and extract underlyng structural varables [x y ]. Predctng new outcomes: generalsng. Supervsed learnng. Observe nput/output pars ( teachng ): (x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ), (x 4, y 4 ),... Predct the correct y for new test nput x. Choosng actons wsely.

9 Three learnng problems Systematsng (nosy) observatons: dscoverng structure. Unsupervsed learnng. Observe (sensory) nput alone: x 1, x 2, x 3, x 4,... Descrbe pattern of data [p(x)], dentfy and extract underlyng structural varables [x y ]. Predctng new outcomes: generalsng. Supervsed learnng. Observe nput/output pars ( teachng ): (x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ), (x 4, y 4 ),... Predct the correct y for new test nput x. Choosng actons wsely. Renforcement learnng. Rewards or payoffs (and possbly also nputs) depend on actons: x 1 : a 1 r 1, x 2 : a 2 r 2, x 3 : a 3 r 3... Fnd a polcy for acton choce that maxmses payoff.

10 Unsupervsed Learnng Fnd underlyng structure: separate generatng processes (clusters) reduced dmensonalty representatons good explanatons (causes) of the data modellng the data densty a W flters mage patch, mage ensemble I Uses of Unsupervsed Learnng: structure dscovery, scence data compresson outler detecton nput to supervsed/renforcement algorthms (causes may be more smply related to outputs or rewards) a theory of bologcal learnng and percepton Φ a bass functons causes

11 y Supervsed learnng Two man examples: Classfcaton: x x x o x x xx o x o x x o o o o x Regresson: Dscrete (class label) outputs x Contnuous-values outputs. But also: ranks, relatonshps, trees etc. Varants may relate to unsupervsed learnng: sem-supervsed learnng (most x unlabelled; assumes structure of {x} and relatonshp x y are lnked). multtask (transfer) learnng (predct dfferent y n dfferent contexts; assumes lnks between structure of relatonshps).

12 A probablstc approach Data are generated by random and/or unknown processes.

13 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood.

14 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to

15 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs

16 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery

17 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss

18 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss communcate the data n an effcent way

19 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss communcate the data n an effcent way Probablstc modellng s often equvalent to other vews of learnng:

20 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss communcate the data n an effcent way Probablstc modellng s often equvalent to other vews of learnng: nformaton theoretc: fndng compact representatons of the data

21 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss communcate the data n an effcent way Probablstc modellng s often equvalent to other vews of learnng: nformaton theoretc: fndng compact representatons of the data physcal analoges: mnmsng (free) energy of a correspondng statstcal mechancal system

22 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss communcate the data n an effcent way Probablstc modellng s often equvalent to other vews of learnng: nformaton theoretc: fndng compact representatons of the data physcal analoges: mnmsng (free) energy of a correspondng statstcal mechancal system structural rsk: compensate for overconfdence n powerful models

23 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss communcate the data n an effcent way Probablstc modellng s often equvalent to other vews of learnng: nformaton theoretc: fndng compact representatons of the data physcal analoges: mnmsng (free) energy of a correspondng statstcal mechancal system structural rsk: compensate for overconfdence n powerful models The calculus of probabltes naturally handles randomness. It s also the rght way to reason about unknown values.

24 Representng belefs Let b(x) represent our strength of belef n (plausblty of) proposton x: 0 b(x) 1 b(x) = 0 x s defntely not true b(x) = 1 x s defntely true b(x y) strength of belef that x s true gven that we know y s true Cox Axoms (Desderata): Let b(x) be real. As b(x) ncreases, b( x) decreases, and so the functon mappng b(x) b( x) s monotoncally decreasng and self-nverse. b(x y) depends only on b(y) and b(x y). Consstency If a concluson can be reasoned n more than one way, then every way should lead to the same answer. Belefs always take nto account all relevant evdence. Equvalent states of knowledge are represented by equvalent plausblty assgnments. Consequence: Belef functons (e.g. b(x), b(x y), b(x, y)) must be somorphc to probabltes, satsfyng all the usual laws, ncludng Bayes rule. (See Jaynes, Probablty Theory: The Logc of Scence)

25 Basc rules of probablty Probabltes are non-negatve P(x) 0 x.

26 Basc rules of probablty Probabltes are non-negatve P(x) 0 x. Probabltes normalse: x X P(x) = 1 for dstrbutons f x s a dscrete varable and + p(x)dx = 1 for probablty denstes over contnuous varables Warnng: I wll not be obsessvely careful n my use of p and P for probablty densty and probablty dstrbuton. Should be obvous from context.

27 Basc rules of probablty Probabltes are non-negatve P(x) 0 x. Probabltes normalse: x X P(x) = 1 for dstrbutons f x s a dscrete varable and + p(x)dx = 1 for probablty denstes over contnuous varables The jont probablty of x and y s: P(x, y). Warnng: I wll not be obsessvely careful n my use of p and P for probablty densty and probablty dstrbuton. Should be obvous from context.

28 Basc rules of probablty Probabltes are non-negatve P(x) 0 x. Probabltes normalse: x X P(x) = 1 for dstrbutons f x s a dscrete varable and + p(x)dx = 1 for probablty denstes over contnuous varables The jont probablty of x and y s: P(x, y). The margnal probablty of x s: P(x) = y P(x, y), assumng y s dscrete. Warnng: I wll not be obsessvely careful n my use of p and P for probablty densty and probablty dstrbuton. Should be obvous from context.

29 Basc rules of probablty Probabltes are non-negatve P(x) 0 x. Probabltes normalse: x X P(x) = 1 for dstrbutons f x s a dscrete varable and + p(x)dx = 1 for probablty denstes over contnuous varables The jont probablty of x and y s: P(x, y). The margnal probablty of x s: P(x) = y P(x, y), assumng y s dscrete. The condtonal probablty of x gven y s: P(x y) = P(x, y)/p(y) Warnng: I wll not be obsessvely careful n my use of p and P for probablty densty and probablty dstrbuton. Should be obvous from context.

30 Basc rules of probablty Probabltes are non-negatve P(x) 0 x. Probabltes normalse: x X P(x) = 1 for dstrbutons f x s a dscrete varable and + p(x)dx = 1 for probablty denstes over contnuous varables The jont probablty of x and y s: P(x, y). The margnal probablty of x s: P(x) = y P(x, y), assumng y s dscrete. The condtonal probablty of x gven y s: P(x y) = P(x, y)/p(y) Bayes Rule: P(x, y) = P(x)P(y x) = P(y)P(x y) P(y x) = P(x y)p(y) P(x) Warnng: I wll not be obsessvely careful n my use of p and P for probablty densty and probablty dstrbuton. Should be obvous from context.

31 The Dutch book theorem Assume you are wllng to accept bets wth odds proportonal to the strength of your belefs. That s, b(x) = 0.9 mples that you wll accept a bet: { x s true wn $1 x at 1 : 9 x s false lose $9 Then, unless your belefs satsfy the rules of probablty theory, ncludng Bayes rule, there exsts a set of smultaneous bets (called a Dutch Book ) whch you are wllng to accept, and for whch you are guaranteed to lose money, no matter what the outcome. E.g. suppose A B =, then

32 The Dutch book theorem Assume you are wllng to accept bets wth odds proportonal to the strength of your belefs. That s, b(x) = 0.9 mples that you wll accept a bet: { x s true wn $1 x at 1 : 9 x s false lose $9 Then, unless your belefs satsfy the rules of probablty theory, ncludng Bayes rule, there exsts a set of smultaneous bets (called a Dutch Book ) whch you are wllng to accept, and for whch you are guaranteed to lose money, no matter what the outcome. E.g. suppose A B =, then b(a) = 0.3 b(b) = 0.2 b(a B) = 0.6 accept the bets A at 3 : 7 B at 2 : 8 A B at 4 : 6

33 The Dutch book theorem Assume you are wllng to accept bets wth odds proportonal to the strength of your belefs. That s, b(x) = 0.9 mples that you wll accept a bet: { x s true wn $1 x at 1 : 9 x s false lose $9 Then, unless your belefs satsfy the rules of probablty theory, ncludng Bayes rule, there exsts a set of smultaneous bets (called a Dutch Book ) whch you are wllng to accept, and for whch you are guaranteed to lose money, no matter what the outcome. E.g. suppose A B =, then But then: b(a) = 0.3 b(b) = 0.2 b(a B) = 0.6 accept the bets A B wn = 1 A B wn = 1 A B wn = 1 A at 3 : 7 B at 2 : 8 A B at 4 : 6

34 The Dutch book theorem Assume you are wllng to accept bets wth odds proportonal to the strength of your belefs. That s, b(x) = 0.9 mples that you wll accept a bet: { x s true wn $1 x at 1 : 9 x s false lose $9 Then, unless your belefs satsfy the rules of probablty theory, ncludng Bayes rule, there exsts a set of smultaneous bets (called a Dutch Book ) whch you are wllng to accept, and for whch you are guaranteed to lose money, no matter what the outcome. E.g. suppose A B =, then But then: b(a) = 0.3 b(b) = 0.2 b(a B) = 0.6 accept the bets A B wn = 1 A B wn = 1 A B wn = 1 A at 3 : 7 B at 2 : 8 A B at 4 : 6 The only way to guard aganst Dutch Books s to ensure that your belefs are coherent:.e. satsfy the rules of probablty.

35 Bayesan learnng Apply the basc rules of probablty to learnng from data. Problem specfcaton: Data: D = {x 1,..., x n} Models: M 1, M 2, etc. Parameters: θ (per model) Pror probablty of models: P(M ). Pror probabltes of model parameters: P(θ M ) Model of data gven parameters (lkelhood model): P(x θ, M )

36 Bayesan learnng Apply the basc rules of probablty to learnng from data. Problem specfcaton: Data: D = {x 1,..., x n} Models: M 1, M 2, etc. Parameters: θ (per model) Pror probablty of models: P(M ). Pror probabltes of model parameters: P(θ M ) Model of data gven parameters (lkelhood model): P(x θ, M ) Data probablty (lkelhood) n P(D θ, M ) = P(x j θ, M ) L(θ ) j=1 (provded the data are ndependently and dentcally dstrbuted (d).

37 Bayesan learnng Apply the basc rules of probablty to learnng from data. Problem specfcaton: Data: D = {x 1,..., x n} Models: M 1, M 2, etc. Parameters: θ (per model) Pror probablty of models: P(M ). Pror probabltes of model parameters: P(θ M ) Model of data gven parameters (lkelhood model): P(x θ, M ) Data probablty (lkelhood) n P(D θ, M ) = P(x j θ, M ) L(θ ) j=1 (provded the data are ndependently and dentcally dstrbuted (d). Parameter learnng (posteror): P(θ D, M ) = P(D θ, M)P(θ M) ; P(D M ) = P(D M ) dθ P(D θ, M )P(θ M )

38 Bayesan learnng Apply the basc rules of probablty to learnng from data. Problem specfcaton: Data: D = {x 1,..., x n} Models: M 1, M 2, etc. Parameters: θ (per model) Pror probablty of models: P(M ). Pror probabltes of model parameters: P(θ M ) Model of data gven parameters (lkelhood model): P(x θ, M ) Data probablty (lkelhood) n P(D θ, M ) = P(x j θ, M ) L(θ ) j=1 (provded the data are ndependently and dentcally dstrbuted (d). Parameter learnng (posteror): P(θ D, M ) = P(D θ, M)P(θ M) ; P(D M ) = P(D M ) dθ P(D θ, M )P(θ M ) P(D M ) s called the margnal lkelhood or evdence for M. It s proportonal to the posteror probablty model M beng the one that generated the data.

39 Bayesan learnng Apply the basc rules of probablty to learnng from data. Problem specfcaton: Data: D = {x 1,..., x n} Models: M 1, M 2, etc. Parameters: θ (per model) Pror probablty of models: P(M ). Pror probabltes of model parameters: P(θ M ) Model of data gven parameters (lkelhood model): P(x θ, M ) Data probablty (lkelhood) n P(D θ, M ) = P(x j θ, M ) L(θ ) j=1 (provded the data are ndependently and dentcally dstrbuted (d). Parameter learnng (posteror): P(θ D, M ) = P(D θ, M)P(θ M) ; P(D M ) = P(D M ) dθ P(D θ, M )P(θ M ) P(D M ) s called the margnal lkelhood or evdence for M. It s proportonal to the posteror probablty model M beng the one that generated the data. Model selecton: P(M D) = P(D M)P(M) P(D)

40 Bayesan learnng: A con toss example Con toss: One parameter q the probablty of obtanng heads So our space of models s the set of dstrbutons over q [0, 1].

41 Bayesan learnng: A con toss example Con toss: One parameter q the probablty of obtanng heads So our space of models s the set of dstrbutons over q [0, 1]. Learner A beleves model M A: all values of q are equally plausble; P(q) q A

42 Bayesan learnng: A con toss example Con toss: One parameter q the probablty of obtanng heads So our space of models s the set of dstrbutons over q [0, 1]. Learner A beleves model M A: all values of q are equally plausble; Learner B beleves model M B: more plausble that the con s far (q 0.5) than based P(q) 1 P(q) q A q B

43 Bayesan learnng: A con toss example Con toss: One parameter q the probablty of obtanng heads So our space of models s the set of dstrbutons over q [0, 1]. Learner A beleves model M A: all values of q are equally plausble; Learner B beleves model M B: more plausble that the con s far (q 0.5) than based P(q) 1 P(q) q A Both pror belefs can be descrbed by the Beta dstrbuton: p(q α 1, α 2) = q(α 1 1) (1 q) (α 2 1) B(α 1, α 2) = Beta(q α 1, α 2) q B

44 Bayesan learnng: A con toss example Con toss: One parameter q the probablty of obtanng heads So our space of models s the set of dstrbutons over q [0, 1]. Learner A beleves model M A: all values of q are equally plausble; Learner B beleves model M B: more plausble that the con s far (q 0.5) than based P(q) 1 P(q) q q A: α 1 = α 2 = 1.0 B: α 1 = α 2 = 4.0 Both pror belefs can be descrbed by the Beta dstrbuton: p(q α 1, α 2) = q(α 1 1) (1 q) (α 2 1) B(α 1, α 2) = Beta(q α 1, α 2)

45 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q

46 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q Suppose our sngle con toss comes out heads

47 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q Suppose our sngle con toss comes out heads The probablty of the observed data (lkelhood) s: p(h q) = q

48 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q Suppose our sngle con toss comes out heads The probablty of the observed data (lkelhood) s: p(h q) = q Usng Bayes Rule, we multply the pror, p(q) by the lkelhood and renormalse to get the posteror probablty: p(q H) = p(q)p(h q) p(h)

49 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q Suppose our sngle con toss comes out heads The probablty of the observed data (lkelhood) s: p(h q) = q Usng Bayes Rule, we multply the pror, p(q) by the lkelhood and renormalse to get the posteror probablty: p(q H) = p(q)p(h q) p(h) q Beta(q α 1, α 2)

50 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q Suppose our sngle con toss comes out heads The probablty of the observed data (lkelhood) s: p(h q) = q Usng Bayes Rule, we multply the pror, p(q) by the lkelhood and renormalse to get the posteror probablty: p(q H) = p(q)p(h q) p(h) q Beta(q α 1, α 2) q q (α 1 1) (1 q) (α 2 1)

51 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q Suppose our sngle con toss comes out heads The probablty of the observed data (lkelhood) s: p(h q) = q Usng Bayes Rule, we multply the pror, p(q) by the lkelhood and renormalse to get the posteror probablty: p(q H) = p(q)p(h q) p(h) q Beta(q α 1, α 2) q q (α1 1) (1 q) (α 2 1) = Beta(q α 1 + 1, α 2)

52 Bayesan learnng: The con toss (cont) A B Pror P(q) P(q) q q Beta(q 1, 1) Beta(q 4, 4) Posteror P(q) P(q) q q Beta(q 2, 1) Beta(q 5, 4)

53 Bayesan learnng: The con toss (cont) What about multple tosses?

54 Bayesan learnng: The con toss (cont) What about multple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3

55 Bayesan learnng: The con toss (cont) What about multple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 Ths s stll straghtforward:

56 Bayesan learnng: The con toss (cont) What about multple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 Ths s stll straghtforward: p(q D) = p(q)p(d q) p(d)

57 Bayesan learnng: The con toss (cont) What about multple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 Ths s stll straghtforward: p(q D) = p(q)p(d q) p(d) q 3 (1 q) 3 Beta(q α 1, α 2)

58 Bayesan learnng: The con toss (cont) What about multple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 Ths s stll straghtforward: p(q D) = p(q)p(d q) p(d) q 3 (1 q) 3 Beta(q α 1, α 2) Beta(q α 1 + 3, α 2 + 3)

59 Bayesan learnng: The con toss (cont) What about multple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 Ths s stll straghtforward: p(q D) = p(q)p(d q) p(d) q 3 (1 q) 3 Beta(q α 1, α 2) Beta(q α 1 + 3, α 2 + 3) P(q) P(q) q q

60 Conjugate prors Updatng the pror to form the posteror was partcularly easy n these examples. Ths s because we used a conjugate pror for an exponental famly lkelhood.

61 Conjugate prors Updatng the pror to form the posteror was partcularly easy n these examples. Ths s because we used a conjugate pror for an exponental famly lkelhood. Exponental famly dstrbutons take the form: P(x θ) = g(θ)f (x)e φ(θ)t T(x) wth g(θ) the normalsng constant.

62 Conjugate prors Updatng the pror to form the posteror was partcularly easy n these examples. Ths s because we used a conjugate pror for an exponental famly lkelhood. Exponental famly dstrbutons take the form: P(x θ) = g(θ)f (x)e φ(θ)t T(x) wth g(θ) the normalsng constant. Gven n d observatons, P({x } θ) = ( P(x θ) = g(θ) n e φ(θ)t T(x )) f (x )

63 Conjugate prors Updatng the pror to form the posteror was partcularly easy n these examples. Ths s because we used a conjugate pror for an exponental famly lkelhood. Exponental famly dstrbutons take the form: P(x θ) = g(θ)f (x)e φ(θ)t T(x) wth g(θ) the normalsng constant. Gven n d observatons, P({x } θ) = ( P(x θ) = g(θ) n e φ(θ)t T(x )) f (x ) Thus, f the pror takes the conjugate form P(θ) = F(τ, ν)g(θ) ν e φ(θ)t τ wth F(τ, ν) the normalser

64 Conjugate prors Updatng the pror to form the posteror was partcularly easy n these examples. Ths s because we used a conjugate pror for an exponental famly lkelhood. Exponental famly dstrbutons take the form: P(x θ) = g(θ)f (x)e φ(θ)t T(x) wth g(θ) the normalsng constant. Gven n d observatons, P({x } θ) = ( P(x θ) = g(θ) n e φ(θ)t T(x )) f (x ) Thus, f the pror takes the conjugate form P(θ) = F(τ, ν)g(θ) ν e φ(θ)t τ wth F(τ, ν) the normalser, then the posteror s ( P(θ {x }) P({x } θ)p(θ) g(θ) ν+n e φ(θ)t τ + ) T(x ) wth the normalser gven by F ( τ + T(x ), ν + n ).

65 Conjugate prors The posteror gven an exponental famly lkelhood and conjugate pror s: P(θ {x }) = F ( τ + T(x ), ν + n ) ( g(θ) ν+n exp [φ(θ) T τ + )] T(x ) Here, φ(θ) T(x ) τ ν s the vector of natural parameters s the vector of suffcent statstcs are pseudo-observatons whch defne the pror s the scale of the pror (need not be an nteger) As new data come n, each one ncrements the suffcent statstcs vector and the scale to defne the posteror.

66 Conjugate prors The posteror gven an exponental famly lkelhood and conjugate pror s: P(θ {x }) = F ( τ + T(x ), ν + n ) ( g(θ) ν+n exp [φ(θ) T τ + )] T(x ) Here, φ(θ) T(x ) τ ν s the vector of natural parameters s the vector of suffcent statstcs are pseudo-observatons whch defne the pror s the scale of the pror (need not be an nteger) As new data come n, each one ncrements the suffcent statstcs vector and the scale to defne the posteror. The pror appears to be based on pseudo-observatons, but: 1. Ths s dfferent to applyng Bayes rule. No pror! Sometmes we can take a unform pror (say on [0, 1] for q), but for unbounded θ, there may be no equvalent. 2. A vald conjugate pror mght have non-ntegral ν or mpossble τ, wth no lkelhood equvalent.

67 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form.

68 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form. The Bernoull dstrbuton (a sngle con flp) wth parameter q and observaton x {0, 1}, can be wrtten: P(x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x

69 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form. The Bernoull dstrbuton (a sngle con flp) wth parameter q and observaton x {0, 1}, can be wrtten: P(x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter s the log odds log(q/(1 q)), and the suffcent stats (for multple tosses) s the number of heads.

70 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form. The Bernoull dstrbuton (a sngle con flp) wth parameter q and observaton x {0, 1}, can be wrtten: P(x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter s the log odds log(q/(1 q)), and the suffcent stats (for multple tosses) s the number of heads. The conjugate pror s P(q) = F(τ, ν) (1 q) ν e log(q/(1 q))τ = F(τ, ν) (1 q) ν τ log q τ log(1 q) e = F(τ, ν) (1 q) ν τ q τ whch has the form of the Beta dstrbuton F(τ, ν) = 1/B(τ + 1, ν τ + 1).

71 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form. The Bernoull dstrbuton (a sngle con flp) wth parameter q and observaton x {0, 1}, can be wrtten: P(x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter s the log odds log(q/(1 q)), and the suffcent stats (for multple tosses) s the number of heads. The conjugate pror s P(q) = F(τ, ν) (1 q) ν e log(q/(1 q))τ = F(τ, ν) (1 q) ν τ log q τ log(1 q) e = F(τ, ν) (1 q) ν τ q τ whch has the form of the Beta dstrbuton F(τ, ν) = 1/B(τ + 1, ν τ + 1). In general, then, the posteror wll be P(q {x }) = Beta(α 1, α 2), wth α 1 = 1 + τ + x α 2 = 1 + (ν + n) (τ + ) x

72 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form. The Bernoull dstrbuton (a sngle con flp) wth parameter q and observaton x {0, 1}, can be wrtten: P(x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter s the log odds log(q/(1 q)), and the suffcent stats (for multple tosses) s the number of heads. The conjugate pror s P(q) = F(τ, ν) (1 q) ν e log(q/(1 q))τ = F(τ, ν) (1 q) ν τ log q τ log(1 q) e = F(τ, ν) (1 q) ν τ q τ whch has the form of the Beta dstrbuton F(τ, ν) = 1/B(τ + 1, ν τ + 1). In general, then, the posteror wll be P(q {x }) = Beta(α 1, α 2), wth α 1 = 1 + τ + x α 2 = 1 + (ν + n) (τ + ) x If we observe a head, we add 1 to the suffcent statstc x, and also 1 to the count n. Ths ncrements α 1.

73 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form. The Bernoull dstrbuton (a sngle con flp) wth parameter q and observaton x {0, 1}, can be wrtten: P(x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter s the log odds log(q/(1 q)), and the suffcent stats (for multple tosses) s the number of heads. The conjugate pror s P(q) = F(τ, ν) (1 q) ν e log(q/(1 q))τ = F(τ, ν) (1 q) ν τ log q τ log(1 q) e = F(τ, ν) (1 q) ν τ q τ whch has the form of the Beta dstrbuton F(τ, ν) = 1/B(τ + 1, ν τ + 1). In general, then, the posteror wll be P(q {x }) = Beta(α 1, α 2), wth α 1 = 1 + τ + x α 2 = 1 + (ν + n) (τ + ) x If we observe a head, we add 1 to the suffcent statstc x, and also 1 to the count n. Ths ncrements α 1. If we observe a tal we add 1 to n, but not to x, ncrementng α 2.

74 Bayesan cons comparng models We have seen how to update posterors wthn each model. To study the choce of model, consder two more extreme models: far and bent.

75 Bayesan cons comparng models We have seen how to update posterors wthn each model. To study the choce of model, consder two more extreme models: far and bent. A pror, we may thnk that far s more probable, eg: p(far) = 0.8, p(bent) = 0.2

76 Bayesan cons comparng models We have seen how to update posterors wthn each model. To study the choce of model, consder two more extreme models: far and bent. A pror, we may thnk that far s more probable, eg: p(far) = 0.8, p(bent) = 0.2 For the bent con, we assume all parameter values are equally lkely, whlst the far con has a fxed probablty: p(q bent) p(q far) parameter, q parameter, q

77 Bayesan cons comparng models We have seen how to update posterors wthn each model. To study the choce of model, consder two more extreme models: far and bent. A pror, we may thnk that far s more probable, eg: p(far) = 0.8, p(bent) = 0.2 For the bent con, we assume all parameter values are equally lkely, whlst the far con has a fxed probablty: p(q bent) p(q far) parameter, q parameter, q We make 10 tosses, and get: D = (T H T H T T T T T T).

78 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)?

79 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2)

80 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent)

81 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent) = dq q 2 (1 q) 8

82 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) 0.002

83 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) Thus, the posteror for the models, by Bayes rule: P(far D) , P(bent D) , e, a two-thrds probablty that the con s far.

84 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) Thus, the posteror for the models, by Bayes rule: P(far D) , P(bent D) , e, a two-thrds probablty that the con s far. How do we make predctons?

85 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) Thus, the posteror for the models, by Bayes rule: P(far D) , P(bent D) , e, a two-thrds probablty that the con s far. How do we make predctons? Could choose the far model (model selecton).

86 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) Thus, the posteror for the models, by Bayes rule: P(far D) , P(bent D) , e, a two-thrds probablty that the con s far. How do we make predctons? Could choose the far model (model selecton). Or could weght the predctons from each model by ther probablty (model averagng). Probablty of H at next toss s: P(H D) = P(H D, far)p(far D) + P(H D, bent)p(bent D) = = 5 12.

87 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters.

88 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly).

89 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons

90 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D)

91 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss.

92 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ).

93 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ). Equvalent to mnmsng the 0/1 loss.

94 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ). Equvalent to mnmsng the 0/1 loss. Maxmum Lkelhood (ML) Learnng: No pror over the parameters. Compute parameter value that maxmses the lkelhood functon alone: θ ML = argmax P(D θ).

95 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ). Equvalent to mnmsng the 0/1 loss. Maxmum Lkelhood (ML) Learnng: No pror over the parameters. Compute parameter value that maxmses the lkelhood functon alone: θ ML = argmax P(D θ). Parametersaton-ndependent.

96 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ). Equvalent to mnmsng the 0/1 loss. Maxmum Lkelhood (ML) Learnng: No pror over the parameters. Compute parameter value that maxmses the lkelhood functon alone: θ ML = argmax P(D θ). Parametersaton-ndependent. Approxmatons may allow us to recover samples from posteror, or to fnd a dstrbuton whch s close n some sense.

97 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ). Equvalent to mnmsng the 0/1 loss. Maxmum Lkelhood (ML) Learnng: No pror over the parameters. Compute parameter value that maxmses the lkelhood functon alone: θ ML = argmax P(D θ). Parametersaton-ndependent. Approxmatons may allow us to recover samples from posteror, or to fnd a dstrbuton whch s close n some sense. Choosng between these and other alternatves may be a matter of defnton, of goals (loss functon), or of practcalty.

98 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ). Equvalent to mnmsng the 0/1 loss. Maxmum Lkelhood (ML) Learnng: No pror over the parameters. Compute parameter value that maxmses the lkelhood functon alone: θ ML = argmax P(D θ). Parametersaton-ndependent. Approxmatons may allow us to recover samples from posteror, or to fnd a dstrbuton whch s close n some sense. Choosng between these and other alternatves may be a matter of defnton, of goals (loss functon), or of practcalty. For the next few weeks we wll look at ML and MAP learnng n more complex models. We wll then return to the fully Bayesan formulaton for the few nterstng cases where t s tractable. Approxmatons wll be addressed n the second half of the course.

99 Modellng assocatons between varables 1 Data set D = {x 1,..., x N} x 2 0 wth each data pont a vector of D features: x = [x 1... x D] Assume data are..d. (ndependent and dentcally dstrbuted) x 1

100 Modellng assocatons between varables 1 Data set D = {x 1,..., x N} x 2 0 wth each data pont a vector of D features: x = [x 1... x D] Assume data are..d. (ndependent and dentcally dstrbuted) x 1 A smple forms of unsupervsed (structure) learnng: model the mean of the data and the correlatons between the D features n the data.

101 Modellng assocatons between varables 1 Data set D = {x 1,..., x N} x 2 0 wth each data pont a vector of D features: x = [x 1... x D] Assume data are..d. (ndependent and dentcally dstrbuted) x 1 A smple forms of unsupervsed (structure) learnng: model the mean of the data and the correlatons between the D features n the data. We can use a multvarate Gaussan model: p(x µ, Σ) = N (µ, Σ) = 2πΣ 1 2 exp { 1 } 2 (x µ)t Σ 1 (x µ)

102 ML Learnng for a Gaussan N Data set D = {x 1,..., x N}, lkelhood: p(d µ, Σ) = p(x n µ, Σ) n=1 Goal: fnd µ and Σ that maxmse lkelhood N L = p(x n µ, Σ) n=1

103 ML Learnng for a Gaussan N Data set D = {x 1,..., x N}, lkelhood: p(d µ, Σ) = p(x n µ, Σ) Goal: fnd µ and Σ that maxmse lkelhood maxmse log lkelhood: n=1 N l = log p(x n µ, Σ) n=1

104 ML Learnng for a Gaussan N Data set D = {x 1,..., x N}, lkelhood: p(d µ, Σ) = p(x n µ, Σ) Goal: fnd µ and Σ that maxmse lkelhood maxmse log lkelhood: n=1 l = log N n=1 p(x n µ, Σ) = n log p(x n µ, Σ)

105 ML Learnng for a Gaussan N Data set D = {x 1,..., x N}, lkelhood: p(d µ, Σ) = p(x n µ, Σ) Goal: fnd µ and Σ that maxmse lkelhood maxmse log lkelhood: n=1 N l = log p(x n µ, Σ) = log p(x n µ, Σ) n=1 n = N 2 log 2πΣ 1 (x n µ) T Σ 1 (x n µ) 2 n

106 ML Learnng for a Gaussan N Data set D = {x 1,..., x N}, lkelhood: p(d µ, Σ) = p(x n µ, Σ) Goal: fnd µ and Σ that maxmse lkelhood maxmse log lkelhood: n=1 N l = log p(x n µ, Σ) = log p(x n µ, Σ) n=1 n = N 2 log 2πΣ 1 (x n µ) T Σ 1 (x n µ) 2 n Note: equvalently, mnmse l, whch s quadratc n µ

107 ML Learnng for a Gaussan N Data set D = {x 1,..., x N}, lkelhood: p(d µ, Σ) = p(x n µ, Σ) Goal: fnd µ and Σ that maxmse lkelhood maxmse log lkelhood: n=1 N l = log p(x n µ, Σ) = log p(x n µ, Σ) n=1 n = N 2 log 2πΣ 1 (x n µ) T Σ 1 (x n µ) 2 n Note: equvalently, mnmse l, whch s quadratc n µ Procedure: take dervatves and set to zero: l µ = 0 ˆµ = 1 x n (sample mean) N n l Σ = 0 1 ˆΣ = (x n ˆµ)(x n ˆµ) T (sample covarance) N n

108 Refresher matrx dervatves of scalar forms We wll use the followng facts:

109 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr [ ] x T Ay (scalars equal ther own transpose and trace)

110 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] x T Ay (scalars equal ther own transpose and trace)

111 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA]

112 Refresher matrx dervatves of scalar forms We wll use the followng facts: [ ] Tr A T B A j x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA]

113 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = [A T B] nn A j A j n [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA]

114 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A j A j n m [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] A T nmb mn

115 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A j A j mn [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] A mnb mn

116 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA]

117 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA]

118 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ ] A Tr A T BAC [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA]

119 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps

120 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C ] F1 A + F 2 Tr [ F T 1 BF 2C ] F2 A

121 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C ] F1 A + F 2 Tr [ CF T 1 BF 2 ] F2 A

122 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C ] F1 A + F 2 Tr [ F T 2 B T F 1C T ] F2 A

123 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C = BF 2C + B T F 1C T ] F1 A + F 2 Tr [ F T 2 B T F 1C T ] F2 A

124 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C ] F1 A + F 2 Tr = BF 2C + B T F 1C T = BAC + B T AC T [ F T 2 B T F 1C T ] F2 A

125 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] A j log A [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C ] F1 A + F 2 Tr = BF 2C + B T F 1C T = BAC + B T AC T [ F T 2 B T F 1C T ] F2 A

126 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C ] F1 A + F 2 Tr = BF 2C + B T F 1C T = BAC + B T AC T log A = 1 A A j A A j [ F T 2 B T F 1C T ] F2 A

127 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr log A = 1 A j A [ F T 1 BF 2C ] F1 A + F 2 Tr = BF 2C + B T F 1C T = BAC + B T AC T ( 1) +k A k [A] k A j k [ F T 2 B T F 1C T ] F2 A

128 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr log A = 1 A j A [ F T 1 BF 2C ] F1 A + F 2 Tr = BF 2C + B T F 1C T = BAC + B T AC T [ F T 2 B T F 1C T ] F2 A ( 1) +k A k [A] k = 1 A j A ( 1)+j [A] j k

129 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr log A = 1 A j A A log A = (A 1 ) T [ F T 1 BF 2C ] F1 A + F 2 Tr = BF 2C + B T F 1C T = BAC + B T AC T [ F T 2 B T F 1C T ] F2 A ( 1) +k A k [A] k = 1 A j A ( 1)+j [A] j k

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements CS 750 Machne Learnng Lecture 5 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square CS 750 Machne Learnng Announcements Homework Due on Wednesday before the class Reports: hand n before

More information

Probabilistic and Bayesian Machine Learning

Probabilistic and Bayesian Machine Learning Probabilistic and Bayesian Machine Learning Lecture 1: Introduction to Probabilistic Modelling Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Why a

More information

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012 MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example:

More information

Conjugacy and the Exponential Family

Conjugacy and the Exponential Family CS281B/Stat241B: Advanced Topcs n Learnng & Decson Makng Conjugacy and the Exponental Famly Lecturer: Mchael I. Jordan Scrbes: Bran Mlch 1 Conjugacy In the prevous lecture, we saw conjugate prors for the

More information

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Probabilistic Classification: Bayes Classifiers. Lecture 6: Probablstc Classfcaton: Bayes Classfers Lecture : Classfcaton Models Sam Rowes January, Generatve model: p(x, y) = p(y)p(x y). p(y) are called class prors. p(x y) are called class condtonal feature dstrbutons.

More information

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING MACHINE LEANING Vasant Honavar Bonformatcs and Computatonal Bology rogram Center for Computatonal Intellgence, Learnng, & Dscovery Iowa State Unversty honavar@cs.astate.edu www.cs.astate.edu/~honavar/

More information

Machine learning: Density estimation

Machine learning: Density estimation CS 70 Foundatons of AI Lecture 3 Machne learnng: ensty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square ata: ensty estmaton {.. n} x a vector of attrbute values Objectve: estmate the model of

More information

Composite Hypotheses testing

Composite Hypotheses testing Composte ypotheses testng In many hypothess testng problems there are many possble dstrbutons that can occur under each of the hypotheses. The output of the source s a set of parameters (ponts n a parameter

More information

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton

More information

EM and Structure Learning

EM and Structure Learning EM and Structure Learnng Le Song Machne Learnng II: Advanced Topcs CSE 8803ML, Sprng 2012 Partally observed graphcal models Mxture Models N(μ 1, Σ 1 ) Z X N N(μ 2, Σ 2 ) 2 Gaussan mxture model Consder

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

} Often, when learning, we deal with uncertainty:

} Often, when learning, we deal with uncertainty: Uncertanty and Learnng } Often, when learnng, we deal wth uncertanty: } Incomplete data sets, wth mssng nformaton } Nosy data sets, wth unrelable nformaton } Stochastcty: causes and effects related non-determnstcally

More information

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors Stat60: Bayesan Modelng and Inference Lecture Date: February, 00 Reference Prors Lecturer: Mchael I. Jordan Scrbe: Steven Troxler and Wayne Lee In ths lecture, we assume that θ R; n hgher-dmensons, reference

More information

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute

More information

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton

More information

Lecture 3: Probability Distributions

Lecture 3: Probability Distributions Lecture 3: Probablty Dstrbutons Random Varables Let us begn by defnng a sample space as a set of outcomes from an experment. We denote ths by S. A random varable s a functon whch maps outcomes nto the

More information

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong Moton Percepton Under Uncertanty Hongjng Lu Department of Psychology Unversty of Hong Kong Outlne Uncertanty n moton stmulus Correspondence problem Qualtatve fttng usng deal observer models Based on sgnal

More information

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore 8/5/17 Data Modelng Patrce Koehl Department of Bologcal Scences atonal Unversty of Sngapore http://www.cs.ucdavs.edu/~koehl/teachng/bl59 koehl@cs.ucdavs.edu Data Modelng Ø Data Modelng: least squares Ø

More information

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010 Parametrc fractonal mputaton for mssng data analyss Jae Kwang Km Survey Workng Group Semnar March 29, 2010 1 Outlne Introducton Proposed method Fractonal mputaton Approxmaton Varance estmaton Multple mputaton

More information

Course 395: Machine Learning - Lectures

Course 395: Machine Learning - Lectures Course 395: Machne Learnng - Lectures Lecture 1-2: Concept Learnng (M. Pantc Lecture 3-4: Decson Trees & CC Intro (M. Pantc Lecture 5-6: Artfcal Neural Networks (S.Zaferou Lecture 7-8: Instance ased Learnng

More information

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ CSE 455/555 Sprng 2013 Homework 7: Parametrc Technques Jason J. Corso Computer Scence and Engneerng SUY at Buffalo jcorso@buffalo.edu Solutons by Yngbo Zhou Ths assgnment does not need to be submtted and

More information

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) Maxmum Lkelhood Estmaton (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Wnter 01 UCSD Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y

More information

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above The conjugate pror to a Bernoull s A) Bernoull B) Gaussan C) Beta D) none of the above The conjugate pror to a Gaussan s A) Bernoull B) Gaussan C) Beta D) none of the above MAP estmates A) argmax θ p(θ

More information

Lecture 12: Classification

Lecture 12: Classification Lecture : Classfcaton g Dscrmnant functons g The optmal Bayes classfer g Quadratc classfers g Eucldean and Mahalanobs metrcs g K Nearest Neghbor Classfers Intellgent Sensor Systems Rcardo Guterrez-Osuna

More information

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2) 1/16 MATH 829: Introducton to Data Mnng and Analyss The EM algorthm (part 2) Domnque Gullot Departments of Mathematcal Scences Unversty of Delaware Aprl 20, 2016 Recall 2/16 We are gven ndependent observatons

More information

Homework Assignment 3 Due in class, Thursday October 15

Homework Assignment 3 Due in class, Thursday October 15 Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.

More information

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering / Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Lecture 4 Bayes Classfcaton Rule Dept. of Electrcal and Computer Engneerng 0909.40.0 / 0909.504.04 Theory & Applcatons

More information

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition EG 880/988 - Specal opcs n Computer Engneerng: Pattern Recognton Memoral Unversty of ewfoundland Pattern Recognton Lecture 7 May 3, 006 http://wwwengrmunca/~charlesr Offce Hours: uesdays hursdays 8:30-9:30

More information

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin Fnte Mxture Models and Expectaton Maxmzaton Most sldes are from: Dr. Maro Fgueredo, Dr. Anl Jan and Dr. Rong Jn Recall: The Supervsed Learnng Problem Gven a set of n samples X {(x, y )},,,n Chapter 3 of

More information

Relevance Vector Machines Explained

Relevance Vector Machines Explained October 19, 2010 Relevance Vector Machnes Explaned Trstan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introducton Ths document has been wrtten n an attempt to make Tppng s [1] Relevance Vector Machnes

More information

Engineering Risk Benefit Analysis

Engineering Risk Benefit Analysis Engneerng Rsk Beneft Analyss.55, 2.943, 3.577, 6.938, 0.86, 3.62, 6.862, 22.82, ESD.72, ESD.72 RPRA 2. Elements of Probablty Theory George E. Apostolaks Massachusetts Insttute of Technology Sprng 2007

More information

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore Sesson Outlne Introducton to classfcaton problems and dscrete choce models. Introducton to Logstcs Regresson. Logstc functon and Logt functon. Maxmum Lkelhood Estmator (MLE) for estmaton of LR parameters.

More information

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction ECONOMICS 5* -- NOTE (Summary) ECON 5* -- NOTE The Multple Classcal Lnear Regresson Model (CLRM): Specfcaton and Assumptons. Introducton CLRM stands for the Classcal Lnear Regresson Model. The CLRM s also

More information

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family IOSR Journal of Mathematcs IOSR-JM) ISSN: 2278-5728. Volume 3, Issue 3 Sep-Oct. 202), PP 44-48 www.osrjournals.org Usng T.O.M to Estmate Parameter of dstrbutons that have not Sngle Exponental Famly Jubran

More information

Classification as a Regression Problem

Classification as a Regression Problem Target varable y C C, C,, ; Classfcaton as a Regresson Problem { }, 3 L C K To treat classfcaton as a regresson problem we should transform the target y nto numercal values; The choce of numercal class

More information

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov 9.93 Class IV Part I Bayesan Decson Theory Yur Ivanov TOC Roadmap to Machne Learnng Bayesan Decson Makng Mnmum Error Rate Decsons Mnmum Rsk Decsons Mnmax Crteron Operatng Characterstcs Notaton x - scalar

More information

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Hidden Markov Models & The Multivariate Gaussian (10/26/04) CS281A/Stat241A: Statstcal Learnng Theory Hdden Markov Models & The Multvarate Gaussan (10/26/04) Lecturer: Mchael I. Jordan Scrbes: Jonathan W. Hu 1 Hdden Markov Models As a bref revew, hdden Markov models

More information

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U) Econ 413 Exam 13 H ANSWERS Settet er nndelt 9 deloppgaver, A,B,C, som alle anbefales å telle lkt for å gøre det ltt lettere å stå. Svar er gtt . Unfortunately, there s a prntng error n the hnt of

More information

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition) Count Data Models See Book Chapter 11 2 nd Edton (Chapter 10 1 st Edton) Count data consst of non-negatve nteger values Examples: number of drver route changes per week, the number of trp departure changes

More information

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1 Random varables Measure of central tendences and varablty (means and varances) Jont densty functons and ndependence Measures of assocaton (covarance and correlaton) Interestng result Condtonal dstrbutons

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Maxmum Lkelhood Estmaton INFO-2301: Quanttatve Reasonng 2 Mchael Paul and Jordan Boyd-Graber MARCH 7, 2017 INFO-2301: Quanttatve Reasonng 2 Paul and Boyd-Graber Maxmum Lkelhood Estmaton 1 of 9 Why MLE?

More information

Probability Theory (revisited)

Probability Theory (revisited) Probablty Theory (revsted) Summary Probablty v.s. plausblty Random varables Smulaton of Random Experments Challenge The alarm of a shop rang. Soon afterwards, a man was seen runnng n the street, persecuted

More information

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X Statstcs 1: Probablty Theory II 37 3 EPECTATION OF SEVERAL RANDOM VARIABLES As n Probablty Theory I, the nterest n most stuatons les not on the actual dstrbuton of a random vector, but rather on a number

More information

First Year Examination Department of Statistics, University of Florida

First Year Examination Department of Statistics, University of Florida Frst Year Examnaton Department of Statstcs, Unversty of Florda May 7, 010, 8:00 am - 1:00 noon Instructons: 1. You have four hours to answer questons n ths examnaton.. You must show your work to receve

More information

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics /7/7 CSE 73: Artfcal Intellgence Bayesan - Learnng Deter Fox Sldes adapted from Dan Weld, Jack Breese, Dan Klen, Daphne Koller, Stuart Russell, Andrew Moore & Luke Zettlemoyer What s Beng Learned? Space

More information

Chapter 9: Statistical Inference and the Relationship between Two Variables

Chapter 9: Statistical Inference and the Relationship between Two Variables Chapter 9: Statstcal Inference and the Relatonshp between Two Varables Key Words The Regresson Model The Sample Regresson Equaton The Pearson Correlaton Coeffcent Learnng Outcomes After studyng ths chapter,

More information

Gaussian process classification: a message-passing viewpoint

Gaussian process classification: a message-passing viewpoint Gaussan process classfcaton: a message-passng vewpont Flpe Rodrgues fmpr@de.uc.pt November 014 Abstract The goal of ths short paper s to provde a message-passng vewpont of the Expectaton Propagaton EP

More information

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:

More information

Classification learning II

Classification learning II Lecture 8 Classfcaton learnng II Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square Logstc regresson model Defnes a lnear decson boundar Dscrmnant functons: g g g g here g z / e z f, g g - s a logstc functon

More information

10-701/ Machine Learning, Fall 2005 Homework 3

10-701/ Machine Learning, Fall 2005 Homework 3 10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40

More information

Generative classification models

Generative classification models CS 675 Intro to Machne Learnng Lecture Generatve classfcaton models Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square Data: D { d, d,.., dn} d, Classfcaton represents a dscrete class value Goal: learn

More information

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models ECO 452 -- OE 4: Probt and Logt Models ECO 452 -- OE 4 Maxmum Lkelhood Estmaton of Bnary Dependent Varables Models: Probt and Logt hs note demonstrates how to formulate bnary dependent varables models

More information

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU Group M D L M Chapter Bayesan Decson heory Xn-Shun Xu @ SDU School of Computer Scence and echnology, Shandong Unversty Bayesan Decson heory Bayesan decson theory s a statstcal approach to data mnng/pattern

More information

Limited Dependent Variables

Limited Dependent Variables Lmted Dependent Varables. What f the left-hand sde varable s not a contnuous thng spread from mnus nfnty to plus nfnty? That s, gven a model = f (, β, ε, where a. s bounded below at zero, such as wages

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probablstc & Unsupervsed Learnng Convex Algorthms n Approxmate Inference Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computatonal Neuroscence Unt Unversty College London Term 1, Autumn 2008 Convexty A convex

More information

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set

More information

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement Markov Chan Monte Carlo MCMC, Gbbs Samplng, Metropols Algorthms, and Smulated Annealng 2001 Bonformatcs Course Supplement SNU Bontellgence Lab http://bsnuackr/ Outlne! Markov Chan Monte Carlo MCMC! Metropols-Hastngs

More information

Statistical pattern recognition

Statistical pattern recognition Statstcal pattern recognton Bayes theorem Problem: decdng f a patent has a partcular condton based on a partcular test However, the test s mperfect Someone wth the condton may go undetected (false negatve

More information

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018 INF 5860 Machne learnng for mage classfcaton Lecture 3 : Image classfcaton and regresson part II Anne Solberg January 3, 08 Today s topcs Multclass logstc regresson and softma Regularzaton Image classfcaton

More information

The Geometry of Logit and Probit

The Geometry of Logit and Probit The Geometry of Logt and Probt Ths short note s meant as a supplement to Chapters and 3 of Spatal Models of Parlamentary Votng and the notaton and reference to fgures n the text below s to those two chapters.

More information

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines II Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed

More information

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models Computaton of Hgher Order Moments from Two Multnomal Overdsperson Lkelhood Models BY J. T. NEWCOMER, N. K. NEERCHAL Department of Mathematcs and Statstcs, Unversty of Maryland, Baltmore County, Baltmore,

More information

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD he Gaussan classfer Nuno Vasconcelos ECE Department, UCSD Bayesan decson theory recall that we have state of the world X observatons g decson functon L[g,y] loss of predctng y wth g Bayes decson rule s

More information

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation An Experment/Some Intuton I have three cons n my pocket, 6.864 (Fall 2006): Lecture 18 The EM Algorthm Con 0 has probablty λ of heads; Con 1 has probablty p 1 of heads; Con 2 has probablty p 2 of heads

More information

Clustering & Unsupervised Learning

Clustering & Unsupervised Learning Clusterng & Unsupervsed Learnng Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Wnter 2012 UCSD Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y

More information

Hidden Markov Models

Hidden Markov Models CM229S: Machne Learnng for Bonformatcs Lecture 12-05/05/2016 Hdden Markov Models Lecturer: Srram Sankararaman Scrbe: Akshay Dattatray Shnde Edted by: TBD 1 Introducton For a drected graph G we can wrte

More information

Learning from Data 1 Naive Bayes

Learning from Data 1 Naive Bayes Learnng from Data 1 Nave Bayes Davd Barber dbarber@anc.ed.ac.uk course page : http://anc.ed.ac.uk/ dbarber/lfd1/lfd1.html c Davd Barber 2001, 2002 1 Learnng from Data 1 : c Davd Barber 2001,2002 2 1 Why

More information

/ n ) are compared. The logic is: if the two

/ n ) are compared. The logic is: if the two STAT C141, Sprng 2005 Lecture 13 Two sample tests One sample tests: examples of goodness of ft tests, where we are testng whether our data supports predctons. Two sample tests: called as tests of ndependence

More information

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4) I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes

More information

Week 5: Neural Networks

Week 5: Neural Networks Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple

More information

Communication with AWGN Interference

Communication with AWGN Interference Communcaton wth AWG Interference m {m } {p(m } Modulator s {s } r=s+n Recever ˆm AWG n m s a dscrete random varable(rv whch takes m wth probablty p(m. Modulator maps each m nto a waveform sgnal s m=m

More information

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one) Why Bayesan? 3. Bayes and Normal Models Alex M. Martnez alex@ece.osu.edu Handouts Handoutsfor forece ECE874 874Sp Sp007 If all our research (n PR was to dsappear and you could only save one theory, whch

More information

x i1 =1 for all i (the constant ).

x i1 =1 for all i (the constant ). Chapter 5 The Multple Regresson Model Consder an economc model where the dependent varable s a functon of K explanatory varables. The economc model has the form: y = f ( x,x,..., ) xk Approxmate ths by

More information

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis Statstcal analyss usng matlab HY 439 Presented by: George Fortetsanaks Roadmap Probablty dstrbutons Statstcal estmaton Fttng data to probablty dstrbutons Contnuous dstrbutons Contnuous random varable X

More information

Clustering & (Ken Kreutz-Delgado) UCSD

Clustering & (Ken Kreutz-Delgado) UCSD Clusterng & Unsupervsed Learnng Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y ), fnd an approxmatng

More information

Expectation propagation

Expectation propagation Expectaton propagaton Lloyd Ellott May 17, 2011 Suppose p(x) s a pdf and we have a factorzaton p(x) = 1 Z n f (x). (1) =1 Expectaton propagaton s an nference algorthm desgned to approxmate the factors

More information

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF 10-708: Probablstc Graphcal Models 10-708, Sprng 2014 8 : Learnng n Fully Observed Markov Networks Lecturer: Erc P. Xng Scrbes: Meng Song, L Zhou 1 Why We Need to Learn Undrected Graphcal Models In the

More information

STAT 3008 Applied Regression Analysis

STAT 3008 Applied Regression Analysis STAT 3008 Appled Regresson Analyss Tutoral : Smple Lnear Regresson LAI Chun He Department of Statstcs, The Chnese Unversty of Hong Kong 1 Model Assumpton To quantfy the relatonshp between two factors,

More information

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands Content. Inference on Regresson Parameters a. Fndng Mean, s.d and covarance amongst estmates.. Confdence Intervals and Workng Hotellng Bands 3. Cochran s Theorem 4. General Lnear Testng 5. Measures of

More information

A be a probability space. A random vector

A be a probability space. A random vector Statstcs 1: Probablty Theory II 8 1 JOINT AND MARGINAL DISTRIBUTIONS In Probablty Theory I we formulate the concept of a (real) random varable and descrbe the probablstc behavor of ths random varable by

More information

Artificial Intelligence Bayesian Networks

Artificial Intelligence Bayesian Networks Artfcal Intellgence Bayesan Networks Adapted from sldes by Tm Fnn and Mare desjardns. Some materal borrowed from Lse Getoor. 1 Outlne Bayesan networks Network structure Condtonal probablty tables Condtonal

More information

Explaining the Stein Paradox

Explaining the Stein Paradox Explanng the Sten Paradox Kwong Hu Yung 1999/06/10 Abstract Ths report offers several ratonale for the Sten paradox. Sectons 1 and defnes the multvarate normal mean estmaton problem and ntroduces Sten

More information

Statistics for Economics & Business

Statistics for Economics & Business Statstcs for Economcs & Busness Smple Lnear Regresson Learnng Objectves In ths chapter, you learn: How to use regresson analyss to predct the value of a dependent varable based on an ndependent varable

More information

Semi-Supervised Learning

Semi-Supervised Learning Sem-Supervsed Learnng Consder the problem of Prepostonal Phrase Attachment. Buy car wth money ; buy car wth wheel There are several ways to generate features. Gven the lmted representaton, we can assume

More information

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD Bayesan decson theory Nuno Vasconcelos ECE Department UCSD Notaton the notaton n DHS s qute sloppy e.. show that error error z z dz really not clear what ths means we wll use the follown notaton subscrpts

More information

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30 STATS 306B: Unsupervsed Learnng Sprng 2014 Lecture 10 Aprl 30 Lecturer: Lester Mackey Scrbe: Joey Arthur, Rakesh Achanta 10.1 Factor Analyss 10.1.1 Recap Recall the factor analyss (FA) model for lnear

More information

Randomness and Computation

Randomness and Computation Randomness and Computaton or, Randomzed Algorthms Mary Cryan School of Informatcs Unversty of Ednburgh RC 208/9) Lecture 0 slde Balls n Bns m balls, n bns, and balls thrown unformly at random nto bns usually

More information

Stat 543 Exam 2 Spring 2016

Stat 543 Exam 2 Spring 2016 Stat 543 Exam 2 Sprng 206 I have nether gven nor receved unauthorzed assstance on ths exam. Name Sgned Date Name Prnted Ths Exam conssts of questons. Do at least 0 of the parts of the man exam. I wll score

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm The Expectaton-Maxmaton Algorthm Charles Elan elan@cs.ucsd.edu November 16, 2007 Ths chapter explans the EM algorthm at multple levels of generalty. Secton 1 gves the standard hgh-level verson of the algorthm.

More information

1/10/18. Definitions. Probabilistic models. Why probabilistic models. Example: a fair 6-sided dice. Probability

1/10/18. Definitions. Probabilistic models. Why probabilistic models. Example: a fair 6-sided dice. Probability /0/8 I529: Machne Learnng n Bonformatcs Defntons Probablstc models Probablstc models A model means a system that smulates the object under consderaton A probablstc model s one that produces dfferent outcomes

More information

Gaussian Mixture Models

Gaussian Mixture Models Lab Gaussan Mxture Models Lab Objectve: Understand the formulaton of Gaussan Mxture Models (GMMs) and how to estmate GMM parameters. You ve already seen GMMs as the observaton dstrbuton n certan contnuous

More information

EGR 544 Communication Theory

EGR 544 Communication Theory EGR 544 Communcaton Theory. Informaton Sources Z. Alyazcoglu Electrcal and Computer Engneerng Department Cal Poly Pomona Introducton Informaton Source x n Informaton sources Analog sources Dscrete sources

More information

Probability and Random Variable Primer

Probability and Random Variable Primer B. Maddah ENMG 622 Smulaton 2/22/ Probablty and Random Varable Prmer Sample space and Events Suppose that an eperment wth an uncertan outcome s performed (e.g., rollng a de). Whle the outcome of the eperment

More information

Lecture 3: Shannon s Theorem

Lecture 3: Shannon s Theorem CSE 533: Error-Correctng Codes (Autumn 006 Lecture 3: Shannon s Theorem October 9, 006 Lecturer: Venkatesan Guruswam Scrbe: Wdad Machmouch 1 Communcaton Model The communcaton model we are usng conssts

More information

How its computed. y outcome data λ parameters hyperparameters. where P denotes the Laplace approximation. k i k k. Andrew B Lawson 2013

How its computed. y outcome data λ parameters hyperparameters. where P denotes the Laplace approximation. k i k k. Andrew B Lawson 2013 Andrew Lawson MUSC INLA INLA s a relatvely new tool that can be used to approxmate posteror dstrbutons n Bayesan models INLA stands for ntegrated Nested Laplace Approxmaton The approxmaton has been known

More information

Expectation Maximization Mixture Models HMMs

Expectation Maximization Mixture Models HMMs -755 Machne Learnng for Sgnal Processng Mture Models HMMs Class 9. 2 Sep 200 Learnng Dstrbutons for Data Problem: Gven a collecton of eamples from some data, estmate ts dstrbuton Basc deas of Mamum Lelhood

More information

Chapter 11: Simple Linear Regression and Correlation

Chapter 11: Simple Linear Regression and Correlation Chapter 11: Smple Lnear Regresson and Correlaton 11-1 Emprcal Models 11-2 Smple Lnear Regresson 11-3 Propertes of the Least Squares Estmators 11-4 Hypothess Test n Smple Lnear Regresson 11-4.1 Use of t-tests

More information

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson

More information

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks Other NN Models Renforcement learnng (RL) Probablstc neural networks Support vector machne (SVM) Renforcement learnng g( (RL) Basc deas: Supervsed dlearnng: (delta rule, BP) Samples (x, f(x)) to learn

More information

The Order Relation and Trace Inequalities for. Hermitian Operators

The Order Relation and Trace Inequalities for. Hermitian Operators Internatonal Mathematcal Forum, Vol 3, 08, no, 507-57 HIKARI Ltd, wwwm-hkarcom https://doorg/0988/mf088055 The Order Relaton and Trace Inequaltes for Hermtan Operators Y Huang School of Informaton Scence

More information