Probabilistic & Unsupervised Learning. Introduction and Foundations

Size: px

Start display at page:

Download "Probabilistic & Unsupervised Learning. Introduction and Foundations"

Dorcas Harrington
5 years ago
Views:

1 Probablstc & Unsupervsed Learnng Introducton and Foundatons Maneesh Sahan Gatsby Computatonal Neuroscence Unt, and MSc ML/CSML, Dept Computer Scence Unversty College London Term 1, Autumn 2018

2 What do we mean by learnng? Jan Steen Not just rememberng:

3 What do we mean by learnng? Jan Steen Not just rememberng: Systematsng (nosy) observatons: dscoverng structure.

4 What do we mean by learnng? Jan Steen Not just rememberng: Systematsng (nosy) observatons: dscoverng structure. Predctng new outcomes: generalsng.

5 What do we mean by learnng? Jan Steen Not just rememberng: Systematsng (nosy) observatons: dscoverng structure. Predctng new outcomes: generalsng. Choosng actons wsely.

6 Three learnng problems Systematsng (nosy) observatons: dscoverng structure. Predctng new outcomes: generalsng. Choosng actons wsely.

7 Three learnng problems Systematsng (nosy) observatons: dscoverng structure. Unsupervsed learnng. Observe (sensory) nput alone: x 1, x 2, x 3, x 4,... Descrbe pattern of data [p(x)], dentfy and extract underlyng structural varables [x y ]. Predctng new outcomes: generalsng. Choosng actons wsely.

8 Three learnng problems Systematsng (nosy) observatons: dscoverng structure. Unsupervsed learnng. Observe (sensory) nput alone: x 1, x 2, x 3, x 4,... Descrbe pattern of data [p(x)], dentfy and extract underlyng structural varables [x y ]. Predctng new outcomes: generalsng. Supervsed learnng. Observe nput/output pars ( teachng ): (x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ), (x 4, y 4 ),... Predct the correct y for new test nput x. Choosng actons wsely.

9 Three learnng problems Systematsng (nosy) observatons: dscoverng structure. Unsupervsed learnng. Observe (sensory) nput alone: x 1, x 2, x 3, x 4,... Descrbe pattern of data [p(x)], dentfy and extract underlyng structural varables [x y ]. Predctng new outcomes: generalsng. Supervsed learnng. Observe nput/output pars ( teachng ): (x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ), (x 4, y 4 ),... Predct the correct y for new test nput x. Choosng actons wsely. Renforcement learnng. Rewards or payoffs (and possbly also nputs) depend on actons: x 1 : a 1 r 1, x 2 : a 2 r 2, x 3 : a 3 r 3... Fnd a polcy for acton choce that maxmses payoff.

Unsupervsed Learnng Fnd underlyng structure: separate generatng processes (clusters) reduced dmensonalty representatons good explanatons (causes) of the data modellng the data densty a W flters mage

10 Unsupervsed Learnng Fnd underlyng structure: separate generatng processes (clusters) reduced dmensonalty representatons good explanatons (causes) of the data modellng the data densty a W flters mage patch, mage ensemble I Uses of Unsupervsed Learnng: structure dscovery, scence data compresson outler detecton nput to supervsed/renforcement algorthms (causes may be more smply related to outputs or rewards) a theory of bologcal learnng and percepton Φ a bass functons causes

11 y Supervsed learnng Two man examples: Classfcaton: x x x o x x xx o x o x x o o o o x Regresson: Dscrete (class label) outputs x Contnuous-values outputs. But also: ranks, relatonshps, trees etc. Varants may relate to unsupervsed learnng: sem-supervsed learnng (most x unlabelled; assumes structure of {x} and relatonshp x y are lnked). multtask (transfer) learnng (predct dfferent y n dfferent contexts; assumes lnks between structure of relatonshps).

12 A probablstc approach Data are generated by random and/or unknown processes.

13 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood.

14 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to

15 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs

16 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery

17 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss

18 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss communcate the data n an effcent way

19 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss communcate the data n an effcent way Probablstc modellng s often equvalent to other vews of learnng:

20 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss communcate the data n an effcent way Probablstc modellng s often equvalent to other vews of learnng: nformaton theoretc: fndng compact representatons of the data

21 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss communcate the data n an effcent way Probablstc modellng s often equvalent to other vews of learnng: nformaton theoretc: fndng compact representatons of the data physcal analoges: mnmsng (free) energy of a correspondng statstcal mechancal system

22 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss communcate the data n an effcent way Probablstc modellng s often equvalent to other vews of learnng: nformaton theoretc: fndng compact representatons of the data physcal analoges: mnmsng (free) energy of a correspondng statstcal mechancal system structural rsk: compensate for overconfdence n powerful models

23 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss communcate the data n an effcent way Probablstc modellng s often equvalent to other vews of learnng: nformaton theoretc: fndng compact representatons of the data physcal analoges: mnmsng (free) energy of a correspondng statstcal mechancal system structural rsk: compensate for overconfdence n powerful models The calculus of probabltes naturally handles randomness. It s also the rght way to reason about unknown values.

24 Representng belefs Let b(x) represent our strength of belef n (plausblty of) proposton x: 0 b(x) 1 b(x) = 0 x s defntely not true b(x) = 1 x s defntely true b(x y) strength of belef that x s true gven that we know y s true Cox Axoms (Desderata): Let b(x) be real. As b(x) ncreases, b( x) decreases, and so the functon mappng b(x) b( x) s monotoncally decreasng and self-nverse. b(x y) depends only on b(y) and b(x y). Consstency If a concluson can be reasoned n more than one way, then every way should lead to the same answer. Belefs always take nto account all relevant evdence. Equvalent states of knowledge are represented by equvalent plausblty assgnments. Consequence: Belef functons (e.g. b(x), b(x y), b(x, y)) must be somorphc to probabltes, satsfyng all the usual laws, ncludng Bayes rule. (See Jaynes, Probablty Theory: The Logc of Scence)

25 Basc rules of probablty Probabltes are non-negatve P(x) 0 x.

26 Basc rules of probablty Probabltes are non-negatve P(x) 0 x. Probabltes normalse: x X P(x) = 1 for dstrbutons f x s a dscrete varable and + p(x)dx = 1 for probablty denstes over contnuous varables Warnng: I wll not be obsessvely careful n my use of p and P for probablty densty and probablty dstrbuton. Should be obvous from context.

27 Basc rules of probablty Probabltes are non-negatve P(x) 0 x. Probabltes normalse: x X P(x) = 1 for dstrbutons f x s a dscrete varable and + p(x)dx = 1 for probablty denstes over contnuous varables The jont probablty of x and y s: P(x, y). Warnng: I wll not be obsessvely careful n my use of p and P for probablty densty and probablty dstrbuton. Should be obvous from context.

28 Basc rules of probablty Probabltes are non-negatve P(x) 0 x. Probabltes normalse: x X P(x) = 1 for dstrbutons f x s a dscrete varable and + p(x)dx = 1 for probablty denstes over contnuous varables The jont probablty of x and y s: P(x, y). The margnal probablty of x s: P(x) = y P(x, y), assumng y s dscrete. Warnng: I wll not be obsessvely careful n my use of p and P for probablty densty and probablty dstrbuton. Should be obvous from context.

29 Basc rules of probablty Probabltes are non-negatve P(x) 0 x. Probabltes normalse: x X P(x) = 1 for dstrbutons f x s a dscrete varable and + p(x)dx = 1 for probablty denstes over contnuous varables The jont probablty of x and y s: P(x, y). The margnal probablty of x s: P(x) = y P(x, y), assumng y s dscrete. The condtonal probablty of x gven y s: P(x y) = P(x, y)/p(y) Warnng: I wll not be obsessvely careful n my use of p and P for probablty densty and probablty dstrbuton. Should be obvous from context.

30 Basc rules of probablty Probabltes are non-negatve P(x) 0 x. Probabltes normalse: x X P(x) = 1 for dstrbutons f x s a dscrete varable and + p(x)dx = 1 for probablty denstes over contnuous varables The jont probablty of x and y s: P(x, y). The margnal probablty of x s: P(x) = y P(x, y), assumng y s dscrete. The condtonal probablty of x gven y s: P(x y) = P(x, y)/p(y) Bayes Rule: P(x, y) = P(x)P(y x) = P(y)P(x y) P(y x) = P(x y)p(y) P(x) Warnng: I wll not be obsessvely careful n my use of p and P for probablty densty and probablty dstrbuton. Should be obvous from context.

31 The Dutch book theorem Assume you are wllng to accept bets wth odds proportonal to the strength of your belefs. That s, b(x) = 0.9 mples that you wll accept a bet: { x s true wn $1 x at 1 : 9 x s false lose $9 Then, unless your belefs satsfy the rules of probablty theory, ncludng Bayes rule, there exsts a set of smultaneous bets (called a Dutch Book ) whch you are wllng to accept, and for whch you are guaranteed to lose money, no matter what the outcome. E.g. suppose A B =, then

32 The Dutch book theorem Assume you are wllng to accept bets wth odds proportonal to the strength of your belefs. That s, b(x) = 0.9 mples that you wll accept a bet: { x s true wn $1 x at 1 : 9 x s false lose $9 Then, unless your belefs satsfy the rules of probablty theory, ncludng Bayes rule, there exsts a set of smultaneous bets (called a Dutch Book ) whch you are wllng to accept, and for whch you are guaranteed to lose money, no matter what the outcome. E.g. suppose A B =, then b(a) = 0.3 b(b) = 0.2 b(a B) = 0.6 accept the bets A at 3 : 7 B at 2 : 8 A B at 4 : 6

33 The Dutch book theorem Assume you are wllng to accept bets wth odds proportonal to the strength of your belefs. That s, b(x) = 0.9 mples that you wll accept a bet: { x s true wn $1 x at 1 : 9 x s false lose $9 Then, unless your belefs satsfy the rules of probablty theory, ncludng Bayes rule, there exsts a set of smultaneous bets (called a Dutch Book ) whch you are wllng to accept, and for whch you are guaranteed to lose money, no matter what the outcome. E.g. suppose A B =, then But then: b(a) = 0.3 b(b) = 0.2 b(a B) = 0.6 accept the bets A B wn = 1 A B wn = 1 A B wn = 1 A at 3 : 7 B at 2 : 8 A B at 4 : 6

34 The Dutch book theorem Assume you are wllng to accept bets wth odds proportonal to the strength of your belefs. That s, b(x) = 0.9 mples that you wll accept a bet: { x s true wn $1 x at 1 : 9 x s false lose $9 Then, unless your belefs satsfy the rules of probablty theory, ncludng Bayes rule, there exsts a set of smultaneous bets (called a Dutch Book ) whch you are wllng to accept, and for whch you are guaranteed to lose money, no matter what the outcome. E.g. suppose A B =, then But then: b(a) = 0.3 b(b) = 0.2 b(a B) = 0.6 accept the bets A B wn = 1 A B wn = 1 A B wn = 1 A at 3 : 7 B at 2 : 8 A B at 4 : 6 The only way to guard aganst Dutch Books s to ensure that your belefs are coherent:.e. satsfy the rules of probablty.

35 Bayesan learnng Apply the basc rules of probablty to learnng from data. Problem specfcaton: Data: D = {x 1,..., x n} Models: M 1, M 2, etc. Parameters: θ (per model) Pror probablty of models: P(M ). Pror probabltes of model parameters: P(θ M ) Model of data gven parameters (lkelhood model): P(x θ, M )

36 Bayesan learnng Apply the basc rules of probablty to learnng from data. Problem specfcaton: Data: D = {x 1,..., x n} Models: M 1, M 2, etc. Parameters: θ (per model) Pror probablty of models: P(M ). Pror probabltes of model parameters: P(θ M ) Model of data gven parameters (lkelhood model): P(x θ, M ) Data probablty (lkelhood) n P(D θ, M ) = P(x j θ, M ) L(θ ) j=1 (provded the data are ndependently and dentcally dstrbuted (d).

37 Bayesan learnng Apply the basc rules of probablty to learnng from data. Problem specfcaton: Data: D = {x 1,..., x n} Models: M 1, M 2, etc. Parameters: θ (per model) Pror probablty of models: P(M ). Pror probabltes of model parameters: P(θ M ) Model of data gven parameters (lkelhood model): P(x θ, M ) Data probablty (lkelhood) n P(D θ, M ) = P(x j θ, M ) L(θ ) j=1 (provded the data are ndependently and dentcally dstrbuted (d). Parameter learnng (posteror): P(θ D, M ) = P(D θ, M)P(θ M) ; P(D M ) = P(D M ) dθ P(D θ, M )P(θ M )

38 Bayesan learnng Apply the basc rules of probablty to learnng from data. Problem specfcaton: Data: D = {x 1,..., x n} Models: M 1, M 2, etc. Parameters: θ (per model) Pror probablty of models: P(M ). Pror probabltes of model parameters: P(θ M ) Model of data gven parameters (lkelhood model): P(x θ, M ) Data probablty (lkelhood) n P(D θ, M ) = P(x j θ, M ) L(θ ) j=1 (provded the data are ndependently and dentcally dstrbuted (d). Parameter learnng (posteror): P(θ D, M ) = P(D θ, M)P(θ M) ; P(D M ) = P(D M ) dθ P(D θ, M )P(θ M ) P(D M ) s called the margnal lkelhood or evdence for M. It s proportonal to the posteror probablty model M beng the one that generated the data.

39 Bayesan learnng Apply the basc rules of probablty to learnng from data. Problem specfcaton: Data: D = {x 1,..., x n} Models: M 1, M 2, etc. Parameters: θ (per model) Pror probablty of models: P(M ). Pror probabltes of model parameters: P(θ M ) Model of data gven parameters (lkelhood model): P(x θ, M ) Data probablty (lkelhood) n P(D θ, M ) = P(x j θ, M ) L(θ ) j=1 (provded the data are ndependently and dentcally dstrbuted (d). Parameter learnng (posteror): P(θ D, M ) = P(D θ, M)P(θ M) ; P(D M ) = P(D M ) dθ P(D θ, M )P(θ M ) P(D M ) s called the margnal lkelhood or evdence for M. It s proportonal to the posteror probablty model M beng the one that generated the data. Model selecton: P(M D) = P(D M)P(M) P(D)

40 Bayesan learnng: A con toss example Con toss: One parameter q the probablty of obtanng heads So our space of models s the set of dstrbutons over q [0, 1].

41 Bayesan learnng: A con toss example Con toss: One parameter q the probablty of obtanng heads So our space of models s the set of dstrbutons over q [0, 1]. Learner A beleves model M A: all values of q are equally plausble; P(q) q A

42 Bayesan learnng: A con toss example Con toss: One parameter q the probablty of obtanng heads So our space of models s the set of dstrbutons over q [0, 1]. Learner A beleves model M A: all values of q are equally plausble; Learner B beleves model M B: more plausble that the con s far (q 0.5) than based P(q) 1 P(q) q A q B

43 Bayesan learnng: A con toss example Con toss: One parameter q the probablty of obtanng heads So our space of models s the set of dstrbutons over q [0, 1]. Learner A beleves model M A: all values of q are equally plausble; Learner B beleves model M B: more plausble that the con s far (q 0.5) than based P(q) 1 P(q) q A Both pror belefs can be descrbed by the Beta dstrbuton: p(q α 1, α 2) = q(α 1 1) (1 q) (α 2 1) B(α 1, α 2) = Beta(q α 1, α 2) q B

44 Bayesan learnng: A con toss example Con toss: One parameter q the probablty of obtanng heads So our space of models s the set of dstrbutons over q [0, 1]. Learner A beleves model M A: all values of q are equally plausble; Learner B beleves model M B: more plausble that the con s far (q 0.5) than based P(q) 1 P(q) q q A: α 1 = α 2 = 1.0 B: α 1 = α 2 = 4.0 Both pror belefs can be descrbed by the Beta dstrbuton: p(q α 1, α 2) = q(α 1 1) (1 q) (α 2 1) B(α 1, α 2) = Beta(q α 1, α 2)

45 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q

46 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q Suppose our sngle con toss comes out heads

47 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q Suppose our sngle con toss comes out heads The probablty of the observed data (lkelhood) s: p(h q) = q

48 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q Suppose our sngle con toss comes out heads The probablty of the observed data (lkelhood) s: p(h q) = q Usng Bayes Rule, we multply the pror, p(q) by the lkelhood and renormalse to get the posteror probablty: p(q H) = p(q)p(h q) p(h)

49 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q Suppose our sngle con toss comes out heads The probablty of the observed data (lkelhood) s: p(h q) = q Usng Bayes Rule, we multply the pror, p(q) by the lkelhood and renormalse to get the posteror probablty: p(q H) = p(q)p(h q) p(h) q Beta(q α 1, α 2)

50 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q Suppose our sngle con toss comes out heads The probablty of the observed data (lkelhood) s: p(h q) = q Usng Bayes Rule, we multply the pror, p(q) by the lkelhood and renormalse to get the posteror probablty: p(q H) = p(q)p(h q) p(h) q Beta(q α 1, α 2) q q (α 1 1) (1 q) (α 2 1)

51 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q Suppose our sngle con toss comes out heads The probablty of the observed data (lkelhood) s: p(h q) = q Usng Bayes Rule, we multply the pror, p(q) by the lkelhood and renormalse to get the posteror probablty: p(q H) = p(q)p(h q) p(h) q Beta(q α 1, α 2) q q (α1 1) (1 q) (α 2 1) = Beta(q α 1 + 1, α 2)

52 Bayesan learnng: The con toss (cont) A B Pror P(q) P(q) q q Beta(q 1, 1) Beta(q 4, 4) Posteror P(q) P(q) q q Beta(q 2, 1) Beta(q 5, 4)

53 Bayesan learnng: The con toss (cont) What about multple tosses?

54 Bayesan learnng: The con toss (cont) What about multple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3

55 Bayesan learnng: The con toss (cont) What about multple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 Ths s stll straghtforward:

56 Bayesan learnng: The con toss (cont) What about multple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 Ths s stll straghtforward: p(q D) = p(q)p(d q) p(d)

57 Bayesan learnng: The con toss (cont) What about multple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 Ths s stll straghtforward: p(q D) = p(q)p(d q) p(d) q 3 (1 q) 3 Beta(q α 1, α 2)

58 Bayesan learnng: The con toss (cont) What about multple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 Ths s stll straghtforward: p(q D) = p(q)p(d q) p(d) q 3 (1 q) 3 Beta(q α 1, α 2) Beta(q α 1 + 3, α 2 + 3)

59 Bayesan learnng: The con toss (cont) What about multple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 Ths s stll straghtforward: p(q D) = p(q)p(d q) p(d) q 3 (1 q) 3 Beta(q α 1, α 2) Beta(q α 1 + 3, α 2 + 3) P(q) P(q) q q

60 Conjugate prors Updatng the pror to form the posteror was partcularly easy n these examples. Ths s because we used a conjugate pror for an exponental famly lkelhood.

61 Conjugate prors Updatng the pror to form the posteror was partcularly easy n these examples. Ths s because we used a conjugate pror for an exponental famly lkelhood. Exponental famly dstrbutons take the form: P(x θ) = g(θ)f (x)e φ(θ)t T(x) wth g(θ) the normalsng constant.

62 Conjugate prors Updatng the pror to form the posteror was partcularly easy n these examples. Ths s because we used a conjugate pror for an exponental famly lkelhood. Exponental famly dstrbutons take the form: P(x θ) = g(θ)f (x)e φ(θ)t T(x) wth g(θ) the normalsng constant. Gven n d observatons, P({x } θ) = ( P(x θ) = g(θ) n e φ(θ)t T(x )) f (x )

63 Conjugate prors Updatng the pror to form the posteror was partcularly easy n these examples. Ths s because we used a conjugate pror for an exponental famly lkelhood. Exponental famly dstrbutons take the form: P(x θ) = g(θ)f (x)e φ(θ)t T(x) wth g(θ) the normalsng constant. Gven n d observatons, P({x } θ) = ( P(x θ) = g(θ) n e φ(θ)t T(x )) f (x ) Thus, f the pror takes the conjugate form P(θ) = F(τ, ν)g(θ) ν e φ(θ)t τ wth F(τ, ν) the normalser

64 Conjugate prors Updatng the pror to form the posteror was partcularly easy n these examples. Ths s because we used a conjugate pror for an exponental famly lkelhood. Exponental famly dstrbutons take the form: P(x θ) = g(θ)f (x)e φ(θ)t T(x) wth g(θ) the normalsng constant. Gven n d observatons, P({x } θ) = ( P(x θ) = g(θ) n e φ(θ)t T(x )) f (x ) Thus, f the pror takes the conjugate form P(θ) = F(τ, ν)g(θ) ν e φ(θ)t τ wth F(τ, ν) the normalser, then the posteror s ( P(θ {x }) P({x } θ)p(θ) g(θ) ν+n e φ(θ)t τ + ) T(x ) wth the normalser gven by F ( τ + T(x ), ν + n ).

65 Conjugate prors The posteror gven an exponental famly lkelhood and conjugate pror s: P(θ {x }) = F ( τ + T(x ), ν + n ) ( g(θ) ν+n exp [φ(θ) T τ + )] T(x ) Here, φ(θ) T(x ) τ ν s the vector of natural parameters s the vector of suffcent statstcs are pseudo-observatons whch defne the pror s the scale of the pror (need not be an nteger) As new data come n, each one ncrements the suffcent statstcs vector and the scale to defne the posteror.

66 Conjugate prors The posteror gven an exponental famly lkelhood and conjugate pror s: P(θ {x }) = F ( τ + T(x ), ν + n ) ( g(θ) ν+n exp [φ(θ) T τ + )] T(x ) Here, φ(θ) T(x ) τ ν s the vector of natural parameters s the vector of suffcent statstcs are pseudo-observatons whch defne the pror s the scale of the pror (need not be an nteger) As new data come n, each one ncrements the suffcent statstcs vector and the scale to defne the posteror. The pror appears to be based on pseudo-observatons, but: 1. Ths s dfferent to applyng Bayes rule. No pror! Sometmes we can take a unform pror (say on [0, 1] for q), but for unbounded θ, there may be no equvalent. 2. A vald conjugate pror mght have non-ntegral ν or mpossble τ, wth no lkelhood equvalent.

67 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form.

68 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form. The Bernoull dstrbuton (a sngle con flp) wth parameter q and observaton x {0, 1}, can be wrtten: P(x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x

69 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form. The Bernoull dstrbuton (a sngle con flp) wth parameter q and observaton x {0, 1}, can be wrtten: P(x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter s the log odds log(q/(1 q)), and the suffcent stats (for multple tosses) s the number of heads.

70 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form. The Bernoull dstrbuton (a sngle con flp) wth parameter q and observaton x {0, 1}, can be wrtten: P(x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter s the log odds log(q/(1 q)), and the suffcent stats (for multple tosses) s the number of heads. The conjugate pror s P(q) = F(τ, ν) (1 q) ν e log(q/(1 q))τ = F(τ, ν) (1 q) ν τ log q τ log(1 q) e = F(τ, ν) (1 q) ν τ q τ whch has the form of the Beta dstrbuton F(τ, ν) = 1/B(τ + 1, ν τ + 1).

71 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form. The Bernoull dstrbuton (a sngle con flp) wth parameter q and observaton x {0, 1}, can be wrtten: P(x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter s the log odds log(q/(1 q)), and the suffcent stats (for multple tosses) s the number of heads. The conjugate pror s P(q) = F(τ, ν) (1 q) ν e log(q/(1 q))τ = F(τ, ν) (1 q) ν τ log q τ log(1 q) e = F(τ, ν) (1 q) ν τ q τ whch has the form of the Beta dstrbuton F(τ, ν) = 1/B(τ + 1, ν τ + 1). In general, then, the posteror wll be P(q {x }) = Beta(α 1, α 2), wth α 1 = 1 + τ + x α 2 = 1 + (ν + n) (τ + ) x

72 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form. The Bernoull dstrbuton (a sngle con flp) wth parameter q and observaton x {0, 1}, can be wrtten: P(x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter s the log odds log(q/(1 q)), and the suffcent stats (for multple tosses) s the number of heads. The conjugate pror s P(q) = F(τ, ν) (1 q) ν e log(q/(1 q))τ = F(τ, ν) (1 q) ν τ log q τ log(1 q) e = F(τ, ν) (1 q) ν τ q τ whch has the form of the Beta dstrbuton F(τ, ν) = 1/B(τ + 1, ν τ + 1). In general, then, the posteror wll be P(q {x }) = Beta(α 1, α 2), wth α 1 = 1 + τ + x α 2 = 1 + (ν + n) (τ + ) x If we observe a head, we add 1 to the suffcent statstc x, and also 1 to the count n. Ths ncrements α 1.

73 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form. The Bernoull dstrbuton (a sngle con flp) wth parameter q and observaton x {0, 1}, can be wrtten: P(x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter s the log odds log(q/(1 q)), and the suffcent stats (for multple tosses) s the number of heads. The conjugate pror s P(q) = F(τ, ν) (1 q) ν e log(q/(1 q))τ = F(τ, ν) (1 q) ν τ log q τ log(1 q) e = F(τ, ν) (1 q) ν τ q τ whch has the form of the Beta dstrbuton F(τ, ν) = 1/B(τ + 1, ν τ + 1). In general, then, the posteror wll be P(q {x }) = Beta(α 1, α 2), wth α 1 = 1 + τ + x α 2 = 1 + (ν + n) (τ + ) x If we observe a head, we add 1 to the suffcent statstc x, and also 1 to the count n. Ths ncrements α 1. If we observe a tal we add 1 to n, but not to x, ncrementng α 2.

74 Bayesan cons comparng models We have seen how to update posterors wthn each model. To study the choce of model, consder two more extreme models: far and bent.

75 Bayesan cons comparng models We have seen how to update posterors wthn each model. To study the choce of model, consder two more extreme models: far and bent. A pror, we may thnk that far s more probable, eg: p(far) = 0.8, p(bent) = 0.2

76 Bayesan cons comparng models We have seen how to update posterors wthn each model. To study the choce of model, consder two more extreme models: far and bent. A pror, we may thnk that far s more probable, eg: p(far) = 0.8, p(bent) = 0.2 For the bent con, we assume all parameter values are equally lkely, whlst the far con has a fxed probablty: p(q bent) p(q far) parameter, q parameter, q

77 Bayesan cons comparng models We have seen how to update posterors wthn each model. To study the choce of model, consder two more extreme models: far and bent. A pror, we may thnk that far s more probable, eg: p(far) = 0.8, p(bent) = 0.2 For the bent con, we assume all parameter values are equally lkely, whlst the far con has a fxed probablty: p(q bent) p(q far) parameter, q parameter, q We make 10 tosses, and get: D = (T H T H T T T T T T).

78 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)?

79 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2)

80 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent)

81 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent) = dq q 2 (1 q) 8

82 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) 0.002

83 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) Thus, the posteror for the models, by Bayes rule: P(far D) , P(bent D) , e, a two-thrds probablty that the con s far.

84 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) Thus, the posteror for the models, by Bayes rule: P(far D) , P(bent D) , e, a two-thrds probablty that the con s far. How do we make predctons?

85 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) Thus, the posteror for the models, by Bayes rule: P(far D) , P(bent D) , e, a two-thrds probablty that the con s far. How do we make predctons? Could choose the far model (model selecton).

86 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) Thus, the posteror for the models, by Bayes rule: P(far D) , P(bent D) , e, a two-thrds probablty that the con s far. How do we make predctons? Could choose the far model (model selecton). Or could weght the predctons from each model by ther probablty (model averagng). Probablty of H at next toss s: P(H D) = P(H D, far)p(far D) + P(H D, bent)p(bent D) = = 5 12.

87 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters.

88 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly).

89 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons

90 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D)

91 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss.

92 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ).

93 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ). Equvalent to mnmsng the 0/1 loss.

94 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ). Equvalent to mnmsng the 0/1 loss. Maxmum Lkelhood (ML) Learnng: No pror over the parameters. Compute parameter value that maxmses the lkelhood functon alone: θ ML = argmax P(D θ).

95 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ). Equvalent to mnmsng the 0/1 loss. Maxmum Lkelhood (ML) Learnng: No pror over the parameters. Compute parameter value that maxmses the lkelhood functon alone: θ ML = argmax P(D θ). Parametersaton-ndependent.

96 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ). Equvalent to mnmsng the 0/1 loss. Maxmum Lkelhood (ML) Learnng: No pror over the parameters. Compute parameter value that maxmses the lkelhood functon alone: θ ML = argmax P(D θ). Parametersaton-ndependent. Approxmatons may allow us to recover samples from posteror, or to fnd a dstrbuton whch s close n some sense.

97 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ). Equvalent to mnmsng the 0/1 loss. Maxmum Lkelhood (ML) Learnng: No pror over the parameters. Compute parameter value that maxmses the lkelhood functon alone: θ ML = argmax P(D θ). Parametersaton-ndependent. Approxmatons may allow us to recover samples from posteror, or to fnd a dstrbuton whch s close n some sense. Choosng between these and other alternatves may be a matter of defnton, of goals (loss functon), or of practcalty.

98 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ). Equvalent to mnmsng the 0/1 loss. Maxmum Lkelhood (ML) Learnng: No pror over the parameters. Compute parameter value that maxmses the lkelhood functon alone: θ ML = argmax P(D θ). Parametersaton-ndependent. Approxmatons may allow us to recover samples from posteror, or to fnd a dstrbuton whch s close n some sense. Choosng between these and other alternatves may be a matter of defnton, of goals (loss functon), or of practcalty. For the next few weeks we wll look at ML and MAP learnng n more complex models. We wll then return to the fully Bayesan formulaton for the few nterstng cases where t s tractable. Approxmatons wll be addressed n the second half of the course.

99 Modellng assocatons between varables 1 Data set D = {x 1,..., x N} x 2 0 wth each data pont a vector of D features: x = [x 1... x D] Assume data are..d. (ndependent and dentcally dstrbuted) x 1

100 Modellng assocatons between varables 1 Data set D = {x 1,..., x N} x 2 0 wth each data pont a vector of D features: x = [x 1... x D] Assume data are..d. (ndependent and dentcally dstrbuted) x 1 A smple forms of unsupervsed (structure) learnng: model the mean of the data and the correlatons between the D features n the data.

101 Modellng assocatons between varables 1 Data set D = {x 1,..., x N} x 2 0 wth each data pont a vector of D features: x = [x 1... x D] Assume data are..d. (ndependent and dentcally dstrbuted) x 1 A smple forms of unsupervsed (structure) learnng: model the mean of the data and the correlatons between the D features n the data. We can use a multvarate Gaussan model: p(x µ, Σ) = N (µ, Σ) = 2πΣ 1 2 exp { 1 } 2 (x µ)t Σ 1 (x µ)

102 ML Learnng for a Gaussan N Data set D = {x 1,..., x N}, lkelhood: p(d µ, Σ) = p(x n µ, Σ) n=1 Goal: fnd µ and Σ that maxmse lkelhood N L = p(x n µ, Σ) n=1

103 ML Learnng for a Gaussan N Data set D = {x 1,..., x N}, lkelhood: p(d µ, Σ) = p(x n µ, Σ) Goal: fnd µ and Σ that maxmse lkelhood maxmse log lkelhood: n=1 N l = log p(x n µ, Σ) n=1

104 ML Learnng for a Gaussan N Data set D = {x 1,..., x N}, lkelhood: p(d µ, Σ) = p(x n µ, Σ) Goal: fnd µ and Σ that maxmse lkelhood maxmse log lkelhood: n=1 l = log N n=1 p(x n µ, Σ) = n log p(x n µ, Σ)

105 ML Learnng for a Gaussan N Data set D = {x 1,..., x N}, lkelhood: p(d µ, Σ) = p(x n µ, Σ) Goal: fnd µ and Σ that maxmse lkelhood maxmse log lkelhood: n=1 N l = log p(x n µ, Σ) = log p(x n µ, Σ) n=1 n = N 2 log 2πΣ 1 (x n µ) T Σ 1 (x n µ) 2 n

106 ML Learnng for a Gaussan N Data set D = {x 1,..., x N}, lkelhood: p(d µ, Σ) = p(x n µ, Σ) Goal: fnd µ and Σ that maxmse lkelhood maxmse log lkelhood: n=1 N l = log p(x n µ, Σ) = log p(x n µ, Σ) n=1 n = N 2 log 2πΣ 1 (x n µ) T Σ 1 (x n µ) 2 n Note: equvalently, mnmse l, whch s quadratc n µ

107 ML Learnng for a Gaussan N Data set D = {x 1,..., x N}, lkelhood: p(d µ, Σ) = p(x n µ, Σ) Goal: fnd µ and Σ that maxmse lkelhood maxmse log lkelhood: n=1 N l = log p(x n µ, Σ) = log p(x n µ, Σ) n=1 n = N 2 log 2πΣ 1 (x n µ) T Σ 1 (x n µ) 2 n Note: equvalently, mnmse l, whch s quadratc n µ Procedure: take dervatves and set to zero: l µ = 0 ˆµ = 1 x n (sample mean) N n l Σ = 0 1 ˆΣ = (x n ˆµ)(x n ˆµ) T (sample covarance) N n

108 Refresher matrx dervatves of scalar forms We wll use the followng facts:

109 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr [ ] x T Ay (scalars equal ther own transpose and trace)

110 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] x T Ay (scalars equal ther own transpose and trace)

111 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA]

112 Refresher matrx dervatves of scalar forms We wll use the followng facts: [ ] Tr A T B A j x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA]

113 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = [A T B] nn A j A j n [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA]

114 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A j A j n m [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] A T nmb mn

115 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A j A j mn [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] A mnb mn

116 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA]

117 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA]

118 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ ] A Tr A T BAC [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA]

119 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps

120 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C ] F1 A + F 2 Tr [ F T 1 BF 2C ] F2 A

121 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C ] F1 A + F 2 Tr [ CF T 1 BF 2 ] F2 A

122 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C ] F1 A + F 2 Tr [ F T 2 B T F 1C T ] F2 A

123 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C = BF 2C + B T F 1C T ] F1 A + F 2 Tr [ F T 2 B T F 1C T ] F2 A

124 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C ] F1 A + F 2 Tr = BF 2C + B T F 1C T = BAC + B T AC T [ F T 2 B T F 1C T ] F2 A

125 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] A j log A [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C ] F1 A + F 2 Tr = BF 2C + B T F 1C T = BAC + B T AC T [ F T 2 B T F 1C T ] F2 A

126 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C ] F1 A + F 2 Tr = BF 2C + B T F 1C T = BAC + B T AC T log A = 1 A A j A A j [ F T 2 B T F 1C T ] F2 A

127 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr log A = 1 A j A [ F T 1 BF 2C ] F1 A + F 2 Tr = BF 2C + B T F 1C T = BAC + B T AC T ( 1) +k A k [A] k A j k [ F T 2 B T F 1C T ] F2 A

128 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr log A = 1 A j A [ F T 1 BF 2C ] F1 A + F 2 Tr = BF 2C + B T F 1C T = BAC + B T AC T [ F T 2 B T F 1C T ] F2 A ( 1) +k A k [A] k = 1 A j A ( 1)+j [A] j k

129 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr log A = 1 A j A A log A = (A 1 ) T [ F T 1 BF 2C ] F1 A + F 2 Tr = BF 2C + B T F 1C T = BAC + B T AC T [ F T 2 B T F 1C T ] F2 A ( 1) +k A k [A] k = 1 A j A ( 1)+j [A] j k

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements CS 750 Machne Learnng Lecture 5 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square CS 750 Machne Learnng Announcements Homework Due on Wednesday before the class Reports: hand n before