Probabilistic & Unsupervised Learning. Introduction and Foundations
|
|
- Dorcas Harrington
- 5 years ago
- Views:
Transcription
1 Probablstc & Unsupervsed Learnng Introducton and Foundatons Maneesh Sahan Gatsby Computatonal Neuroscence Unt, and MSc ML/CSML, Dept Computer Scence Unversty College London Term 1, Autumn 2018
2 What do we mean by learnng? Jan Steen Not just rememberng:
3 What do we mean by learnng? Jan Steen Not just rememberng: Systematsng (nosy) observatons: dscoverng structure.
4 What do we mean by learnng? Jan Steen Not just rememberng: Systematsng (nosy) observatons: dscoverng structure. Predctng new outcomes: generalsng.
5 What do we mean by learnng? Jan Steen Not just rememberng: Systematsng (nosy) observatons: dscoverng structure. Predctng new outcomes: generalsng. Choosng actons wsely.
6 Three learnng problems Systematsng (nosy) observatons: dscoverng structure. Predctng new outcomes: generalsng. Choosng actons wsely.
7 Three learnng problems Systematsng (nosy) observatons: dscoverng structure. Unsupervsed learnng. Observe (sensory) nput alone: x 1, x 2, x 3, x 4,... Descrbe pattern of data [p(x)], dentfy and extract underlyng structural varables [x y ]. Predctng new outcomes: generalsng. Choosng actons wsely.
8 Three learnng problems Systematsng (nosy) observatons: dscoverng structure. Unsupervsed learnng. Observe (sensory) nput alone: x 1, x 2, x 3, x 4,... Descrbe pattern of data [p(x)], dentfy and extract underlyng structural varables [x y ]. Predctng new outcomes: generalsng. Supervsed learnng. Observe nput/output pars ( teachng ): (x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ), (x 4, y 4 ),... Predct the correct y for new test nput x. Choosng actons wsely.
9 Three learnng problems Systematsng (nosy) observatons: dscoverng structure. Unsupervsed learnng. Observe (sensory) nput alone: x 1, x 2, x 3, x 4,... Descrbe pattern of data [p(x)], dentfy and extract underlyng structural varables [x y ]. Predctng new outcomes: generalsng. Supervsed learnng. Observe nput/output pars ( teachng ): (x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ), (x 4, y 4 ),... Predct the correct y for new test nput x. Choosng actons wsely. Renforcement learnng. Rewards or payoffs (and possbly also nputs) depend on actons: x 1 : a 1 r 1, x 2 : a 2 r 2, x 3 : a 3 r 3... Fnd a polcy for acton choce that maxmses payoff.
10 Unsupervsed Learnng Fnd underlyng structure: separate generatng processes (clusters) reduced dmensonalty representatons good explanatons (causes) of the data modellng the data densty a W flters mage patch, mage ensemble I Uses of Unsupervsed Learnng: structure dscovery, scence data compresson outler detecton nput to supervsed/renforcement algorthms (causes may be more smply related to outputs or rewards) a theory of bologcal learnng and percepton Φ a bass functons causes
11 y Supervsed learnng Two man examples: Classfcaton: x x x o x x xx o x o x x o o o o x Regresson: Dscrete (class label) outputs x Contnuous-values outputs. But also: ranks, relatonshps, trees etc. Varants may relate to unsupervsed learnng: sem-supervsed learnng (most x unlabelled; assumes structure of {x} and relatonshp x y are lnked). multtask (transfer) learnng (predct dfferent y n dfferent contexts; assumes lnks between structure of relatonshps).
12 A probablstc approach Data are generated by random and/or unknown processes.
13 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood.
14 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to
15 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs
16 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery
17 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss
18 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss communcate the data n an effcent way
19 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss communcate the data n an effcent way Probablstc modellng s often equvalent to other vews of learnng:
20 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss communcate the data n an effcent way Probablstc modellng s often equvalent to other vews of learnng: nformaton theoretc: fndng compact representatons of the data
21 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss communcate the data n an effcent way Probablstc modellng s often equvalent to other vews of learnng: nformaton theoretc: fndng compact representatons of the data physcal analoges: mnmsng (free) energy of a correspondng statstcal mechancal system
22 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss communcate the data n an effcent way Probablstc modellng s often equvalent to other vews of learnng: nformaton theoretc: fndng compact representatons of the data physcal analoges: mnmsng (free) energy of a correspondng statstcal mechancal system structural rsk: compensate for overconfdence n powerful models
23 A probablstc approach Data are generated by random and/or unknown processes. Our approach to learnng starts wth a probablstc model of data producton: P(data parameters) P(x θ) or P(y x, θ) Ths s the generatve model or lkelhood. The probablstc model can be used to make nferences about mssng nputs generate predctons/fantases/magery make predctons or decsons whch mnmse expected loss communcate the data n an effcent way Probablstc modellng s often equvalent to other vews of learnng: nformaton theoretc: fndng compact representatons of the data physcal analoges: mnmsng (free) energy of a correspondng statstcal mechancal system structural rsk: compensate for overconfdence n powerful models The calculus of probabltes naturally handles randomness. It s also the rght way to reason about unknown values.
24 Representng belefs Let b(x) represent our strength of belef n (plausblty of) proposton x: 0 b(x) 1 b(x) = 0 x s defntely not true b(x) = 1 x s defntely true b(x y) strength of belef that x s true gven that we know y s true Cox Axoms (Desderata): Let b(x) be real. As b(x) ncreases, b( x) decreases, and so the functon mappng b(x) b( x) s monotoncally decreasng and self-nverse. b(x y) depends only on b(y) and b(x y). Consstency If a concluson can be reasoned n more than one way, then every way should lead to the same answer. Belefs always take nto account all relevant evdence. Equvalent states of knowledge are represented by equvalent plausblty assgnments. Consequence: Belef functons (e.g. b(x), b(x y), b(x, y)) must be somorphc to probabltes, satsfyng all the usual laws, ncludng Bayes rule. (See Jaynes, Probablty Theory: The Logc of Scence)
25 Basc rules of probablty Probabltes are non-negatve P(x) 0 x.
26 Basc rules of probablty Probabltes are non-negatve P(x) 0 x. Probabltes normalse: x X P(x) = 1 for dstrbutons f x s a dscrete varable and + p(x)dx = 1 for probablty denstes over contnuous varables Warnng: I wll not be obsessvely careful n my use of p and P for probablty densty and probablty dstrbuton. Should be obvous from context.
27 Basc rules of probablty Probabltes are non-negatve P(x) 0 x. Probabltes normalse: x X P(x) = 1 for dstrbutons f x s a dscrete varable and + p(x)dx = 1 for probablty denstes over contnuous varables The jont probablty of x and y s: P(x, y). Warnng: I wll not be obsessvely careful n my use of p and P for probablty densty and probablty dstrbuton. Should be obvous from context.
28 Basc rules of probablty Probabltes are non-negatve P(x) 0 x. Probabltes normalse: x X P(x) = 1 for dstrbutons f x s a dscrete varable and + p(x)dx = 1 for probablty denstes over contnuous varables The jont probablty of x and y s: P(x, y). The margnal probablty of x s: P(x) = y P(x, y), assumng y s dscrete. Warnng: I wll not be obsessvely careful n my use of p and P for probablty densty and probablty dstrbuton. Should be obvous from context.
29 Basc rules of probablty Probabltes are non-negatve P(x) 0 x. Probabltes normalse: x X P(x) = 1 for dstrbutons f x s a dscrete varable and + p(x)dx = 1 for probablty denstes over contnuous varables The jont probablty of x and y s: P(x, y). The margnal probablty of x s: P(x) = y P(x, y), assumng y s dscrete. The condtonal probablty of x gven y s: P(x y) = P(x, y)/p(y) Warnng: I wll not be obsessvely careful n my use of p and P for probablty densty and probablty dstrbuton. Should be obvous from context.
30 Basc rules of probablty Probabltes are non-negatve P(x) 0 x. Probabltes normalse: x X P(x) = 1 for dstrbutons f x s a dscrete varable and + p(x)dx = 1 for probablty denstes over contnuous varables The jont probablty of x and y s: P(x, y). The margnal probablty of x s: P(x) = y P(x, y), assumng y s dscrete. The condtonal probablty of x gven y s: P(x y) = P(x, y)/p(y) Bayes Rule: P(x, y) = P(x)P(y x) = P(y)P(x y) P(y x) = P(x y)p(y) P(x) Warnng: I wll not be obsessvely careful n my use of p and P for probablty densty and probablty dstrbuton. Should be obvous from context.
31 The Dutch book theorem Assume you are wllng to accept bets wth odds proportonal to the strength of your belefs. That s, b(x) = 0.9 mples that you wll accept a bet: { x s true wn $1 x at 1 : 9 x s false lose $9 Then, unless your belefs satsfy the rules of probablty theory, ncludng Bayes rule, there exsts a set of smultaneous bets (called a Dutch Book ) whch you are wllng to accept, and for whch you are guaranteed to lose money, no matter what the outcome. E.g. suppose A B =, then
32 The Dutch book theorem Assume you are wllng to accept bets wth odds proportonal to the strength of your belefs. That s, b(x) = 0.9 mples that you wll accept a bet: { x s true wn $1 x at 1 : 9 x s false lose $9 Then, unless your belefs satsfy the rules of probablty theory, ncludng Bayes rule, there exsts a set of smultaneous bets (called a Dutch Book ) whch you are wllng to accept, and for whch you are guaranteed to lose money, no matter what the outcome. E.g. suppose A B =, then b(a) = 0.3 b(b) = 0.2 b(a B) = 0.6 accept the bets A at 3 : 7 B at 2 : 8 A B at 4 : 6
33 The Dutch book theorem Assume you are wllng to accept bets wth odds proportonal to the strength of your belefs. That s, b(x) = 0.9 mples that you wll accept a bet: { x s true wn $1 x at 1 : 9 x s false lose $9 Then, unless your belefs satsfy the rules of probablty theory, ncludng Bayes rule, there exsts a set of smultaneous bets (called a Dutch Book ) whch you are wllng to accept, and for whch you are guaranteed to lose money, no matter what the outcome. E.g. suppose A B =, then But then: b(a) = 0.3 b(b) = 0.2 b(a B) = 0.6 accept the bets A B wn = 1 A B wn = 1 A B wn = 1 A at 3 : 7 B at 2 : 8 A B at 4 : 6
34 The Dutch book theorem Assume you are wllng to accept bets wth odds proportonal to the strength of your belefs. That s, b(x) = 0.9 mples that you wll accept a bet: { x s true wn $1 x at 1 : 9 x s false lose $9 Then, unless your belefs satsfy the rules of probablty theory, ncludng Bayes rule, there exsts a set of smultaneous bets (called a Dutch Book ) whch you are wllng to accept, and for whch you are guaranteed to lose money, no matter what the outcome. E.g. suppose A B =, then But then: b(a) = 0.3 b(b) = 0.2 b(a B) = 0.6 accept the bets A B wn = 1 A B wn = 1 A B wn = 1 A at 3 : 7 B at 2 : 8 A B at 4 : 6 The only way to guard aganst Dutch Books s to ensure that your belefs are coherent:.e. satsfy the rules of probablty.
35 Bayesan learnng Apply the basc rules of probablty to learnng from data. Problem specfcaton: Data: D = {x 1,..., x n} Models: M 1, M 2, etc. Parameters: θ (per model) Pror probablty of models: P(M ). Pror probabltes of model parameters: P(θ M ) Model of data gven parameters (lkelhood model): P(x θ, M )
36 Bayesan learnng Apply the basc rules of probablty to learnng from data. Problem specfcaton: Data: D = {x 1,..., x n} Models: M 1, M 2, etc. Parameters: θ (per model) Pror probablty of models: P(M ). Pror probabltes of model parameters: P(θ M ) Model of data gven parameters (lkelhood model): P(x θ, M ) Data probablty (lkelhood) n P(D θ, M ) = P(x j θ, M ) L(θ ) j=1 (provded the data are ndependently and dentcally dstrbuted (d).
37 Bayesan learnng Apply the basc rules of probablty to learnng from data. Problem specfcaton: Data: D = {x 1,..., x n} Models: M 1, M 2, etc. Parameters: θ (per model) Pror probablty of models: P(M ). Pror probabltes of model parameters: P(θ M ) Model of data gven parameters (lkelhood model): P(x θ, M ) Data probablty (lkelhood) n P(D θ, M ) = P(x j θ, M ) L(θ ) j=1 (provded the data are ndependently and dentcally dstrbuted (d). Parameter learnng (posteror): P(θ D, M ) = P(D θ, M)P(θ M) ; P(D M ) = P(D M ) dθ P(D θ, M )P(θ M )
38 Bayesan learnng Apply the basc rules of probablty to learnng from data. Problem specfcaton: Data: D = {x 1,..., x n} Models: M 1, M 2, etc. Parameters: θ (per model) Pror probablty of models: P(M ). Pror probabltes of model parameters: P(θ M ) Model of data gven parameters (lkelhood model): P(x θ, M ) Data probablty (lkelhood) n P(D θ, M ) = P(x j θ, M ) L(θ ) j=1 (provded the data are ndependently and dentcally dstrbuted (d). Parameter learnng (posteror): P(θ D, M ) = P(D θ, M)P(θ M) ; P(D M ) = P(D M ) dθ P(D θ, M )P(θ M ) P(D M ) s called the margnal lkelhood or evdence for M. It s proportonal to the posteror probablty model M beng the one that generated the data.
39 Bayesan learnng Apply the basc rules of probablty to learnng from data. Problem specfcaton: Data: D = {x 1,..., x n} Models: M 1, M 2, etc. Parameters: θ (per model) Pror probablty of models: P(M ). Pror probabltes of model parameters: P(θ M ) Model of data gven parameters (lkelhood model): P(x θ, M ) Data probablty (lkelhood) n P(D θ, M ) = P(x j θ, M ) L(θ ) j=1 (provded the data are ndependently and dentcally dstrbuted (d). Parameter learnng (posteror): P(θ D, M ) = P(D θ, M)P(θ M) ; P(D M ) = P(D M ) dθ P(D θ, M )P(θ M ) P(D M ) s called the margnal lkelhood or evdence for M. It s proportonal to the posteror probablty model M beng the one that generated the data. Model selecton: P(M D) = P(D M)P(M) P(D)
40 Bayesan learnng: A con toss example Con toss: One parameter q the probablty of obtanng heads So our space of models s the set of dstrbutons over q [0, 1].
41 Bayesan learnng: A con toss example Con toss: One parameter q the probablty of obtanng heads So our space of models s the set of dstrbutons over q [0, 1]. Learner A beleves model M A: all values of q are equally plausble; P(q) q A
42 Bayesan learnng: A con toss example Con toss: One parameter q the probablty of obtanng heads So our space of models s the set of dstrbutons over q [0, 1]. Learner A beleves model M A: all values of q are equally plausble; Learner B beleves model M B: more plausble that the con s far (q 0.5) than based P(q) 1 P(q) q A q B
43 Bayesan learnng: A con toss example Con toss: One parameter q the probablty of obtanng heads So our space of models s the set of dstrbutons over q [0, 1]. Learner A beleves model M A: all values of q are equally plausble; Learner B beleves model M B: more plausble that the con s far (q 0.5) than based P(q) 1 P(q) q A Both pror belefs can be descrbed by the Beta dstrbuton: p(q α 1, α 2) = q(α 1 1) (1 q) (α 2 1) B(α 1, α 2) = Beta(q α 1, α 2) q B
44 Bayesan learnng: A con toss example Con toss: One parameter q the probablty of obtanng heads So our space of models s the set of dstrbutons over q [0, 1]. Learner A beleves model M A: all values of q are equally plausble; Learner B beleves model M B: more plausble that the con s far (q 0.5) than based P(q) 1 P(q) q q A: α 1 = α 2 = 1.0 B: α 1 = α 2 = 4.0 Both pror belefs can be descrbed by the Beta dstrbuton: p(q α 1, α 2) = q(α 1 1) (1 q) (α 2 1) B(α 1, α 2) = Beta(q α 1, α 2)
45 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q
46 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q Suppose our sngle con toss comes out heads
47 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q Suppose our sngle con toss comes out heads The probablty of the observed data (lkelhood) s: p(h q) = q
48 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q Suppose our sngle con toss comes out heads The probablty of the observed data (lkelhood) s: p(h q) = q Usng Bayes Rule, we multply the pror, p(q) by the lkelhood and renormalse to get the posteror probablty: p(q H) = p(q)p(h q) p(h)
49 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q Suppose our sngle con toss comes out heads The probablty of the observed data (lkelhood) s: p(h q) = q Usng Bayes Rule, we multply the pror, p(q) by the lkelhood and renormalse to get the posteror probablty: p(q H) = p(q)p(h q) p(h) q Beta(q α 1, α 2)
50 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q Suppose our sngle con toss comes out heads The probablty of the observed data (lkelhood) s: p(h q) = q Usng Bayes Rule, we multply the pror, p(q) by the lkelhood and renormalse to get the posteror probablty: p(q H) = p(q)p(h q) p(h) q Beta(q α 1, α 2) q q (α 1 1) (1 q) (α 2 1)
51 Bayesan learnng: The con toss (cont) Now we observe a toss. Two possble outcomes: p(h q) = q p(t q) = 1 q Suppose our sngle con toss comes out heads The probablty of the observed data (lkelhood) s: p(h q) = q Usng Bayes Rule, we multply the pror, p(q) by the lkelhood and renormalse to get the posteror probablty: p(q H) = p(q)p(h q) p(h) q Beta(q α 1, α 2) q q (α1 1) (1 q) (α 2 1) = Beta(q α 1 + 1, α 2)
52 Bayesan learnng: The con toss (cont) A B Pror P(q) P(q) q q Beta(q 1, 1) Beta(q 4, 4) Posteror P(q) P(q) q q Beta(q 2, 1) Beta(q 5, 4)
53 Bayesan learnng: The con toss (cont) What about multple tosses?
54 Bayesan learnng: The con toss (cont) What about multple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3
55 Bayesan learnng: The con toss (cont) What about multple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 Ths s stll straghtforward:
56 Bayesan learnng: The con toss (cont) What about multple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 Ths s stll straghtforward: p(q D) = p(q)p(d q) p(d)
57 Bayesan learnng: The con toss (cont) What about multple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 Ths s stll straghtforward: p(q D) = p(q)p(d q) p(d) q 3 (1 q) 3 Beta(q α 1, α 2)
58 Bayesan learnng: The con toss (cont) What about multple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 Ths s stll straghtforward: p(q D) = p(q)p(d q) p(d) q 3 (1 q) 3 Beta(q α 1, α 2) Beta(q α 1 + 3, α 2 + 3)
59 Bayesan learnng: The con toss (cont) What about multple tosses? Suppose we observe D = { H H T H T T }: p({ H H T H T T } q) = qq(1 q)q(1 q)(1 q) = q 3 (1 q) 3 Ths s stll straghtforward: p(q D) = p(q)p(d q) p(d) q 3 (1 q) 3 Beta(q α 1, α 2) Beta(q α 1 + 3, α 2 + 3) P(q) P(q) q q
60 Conjugate prors Updatng the pror to form the posteror was partcularly easy n these examples. Ths s because we used a conjugate pror for an exponental famly lkelhood.
61 Conjugate prors Updatng the pror to form the posteror was partcularly easy n these examples. Ths s because we used a conjugate pror for an exponental famly lkelhood. Exponental famly dstrbutons take the form: P(x θ) = g(θ)f (x)e φ(θ)t T(x) wth g(θ) the normalsng constant.
62 Conjugate prors Updatng the pror to form the posteror was partcularly easy n these examples. Ths s because we used a conjugate pror for an exponental famly lkelhood. Exponental famly dstrbutons take the form: P(x θ) = g(θ)f (x)e φ(θ)t T(x) wth g(θ) the normalsng constant. Gven n d observatons, P({x } θ) = ( P(x θ) = g(θ) n e φ(θ)t T(x )) f (x )
63 Conjugate prors Updatng the pror to form the posteror was partcularly easy n these examples. Ths s because we used a conjugate pror for an exponental famly lkelhood. Exponental famly dstrbutons take the form: P(x θ) = g(θ)f (x)e φ(θ)t T(x) wth g(θ) the normalsng constant. Gven n d observatons, P({x } θ) = ( P(x θ) = g(θ) n e φ(θ)t T(x )) f (x ) Thus, f the pror takes the conjugate form P(θ) = F(τ, ν)g(θ) ν e φ(θ)t τ wth F(τ, ν) the normalser
64 Conjugate prors Updatng the pror to form the posteror was partcularly easy n these examples. Ths s because we used a conjugate pror for an exponental famly lkelhood. Exponental famly dstrbutons take the form: P(x θ) = g(θ)f (x)e φ(θ)t T(x) wth g(θ) the normalsng constant. Gven n d observatons, P({x } θ) = ( P(x θ) = g(θ) n e φ(θ)t T(x )) f (x ) Thus, f the pror takes the conjugate form P(θ) = F(τ, ν)g(θ) ν e φ(θ)t τ wth F(τ, ν) the normalser, then the posteror s ( P(θ {x }) P({x } θ)p(θ) g(θ) ν+n e φ(θ)t τ + ) T(x ) wth the normalser gven by F ( τ + T(x ), ν + n ).
65 Conjugate prors The posteror gven an exponental famly lkelhood and conjugate pror s: P(θ {x }) = F ( τ + T(x ), ν + n ) ( g(θ) ν+n exp [φ(θ) T τ + )] T(x ) Here, φ(θ) T(x ) τ ν s the vector of natural parameters s the vector of suffcent statstcs are pseudo-observatons whch defne the pror s the scale of the pror (need not be an nteger) As new data come n, each one ncrements the suffcent statstcs vector and the scale to defne the posteror.
66 Conjugate prors The posteror gven an exponental famly lkelhood and conjugate pror s: P(θ {x }) = F ( τ + T(x ), ν + n ) ( g(θ) ν+n exp [φ(θ) T τ + )] T(x ) Here, φ(θ) T(x ) τ ν s the vector of natural parameters s the vector of suffcent statstcs are pseudo-observatons whch defne the pror s the scale of the pror (need not be an nteger) As new data come n, each one ncrements the suffcent statstcs vector and the scale to defne the posteror. The pror appears to be based on pseudo-observatons, but: 1. Ths s dfferent to applyng Bayes rule. No pror! Sometmes we can take a unform pror (say on [0, 1] for q), but for unbounded θ, there may be no equvalent. 2. A vald conjugate pror mght have non-ntegral ν or mpossble τ, wth no lkelhood equvalent.
67 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form.
68 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form. The Bernoull dstrbuton (a sngle con flp) wth parameter q and observaton x {0, 1}, can be wrtten: P(x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x
69 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form. The Bernoull dstrbuton (a sngle con flp) wth parameter q and observaton x {0, 1}, can be wrtten: P(x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter s the log odds log(q/(1 q)), and the suffcent stats (for multple tosses) s the number of heads.
70 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form. The Bernoull dstrbuton (a sngle con flp) wth parameter q and observaton x {0, 1}, can be wrtten: P(x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter s the log odds log(q/(1 q)), and the suffcent stats (for multple tosses) s the number of heads. The conjugate pror s P(q) = F(τ, ν) (1 q) ν e log(q/(1 q))τ = F(τ, ν) (1 q) ν τ log q τ log(1 q) e = F(τ, ν) (1 q) ν τ q τ whch has the form of the Beta dstrbuton F(τ, ν) = 1/B(τ + 1, ν τ + 1).
71 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form. The Bernoull dstrbuton (a sngle con flp) wth parameter q and observaton x {0, 1}, can be wrtten: P(x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter s the log odds log(q/(1 q)), and the suffcent stats (for multple tosses) s the number of heads. The conjugate pror s P(q) = F(τ, ν) (1 q) ν e log(q/(1 q))τ = F(τ, ν) (1 q) ν τ log q τ log(1 q) e = F(τ, ν) (1 q) ν τ q τ whch has the form of the Beta dstrbuton F(τ, ν) = 1/B(τ + 1, ν τ + 1). In general, then, the posteror wll be P(q {x }) = Beta(α 1, α 2), wth α 1 = 1 + τ + x α 2 = 1 + (ν + n) (τ + ) x
72 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form. The Bernoull dstrbuton (a sngle con flp) wth parameter q and observaton x {0, 1}, can be wrtten: P(x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter s the log odds log(q/(1 q)), and the suffcent stats (for multple tosses) s the number of heads. The conjugate pror s P(q) = F(τ, ν) (1 q) ν e log(q/(1 q))τ = F(τ, ν) (1 q) ν τ log q τ log(1 q) e = F(τ, ν) (1 q) ν τ q τ whch has the form of the Beta dstrbuton F(τ, ν) = 1/B(τ + 1, ν τ + 1). In general, then, the posteror wll be P(q {x }) = Beta(α 1, α 2), wth α 1 = 1 + τ + x α 2 = 1 + (ν + n) (τ + ) x If we observe a head, we add 1 to the suffcent statstc x, and also 1 to the count n. Ths ncrements α 1.
73 Conjugacy n the con flp Dstrbutons are not always wrtten n ther natural exponental form. The Bernoull dstrbuton (a sngle con flp) wth parameter q and observaton x {0, 1}, can be wrtten: P(x q) = q x (1 q) (1 x) x log q+(1 x) log(1 q) = e log(1 q)+x log(q/(1 q)) = e = (1 q)e log(q/(1 q))x So the natural parameter s the log odds log(q/(1 q)), and the suffcent stats (for multple tosses) s the number of heads. The conjugate pror s P(q) = F(τ, ν) (1 q) ν e log(q/(1 q))τ = F(τ, ν) (1 q) ν τ log q τ log(1 q) e = F(τ, ν) (1 q) ν τ q τ whch has the form of the Beta dstrbuton F(τ, ν) = 1/B(τ + 1, ν τ + 1). In general, then, the posteror wll be P(q {x }) = Beta(α 1, α 2), wth α 1 = 1 + τ + x α 2 = 1 + (ν + n) (τ + ) x If we observe a head, we add 1 to the suffcent statstc x, and also 1 to the count n. Ths ncrements α 1. If we observe a tal we add 1 to n, but not to x, ncrementng α 2.
74 Bayesan cons comparng models We have seen how to update posterors wthn each model. To study the choce of model, consder two more extreme models: far and bent.
75 Bayesan cons comparng models We have seen how to update posterors wthn each model. To study the choce of model, consder two more extreme models: far and bent. A pror, we may thnk that far s more probable, eg: p(far) = 0.8, p(bent) = 0.2
76 Bayesan cons comparng models We have seen how to update posterors wthn each model. To study the choce of model, consder two more extreme models: far and bent. A pror, we may thnk that far s more probable, eg: p(far) = 0.8, p(bent) = 0.2 For the bent con, we assume all parameter values are equally lkely, whlst the far con has a fxed probablty: p(q bent) p(q far) parameter, q parameter, q
77 Bayesan cons comparng models We have seen how to update posterors wthn each model. To study the choce of model, consder two more extreme models: far and bent. A pror, we may thnk that far s more probable, eg: p(far) = 0.8, p(bent) = 0.2 For the bent con, we assume all parameter values are equally lkely, whlst the far con has a fxed probablty: p(q bent) p(q far) parameter, q parameter, q We make 10 tosses, and get: D = (T H T H T T T T T T).
78 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)?
79 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2)
80 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent)
81 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent) = dq q 2 (1 q) 8
82 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) 0.002
83 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) Thus, the posteror for the models, by Bayes rule: P(far D) , P(bent D) , e, a two-thrds probablty that the con s far.
84 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) Thus, the posteror for the models, by Bayes rule: P(far D) , P(bent D) , e, a two-thrds probablty that the con s far. How do we make predctons?
85 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) Thus, the posteror for the models, by Bayes rule: P(far D) , P(bent D) , e, a two-thrds probablty that the con s far. How do we make predctons? Could choose the far model (model selecton).
86 Bayesan cons comparng models Whch model should we prefer a posteror (.e. after seeng the data)? The evdence for the far model s: P(D far) = (1/2) and for the bent model s: P(D bent) = dq P(D q, bent)p(q bent) = dq q 2 (1 q) 8 = B(3, 9) Thus, the posteror for the models, by Bayes rule: P(far D) , P(bent D) , e, a two-thrds probablty that the con s far. How do we make predctons? Could choose the far model (model selecton). Or could weght the predctons from each model by ther probablty (model averagng). Probablty of H at next toss s: P(H D) = P(H D, far)p(far D) + P(H D, bent)p(bent D) = = 5 12.
87 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters.
88 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly).
89 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons
90 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D)
91 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss.
92 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ).
93 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ). Equvalent to mnmsng the 0/1 loss.
94 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ). Equvalent to mnmsng the 0/1 loss. Maxmum Lkelhood (ML) Learnng: No pror over the parameters. Compute parameter value that maxmses the lkelhood functon alone: θ ML = argmax P(D θ).
95 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ). Equvalent to mnmsng the 0/1 loss. Maxmum Lkelhood (ML) Learnng: No pror over the parameters. Compute parameter value that maxmses the lkelhood functon alone: θ ML = argmax P(D θ). Parametersaton-ndependent.
96 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ). Equvalent to mnmsng the 0/1 loss. Maxmum Lkelhood (ML) Learnng: No pror over the parameters. Compute parameter value that maxmses the lkelhood functon alone: θ ML = argmax P(D θ). Parametersaton-ndependent. Approxmatons may allow us to recover samples from posteror, or to fnd a dstrbuton whch s close n some sense.
97 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ). Equvalent to mnmsng the 0/1 loss. Maxmum Lkelhood (ML) Learnng: No pror over the parameters. Compute parameter value that maxmses the lkelhood functon alone: θ ML = argmax P(D θ). Parametersaton-ndependent. Approxmatons may allow us to recover samples from posteror, or to fnd a dstrbuton whch s close n some sense. Choosng between these and other alternatves may be a matter of defnton, of goals (loss functon), or of practcalty.
98 Learnng parameters The Bayesan probablstc prescrpton tells us how to reason about models and ther parameters. But t s often mpractcal for realstc models (outsde the exponental famly). Pont estmates of parameters or other predctons Compute posteror and fnd sngle parameter that mnmses expected loss. θ BP = argmn L(ˆθ, θ) ˆθ P(θ D) θ P(θ D) mnmses squared loss. Maxmum a Posteror (MAP) estmate: Assume a pror over the model parameters P(θ), and compute parameters that are most probable under the posteror: θ MAP = argmax P(θ D) = argmax P(θ)P(D θ). Equvalent to mnmsng the 0/1 loss. Maxmum Lkelhood (ML) Learnng: No pror over the parameters. Compute parameter value that maxmses the lkelhood functon alone: θ ML = argmax P(D θ). Parametersaton-ndependent. Approxmatons may allow us to recover samples from posteror, or to fnd a dstrbuton whch s close n some sense. Choosng between these and other alternatves may be a matter of defnton, of goals (loss functon), or of practcalty. For the next few weeks we wll look at ML and MAP learnng n more complex models. We wll then return to the fully Bayesan formulaton for the few nterstng cases where t s tractable. Approxmatons wll be addressed n the second half of the course.
99 Modellng assocatons between varables 1 Data set D = {x 1,..., x N} x 2 0 wth each data pont a vector of D features: x = [x 1... x D] Assume data are..d. (ndependent and dentcally dstrbuted) x 1
100 Modellng assocatons between varables 1 Data set D = {x 1,..., x N} x 2 0 wth each data pont a vector of D features: x = [x 1... x D] Assume data are..d. (ndependent and dentcally dstrbuted) x 1 A smple forms of unsupervsed (structure) learnng: model the mean of the data and the correlatons between the D features n the data.
101 Modellng assocatons between varables 1 Data set D = {x 1,..., x N} x 2 0 wth each data pont a vector of D features: x = [x 1... x D] Assume data are..d. (ndependent and dentcally dstrbuted) x 1 A smple forms of unsupervsed (structure) learnng: model the mean of the data and the correlatons between the D features n the data. We can use a multvarate Gaussan model: p(x µ, Σ) = N (µ, Σ) = 2πΣ 1 2 exp { 1 } 2 (x µ)t Σ 1 (x µ)
102 ML Learnng for a Gaussan N Data set D = {x 1,..., x N}, lkelhood: p(d µ, Σ) = p(x n µ, Σ) n=1 Goal: fnd µ and Σ that maxmse lkelhood N L = p(x n µ, Σ) n=1
103 ML Learnng for a Gaussan N Data set D = {x 1,..., x N}, lkelhood: p(d µ, Σ) = p(x n µ, Σ) Goal: fnd µ and Σ that maxmse lkelhood maxmse log lkelhood: n=1 N l = log p(x n µ, Σ) n=1
104 ML Learnng for a Gaussan N Data set D = {x 1,..., x N}, lkelhood: p(d µ, Σ) = p(x n µ, Σ) Goal: fnd µ and Σ that maxmse lkelhood maxmse log lkelhood: n=1 l = log N n=1 p(x n µ, Σ) = n log p(x n µ, Σ)
105 ML Learnng for a Gaussan N Data set D = {x 1,..., x N}, lkelhood: p(d µ, Σ) = p(x n µ, Σ) Goal: fnd µ and Σ that maxmse lkelhood maxmse log lkelhood: n=1 N l = log p(x n µ, Σ) = log p(x n µ, Σ) n=1 n = N 2 log 2πΣ 1 (x n µ) T Σ 1 (x n µ) 2 n
106 ML Learnng for a Gaussan N Data set D = {x 1,..., x N}, lkelhood: p(d µ, Σ) = p(x n µ, Σ) Goal: fnd µ and Σ that maxmse lkelhood maxmse log lkelhood: n=1 N l = log p(x n µ, Σ) = log p(x n µ, Σ) n=1 n = N 2 log 2πΣ 1 (x n µ) T Σ 1 (x n µ) 2 n Note: equvalently, mnmse l, whch s quadratc n µ
107 ML Learnng for a Gaussan N Data set D = {x 1,..., x N}, lkelhood: p(d µ, Σ) = p(x n µ, Σ) Goal: fnd µ and Σ that maxmse lkelhood maxmse log lkelhood: n=1 N l = log p(x n µ, Σ) = log p(x n µ, Σ) n=1 n = N 2 log 2πΣ 1 (x n µ) T Σ 1 (x n µ) 2 n Note: equvalently, mnmse l, whch s quadratc n µ Procedure: take dervatves and set to zero: l µ = 0 ˆµ = 1 x n (sample mean) N n l Σ = 0 1 ˆΣ = (x n ˆµ)(x n ˆµ) T (sample covarance) N n
108 Refresher matrx dervatves of scalar forms We wll use the followng facts:
109 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr [ ] x T Ay (scalars equal ther own transpose and trace)
110 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] x T Ay (scalars equal ther own transpose and trace)
111 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA]
112 Refresher matrx dervatves of scalar forms We wll use the followng facts: [ ] Tr A T B A j x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA]
113 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = [A T B] nn A j A j n [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA]
114 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A j A j n m [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] A T nmb mn
115 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A j A j mn [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] A mnb mn
116 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA]
117 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA]
118 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ ] A Tr A T BAC [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA]
119 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps
120 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C ] F1 A + F 2 Tr [ F T 1 BF 2C ] F2 A
121 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C ] F1 A + F 2 Tr [ CF T 1 BF 2 ] F2 A
122 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C ] F1 A + F 2 Tr [ F T 2 B T F 1C T ] F2 A
123 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C = BF 2C + B T F 1C T ] F1 A + F 2 Tr [ F T 2 B T F 1C T ] F2 A
124 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C ] F1 A + F 2 Tr = BF 2C + B T F 1C T = BAC + B T AC T [ F T 2 B T F 1C T ] F2 A
125 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] A j log A [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C ] F1 A + F 2 Tr = BF 2C + B T F 1C T = BAC + B T AC T [ F T 2 B T F 1C T ] F2 A
126 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr [ F T 1 BF 2C ] F1 A + F 2 Tr = BF 2C + B T F 1C T = BAC + B T AC T log A = 1 A A j A A j [ F T 2 B T F 1C T ] F2 A
127 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr log A = 1 A j A [ F T 1 BF 2C ] F1 A + F 2 Tr = BF 2C + B T F 1C T = BAC + B T AC T ( 1) +k A k [A] k A j k [ F T 2 B T F 1C T ] F2 A
128 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr log A = 1 A j A [ F T 1 BF 2C ] F1 A + F 2 Tr = BF 2C + B T F 1C T = BAC + B T AC T [ F T 2 B T F 1C T ] F2 A ( 1) +k A k [A] k = 1 A j A ( 1)+j [A] j k
129 Refresher matrx dervatves of scalar forms We wll use the followng facts: x T Ay = y T A T x = Tr ] Tr [A] = Tr [A T [ ] Tr A T B = A mnb mn = B j A j A j mn [ ] A Tr A T B = B [ A Tr A T BAC ] [ ] x T Ay (scalars equal ther own transpose and trace) Tr [ABC] = Tr [CAB] = Tr [BCA] = [ ] A Tr F 1(A) T BF 2(A)C wth F 1 and F 2 both dentty maps = F 1 Tr log A = 1 A j A A log A = (A 1 ) T [ F T 1 BF 2C ] F1 A + F 2 Tr = BF 2C + B T F 1C T = BAC + B T AC T [ F T 2 B T F 1C T ] F2 A ( 1) +k A k [A] k = 1 A j A ( 1)+j [A] j k
CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements
CS 750 Machne Learnng Lecture 5 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square CS 750 Machne Learnng Announcements Homework Due on Wednesday before the class Reports: hand n before
More informationProbabilistic and Bayesian Machine Learning
Probabilistic and Bayesian Machine Learning Lecture 1: Introduction to Probabilistic Modelling Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Why a
More informationMLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012
MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example:
More informationConjugacy and the Exponential Family
CS281B/Stat241B: Advanced Topcs n Learnng & Decson Makng Conjugacy and the Exponental Famly Lecturer: Mchael I. Jordan Scrbes: Bran Mlch 1 Conjugacy In the prevous lecture, we saw conjugate prors for the
More informationProbabilistic Classification: Bayes Classifiers. Lecture 6:
Probablstc Classfcaton: Bayes Classfers Lecture : Classfcaton Models Sam Rowes January, Generatve model: p(x, y) = p(y)p(x y). p(y) are called class prors. p(x y) are called class condtonal feature dstrbutons.
More informationDepartment of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING
MACHINE LEANING Vasant Honavar Bonformatcs and Computatonal Bology rogram Center for Computatonal Intellgence, Learnng, & Dscovery Iowa State Unversty honavar@cs.astate.edu www.cs.astate.edu/~honavar/
More informationMachine learning: Density estimation
CS 70 Foundatons of AI Lecture 3 Machne learnng: ensty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square ata: ensty estmaton {.. n} x a vector of attrbute values Objectve: estmate the model of
More informationComposite Hypotheses testing
Composte ypotheses testng In many hypothess testng problems there are many possble dstrbutons that can occur under each of the hypotheses. The output of the source s a set of parameters (ponts n a parameter
More information2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification
E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton
More informationEM and Structure Learning
EM and Structure Learnng Le Song Machne Learnng II: Advanced Topcs CSE 8803ML, Sprng 2012 Partally observed graphcal models Mxture Models N(μ 1, Σ 1 ) Z X N N(μ 2, Σ 2 ) 2 Gaussan mxture model Consder
More informationLecture Notes on Linear Regression
Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume
More information} Often, when learning, we deal with uncertainty:
Uncertanty and Learnng } Often, when learnng, we deal wth uncertanty: } Incomplete data sets, wth mssng nformaton } Nosy data sets, wth unrelable nformaton } Stochastcty: causes and effects related non-determnstcally
More informationStat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors
Stat60: Bayesan Modelng and Inference Lecture Date: February, 00 Reference Prors Lecturer: Mchael I. Jordan Scrbe: Steven Troxler and Wayne Lee In ths lecture, we assume that θ R; n hgher-dmensons, reference
More informationCIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M
CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute
More informationLogistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI
Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton
More informationLecture 3: Probability Distributions
Lecture 3: Probablty Dstrbutons Random Varables Let us begn by defnng a sample space as a set of outcomes from an experment. We denote ths by S. A random varable s a functon whch maps outcomes nto the
More informationMotion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong
Moton Percepton Under Uncertanty Hongjng Lu Department of Psychology Unversty of Hong Kong Outlne Uncertanty n moton stmulus Correspondence problem Qualtatve fttng usng deal observer models Based on sgnal
More information8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore
8/5/17 Data Modelng Patrce Koehl Department of Bologcal Scences atonal Unversty of Sngapore http://www.cs.ucdavs.edu/~koehl/teachng/bl59 koehl@cs.ucdavs.edu Data Modelng Ø Data Modelng: least squares Ø
More informationParametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010
Parametrc fractonal mputaton for mssng data analyss Jae Kwang Km Survey Workng Group Semnar March 29, 2010 1 Outlne Introducton Proposed method Fractonal mputaton Approxmaton Varance estmaton Multple mputaton
More informationCourse 395: Machine Learning - Lectures
Course 395: Machne Learnng - Lectures Lecture 1-2: Concept Learnng (M. Pantc Lecture 3-4: Decson Trees & CC Intro (M. Pantc Lecture 5-6: Artfcal Neural Networks (S.Zaferou Lecture 7-8: Instance ased Learnng
More informationxp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ
CSE 455/555 Sprng 2013 Homework 7: Parametrc Technques Jason J. Corso Computer Scence and Engneerng SUY at Buffalo jcorso@buffalo.edu Solutons by Yngbo Zhou Ths assgnment does not need to be submtted and
More informationMaximum Likelihood Estimation (MLE)
Maxmum Lkelhood Estmaton (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Wnter 01 UCSD Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y
More informationThe conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above
The conjugate pror to a Bernoull s A) Bernoull B) Gaussan C) Beta D) none of the above The conjugate pror to a Gaussan s A) Bernoull B) Gaussan C) Beta D) none of the above MAP estmates A) argmax θ p(θ
More informationLecture 12: Classification
Lecture : Classfcaton g Dscrmnant functons g The optmal Bayes classfer g Quadratc classfers g Eucldean and Mahalanobs metrcs g K Nearest Neghbor Classfers Intellgent Sensor Systems Rcardo Guterrez-Osuna
More informationMATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)
1/16 MATH 829: Introducton to Data Mnng and Analyss The EM algorthm (part 2) Domnque Gullot Departments of Mathematcal Scences Unversty of Delaware Aprl 20, 2016 Recall 2/16 We are gven ndependent observatons
More informationHomework Assignment 3 Due in class, Thursday October 15
Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.
More informationP R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Lecture 4 Bayes Classfcaton Rule Dept. of Electrcal and Computer Engneerng 0909.40.0 / 0909.504.04 Theory & Applcatons
More informationENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition
EG 880/988 - Specal opcs n Computer Engneerng: Pattern Recognton Memoral Unversty of ewfoundland Pattern Recognton Lecture 7 May 3, 006 http://wwwengrmunca/~charlesr Offce Hours: uesdays hursdays 8:30-9:30
More informationFinite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin
Fnte Mxture Models and Expectaton Maxmzaton Most sldes are from: Dr. Maro Fgueredo, Dr. Anl Jan and Dr. Rong Jn Recall: The Supervsed Learnng Problem Gven a set of n samples X {(x, y )},,,n Chapter 3 of
More informationRelevance Vector Machines Explained
October 19, 2010 Relevance Vector Machnes Explaned Trstan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introducton Ths document has been wrtten n an attempt to make Tppng s [1] Relevance Vector Machnes
More informationEngineering Risk Benefit Analysis
Engneerng Rsk Beneft Analyss.55, 2.943, 3.577, 6.938, 0.86, 3.62, 6.862, 22.82, ESD.72, ESD.72 RPRA 2. Elements of Probablty Theory George E. Apostolaks Massachusetts Insttute of Technology Sprng 2007
More informationPredictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore
Sesson Outlne Introducton to classfcaton problems and dscrete choce models. Introducton to Logstcs Regresson. Logstc functon and Logt functon. Maxmum Lkelhood Estmator (MLE) for estmaton of LR parameters.
More informationThe Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction
ECONOMICS 5* -- NOTE (Summary) ECON 5* -- NOTE The Multple Classcal Lnear Regresson Model (CLRM): Specfcaton and Assumptons. Introducton CLRM stands for the Classcal Lnear Regresson Model. The CLRM s also
More informationUsing T.O.M to Estimate Parameter of distributions that have not Single Exponential Family
IOSR Journal of Mathematcs IOSR-JM) ISSN: 2278-5728. Volume 3, Issue 3 Sep-Oct. 202), PP 44-48 www.osrjournals.org Usng T.O.M to Estmate Parameter of dstrbutons that have not Sngle Exponental Famly Jubran
More informationClassification as a Regression Problem
Target varable y C C, C,, ; Classfcaton as a Regresson Problem { }, 3 L C K To treat classfcaton as a regresson problem we should transform the target y nto numercal values; The choce of numercal class
More information9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov
9.93 Class IV Part I Bayesan Decson Theory Yur Ivanov TOC Roadmap to Machne Learnng Bayesan Decson Makng Mnmum Error Rate Decsons Mnmum Rsk Decsons Mnmax Crteron Operatng Characterstcs Notaton x - scalar
More informationHidden Markov Models & The Multivariate Gaussian (10/26/04)
CS281A/Stat241A: Statstcal Learnng Theory Hdden Markov Models & The Multvarate Gaussan (10/26/04) Lecturer: Mchael I. Jordan Scrbes: Jonathan W. Hu 1 Hdden Markov Models As a bref revew, hdden Markov models
More informationANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)
Econ 413 Exam 13 H ANSWERS Settet er nndelt 9 deloppgaver, A,B,C, som alle anbefales å telle lkt for å gøre det ltt lettere å stå. Svar er gtt . Unfortunately, there s a prntng error n the hnt of
More informationSee Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)
Count Data Models See Book Chapter 11 2 nd Edton (Chapter 10 1 st Edton) Count data consst of non-negatve nteger values Examples: number of drver route changes per week, the number of trp departure changes
More informationj) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1
Random varables Measure of central tendences and varablty (means and varances) Jont densty functons and ndependence Measures of assocaton (covarance and correlaton) Interestng result Condtonal dstrbutons
More informationMaximum Likelihood Estimation
Maxmum Lkelhood Estmaton INFO-2301: Quanttatve Reasonng 2 Mchael Paul and Jordan Boyd-Graber MARCH 7, 2017 INFO-2301: Quanttatve Reasonng 2 Paul and Boyd-Graber Maxmum Lkelhood Estmaton 1 of 9 Why MLE?
More informationProbability Theory (revisited)
Probablty Theory (revsted) Summary Probablty v.s. plausblty Random varables Smulaton of Random Experments Challenge The alarm of a shop rang. Soon afterwards, a man was seen runnng n the street, persecuted
More information3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X
Statstcs 1: Probablty Theory II 37 3 EPECTATION OF SEVERAL RANDOM VARIABLES As n Probablty Theory I, the nterest n most stuatons les not on the actual dstrbuton of a random vector, but rather on a number
More informationFirst Year Examination Department of Statistics, University of Florida
Frst Year Examnaton Department of Statstcs, Unversty of Florda May 7, 010, 8:00 am - 1:00 noon Instructons: 1. You have four hours to answer questons n ths examnaton.. You must show your work to receve
More informationSpace of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics
/7/7 CSE 73: Artfcal Intellgence Bayesan - Learnng Deter Fox Sldes adapted from Dan Weld, Jack Breese, Dan Klen, Daphne Koller, Stuart Russell, Andrew Moore & Luke Zettlemoyer What s Beng Learned? Space
More informationChapter 9: Statistical Inference and the Relationship between Two Variables
Chapter 9: Statstcal Inference and the Relatonshp between Two Varables Key Words The Regresson Model The Sample Regresson Equaton The Pearson Correlaton Coeffcent Learnng Outcomes After studyng ths chapter,
More informationGaussian process classification: a message-passing viewpoint
Gaussan process classfcaton: a message-passng vewpont Flpe Rodrgues fmpr@de.uc.pt November 014 Abstract The goal of ths short paper s to provde a message-passng vewpont of the Expectaton Propagaton EP
More informationModule 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur
Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:
More informationClassification learning II
Lecture 8 Classfcaton learnng II Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square Logstc regresson model Defnes a lnear decson boundar Dscrmnant functons: g g g g here g z / e z f, g g - s a logstc functon
More information10-701/ Machine Learning, Fall 2005 Homework 3
10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40
More informationGenerative classification models
CS 675 Intro to Machne Learnng Lecture Generatve classfcaton models Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square Data: D { d, d,.., dn} d, Classfcaton represents a dscrete class value Goal: learn
More informationMaximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models
ECO 452 -- OE 4: Probt and Logt Models ECO 452 -- OE 4 Maxmum Lkelhood Estmaton of Bnary Dependent Varables Models: Probt and Logt hs note demonstrates how to formulate bnary dependent varables models
More informationMIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU
Group M D L M Chapter Bayesan Decson heory Xn-Shun Xu @ SDU School of Computer Scence and echnology, Shandong Unversty Bayesan Decson heory Bayesan decson theory s a statstcal approach to data mnng/pattern
More informationLimited Dependent Variables
Lmted Dependent Varables. What f the left-hand sde varable s not a contnuous thng spread from mnus nfnty to plus nfnty? That s, gven a model = f (, β, ε, where a. s bounded below at zero, such as wages
More informationProbabilistic & Unsupervised Learning
Probablstc & Unsupervsed Learnng Convex Algorthms n Approxmate Inference Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computatonal Neuroscence Unt Unversty College London Term 1, Autumn 2008 Convexty A convex
More informationGeneralized Linear Methods
Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set
More informationMarkov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement
Markov Chan Monte Carlo MCMC, Gbbs Samplng, Metropols Algorthms, and Smulated Annealng 2001 Bonformatcs Course Supplement SNU Bontellgence Lab http://bsnuackr/ Outlne! Markov Chan Monte Carlo MCMC! Metropols-Hastngs
More informationStatistical pattern recognition
Statstcal pattern recognton Bayes theorem Problem: decdng f a patent has a partcular condton based on a partcular test However, the test s mperfect Someone wth the condton may go undetected (false negatve
More informationINF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018
INF 5860 Machne learnng for mage classfcaton Lecture 3 : Image classfcaton and regresson part II Anne Solberg January 3, 08 Today s topcs Multclass logstc regresson and softma Regularzaton Image classfcaton
More informationThe Geometry of Logit and Probit
The Geometry of Logt and Probt Ths short note s meant as a supplement to Chapters and 3 of Spatal Models of Parlamentary Votng and the notaton and reference to fgures n the text below s to those two chapters.
More informationLecture 10 Support Vector Machines II
Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed
More informationComputation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models
Computaton of Hgher Order Moments from Two Multnomal Overdsperson Lkelhood Models BY J. T. NEWCOMER, N. K. NEERCHAL Department of Mathematcs and Statstcs, Unversty of Maryland, Baltmore County, Baltmore,
More informationThe Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD
he Gaussan classfer Nuno Vasconcelos ECE Department, UCSD Bayesan decson theory recall that we have state of the world X observatons g decson functon L[g,y] loss of predctng y wth g Bayes decson rule s
More informationAn Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation
An Experment/Some Intuton I have three cons n my pocket, 6.864 (Fall 2006): Lecture 18 The EM Algorthm Con 0 has probablty λ of heads; Con 1 has probablty p 1 of heads; Con 2 has probablty p 2 of heads
More informationClustering & Unsupervised Learning
Clusterng & Unsupervsed Learnng Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Wnter 2012 UCSD Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y
More informationHidden Markov Models
CM229S: Machne Learnng for Bonformatcs Lecture 12-05/05/2016 Hdden Markov Models Lecturer: Srram Sankararaman Scrbe: Akshay Dattatray Shnde Edted by: TBD 1 Introducton For a drected graph G we can wrte
More informationLearning from Data 1 Naive Bayes
Learnng from Data 1 Nave Bayes Davd Barber dbarber@anc.ed.ac.uk course page : http://anc.ed.ac.uk/ dbarber/lfd1/lfd1.html c Davd Barber 2001, 2002 1 Learnng from Data 1 : c Davd Barber 2001,2002 2 1 Why
More information/ n ) are compared. The logic is: if the two
STAT C141, Sprng 2005 Lecture 13 Two sample tests One sample tests: examples of goodness of ft tests, where we are testng whether our data supports predctons. Two sample tests: called as tests of ndependence
More informationEcon107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)
I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes
More informationWeek 5: Neural Networks
Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple
More informationCommunication with AWGN Interference
Communcaton wth AWG Interference m {m } {p(m } Modulator s {s } r=s+n Recever ˆm AWG n m s a dscrete random varable(rv whch takes m wth probablty p(m. Modulator maps each m nto a waveform sgnal s m=m
More informationWhy Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)
Why Bayesan? 3. Bayes and Normal Models Alex M. Martnez alex@ece.osu.edu Handouts Handoutsfor forece ECE874 874Sp Sp007 If all our research (n PR was to dsappear and you could only save one theory, whch
More informationx i1 =1 for all i (the constant ).
Chapter 5 The Multple Regresson Model Consder an economc model where the dependent varable s a functon of K explanatory varables. The economc model has the form: y = f ( x,x,..., ) xk Approxmate ths by
More informationStatistical analysis using matlab. HY 439 Presented by: George Fortetsanakis
Statstcal analyss usng matlab HY 439 Presented by: George Fortetsanaks Roadmap Probablty dstrbutons Statstcal estmaton Fttng data to probablty dstrbutons Contnuous dstrbutons Contnuous random varable X
More informationClustering & (Ken Kreutz-Delgado) UCSD
Clusterng & Unsupervsed Learnng Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y ), fnd an approxmatng
More informationExpectation propagation
Expectaton propagaton Lloyd Ellott May 17, 2011 Suppose p(x) s a pdf and we have a factorzaton p(x) = 1 Z n f (x). (1) =1 Expectaton propagaton s an nference algorthm desgned to approxmate the factors
More information8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF
10-708: Probablstc Graphcal Models 10-708, Sprng 2014 8 : Learnng n Fully Observed Markov Networks Lecturer: Erc P. Xng Scrbes: Meng Song, L Zhou 1 Why We Need to Learn Undrected Graphcal Models In the
More informationSTAT 3008 Applied Regression Analysis
STAT 3008 Appled Regresson Analyss Tutoral : Smple Lnear Regresson LAI Chun He Department of Statstcs, The Chnese Unversty of Hong Kong 1 Model Assumpton To quantfy the relatonshp between two factors,
More information1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands
Content. Inference on Regresson Parameters a. Fndng Mean, s.d and covarance amongst estmates.. Confdence Intervals and Workng Hotellng Bands 3. Cochran s Theorem 4. General Lnear Testng 5. Measures of
More informationA be a probability space. A random vector
Statstcs 1: Probablty Theory II 8 1 JOINT AND MARGINAL DISTRIBUTIONS In Probablty Theory I we formulate the concept of a (real) random varable and descrbe the probablstc behavor of ths random varable by
More informationArtificial Intelligence Bayesian Networks
Artfcal Intellgence Bayesan Networks Adapted from sldes by Tm Fnn and Mare desjardns. Some materal borrowed from Lse Getoor. 1 Outlne Bayesan networks Network structure Condtonal probablty tables Condtonal
More informationExplaining the Stein Paradox
Explanng the Sten Paradox Kwong Hu Yung 1999/06/10 Abstract Ths report offers several ratonale for the Sten paradox. Sectons 1 and defnes the multvarate normal mean estmaton problem and ntroduces Sten
More informationStatistics for Economics & Business
Statstcs for Economcs & Busness Smple Lnear Regresson Learnng Objectves In ths chapter, you learn: How to use regresson analyss to predct the value of a dependent varable based on an ndependent varable
More informationSemi-Supervised Learning
Sem-Supervsed Learnng Consder the problem of Prepostonal Phrase Attachment. Buy car wth money ; buy car wth wheel There are several ways to generate features. Gven the lmted representaton, we can assume
More informationBayesian decision theory. Nuno Vasconcelos ECE Department, UCSD
Bayesan decson theory Nuno Vasconcelos ECE Department UCSD Notaton the notaton n DHS s qute sloppy e.. show that error error z z dz really not clear what ths means we wll use the follown notaton subscrpts
More informationSTATS 306B: Unsupervised Learning Spring Lecture 10 April 30
STATS 306B: Unsupervsed Learnng Sprng 2014 Lecture 10 Aprl 30 Lecturer: Lester Mackey Scrbe: Joey Arthur, Rakesh Achanta 10.1 Factor Analyss 10.1.1 Recap Recall the factor analyss (FA) model for lnear
More informationRandomness and Computation
Randomness and Computaton or, Randomzed Algorthms Mary Cryan School of Informatcs Unversty of Ednburgh RC 208/9) Lecture 0 slde Balls n Bns m balls, n bns, and balls thrown unformly at random nto bns usually
More informationStat 543 Exam 2 Spring 2016
Stat 543 Exam 2 Sprng 206 I have nether gven nor receved unauthorzed assstance on ths exam. Name Sgned Date Name Prnted Ths Exam conssts of questons. Do at least 0 of the parts of the man exam. I wll score
More informationThe Expectation-Maximization Algorithm
The Expectaton-Maxmaton Algorthm Charles Elan elan@cs.ucsd.edu November 16, 2007 Ths chapter explans the EM algorthm at multple levels of generalty. Secton 1 gves the standard hgh-level verson of the algorthm.
More information1/10/18. Definitions. Probabilistic models. Why probabilistic models. Example: a fair 6-sided dice. Probability
/0/8 I529: Machne Learnng n Bonformatcs Defntons Probablstc models Probablstc models A model means a system that smulates the object under consderaton A probablstc model s one that produces dfferent outcomes
More informationGaussian Mixture Models
Lab Gaussan Mxture Models Lab Objectve: Understand the formulaton of Gaussan Mxture Models (GMMs) and how to estmate GMM parameters. You ve already seen GMMs as the observaton dstrbuton n certan contnuous
More informationEGR 544 Communication Theory
EGR 544 Communcaton Theory. Informaton Sources Z. Alyazcoglu Electrcal and Computer Engneerng Department Cal Poly Pomona Introducton Informaton Source x n Informaton sources Analog sources Dscrete sources
More informationProbability and Random Variable Primer
B. Maddah ENMG 622 Smulaton 2/22/ Probablty and Random Varable Prmer Sample space and Events Suppose that an eperment wth an uncertan outcome s performed (e.g., rollng a de). Whle the outcome of the eperment
More informationLecture 3: Shannon s Theorem
CSE 533: Error-Correctng Codes (Autumn 006 Lecture 3: Shannon s Theorem October 9, 006 Lecturer: Venkatesan Guruswam Scrbe: Wdad Machmouch 1 Communcaton Model The communcaton model we are usng conssts
More informationHow its computed. y outcome data λ parameters hyperparameters. where P denotes the Laplace approximation. k i k k. Andrew B Lawson 2013
Andrew Lawson MUSC INLA INLA s a relatvely new tool that can be used to approxmate posteror dstrbutons n Bayesan models INLA stands for ntegrated Nested Laplace Approxmaton The approxmaton has been known
More informationExpectation Maximization Mixture Models HMMs
-755 Machne Learnng for Sgnal Processng Mture Models HMMs Class 9. 2 Sep 200 Learnng Dstrbutons for Data Problem: Gven a collecton of eamples from some data, estmate ts dstrbuton Basc deas of Mamum Lelhood
More informationChapter 11: Simple Linear Regression and Correlation
Chapter 11: Smple Lnear Regresson and Correlaton 11-1 Emprcal Models 11-2 Smple Lnear Regresson 11-3 Propertes of the Least Squares Estmators 11-4 Hypothess Test n Smple Lnear Regresson 11-4.1 Use of t-tests
More informationFor now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.
Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson
More informationOther NN Models. Reinforcement learning (RL) Probabilistic neural networks
Other NN Models Renforcement learnng (RL) Probablstc neural networks Support vector machne (SVM) Renforcement learnng g( (RL) Basc deas: Supervsed dlearnng: (delta rule, BP) Samples (x, f(x)) to learn
More informationThe Order Relation and Trace Inequalities for. Hermitian Operators
Internatonal Mathematcal Forum, Vol 3, 08, no, 507-57 HIKARI Ltd, wwwm-hkarcom https://doorg/0988/mf088055 The Order Relaton and Trace Inequaltes for Hermtan Operators Y Huang School of Informaton Scence
More information