Appendix. Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA Qihang Lin

Size: px
Start display at page:

Download "Appendix. Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA Qihang Lin"

Transcription

1 Xi Chen Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 53, USA Qihang Lin Tepper School of Business, Carnegie Mellon University, Pittsburgh, PA 53, USA Dengyong Zhou Machine Learning Group, Microsoft Research, Redmond, WA 9805, USA Proof of Corollary Recall that Ia, b Prθ θ Betaa, b where is the beta function t a t b dt, It is easy to see that Ia, b > Ia, b > Ia, b We re-write Ia, b as follows Ia, b 0 t a t b dt t b t a dt, where the second equality is obtained by setting t : t Then we have: Ia, b Ia, b t a t b t b t a dt a b t t a t b dt t t t > When a > b, t a b Since t >, t > and hence Ia, b Ia, b > 0, ie, Ia, b > a b t When a b, t and Ia, b When a b t a < b, t < and Ia, b < Proof of Proposition 3 We use the proof technique in Xie & Frazier, 0 to prove Proposition 3 By Proposition, we first obtain the value function: V S 0 sup E i H + i H F T i H T i H T K sup E hp T i i To decompose the final accuracy K i hp i T into the intermediate reward at each stage, we define G 0 K i hp i 0 and G t+ K t+ i hpi K i hp Then, K i hp i T can be decomposed as: K i hp i T G 0 + T t0 G t+ The value function can now be re-written as follows: V S 0 T G 0 S 0 + sup E G t+ t0 T G 0 S 0 + sup E EG t+ F t t0 T G 0 S 0 + sup E EG t+ S t, t0 Here, the first inequality is true because G 0 is determinant and independent of, the second inequality is due to the tower property of conditional expectation and the third inequality holds because P t+ i and Pi i, and thus, G t+ depend on F t only through S t and We define intermediate expected reward gained by labeling the -th instance at the state S t as follows: RS t, EG t+ S t, K K E hp t+ i hp S t, i i E hp t+ hp t S t, 3

2 The last equation is due to the fact that only P t will be changed if the -th instance is labeled next With the expected reward function in place, our value function takes the following form: V S 0 G 0 s + sup T E RS t, S 0 4 t0 3 Proof of Proposition 3 To prove the failure of deterministic KG, we first show a key property for the expected reward function: Ra, b a hia +, b hia, b 5 a + b + b hia, b + hia, b a + b Lemma 3 When a, b are positive integers, if a b, Ra, b a aba,a and if a b, Ra, b 0 To prove lemma 3, we first present several basic properties for and Ia, b, which will be used in all the following theorems and proofs Properties for : Properties for : Bb, a 6 Ba +, b a a + b 7 Ba, b + b a + b 8 Ia, b Ib, a 9 Ia +, b Ia, b + a+b a Ia, b + Ia, b a+b b 0 The properties for Ia, b are derived from the basic property of regularized incomplete beta function Proof When a b, by Corollary, we have Ia +, b >, Ia, b and Ia, b + < Therefore, the expected reward 5 takes the following form: Ra, b Ia +, a Ia, a + Ia, a + Ia, a Ia +, a Ia, a a aba, a When a > b, since a, b are integers, we have a b + and hence Ia +, b >, Ia, b >, Ia, b + according to Corollary The expected reward 5 now becomes: Ra, b a Ia +, b + b Ia, b + Ia, b a + b a + b a t t a t b dt a + b Ba +, b + b a + b Ba, b + Ia, b Ia, b Ia, b 0 t a t t b dt t + t t a t b dt Ia, b Here we use 7 and 8 to show that b a+b Ba,b+ Ba,b a a+b Ba+,b When a b, we can prove Ra, b 0 in a similar way With Lemma 3 in place, the proof for Proposition 3 is straightforward Recall that the deterministic KG policy chooses the next instance according to arg max i RS t, i arg max Ra t i, b t i, i and breaks the tie by selecting the one with the smallest index Since Ra, b > 0 if and only if a b, at the initial stage t 0, Ra 0 i, b0 i > 0 for those instances i E {i : a 0 i b 0 i } The policy will first select i 0 E with the largest Ra 0 i, b0 i After obtaining the label y i0, either a 0 i 0 or b 0 i 0 will add one and hence a i 0 b i 0 and Ra i 0, b i 0 0 The policy will select another instance i E with the current largest expected reward and the expected reward for i after obtaining the label y i will then become zero As a consequence, the KG policy will label each instance in E for the first E stages and Ra E i, b E i 0 for all i {,, K} Then the deterministic policy will break the tie selecting the first instance to label From now on, for any t E, if a t b t, then the expected reward Ra t, b t 0 Since the expected reward for other instances are all zero, the policy will still label the first instance On the other hand, if a t b t, and the first instance is the only one with the positive expected reward and the policy will label it Thus Proposition 3 is proved Remark For randomized KG, after getting one label for each instance in E for the first E stages, the expected reward for each instance has become zero Then randomized KG will uniformly select one instance to label At any stage t E, if there exists one instance

3 i at most one instance with a t i b t i, the KG policy will provide the next label for i; otherwise, it will randomly select an instance to label 4 Proof of Theorem 3 To prove the consistency of the optimistic KG policy, we first show the exact values for R + α a, b maxr a, b, R a, b b When a b + : R a, b Ia +, b Ia, b a+b a > 0; R a, b Ia, b + Ia, b a+b b < 0 Therefore, When a b: R + a, b R a, b a+b a > 0 R a, b Ia +, a Ia, a a aba, a ; R a, b Ia, a + Ia, a a aba, a Therefore, we have R R and R + a, b R a, b R a, b a aba, a > 0 3 When b a: R a, b Ia, b Ia +, b a+b a < 0; R a, b Ia, b Ia, b + a+b b > 0 Therefore R + a, b R a, b a+b b > 0 For better visualization, we plot values of R + a, b for different a, b in Figure As we can see R + a, b > 0 for any positive integers a, b, we first prove that lim a+b R + a, b 0 in the following Lemma Lemma 4 Properties for R + a, b: Ra, b is symmetric, ie, R + a, b R + b, a a Figure Illustration of R + a, b lim a R + a, a 0 3 For any fixed a, R + a + k, a k R + a k, a + k is monotonically decreasing in k for k 0,, a 4 When a b, for any fixed b, R + a, b is monotonically decreasing in a By the symmetry of R + a, b, when b a, for any fixed a, R + a, b is monotonically decreasing in b By the above four properties, we have lim a+b R + a, b 0 Proof We first prove these four properties Property : By the fact that Bb, a, the symmetry of R + a, b is straightforward R Property : For a >, + a,a R + a,a a a < and hence R + a, a is monotonically decreasing in a Moreover, a R + a, a R + i, i i a R +, i i R +, e a i i Since lim a a i i and R + a, a 0, lim a R + a, a 0 Property 3: For any k 0, R + a + k +, a k + R + a + k, a k a + kba + k, a k a + k + Ba + k +, a k + a k + a + k + <

4 Property 4: When a b, for any fixed b: R + a +, b R + a, b a a + Ba +, b aa + b aa + < According to the third property, when a + b is an even number, we have R + a, b < R + a+b, a+b According to the fourth property, when a + b is an odd number and a b +, we have R + a, b < R + a, b < R + a+b, a+b ; while when a + b is an odd number and a b, we have R + a, b < R + a, b < R + a+b, a+b Therefore, R + a, b < R + a + b + b, a According to the second property such that lim a R + a, a 0, we conclude that lim a+b R + a, b 0 Using Lemma 4, we first show that, in any sample path, the optimistic KG will label each instance infinitely many times as T goes to infinity Let η i T be a random variable representing the number of times that the i-th instance has been labeled until the stage T using optimistic KG Given a sample path ω, let Iω {i : lim T η i T ω < } be the set of instances that has been labeled only finite number of times as T goes to infinity in this sample path We need to prove that Iω is an empty set for any ω We prove it by contradiction Assuming that Iω is not empty, then after a certain stage T, instances in Iω will never be labeled By Lemma 4, for any j I c, lim T R + a T j ω, bt j ω 0 Therefore, there will exist T > T such that: max j I R+ a T j ω, b T j ω < max c i I R+ a T i ω, b T i ω max i I R+ a T i ω, b T i ω Then according to the optimistic KG policy, the next instance to be labeled must be in Iω, which leads to the contradiction Therefore, Iω will be an empty set for any sample path ω Let Yi s be the random variable which takes the value if the s-th label of the i-th instance is and the value if the s-th label is 0 It is easy to see that EYi s θ i PrYi s θ i θ i Hence, Yi s, s,, are iid random variables By the fact that lim T η T i in all sample paths and using the strong law of large number, we conclude that, conditioning on θ i, i,, K, the conditional probability of a T i b T ηit i s lim lim Y i s EYi s θ i θ i T η i T T η i T for all i,, K, is one According to Proposition, we have H T {i : a T i b T i } and H {i : θ i } The accuracy is AccT K H T H + HT c H c We have: Pr lim AccT T {θi}k i Pr lim T HT H + HT c H c K {θ i} K i a T i b T i Pr lim θ i, i,, K {θ i} K i T η it, whenever θ i for all i The last inequality is due to the fact that, as long as θ i is not in any i, any sample path that gives the event a lim T i bt i T η it θ i, i,, K also gives the event lim T a T i bt i sgnθ i +, which further implies lim T H T H + HT c H c K Finally, we have: Pr lim AccT T [ ] E {θi } K Pr lim AccT i T {θi}k i [ ] E {θi :θ i } K Pr lim AccT i T {θi}k i [], E {θi :θ i } K i where the second equality is because {θ i : i, θ i } is a zero measure set 5 Incorporate Workers Reliability We assume there are K instances with the soft-label θ i Betaa 0 i, b0 i and M workers with the reliability ρ j Betac 0 j, d0 j Given the decision on labeling the i- th instance by the j-th worker, we have the probability of the outcome Z ij : PrZ ij θ i, ρ j θ i ρ j + θ i ρ j PrZ ij θ i, ρ j θ i ρ j + θ i ρ j 3 We approximate the posterior so that at any stage for all i, j, θ i and ρ j will follow Beta distributions In particular, assuming at the current state θ i Betaa i, b i and ρ j Betac j, d j, the posterior distribution conditioned on Z ij takes the following form: PrZij θi, ρjbetaai, bibetacj, dj pθ i, ρ j Z ij PrZ ij PrZij θi, ρjbetaai, bibetacj, dj pθ i, ρ j Z ij PrZ ij

5 where the likelihood PrZ ij z θ i, ρ j for z, is defined in and 3 respectively and PrZ ij EPrZ ij θ i, ρ j Eθ i Eρ j + Eθ i Eρ j a i c j + a i + b i c j + d j b i a i + b i PrZ ij EPrZ ij θ i, ρ j d j c j + d j Eθ i Eρ j + Eθ i Eρ j b i c j + a i + b i c j + d j a i a i + b i d j c j + d j The posterior distributions pθ i, p j Z ij z no longer takes the form of the product of Beta distributions on θ i and p j Therefore, we use variational approximation by first assuming the conditional independence of θ i and ρ j : pθ i, ρ j Z ij z pθ i Z ij zpρ j Z ij z In particular, we have the exact form for the marginal distributions: pθ i Z ij pρ j Z ij pθ i Z ij pρ j Z ij θieρj + θi Eρj Betaa i, b i PrZ ij Eθiρj + Eθi ρj Betac j, d j PrZ ij θieρj + θi Eρj Betaa i, b i PrZ ij Eθiρj + Eθi ρj Betac j, d j PrZ ij To approximate the marginal distribution as Beta distribution, we use the moment matching technique In particular, we approximate such that θ i Z ij z Betaã i z, b i z, Ẽ zθ i E pθi Z ij zθ i ã iz ã iz + b iz, 4 Ẽ zθ i E pθi Z ij zθ i ã izã iz + ã iz + b izã iz + b iz +, where ã iz and ã iz+ b iz 5 ã izã iz+ are the ã iz+ b izã iz+ b iz+ first and second order moment of Betaã i z, b i z To make 4 and 5 hold, we have: ã i z Ẽzθ Ẽ z θ i i Ẽzθi, 6 Ẽ z θi Ẽz θ i bi z Ẽzθ Ẽ z θ i i Ẽzθi 7 Ẽ z θi Ẽz θ i Similarly, we approximate such that ρ j Z ij z Beta c j z, d j z, Ẽ zρ j E pρj Z ij zρ j c jz c jz + d jz, 8 Ẽ zρ j E pρj Z ij zρ j c jz c jz + c jz + d jz c jz + d jz +, where c jz c jz+ d and jz 9 c jz c jz+ c jz+ d jz c jz+ d are the jz+ first and second order moment of Beta c j z, d j z To make 4 and 5 hold, we have: c j z Ẽzρ Ẽ z ρ j j Ẽzρ j, 0 Ẽ z ρ j Ẽz ρ j d j z Ẽzρ Ẽ z ρ j j Ẽzρ j Ẽ z ρ j Ẽz ρ j Furthermore, we can compute the exact values for Ẽ z θ i, Ẽzθi, Ẽzρ j and Ẽzρ j as follows Ẽ θ i Eθ i Eρ j + Eθ i Eθi Eρ j pz ij aiai + cj + bidj a i + b i + a ic j + b id j Ẽ θi Eθ3 i Eρ j + Eθi Eθi 3 Eρ j pz ij a ia i + a i + c j + b id j a i + b i + a i + b i + a ic j + b id j Ẽ θ i Eθi Eθ i Eρ j + Eθi Eρ j pz ij aibicj + ai + dj a i + b i + b ic j + a id j Ẽ θi Eθ i Eθi 3 Eρ j + Eθi 3 Eρ j pz ij a ia i + b ic j + a i + d j a i + b i + a i + b i + b ic j + a id j Ẽ ρ j EθiEρ j + Eθ ieρ j Eρ j pz ij cjaicj + + bidj c j + d j + a ic j + b id j Ẽ ρ j EθiEρ3 j + Eθ ieρ j Eρ 3 j pz ij c jc j + a ic j + + b id j c j + d j + c j + d j + a ic j + b id j

6 Algorithm Optimistic Knowledge Gradient with Workers Reliability Input: Parameters of prior distributions for instances {a 0 i, b0 i }K i and for workers {c0 j, d0 j }M j The total budget T for t 0,, T do Select the next instance to label and the next worker j t to label according to: Here, j t arg max R + a t i, b t i, c t j, d t j i {,,K},j {,,M} R + a t i, bt i, ct j, dt j maxrat i, bt i, ct j, dt j, Rat i, bt i, ct j, dt j Acquire the label Z itj t {, } of the i-th instance from the j-th worker 3 Update the posterior by setting: a t+ ã t Z itj t b t+ b t Z itj t c t+ j t c t j t Z itj t d t+ j t d t j t Z itj t, and all parameters for i and j j t remain the same end for Output: The positive set H T {i : a T i b T i } Ẽ ρ j EθiEρ j + Eθ ieρ j Eρ j pz ij cjbicj + + aidj c j + d j + b ic j + a id j Ẽ ρ j EθiEρ3 j + Eθ ieρ j Eρ 3 j pz ij c jc j + b ic j + + a id j c j + d j + c j + d j + b ic j + a id j Assuming at a certain stage, θ i for the i-th instance has the Beta posterior Betaa i, b i and ρ j for the j-th worker has the Beta posterior Betac j, d j The reward of getting label for the i-th instance from the j-th worker and getting label - are: R a i, b i, c j, d j hiã iz, b iz hia i, b i R a i, b i, c j, d j hiã iz, b iz hia i, b i, 3 where ã i z ± and b i z ± are defined in 0 and, which further depend on c j and d j With the reward in place, we present the optimistic knowledge gradient algorithm for budget allocation with the modeling of workers reliability in Algorithm 6 Extensions 6 Incorporating Feature Information When each instance is associated with a p-dimensional feature vector x i R p, we incorporate the feature information in our budget allocation problem by assuming: θ i σ w, x i where σx +exp{ x} + exp{ w, x i }, 4 is the sigmoid function and w is assumed to be drawn from a Gaussian prior Nµ 0, Σ 0 At the t-th stage with the state S t µ t, Σ t and w µ t, Σ t, one decides to label the -th instance according to a certain policy eg, KG and observes the label y it {, } The posterior distribution pw y it, S t py it wpw S t has the following log-likelihood: ln pw y it, S t ln py it w + ln pw S t + const y it ln σ w, x it + y it ln σ w, x it w µ t Ω t w µ t + const, where Ω t Σ t is the precision matrix To approximate pw y it, µ t, Σ t by a Gaussian distribution Nµ t+, Σ t+, we use the Laplace method see Chapter 44 in Bishop, 007 In particular, we define the mean of the posterior Gaussian using the MAP maximum a posteriori estimator of w: µ t+ arg max ln pw y it, S t 5 w And µ t+ can be solved by Newton s method and the precision matrix: Ω t+ ln pw y it, S t wµt+ Ω t + σµ t+x it+ σµ t+x it+ x it+ x + By Sherman-Morrison formula, the covariance matrix Σ t+ Ω t+ Σ t σµ t+x it σµ t+ x it + σµ t+ x σµ t+ xx Σ tx it Σ tx it+ x Σ t We also calculate the transition probability of y it and y it as follows: Pry it S t, py it wpw S t dw σw x i pw S t dw σµ i κs i,

7 where κs i + s i /8 / and µ i µ t, x i and s i x i Σ tx i To calculate the reward function, in addition to the transition probability, we also need to compute: P Prθ i F t Pr + exp{ w tx i } wt Nµ t, Σ t Prw tx i 0 w t Nµ t, Σ t δc w, x i Nw µ t, Σ t dw dc, 0 w where δ is the Dirac delta functionlet pc δc w, x i Nw µ t, Σ t dw w Since the marginal of a Gaussian distribution is still a Gaussian, pc is a univariate-gaussian distribution with the mean and variance: µ i Ec Ew, x i µ t, x i s i Varc x i Covw, wx i x i Σ t x i Therefore, we have: P t i 0 pcdc Φ where Φ is the Gaussian CDF µ i, 6 s i With Pi t and transition probability in place, the expected reward in value function takes the following form : K K RS t, E hp t+ i hp S t, 7 i i We note that since w will affect all P, the summation from to K in 7 can not be omitted and hence 7 cannot be written as E hp t+ hp t S t, in 3 In this problem, the myopic KG or optimistic KG need to solve OT K optimization problems to compute the mean of the posterior as in 5, which could be computationally quite expensive One possibility to address this problem is to use the variational Bayesian logistic regression Jaakkola & Jordan, 000, which could lead to a faster optimization procedure 6 Multi-Class Setting In multi-class setting with C different classes, we assume that the i-th instance is associated with a probability vector θ i θ i, θ ic, where θ ic is the probability that the i-th instance belongs to the class c and C i θ ic We assume that θ i has a Dirichlet prior θ i Dirα 0 i and our initial state S0 is a K C matrix with α 0 i as its i-th row At each stage t with the current state S t, we determine an instance to label and collect its label y it {,, C}, which follows the categorical distribution: py it C c θiy c c Since the Dirichlet is the conjugate prior of the categorical distribution, the next state induced by the posterior distribution is: S t+ S t + δ yit and S t+ i S for all i Here δ c is a row vector with one at the c- th entry and zeros at all other entries The transition probability: Pry it c S t, Eθ itc S t α tc C c αt c In multi-class problem, at the final stage T when all budget is used up, we construct the set Hc T for each class c to maximize the conditional expected classification accuracy: K C {H T c }C c arg max E Ii H cii H c F T Hc {,,C},Hc H c i c K C arg max Ii H c Pr i H c F T Hc {,,C},Hc H c i c 8 Here, Hc {i : θ ic θ ic, c c} is the true set of instances in the class c The set Hc T consists of instances that belong to class c Therefore, {Hc T } C c should form a partition of all instances {,, K} Let P T ic Pri H c F T Prθ ic θ i c, c c F T 9 To maximize RHS of 8, we have H T c {i : P T ic P T i c, c c} 30 If there is i belongs to more than one H T c, we only assign it to the one with the smallest index c The maximum conditional expected accuracy takes the form: K i max c {,C} P T ic Then the value function can be defined as: V S 0 sup where P T i E K sup E sup i c E K C Ii H T S c Ii H c 0 i c K E hp T S i 0, i P T i,, P T ic and hp T i 3 C Ii H T c Ii H c FT S 0 max P ic T c {,C} 3

8 Following Proposition, let Pic t Pri H c F t and P t i P,, P ic t By defining intermediate reward function at each stage: RS t, E hp t+ hp t S t, The value function can be re-written as: T V S 0 G 0 S 0 + sup E RS t, S 0, t0 where G 0 S 0 K i hp0 i Since the reward function only depends on S t α t R C +, we can define the reward function in a more explicit way by defining: References Bishop, C M Pattern Recognition and Machine Learning Springer, 007 Jaakkola, T and Jordan, M I Bayesian parameter estimation via variational methods Statistics and Computing, 0:5 37, 000 Xie, J and Frazier, P I Sequential bayes-optimal policies for multiple comparions with a control Technical report, Cornell University, 0 Rα C c α c C c α c hiα + δ c hiα Here δ c be a row vector of length C with one at the c-th entry and zeros at all other entries; and Iα I α,, I C α where I c α Prθ c θ c, c c θ Dirα 33 Therefore, we have RS t, Rα t To evaluate the reward Rα, the major bottleneck is how to compute I c α efficiently Directly taking the C-dimensional integration on the region {θ c θ c, c c} C will be computationally very expensive, where C denotes the C-dimensional simplex Therefore, we propose a method to convert the computation of I c α into a one-dimensional integration It is known that to generate θ Dirα, it is equivalent to generate {X c } C c with X c Gammaα c, X and let θ c C c Then θ θ,, θ C will fol- c Xc low Dirα Therefore, we have: I c α PrX c X c, c c X c Gammaα c, 34 It is easy to see that I cα 35 C f Gammax c; α c, dx dx C 0 x xc xc 0 0 x C xc c f Gammax c; α c, c c F Gammax c; α c, dx c, xc 0 where f Gamma x; α c, is the density function of Gamma distribution with the parameter α c, and F Gamma x c ; α c, is the CDF of Gamma distribution at x c with the parameter α c, In many softwares, F Gamma x c ; α c, can be calculated very efficiently without an explicit integration Therefore, we can evaluate I c α by performing only a one-dimensional numerical integration as in 35 We could also use Monte-Carlo approximation to further accelerate the computation in 35

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling Crowdsourcing & Optimal Budget Allocation in Crowd Labeling Madhav Mohandas, Richard Zhu, Vincent Zhuang May 5, 2016 Table of Contents 1. Intro to Crowdsourcing 2. The Problem 3. Knowledge Gradient Algorithm

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Bayesian Linear Regression. Sargur Srihari

Bayesian Linear Regression. Sargur Srihari Bayesian Linear Regression Sargur srihari@cedar.buffalo.edu Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2 Linear

More information

STAT Advanced Bayesian Inference

STAT Advanced Bayesian Inference 1 / 32 STAT 625 - Advanced Bayesian Inference Meng Li Department of Statistics Jan 23, 218 The Dirichlet distribution 2 / 32 θ Dirichlet(a 1,...,a k ) with density p(θ 1,θ 2,...,θ k ) = k j=1 Γ(a j) Γ(

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Part 1: Expectation Propagation

Part 1: Expectation Propagation Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Linear Classification

Linear Classification Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015 Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Motif representation using position weight matrix

Motif representation using position weight matrix Motif representation using position weight matrix Xiaohui Xie University of California, Irvine Motif representation using position weight matrix p.1/31 Position weight matrix Position weight matrix representation

More information

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Shunsuke Horii Waseda University s.horii@aoni.waseda.jp Abstract In this paper, we present a hierarchical model which

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

44 CHAPTER 2. BAYESIAN DECISION THEORY

44 CHAPTER 2. BAYESIAN DECISION THEORY 44 CHAPTER 2. BAYESIAN DECISION THEORY Problems Section 2.1 1. In the two-category case, under the Bayes decision rule the conditional error is given by Eq. 7. Even if the posterior densities are continuous,

More information

Expect Values and Probability Density Functions

Expect Values and Probability Density Functions Intelligent Systems: Reasoning and Recognition James L. Crowley ESIAG / osig Second Semester 00/0 Lesson 5 8 april 0 Expect Values and Probability Density Functions otation... Bayesian Classification (Reminder...3

More information

Introduction to Machine Learning

Introduction to Machine Learning Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

CSC411 Fall 2018 Homework 5

CSC411 Fall 2018 Homework 5 Homework 5 Deadline: Wednesday, Nov. 4, at :59pm. Submission: You need to submit two files:. Your solutions to Questions and 2 as a PDF file, hw5_writeup.pdf, through MarkUs. (If you submit answers to

More information

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model

More information

Outline Lecture 2 2(32)

Outline Lecture 2 2(32) Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Bayesian Logistic Regression

Bayesian Logistic Regression Bayesian Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic Generative

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Classification: Logistic Regression NB & LR connections Readings: Barber 17.4 Dhruv Batra Virginia Tech Administrativia HW2 Due: Friday 3/6, 3/15, 11:55pm

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Clustering bi-partite networks using collapsed latent block models

Clustering bi-partite networks using collapsed latent block models Clustering bi-partite networks using collapsed latent block models Jason Wyse, Nial Friel & Pierre Latouche Insight at UCD Laboratoire SAMM, Université Paris 1 Mail: jason.wyse@ucd.ie Insight Latent Space

More information

ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides.

ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides. ECE521 Tutorial 2 Regression, GPs, Assignment 1 ECE521 Winter 2016 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides. ECE521 Tutorial 2 ECE521 Winter 2016 Credits to Alireza / 3 Outline

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Introduction to Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models Introduction to Probabilistic Graphical Models Sargur Srihari srihari@cedar.buffalo.edu 1 Topics 1. What are probabilistic graphical models (PGMs) 2. Use of PGMs Engineering and AI 3. Directionality in

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 143 Part IV

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target

More information

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

Lecture 2: Conjugate priors

Lecture 2: Conjugate priors (Spring ʼ) Lecture : Conjugate priors Julia Hockenmaier juliahmr@illinois.edu Siebel Center http://www.cs.uiuc.edu/class/sp/cs98jhm The binomial distribution If p is the probability of heads, the probability

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Lecture 2: From Linear Regression to Kalman Filter and Beyond Department of Biomedical Engineering and Computational Science Aalto University January 26, 2012 Contents 1 Batch and Recursive Estimation

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Linear Regression [DRAFT - In Progress] Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58 Overview of Course So far, we have studied The concept of Bayesian network Independence and Separation in Bayesian networks Inference in Bayesian networks The rest of the course: Data analysis using Bayesian

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning for for Advanced Topics in California Institute of Technology April 20th, 2017 1 / 50 Table of Contents for 1 2 3 4 2 / 50 History of methods for Enrico Fermi used to calculate incredibly accurate predictions

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Variational Bayesian Logistic Regression

Variational Bayesian Logistic Regression Variational Bayesian Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic

More information

Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2)

Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2) Lectures on Machine Learning (Fall 2017) Hyeong In Choi Seoul National University Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2) Topics to be covered:

More information

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

Approximate Inference using MCMC

Approximate Inference using MCMC Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Logistic regression and linear classifiers COMS 4771

Logistic regression and linear classifiers COMS 4771 Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random

More information

Introduction to Bayesian Statistics

Introduction to Bayesian Statistics School of Computing & Communication, UTS January, 207 Random variables Pre-university: A number is just a fixed value. When we talk about probabilities: When X is a continuous random variable, it has a

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative 1. Fixed basis

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Generative Models Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability

More information

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Bayesian decision theory 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology

More information

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING

More information

6-1. Canonical Correlation Analysis

6-1. Canonical Correlation Analysis 6-1. Canonical Correlation Analysis Canonical Correlatin analysis focuses on the correlation between a linear combination of the variable in one set and a linear combination of the variables in another

More information

Machine Learning

Machine Learning Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 1, 2011 Today: Generative discriminative classifiers Linear regression Decomposition of error into

More information

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Variational Scoring of Graphical Model Structures

Variational Scoring of Graphical Model Structures Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

More information