Appendices for the article entitled Semi-supervised multi-class classification problems with scarcity of labelled data

Size: px
Start display at page:

Download "Appendices for the article entitled Semi-supervised multi-class classification problems with scarcity of labelled data"

Transcription

1 Appendices for the article entitled Semi-supervised multi-class classification problems with scarcity of labelled data Jonathan Ortigosa-Hernández, Iñaki Inza, and Jose A. Lozano Contents 1 Appendix A. Mathematical proofs Theorem Theorem Theorem Theorem Lemma Theorem Theorem Corollary Theorem Corollary Theorem Appendix B. Supplementary data for the empirical study 12 3 Appendix C. Source code l k -related functions Functions related to the probability of error Examples of the executions of the package

2 1 Appendix A. Mathematical proofs This appendix provides the mathematical proofs of the theoretical results in our paper Semi-supervised multi-class classification problems with scarcity of labelled data: A theoretical study 1.1 Theorem 1 Theorem 1. (Probability of error with no labelled data) 1 The probability of error of classifying a new sample (x (0), c (0) ) of a -class problem with any classifier learnt with no labelled examples and any u 0 number of unlabelled samples coincides with the probability of error random classifier. In both cases, it remains constant to P e (0, u) = e 0 = ( 1), u 0. (1) Proof. - When f(x) is unknown: When there are not enough unlabelled records (u < ) to determine the mixture density (and its components), the class c (0) of the unseen distance must be determined by the uniformly random classifier. In such a case, let the event A be defined as the probability of correctly choosing the right class for c (0) over a choice of different classes; P (A) = 1/. Then, the probability of committing an error is P e (0, u) = 1 P (A) = ( 1), u <. - When f(x) is known: With unlabelled examples, the mixture is identifiable up to a permutation π of the components. Let the observation x (0) be drawn from the j-th decomposed component, f π(j) ( ), i the unknown true label such that π c (j) = i, and let us define the following two events: B 1 {η πc(j)f πc(j)(x (0) ) max j j η π c(j )f πc(j )(x (0) ) > 0} B 2 {η πc(j)f πc(j)(x (0) ) max j j η π c(j )f πc(j )(x (0) ) < 0} B 1 is the event which represents achieving a correct answer in the application of the Bayes decision rule over x (0) and B 2 represents the opposite- By the definition, P (B 1 ) = (1 e B ) and P (B 2 ) = e B. Then, the probability of error is P e (0, ) = P (ĉ (0) c (0) ) = 1 P (ĉ (0) = c (0) ) = = 1 P (ĉ (0) = c (0) B 1 )P (B 1 ) P (ĉ (0) = c (0) B 2 )P (B 2 ) = = 1 P (π a )(1 e B ) P (π a )e B. 1 This holds true independently of the value of the class priors η 1,..., η. 2

3 where π a and π b are two permutations such that π a (j) = i and π b (j ) = i.as no labelled data are provided to determine the correspondence π, it has to be randomly chosen. Then, the probability of choosing those permutations is P (π a ) = P (π b ) = ( 1)!/! = 1/. After substituting these probabilities in the previous formula and after some algebra, we obtain the same expression: P e (0, ) = ( 1). 1.2 Theorem 2 Theorem 2. (Optimality of PC SSL ) PC SSL is an optimum learning procedure for Stage 2 of Algorithm 2. Proof. The proof of this theorem can be found in the manuscript. 1.3 Theorem 3 Theorem 3. (Minimum number of labelled examples) Let the family of mixtures F be linearly independent. Let be the number of classes and the number of mixture components. Let the class priors be equiprobable, i.e. i, η i = η = 1/. Then, the expected minimum number of labelled instances needed to uniquely determine the class labels of the components of a mixture is : l = 1 if = 2 1 ( ) ( 1) j (j 1) if > 2 j j=2 ) ( 2) ( 1 + ( j ( j) ) j Proof. The proof for = 2 can be found in [1]. For higher values of, let L be a random variable representing the minimum number of instances needed to obtain examples of ( 1) different classes. Assuming equiprobability, P (L = l), for l ( 1), is a fraction whose numerator is the number of favourable cases and whose denominator is the number of all possible cases: P (L = l) = P S 2 (l 1, 2) P R,l, where P is the number of permutations of elements, S 2 (, ) is the Stirling number of second kind, and P R,l is the number of l-permutations of elements with repetition (formulae in [2]). Then, we calculate the expectation of the random variable L in l as l = E[L ] = l= 1 lp (L = l). After some algebra, we reach equation (2), for > 2. 3 (2)

4 1.4 Theorem 4 Theorem 4. (Minimum labelled examples for ternary problems) Let f(x) be identifiable and let = 3 be the number of classes with priors η i > 0, 3 i=1 η i = 1. Then, the expected minimum number of labelled instances needed to uniquely determine the class labels of the components of the mixture is l 3 = i=1 η 2 i 1 η i. (3) Proof. Let L 3 be a random variable representing the minimum number of instances needed to obtain two different labels. Assuming each class c i has a prior η i and 3 i=1 η i = 1, P (L 3 = l) has the following form: P (L 3 = l) = η (l 1) 1 (1 η 1 ) + η (l 1) 2 (1 η 2 ) + η (l 1) 3 (1 η 3 ). Which stands for the probability of having (l 1) labelled examples of one class, and just one example of one of the two other classes. Then, we calculate the expectation of L 3 in l. After some algebra, we reach the equation (3). 1.5 Lemma 1 Lemma 1. Let L be a labelled subset distributed according to a generative model f(x, c). Let the marginal of f(x, c) on x be an identifiable mixture density f(x) = i=1 η if i (x) which represents the distribution of the infinite unlabelled records. Let π be the correspondence returned by the learning procedure Π( ), (Stage 2 of Algorithm 2), and let R π 1 (i) be defined as R π 1 (i) = {x : η π 1 (i)f π 1 (i)(x) max i i η π 1 (i )f π 1 (i )(x) > 0}. Then, the probability of committing an error in classifying an unseen instance with the BDR after assuming the correspondence π is P (e π) = 1 η i f i (x)dx. (4) R π 1(i) i=1 Proof. According to Algorithm 2, a correct prediction occurs depending on whether the optimal classification rule, which assumes a learnt correspondence π, over the unseen instance x, hits the real class value (c i ). If the region where f i (x) is maximum is R j (it holds that i = π c (j)), there are only two cases in which the real class is correctly predicted: Case 1: if x R j π(j) = i Case 2: if x R j j π(j ) = i 4

5 As can be seen, in both cases, the region to where x belongs can be named as π 1 (i), which is equal to j in Case 1, and to j in Case 2. Therefore, the probability of correctly classifying x is just the probability of x being in the region labelled as i, i.e. R π 1 (i) and this formula can be easily generalised to all the possible values of i in the range {1,..., } as: P (ē π) = η i f i (x)dx, R π 1(i) i=1 After that we can reach formula (4) taking into account that the probability of error is the opposite to the probability of a correct classification, P (ē π) = 1 P (e π). 1.6 Theorem 5 Theorem 5. (Probability of error) The probability of error of classifying an unseen instance in the multi-class scenario, given l labelled records and infinite unlabelled records, is: P e (l, ) = π S P (Π l = π)p (e π), (5) where P (Π l = π) denotes the probability of choosing, using the learning procedure Π( ), the permutation π given l labelled data (Stage 2 of Algorithm 2) and P (e π) the probability of misclassification with the BDR after assuming the correspondence π (Stage 3 of Algorithm 2) 2. Finally, S represents the set of all possible permutations of size representing all the correspondences between labels and components. Proof. The mathematical proof of this theorem is in the manuscript. 1.7 Theorem 6 Theorem 6. (Probability of error with zero Bayes error) Let the mixture density f(x) be an identifiable mixture. Let be the number of classes and the number of mixture components. Let the class priors be equiprobable, i.e. i, η i = η = 1. Then, the probability of error P e (l, ), given l > 0 labelled records and infinite unlabelled records when the components are mutually disjoint is given by: P e (l, ) = min( 2,l) z=1 P,z S 2 (l, z) ( P R,l 1 z + 1 ), (6) where P,z and P R,l are the number of z-permutations without repetition and l-permutations with repetition of, respectively. S 2 (l, z) is the Stirling number of 2 nd kind[2]. 2 Note that eq. (5) fits with the one proposed for binary problems in [1] (pp. 107, eq. (7)): P (Π l = π c)p (e π c) + P (Π l = π c)p (e π c). 5

6 Proof. To determine the probability of error, we just need to calculate the probability of finding z min(, l) different labels in l. When z ( 1) (see definition of l ) we reach the real model, so, the probability of error is e B, which is zero. Therefore, we can define the probability as follows: P e (l, ) = min( 2,l) z=1 P (Ψ l = z)p (e z), (7) where Ψ l is a random variable representing the number of different labels, z in l and P (e z) is the probability of committing a classification error knowing the real correspondence of z labels with their components. Regarding P (Ψ l = z), as all the priors are equiprobable; it is a fraction whose numerator is the number of selecting just z classes of multiplied by the way of ordering them, and whose denominator is the number of all possible cases: P (Ψ l = z) = P,zS 2 (l, z) P R,l. Then, P (e z) can be decomposed into the sum of the probability of misclassifying when we find a particular label among the labelled subset and the probability of misclassifying it when we do not have that label to identify a mixture: P (e z) = Pr{i z} Pr{e i}+pr{i z} Pr{e i}, where {i z} corresponds to the event that the unseen instance has the same label as one of the labels that appears in the labelled subset, {i z} is the opposite, and Pr{e i} is the probability of misclassifying an instance of class i. By solving the previous formula, we obtain that P (e z) is equal to z 0 + z ( z)! ( z 1)! = 1 z + 1 ( z)!. (8) Finally, by substituting the calculations of P (Ψ l = z) and P (e z) in equation (7), we reach formula (6). 1.8 Corollary 1 Corollary 1. When the components are mutually disjoint, P e (l, ) converges to 0 exponentially fast in the sense of that ( ( 1 ) ) l o. (9) Proof. It is trivial to calculate the convergence order from the previous theorem. 6

7 1.9 Theorem 7 Theorem 7. (Upper bound of the error) When the components are not mutually disjoint and all the priors are equiprobable, the optimal probability of error of a model composed by different identifiable mixture components is upper bounded by { lλ 2 } P e (l, ) e B 2 exp, (10) 2 where λ (0, 1] depends on the degree of intersection among the components of the mixture distribution and is defined by λ = 1 { } min f i (x)dx max f z (x)dx. (11) j R j z i R j Proof. Let the Voting learning procedure [3] be defined as the function Σ : L S, where L L is the labelled set and ˆπ v S is the permutation of components returned by Voting. In [3], the excess of risk of the Voting procedure, (ε(σ)), is defined as [ E L j=1 R j ( ηπc(j)f πc(j)(x) ηˆπv(j)fˆπv(j)(x) ) dx], (12) where E L [ ] is the expectation with respect to the labelled sample and, π c (j) and ˆπ v (j) correspond to the true class and the Voting bet of the j-th component, respectively. In that paper, they prove that ε(σ) 2 exp{ lλ 2 /2}. Then, we just need to prove that P e (l, ) e B ε(σ l ). By means of the linearity property of both the integral and the expectation over equation (12), we reach [ E L j=1 R j ] [ η πc(j)f πc(j)(x)dx E L j=1 R j ] ηˆπv(j)fˆπv(j)(x)dx. There, the first term is equal to (1 e B ) since π c (j) = i (equation (4)). Then, the second is equal to L (1 P (e L))P (L)dL (where Voting is implicitly contained), which, following the same reasoning as in the proof of theorem 5, it is equal to Pe V (l, ) = P (Σ l = π)p (e π). π S Note that, here, P V e (l, ) is used instead of P e (l, ) to denote that the Voting classifier is used, not the optimal one. Then, by Theorem 2 (P V e (l, ) P e (l, ), l), we can reach the conclusion of equation (10); P e (l, ) is upper bounded by ε(σ), which, in turn, is upper bounded by 2 exp{ lλ 2 /2}. 7

8 1.10 Corollary 2 Corollary 2. The probability of error, P e (l, ), decreases to e B, at least, exponentially fast in the number of labelled data. Proof. It is trivial to calculate the convergence order from Theorem 7 of the manuscript Theorem 8 Lemma 2. Assuming all the priors to be equiprobable and that the model is a Gaussian identifiable mixture of components with the same variance (σ), let δ M be the largest distance between the means of two components and let Φ( ) be the CDF of a standard Gaussian distribution. Then, it holds that Proof. First, P (Π l = π) can be rewritten as: (l 1,...,l ) G l ( min P (Π l = π) 1 Φ ( δ M ) ) (! 1) l. (13) π 2σ P ((l 1,..., l )) P (Π((l 1,..., l )) = π), (14) which is the decomposition of the probability in terms of the number of labelled examples of each class c i in l, i.e. l i, 1 i. There, P ((l 1,..., l )) is the probability of having the distribution of labelled samples (l 1,..., l ) in l, P (Π((l 1,..., l k )) = π) is the probability of obtaining the permutation π with a determined distribution of labelled samples. G l is the set containing all possible distributions of labels defined as the set containing all the integer partitions of l in exactly addenda, but including zeros and taking into account the order of addenda. Formally, it is defined as G l {(γ 1, γ 2,..., γ ) γ z = l γ z {0, 1,..., l}, z}. z=1 Then, we define P (Π((l 1,..., l k )) = π) in terms of the components of the mixture as l i ηf π 1 (i)(x a ) i=1 a=1 P 1, arg max { l i ηf τ τ π 1 (i)(x a )} i=1 a=1 where f j (x) is the density function of the component j. For simplicity of notation, from now on, we rename the numerator as f l π and the denominator as arg max τ π {f l τ }. 8

9 Then, by defining the permutation π D as the furthest permutation to π c, i.e. π D = arg max π i=1 µ π 1 (i) µ π 1 c (i), it holds that ( ) ( ) min P = P. (15) π f l π arg max τ π {f l τ } 1 f l π D arg max τ π D {f l τ } 1 As there is no independency between the permutations, we cannot express the second term of the formula (15) as a product of π D being greater than or equal to any other permutation τ. For that reason, we use the chain rule to decompose the formula in such a way that the first term of the chain has, l > 0, the lowest probability: ( f l ) π P D arg max τ πd {fτ l } 1 = P ( fπ l D fπ l )! c j=2 ( j 1 P fπ l D fπ l j fπ l D fπ l c, m=2 f l π D f l π m ). (16) Since we assume a mixture of Gaussian components (θ = {µ 1,..., µ, σ}), doing some calculations over the first term of the chain in formula (16) we reach the conclusion that it is equal to P ( f l π D f l π c ) and bounded to = 1 Φ 1 2σ i=1 l i (µ π 1 D (i) µ π 1 1 Φ 1 ) l i δm 2 = 1 Φ 2σ i=1 c (i) )2 ( δm 2σ l. where δ M = max{(µ π 1 D (i) µ πc 1 (i) )2 } is the largest distance between two means. As equation (17) is, by definition of π D, the lowest term in the product of equation (16), we can lower bound it by substituting the product by equation (17) to the (! 1) (number of terms in the chain rule) power. Then, substituting this value in equation (14), we reach formula (13): min π P (Π l = π) is greater than or equal to ( 1 Φ ( δ M 2σ 1 (17) ) ) { (! 1) }}{ l P ((l 1,..., l k ). (18) G l Theorem 8. (Lower bound of the error) Assuming that the model is a Gaussian identifiable mixture of components with the same variance (σ) and equiprobable priors. Let δ M be the largest distance between the means of two components and Q( ) a polynomial of degree 3. Then, the probability of error for non-disjoint components is lower bounded by P e (l, ) e B!(1 e B) ( 1)! ( 1 + exp{q( δ M 2σ l)} ) (! 1). (19) 9

10 Proof. First, we decompose formula (10) of the main manuscript as follows: P e (l, ) = P (Π l = π c )P (e π c )+ + π S \π c P (Π l = π)p (e π). (20) It can be easily seen that, when l increases, whilst P (Π l = π c ) grows to 1, the remaining P (Π l = π), π π c decrease to 0. Since P (e π c ) is, by definition, equal to e B and P (Π l = π c ) = 1 π S \π c P (Π l = π), by just substituting equation (13) of the previous lemma in (20) and subtracting e B in both terms, we can lower bound the error (P e (l, ) e B ) by ( 1 Φ ( δ M ) ) (! 1) l (1 eb ) P (e π). 2σ (21) π S \π c There, we start by calculating π S \π c P (e π). Note that: P (e π) = P (e π) e B (22) π S \π c π S π S P (e π) =! π S P (e π) (23) By substituting formula (9) of Lemma 1 (in the manuscript) in equation (23), we obtain π S P (e π) =! π S i=1 1 f i (x)dx (24) R π 1 (i) By distributive property of addition and the fact that ( 1)! permutations of S share the same element in the same position, formula (24) can be rewritten as {}}{ 1 P (e π) =! ( 1)! f i (x)dx, π S i=1 j=1 where j = π 1 (i). Then, we can substitute the result in formula (22) obtaining π S \π c P (e π) =! ( 1)! e B. (25) Secondly, as there is no closed form expression for the normal cumulative density function, we approximate Φ(x) by an inverse exponential as proposed by Page in [4]: Φ(x) Φ Page (x) = 1 (1 + exp{q(x)}) 1, (26) 1 R j 10

11 where Q(x) = x x 3 and the absolute error ɛ = Φ(x) Φ Page (x) , x 0. Then, by substituting the formulas (26) and (25) in (21) and after some algebra, we reach formula (19). 11

12 2 Appendix B. Supplementary data for the empirical study The results in this supplementary data for = {4, 5, 6} show a similar style to that presented in the manuscript for = 3. The generative model and the strategies used are the same as in Section VIII (manuscript). However, in these problems, we set l max = 50. Whilst Figure 1, Figure 3 and Figure 5 shows the excess of risk of both Voting and PC SSL for the values of = {4, 5, 6}, respectively, Figure 2, Figure 4 and Figure 6 present the absolute differences between the probability of error of both learning procedures for the respective values of = {4, 5, 6}. In each figure, the subfigure (a) stands for complex problems (δ = 0.25), subfigure (b) for problems showing medium complexity (δ = 1) and subfigure (c) shows the performance for easy problems showing a small degree of intersection among the components (δ = 5). (a) Complex (δ = 0.25). (b) Medium (δ = 1). (c) Easy (δ = 5) Figure 1: Probability of error of Voting [3] (upper) and PC SSL (lower curve) for = 4 (l = 4.33). (a) Complex (δ = 0.25). (b) Medium (δ = 1). (c) Easy (δ = 5) Figure 2: Absolute differences between Voting [3] and PC SSL, i.e. P V e (l, ) P P e (l, ), for = 4. 12

13 (a) Complex (δ = 0.25). (b) Medium (δ = 1). (c) Easy (δ = 5) Figure 3: Probability of error of Voting [3] (upper) and PC SSL (lower curve) for = 5 (l = 6.42). (a) Complex (δ = 0.25). (b) Medium (δ = 1). (c) Easy (δ = 5) Figure 4: Absolute differences between Voting [3] and PC SSL, i.e. P V e (l, ) P P e (l, ), for = 5. (a) Complex (δ = 0.25). (b) Medium (δ = 1). (c) Easy (δ = 5) Figure 5: Probability of error of Voting [3] (upper) and PC SSL (lower curve) for = 6 (l = 8.7). 13

14 (a) Complex (δ = 0.25). (b) Medium (δ = 1). (c) Easy (δ = 5) Figure 6: Absolute differences between Voting [3] and PC SSL, i.e. P V e (l, ) P P e (l, ), for = 6. 14

15 3 Appendix C. Source code The Mathematica package [5] available to download from our website contains the formulae presented in the manuscript. Besides, it also contains some experiments used to study the behaviour of the probability of error of both Voting [3] and PC SSL learning algorithms. 3.1 l k -related functions 1) The function LEQ[] calculates the l value for a problem with equiprobable clases. LEQ[] Input parameters: Output: Number of classes. A real number, l k. 2) The function LEQPlot[max] prints a figure of the l values for problems of {1..max} equiprobable classes. The Figure 2a of the manuscript has been created with this function. LEQPlot[max] Input parameters: Output: max Number of classes. Plot in the standard output. 3) LPriors[priors, nrep] use a Monte Carlo method with nrep repetitions to approximate l k for a problem with Length[priors] non-equiprobable priors given by the variable priors. LPriors[priors, nrep] Input parameters: Output: priors nrep List containing class priors. Number of repetitions. A real number, l k. 4) LSampling[, samplesize, nrep] applies LPriors[priors, nrep] with nrep repetitions over a population of class priors of size samplesize which are generated from a Dirichlet distribution with all the alpha hyper-parameters equal to 1. 15

16 LSampling[, samplesize, nrep] Input parameters: Output: samplesize nrep Number of classes. Population size. Number of repetitions for LPriors[...]. A list of l k, one per each sample of the population. 5) The eq. (8) of Theorem 4 corresponds to L3[n1,n2,n3]. It calculates l 3 for a ternary problem with priors n1, n2 and n3. L3[n1,n2,n3] Input parameters: Output: n1 Class prior of the class c 1 n2 Class prior of the class c 2 n3 Class prior of the class c 3 A real number, l 3. 6) L3Plot[] plots L3[n1,n2,n3] assuming that n1= η 1 and n2, n3= (1 η 1 )/( 1). L3Plot[] Input parameters: Output: Plot in the standard output. 3.2 Functions related to the probability of error 7) The function ZeroEB[,maxL] calculates the probability of error for l = 0..maxL for a -class problem when there is no intersection among the components. ZeroEB[,maxL] Input parameters: Output: maxl Number of classes. Maximum number of labelled examples. Summary of results in the standard output. 16

17 8) MCPCSSL[,distance,sigma,maxL,nRep] uses a Monte Carlo method with nrep repetitions to approximate P e (l, ) for l = {0..maxL} assuming a mixture of Gaussian. distance represents the distance between the adjacent means and sigma is the variance. MCPCSSL[,distance,sigma,maxL,nRep] Input parameters: Output: distance sigma maxl nrep Number of classes. Distance between the means of adjacent components. Variance of the Gaussian components. Maximum number of labelled examples. Number of repetitions for the Monte Carlo method. Summary of results in the standard output. 9) The function MCPCSSLBiased[,distance,sigma,maxL,bias, nrep] uses a Monte Carlo method with nrep repetitions to approximate P e (l, ) for l = {0..maxL} assuming a a generative model composed of a mixture of Gaussian. distance represents the distance between the adjacent means and sigma is the variance. In this simulation the learnt model is also a mixture of Gaussian components, but they are shifted a factor bias to the right, i.e. ˆµ i µ i = bias. This function corresponds to the second experiment of the manuscript. MCPCSSLBiased[,distance,sigma,maxL,bias, nrep] Input parameters: Output: distance sigma maxl bias nrep Number of classes. Distance between the means of adjacent components. Variance of the Gaussian components. Maximum number of labelled examples. Bias between the learnt and the generative models. Number of repetitions for the Monte Carlo method. Summary of results in the standard output. 10) MCVOTING[,distance,sigma,maxL,bias, nrep] uses a Monte Carlo method with nrep repetitions to approximate the probability of error of Voting for l = {0..maxL} assuming a mixture of Gaussian. distance represents the distance between the adjacent means and sigma is the variance. 17

18 MCVOTING[,distance,sigma,maxL,bias, nrep] Input parameters: Output: distance sigma maxl nrep Number of classes. Distance between the means of adjacent components. Variance of the Gaussian components. Maximum number of labelled examples. Number of repetitions for the Monte Carlo method. Summary of results in the standard output. 11) MCComparison[,distance,sigma,maxL,bias, nrep] uses a Monte Carlo method with nrep repetitions to approximate the probability of error of both PC SSL and Voting for l = {0..maxL} assuming a mixture of Gaussian. distance represents the distance between the adjacent means and sigma is the variance. MCComparison[,distance,sigma,maxL,bias, nrep] Input parameters: Output: distance sigma maxl nrep Number of classes. Distance between the means of adjacent components. Variance of the Gaussian components. Maximum number of labelled examples. Number of repetitions for the Monte Carlo method. Summary of results in the standard output. 3.3 Examples of the executions of the package Figure 7 and Figure 8 shows two executions of the package to calculate l and l 3. 18

19 Figure 7: An example of the use of the l k -related functions. 19

20 Figure 8: An example of the calculation of l 3. Figure 9 shows an example of how, in cases of mutually non-intersecting components, the probability of error is calculated. 20

21 Figure 9: An execution of ZeroEB[,maxL]. Whilst Figure 10 shows how the probability of error of P C SSL is calculated with the Mathematica package, Figure 8 provides an example of the calculation of the probability of error of P C SSL when the correct model assumption is not met. 21

22 Figure 10: Calculating the optimal probability of error using P CSSL. 22

23 Figure 11: An execution of P CSSL when the correct model assumption is not met. Then, Figure 12 shows how the probability of error of P CSSL is calculated. 23

24 Figure 12: An example of the execution of the Voting strategy. Finally, 13 shows an example of how the comparison between both procedures can be made with the function MCComparison. 24

25 Figure 13: Comparing both learning algorithms using the function MCComparison[...]. 25

26 References [1] V. Castelli and T. Cover, On the exponential value of labeled samples, Pattern Recognition Letters, vol. 16, pp , [2] R. P. Stanley, Enumerative combinatorics. Wadsworth Publ. Co., [3] H. Chen and L. Li, Semisupervised multicategory classification with imperfect model, IEEE Transactions on Neural Networks, vol. 20, no. 10, pp , [4] E. Page, Approximation to the cummulative normal function and its inverse for use on a pocket calculator, Applied Statistics, vol. 26, pp , [5] Wolfram Research Inc., Mathematica,

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

Unlabeled Data: Now It Helps, Now It Doesn t

Unlabeled Data: Now It Helps, Now It Doesn t institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Pattern Recognition. Parameter Estimation of Probability Density Functions

Pattern Recognition. Parameter Estimation of Probability Density Functions Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable Lecture Notes 1 Probability and Random Variables Probability Spaces Conditional Probability and Independence Random Variables Functions of a Random Variable Generation of a Random Variable Jointly Distributed

More information

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable Lecture Notes 1 Probability and Random Variables Probability Spaces Conditional Probability and Independence Random Variables Functions of a Random Variable Generation of a Random Variable Jointly Distributed

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

Linear Classifiers as Pattern Detectors

Linear Classifiers as Pattern Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2014/2015 Lesson 16 8 April 2015 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear

More information

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Minimax risk bounds for linear threshold functions

Minimax risk bounds for linear threshold functions CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Classification. 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization

Classification. 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization Classification Volker Blobel University of Hamburg March 2005 Given objects (e.g. particle tracks), which have certain features (e.g. momentum p, specific energy loss de/ dx) and which belong to one of

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

More information

Lecture Notes on Support Vector Machine

Lecture Notes on Support Vector Machine Lecture Notes on Support Vector Machine Feng Li fli@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional space, a hyper plane is defined by ω T x + b = 0 (1) where ω R n is

More information

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI Support Vector Machines CAP 5610: Machine Learning Instructor: Guo-Jun QI 1 Linear Classifier Naive Bayes Assume each attribute is drawn from Gaussian distribution with the same variance Generative model:

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Neural Networks Lecture 4: Radial Bases Function Networks

Neural Networks Lecture 4: Radial Bases Function Networks Neural Networks Lecture 4: Radial Bases Function Networks H.A Talebi Farzaneh Abdollahi Department of Electrical Engineering Amirkabir University of Technology Winter 2011. A. Talebi, Farzaneh Abdollahi

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Partition models and cluster processes

Partition models and cluster processes and cluster processes and cluster processes With applications to classification Jie Yang Department of Statistics University of Chicago ICM, Madrid, August 26 and cluster processes utline 1 and cluster

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,

More information

THE information capacity is one of the most important

THE information capacity is one of the most important 256 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 1, JANUARY 1998 Capacity of Two-Layer Feedforward Neural Networks with Binary Weights Chuanyi Ji, Member, IEEE, Demetri Psaltis, Senior Member,

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Learning Methods for Linear Detectors

Learning Methods for Linear Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com 1 School of Oriental and African Studies September 2015 Department of Economics Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com Gujarati D. Basic Econometrics, Appendix

More information

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Support Vector Machines. Prof. Matteo Matteucci Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Voronoi Networks and Their Probability of Misclassification

Voronoi Networks and Their Probability of Misclassification IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000 1361 Voronoi Networks and Their Probability of Misclassification K. Krishna, M. A. L. Thathachar, Fellow, IEEE, and K. R. Ramakrishnan

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

6 Markov Chain Monte Carlo (MCMC)

6 Markov Chain Monte Carlo (MCMC) 6 Markov Chain Monte Carlo (MCMC) The underlying idea in MCMC is to replace the iid samples of basic MC methods, with dependent samples from an ergodic Markov chain, whose limiting (stationary) distribution

More information

Brief Introduction to Machine Learning

Brief Introduction to Machine Learning Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1

IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1 IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1 1 2 IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 2 An experimental bias variance analysis of SVM ensembles based on resampling

More information

The Perceptron algorithm

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

An Introduction to Statistical Machine Learning - Theoretical Aspects -

An Introduction to Statistical Machine Learning - Theoretical Aspects - An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Clustering and Gaussian Mixture Models

Clustering and Gaussian Mixture Models Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap

More information

Abstract. 2. We construct several transcendental numbers.

Abstract. 2. We construct several transcendental numbers. Abstract. We prove Liouville s Theorem for the order of approximation by rationals of real algebraic numbers. 2. We construct several transcendental numbers. 3. We define Poissonian Behaviour, and study

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

A Result of Vapnik with Applications

A Result of Vapnik with Applications A Result of Vapnik with Applications Martin Anthony Department of Statistical and Mathematical Sciences London School of Economics Houghton Street London WC2A 2AE, U.K. John Shawe-Taylor Department of

More information

Machine Learning for Signal Processing Bayes Classification

Machine Learning for Signal Processing Bayes Classification Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj - Abelino Jimenez 11755/18797 1 Recap: KNN A very effective and simple way of performing classification

More information

Semi-Supervised Learning by Multi-Manifold Separation

Semi-Supervised Learning by Multi-Manifold Separation Semi-Supervised Learning by Multi-Manifold Separation Xiaojin (Jerry) Zhu Department of Computer Sciences University of Wisconsin Madison Joint work with Andrew Goldberg, Zhiting Xu, Aarti Singh, and Rob

More information

16 : Markov Chain Monte Carlo (MCMC)

16 : Markov Chain Monte Carlo (MCMC) 10-708: Probabilistic Graphical Models 10-708, Spring 2014 16 : Markov Chain Monte Carlo MCMC Lecturer: Matthew Gormley Scribes: Yining Wang, Renato Negrinho 1 Sampling from low-dimensional distributions

More information

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization

More information

Review: Support vector machines. Machine learning techniques and image analysis

Review: Support vector machines. Machine learning techniques and image analysis Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization

More information

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes. Random Forests One of the best known classifiers is the random forest. It is very simple and effective but there is still a large gap between theory and practice. Basically, a random forest is an average

More information

On the errors introduced by the naive Bayes independence assumption

On the errors introduced by the naive Bayes independence assumption On the errors introduced by the naive Bayes independence assumption Author Matthijs de Wachter 3671100 Utrecht University Master Thesis Artificial Intelligence Supervisor Dr. Silja Renooij Department of

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

Minimal basis for connected Markov chain over 3 3 K contingency tables with fixed two-dimensional marginals. Satoshi AOKI and Akimichi TAKEMURA

Minimal basis for connected Markov chain over 3 3 K contingency tables with fixed two-dimensional marginals. Satoshi AOKI and Akimichi TAKEMURA Minimal basis for connected Markov chain over 3 3 K contingency tables with fixed two-dimensional marginals Satoshi AOKI and Akimichi TAKEMURA Graduate School of Information Science and Technology University

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Introduction: The Perceptron

Introduction: The Perceptron Introduction: The Perceptron Haim Sompolinsy, MIT October 4, 203 Perceptron Architecture The simplest type of perceptron has a single layer of weights connecting the inputs and output. Formally, the perceptron

More information

1. Definition of a Polynomial

1. Definition of a Polynomial 1. Definition of a Polynomial What is a polynomial? A polynomial P(x) is an algebraic expression of the form Degree P(x) = a n x n + a n 1 x n 1 + a n 2 x n 2 + + a 3 x 3 + a 2 x 2 + a 1 x + a 0 Leading

More information

Introduction to Probability and Statistics (Continued)

Introduction to Probability and Statistics (Continued) Introduction to Probability and Statistics (Continued) Prof. icholas Zabaras Center for Informatics and Computational Science https://cics.nd.edu/ University of otre Dame otre Dame, Indiana, USA Email:

More information

Today. Calculus. Linear Regression. Lagrange Multipliers

Today. Calculus. Linear Regression. Lagrange Multipliers Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Single layer NN. Neuron Model

Single layer NN. Neuron Model Single layer NN We consider the simple architecture consisting of just one neuron. Generalization to a single layer with more neurons as illustrated below is easy because: M M The output units are independent

More information

An Introduction to Expectation-Maximization

An Introduction to Expectation-Maximization An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information