Appendices for the article entitled Semi-supervised multi-class classification problems with scarcity of labelled data

Size: px

Start display at page:

Download "Appendices for the article entitled Semi-supervised multi-class classification problems with scarcity of labelled data"

Beverly Reeves
6 years ago
Views:

1 Appendices for the article entitled Semi-supervised multi-class classification problems with scarcity of labelled data Jonathan Ortigosa-Hernández, Iñaki Inza, and Jose A. Lozano Contents 1 Appendix A. Mathematical proofs Theorem Theorem Theorem Theorem Lemma Theorem Theorem Corollary Theorem Corollary Theorem Appendix B. Supplementary data for the empirical study 12 3 Appendix C. Source code l k -related functions Functions related to the probability of error Examples of the executions of the package

2 1 Appendix A. Mathematical proofs This appendix provides the mathematical proofs of the theoretical results in our paper Semi-supervised multi-class classification problems with scarcity of labelled data: A theoretical study 1.1 Theorem 1 Theorem 1. (Probability of error with no labelled data) 1 The probability of error of classifying a new sample (x (0), c (0) ) of a -class problem with any classifier learnt with no labelled examples and any u 0 number of unlabelled samples coincides with the probability of error random classifier. In both cases, it remains constant to P e (0, u) = e 0 = ( 1), u 0. (1) Proof. - When f(x) is unknown: When there are not enough unlabelled records (u < ) to determine the mixture density (and its components), the class c (0) of the unseen distance must be determined by the uniformly random classifier. In such a case, let the event A be defined as the probability of correctly choosing the right class for c (0) over a choice of different classes; P (A) = 1/. Then, the probability of committing an error is P e (0, u) = 1 P (A) = ( 1), u <. - When f(x) is known: With unlabelled examples, the mixture is identifiable up to a permutation π of the components. Let the observation x (0) be drawn from the j-th decomposed component, f π(j) ( ), i the unknown true label such that π c (j) = i, and let us define the following two events: B 1 {η πc(j)f πc(j)(x (0) ) max j j η π c(j )f πc(j )(x (0) ) > 0} B 2 {η πc(j)f πc(j)(x (0) ) max j j η π c(j )f πc(j )(x (0) ) < 0} B 1 is the event which represents achieving a correct answer in the application of the Bayes decision rule over x (0) and B 2 represents the opposite- By the definition, P (B 1 ) = (1 e B ) and P (B 2 ) = e B. Then, the probability of error is P e (0, ) = P (ĉ (0) c (0) ) = 1 P (ĉ (0) = c (0) ) = = 1 P (ĉ (0) = c (0) B 1 )P (B 1 ) P (ĉ (0) = c (0) B 2 )P (B 2 ) = = 1 P (π a )(1 e B ) P (π a )e B. 1 This holds true independently of the value of the class priors η 1,..., η. 2

3 where π a and π b are two permutations such that π a (j) = i and π b (j ) = i.as no labelled data are provided to determine the correspondence π, it has to be randomly chosen. Then, the probability of choosing those permutations is P (π a ) = P (π b ) = ( 1)!/! = 1/. After substituting these probabilities in the previous formula and after some algebra, we obtain the same expression: P e (0, ) = ( 1). 1.2 Theorem 2 Theorem 2. (Optimality of PC SSL ) PC SSL is an optimum learning procedure for Stage 2 of Algorithm 2. Proof. The proof of this theorem can be found in the manuscript. 1.3 Theorem 3 Theorem 3. (Minimum number of labelled examples) Let the family of mixtures F be linearly independent. Let be the number of classes and the number of mixture components. Let the class priors be equiprobable, i.e. i, η i = η = 1/. Then, the expected minimum number of labelled instances needed to uniquely determine the class labels of the components of a mixture is : l = 1 if = 2 1 ( ) ( 1) j (j 1) if > 2 j j=2 ) ( 2) ( 1 + ( j ( j) ) j Proof. The proof for = 2 can be found in [1]. For higher values of, let L be a random variable representing the minimum number of instances needed to obtain examples of ( 1) different classes. Assuming equiprobability, P (L = l), for l ( 1), is a fraction whose numerator is the number of favourable cases and whose denominator is the number of all possible cases: P (L = l) = P S 2 (l 1, 2) P R,l, where P is the number of permutations of elements, S 2 (, ) is the Stirling number of second kind, and P R,l is the number of l-permutations of elements with repetition (formulae in [2]). Then, we calculate the expectation of the random variable L in l as l = E[L ] = l= 1 lp (L = l). After some algebra, we reach equation (2), for > 2. 3 (2)

4 1.4 Theorem 4 Theorem 4. (Minimum labelled examples for ternary problems) Let f(x) be identifiable and let = 3 be the number of classes with priors η i > 0, 3 i=1 η i = 1. Then, the expected minimum number of labelled instances needed to uniquely determine the class labels of the components of the mixture is l 3 = i=1 η 2 i 1 η i. (3) Proof. Let L 3 be a random variable representing the minimum number of instances needed to obtain two different labels. Assuming each class c i has a prior η i and 3 i=1 η i = 1, P (L 3 = l) has the following form: P (L 3 = l) = η (l 1) 1 (1 η 1 ) + η (l 1) 2 (1 η 2 ) + η (l 1) 3 (1 η 3 ). Which stands for the probability of having (l 1) labelled examples of one class, and just one example of one of the two other classes. Then, we calculate the expectation of L 3 in l. After some algebra, we reach the equation (3). 1.5 Lemma 1 Lemma 1. Let L be a labelled subset distributed according to a generative model f(x, c). Let the marginal of f(x, c) on x be an identifiable mixture density f(x) = i=1 η if i (x) which represents the distribution of the infinite unlabelled records. Let π be the correspondence returned by the learning procedure Π( ), (Stage 2 of Algorithm 2), and let R π 1 (i) be defined as R π 1 (i) = {x : η π 1 (i)f π 1 (i)(x) max i i η π 1 (i )f π 1 (i )(x) > 0}. Then, the probability of committing an error in classifying an unseen instance with the BDR after assuming the correspondence π is P (e π) = 1 η i f i (x)dx. (4) R π 1(i) i=1 Proof. According to Algorithm 2, a correct prediction occurs depending on whether the optimal classification rule, which assumes a learnt correspondence π, over the unseen instance x, hits the real class value (c i ). If the region where f i (x) is maximum is R j (it holds that i = π c (j)), there are only two cases in which the real class is correctly predicted: Case 1: if x R j π(j) = i Case 2: if x R j j π(j ) = i 4

5 As can be seen, in both cases, the region to where x belongs can be named as π 1 (i), which is equal to j in Case 1, and to j in Case 2. Therefore, the probability of correctly classifying x is just the probability of x being in the region labelled as i, i.e. R π 1 (i) and this formula can be easily generalised to all the possible values of i in the range {1,..., } as: P (ē π) = η i f i (x)dx, R π 1(i) i=1 After that we can reach formula (4) taking into account that the probability of error is the opposite to the probability of a correct classification, P (ē π) = 1 P (e π). 1.6 Theorem 5 Theorem 5. (Probability of error) The probability of error of classifying an unseen instance in the multi-class scenario, given l labelled records and infinite unlabelled records, is: P e (l, ) = π S P (Π l = π)p (e π), (5) where P (Π l = π) denotes the probability of choosing, using the learning procedure Π( ), the permutation π given l labelled data (Stage 2 of Algorithm 2) and P (e π) the probability of misclassification with the BDR after assuming the correspondence π (Stage 3 of Algorithm 2) 2. Finally, S represents the set of all possible permutations of size representing all the correspondences between labels and components. Proof. The mathematical proof of this theorem is in the manuscript. 1.7 Theorem 6 Theorem 6. (Probability of error with zero Bayes error) Let the mixture density f(x) be an identifiable mixture. Let be the number of classes and the number of mixture components. Let the class priors be equiprobable, i.e. i, η i = η = 1. Then, the probability of error P e (l, ), given l > 0 labelled records and infinite unlabelled records when the components are mutually disjoint is given by: P e (l, ) = min( 2,l) z=1 P,z S 2 (l, z) ( P R,l 1 z + 1 ), (6) where P,z and P R,l are the number of z-permutations without repetition and l-permutations with repetition of, respectively. S 2 (l, z) is the Stirling number of 2 nd kind[2]. 2 Note that eq. (5) fits with the one proposed for binary problems in [1] (pp. 107, eq. (7)): P (Π l = π c)p (e π c) + P (Π l = π c)p (e π c). 5

6 Proof. To determine the probability of error, we just need to calculate the probability of finding z min(, l) different labels in l. When z ( 1) (see definition of l ) we reach the real model, so, the probability of error is e B, which is zero. Therefore, we can define the probability as follows: P e (l, ) = min( 2,l) z=1 P (Ψ l = z)p (e z), (7) where Ψ l is a random variable representing the number of different labels, z in l and P (e z) is the probability of committing a classification error knowing the real correspondence of z labels with their components. Regarding P (Ψ l = z), as all the priors are equiprobable; it is a fraction whose numerator is the number of selecting just z classes of multiplied by the way of ordering them, and whose denominator is the number of all possible cases: P (Ψ l = z) = P,zS 2 (l, z) P R,l. Then, P (e z) can be decomposed into the sum of the probability of misclassifying when we find a particular label among the labelled subset and the probability of misclassifying it when we do not have that label to identify a mixture: P (e z) = Pr{i z} Pr{e i}+pr{i z} Pr{e i}, where {i z} corresponds to the event that the unseen instance has the same label as one of the labels that appears in the labelled subset, {i z} is the opposite, and Pr{e i} is the probability of misclassifying an instance of class i. By solving the previous formula, we obtain that P (e z) is equal to z 0 + z ( z)! ( z 1)! = 1 z + 1 ( z)!. (8) Finally, by substituting the calculations of P (Ψ l = z) and P (e z) in equation (7), we reach formula (6). 1.8 Corollary 1 Corollary 1. When the components are mutually disjoint, P e (l, ) converges to 0 exponentially fast in the sense of that ( ( 1 ) ) l o. (9) Proof. It is trivial to calculate the convergence order from the previous theorem. 6

7 1.9 Theorem 7 Theorem 7. (Upper bound of the error) When the components are not mutually disjoint and all the priors are equiprobable, the optimal probability of error of a model composed by different identifiable mixture components is upper bounded by { lλ 2 } P e (l, ) e B 2 exp, (10) 2 where λ (0, 1] depends on the degree of intersection among the components of the mixture distribution and is defined by λ = 1 { } min f i (x)dx max f z (x)dx. (11) j R j z i R j Proof. Let the Voting learning procedure [3] be defined as the function Σ : L S, where L L is the labelled set and ˆπ v S is the permutation of components returned by Voting. In [3], the excess of risk of the Voting procedure, (ε(σ)), is defined as [ E L j=1 R j ( ηπc(j)f πc(j)(x) ηˆπv(j)fˆπv(j)(x) ) dx], (12) where E L [ ] is the expectation with respect to the labelled sample and, π c (j) and ˆπ v (j) correspond to the true class and the Voting bet of the j-th component, respectively. In that paper, they prove that ε(σ) 2 exp{ lλ 2 /2}. Then, we just need to prove that P e (l, ) e B ε(σ l ). By means of the linearity property of both the integral and the expectation over equation (12), we reach [ E L j=1 R j ] [ η πc(j)f πc(j)(x)dx E L j=1 R j ] ηˆπv(j)fˆπv(j)(x)dx. There, the first term is equal to (1 e B ) since π c (j) = i (equation (4)). Then, the second is equal to L (1 P (e L))P (L)dL (where Voting is implicitly contained), which, following the same reasoning as in the proof of theorem 5, it is equal to Pe V (l, ) = P (Σ l = π)p (e π). π S Note that, here, P V e (l, ) is used instead of P e (l, ) to denote that the Voting classifier is used, not the optimal one. Then, by Theorem 2 (P V e (l, ) P e (l, ), l), we can reach the conclusion of equation (10); P e (l, ) is upper bounded by ε(σ), which, in turn, is upper bounded by 2 exp{ lλ 2 /2}. 7

8 1.10 Corollary 2 Corollary 2. The probability of error, P e (l, ), decreases to e B, at least, exponentially fast in the number of labelled data. Proof. It is trivial to calculate the convergence order from Theorem 7 of the manuscript Theorem 8 Lemma 2. Assuming all the priors to be equiprobable and that the model is a Gaussian identifiable mixture of components with the same variance (σ), let δ M be the largest distance between the means of two components and let Φ( ) be the CDF of a standard Gaussian distribution. Then, it holds that Proof. First, P (Π l = π) can be rewritten as: (l 1,...,l ) G l ( min P (Π l = π) 1 Φ ( δ M ) ) (! 1) l. (13) π 2σ P ((l 1,..., l )) P (Π((l 1,..., l )) = π), (14) which is the decomposition of the probability in terms of the number of labelled examples of each class c i in l, i.e. l i, 1 i. There, P ((l 1,..., l )) is the probability of having the distribution of labelled samples (l 1,..., l ) in l, P (Π((l 1,..., l k )) = π) is the probability of obtaining the permutation π with a determined distribution of labelled samples. G l is the set containing all possible distributions of labels defined as the set containing all the integer partitions of l in exactly addenda, but including zeros and taking into account the order of addenda. Formally, it is defined as G l {(γ 1, γ 2,..., γ ) γ z = l γ z {0, 1,..., l}, z}. z=1 Then, we define P (Π((l 1,..., l k )) = π) in terms of the components of the mixture as l i ηf π 1 (i)(x a ) i=1 a=1 P 1, arg max { l i ηf τ τ π 1 (i)(x a )} i=1 a=1 where f j (x) is the density function of the component j. For simplicity of notation, from now on, we rename the numerator as f l π and the denominator as arg max τ π {f l τ }. 8

9 Then, by defining the permutation π D as the furthest permutation to π c, i.e. π D = arg max π i=1 µ π 1 (i) µ π 1 c (i), it holds that ( ) ( ) min P = P. (15) π f l π arg max τ π {f l τ } 1 f l π D arg max τ π D {f l τ } 1 As there is no independency between the permutations, we cannot express the second term of the formula (15) as a product of π D being greater than or equal to any other permutation τ. For that reason, we use the chain rule to decompose the formula in such a way that the first term of the chain has, l > 0, the lowest probability: ( f l ) π P D arg max τ πd {fτ l } 1 = P ( fπ l D fπ l )! c j=2 ( j 1 P fπ l D fπ l j fπ l D fπ l c, m=2 f l π D f l π m ). (16) Since we assume a mixture of Gaussian components (θ = {µ 1,..., µ, σ}), doing some calculations over the first term of the chain in formula (16) we reach the conclusion that it is equal to P ( f l π D f l π c ) and bounded to = 1 Φ 1 2σ i=1 l i (µ π 1 D (i) µ π 1 1 Φ 1 ) l i δm 2 = 1 Φ 2σ i=1 c (i) )2 ( δm 2σ l. where δ M = max{(µ π 1 D (i) µ πc 1 (i) )2 } is the largest distance between two means. As equation (17) is, by definition of π D, the lowest term in the product of equation (16), we can lower bound it by substituting the product by equation (17) to the (! 1) (number of terms in the chain rule) power. Then, substituting this value in equation (14), we reach formula (13): min π P (Π l = π) is greater than or equal to ( 1 Φ ( δ M 2σ 1 (17) ) ) { (! 1) }}{ l P ((l 1,..., l k ). (18) G l Theorem 8. (Lower bound of the error) Assuming that the model is a Gaussian identifiable mixture of components with the same variance (σ) and equiprobable priors. Let δ M be the largest distance between the means of two components and Q( ) a polynomial of degree 3. Then, the probability of error for non-disjoint components is lower bounded by P e (l, ) e B!(1 e B) ( 1)! ( 1 + exp{q( δ M 2σ l)} ) (! 1). (19) 9

10 Proof. First, we decompose formula (10) of the main manuscript as follows: P e (l, ) = P (Π l = π c )P (e π c )+ + π S \π c P (Π l = π)p (e π). (20) It can be easily seen that, when l increases, whilst P (Π l = π c ) grows to 1, the remaining P (Π l = π), π π c decrease to 0. Since P (e π c ) is, by definition, equal to e B and P (Π l = π c ) = 1 π S \π c P (Π l = π), by just substituting equation (13) of the previous lemma in (20) and subtracting e B in both terms, we can lower bound the error (P e (l, ) e B ) by ( 1 Φ ( δ M ) ) (! 1) l (1 eb ) P (e π). 2σ (21) π S \π c There, we start by calculating π S \π c P (e π). Note that: P (e π) = P (e π) e B (22) π S \π c π S π S P (e π) =! π S P (e π) (23) By substituting formula (9) of Lemma 1 (in the manuscript) in equation (23), we obtain π S P (e π) =! π S i=1 1 f i (x)dx (24) R π 1 (i) By distributive property of addition and the fact that ( 1)! permutations of S share the same element in the same position, formula (24) can be rewritten as {}}{ 1 P (e π) =! ( 1)! f i (x)dx, π S i=1 j=1 where j = π 1 (i). Then, we can substitute the result in formula (22) obtaining π S \π c P (e π) =! ( 1)! e B. (25) Secondly, as there is no closed form expression for the normal cumulative density function, we approximate Φ(x) by an inverse exponential as proposed by Page in [4]: Φ(x) Φ Page (x) = 1 (1 + exp{q(x)}) 1, (26) 1 R j 10

11 where Q(x) = x x 3 and the absolute error ɛ = Φ(x) Φ Page (x) , x 0. Then, by substituting the formulas (26) and (25) in (21) and after some algebra, we reach formula (19). 11

2 Appendix B. Supplementary data for the empirical study The results in this supplementary data for = {4, 5, 6} show a similar style to that presented in the manuscript for = 3.

Whilst Figure 1, Figure 3 and Figure 5 shows the excess of risk of both Voting and PC SSL for the values of = {4, 5, 6}, respectively, Figure 2, Figure 4 and Figure 6 present the absolute differences

12 2 Appendix B. Supplementary data for the empirical study The results in this supplementary data for = {4, 5, 6} show a similar style to that presented in the manuscript for = 3. The generative model and the strategies used are the same as in Section VIII (manuscript). However, in these problems, we set l max = 50. Whilst Figure 1, Figure 3 and Figure 5 shows the excess of risk of both Voting and PC SSL for the values of = {4, 5, 6}, respectively, Figure 2, Figure 4 and Figure 6 present the absolute differences between the probability of error of both learning procedures for the respective values of = {4, 5, 6}. In each figure, the subfigure (a) stands for complex problems (δ = 0.25), subfigure (b) for problems showing medium complexity (δ = 1) and subfigure (c) shows the performance for easy problems showing a small degree of intersection among the components (δ = 5). (a) Complex (δ = 0.25). (b) Medium (δ = 1). (c) Easy (δ = 5) Figure 1: Probability of error of Voting [3] (upper) and PC SSL (lower curve) for = 4 (l = 4.33). (a) Complex (δ = 0.25). (b) Medium (δ = 1). (c) Easy (δ = 5) Figure 2: Absolute differences between Voting [3] and PC SSL, i.e. P V e (l, ) P P e (l, ), for = 4. 12

13 (a) Complex (δ = 0.25). (b) Medium (δ = 1). (c) Easy (δ = 5) Figure 3: Probability of error of Voting [3] (upper) and PC SSL (lower curve) for = 5 (l = 6.42). (a) Complex (δ = 0.25). (b) Medium (δ = 1). (c) Easy (δ = 5) Figure 4: Absolute differences between Voting [3] and PC SSL, i.e. P V e (l, ) P P e (l, ), for = 5. (a) Complex (δ = 0.25). (b) Medium (δ = 1). (c) Easy (δ = 5) Figure 5: Probability of error of Voting [3] (upper) and PC SSL (lower curve) for = 6 (l = 8.7). 13

14 (a) Complex (δ = 0.25). (b) Medium (δ = 1). (c) Easy (δ = 5) Figure 6: Absolute differences between Voting [3] and PC SSL, i.e. P V e (l, ) P P e (l, ), for = 6. 14

15 3 Appendix C. Source code The Mathematica package [5] available to download from our website contains the formulae presented in the manuscript. Besides, it also contains some experiments used to study the behaviour of the probability of error of both Voting [3] and PC SSL learning algorithms. 3.1 l k -related functions 1) The function LEQ[] calculates the l value for a problem with equiprobable clases. LEQ[] Input parameters: Output: Number of classes. A real number, l k. 2) The function LEQPlot[max] prints a figure of the l values for problems of {1..max} equiprobable classes. The Figure 2a of the manuscript has been created with this function. LEQPlot[max] Input parameters: Output: max Number of classes. Plot in the standard output. 3) LPriors[priors, nrep] use a Monte Carlo method with nrep repetitions to approximate l k for a problem with Length[priors] non-equiprobable priors given by the variable priors. LPriors[priors, nrep] Input parameters: Output: priors nrep List containing class priors. Number of repetitions. A real number, l k. 4) LSampling[, samplesize, nrep] applies LPriors[priors, nrep] with nrep repetitions over a population of class priors of size samplesize which are generated from a Dirichlet distribution with all the alpha hyper-parameters equal to 1. 15

16 LSampling[, samplesize, nrep] Input parameters: Output: samplesize nrep Number of classes. Population size. Number of repetitions for LPriors[...]. A list of l k, one per each sample of the population. 5) The eq. (8) of Theorem 4 corresponds to L3[n1,n2,n3]. It calculates l 3 for a ternary problem with priors n1, n2 and n3. L3[n1,n2,n3] Input parameters: Output: n1 Class prior of the class c 1 n2 Class prior of the class c 2 n3 Class prior of the class c 3 A real number, l 3. 6) L3Plot[] plots L3[n1,n2,n3] assuming that n1= η 1 and n2, n3= (1 η 1 )/( 1). L3Plot[] Input parameters: Output: Plot in the standard output. 3.2 Functions related to the probability of error 7) The function ZeroEB[,maxL] calculates the probability of error for l = 0..maxL for a -class problem when there is no intersection among the components. ZeroEB[,maxL] Input parameters: Output: maxl Number of classes. Maximum number of labelled examples. Summary of results in the standard output. 16

17 8) MCPCSSL[,distance,sigma,maxL,nRep] uses a Monte Carlo method with nrep repetitions to approximate P e (l, ) for l = {0..maxL} assuming a mixture of Gaussian. distance represents the distance between the adjacent means and sigma is the variance. MCPCSSL[,distance,sigma,maxL,nRep] Input parameters: Output: distance sigma maxl nrep Number of classes. Distance between the means of adjacent components. Variance of the Gaussian components. Maximum number of labelled examples. Number of repetitions for the Monte Carlo method. Summary of results in the standard output. 9) The function MCPCSSLBiased[,distance,sigma,maxL,bias, nrep] uses a Monte Carlo method with nrep repetitions to approximate P e (l, ) for l = {0..maxL} assuming a a generative model composed of a mixture of Gaussian. distance represents the distance between the adjacent means and sigma is the variance. In this simulation the learnt model is also a mixture of Gaussian components, but they are shifted a factor bias to the right, i.e. ˆµ i µ i = bias. This function corresponds to the second experiment of the manuscript. MCPCSSLBiased[,distance,sigma,maxL,bias, nrep] Input parameters: Output: distance sigma maxl bias nrep Number of classes. Distance between the means of adjacent components. Variance of the Gaussian components. Maximum number of labelled examples. Bias between the learnt and the generative models. Number of repetitions for the Monte Carlo method. Summary of results in the standard output. 10) MCVOTING[,distance,sigma,maxL,bias, nrep] uses a Monte Carlo method with nrep repetitions to approximate the probability of error of Voting for l = {0..maxL} assuming a mixture of Gaussian. distance represents the distance between the adjacent means and sigma is the variance. 17

18 MCVOTING[,distance,sigma,maxL,bias, nrep] Input parameters: Output: distance sigma maxl nrep Number of classes. Distance between the means of adjacent components. Variance of the Gaussian components. Maximum number of labelled examples. Number of repetitions for the Monte Carlo method. Summary of results in the standard output. 11) MCComparison[,distance,sigma,maxL,bias, nrep] uses a Monte Carlo method with nrep repetitions to approximate the probability of error of both PC SSL and Voting for l = {0..maxL} assuming a mixture of Gaussian. distance represents the distance between the adjacent means and sigma is the variance. MCComparison[,distance,sigma,maxL,bias, nrep] Input parameters: Output: distance sigma maxl nrep Number of classes. Distance between the means of adjacent components. Variance of the Gaussian components. Maximum number of labelled examples. Number of repetitions for the Monte Carlo method. Summary of results in the standard output. 3.3 Examples of the executions of the package Figure 7 and Figure 8 shows two executions of the package to calculate l and l 3. 18

19 Figure 7: An example of the use of the l k -related functions. 19

20 Figure 8: An example of the calculation of l 3. Figure 9 shows an example of how, in cases of mutually non-intersecting components, the probability of error is calculated. 20

21 Figure 9: An execution of ZeroEB[,maxL]. Whilst Figure 10 shows how the probability of error of P C SSL is calculated with the Mathematica package, Figure 8 provides an example of the calculation of the probability of error of P C SSL when the correct model assumption is not met. 21

22 Figure 10: Calculating the optimal probability of error using P CSSL. 22

23 Figure 11: An execution of P CSSL when the correct model assumption is not met. Then, Figure 12 shows how the probability of error of P CSSL is calculated. 23

24 Figure 12: An example of the execution of the Voting strategy. Finally, 13 shows an example of how the comparison between both procedures can be made with the function MCComparison. 24

25 Figure 13: Comparing both learning algorithms using the function MCComparison[...]. 25

26 References [1] V. Castelli and T. Cover, On the exponential value of labeled samples, Pattern Recognition Letters, vol. 16, pp , [2] R. P. Stanley, Enumerative combinatorics. Wadsworth Publ. Co., [3] H. Chen and L. Li, Semisupervised multicategory classification with imperfect model, IEEE Transactions on Neural Networks, vol. 20, no. 10, pp , [4] E. Page, Approximation to the cummulative normal function and its inverse for use on a pocket calculator, Applied Statistics, vol. 26, pp , [5] Wolfram Research Inc., Mathematica,

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical