Supervised Learning: Non-parametric Estimation Edmondo Trentin March 18, 2018
Non-parametric Estimates No assumptions are made on the form of the pdfs 1. There are 3 major instances of non-parametric estimates: 1. We want P(ω i x) = p(x ω i )P(ω i ) p(x) and we try and estimate the pdf p(x ω i ) relying on the available data 2. We want P(ω i x) = p(x ω i )P(ω i ) p(x) and we try and estimate P(ω i x), that is the discriminant function (posterior probability) 3. the feature space X is first projected onto a sub-space Y (having reduced dimensionality) via ϕ : X Y such that, given x and letting y = ϕ(x), computing P(ω i y) = p(y ω i )P(ω i ) p(y) turns out to be easier than computing P(ω i x) = p(x ω i )P(ω i ) p(x) 1 This is the fundamental thing: we recognize we do not know anything!
Non-parametric pdf estimation Let us elaborate on the relation between pdf and probability: what is the probability P that a generic pattern x, drawn from a certain pdf p(.), belongs to an arbitrary region R of the feature space? We have: P = p(x)dx (1) R Thence, P is an averaged (over R) version of p(x). If we can estimate P, then we we can come up with an estimate of the averaged version of p(x). Assume we collected a random sample of n data, x 1,..., x n, independently and identically distributed (iid) according to p(x). If k out of n data are in R, we can estimate P via the relative frequency: P k n (2)
If p(x) is continuous and has little variation in R (e.g. if R is small) then: p(x)dx p(x 0 )V (3) R where x 0 R and V is the volume of R. Thus far we have P = R p(x)dx and P k n, thence: p(x 0 ) k/n V (4) Practical problem: to retrieve p(x 0 ) instead of its version averaged over R, V should become 0. Since n is fixed in real-world cases, this would drive k to 0 as well, thenceforth making the estimate p(x 0 ) 0 V useless.
Some theoretical issues Let x 0 be any pattern of interest. Let us imagine the number n of available data may grow unbounded. We can then build a sequence of regions R 1, R 2,..., R n,... s.t. x 0 R i i as n increases. We use R i to estimate p(x 0 ) from i data. Let: V n be the volume of R n k n be the number of data (out of n) in R n p n (x 0 ) be the n-th estimate of p(x 0 ), i.e., p n (x 0 ) = kn/n V n Asymptotic necessary and sufficient conditions that ensure p n (x 0 ) p(x 0 ): 1. lim n V n = 0 2. lim n k n = 3. lim n k n /n = 0 (to guarantee convergence) How do we satisfy 1, 2 and 3? There are two complementary approaches that make sure p n (x 0 ) p(x 0 ) in probability: 1. Fix a proper volume, say V n = 1/ (n), and determine k n /n consequently (Parzen Window) 2. Fix k n, e.g. k n = n, and determine V n consequently, in such a way that exactly k n patterns fall in R n (k n -nearest neighbor)
Parzen Window To fix ideas, let us assume that R n is an hypercube having edge h n (thus, V n = hn d ). We define a window function (or, kernel): { 1 uj 1/2 j = 1, 2,..., d ϕ(u) = 0 else ( x0 x It is seen that ϕ i is a kernel having value 1 only within the h n ) hypercube centered in x 0 and having edge h n. The number k n of data in this hypercube is: n ( ) x0 x i k n = ϕ (6) i=1 h n (5) Bearing in mind that p n (x 0 ) = kn/n V n, we have: p n (x 0 ) = 1 n ( 1 x0 x i ϕ n V n i=1 h n ) (7)
The approach can be extended to window functions ϕ ( ) of different form. The equation: p n (x 0 ) = 1 n ( ) 1 xi x 0 ϕ n V n i=1 tells us that, using the n available data, the estimate p n (.) of the unknown pdf p(.) is obtained by averaging the kernel function ϕ ( ) over x 1,..., x n, i.e. by interpolation 2. We need a ϕ ( ) such that p n ( ) is a pdf, that is p n ( ) 0 and pn (x)dx = 1. If V n = hn d, this is guaranteed by the (sufficient) conditions: 1. ϕ(u) 0 u R d 2. ϕ(u)du = 1 Exercise: check it out. A popular and effective choice for ϕ ( ) is the Gaussian kernel, namely: ϕ(u) = N(u; 0, 1I ) (9) 2 In fact, I can think of each kernel as of a window centered in x i and evaluated on x 0, since ϕ(.). is a symmetric function. h n (8)
Q: how deep is the impact of the window width h n on the quality of the estimate p n (x)? Let us define: δ n (x) = 1 ( ) x ϕ (10) V n Since we can write Thence: p n (x) = 1 n p n (x) = 1 n n i=1 h n ( 1 x xi ϕ V n h n n ( ) δ n x xi i=1 ) (11) (12) if h n is large, then δ n (x) is very smooth (having small variation), and p n (x) is yielded by the superposition of wide, smooth functions; if h n is small, then δ n (x) is very peaked around x = x i. Since δn (x x i )dx = 1 V n ϕ ( x xi h n ) dx = ϕ(u)du = 1, if h n 0 then δ n (x x i ) converges to a Dirac s Delta centered in x i
p(x) = N(x; 0, 1) ϕ(u) = N(u; 0, 1) hn = h1 / n n pn (x) = 1X 1 ϕ n hn i=1 x xi hn
2.5 < x < 2 1 0.25 0 < x < 2 p(x) = 0 otherwise ϕ(u) = N(u; 0, 1) hn = h1 / n
k n -Nearest Neighbor In Parzen Window, the asymptotic conditions are guaranteed by taking h n = h 1 / n. Unfortunately, in the finite-sample case the estimate is affected by the choice for the initial edge (bandwidth) h 1 : if h 1 is too small, then p n ( ) = 0 if h 1 is too big, then p n ( ) = E[p( )] The inconvenience could be overcome by choosing the volume according to the nature of the data! 1. define k n as a function of n 2. to estimate p(x 0 ), consider a small ball around x 0 and let it grow until it embraces k n data (the k n -nearest neighbors) 3. Eventually: p n (x 0 ) = k n/n V n as we did in the generic non-parametric pdf estimation setup.
Good asymptotic behavior p n (x 0 ) p(x 0 ) is guaranteed once these two necessary and sufficient conditions hold: 1. lim n k n = 2. lim n k n n = 0 For instance, by letting k n = n the conditions are satisfied. Moreover: V n = 1 p(x 0 ) n (13) that is, exactly like in Parzen Window, we have V n = V 1 / n but V 1 = 1/p(x 0 ), i.e., V 1 is not chosen arbitrarily but it is uniquely determined by the nature of the data. In practice, nonetheless, we usually set k n = k 1 n where k1 is determined empirically.
Non-parametric Decision Rule Let {x 1, x 2,..., x n } be a supervised sample Take a ball (of volume V ) around x 0, s.t. it embraces k patterns Among these k patterns, let k i be the number of patterns of class ω i Since p n (x 0 ) = k/n V, we have: p n (x 0, ω i ) = k i/n V Thence, an estimate P n (ω i x 0 ) of P(ω i x 0 ) is given by k p n (x 0, ω i ) i /n P n (ω i x 0 ) = c j=1 p n(x 0, ω j ) = V c j=1 k j /n V = k i k i.e., the fraction of data of class ω i that falls within the ball under consideration. The ball (that is, its volume V ) may be fixed either using the Parzen Window or the k n -nearest neighbors philosophies.
Nearest Neighbor Algorithm The aforementioned decision rule, when applied along with the k n -NN philosophy and k n = 1, takes the (simple, yet effective) form of a popular classification algorithm. Let Y n = {x 1, x 2,..., x n } be the supervised sample, and let x 0 be a pattern to be classified. Nearest-neighbor decision rule: assign x 0 to class ω i if and only if (i) x n Y n is the nearest 3 pattern to x 0, among all those in Y n ; and (ii) x n belongs to class ω i. It is possible to show that, asymptotically, the performance of NN for n is as good as that of k n -NN. Intuitively, this is due to the fact that the probability that x n belongs to ω i is P(ω i x n). If we had many data (say, n ), then P(ω i x n) P(ω i x 0 ). Using NN, indeed, I assign x 0 to ω i due to ω i having taken place in x 0 with the highest probability, i.e. P(ω i x 0 ), 3 According to Euclidean distance.
K-Nearest Neighbor Given the sample Y n = {x 1, x 2,..., x n } and the pattern x 0 to be classified, we can apply this K-NN decision rule: consider the k patterns in Y n that are the nearest to x 0 (in terms of Euclidean distance). Assign x 0 to ω i if and only if the latter is the class having the highest relative frequency within this sub-sample of k patterns. Remarks: K-NN is in the spirit of k n -NN while k n -NN estimates a pdf, K-NN estimates P(ω i x) for n, the asymptotic behavior of K-NN tends to be optimal (i.e., Bayesian) in 2-class cases, an odd value for k is used in order to break ties more accurate decisions would be taken when k, but in the finite-sample case we cannot move away from x 0 too far (trade-off)