2.1 Optimization formulation of k-means

Size: px

Start display at page:

Download "2.1 Optimization formulation of k-means"

Sheryl Mason
5 years ago
Views:

1 MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 2: k-means Clustering Lecturer: Jiaming Xu Scribe: Jiaming Xu, September 2, 2016 Outline Optimization formulation of k-means Convergence of k-means Failure cases of k-means 2.1 Optimization formulation of k-means Recall in the data clustering problem, we are given n data points x 1, x 2,..., x n X, and interested in partitioning them into k clusters. A psuedo-distance d is a mapping from X X to R +, i.e.., d : X X R +. k-partition of [n]: S 1 S 2 S k = [n] such that S i S j = for all i j. Center of a group S [n]: c(s) arg min z X { } d(x i, z). Cost of a k-partition S = (S 1,..., S k ): Seek a k-partition S c(s) := k a=1 a d (x i, c(s a )). min S c(s). (2.1) Note: : It is important to constrain the number of clusters k in the minimization problem (2.1). If instead the number of clusters is unconstrained, and one minimizes c(s) over all possible partitions of [n], then the minimizer is trivially given by treating each data point as an individual cluster. Determining a good choice of k from data is a non-trivial task in general. 1

2 Algorithm 1 k-means clustering 1: Input: Data {x i } n i=1 and initial partition S. 2: Output: New partition S. 3: (Update step): let c a = c(s a ) for 1 a k. 4: (Assignment step) for 1 a k, { } S a = i [n] : a = arg min d(x i, c b ). b [k] 5: Iterate steps 1 4 until c(s ) c(s) ɛ. Note: In the update step of Algorithm 1, we need subroutine: { } c(s) arg min d(x i, z). z X Quadratic distance. If X R n and d(x, y) = x y 2 2, then c(s) = 1 x i. See Fig in Mackay s book [Mac03] for illustration of k-means clustering with quadratic distance. Spherical distance. If X S n 1 := {x R n : x 2 = 1} and d(x, y) = 1 x, y, then c(s) = x i x. i 2 Kullback-Leibler divergence (KL divergence). If X P m 1 := {x R m + : i x(i) = 1}and d(x, y) = D(x y) := m j=1 log y(j), then c(s) = 1 x i. Proof. (See notes on optimization in the course website for more details). To solve min z X { d(x i, z) }, consider its Lagrangian function L(z, λ) := d(x i, z) + λ( j z(j) 1). (2.2) Differentiate L(z, λ) with respect to z(j) gives that L(z, λ) z(j) = x i (j) 1 z(j) + λ. Set L(z,λ) z(j) = 0 gives that z(j) = 1 x i (j), 1 j m. λ 2

3 Since j z(j) = 1, it follows that λ = and hence the optimal z is given by z = x i. 1 Note: Properties of D(x y): 1. D(x y) 0 with equality if and only if x = y. Proof. D(x y) = j y(j) log y(j) y(j) j y(j) y(j) log j y(j) y(j) = 0, where the inequality follows from the convexity of x log x and Jensen s inequality, and it becomes equality if and only if y(j) does not depend on j, i.e., x = y. 2. D(x y) D(y x) in general (convince yourself by constructing examples). 3. D(x y) is convex in (x, y). Proof. ( ) By definition, one can check that for any convex function f : R + R, (p, q) qf p q is convex on R 2 +. Let f(x) = x log x. It follows that (p, q) p log p q is convex on R 2 +. Hence, D(x y) is jointly convex in x and y. Here is an alternative proof of (2.2) using the non-negativity of D(x y). Let y = 1 x i. Then for any z P m 1, (D(x i z) D(x i y)) = m x i (j) log y(j) m = y(j) log y(j) = D(y z) 0. z(j) z(j) j=1 j=1 2.2 Convergence of k-means Proposition If S t+1 = S t, then S t+l = S t for all l The cost c(s t ) is non-increasing in t. 3. k-means halts after at most k n iterations. Proof. Claim 1 follows immediately from the algorithm description. For Claim 2, define C(S, c) = { k a=1 a d(x i, c a ) Denote centers at iteration t by c t. Then by the update step, c(s t ) = min c C(S t, c) and by the assignment step, C(S t+1, c t ) = min S C(S, c t ). It follows that c(s t+1 ) C(S t+1, c t ) C(S t, c t ) c(s t ). Claim 3 follows from Claim 1 and the fact that there are at most k n different k-partitions of [n]. 3 }.

4 Note: From the proof of Claim 2, k-means algorithm can be viewed as an alternating minimization algorithm, which minimizes the cost function C(S, c) over k-partitions S and cluster centers c in an alternating fashion. Note: Although k-means algorithm halts after at most k n iterations, the outcome of the algorithm depends on the initial condition. See Figure 20.4 Mackay s book [Mac03] for an example. 2.3 Failure case of k-means Cluster sizes are unbalanced. See Figure 20.5 in Mackay s book [Mac03]. Distance metric d does not capture the shape of clusters well. See Figure 20.6 in Mackay s book [Mac03]. Note: To be precise, the two failure cases listed above are caused by improper choice of objective function in (2.1). 4

5 Bibliography [Mac03] David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press,

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence