New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1
goal connection between classification and LP, convex QP has a long history (Vapnik, Mangasarian, Bennett, etc) recent progresses in convex optimization: conic and semidefinite programming; geometric programming; robust optimization we ll outline some connections between convex optimization and classification problems joint work with: M. Jordan, N. Cristianini, G. Lanckriet, C. Bhattacharrya 2
outline convex optimization SVMs and robust linear programming minimax probability machine kernel optimization 3
convex optimization standard form: min x f0(x) : fi(x) 0, i = 1,..., m arises in many applications convexity not always recognized in practice can solve large classes of convex problems in polynomial-time (Nesterov, Nemirovski, 1990) 4
conic optimization special class of convex problems: min x c T x : Ax = b, x K where K is a cone, direct product of the following building blocks : K = R n + linear programming K = {(y, t) R n+1 : t y 2} second-order cone programming, quadratic programming K = {x R n n : x = x T 0} semidefinite programming fact: can solve conic problems in polynomial-time (Nesterov, Nemirovski, 1990) 5
conic duality dual of conic problem min x c T x : Ax = b, x K is where max b T y : c A T y K y K = {z : z, x 0 x K} is the cone dual to K for the cones mentioned before, and direct products of them, K = K 6
robust optimization conic problem in dual form: maxy b T y : c A T y K what if A is unknown-but-bounded, say A A, where A is given? robust counterpart: maxy b T y : A A, c A T y K still convex, but tractability depends on A systematic ways to approximate (get lower bounds) for large classes of A, approximation is exact 7
example: robust LP linear program: minx c T x : a T i assume ai s are unknown-but-bounded in ellipsoids Ei := { a : (a âi) T Γ 1 i (a âi) 1 } where âi: center, Γi 0: shape matrix robust LP: minx c T x : ai Ei, a T i 8 x b, i = 1,..., m x b, i = 1,..., m
robust LP: SOCP representation robust LP equivalent to min x c T x : â T i x + Γ 1/2 i x 2 b, i = 1,..., m a second-order cone program! interpretation: smoothes boundary of feasible set 9
LP with Gaussian coefficients assume a N (â, Γ), then for given x, Prob{a T x b} 1 ɛ is equivalent to: â T x + κ Γ 1/2 x 2 b where κ = Φ 1 (1 ɛ) and Φ is the c.d.f. of N (0, 1) hence, can solve LP with Gaussian coefficients using second-order cone programming resulting SOCP is similar to one obtained with ellipsoidal uncertainty 10
LP with random coefficients assume a (â, Γ), i.e. distribution of a has mean â and covariance matrix Γ, but is otherwise unknown Chebychev inequality: Prob{a T x b} 1 ɛ is equivalent to: â T x + κ Γ 1/2 x 2 b where 1 ɛ κ = ɛ leads to SOCP similar to ones obtained previously 11
outline convex optimization SVMs and robust linear programming minimax probability machine kernel optimization 12
SVMs: setup given data points xi with labels yi = ±1, i = 1,..., N two-class linear classification with support vector: min a 2 : yi(a T xi b) 1, i = 1,..., N amounts to select one separating hyperplane among the many possible problem is feasible iff there exists a separating hyperplane between the two classes 13
SVMs: robust optimization interpretation interpretation: SVMs are a way to handle noise in data points assume each data point is unknown-but-bounded in a sphere of radius ρ and center xi find the largest ρ such that separation is still possible between the two classes of perturbed points 14
variations can use other data noise models: hypercube uncertainty (gives rise to LP) ellipsoidal uncertainty ( QP) probabilistic uncertainty, Gaussian or Chebychev ( QP) 15
separation with hypercube uncertainty assume each data point is unknown-but-bounded in an hypercube Ci: xi Ci := {ˆxi + ρp u : u 1} where centers ˆxi and shape matrix P are given robust separation: leads to linear program min P a 1 : yi(a T ˆxi b) 1, i = 1,..., N 16
separation with ellipsoidal uncertainty assume each data point is unknown-but-bounded in an ellipsoid Ei: xi Ei := {ˆxi + ρp u : u 2 1} where center ˆxi and shape matrix P are given robust separation leads to QP min P a 2 : yi(a T ˆxi b) 1, i = 1,..., N 17
outline convex optimization SVMs and robust linear programming minimax probability machine kernel optimization 18
minimax probability machine goal: make assumptions about the data generating process do not assume Gaussian distributions use second-moment analysis of the two classes let ˆx±, Γ± be the mean and covariance matrix of class y = ±1 MPM: maximize ɛ such that there exists (a, b) such that inf x (ˆx+,Γ+) Prob{aT x b} 1 ɛ inf x (ˆx,Γ ) Prob{aT x b} 1 ɛ 19
MPMs: optimization problem two-sided, multivariable Chebychev inequality: (b a T ˆx) 2 inf x (ˆx,Γ) Prob{aT + x b} = (b a T ˆx) 2 + + a T Γa MPM leads to second-order cone program: min a Γ 1/2 + a 2 + Γ 1/2 a 2 : a T (ˆx+ ˆx ) = 1 complexity is the same as standard SVMs 20
dual problem express problem as unconstrained min-max problem: min a max u 2 1, v 2 1 ut Γ 1/2 + a v T Γ 1/2 a + λ(1 a T (x+ x )) exchange min and max, and set ρ := 1/λ: min ρ : x + + Γ 1/2 + u = x + Γ 1/2 v, u 2 ρ, v 2 ρ ρ,u,v geometric interpretation: define the two ellipsoids { } E±(ρ) := ˆx± + Γ 1/2 ± u : u 2 ρ and find largest ρ for which ellipsoids intersect 21
robust optimization interpretation assume data generated as follows: for data with label +, { } x+ E+(ρ) := ˆx+ + Γ 1/2 + u : u 2 ρ and similarly for data with label MPM finds largest ρ for which robust separation is possible PSfrag replacements ˆx+ a T x b = 0 ˆx 22
variations minimize weighted sum of misclassification probabilities quadratic separation: find a quadratic set such that inf x (ˆx+,Γ+) inf x (ˆx,Γ ) Prob{x Q} 1 ɛ Prob{x Q} 1 ɛ leads to a semidefinite programming problem nonlinear classification via kernels (using plug-in estimates of mean and covariance matrix) 23
outline convex optimization SVMs and robust linear programming minimax probability machine kernel optimization 24
transduction transduction: given labeled training set and unlabeled test set, predict the labels the data contains both labeled points and unlabeled points 25
kernel methods main goal: separate using a nonlinear classifier a T φ(x) = b where φ is a nonlinear operator define the kernel matrix (on both labeled and unlabeled data) Kij = φ(xi)φ(xj) T in a transductive setting, all we need to know to predicts the labels are a, b and the kernel matrix 26
kernel methods: idea of proof all the linear classification methods we ve seen so far are such that at the optimum, a is in the range of the labeled data: a = i λixi thus, in the nonlinear case, the optimization problem depends only on the values of kernel matrix Kij for labeled points xi, xj in a transductive setting, the prediction of labels also involves Kij only, since for an unlabeled data point xj, a T φ(xj) = i λiφ(xi) T φ(xj) involves only Kij s 27
kernel optimization all previous algorithms can be kernelized what is a good kernel? kernel should be close to a target kernel kernel matrix satisfies some structure constraints main idea: kernel can be described via the Gram matrix of data points, hence is a positive semidefinite matrix semidefinite programming plays a role in kernel optimization 28
setup we assume we have given training and test sets goal: maximize alignment to a given kernel on training set (translates into constraints on upper-left block of kernel matrix) kernel matrix satisfies structure constraints (translates into constraints on the whole matrix, including test set) 29
alignment idea: align K to a target kernel C by maximizing A(K, C) := C, K K F C F where C, K = Tr (CK) is the inner product of two symmetric matrices, and C F = C, C is the Frobenius norm we can impose a lower bound α on the alignment with the second-order cone constraint on K α C F K F C, K 30
affine constraints it is also useful to impose that the kernel lies in some affine subspace example: assume that K is of the form N K = λiuiu T i i=1 where λi 0 are the (variable) eigenvalues, and ui s are the (fixed) eigenvectors 31
optimizing kernels: example problem goal: find a kernel that has an alignement with a given matrix (eg, C = yy T ) on the training set belongs to some affine set K the problem reduces to a semidefinite programming feasibility problem: find K such that K K, α C F K F C, K, K positive definite 32
kernel optimization: what s next? of course, this is not a learning method much to learn from duality theory many other constraints can be handled, e.g., margin requirements 33
wrap-up convex optimization has much to offer and gain from interaction with classification described variations on linear classification many robust optimization interpretations all these methods can be kernelized kernel optimization has high potential 34
see also Learning the Kernel Matrix with Semi-Definite Programming (Lanckiert, Cristianini, Bartlett, Elghaoui, Jordan) In Preparation (2002) Minimax probability machine (Lanckiert, Bhattacharrya, El Ghaoui, Jordan) (NIPS 2001) 35