Short Course Robust Optimization and Machine Learning. Lecture 6: Robust Optimization in Machine Learning

Similar documents
Robustness and Regularization: An Optimization Perspective

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Convex Optimization in Classification Problems

Sparse and Robust Optimization and Applications

Short Course Robust Optimization and Machine Learning. Lecture 4: Optimization in Unsupervised Learning

Semidefinite and Second Order Cone Programming Seminar Fall 2012 Project: Robust Optimization and its Application of Robust Portfolio Optimization

Large-scale Robust Optimization and Applications

A direct formulation for sparse PCA using semidefinite programming

Sparse PCA with applications in finance

Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs

Robust Optimization and Data Approximation in Machine Learning

Robust Optimization and Data Approximation in Machine Learning. Gia Vinh Anh Pham

Machine Learning And Applications: Supervised Learning-SVM

A direct formulation for sparse PCA using semidefinite programming

EE 227A: Convex Optimization and Applications April 24, 2008

CSC 411 Lecture 17: Support Vector Machine

Convex Optimization M2

Interval Data Classification under Partial Information: A Chance-constraint Approach

A Second order Cone Programming Formulation for Classifying Missing Data

Robust linear optimization under general norms

Handout 8: Dealing with Data Uncertainty

A Least Squares Formulation for Canonical Correlation Analysis

Lecture 7: Kernels for Classification and Regression

Robust Inverse Covariance Estimation under Noisy Measurements

Announcements - Homework

CS295: Convex Optimization. Xiaohui Xie Department of Computer Science University of California, Irvine

Support vector machines

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Handout 6: Some Applications of Conic Linear Programming

LIGHT ROBUSTNESS. Matteo Fischetti, Michele Monaci. DEI, University of Padova. 1st ARRIVAL Annual Workshop and Review Meeting, Utrecht, April 19, 2007

Selected Examples of CONIC DUALITY AT WORK Robust Linear Optimization Synthesis of Linear Controllers Matrix Cube Theorem A.

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Big Data Analytics: Optimization and Randomization

Robust Optimization for Risk Control in Enterprise-wide Optimization

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Robust Kernel-Based Regression

Interval solutions for interval algebraic equations

Support Vector Machines: Maximum Margin Classifiers

Robust Novelty Detection with Single Class MPM

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Estimation and Optimization: Gaps and Bridges. MURI Meeting June 20, Laurent El Ghaoui. UC Berkeley EECS

Kernel Methods and Support Vector Machines

Warm up: risk prediction with logistic regression

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Canonical Problem Forms. Ryan Tibshirani Convex Optimization

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 18

Lecture 20: November 1st

Statistical Analysis of Online News

Support Vector Machines for Classification: A Statistical Portrait

Brief Introduction to Machine Learning

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

1 Robust optimization

IE 521 Convex Optimization

Support Vector Machines, Kernel SVM

Support Vector Machines

Sparse Covariance Selection using Semidefinite Programming

The Alternating Direction Method of Multipliers

Mathematical Optimization Models and Applications

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

CS , Fall 2011 Assignment 2 Solutions

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Linear & nonlinear classifiers

15. Conic optimization

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Support Vector Machines and Kernel Methods

Homework 4. Convex Optimization /36-725

Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht

Max Margin-Classifier

NEW ROBUST UNSUPERVISED SUPPORT VECTOR MACHINES

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Lecture 1: Introduction

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Statistical Machine Learning from Data

Robust portfolio selection under norm uncertainty

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A Convex Upper Bound on the Log-Partition Function for Binary Graphical Models

Lasso: Algorithms and Extensions

A Robust Portfolio Technique for Mitigating the Fragility of CVaR Minimization

Machine Learning for NLP

Robust conic quadratic programming with ellipsoidal uncertainties

Statistical Methods for Data Mining

Lecture 6: Conic Optimization September 8

Sparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations

Jeff Howbert Introduction to Machine Learning Winter

MLCC 2017 Regularization Networks I: Linear Models

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machine (SVM) and Kernel Methods

Machine Learning. Support Vector Machines. Manfred Huber

Accelerated Proximal Gradient Methods for Convex Optimization

The Perceptron Algorithm, Margins

COS 424: Interacting with Data

Convex Optimization and Support Vector Machine

Homework 3. Convex Optimization /36-725

Support Vector Machine (SVM) and Kernel Methods

Robust Principal Component Analysis

Transcription:

Short Course Robust Optimization and Machine Machine Lecture 6: Robust Optimization in Machine Laurent El Ghaoui EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012

Outline Machine

Outline Machine

Supervised learning problems Many supervised learning problems (e.g., classification, regression) can be written as min w L(X T w) where L is convex, and X contains the data. Machine

Penalty approach Often, optimal value and solutions of optimization problems are sensitive to data. A common approach to deal with sensitivity is via penalization, e.g.: min x L(X T w) + Wx 2 2 (W = weighting matrix). How do we choose the penalty? Can we choose it in a way that reflects knowledge about problem structure, or how uncertainty affects data? Does it lead to better solutions from machine learning viewpoint? Machine

Support Vector Machine Support Vector Machine (SVM) classification problem: min w,b m (1 y i (zi T w + b)) + i=1 Z := [z 1,..., z m] R n m contains the data points. y { 1, 1} m contain the labels. x := (w, b) contains the classifier parameters, allowing to classify a new point z via the rule y = sgn(z T w + b). Machine

Robustness to data uncertainty Assume the data matrix is only partially known, and address the robust optimization problem: min w,b max U U m (1 y i ((z i + u i ) T w + b)) +, i=1 where U = [u 1,..., u m] and U R n m is a set that describes additive uncertainty in the data matrix. Machine

Measurement-wise, spherical uncertainty Assume Machine where ρ > 0 is given. U = {U = [u 1,..., u m] R n m : u i 2 ρ}, Robust SVM reduces to min w,b m (1 y i (zi T w + b) + ρ w 2 ) +. i=1

Link with classical SVM Classical SVM contains l 2 -norm regularization term: min w,b m (1 y i (zi T w + b)) + + λ w 2 2. i=1 where λ > 0 is a penalty parameter. With spherical uncertainty, robust SVM is similar to classical SVM. Machine When data is separable, the two models are equivalent...

Separable data Machine Maximally robust classifier for separable data, with spherical uncertainties around each data point. In this case, the robust counterpart reduces to the classical maximum-margin classifier problem.

Interval uncertainty Assume U = {U R n m where ρ > 0 is given. : (i, j), U ij ρ}, Machine Robust SVM reduces to min w,b m (1 y i (zi T w + b) + ρ w 1 ) +. i=1 The l 1 -norm term encourages sparsity, and may not regularize the solution.

Separable data Machine Maximally robust classifier for separable data, with box uncertainties around each data point. This uncertainty model encourages sparsity of the solution.

Other uncertainty models We may generalize the approach to other uncertainty models, retaining tractability: Measurement-wise uncertainty models: perturbations affect each data point independent of each other. Other models couple the way uncertainties affect each measurement; for example we may control the number of errors across all the measurements. Norm-bound models allow for uncertainty of data matrix that is bounded in matrix norm. A whole theory is presented in [1]. Machine

Consider standard l 1 -penalized SVM: φ λ (X) := min w,b Constrained counterpart: ψ c(x) := min w,b 1 m m (1 y i (w T x i + b)) + + λ w 1 i=1 m (1 y i (xi T w + b)) + : w 1 c i=1 Machine Basic goal: solve these problems in the large-scale case. Approach: use to sparsify the data matrix in a controlled way.

Thresholding data We threshold the data using an absolute level t: { 0 if xi,j t (x i (t)) j := 1 otherwise This will make the data sparser, resulting in memory and time savings. Machine

Handling thresholding errors Handle thresholding errors via robust counterpart: (w(t), b(t)) := arg min w,b Above problem is tractable. max Z X t m (1 y i (w T z i + b)) + + λ w 1. i=1 Machine The solution w(t) at threshold level t satisfies 0 1 m m (1 y i (x T i w(t) + b(t))) + + λ w(t) 1 φ λ (X) 2t λ. i=1

Results 20 news groups data set Machine Dataset size: 20, 000 60, 000. Thresholding of data matrix of TF-IDF scores.

Results UCI NYTimes Dataset 1 stock 11 bond 2 nasdaq 12 forecast 3 portfolio 13 thomson financial 4 brokerage 14 index 5 exchanges 15 royal bank 6 shareholder 16 fund 7 fund 17 marketing 8 investor 18 companies 9 alan greenspan 19 bank 10 fed 20 merrill Top 20 keywords for topic stock. Dataset size: 100, 000 102, 660, 30,000,000 non-zeros. Thresholded dataset (by TF-IDF scores) with level 0.05 850,000 non-zeroes (2.8 %). Total run time: 4317s. Machine

Robust SVM with Data: boolean Z {0, 1} n m (eg, co-occurence matrix) Nominal problem: SVM min w,b m (1 y i (zi T w + b)) +, i=1 Uncertainty model: assume each data value can be flipped, total budget of flips is constrained: { } U = U = [u 1,..., u m] R l m : u i { 1, 0, 1} l, u 1 k. Machine

Robust counterpart where min w,b m (1 y i (zi T w + b) + φ(w)) +, i=1 φ(w) := min s k w s + s 1 Penalty is a combination of l 1, l norms. Problem is tractable (doubles number of variables over nominal). Still needs regularization. Machine

Results UCI Internet advertisement data set Machine Dataset size: 3279 1555. k = 0 corresponds to nominal SVM problem. Best performance at k = 3.

Refined model We can impose u i {0, 1 2x i }. This leads to a new penalty: with min w,b m (1 y i (xi T w + b) + φ i (w)) +, i=1 n φ i (w) := min kµ + (y i w j (2x ij 1) µ) + µ 0 Problem can still be solved via LP. j=1 Machine

Results UCI Heart data set Machine Dataset size: 267 22. k = 0 corresponds to nominal SVM problem. Best performance at k = 1.

Outline Machine

Nominal problem where min L(Z T θ), θ Θ Z := [z 1,..., z m] R n m is the data matrix L : R m R is a convex loss function Θ imposes structure (eg, sign) constraints on parameter vector θ Machine

Loss function: assumptions We assume that L(r) = π(abs(p(r))), where abs( ) acts componentwise, π : R m + R is a convex, monotone function on the non-negative orthant, and { r ( symmetric case ) P(r) = ( asymmetric case ) r + with r + the vector with components max(r i, 0), i = 1,..., m. Machine

Loss function: examples l p-norm regression hinge loss Huber, Berhu loss Machine

Robust counterpart min max θ Θ Z Z L(ZT θ). where Z R n m is a set of the form Z = {Z + : ρd, }, with ρ 0 a measure of the size of the uncertainty, and D R l m is given. Machine

Generic analysis For a given vector θ, we have max Z Z L(ZT θ) = max u where L is the conjugate of L, and is the support function of D. u T Z T θ L (u) + ρφ D(uv T ), φ D(X) := max X, D Machine

Assumptions on uncertainty set D Separability condition: there exist two semi-norms φ, ψ such that φ D(uv T ) := max D ut v = φ(u)ψ(v). Does not completely characterize (the support function of) the set D Given φ, ψ, we can construct a set D out that obeys condition The robust counterpart only depends on φ, ψ. WLOG, we can replace D by its convex hull. Machine

Largest singular value model: D = { : ρ}, with φ, ψ Euclidean norms. Any norm-bound model involving an induced norm (φ, ψ are then the norms dual to the norms involved). Measurement-wise uncertainty models, where each column of the perturbation matrix is bounded in norm, independently of the others, correspond to the case with ψ(v) = v 1. Machine

Other examples Bounded-error model: there are (at most K ) errors affecting data Machine D = = [λ 1 δ 1,..., λ mδ m] R l m : δ i 1, i = 1,..., m,. m λ i K, λ {0, 1} m for which φ( ) =, ψ(v) = sum of the K largest magnitudes of the components of v. i=1

(follow d) The set { D = = [λ 1 δ 1,..., λ mδ m] R l m } : δ i { 1, 0, 1} l, δ 1 k models measurement-wise uncertainty affecting (we can impose δ i {x i 1, x i } to be more realistic) In this case, we have ψ( ) = 1 and φ(u) = u 1,k := min w k u w + w 1. Machine

Main result For a given vector θ, we have where min θ max Z Z L(ZT θ) = min θ,κ Lwc(Z T θ, κ) : κ φ(u T θ) L(r, κ) := max v v T r L (v) + κψ(v) is the worst-case loss function of the robust problem. Machine

Worst-case loss function The tractability of the robust counterpart is directly linked to our ability to compute optimal solutions v for L(r, κ) = max v v T r L (v) + κψ(v) Dual representation (assume ψ( ) = is a norm): L(r, κ) = max ξ L(r + κξ) : ξ 1 When ψ is the Euclidean norm, robust regularization of L (Lewis, 2001). Machine

When ψ( ) = p, p = 1,, problem reduces to simple, tractable convex problem (assuming nominal problem is). For p = 2, problem can be reduced to such a simple form, for the hinge, l q-norm and Huber loss functions. Machine

Lasso In particular, the least-squares problem with lasso penalty min θ X T θ y 2 + ρ θ 1 is the robust counterpart to a least-squares problem with uncertainty on X, with additive perturbation bounded in the norm n 1,2 := max 2 ij. 1 i l j=1 Machine

Globalized robust counterpart The robust counterpart is based on the worst-case value of the loss function assuming a bound on the data uncertainty (Z Z): min max θ Θ Z Z L(ZT θ). The approach does not control the degradation of the loss outside the set Z. Machine

Globalized robust counterpart: formulation In globalized robust counterpart, we fix a rate of degradation of the loss, which controls the amount of degradation of the loss as the data matrix Z goes away from the set Z. We seek to minimize τ, such that : L((Z + ) T θ) τ + α, where α > 0 controls the rate of degradation, and is a matrix norm. Machine

Globalized robust counterpart For the SVM case, the globalized robust counterpart can be expressed as: min w,b m (1 y i (zi T w + b)) + : m θ 2 α, i=1 which is a classical form of SVM. For l p-norm regression with m data points, the globalized robust counterpart takes the form min X T θ y p : κ(m, p) θ 2 α θ Machine where κ(m, 1) = m, κ(m, 2) = κ(m, ) = 1.

can address problems with chance constraints min θ max EpL(Z p P (δ)t θ) where δ follows distribution p, and P is a class of distributions Results are more limited, focused on upper bounds. Convex relaxations are available, but more expensive. Approach uses Bernstein approximations (Nemirovski & Ben-tal, 2006). Machine

Robust regression with chance constraints: an example φ p := min θ max x (ˆx,X) Regression variable is θ R n E x A(x)θ b(x) p x R q is an uncertainty vector that enters affinely in the problem matrices: [A(x), b(x)] = [A 0, b 0 ] + i x i[a i, b i ]. The distribution of uncertainty vector x is unknown, except for its mean ˆx and covariance X. Objective is worst-case (over distributions) expected value of l p-norm residual (p = 1, 2). Machine

Main result (Assume ˆx = 0, X = I WLOG) For p = 2, the problem reduces to least-squares: Machine φ 2 2 = min θ q A i θ b i 2 2 i=0 For p = 1, we have (2/π)ψ 1 φ 1 ψ 1, with ψ 1 = min θ q A i θ b i 2 i=0

Example: robust median As a special case, consider the median problem: min θ q θ x i i=1 Now assume that vector x is random, with mean ˆx and covariance X, and consider the robust version: φ 1 := min θ max E x (ˆx,X) x q θ x i i=1 Machine

Approximate solution We have (2/π)ψ 1 φ 1 ψ 1, with ψ 1 := n i=1 (θ ˆx i ) 2 + X ii Amounts to find the minimum distance sum (a very simple SOCP). Machine

Geometry of robust median problem std Nominal vs. robust median 0.7 0.6 0.5 0.4 0.3 0.2 0.1 nominal points std dev. median robust median Machine 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 x

Outline Machine

Machine A. Bental, L. El Ghaoui, and A. Nemirovski. Robust Optimization. Princeton Series in Applied Mathematics. Princeton University Press, October 2009. C. Caramanis, S. Mannor, and H. Xu. Robust optimization in machine learning. In S. Sra, S. Nowozin, and S. Wright, editors, Optimization for Machine, chapter 14. MIT Press, 2011.