Op#mal convex op#miza#on under Tsybakov noise through connec#ons to ac#ve learning
|
|
- Annabelle Booker
- 5 years ago
- Views:
Transcription
1 Op#mal convex op#miza#on under Tsybakov noise through connec#ons to ac#ve learning Aar$ Singh Joint work with: Aaditya Ramdas
2 Connec#ons between convex op#miza#on and ac#ve learning (a formal reduc#on) 2
3 Role of feedback in convex op$miza$on in ac$ve learning f(x) x minimize computa#onal complexity (# queries needed to find op$mum) X 0 Raginsky- Rakhlin 09 x minimize sample complexity (# queries needed to find decision boundary) 3 X
4 Ac#ve learning oracle model Oracle provides Y 2{0, 1} Query x 1 Y 1 Query x 2 Y 2 Query x N Y N E[Y X] =P (Y =1 X) 4
5 Stochas#c op#miza#on oracle model (first- order) Oracle provides f(x), g(x) = 5f(x) Query x 1 bf(x 1 ), bg(x 1 ) Query x 2 bf(x 2 ), bg(x 2 ) Query x N bf(x N ), bg(x N ) E[ f(x)] b = f(x), E[bg(x)] = g(x) unbiased, variance 2 5
6 Connec#ons in 1- dim noiseless sefng convex op$miza$on f(x) ac$ve learning X 0 X 1 P (sign(g(x)) = + X) P (Y =1 X) 0 X P (sign(g(x)) = + X) 6
7 Connec#ons in 1- dim noisy sefng convex op$miza$on f(x) ac$ve learning X X 1 bg P (sign(g(x)) = + X) P (Y =1 X) zero- mean noise X P (sign(g(x)) bg = + X) 7
8 Minimax ac#ve learning rates in 1- dim If Tsybakov Noise Condi$on (TNC) holds apple 1 kx x k apple 1 P (Y =1 X = x) 1/ then minimax op$mal ac$ve learning rate in 1- dim is E[kbx N x k] N 1 2apple 2 and under 0/1 loss + smoothness of P(Y X) Risk(bx N ) Risk(x ) N apple 2apple 2 Castro- Nowak 07 N 1 2 N 1 x 8 X apple =2
9 TNC and strong convexity Strong convexity TNC with apple =2 2 ) f(x) f(x ) kx x k apple 2 ) kg(x) g(x )k kx x k apple 0 If noise pmf grows linearly around its zero mean (Gaussian, uniform, triangular), then PP (sign(g(x)) bg = + X = x) 1/2 kx x k 9
10 Algorithmic reduc#on (1- dim) In 1- dim, consider any ac$ve learning algorithm that is op$mal for TNC exponent κ = 2. When given labels Y = sign(bg(x)), where f(x) is a strongly convex func$on with Lipschitz gradients, it yields E[kbx T x k]=o(t 1 2 ) E[f(bx T ) f(x )] = O(T 1 ) Matches op$mal rates for strongly convex func$ons Nemirovski- Yudin 83, Agarwal- Bartlei- Ravikumar- Wainwright 10 What about d- dim? 10
11 1- dim vs. d- dim Convex op$miza$on Ac$ve learning x x Minimizer: a point (0- dim) Decision boundary: curve (d- 1 dim) T 1 N 2 2+ d 1 Complexity of convex optimization in any dimension is same as complexity of active learning in 1 dimension. 11
12 Algorithmic reduc#on (d- dim) Random coordinate descent with 1- dim ac$ve learning subrou$ne For e =1,...,E = d(log T ) 2 Choose coordinate j at random from 1,...,d Do active learning along coordinate with sample budget T e = T/E treating sign(bg(x t )) as label Y t j If f is strongly convex with Lipschitz gradients O O S S inf bx inf bx f f E[kbx x f k] E[f(bx) f(x )] = Õ(T 1 2 ) = Õ(T 1 ) 12
13 Degree of convexity via Tsybakov noise condi#on (TNC) 13
14 Degree of convexity via TNC TNC for convex func$ons apple 1 ) f(x) f(x ) kx x k apple kg(x) g(x )k kx x k apple 1 Controls strength of convexity around the minimum x f(x) X Uniformly convex func$on implies TNC apple 2 Controls strength of convexity everywhere in domain 14
15 Minimax convex op#miza#on rates Theorem: If TNC for convex func$ons holds f(x) f(x ) kx x k apple apple>1 f(x) and f is Lipschitz, then minimax op$mal convex op$miza$on rate over a bounded set (diam 1) is x X kbx T x k T 1 2apple 2 d-dim kf(bx) f(x )k T apple 2apple 2 d-dim T T 3/2 Precisely the rates for 1-dim active learning! apple =3/2 T 1 T 1 2 Strongly convex Convex apple =2 apple!1 15
16 Lower bounds based on ac#ve learning O S inf bx f E[kbx x f k] = (T 1 2apple 2 ) f 1 f 0 S =[0, 1] d \ {kxk apple1} O : b f(x) N(f(x), 2 ), bg(x) N(g(x), 2 I d ) f 0 (x) =c 1 dx x i apple i=1 0 2a 4a f 1 (x) = c1 ( x 1 f 0 (x) 2a apple + P d i=2 x i apple )+c 2 x 1 apple 4a otherwise P 0 = P ({X i,f 0 (X i ),g 0 (X i )} T i=1) P 1 = P ({X i,f 1 (X i ),g 1 (X i )} T i=1) 16
17 Lower bounds based on ac#ve learning O S inf bx f E[kbx x f k] = (T 1 2apple 2 ) Fano s Inequality if KL(P 0,P 1 ) apple Constant inf bx f P (kbx x f k > kx f 0 x f 1 k/2) constant KL(P 0,P 1 ) apple f 1 f 0 Query that yields max difference between func$on/gradient values Castro- Nowak a 4a apple Constant if kx f 0 x f 1 k/2 =a = T 1 2apple 2 17
18 Lower bounds based on ac#ve learning O S inf bx f E[kbx x f k] = (T 1 2apple 2 ) Fano s Inequality if KL(P 0,P 1 ) apple Constant inf bx f P (kbx x f k > kx f 0 x f 1 k/2) constant KL(P 0,P 1 ) apple Query that yields max difference between func$on/gradient values Castro- Nowak 07 Also yields lower bounds for uniformly convex func$ons and zeroth- order oracle which match Iouditski- Nesterov 10, Jamieson- Nowak- Recht 12 18
19 Epoch- based gradient descent Initialize e =1,x 1 1,T 1,R 1, 1 until Oracle budget T is exhausted Projected Gradient Descent Requires knowledge of κ,r e+1 1 apple e+1, If f is a convex func$on that sa$sfies TNC(κ) and is Lipschitz O O S S inf bx inf bx f f E[kbx x f k] = Õ(T 1 2apple 2 ) E[f(bx) f(x )] = Õ(T apple 2apple 2 ) 19
20 Adap#ng to degree of convexity 20
21 Adap#ng to degree of convexity For e =1,...,E = log p T/log T (ignoring κ) Run any optimization procedure that is optimal for convex functions, with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 since Also, 9ē s.t. kxē x ēk T kx e x ek apple apple f(x e ) f(x e) x ē = x 1/(2apple 2) R e p T rate for convex Lipschitz functions 21
22 Adap#ng to degree of convexity For e =1,...,E = log p T/log T (ignoring κ) Run any optimization procedure that is optimal for convex functions, with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 9ē s.t. kxē x ēk T 1/(2apple 2) x ē = x x R 1 = x 1 x 1 22
23 Adap#ng to degree of convexity For e =1,...,E = log p T/log T (ignoring κ) Run any optimization procedure that is optimal for convex functions, with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 9ē s.t. kxē x ēk T 1/(2apple 2) x ē = x x R 2 = x 2 x 2 23
24 Adap#ng to degree of convexity For e =1,...,E = log p T/log T (ignoring κ) Run any optimization procedure that is optimal for convex functions, with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 9ē s.t. kxē x ēk T 1/(2apple 2) T x ē = x x R e x ē = xē 1/(2apple 2) 24
25 Adap#ng to degree of convexity For e =1,...,E = log p T/log T (ignoring κ) Run any optimization procedure that is optimal for convex functions, with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 9ē s.t. kxē x ēk T x ē = x 1/(2apple 2) T = 8e ē, kx e xēk T 1/(2apple 2) x x e xē 1/(2apple 2) R e>e x e 25
26 Adap#ng to degree of convexity For e =1,...,E = log p T/log T (ignoring κ) Run any optimization procedure that is optimal for convex functions, with sample budget T e = T/E R e+1 = R e /2,e e +1 If f is a convex func$on that sa$sfies TNC(κ) and is Lipschitz O O S S inf bx inf bx f f E[kbx x f k] = Õ(T 1 2apple 2 ) E[f(bx) f(x )] = Õ(T apple 2apple 2 ) 26
27 Adap#ve ac#ve learning 27
28 Adap#ve 1- dim ac#ve learning Robust Binary Search adap$ve to κ For e =1,...,E = log p T/log T (ignoring κ) Do passive learning with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 28
29 Adap#ve 1- dim ac#ve learning Robust Binary Search adap$ve to κ For e =1,...,E = log p T/log T (ignoring κ) Do passive learning with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 since ckx e x ek apple apple Risk(x e ) Risk(x e) Also, 9ē s.t. kxē x ēk T x ē = x 1/(2apple 2) R e p T passive rate for threshold classifiers 29
30 Adap#ve 1- dim ac#ve learning Robust Binary Search adap$ve to κ For e =1,...,E = log p T/log T (ignoring κ) Do passive learning with sample budget T e = T/E R e+1 = R e /2,e e +1 Adapted from Iouditski- Nesterov 10 9ē s.t. kxē x ēk T x ē = x 8e ē, kx e xēk T 1/(2apple 2) 1/(2apple 2) x = x e T xē 1/(2apple 2) R e>e x e Much simpler than Hanneke 09 30
31 Reference & Future direc#ons A. Ramdas and A. Singh, Op$mal rates for stochas$c convex op$miza$on under Tsybakov noise condi$on, to appear ICML (Available on arxiv) Ø Reduc$on from d- dim convex op$miza$on to 1- dim ac$ve learning for κ- TNC func$ons (κ 2)? Ø Adap$ve d- dimensional ac$ve learning/model selec$on in ac$ve learning? Ø Por$ng ac$ve learning results to yield non- convex op$miza$on guarantees? 31
Tsybakov noise adap/ve margin- based ac/ve learning
Tsybakov noise adap/ve margin- based ac/ve learning Aar$ Singh A. Nico Habermann Associate Professor NIPS workshop on Learning Faster from Easy Data II Dec 11, 2015 Passive Learning Ac/ve Learning (X j,?)
More informationQuery Complexity of Derivative-Free Optimization
Query Complexity of Derivative-Free Optimization Kevin G Jamieson University of Wisconsin Madison, WI 53706, USA kgjamieson@wiscedu Robert D Nowak University of Wisconsin Madison, WI 53706, USA nowak@engrwiscedu
More informationHow hard is this function to optimize?
How hard is this function to optimize? John Duchi Based on joint work with Sabyasachi Chatterjee, John Lafferty, Yuancheng Zhu Stanford University West Coast Optimization Rumble October 2016 Problem minimize
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationPlug-in Approach to Active Learning
Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X
More informationNearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal
More informationStochastic gradient descent and robustness to ill-conditioning
Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationStochas(c Dual Ascent Linear Systems, Quasi-Newton Updates and Matrix Inversion
Stochas(c Dual Ascent Linear Systems, Quasi-Newton Updates and Matrix Inversion Peter Richtárik (joint work with Robert M. Gower) University of Edinburgh Oberwolfach, March 8, 2016 Part I Stochas(c Dual
More informationRegression.
Regression www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts linear regression RMSE, MAE, and R-square logistic regression convex functions and sets
More informationMinimax rates for Batched Stochastic Optimization
Minimax rates for Batched Stochastic Optimization John Duchi based on joint work with Feng Ruan and Chulhee Yun Stanford University Tradeoffs Major problem in theoretical statistics: how do we characterize
More informationMinimax risk bounds for linear threshold functions
CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability
More informationLecture: Adaptive Filtering
ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate
More informationO(1/ɛ) Lower Bounds for Convex Optimization with Stochastic Oracles
Technical Note July 8, 200 O(/ɛ) Lower Bounds for Convex Optimization with Stochastic Oracles Yaoliang Yu Xinhua Zhang yaoliang@cs.ualberta.ca xinhua.zhang.cs@gmail.com Department of Computing Science
More informationLecture 3: Minimizing Large Sums. Peter Richtárik
Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors
More informationBellman s Curse of Dimensionality
Bellman s Curse of Dimensionality n- dimensional state space Number of states grows exponen
More informationLeast Mean Squares Regression. Machine Learning Fall 2017
Least Mean Squares Regression Machine Learning Fall 2017 1 Lecture Overview Linear classifiers What func?ons do linear classifiers express? Least Squares Method for Regression 2 Where are we? Linear classifiers
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More information21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)
10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC
More informationAdvanced Machine Learning
Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationConvex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization
Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f
More informationBeyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization
JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationComposite nonlinear models at scale
Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)
More informationAn iterative hard thresholding estimator for low rank matrix recovery
An iterative hard thresholding estimator for low rank matrix recovery Alexandra Carpentier - based on a joint work with Arlene K.Y. Kim Statistical Laboratory, Department of Pure Mathematics and Mathematical
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationAdaptive Sampling Under Low Noise Conditions 1
Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università
More informationStochastic Composition Optimization
Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators
More informationStochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos
1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and
More informationClassification Logistic Regression
Classification Logistic Regression Machine Learning CSE546 Kevin Jamieson University of Washington October 16, 2016 1 THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS 2 Weather prediction
More informationNon-convex optimization. Issam Laradji
Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More information9 Classification. 9.1 Linear Classifiers
9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive
More informationA talk on Oracle inequalities and regularization. by Sara van de Geer
A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other
More informationBetter Algorithms for Selective Sampling
Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized
More informationThe Online Approach to Machine Learning
The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I
More informationGRADIENT = STEEPEST DESCENT
GRADIENT METHODS GRADIENT = STEEPEST DESCENT Convex Function Iso-contours gradient 0.5 0.4 4 2 0 8 0.3 0.2 0. 0 0. negative gradient 6 0.2 4 0.3 2.5 0.5 0 0.5 0.5 0 0.5 0.4 0.5.5 0.5 0 0.5 GRADIENT DESCENT
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationInformation theoretic perspectives on learning algorithms
Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez
More informationExpanding the reach of optimal methods
Expanding the reach of optimal methods Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with C. Kempton (UW), M. Fazel (UW), A.S. Lewis (Cornell), and S. Roy (UW) BURKAPALOOZA! WCOM
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationEarly Stopping for Computational Learning
Early Stopping for Computational Learning Lorenzo Rosasco Universita di Genova, Massachusetts Institute of Technology Istituto Italiano di Tecnologia CBMM Sestri Levante, September, 2014 joint work with
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationStochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values
Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationLecture 2: Subgradient Methods
Lecture 2: Subgradient Methods John C. Duchi Stanford University Park City Mathematics Institute 206 Abstract In this lecture, we discuss first order methods for the minimization of convex functions. We
More informationIs the test error unbiased for these programs?
Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x
More informationLocal Minimax Complexity of Stochastic Convex Optimization
Local Minimax Complexity of Stochastic Convex Optimization Yuancheng Zhu Sabyasachi Chatterjee John Duchi and John Lafferty arxiv:1605.07596v3 [stat.ml] 6 May 016 Department of Statistics Department of
More informationMaking Gradient Descent Optimal for Strongly Convex Stochastic Optimization
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationBinary Classification / Perceptron
Binary Classification / Perceptron Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Supervised Learning Input: x 1, y 1,, (x n, y n ) x i is the i th data
More informationBandit Convex Optimization
March 7, 2017 Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion Learning scenario Compact convex action set K R d. For t = 1 to T : Predict
More informationarxiv: v1 [cs.it] 21 Feb 2013
q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto
More informationOn Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:
A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July
More informationOracle Complexity of Second-Order Methods for Smooth Convex Optimization
racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il
More informationEE 511 Online Learning Perceptron
Slides adapted from Ali Farhadi, Mari Ostendorf, Pedro Domingos, Carlos Guestrin, and Luke Zettelmoyer, Kevin Jamison EE 511 Online Learning Perceptron Instructor: Hanna Hajishirzi hannaneh@washington.edu
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More information10. Ellipsoid method
10. Ellipsoid method EE236C (Spring 2008-09) ellipsoid method convergence proof inequality constraints 10 1 Ellipsoid method history developed by Shor, Nemirovski, Yudin in 1970s used in 1979 by Khachian
More informationRandomized Derivative-Free Optimization of Noisy Convex Functions
Randomized Derivative-Free Optimization of Noisy Convex Functions Ruobing Chen Stefan M. Wild July, 05 Abstract We propose STARS, a randomized derivative-free algorithm for unconstrained optimization when
More informationMinimax Bounds for Active Learning
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. X, MONTH YEAR Minimax Bounds for Active Learning Rui M. Castro, Robert D. Nowak, Senior Member, IEEE, Abstract This paper analyzes the potential advantages
More informationCS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent
CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi Gradient Descent This lecture introduces Gradient Descent a meta-algorithm for unconstrained minimization. Under convexity of the
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationConvex Optimization in Classification Problems
New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1
More informationChapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)
Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1) Detection problems can usually be casted as binary or M-ary hypothesis testing problems. Applications: This chapter: Simple hypothesis
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationMinimax Analysis of Active Learning
Journal of Machine Learning Research 6 05 3487-360 Submitted 0/4; Published /5 Minimax Analysis of Active Learning Steve Hanneke Princeton, NJ 0854 Liu Yang IBM T J Watson Research Center, Yorktown Heights,
More informationNoise-Tolerant Interactive Learning Using Pairwise Comparisons
Noise-Tolerant Interactive Learning Using Pairwise Comparisons Yichong Xu *, Hongyang Zhang *, Kyle Miller, Aarti Singh *, and Artur Dubrawski * Machine Learning Department, Carnegie Mellon University,
More informationCOMP 562: Introduction to Machine Learning
COMP 562: Introduction to Machine Learning Lecture 20 : Support Vector Machines, Kernels Mahmoud Mostapha 1 Department of Computer Science University of North Carolina at Chapel Hill mahmoudm@cs.unc.edu
More informationarxiv: v2 [cs.lg] 6 May 2017
arxiv:170107895v [cslg] 6 May 017 Information Theoretic Limits for Linear Prediction with Graph-Structured Sparsity Abstract Adarsh Barik Krannert School of Management Purdue University West Lafayette,
More informationFinite-sum Composition Optimization via Variance Reduced Gradient Descent
Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu
More informationMachine Learning and Data Mining. Linear regression. Prof. Alexander Ihler
+ Machine Learning and Data Mining Linear regression Prof. Alexander Ihler Supervised learning Notation Features x Targets y Predictions ŷ Parameters θ Learning algorithm Program ( Learner ) Change µ Improve
More informationCS 6140: Machine Learning Spring What We Learned Last Week 2/26/16
Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Sign
More informationHilbert s 13th Problem Great Theorem; Shame about the Algorithm. Bill Moran
Hilbert s 13th Problem Great Theorem; Shame about the Algorithm Bill Moran Structure of Talk Solving Polynomial Equations Hilbert s 13th Problem Kolmogorov-Arnold Theorem Neural Networks Quadratic Equations
More informationMachine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler
+ Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Prof. Alexander Ihler Linear Classifiers (Perceptrons) Linear Classifiers a linear classifier is a mapping which partitions
More informationSta$s$cal Op$miza$on for Big Data. Zhaoran Wang and Han Liu (Joint work with Tong Zhang)
Sta$s$cal Op$miza$on for Big Data Zhaoran Wang and Han Liu (Joint work with Tong Zhang) Big Data Movement Big Data = Massive Data- size + High Dimensional + Complex Structural + Highly Noisy Big Data give
More informationarxiv: v2 [math.oc] 8 Oct 2011
Stochastic convex optimization with bandit feedback Alekh Agarwal Dean P. Foster Daniel Hsu Sham M. Kakade, Alexander Rakhlin Department of EECS Department of Statistics Microsoft Research University of
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationLogis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com
Logis&c Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these
More informationUnsupervised Learning: K- Means & PCA
Unsupervised Learning: K- Means & PCA Unsupervised Learning Supervised learning used labeled data pairs (x, y) to learn a func>on f : X Y But, what if we don t have labels? No labels = unsupervised learning
More informationAdaptive Compressive Imaging Using Sparse Hierarchical Learned Dictionaries
Adaptive Compressive Imaging Using Sparse Hierarchical Learned Dictionaries Jarvis Haupt University of Minnesota Department of Electrical and Computer Engineering Supported by Motivation New Agile Sensing
More informationErgodic Subgradient Descent
Ergodic Subgradient Descent John Duchi, Alekh Agarwal, Mikael Johansson, Michael Jordan University of California, Berkeley and Royal Institute of Technology (KTH), Sweden Allerton Conference, September
More informationSolution: First we need to find the mean of this distribution. The mean is. ) = e[( e 1 e 1 ) 0] = 2.
Math 0A with Professor Stankova Worksheet, Discussion #; Monday, //207 GSI name: Roy Zhao Standard Deviation Example. Let f(x) = e e x for x and 0 otherwise. Find the standard deviation of this distribution.
More informationPavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017
Randomized Similar Triangles Method: A Unifying Framework for Accelerated Randomized Optimization Methods Coordinate Descent, Directional Search, Derivative-Free Method) Pavel Dvurechensky Alexander Gasnikov
More informationGradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725
Gradient Descent Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh Convex Optimization 10-725/36-725 Based on slides from Vandenberghe, Tibshirani Gradient Descent Consider unconstrained, smooth convex
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationCSE546: SVMs, Dual Formula5on, and Kernels Winter 2012
CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012 Luke ZeClemoyer Slides adapted from Carlos Guestrin Linear classifiers Which line is becer? w. = j w (j) x (j) Data Example i Pick the one with the
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationModel Selection and Geometry
Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model
More informationBayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI
Bayes Classifiers CAP5610 Machine Learning Instructor: Guo-Jun QI Recap: Joint distributions Joint distribution over Input vector X = (X 1, X 2 ) X 1 =B or B (drinking beer or not) X 2 = H or H (headache
More informationOnline Learning and Online Convex Optimization
Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game
More informationECON 7335 INFORMATION, LEARNING AND EXPECTATIONS IN MACRO LECTURE 1: BASICS. 1. Bayes Rule. p(b j A)p(A) p(b)
ECON 7335 INFORMATION, LEARNING AND EXPECTATIONS IN MACRO LECTURE : BASICS KRISTOFFER P. NIMARK. Bayes Rule De nition. Bayes Rule. The probability of event A occurring conditional on the event B having
More information