Inverse Time Dependency in Convex Regularized Learning
|
|
- Randolph Blake
- 5 years ago
- Views:
Transcription
1 Inverse Time Dependency in Convex Regularized Learning Zeyuan A. Zhu (Tsinghua University) Weizhu Chen (MSRA) Chenguang Zhu (Tsinghua University) Gang Wang (MSRA) Haixun Wang (MSRA) Zheng Chen (MSRA) December 7,
2 Observation 10K samples 1 hour 2.3% error 1M samples 1 week 2.29% error December 7, Example borrowed from Shalev-Shwartz & Srebro s slides.
3 Observation 10K samples 1 hour 2.3% error 1M samples 1 week 2.29% error 1 hour 2.3% error December 7, Example borrowed from Shalev-Shwartz & Srebro s slides.
4 Observation 10K samples 1 hour 2.3% error 1M samples 1 week 2.29% error 1 hour 2.3% error 10 minutes 2.3% error Can we? December 7, Example borrowed from Shalev-Shwartz & Srebro s slides.
5 Observation 10K samples 1 hour 2.3% error 1M samples 1 week 2.29% error 1 hour 2.3% error 10 minutes 2.3% error Can we? The runtime decreases as the number of samples increase, when desired accuracy is fixed. Inverse Time Dependency December 7,
6 Our Contribution Propose a Primal Gradient Solver (PGS) and proves its inverse time dependency property. This work generalizes the state-ofthe-art l 2 -SVM result to l p -norm with convex loss functions. By bounding S (the domain of w), PGS is able to support more loss functions. For example, Least Square. It first demonstrates that both logistic loss and least square loss can be adopted into PGS and achieve the inverse time dependency property. December 7,
7 Error Decomposition err(w) Optimization error The error due to the early-stop of the optimization algorithm. Estimation error Extra error due to the difference between the training set and the real sample distribution Approximation error Best achievable error for the given model. December 7, Example borrowed from Shalev-Shwartz & Srebro s slides.
8 Error Decomposition err(w) Optimization error Estimation error The error due to the early-stop of the optimization algorithm. Extra error due to the difference between the training set and the real sample distribution Approximation error Best achievable error for the given model Data set size n Estimation error (blue) Desire accuracy (sum) Prediction error (green) No. of iterations Using stochastic gradient descent Total running time December 7, Example borrowed from Shalev-Shwartz & Srebro s slides.
9 Error Decomposition err(w) Optimization error Estimation error The error due to the early-stop of the optimization algorithm. Extra error due to the difference between the training set and the real sample distribution Approximation error Best achievable error for the given model Data set size n Estimation error (blue) Desire accuracy (sum) Prediction error (green) No. of iterations Using stochastic gradient descent Total running time Remark: Generalization Error (formally defined later) = Optimization Error + Estimation Error December 7, Example borrowed from Shalev-Shwartz & Srebro s slides.
10 Error Decomposition C log T F σ w F σ w + σtδ err(w) l w l w 0 O 1 σtδ + O 1 σm + σ 2 p 1 w 0 p 2 T = O 1/δ 2ε 2 p 1 w 0 p 2 O 1 m December 7,
11 Convex Regularized Learning Ψ = x i, y i x i R n m, y i 1,1 i=1 F σ w = σ 2 w m m i=1 max 0,1 y i w, x i Minimize December 7,
12 Convex Regularized Learning Ψ = x i, y i x i R n m, y i 1,1 i=1 F σ w = σ 2 w m m i=1 max 0,1 y i w, x i December 7,
13 Convex Regularized Learning Ψ = x i, y i x i R n m, y i 1,1 i=1 F σ w = σ 2 w m m i=1 max 0,1 y i w, x i Regularizer December 7,
14 Convex Regularized Learning Ψ = x i, y i x i R n m, y i 1,1 i=1 F σ w = σ 2 w m m i=1 max 0,1 y i w, x i Loss December 7,
15 Convex Regularized Learning Ψ = θ i i=1 m m = x i, y i i=1 F σ w = σ 2 w m m i=1 max 0,1 y i w, x i F σ w = σ r(w) + 1 m m i=1 l w, θ i ; θ i The l p -norm regularizer: 1 r w = 2 p 1 w p 2, p 1,2 The SVM hinge loss: l w, θ ; θ = max 0, 1 y w, x The Logistic loss: l w, θ ; θ = log 1 + e y w,x The Least Square loss: l w, θ ; θ = w, x y 2 December 7,
16 Generalization Generalization objective F σ w = σ r w + E θ~dist l w, θ ; θ Empirical objective F σ w = σ r(w) + 1 m m i=1 l w, θ i ; θ i December 7,
17 Generalization Generalization objective F σ w = σ r w + E θ~dist l w, θ ; θ Empirical objective F σ w = σ r(w) + 1 m m i=1 l w, θ i ; θ i December 7,
18 Inverse Time Dependency Generalization objective F σ w = σ r w + E θ~dist l w, θ ; θ Empirical objective F σ w = σ r(w) + 1 m m i=1 l w, θ i ; θ i December 7,
19 Generalization Generalization objective F σ w = σ r w + E θ~dist l w, θ ; θ Empirical objective F σ w = σ r(w) + 1 m m i=1 l w, θ i ; θ i Optimization error Generalization error ε acc, satisfies F σ w F σ w + ε acc ε, satisfies l w l w 0 + ε December 7,
20 Generalization Running time Theorem 2 Theorem 1 Generalization error Optimization error Optimization error Generalization error ε acc, satisfies F σ w F σ w + ε acc ε, satisfies l w l w 0 + ε December 7,
21 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 We will create an algorithm based on Stochastic Gradient Descent, and then build the relationship between ε acc and T. December 7,
22 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 December 7,
23 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 December 7,
24 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 December 7,
25 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 December 7,
26 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 December 7,
27 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 December 7,
28 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 Thm 1: with probability of at least 1 δ over the choices of A 1, A T and the index i C log T F σ w i F σ w + σtδ December 7,
29 Thm 1 - Primal Gradient Solver online strongly convex optimization Thm 1: with probability of at least 1 δ over the choices of A 1, A T and the index i C log T F σ w i F σ w + σtδ December 7,
30 Theorem 2 Thm 2: Suppose w is the predictor optimized by the Primal Gradient Solver. If the desired error rate ε obeys l w l w 0 + ε, w 0 S, then the required number of iterations satisfies: 1/δ T = O 2ε 2 p 1 w 2 O 1 0 p m Sketch of the Proof: using Oracle Inequality. l w l w 0 = F σ w F σ w + F σ w F σ w 0 σ 2 p 1 w p 2 + σ 2 p 1 w 0 p 2 December 7,
31 Theorem 2 Thm 2: Suppose w is the predictor optimized by the Primal Gradient Solver. If the desired error rate ε obeys l w l w 0 + ε, w 0 S, then the required number of iterations satisfies: 1/δ T = O 2ε 2 p 1 w 2 O 1 0 p m Sketch of the Proof: using Oracle Inequality. l w l w 0 = F σ w F σ w + F σ w F σ w 0 σ 2 p 1 w p 2 + σ 2 p 1 w 0 p 2 O 1 σtδ + O 1 σm + F σ w F σ w 0 σ 2 p 1 w p 2 + σ 2 p 1 w 0 p 2 Theorem 1 Karthik Sridharan, Nathan Srebro, and Shai Shalev-Shwartz, "Fast Rates for Regularized Objectives," in NIPS, December 7,
32 Theorem 2 Thm 2: Suppose w is the predictor optimized by the Primal Gradient Solver. If the desired error rate ε obeys l w l w 0 + ε, w 0 S, then the required number of iterations satisfies: 1/δ T = O 2ε 2 p 1 w 2 O 1 0 p m Sketch of the Proof: using Oracle Inequality. l w l w 0 = F σ w F σ w + F σ w F σ w 0 σ 2 p 1 w p 2 + σ 2 p 1 w 0 p 2 Non-positive December 7,
33 Theorem 2 Thm 2: Suppose w is the predictor optimized by the Primal Gradient Solver. If the desired error rate ε obeys l w l w 0 + ε, w 0 S, then the required number of iterations satisfies: 1/δ T = O 2ε 2 p 1 w 2 O 1 0 p m Sketch of the Proof: using Oracle Inequality. l w l w 0 = F σ w F σ w + F σ w F σ w 0 σ 2 p 1 w p 2 + σ 2 p 1 w 0 p 2 l w l w 0 O 1 σtδ + O 1 σm + σ 2 p 1 w 0 p 2 December 7,
34 Error Decomposition C log T F σ w F σ w + σtδ err(w) l w l w 0 O 1 σtδ + O 1 σm + σ 2 p 1 w 0 p 2 T = O 1/δ 2ε 2 p 1 w 0 p 2 O 1 m Recall the definitions: δ confidence parameter ε desired generalization error p p-norm regularizer w 0 the optimal predictor m number of samples December 7,
35 Experimental Results Accuracy of PGS: (CCAT dataset) (in comparison with the best achievable accuracy by Quasi-Newton) Regularizer Loss QN Accuracy (2 hours) PGS Accuracy PGS Training Time l 2 LogisticRegression ± sec l 1. 8 LogisticRegression ± sec l 2 Least Square ± sec Speed of PGS: (CCAT dataset) To achieve an accuracy of 94%: PGS: 10 seconds for p = 2 20 seconds for p = 1.8 Quasi-Newton 600 seconds for both December 7,
36 Experimental Results December 7,
37 Experimental Results December 7,
38 Further Discussion P-norm? p 1,2? Non-linear? Kernel? Welcome to my talk on Tuesday 2 4PM P packsvm: Parallel Primal gradient descent Kernel SVM Other applications? December 7,
39 Conclusion Fast Primal Gradient Solver for l p -norm regularized convex learning Regularization error = Optimization error + Estimation error Running time inverse time dependent on input data size December 7,
40 Thanks Questions: Acknowledgment: Shai Shalev-Shwartz From Hebrew University. December 7,
41 Thm 1 - Primal Gradient Solver 1. INPUT: λ, p, S. Let n be the feature dimension. 2. FOR i = 1,2,, n 3. w t i 1 q 1 j λ j t+1 σ q 2 q 1 λ j t+1 σ q 1 sgn λ j 4. IF S = R n, RETURN w t 5. IF S = w: w p B 6. IF w t p > B THEN, w t B w t 7. RETURN w t q w t Explicit calculation for w t = r λ (t + 1)σ. We use the superscript of the form (j) to denote the j th coordinate of a vector December 7,
Inverse Time Dependency in Convex Regularized Learning
Inverse Time Dependency in Convex Regularized Learning Zeyuan Allen Zhu 2*, Weizhu Chen 2, Chenguang Zhu 23, Gang Wang 2, Haixun Wang 2, Zheng Chen 2 Fundamental Science Class, Department of Physics, Tsinghua
More informationOptimistic Rates Nati Srebro
Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationA Distributed Solver for Kernelized SVM
and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationMaking Gradient Descent Optimal for Strongly Convex Stochastic Optimization
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationOn the tradeoff between computational complexity and sample complexity in learning
On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham
More informationMachine Learning in the Data Revolution Era
Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,
More informationStochastic Subgradient Method
Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)
More informationP-packSVM: Parallel Primal gradient descent Kernel SVM
P-packSVM: Parallel Primal gradient descent Kernel SVM Zeyuan Allen Zhu 2*, Weizhu Chen 2, Gang Wang 2, Chenguang Zhu 23, Zheng Chen 2 Fundamental Science Class, Department of Physics, Tsinghua University
More informationMachine Learning Lecture 6 Note
Machine Learning Lecture 6 Note Compiled by Abhi Ashutosh, Daniel Chen, and Yijun Xiao February 16, 2016 1 Pegasos Algorithm The Pegasos Algorithm looks very similar to the Perceptron Algorithm. In fact,
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationIntroduction to Machine Learning (67577) Lecture 7
Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationTopics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families
Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationLogarithmic Regret Algorithms for Strongly Convex Repeated Games
Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600
More informationBeating SGD: Learning SVMs in Sublinear Time
Beating SGD: Learning SVMs in Sublinear Time Elad Hazan Tomer Koren Technion, Israel Institute of Technology Haifa, Israel 32000 {ehazan@ie,tomerk@cs}.technion.ac.il Nathan Srebro Toyota Technological
More informationMini-Batch Primal and Dual Methods for SVMs
Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314
More informationCutting Plane Training of Structural SVM
Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs
More informationAdvanced Topics in Machine Learning
Advanced Topics in Machine Learning 1. Learning SVMs / Primal Methods Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim, Germany 1 / 16 Outline 10. Linearization
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationCOMP 551 Applied Machine Learning Lecture 2: Linear regression
COMP 551 Applied Machine Learning Lecture 2: Linear regression Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine
More informationLearnability, Stability, Regularization and Strong Convexity
Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationLarge Scale Machine Learning with Stochastic Gradient Descent
Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning.
More informationSVMs, Duality and the Kernel Trick
SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationA Novel Click Model and Its Applications to Online Advertising
A Novel Click Model and Its Applications to Online Advertising Zeyuan Zhu Weizhu Chen Tom Minka Chenguang Zhu Zheng Chen February 5, 2010 1 Introduction Click Model - To model the user behavior Application
More informationStochastic Methods for l 1 Regularized Loss Minimization
Shai Shalev-Shwartz Ambuj Tewari Toyota Technological Institute at Chicago, 6045 S Kenwood Ave, Chicago, IL 60637, USA SHAI@TTI-CORG TEWARI@TTI-CORG Abstract We describe and analyze two stochastic methods
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationBeyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization
JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationIntroduction to Machine Learning (67577)
Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks
More informationPresentation in Convex Optimization
Dec 22, 2014 Introduction Sample size selection in optimization methods for machine learning Introduction Sample size selection in optimization methods for machine learning Main results: presents a methodology
More informationMulti-class classification via proximal mirror descent
Multi-class classification via proximal mirror descent Daria Reshetova Stanford EE department resh@stanford.edu Abstract We consider the problem of multi-class classification and a stochastic optimization
More informationOLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research
OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and
More informationEfficient Bandit Algorithms for Online Multiclass Prediction
Efficient Bandit Algorithms for Online Multiclass Prediction Sham Kakade, Shai Shalev-Shwartz and Ambuj Tewari Presented By: Nakul Verma Motivation In many learning applications, true class labels are
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More informationLarge-scale machine learning Stochastic gradient descent
Stochastic gradient descent IRKM Lab April 22, 2010 Introduction Very large amounts of data being generated quicker than we know what do with it ( 08 stats): NYSE generates 1 terabyte/day of new trade
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationStochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration
Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration Nati Srebro Toyota Technological Institute at Chicago a philanthropically endowed academic computer science institute dedicated
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationIntroduction to Machine Learning (67577) Lecture 3
Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationRandomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity
Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems
More informationConvex Optimization. Prof. Nati Srebro. Lecture 12: Infeasible-Start Newton s Method Interior Point Methods
Convex Optimization Prof. Nati Srebro Lecture 12: Infeasible-Start Newton s Method Interior Point Methods Equality Constrained Optimization f 0 (x) s. t. A R p n, b R p Using access to: 2 nd order oracle
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear
More informationSupport Vector Machine
Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationImproved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations
Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationOnline Sparse Passive Aggressive Learning with Kernels
Online Sparse Passive Aggressive Learning with Kernels Jing Lu Peilin Zhao Steven C.H. Hoi Abstract Conventional online kernel methods often yield an unbounded large number of support vectors, making them
More informationKernelized Perceptron Support Vector Machines
Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationMachine Learning Support Vector Machines. Prof. Matteo Matteucci
Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationComposite Objective Mirror Descent
Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research
More informationApplied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson
Applied Machine Learning Lecture 5: Linear classifiers, continued Richard Johansson overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers
More informationSupport Vector Machines
Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781
More informationConvergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity
Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to
More informationLecture 3: Minimizing Large Sums. Peter Richtárik
Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors
More informationHomework 5. Convex Optimization /36-725
Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationSolving the SVM Optimization Problem
Solving the SVM Optimization Problem Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian
More informationA simpler unified analysis of Budget Perceptrons
Ilya Sutskever University of Toronto, 6 King s College Rd., Toronto, Ontario, M5S 3G4, Canada ILYA@CS.UTORONTO.CA Abstract The kernel Perceptron is an appealing online learning algorithm that has a drawback:
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationCOMP 551 Applied Machine Learning Lecture 2: Linear Regression
COMP 551 Applied Machine Learning Lecture 2: Linear Regression Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More informationThe FTRL Algorithm with Strongly Convex Regularizers
CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked
More informationOnline Learning meets Optimization in the Dual
Online Learning meets Optimization in the Dual Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., The Hebrew University, Jerusalem 91904, Israel 2 Google Inc., 1600 Amphitheater
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July
More informationAdvanced Machine Learning
Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 2 Machine Learning
More informationarxiv: v1 [math.oc] 18 Mar 2016
Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationFAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč
FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom
More informationKernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning
Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:
More informationAn empirical study about online learning with generalized passive-aggressive approaches
An empirical study about online learning with generalized passive-aggressive approaches Adrian Perez-Suay, Francesc J. Ferri, Miguel Arevalillo-Herráez, and Jesús V. Albert Dept. nformàtica, Universitat
More information