Statistical Optimality of Stochastic Gradient Descent through Multiple Passes
|
|
- Karen Brown
- 5 years ago
- Views:
Transcription
1 Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien and Alessandro Rudi Newton Institute, Cambridge - June 2018
2 Two-minute summary Stochastic gradient descent for large-scale machine learning Processes observations one by one
3 Two-minute summary Stochastic gradient descent for large-scale machine learning Processes observations one by one Theory: Single pass SGD is optimal Only for easy problems Practice: Multiple pass SGD always works better
4 Two-minute summary Stochastic gradient descent for large-scale machine learning Processes observations one by one Theory: Single pass SGD is optimal Only for easy problems Practice: Multiple pass SGD always works better Provable for hard problems Quantification of required number of passes Optimal statistical performance Source and capacity conditions from kernel methods
5 Least-squares regression in finite dimension Data: n observations (x i,y i ) X R, i = 1,...,n, i.i.d. Prediction as linear functions θ,φ(x) of features Φ(x) H = R d
6 Least-squares regression in finite dimension Data: n observations (x i,y i ) X R, i = 1,...,n, i.i.d. Prediction as linear functions θ,φ(x) of features Φ(x) H = R d Optimal prediction θ H minimizing F(θ) = E(y θ,φ(x) ) 2
7 Least-squares regression in finite dimension Data: n observations (x i,y i ) X R, i = 1,...,n, i.i.d. Prediction as linear functions θ,φ(x) of features Φ(x) H = R d Optimal prediction θ H minimizing F(θ) = E(y θ,φ(x) ) 2 Assumption: Φ(x) R almost surely Assumption: y M and y θ,φ(x) σ almost surely
8 Least-squares regression in finite dimension Data: n observations (x i,y i ) X R, i = 1,...,n, i.i.d. Prediction as linear functions θ,φ(x) of features Φ(x) H = R d Optimal prediction θ H minimizing F(θ) = E(y θ,φ(x) ) 2 Assumption: Φ(x) R almost surely Assumption: y M and y θ,φ(x) σ almost surely Statistical performance of estimators ˆθ defined as EF(ˆθ) F(θ ) Finite dimension: optimal rate σ2 dim(h) n = σ2 d n Attained by empirical risk minimization (ERM) and SGD
9 Least-squares regression in finite dimension Data: n observations (x i,y i ) X R, i = 1,...,n, i.i.d. Prediction as linear functions θ,φ(x) of features Φ(x) H = R d Optimal prediction θ H minimizing F(θ) = E(y θ,φ(x) ) 2 Assumption: Φ(x) R almost surely Assumption: y M and y θ,φ(x) σ almost surely Statistical performance of estimators ˆθ defined as EF(ˆθ) F(θ ) Finite dimension: optimal rate σ2 dim(h) n = σ2 d n Attained by empirical risk minimization (ERM) and SGD What if n dim(h)? Needs assumptions on Σ = E [ Φ(x) Φ(x) ] and θ
10 Spectrum of covariance matrix Σ = E [ Φ(x) Φ(x) ] Eigenvalues λ m (Σ) (in decreasing order) Example: News dataset (d = , n = ) 8 6 log( λ m ) log(m)
11 Spectrum of covariance matrix Σ = E [ Φ(x) Φ(x) ] Eigenvalues λ m (Σ) (in decreasing order) Example: News dataset (d = , n = ) 8 6 log( λ m ) log(m) Assumption: tr(σ 1/α ) = m 1λ m (Σ) 1/α is small (compared to n) Equivalent to λ m (Σ) = O(m α )
12 Difficulty of the learning problem Measuring difficulty through the norm of θ Assumption: Σ 1/2 r θ is small (compared to n) r = 1/2: usual assumption on θ Larger r: simpler problems Smaller r: harder problems (r = 0 always true)
13 Difficulty of the learning problem Measuring difficulty through the norm of θ Assumption: Σ 1/2 r θ is small (compared to n) r = 1/2: usual assumption on θ Larger r: simpler problems Smaller r: harder problems (r = 0 always true)
14 Difficulty of the learning problem Measuring difficulty through the norm of θ Assumption: Σ 1/2 r θ is small (compared to n) r = 1/2: usual assumption on θ Larger r: simpler problems Smaller r: harder problems (r = 0 always true)
15 Optimal statistical performance Easy problems r α 1 2rα 2α : optimal rate is O(n 2rα+1)
16 Optimal statistical performance Easy problems r α 1 2rα 2α : optimal rate is O(n 2rα+1), achieved by: Regularized ERM (Caponnetto and De Vito, 2007) Early-stopped gradient descent (Yao et al., 2007) Single-pass averaged SGD (Dieuleveut and Bach, 2016)
17 Optimal statistical performance Easy problems r α 1 2rα 2α : optimal rate is O(n 2rα+1) Hard problems r α 1 2α Lower bound: O(n 2rα 2rα+1). Known upper bound: O(n 2r )
18 Least-mean-square (LMS) algorithm Least-squares: F(θ) = 1 2 E[ (y Φ(x),θ ) 2] with θ R d SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) Iteration: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi )
19 Least-mean-square (LMS) algorithm Least-squares: F(θ) = 1 2 E[ (y Φ(x),θ ) 2] with θ R d SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) Iteration: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) New analysis for averaging and constant step-size γ = 1/(4R 2 ) Bach and Moulines (2013) Assume Φ(x) R and y Φ(x),θ σ almost surely No assumption regarding lowest eigenvalues of Σ Main result: EF( θ n ) F(θ ) 4σ2 d n + 4R2 θ 0 θ 2 n Matches statistical lower bound (Tsybakov, 2003)
20 Markov chain interpretation of constant step sizes LMS recursion: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi )
21 Markov chain interpretation of constant step sizes LMS recursion: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) The sequence (θ i ) i is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) - For least-squares, θ γ = θ θ n θ γ θ 0
22 Markov chain interpretation of constant step sizes LMS recursion: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) The sequence (θ i ) i is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ γ θ 0
23 Markov chain interpretation of constant step sizes LMS recursion: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) The sequence (θ i ) i is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ n θ θ 0
24 Markov chain interpretation of constant step sizes LMS recursion: θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) The sequence (θ i ) i is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n does not converge to θ but oscillates around it Ergodic theorem: Averaged iterates converge to θ γ = θ at rate O(1/n) See Dieuleveut, Durmus, and Bach (2017) for more details
25 Simulations - synthetic examples Gaussian distributions - d = 20 0 synthetic square log 10 [f(θ) f(θ * )] /2R 2 1/8R 2 4 1/32R 2 1/2R 2 n 1/ log 10 (n)
26 Simulations - benchmarks alpha (d = 500, n = ), news (d = , n = ) 1 alpha square C=1 test 0.2 news square C=1 test log 10 [f(θ) f(θ * )] /R 2 1/R 2 n 1/2 SAG log 10 [f(θ) f(θ * )] /R 2 1/R 2 n 1/2 SAG log 10 (n) log 10 (n)
27 Optimal bounds for least-squares? Least-squares: cannot beat σ 2 d/n (Tsybakov, 2003). Really? What if d n?
28 Optimal bounds for least-squares? Least-squares: cannot beat σ 2 d/n (Tsybakov, 2003). Really? What if d n? Needs assumptions on Σ = E [ Φ(x) Φ(x) ] and θ
29 Finer assumptions (Dieuleveut and Bach, 2016) Covariance eigenvalues Pessimistic assumption: all eigenvalues λ m less than a constant Actual decay as λ m = o(m α ) with trσ 1/α = λ 1/α m small m New result: replace σ2 d n by σ2 (γn) 1/α trσ 1/α n 8 6 log( λ m ) log(m)
30 Finer assumptions (Dieuleveut and Bach, 2016) Covariance eigenvalues Pessimistic assumption: all eigenvalues λ m less than a constant Actual decay as λ m = o(m α ) with trσ 1/α = λ 1/α m small m New result: replace σ2 d n by σ2 (γn) 1/α trσ 1/α n 8 40 log( λ m ) (γ n) 1/α Tr Σ 1/α / n log(m) alpha
31 Finer assumptions (Dieuleveut and Bach, 2016) Covariance eigenvalues Pessimistic assumption: all eigenvalues λ m less than a constant Actual decay as λ m = o(m α ) with trσ 1/α = λ 1/α m small m New result: replace σ2 d n by σ2 (γn) 1/α trσ 1/α n Optimal predictor Pessimistic assumption: θ 0 θ 2 finite/small Finer assumption: Σ 1/2 r (θ 0 θ ) 2 small, for r [0,1] Always satisfied for r = 0 and θ 0 = 0, since Σ 1/2 θ 2 Ey 2 n
32 Finer assumptions (Dieuleveut and Bach, 2016) Covariance eigenvalues Pessimistic assumption: all eigenvalues λ m less than a constant Actual decay as λ m = o(m α ) with trσ 1/α = λ 1/α m small m New result: replace σ2 d n by σ2 (γn) 1/α trσ 1/α n Optimal predictor Pessimistic assumption: θ 0 θ 2 finite/small Finer assumption: Σ 1/2 r (θ 0 θ ) 2 small, for r [0,1] Always satisfied for r = 0 and θ 0 = 0, since Σ 1/2 θ 2 Eyn 2 New result: replace θ 0 θ 2 by Σ1/2 r (θ 0 θ ) 2 γn γ 2r n 2r
33 Optimal bounds for least-squares? Least-squares: cannot beat σ 2 d/n (Tsybakov, 2003). Really? What if d n? Refined assumptions with adaptivity (Dieuleveut and Bach, 2016) Beyond strong convexity or lack thereof EF( θ n ) F(θ ) inf α 1,r [0,1] 4σ 2 trσ 1/α n (γn) 1/α + 4 Σ1/2 r θ 2 γ 2r n 2r Previous results: α = + and r = 1/2
34 Optimal bounds for least-squares? Least-squares: cannot beat σ 2 d/n (Tsybakov, 2003). Really? What if d n? Refined assumptions with adaptivity (Dieuleveut and Bach, 2016) Beyond strong convexity or lack thereof EF( θ n ) F(θ ) inf α 1,r [0,1] 4σ 2 trσ 1/α n (γn) 1/α + 4 Σ1/2 r θ 2 γ 2r n 2r Previous results: α = + and r = 1/2 Optimal step-size γ potentially decaying with n, but depends on usually unknown quantities α and r no adaptivity (yet)
35 Optimal bounds for least-squares? Least-squares: cannot beat σ 2 d/n (Tsybakov, 2003). Really? What if d n? Refined assumptions with adaptivity (Dieuleveut and Bach, 2016) Beyond strong convexity or lack thereof EF( θ n ) F(θ ) inf α 1,r [0,1] 4σ 2 trσ 1/α n (γn) 1/α + 4 Σ1/2 r θ 2 γ 2r n 2r Previous results: α = + and r = 1/2 Optimal step-size γ potentially decaying with n, but depends on usually unknown quantities α and r no adaptivity (yet) Extension to non-parametric estimation (using kernels) with optimal rates when r (α 1)/(2α), still with O(n 2 ) running-time
36 From least-squares to non-parametric estimation Extension to Hilbert spaces: Φ(x),θ H θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) If θ 0 = 0, θ i is a linear combination of Φ(x 1 ),...,Φ(x i ) θ i = i k=1 a k Φ(x k ) and a i = γ i 1 k=1 a k Φ(x k ),Φ(x i ) +γy i Kernel trick: k(x,x ) = Φ(x),Φ(x ) Reproducing kernel Hilbert spaces and non parametric estimation See, e.g., Schölkopf and Smola (2001); Shawe-Taylor and Cristianini (2004); Dieuleveut and Bach (2016) Still O(n 2 ) overall running-time
37 From least-squares to non-parametric estimation Extension to Hilbert spaces: Φ(x),θ H θ i = θ i 1 γ ( Φ(x i ),θ i 1 y i ) Φ(xi ) If θ 0 = 0, θ i is a linear combination of Φ(x 1 ),...,Φ(x i ) θ i = i k=1 a k Φ(x k ) and a i = γ i 1 k=1 a k Φ(x k ),Φ(x i ) +γy i Kernel trick: k(x,x ) = Φ(x),Φ(x ) Reproducing kernel Hilbert spaces and non-parametric estimation See, e.g., Schölkopf and Smola (2001); Shawe-Taylor and Cristianini (2004); Dieuleveut and Bach (2016) Still O(n 2 ) overall running-time
38 Example: Sobolev spaces in one dimension X = [0, 1], functions represented through their Fourier series Weighted Fourier basis Φ(x) m = λ 1/2 m cos(2mπx) (plus sines) kernel k(x,x ) = m λ mcos [ 2mπ(x x ) ]
39 Example: Sobolev spaces in one dimension X = [0, 1], functions represented through their Fourier series Weighted Fourier basis Φ(x) m = λ 1/2 m cos(2mπx) (plus sines) kernel k(x,x ) = m λ mcos [ 2mπ(x x ) ] λ m m α corresponds to Sobolev penalty on f θ (x) = θ,φ(x) f θ 2 = θ 2 = m Fourier(f θ ) m 2 λ 1 m 1 0 f (α/2) θ (x) 2 dx
40 Example: Sobolev spaces in one dimension X = [0, 1], functions represented through their Fourier series Weighted Fourier basis Φ(x) m = λ 1/2 m cos(2mπx) (plus sines) kernel k(x,x ) = m λ mcos [ 2mπ(x x ) ] λ m m α corresponds to Sobolev penalty on f θ (x) = θ,φ(x) f θ 2 = θ 2 = m Fourier(f θ ) m 2 λ 1 m 1 0 f (α/2) θ (x) 2 dx Adapted norm Σ 1/2 r θ 2 depends on regularity of f θ Σ 1/2 r θ 2 = Fourier(f θ ) m 2 λ 2r m m Optimal rate is O(n 2rα+1) 2rα 1 0 f (rα) θ (x) 2 dx
41 New assumption needed Assumption: Σ µ/2 1/2 Φ(x) almost surely small Already used by Steinwart et al. (2009) True for µ = 1 Usually µ 1/α (equal for Sobolev spaces) Relationship between L norm L and RKHS norm g L = O( g µ g 1 µ L 2 ) NB: implies bounded leverage scores (Rudi et al., 2015)
42 Multiple pass SGD (sampling with replacement) Algorithm from n i.i.d. observations (x i,y i ), i = 1,...,n: θ u = θ u 1 +γ ( y i(u) θ u 1,Φ(x i(u) ) ) Φ(x i(u) ) θ t averaged iterate after t n iterations
43 Multiple pass SGD (sampling with replacement) Algorithm from n i.i.d. observations (x i,y i ), i = 1,...,n: θ u = θ u 1 +γ ( y i(u) θ u 1,Φ(x i(u) ) ) Φ(x i(u) ) θ t averaged iterate after t n iterations Theorem (Pillaud-Vivien, Rudi, and Bach, 2018): Assume r α 1 2α. If µ 2r, then after t = Θ(n α/(2rα+1) ) iterations, we have: EF( θ t ) F(θ ) = O(n 2rα/(2rα+1) ) Optimal Otherwise, then after t = Θ(n 1/µ (logn) 1 µ) iterations, we have: EF( θ t ) F(θ ) O(n 2r/µ ) Improved Proof technique following Rosasco and Villa (2015)
44 Proof sketch Algorithm from n i.i.d. observations (x i,y i ), i = 1,...,n: θ u = θ u 1 +γ ( y i(u) θ u 1,Φ(x i(u) ) ) Φ(x i(u) ) θ t averaged iterate after t n iterations Following Rosasco and Villa (2015), consider batch gradient recursion η u = θ u 1 + γ n ( yi θ u 1,Φ(x i ) ) Φ(x i ) n i=1 η t averaged iterate after t n iterations
45 Proof sketch Algorithm from n i.i.d. observations (x i,y i ), i = 1,...,n: θ u = θ u 1 +γ ( y i(u) θ u 1,Φ(x i(u) ) ) Φ(x i(u) ) θ t averaged iterate after t n iterations Following Rosasco and Villa (2015), consider batch gradient recursion η u = θ u 1 + γ n ( yi θ u 1,Φ(x i ) ) Φ(x i ) n i=1 η t averaged iterate after t n iterations As long as t = O(n 1/µ ) ( t 1/α) Property 1: EF( θ t ) EF( η t ) = O ( t t 1/α) Property 2: EF( η t ) F(θ ) = O +O(t 2r ) n
46 Multiple pass SGD (sampling with replacement) Algorithm from n i.i.d. observations (x i,y i ), i = 1,...,n: θ u = θ u 1 +γ ( y i(u) θ u 1,Φ(x i(u) ) ) Φ(x i(u) ) θ t averaged iterate after t n iterations Theorem (Pillaud-Vivien, Rudi, and Bach, 2018): Assume r α 1 2α. If µ 2r, then after t = Θ(n α/(2rα+1) ) iterations, we have: EF( θ t ) F(θ ) = O(n 2rα/(2rα+1) ) Optimal Otherwise, then after t = Θ(n 1/µ (logn) 1 µ) iterations, we have: EF( θ t ) F(θ ) O(n 2r/µ ) Improved Proof technique following Rosasco and Villa (2015)
47 Statistical optimality If µ 2r, then after t = Θ(n α/(2rα+1) ) iterations, we have: EF( θ t ) F(θ ) = O(n 2rα/(2rα+1) ) Optimal Otherwise, then after t = Θ(n 1/µ (logn) µ) 1 iterations, we have: EF( θ t ) F(θ ) O(n 2r/µ ) Improved
48 Simulations Synthetic examples One-dimensional kernel regression Sobolev spaces Arbitrary chosen values for r and α Check optimal number of iterations over the data
49 Simulations Synthetic examples One-dimensional kernel regression Sobolev spaces Arbitrary chosen values for r and α Check optimal number of iterations over the data Comparing three sampling schemes With replacement Without replacement (cycling with random reshuffling) Cycling
50 Simulations (sampling with replacement) α = 3/2,r = 1/3 > (α 1)/(2α) α = 4,r = 1/4 = (α 1)/(2α) line of slope 1 line of slope log 10 t * 3.0 log 10 t * log 10 n log 10 n α = 5/2,r = 1/5 < (α 1)/(2α) α = 3,r = 1/6 < (α 1)/(2α) line of slope 1.25 line of slope log 10 t * 3.0 log 10 t * log 10 n log 10 n
51 Simulations (sampling without replacement) α = 3/2,r = 1/3 > (α 1)/(2α) α = 4,r = 1/4 = (α 1)/(2α) line of slope 1 line of slope log 10 t * 3.0 log 10 t * log 10 n log 10 n α = 5/2,r = 1/5 < (α 1)/(2α) α = 3,r = 1/6 < (α 1)/(2α) line of slope 1.25 line of slope log 10 t * 3.0 log 10 t * log 10 n log 10 n
52 Simulations (cycling) α = 3/2,r = 1/3 > (α 1)/(2α) α = 4,r = 1/4 = (α 1)/(2α) line of slope 1 line of slope log 10 t * 3.0 log 10 t * log 10 n log 10 n α = 5/2,r = 1/5 < (α 1)/(2α) α = 3,r = 1/6 < (α 1)/(2α) line of slope 1.25 line of slope log 10 t * 3.0 log 10 t * log 10 n log 10 n
53 Simulations - Benchmarks MNIST dataset with linear kernel Optimal number of passes Number of samples
54 Benefits of multiple passes Conclusion Number of passes grows with sample size for hard problems First provable improvement of multiple passes over SGD [NB: Hardt et al. (2016); Lin and Rosasco (2017) consider small step-sizes]
55 Benefits of multiple passes Conclusion Number of passes grows with sample size for hard problems First provable improvement of multiple passes over SGD [NB: Hardt et al. (2016); Lin and Rosasco (2017) consider small step-sizes] Current work - Extensions Study of cycling and sampling without replacement (Shamir, 2016; Gürbüzbalaban et al., 2015) Mini-batches Beyond least-squares Optimal efficient algorithms for the situation µ > 2r Combining analysis with exponential convergence of testing errors (Pillaud-Vivien, Rudi, and Bach, 2017)
56 References F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In Advances in Neural Information Processing Systems (NIPS), Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3): , A. Dieuleveut and F. Bach. Non-parametric Stochastic Approximation with Large Step sizes. Annals of Statistics, 44(4): , Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes. The Annals of Statistics, 44(4): , Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and markov chains. Technical Report , arxiv, Mert Gürbüzbalaban, Asu Ozdaglar, and Pablo Parrilo. Why random reshuffling beats stochastic gradient descent. Technical Report , arxiv, Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, Junhong Lin and Lorenzo Rosasco. Optimal rates for multi-pass stochastic gradient methods. Journal of Machine Learning Research, 18(97):1 47, O. Macchi. Adaptive processing: The least mean squares approach with applications in transmission. Wiley West Sussex, 1995.
57 L. Pillaud-Vivien, A. Rudi, and F. Bach. Stochastic gradient methods with exponential convergence of testing errors. Technical Report , arxiv, Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. Technical Report , arxiv, Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization. In Advances in Neural Information Processing Systems, pages , A. Rudi, R. Camoriano, and L. Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems, pages , B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems 29, pages 46 54, J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Ingo Steinwart, Don R. Hush, and Clint Scovel. Optimal rates for regularized least squares regression. In Proc. COLT, A. B. Tsybakov. Optimal rates of aggregation. In Proc. COLT, Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2): , 2007.
Beyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationStochastic gradient descent and robustness to ill-conditioning
Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationTUM 2016 Class 3 Large scale learning by regularization
TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond
More informationLarge-scale machine learning and convex optimization
Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Allerton Conference - September 2015 Slides available at www.di.ens.fr/~fbach/gradsto_allerton.pdf
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationLarge-scale machine learning and convex optimization
Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE IFCAM, Bangalore - July 2014 Big data revolution? A new scientific
More informationOptimal kernel methods for large scale learning
Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem
More informationOptimal Rates for Multi-pass Stochastic Gradient Methods
Journal of Machine Learning Research 8 (07) -47 Submitted 3/7; Revised 8/7; Published 0/7 Optimal Rates for Multi-pass Stochastic Gradient Methods Junhong Lin Laboratory for Computational and Statistical
More informationOslo Class 2 Tikhonov regularization and kernels
RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n
More informationApproximate Kernel Methods
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine
More informationMathematical Methods for Data Analysis
Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data
More informationOnline Gradient Descent Learning Algorithms
DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline
More informationSimultaneous Model Selection and Optimization through Parameter-free Stochastic Learning
Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationRegML 2018 Class 2 Tikhonov regularization and kernels
RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,
More informationOptimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms
Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Junhong Lin Volkan Cevher Laboratory for Information and Inference Systems École Polytechnique Fédérale
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationOn non-parametric robust quantile regression by support vector machines
On non-parametric robust quantile regression by support vector machines Andreas Christmann joint work with: Ingo Steinwart (Los Alamos National Lab) Arnout Van Messem (Vrije Universiteit Brussel) ERCIM
More informationLearning sets and subspaces: a spectral approach
Learning sets and subspaces: a spectral approach Alessandro Rudi DIBRIS, Università di Genova Optimization and dynamical processes in Statistical learning and inverse problems Sept 8-12, 2014 A world of
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationPerturbed Proximal Gradient Algorithm
Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing
More informationDistribution Regression: A Simple Technique with Minimax-optimal Guarantee
Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationwhat can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley
what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,
More informationStochastic optimization: Beyond stochastic gradients and convexity Part I
Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationAn inverse problem perspective on machine learning
An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems
More informationHierarchical kernel learning
Hierarchical kernel learning Francis Bach Willow project, INRIA - Ecole Normale Supérieure May 2010 Outline Supervised learning and regularization Kernel methods vs. sparse methods MKL: Multiple kernel
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationReproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto
Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationEarly Stopping for Computational Learning
Early Stopping for Computational Learning Lorenzo Rosasco Universita di Genova, Massachusetts Institute of Technology Istituto Italiano di Tecnologia CBMM Sestri Levante, September, 2014 joint work with
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationMIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco
MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco Linear functions Let H lin be the space of linear functions f(x) = w x. f w is one
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationOptimal Distributed Learning with Multi-pass Stochastic Gradient Methods
Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods Junhong Lin Volkan Cevher Abstract We study generalization properties of distributed algorithms in the setting of nonparametric
More informationMachine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari
Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More informationMaking Gradient Descent Optimal for Strongly Convex Stochastic Optimization
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationGeneralization Bounds for Randomized Learning with Application to Stochastic Gradient Descent
Generalization Bounds for Randomized Learning with Application to Stochastic Gradient Descent Ben London blondon@amazon.com 1 Introduction Randomized algorithms are central to modern machine learning.
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationStochastic Analogues to Deterministic Optimizers
Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured
More informationdans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés
Inférence pénalisée dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés Gersende Fort Institut de Mathématiques de Toulouse, CNRS and Univ. Paul Sabatier Toulouse,
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationIterate Averaging as Regularization for Stochastic Gradient Descent
Proceedings of Machine Learning Research vol 75: 2, 208 3st Annual Conference on Learning Theory Iterate Averaging as Regularization for Stochastic Gradient Descent Gergely Neu Universitat Pompeu Fabra,
More informationLinearly-Convergent Stochastic-Gradient Methods
Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure
More informationDiffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)
Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions
More informationRegularization via Spectral Filtering
Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationStein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationConvex relaxation for Combinatorial Penalties
Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,
More informationAdvanced computational methods X Selected Topics: SGD
Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety
More informationSpectral Regularization
Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationSimple Optimization, Bigger Models, and Faster Learning. Niao He
Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture
More informationTowards stability and optimality in stochastic gradient descent
Towards stability and optimality in stochastic gradient descent Panos Toulis, Dustin Tran and Edoardo M. Airoldi August 26, 2016 Discussion by Ikenna Odinaka Duke University Outline Introduction 1 Introduction
More informationRobust Support Vector Machines for Probability Distributions
Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,
More informationIncremental Gradient, Subgradient, and Proximal Methods for Convex Optimization
Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology February 2014
More informationDistribution Regression
Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationDistributed Learning with Regularized Least Squares
Journal of Machine Learning Research 8 07-3 Submitted /5; Revised 6/7; Published 9/7 Distributed Learning with Regularized Least Squares Shao-Bo Lin sblin983@gmailcom Department of Mathematics City University
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationOPTIMIZATION METHODS IN DEEP LEARNING
Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate
More informationOptimization and Gradient Descent
Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationStochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos
1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationProbabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationBasis Expansion and Nonlinear SVM. Kai Yu
Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion
More informationGradient Descent. Sargur Srihari
Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationStatistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression
Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression Meimei Liu Department of Statistical Science Duke University Durham, IN - 27708 Email: meimei.liu@duke.edu Jean
More informationAn Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal
An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationLecture 3: Minimizing Large Sums. Peter Richtárik
Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationThe classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know
The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is
More informationThe classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA
The Bayes classifier Linear discriminant analysis (LDA) Theorem The classifier satisfies In linear discriminant analysis (LDA), we make the (strong) assumption that where the min is over all possible classifiers.
More informationScaling up Vector Autoregressive Models with Operator Random Fourier Features
Scaling up Vector Autoregressive Models with Operator Random Fourier Features Romain Brault, Néhémy Lim, Florence d Alché-Buc August, 6 Abstract A nonparametric approach to Vector Autoregressive Modeling
More information