Optimal kernel methods for large scale learning

Size: px
Start display at page:

Download "Optimal kernel methods for large scale learning"

Transcription

1 Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique

2 Learning problem The problem P Find f H = argmin E(f), E(f) = f H with ρ unknown but given (x i, y i ) n i=1 i.i.d. samples. dρ(x, y)(y f(x)) 2 Remarks: stochastic optimization problem H is a space of candidate solutions.

3 Outline Learning with kernels Random projections FALKON: Random projections and preconditioning

4 Kernel ridge regression Let K p.d. kernel (e.g. K(x, x ) = e γ x x 2 ) and H = span{k(x, ) x X}, Problem P n f λ = argmin f H 1 n n (y i f(x i )) 2 + λ f 2 H i=1

5 KRR: Statistic (worst case) Noise E[ Y p X = x] 1 2 p!σ2 b p 2, p 2 Kernel boundness sup x K(x, x) <. Best model There exists f H solving min E(f). f H

6 KRR: Statistic (worst case) Noise E[ Y p X = x] 1 2 p!σ2 b p 2, p 2 Kernel boundness sup x K(x, x) <. Best model There exists f H solving min E(f). f H Theorem[(Caponnetto, De Vito 05)] Under the assumptions above EE( f λ ) E(f H ) 1 λn + λ. By selecting λ n = 1 n EE( f λn ) E(f H ) 1 n

7 KRR: Statistics (refined case) Let Lf(x ) = EK(x, x)f(x) and N (λ) = Trace((L + λi) 1 L) Capacity condition: N (λ) = O(λ γ ), γ [0, 1] Source condition: f H Range(L r ), r 1/2

8 KRR: Statistics (refined case) Let Lf(x ) = EK(x, x)f(x) and N (λ) = Trace((L + λi) 1 L) Capacity condition: N (λ) = O(λ γ ), γ [0, 1] Source condition: f H Range(L r ), r 1/2 Theorem[(Caponnetto, De Vito 05)] Under (basic) and (refined) EE( f λ ) E(f H ) N (λ) n + λ 2r. By selecting λ n = n 1 2r+γ EE( f λn ) E(f H ) n 2r 2r+γ

9 KRR: Statistics (refined case) Let Lf(x ) = EK(x, x)f(x) and N (λ) = Trace((L + λi) 1 L) Capacity condition: N (λ) = O(λ γ ), γ [0, 1] Source condition: f H Range(L r ), r 1/2 Theorem[(Caponnetto, De Vito 05)] Under (basic) and (refined) EE( f λ ) E(f H ) N (λ) n + λ 2r. By selecting λ n = n 1 2r+γ EE( f λn ) E(f H ) n 2r 2r+γ

10 KRR: Optimization n f λ (x) = K(x, x i )c i i=1 ( K + λni)c = ŷ Linear System bk c = by Computations Space O(n 2 ) Kernel eval. O(n 2 ) Time O(n 3 ) Large scale ML: Running out of time and space... can we fix it?

11 Computations for optimal statistical accuracy Model: O(n) Space: O(n 2 ) Kernel eval.: O(n 2 ) Time: O(n 3 )

12 Outline Learning with kernels Random projections FALKON: Random projections and preconditioning

13 Random projections Solve P n on H M = span{k( x 1, ),..., K( x M, )} 1 n f λ,m = argmin (y i f(x i )) 2 + λ f 2 H f H M n i=1

14 Random projections Solve P n on H M = span{k( x 1, ),..., K( x M, )} 1 f λ,m = argmin f H M n... that is, pick M columns at random f λ,m (x) = M K(x, x i )c i i=1 n (y i f(x i )) 2 + λ f 2 H i=1 Linear System c ( K nm K nm +λn K MM )c = K nm ŷ bk nm = by - Nyström methods (Smola, Scholköpf 00) - Gaussian processes: inducing inputs (Quionero-Candela et al 05)

15 Nyström KRR: Computations c bk nm = by Computations (train) Space O(n 2 ) O(M 2 n) Kernel eval. O(n 2 ) O(nM n ) Time O(n 3 ) O(nM 2 n) Computations (test) O(n) O(M n ) ( K nm K nm + λn K MM )c = K nm ŷ

16 Nyström KRR: Statistics (worst case) Theorem[Rudi, Camoriano, Rosasco 15] Under the basic assumptions EE( f λ,m ) E(f H ) 1 λn + λ + 1 M.

17 Nyström KRR: Statistics (worst case) Theorem[Rudi, Camoriano, Rosasco 15] Under the basic assumptions By selecting λ n = 1 n, M n = 1 λ n EE( f λ,m ) E(f H ) 1 λn + λ + 1 M. EE( f λn,m n ) E(f H ) 1 n

18 Remarks M = O( n) suffices for optimal generalization Previous works: only for fixed design (Bach 13, Alaoui, Mahoney, 15, Yang et al. 15, Musco, Musco 16) Matches statistical minimax lower bounds [Caponnetto, De Vito 05]. Special case: Sobolev spaces with s = d/2, e.g. exponential kernel and Fourier features. Same statistical bound of (kernel) ridge regression [Caponnetto, De Vito 05].

19 Nyström KRR: Statistics (refined) Theorem[Rudi, Camoriano, Rosasco 15] Under (basic) and (refined) EE( f λ,m ) E(f H ) N (λ) n + λ 2r + 1 M.

20 Nyström KRR: Statistics (refined) Theorem[Rudi, Camoriano, Rosasco 15] Under (basic) and (refined) By selecting λ n = n 1 2r+γ, Mn = 1 λ n EE( f λ,m ) E(f H ) N (λ) n EE( f λn,m n ) E(f H ) n + λ 2r + 1 M. 2r 2r+γ

21 Remarks The obtained rate is minmax optimal [Caponnetto, De Vito 05]. Reduces to worst case for γ = 1, r = 1/2. Comparison with Random Features: [Rudi, Rosasco 17] M = n c

22 Computations required for 1/ n rate Model: O( n) Space: O(n) Kernel eval.: O(n n) Time: O(n 2 ) Possible improvements: adaptive sampling optimization

23 Outline Learning with kernels Random projections FALKON: Random projections and preconditioning

24 Beyond O(n 2 ) time? ( K nm K nm + λn K MM ) c = K nm ŷ. by bk nm c = Bottleneck: compute K nm K nm requires O(nM 2 ) time.

25 Optimization to rescue K nm K nm + λn K MM c = }{{} K nm ŷ. }{{} H b bk nm c = by Idea: First order methods c t = c t 1 τ n [ K nm ( K nm c t 1 y n ) + λn K MM c t 1 ] Pros: requires O(nM t) Cons: t κ(h) arbitrarily large- κ(h) = σ max (H)/σ min (H) condition number.

26 Preconditioning Idea: solve an equivalent linear system with better condition number Preconditioning Hc = b P HP β = P b, c = P β. Ideally P P = H 1, so that t = O(κ(H)) t = O(1)! Computing a good preconditioning can be hard!

27 Remarks Preconditioning KRR (Fasshauer et al 12, Avron et al 16, Cutajat 16, Ma, Belkin 17) H = K + λni Can we precondition Nystrom-KRR?

28 Preconditioning Nystom-KRR Consider H := K nm K nm + λn K MM Proposed Preconditioning ( n P P = M K MM 2 + λn K ) 1 MM Compare to naive preconditioning ( P P = K nm KnM + λn K ) 1 MM.

29 Baby FALKON Proposed Preconditioning P P = ( n M K 2 MM + λn K MM ) 1, Gradient descent M f λ,m,t (x) = K(x, x i )c t,i, c t = P β t i=1 β t = β t 1 τ n P [ K nm ( K nm P β t 1 y n ) + λn K MM P β t 1 ]

30 FALKON Gradient descent conjugate gradient Computing P P = 1 T 1 A 1, n T = chol(k MM ), ( ) 1 A = chol M T T + λi, where chol( ) is the Cholesky decomposition.

31 Falkon statistics (worst case) Theorem Under (basic), when M > log n λ, [ EE( f λn,m n,t n ) E(f H ) 1 λn + λ + 1 M + exp t ( 1 log n ) ] 1/2 λm

32 Falkon statistics (worst case) Theorem Under (basic), when M > log n λ, [ EE( f λn,m n,t n ) E(f H ) 1 λn + λ + 1 M + exp t ( 1 log n ) ] 1/2 λm By selecting then λ n = 1/ n, M n = 2 log n, t n = log n, λ EE( f λn,m n,t n ) E(f H ) 1 n

33 Falkon statistics (refined results) Theorem Under (basic) and (refined), when M > log n λ, EE( f λn,m n,t n ) E(f H ) N (λ) n + λ 2r + 1 M + exp [ t ( 1 log n ) ] 1/2 λm

34 Falkon statistics (refined results) Theorem Under (basic) and (refined), when M > log n λ, EE( f λn,m n,t n ) E(f H ) N (λ) n + λ 2r + 1 M + exp [ t ( 1 log n ) ] 1/2 λm By selecting then λ n = n 1 2r+γ, Mn = 2 log n, t n = log n, λ EE( f λn,m n,t n ) E(f H ) n 2r 2r+γ

35 Remarks Relevant works SGD RF-KRR (Rahimi, Recht 07; Bach 15; Rudi, Rosasco 17) Divide and conquer (Zhang et al. 13) NYTRO (Angles et al 16) Nyström SGD (Lin, Rosasco 16) Same statistical properties/memory requirements Much smaller time complexity

36 Proof: bridging statistics and optimization Lemma Let δ > 0, κ P := κ(p HP ), c δ = c 0 log 1 δ. When λ 1 n with probability 1 δ. E( f λ,m,t ) E(f H ) E( f λ,m ) E(f H ) + c δ exp( t/ κ P ).

37 Proof: bridging statistics and optimization Lemma Let δ > 0, κ P := κ(p HP ), c δ = c 0 log 1 δ. When λ 1 n with probability 1 δ. E( f λ,m,t ) E(f H ) E( f λ,m ) E(f H ) + c δ exp( t/ κ P ). Lemma Let δ (0, 1], λ > 0. When then with probability 1 δ. κ(p HP ) M = 2 log 1 δ λ, ( 1 log 1 ) 1 δ < 4 λm

38 Computational implications Cost for optimal generalization with FALKON M n = O( n), t n = log n Model: O( n) Space: O(n) Kernel eval.: O(n n) Time: O(n 2 ) O(n n)

39 In practice 1 Higgs dataset: n = 10, 000, 000, M = 50,

40 Some experiments MillionSongs (n 10 6 ) YELP (n 10 6 ) TIMIT (n 10 6 ) MSE Relative error Time(s) RMSE Time(m) c-err Time(h) FALKON % 1.5 Prec. KRR Hierarchical D&C Rand. Feat Nyström ADMM R. F BCD R. F % 1.7 BCD Nyström % 1.7 KRR % 8.3 EigenPro % 3.9 Deep NN % - Sparse Kernels % - Ensemble % - Table: MillionSongs, YELP and TIMIT Datasets. Times obtained on: = cluster of 128 EC2 r3.2xlarge machines, = cluster of 8 EC2 r3.8xlarge machines, = single machine with two Intel Xeon E5-2620, one Nvidia GTX Titan X GPU and 128GB of RAM, = cluster with 512 GB of RAM and IBM POWER8 12-core processor, = unknown platform.

41 Some more experiments SUSY (n 10 6 ) HIGGS (n 10 7 ) IMAGENET (n 10 6 ) c-err AUC Time(m) AUC Time(h) c-err Time(h) FALKON 19.6% % 4 EigenPro 19.8% Hierarchical 20.1% Boosted Decision Tree Neural Network Deep Neural Network Inception-V % - Table: Architectures: cluster with IBM POWER8 12-core cpu, 512 GB RAM, single machine with two Intel Xeon E5-2620, one Nvidia GTX Titan X GPU, 128GB RAM, single machine.

42 Contributions Best computations so far for optimal statistics Space O(n) Time O(n n) Random projections+iterative solvers+preconditioning... fast rates... adaptive sampling Next? Distributed architectures... Open the kernel blackbox!

43 Proving κ(p HP ) 1 Let K x = K(x, ) H, C = K x K xdρ X(x), Ĉ n = 1 n n K xi K xi, i=1 Ĉ M = 1 M M K xj K xj. j=1 Recall that P = 1 n T 1 A 1, T = chol(k MM ), A = chol ( 1 M T T + λi ). Steps 1. P HP = A V (Ĉn + λi)v A 1

44 Proving κ(p HP ) 1 Let K x = K(x, ) H, C = K x K xdρ X(x), Ĉ n = 1 n n K xi K xi, i=1 Ĉ M = 1 M M K xj K xj. j=1 Recall that P = 1 n T 1 A 1, T = chol(k MM ), A = chol ( 1 M T T + λi ). Steps 2. P HP = A V (ĈM + λi)v A 1 + A V (Ĉn ĈM )V A 1

45 Proving κ(p HP ) 1 Let K x = K(x, ) H, C = K x K xdρ X(x), Ĉ n = 1 n n K xi K xi, i=1 Ĉ M = 1 M M K xj K xj. j=1 Recall that P = 1 n T 1 A 1, T = chol(k MM ), A = chol ( 1 M T T + λi ). Steps 3. P HP = I + A V (Ĉn ĈM )V A 1

46 Proving κ(p HP ) 1 Let K x = K(x, ) H, C = K x K xdρ X(x), Ĉ n = 1 n n K xi K xi, i=1 Ĉ M = 1 M M K xj K xj. j=1 Recall that P = 1 n T 1 A 1, T = chol(k MM ), A = chol ( 1 M T T + λi ). Steps 3. P HP = I + E with E = A V (Ĉn ĈM )V A 1

47 Proving κ(p HP ) 1 Let K x = K(x, ) H, C = K x K xdρ X(x), Ĉ n = 1 n n K xi K xi, i=1 Ĉ M = 1 M M K xj K xj. j=1 Recall that P = 1 n T 1 A 1, T = chol(k MM ), A = chol ( 1 M T T + λi ). Steps 4. κ(p HP ) = κ(i + E) 1 + E, when E < 1, 1 E

48 Proving κ(p HP ) 1 Let K x = K(x, ) H, C = K x K xdρ X(x), Ĉ n = 1 n n K xi K xi, i=1 Ĉ M = 1 M M K xj K xj. j=1 Recall that P = 1 n T 1 A 1, T = chol(k MM ), A = chol ( 1 M T T + λi ). Steps 5.E = A V (Ĉn ĈM )V A 1 1/2 w.h.p. when M c 0 log 1 δ λ

An inverse problem perspective on machine learning

An inverse problem perspective on machine learning An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

FALKON: An Optimal Large Scale Kernel Method

FALKON: An Optimal Large Scale Kernel Method FALKON: An Optimal Large Scale Kernel Method Alessandro Rudi INRIA Sierra Project-team, École Normale Supérieure, Paris Luigi Carratino University of Genoa Genova, Italy Lorenzo Rosasco University of Genoa,

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

TUM 2016 Class 3 Large scale learning by regularization

TUM 2016 Class 3 Large scale learning by regularization TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond

More information

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

arxiv: v1 [stat.ml] 31 Oct 2018

arxiv: v1 [stat.ml] 31 Oct 2018 On Fast Leverage Score Sampling and Optimal Learning Alessandro Rudi, alessandro.rudi@inria.fr Luigi Carratino 3 luigi.carratino@dibris.unige.it Daniele Calandriello,2 daniele.calandriello@iit.it Lorenzo

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Junhong Lin Volkan Cevher Laboratory for Information and Inference Systems École Polytechnique Fédérale

More information

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression

Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression Meimei Liu Department of Statistical Science Duke University Durham, IN - 27708 Email: meimei.liu@duke.edu Jean

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent

Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent Lukas Pfahler and Katharina Morik TU Dortmund University, Artificial Intelligence Group, Dortmund, Germany

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Kernel Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1 / 21

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Oslo Class 4 Early Stopping and Spectral Regularization

Oslo Class 4 Early Stopping and Spectral Regularization RegML2017@SIMULA Oslo Class 4 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

TUM 2016 Class 1 Statistical learning theory

TUM 2016 Class 1 Statistical learning theory TUM 2016 Class 1 Statistical learning theory Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Machine learning applications Texts Images Data: (x 1, y 1 ),..., (x n, y n ) Note: x i s huge dimensional! All

More information

Faster Kernel Ridge Regression Using Sketching and Preconditioning

Faster Kernel Ridge Regression Using Sketching and Preconditioning Faster Kernel Ridge Regression Using Sketching and Preconditioning Haim Avron Department of Applied Math Tel Aviv University, Israel SIAM CS&E 2017 February 27, 2017 Brief Introduction to Kernel Ridge

More information

Distributed Adaptive Sampling for Kernel Matrix Approximation

Distributed Adaptive Sampling for Kernel Matrix Approximation Daniele Calandriello Alessandro Lazaric Michal Valko SequeL team, INRIA Lille - Nord Europe Abstract Most kernel-based methods, such as kernel regression, kernel PCA, ICA, or k-means clustering, do not

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

Distributed Learning with Regularized Least Squares

Distributed Learning with Regularized Least Squares Journal of Machine Learning Research 8 07-3 Submitted /5; Revised 6/7; Published 9/7 Distributed Learning with Regularized Least Squares Shao-Bo Lin sblin983@gmailcom Department of Mathematics City University

More information

Mathematical Methods for Data Analysis

Mathematical Methods for Data Analysis Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine

More information

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation COMP 55 Applied Machine Learning Lecture 2: Bayesian optimisation Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material posted

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training

More information

Least Squares SVM Regression

Least Squares SVM Regression Least Squares SVM Regression Consider changing SVM to LS SVM by making following modifications: min (w,e) ½ w 2 + ½C Σ e(i) 2 subject to d(i) (w T Φ( x(i))+ b) = e(i), i, and C>0. Note that e(i) is error

More information

Structured Prediction

Structured Prediction Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]

More information

Semi-Nonparametric Inferences for Massive Data

Semi-Nonparametric Inferences for Massive Data Semi-Nonparametric Inferences for Massive Data Guang Cheng 1 Department of Statistics Purdue University Statistics Seminar at NCSU October, 2015 1 Acknowledge NSF, Simons Foundation and ONR. A Joint Work

More information

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016 CPSC 340: Machine Learning and Data Mining Gradient Descent Fall 2016 Admin Assignment 1: Marks up this weekend on UBC Connect. Assignment 2: 3 late days to hand it in Monday. Assignment 3: Due Wednesday

More information

Machine Learning (CSE 446): Neural Networks

Machine Learning (CSE 446): Neural Networks Machine Learning (CSE 446): Neural Networks Noah Smith c 2017 University of Washington nasmith@cs.washington.edu November 6, 2017 1 / 22 Admin No Wednesday office hours for Noah; no lecture Friday. 2 /

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Fastfood Approximating Kernel Expansions in Loglinear Time. Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle)

Fastfood Approximating Kernel Expansions in Loglinear Time. Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle) Fastfood Approximating Kernel Expansions in Loglinear Time Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle) Large Scale Problem: ImageNet Challenge Large scale data Number of training

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction

More information

Distribution Regression

Distribution Regression Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

INTRODUCTION TO DATA SCIENCE

INTRODUCTION TO DATA SCIENCE INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #13 3/9/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Mini-Project #1 is due Saturday night (3/11): Seems like people are able to do

More information

Lecture 6. Notes on Linear Algebra. Perceptron

Lecture 6. Notes on Linear Algebra. Perceptron Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture Notes on linear algebra Vectors

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Regularization Algorithms for Learning

Regularization Algorithms for Learning DISI, UNIGE Texas, 10/19/07 plan motivation setting elastic net regularization - iterative thresholding algorithms - error estimates and parameter choice applications motivations starting point of many

More information

Memory Efficient Kernel Approximation

Memory Efficient Kernel Approximation Si Si Department of Computer Science University of Texas at Austin ICML Beijing, China June 23, 2014 Joint work with Cho-Jui Hsieh and Inderjit S. Dhillon Outline Background Motivation Low-Rank vs. Block

More information

Approximate Spectral Clustering via Randomized Sketching

Approximate Spectral Clustering via Randomized Sketching Approximate Spectral Clustering via Randomized Sketching Christos Boutsidis Yahoo! Labs, New York Joint work with Alex Gittens (Ebay), Anju Kambadur (IBM) The big picture: sketch and solve Tradeoff: Speed

More information

Probabilistic Regression Using Basis Function Models

Probabilistic Regression Using Basis Function Models Probabilistic Regression Using Basis Function Models Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract Our goal is to accurately estimate

More information

arxiv: v6 [stat.ml] 17 Mar 2016

arxiv: v6 [stat.ml] 17 Mar 2016 Less is More: Nyström Computational Regularization Alessandro Rudi 1 Raffaello Camoriano 1,2 Lorenzo Rosasco 1,3 ale rudi@mit.edu raffaello.camoriano@iit.it lrosasco@mit.edu arxiv:1507.04717v6 [stat.ml]

More information

CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE

CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE Copyright 2013, SAS Institute Inc. All rights reserved. Agenda (Optional) History Lesson 2015 Buzzwords Machine Learning for X Citizen Data

More information

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17 3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural

More information

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines

More information

Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

More information

Distributed Semi-supervised Learning with Kernel Ridge Regression

Distributed Semi-supervised Learning with Kernel Ridge Regression Journal of Machine Learning Research 18 (017) 1- Submitted 11/16; Revised 4/17; Published 5/17 Distributed Semi-supervised Learning with Kernel Ridge Regression Xiangyu Chang Center of Data Science and

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces

Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces Junhong Lin 1, Alessandro Rudi 2,3, Lorenzo Rosasco 4,5, Volkan Cevher 1 1 Laboratory for Information and Inference

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information

Spectral Filtering for MultiOutput Learning

Spectral Filtering for MultiOutput Learning Spectral Filtering for MultiOutput Learning Lorenzo Rosasco Center for Biological and Computational Learning, MIT Universita di Genova, Italy Plan Learning with kernels Multioutput kernel and regularization

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

arxiv: v1 [stat.ml] 5 Nov 2018

arxiv: v1 [stat.ml] 5 Nov 2018 Kernel Conjugate Gradient Methods with Random Projections Junhong Lin and Volkan Cevher {junhong.lin, volkan.cevher}@epfl.ch Laboratory for Information and Inference Systems École Polytechnique Fédérale

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

Quiz 2 Date: Monday, November 21, 2016

Quiz 2 Date: Monday, November 21, 2016 10-704 Information Processing and Learning Fall 2016 Quiz 2 Date: Monday, November 21, 2016 Name: Andrew ID: Department: Guidelines: 1. PLEASE DO NOT TURN THIS PAGE UNTIL INSTRUCTED. 2. Write your name,

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

COMPRESSED AND PENALIZED LINEAR

COMPRESSED AND PENALIZED LINEAR COMPRESSED AND PENALIZED LINEAR REGRESSION Daniel J. McDonald Indiana University, Bloomington mypage.iu.edu/ dajmcdon 2 June 2017 1 OBLIGATORY DATA IS BIG SLIDE Modern statistical applications genomics,

More information

Statistical Convergence of Kernel CCA

Statistical Convergence of Kernel CCA Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Spectral Regularization

Spectral Regularization Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

Optimization geometry and implicit regularization

Optimization geometry and implicit regularization Optimization geometry and implicit regularization Suriya Gunasekar Joint work with N. Srebro (TTIC), J. Lee (USC), D. Soudry (Technion), M.S. Nacson (Technion), B. Woodworth (TTIC), S. Bhojanapalli (TTIC),

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology Rirdge Regression Szymon Bobek Institute of Applied Computer science AGH University of Science and Technology Based on Carlos Guestrin adn Emily Fox slides from Coursera Specialization on Machine Learnign

More information

Machine Learning Techniques

Machine Learning Techniques Machine Learning Techniques ( 機器學習技法 ) Lecture 15: Matrix Factorization Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University

More information

Asynchronous Doubly Stochastic Sparse Kernel Learning

Asynchronous Doubly Stochastic Sparse Kernel Learning The Thirty-Second AAAI Conference on Artificial Intelligence AAAI-8 Asynchronous Doubly Stochastic Sparse Kernel Learning Bin Gu, Xin Miao, Zhouyuan Huo, Heng Huang * Department of Electrical & Computer

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information