Machine Learning in the Data Revolution Era

Size: px
Start display at page:

Download "Machine Learning in the Data Revolution Era"

Transcription

1 Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo, July 2010 Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 1 / 40

2 10 Years Ago... ALLWEIN, SCHAPIRE & ALLWEIN, SINGER, SCHAPIRE 2000 &SINGER (received Best 10-year paper at ICML 2010) #Examples Problem Train Test #Attributes #Classes dermatology satimage glass segmentation ecoli pendigits yeast vowel soybean thyroid audiology isolet letter Table 1: Description of the datasets used in the experiments. tradeoff between the redundancy and correcting properties of the output codes and the difficulty of the binary learning problems it induces. The complete code has good error correcting properties; Shai Shalev-Shwartz the distance (Hebrew between U) each pair Learning of rows in is the Data Revolution Era. However, many Google-UW, of the binary July 10 2 / 40

3 Data Revolution In 2009, more data has been generated by individuals than in the entire history of mankind through Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 3 / 40

4 Discriminative Learning Document If only Tiger Woods could bond with fans as well as he did with his caddie on the Old Course practice range on Monday he might find the road to redemption all the more simple... About Sport? Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 4 / 40

5 Discriminative Learning Training Set {(x i, y i )} m i=1 Learning Algorithm Hypothesis set H Learning rule Output h : X Y E.g. X - instance domain (e.g documents) Y - label set (e.g. document category is about sport?) Goal: learn a predictor h : X Y H - Set of possible predictors Learning rule - (e.g. find h H with minimal error on training set ) Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 5 / 40

6 Large Scale Learning 10k training examples 1 hour 2.3% error Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 6 / 40

7 Large Scale Learning 10k training examples 1 hour 2.3% error 1M training examples 1 week 2.29% error Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 6 / 40

8 Large Scale Learning 10k training examples 1 hour 2.3% error 1M training examples 1 week 2.29% error Can always sub-sample and get error of 2.3% using 1 hour Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 6 / 40

9 Large Scale Learning 10k training examples 1 hour 2.3% error 1M training examples 1 week 2.29% error Can always sub-sample and get error of 2.3% using 1 hour Say we re happy with 2.3% error What else can we gain from the larger data set? Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 6 / 40

10 How can more data help? More Data Traditional reduce error Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 7 / 40

11 How can more data help? More Data Traditional This talk reduce error reduce runtime compensate for missing information Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 7 / 40

12 Outline How can more data reduce runtime? Learning using Stochastic Optimization (S. & Srebro 2008) Injecting Structure (S, Shamir, Sirdharan 2010) How can more data compensate for missing information? Attribute Efficient Learning (Cesa-Bianchi, S., Shamir 2010) Technique: Rely on Stochastic Optimization Ads Placement (Kakade, S., Tewari 2008) Technique: Inject Structure by Exploration Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 8 / 40

13 Background on ML: Vapnik s General Learning Setting Probably Approximately Solve: where: min h H E [f(h, z)] z D D is an unknown distribution Can only get samples z 1,..., z m i.i.d. D H is a known hypothesis class f(h, z) is the loss of using hypothesis h on example z Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 9 / 40

14 Background on ML: Vapnik s General Learning Setting Probably Approximately Solve: where: min h H E [f(h, z)] z D D is an unknown distribution Can only get samples z 1,..., z m i.i.d. D H is a known hypothesis class f(h, z) is the loss of using hypothesis h on example z Main question studied by Vapnik: How many examples are required to find ĥ = A(z 1,..., z m ) s.t. [f(ĥ, ] z) min E [f(h, z)] ɛ w.p. 1 δ E z D h H z D Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July 10 9 / 40

15 Examples Document Categorization: z = (x, y), x represents a document and y [k] is the topic h H is a predictor h : X [k] f(h, z) = 1[h(x) y] k-means clustering: z R d h R k,d specifies k cluster centers f(h, z) = min j h j, z Density Estimation: h is a parameter of a density p h (z) f(h, z) = log p h (z) Ads Placement: z = (x, c), x represents context and c [0, 1] k represents the gain of placing each ad h : X S k returns the probability to place each ad f(h, z) = h(x), c is the expected gain of following policy h Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

16 Vapnik s Learning Rule Empirical Risk Minimization: Return a hypothesis with smallest risk on the training examples: ĥ argmin h H 1 m m f(h, z i ) i=1 Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

17 Vapnik s Learning Rule Empirical Risk Minimization: Return a hypothesis with smallest risk on the training examples: ĥ argmin h H 1 m m f(h, z i ) i=1 ( Sample Complexity: If m Ω C(H) log(1/δ) ), where C(H) is some ɛ 2 complexity measure, then ĥ is Probably Approximately Correct. Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

18 Example: Support Vector Machines SVM is a very popular Machine Learning algorithm Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

19 Support Vector Machines SVM learning rule: Given samples (x 1, y 1 ),..., (x m, y m ) solve: min w λ 2 w m m max{0, 1 y i w, x i } i=1 Casting in the General Learning Setting: z = (x, y) h = w f(h, z) = max{0, 1 y w, x } + λ 2 w 2 SVM learning rule is exactly ERM for the above loss function Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

20 Solving the SVM problem SVM learning rule can be rewritten as a Quadratic Optimization Problem. Solvable in polynomial time by standard solvers. Dedicated solvers also exist. End of story? Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

21 Solving the SVM problem SVM learning rule can be rewritten as a Quadratic Optimization Problem. Solvable in polynomial time by standard solvers. Dedicated solvers also exist. End of story? Not quite... Runtime of standard solvers increases with m What if m is very large? Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

22 Machine Learning Perspective on Optimization Our real goal is not to minimize the SVM optimization problem SVM optimization problem is just a proxy for achieving the real goal find a hypothesis with small error on unseen data Should study runtime required to solve the Learning Problem: T (m, ɛ) runtime required to find ɛ-accurate solution w.r.t. unseen data when number of training examples available is m T (, ɛ) data laden analysis data is plentiful and computational complexity is the real bottleneck Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

23 ERM vs. Stochastic Optimization Goal: min h E z D [f(h, z)] ERM (traditional) min h 1 m m i=1 f(h, z i) Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

24 ERM vs. Stochastic Optimization Goal: min h E z D [f(h, z)] ERM (traditional) SGD min h 1 m m i=1 f(h, z i) h h η f(h, z i ) Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

25 Data Laden Analysis of SVM Solvers Method Optimization runtime T (, ɛ) Interior Point Dual decompositoin SVM-Perf (Joachims 06) SGD (S, Srbero, Singer 07) m 3.5 log log 1 ɛ m 2 log 1 ɛ m λɛ 1 λɛ 1 ɛ 7 1 ɛ 4 1 ɛ 4 1 ɛ 2 Best optimization method is the worst learning method... Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

26 More Data Less Work Runtime Dual Decomposition SVM Perf Runtime SGD PEGASOS Runtime Training Set Size Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

27 More Data Less Work Theoretical Empirical (CCAT) Runtime sample complexity data-laden Training Set Size Million Iterations (! runtime) , , ,000 Training Set Size Intuition: Better to do simple operations on more examples than complicated operations on fewer examples Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

28 Outline How can more data reduce runtime? Learning using Stochastic Optimization (S. & Srebro 2008) Injecting Structure (S, Shamir, Sirdharan 2010) How can more data compensate for missing information? Attribute Efficient Learning (Cesa-Bianchi, S., Shamir 2010) Technique: rely on Stochastic Optimization Ads Placement (Kakade, S., Tewari 2008) Technique: Inject Structure by Exploration Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

29 Injecting Structure Main Idea Replace original hypothesis class with a larger hypothesis class On one hand: Larger class has more structure easier to optimize On the other hand: Larger class larger estimation error need more examples New Hypothesis Class Original Hypothesis Class Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

30 Example Learning 3-DNF Goal: learn a Boolean function h : {0, 1} d {0, 1} Runtime m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

31 Example Learning 3-DNF Goal: learn a Boolean function h : {0, 1} d {0, 1} Hypothesis class: 3-term DNF formulae: Runtime m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

32 Example Learning 3-DNF Goal: learn a Boolean function h : {0, 1} d {0, 1} Hypothesis class: 3-term DNF formulae: E.g. h(x) = (x 1 x 3 x 7 ) (x 4 x 2 ) (x 5 x 9 ) Runtime m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

33 Example Learning 3-DNF Goal: learn a Boolean function h : {0, 1} d {0, 1} Hypothesis class: 3-term DNF formulae: E.g. h(x) = (x 1 x 3 x 7 ) (x 4 x 2 ) (x 5 x 9 ) Sample complexity is order d/ɛ Runtime m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

34 Example Learning 3-DNF Goal: learn a Boolean function h : {0, 1} d {0, 1} Hypothesis class: 3-term DNF formulae: E.g. h(x) = (x 1 x 3 x 7 ) (x 4 x 2 ) (x 5 x 9 ) Sample complexity is order d/ɛ Kearns & Vazirani: If RP NP, it is not possible to efficiently find an ɛ-accurate 3-DNF formula Runtime m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

35 Example Learning 3-DNF Goal: learn a Boolean function h : {0, 1} d {0, 1} Hypothesis class: 3-term DNF formulae: E.g. h(x) = (x 1 x 3 x 7 ) (x 4 x 2 ) (x 5 x 9 ) Sample complexity is order d/ɛ Kearns & Vazirani: If RP NP, it is not possible to efficiently find an ɛ-accurate 3-DNF formula Claim: if m d 3 /ɛ it is possible to find a predictor with error ɛ in polynomial time Runtime m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

36 Proof Observation: 3-DNF formula can be rewritten as u T1,v T 2,w T 3 (u v w) for three sets of literals T 1, T 2, T 3 Define: ψ : {0, 1} d {0, 1} 2(2d)3 s.t. for each triplet of literals u, v, w there are two variables indicating if u v w is true or false Observation: Each 3-DNF can be represented as a single conjunction over ψ(x) Easy to learn single conjunction (greedy or LP) conjunction over ψ(x) 3-DNF over x Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

37 Trading samples for runtime Algorithm samples runtime 3-DNF over x d ɛ 2 d Conjunction over ψ(x) d 3 ɛ poly(d) 3-DNF Runtime Conjunction m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

38 More complicated example: Agnostic learning of Halfspaces with 0 1 loss Ben-David & Simon (2000) Runtime m S., Shamir, Sridharan (2010) Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

39 Outline How can more data reduce runtime? Learning using Stochastic Optimization (S. & Srebro 2008) Injecting Structure (S, Shamir, Sirdharan 2010) How can more data compensate for missing information? Attribute Efficient Learning (Cesa-Bianchi, S., Shamir 2010) Technique: rely on Stochastic Optimization Ads Placement (Kakade, S., Tewari 2008) Technique: Inject Structure by Exploration Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

40 Attribute efficient regression Each training example is a pair (x, y) R d R Partial information: can only view O(1) attributes of each individual example Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

41 Sponsored Advertisement (contextual multi-armed bandit) Each training example is a pair (x, c) R d [0, 1] k Partial information: Don t know c. Can only guess some y and know the value of c y Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

42 How more data helps? Three main techniques: 1 Missing information as noise 2 Active Exploration try to fish the relevant information 3 Inject structure problem hard in the original representation but becomes simple in another representation (different hypothesis class) More data helps because: 1 It reduces variance compensates for the noise 2 It allows more exploration 3 It compensates for larger sample complexity due to using larger hypotheses classes Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

43 Attribute efficient regression Formal problem statement: Unknown distribution D over R d R Goal: learn a linear predictor x w, x with low risk: Risk: L D (w) = E D [( w, x y) 2 ] Training set: (x 1, y 1 ),..., (x m, y m ) Partial information: For each (x i, y i ), learner can view only k attributes of x i Active selection: learner can choose which k attributes to see Similar to Learning with restricted focus of attention (Ben-David & Dichterman 98) Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

44 Dealing with missing information Usually difficult exponential ways to complete the missing information Popular approach Expectation Maximization (EM) Previous methods usually do not come with guarantees (neither sample complexity nor computational complexity) Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

45 Ostrich approach Simply set the missing attributes to a default value Use your favorite regression algorithm for full information, e.g., min w m ( w, x i y i ) 2 i=1 + λ w p p Ridge Regression: p = 2 Lasso: p = 1 Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

46 Ostrich approach Simply set the missing attributes to a default value Use your favorite regression algorithm for full information, e.g., min w m ( w, x i y i ) 2 i=1 + λ w p p Ridge Regression: p = 2 Lasso: p = 1 Efficient Correct? Sample complexity? How to choose attributes? Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

47 Partial information as noise Observation: x = x 1 x 2. x d = 1 d dx d 0. 0 dx d Therefore, choosing i uniformly at random gives E[dx i e i ] = x. i If x 1 then dx i e i d (i.e. variance increased) Reduced missing information to unbiased noise Many examples can compensate for the added noise Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

48 Many examples compensates for noise True goal: minimize over w the true risk L D (w) = E (x,y) [( w, x y) 2 ] Loss on one example, ( w, x y) 2, gives an unbiased estimate for L D (w) Averaging loss on many examples reduces variance In our case, we construct an unbiased estimate of the loss of each single example Variance increases but many examples still reduces it back Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

49 Loss-Based Attribute Efficient Regression (LaBAER) Theorem (Cesa-Bianchi, S, Shamir) Let ŵ be the output of LaBAER. Then, with overwhelming probability ( d 2 L D (ŵ) min L B 2 ) D(w) + Õ, w: w 1 B m where d is dimension and m is number of examples. Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

50 Loss-Based Attribute Efficient Regression (LaBAER) Theorem (Cesa-Bianchi, S, Shamir) Let ŵ be the output of LaBAER. Then, with overwhelming probability ( d 2 L D (ŵ) min L B 2 ) D(w) + Õ, w: w 1 B m where d is dimension and m is number of examples. Factor of d 4 more examples compensates for the missing information! Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

51 Loss-Based Attribute Efficient Regression (LaBAER) Theorem (Cesa-Bianchi, S, Shamir) Let ŵ be the output of LaBAER. Then, with overwhelming probability ( d 2 L D (ŵ) min L B 2 ) D(w) + Õ, w: w 1 B m where d is dimension and m is number of examples. Factor of d 4 more examples compensates for the missing information! Can we do better? Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

52 Pegasos Attribute Efficient Regression (PAER) The factor d 2 in the bound because we estimate a matrix xx T Can we avoid estimating the matrix? Yes! Estimate the gradient of the loss instead of the loss Follow the Stochastic Gradient Descent approach Leads to a much better bound Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

53 Demonstration Full information classifiers (top line) error of 1.5% Our algorithms (bottom line) error of 4% Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

54 Demonstration Test Regression Error PAER, LaBAER: 4 pixels per example Ridge, Lasso: 768 pixels per example Number of Examples 1.2 Number of Examples Test Regression Error Ridge Regression Lasso PAER LaBAER Number of Features x 10 4 Number of observed features Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

55 Ads Placement: More data improves runtime Partial supervision Technique: Active Exploration Larger regret can be compensated by having more examples runtime Halving Banditron m Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

56 Summary Learning theory: Many examples smaller error This work: Many examples higher efficiency compensating for missing information Techniques: 1 Stochastic optimization 2 Inject structure 3 Missing information as noise 4 Active Exploration Shai Shalev-Shwartz (Hebrew U) Learning in the Data Revolution Era Google-UW, July / 40

On the tradeoff between computational complexity and sample complexity in learning

On the tradeoff between computational complexity and sample complexity in learning On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

Learnability, Stability, Regularization and Strong Convexity

Learnability, Stability, Regularization and Strong Convexity Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Using More Data to Speed-up Training Time

Using More Data to Speed-up Training Time Using More Data to Speed-up Training Time Shai Shalev-Shwartz Ohad Shamir Eran Tromer Benin School of CSE, Microsoft Research, Blavatnik School of Computer Science Hebrew University, Jerusalem, Israel

More information

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Greedy Algorithms, Frank-Wolfe and Friends

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

Introduction to Machine Learning (67577)

Introduction to Machine Learning (67577) Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks

More information

Inverse Time Dependency in Convex Regularized Learning

Inverse Time Dependency in Convex Regularized Learning Inverse Time Dependency in Convex Regularized Learning Zeyuan A. Zhu (Tsinghua University) Weizhu Chen (MSRA) Chenguang Zhu (Tsinghua University) Gang Wang (MSRA) Haixun Wang (MSRA) Zheng Chen (MSRA) December

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Introduction to Machine Learning (67577) Lecture 5

Introduction to Machine Learning (67577) Lecture 5 Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai

More information

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable

More information

Littlestone s Dimension and Online Learnability

Littlestone s Dimension and Online Learnability Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Efficient Bandit Algorithms for Online Multiclass Prediction

Efficient Bandit Algorithms for Online Multiclass Prediction Efficient Bandit Algorithms for Online Multiclass Prediction Sham Kakade, Shai Shalev-Shwartz and Ambuj Tewari Presented By: Nakul Verma Motivation In many learning applications, true class labels are

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Failures of Gradient-Based Deep Learning

Failures of Gradient-Based Deep Learning Failures of Gradient-Based Deep Learning Shai Shalev-Shwartz, Shaked Shammah, Ohad Shamir The Hebrew University and Mobileye Representation Learning Workshop Simons Institute, Berkeley, 2017 Shai Shalev-Shwartz

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes

Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes Yoram Singer Hebrew University http://www.cs.huji.il/ singer Based on joint work with: Koby Crammer, Hebrew

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

A Distributed Solver for Kernelized SVM

A Distributed Solver for Kernelized SVM and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Machine Learning Lecture 6 Note

Machine Learning Lecture 6 Note Machine Learning Lecture 6 Note Compiled by Abhi Ashutosh, Daniel Chen, and Yijun Xiao February 16, 2016 1 Pegasos Algorithm The Pegasos Algorithm looks very similar to the Perceptron Algorithm. In fact,

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Practical Agnostic Active Learning

Practical Agnostic Active Learning Practical Agnostic Active Learning Alina Beygelzimer Yahoo Research based on joint work with Sanjoy Dasgupta, Daniel Hsu, John Langford, Francesco Orabona, Chicheng Zhang, and Tong Zhang * * introductory

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

l 1 and l 2 Regularization

l 1 and l 2 Regularization David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

6.036 midterm review. Wednesday, March 18, 15

6.036 midterm review. Wednesday, March 18, 15 6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Computational Learning Theory. Definitions

Computational Learning Theory. Definitions Computational Learning Theory Computational learning theory is interested in theoretical analyses of the following issues. What is needed to learn effectively? Sample complexity. How many examples? Computational

More information

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017 Stochastic Gradient Descent: The Workhorse of Machine Learning CS6787 Lecture 1 Fall 2017 Fundamentals of Machine Learning? Machine Learning in Practice this course What s missing in the basic stuff? Efficiency!

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

Statistical and Computational Learning Theory

Statistical and Computational Learning Theory Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office

More information

Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Algorithmic Stability and Generalization Christoph Lampert

Algorithmic Stability and Generalization Christoph Lampert Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts

More information

Multiclass Learnability and the ERM principle

Multiclass Learnability and the ERM principle Multiclass Learnability and the ERM principle Amit Daniely Sivan Sabato Shai Ben-David Shai Shalev-Shwartz November 5, 04 arxiv:308.893v [cs.lg] 4 Nov 04 Abstract We study the sample complexity of multiclass

More information

Beating SGD: Learning SVMs in Sublinear Time

Beating SGD: Learning SVMs in Sublinear Time Beating SGD: Learning SVMs in Sublinear Time Elad Hazan Tomer Koren Technion, Israel Institute of Technology Haifa, Israel 32000 {ehazan@ie,tomerk@cs}.technion.ac.il Nathan Srebro Toyota Technological

More information

Optimistic Rates Nati Srebro

Optimistic Rates Nati Srebro Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Multiclass Learnability and the ERM principle

Multiclass Learnability and the ERM principle JMLR: Workshop and Conference Proceedings 19 2011 207 232 24th Annual Conference on Learning Theory Multiclass Learnability and the ERM principle Amit Daniely amit.daniely@mail.huji.ac.il Dept. of Mathematics,

More information

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

COMP 551 Applied Machine Learning Lecture 2: Linear Regression COMP 551 Applied Machine Learning Lecture 2: Linear Regression Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Multiclass Learnability and the ERM Principle

Multiclass Learnability and the ERM Principle Journal of Machine Learning Research 6 205 2377-2404 Submitted 8/3; Revised /5; Published 2/5 Multiclass Learnability and the ERM Principle Amit Daniely amitdaniely@mailhujiacil Dept of Mathematics, The

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

New Algorithms for Contextual Bandits

New Algorithms for Contextual Bandits New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised

More information

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24 Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Feb 24 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Prediction III Goal

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning PAC Learning and VC Dimension Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE

More information

The Perceptron algorithm

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

1 The Probably Approximately Correct (PAC) Model

1 The Probably Approximately Correct (PAC) Model COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #3 Scribe: Yuhui Luo February 11, 2008 1 The Probably Approximately Correct (PAC) Model A target concept class C is PAC-learnable by

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Learning Kernel-Based Halfspaces with the Zero-One Loss

Learning Kernel-Based Halfspaces with the Zero-One Loss Learning Kernel-Based Halfspaces with the Zero-One Loss Shai Shalev-Shwartz The Hebrew University shais@cs.huji.ac.il Ohad Shamir The Hebrew University ohadsh@cs.huji.ac.il Karthik Sridharan Toyota Technological

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)

More information

COMP 551 Applied Machine Learning Lecture 2: Linear regression

COMP 551 Applied Machine Learning Lecture 2: Linear regression COMP 551 Applied Machine Learning Lecture 2: Linear regression Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this

More information

Computational Learning Theory

Computational Learning Theory 0. Computational Learning Theory Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 7 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell 1. Main Questions

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information