1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:
|
|
- Terence Owens
- 5 years ago
- Views:
Transcription
1 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 October 31, 2016 Due: A. November 11, 2016; B. November 22, 2016 A. Boosting 1. Implement AdaBoost with boosting stumps and apply the algorithm to the spambase data set of HW2 with the same training and test sets. Plot the average cross-validation error plus or minus one standard deviation as a function of the number of rounds of boosting T by selecting the value of this parameter out of {10, 10 2,..., 10 k } for a suitable value of k, as in HW2. Let T be the best value found for the parameter. Plot the error on the training and test set as a function of the number of rounds of boosting for t [1, T ]. Compare your results with those obtained using SVMs in HW2. Solution: Figure 1: Cross validation error shown with ±1 standard deviation (Figure courtesy of Chaitanya Rudra). In figure 1 we see a typical plot of the average cross-validation performance shown with ±1 standard deviation. Note that the error descreases exponen- 1
2 Figure 2: Test and training error (Figure courtesy of Chaitanya Rudra). tially and eventually levels out after roughly 400 iterations. In figure 2 we see the training and test error after T = 400 rounds. Note that the test error eventually levels off, while the training error continues to decrease towards zero. Overall, the AdaBoost algorithm performs only slightly better than SVM algorithm increasing accuracy by about 1%. 2. Consider the following variant of the classification problem where, in addition to the positive and negative labels +1 and 1, points may be labeled with 0. This can correspond to cases where the true label of a point is unknown, a situation that often arises in practice, or more generally to the fact that the learning algorithm incurs no loss for predicting 1 or +1 for such a point. Let X be the input space and let Y = { 1, 0, +1}. As in standard binary classification, the loss of f : X R on a pair (x, y) X Y is defined by 1 yf(x)<0. Consider a sample S = ((x 1, y 1 ),..., (x m, y m )) (X Y) m and a hypothesis set H of base functions taking values in { 1, 0, +1}. For a base hypothesis h t H and a distribution D t over indices i [1, m], define ɛ s t for s { 1, 0, +1} by ɛ s t = E i Dt [1 yi h t(x i )=s]. (a) Derive a boosting-style algorithm for this setting in terms of ɛ s ts, using the same objective function as that of AdaBoost. You should carefully justify the definition of the algorithm. Solution: Say a boosting-style algorithm is just AdaBoost with a pos- 2
3 sibly different step size t. Recall these definitions from the description of AdaBoost: The final hypothesis is f(x) = t th t (x) and the normalization constant in round t is Z t = i D t(i) exp( t y i h t (x i )). We proved in class that 1 m 1 yi f(x i )<0 1 exp( y i f(x i )) = m t i i and that AdaBoost s step size can be derived by minimizing this objective in each round t. Taking that same approach, observe that Z t Z t = i D t (i) exp( t y i h t (x i )) = ɛ 0 t + ɛ t exp( t) + ɛ + t exp( t). Differentiating the right-hand side with respect to t and ( setting equal to zero shows that Z t is minimized by letting t = 1 2 log ɛ + ) t. ɛ t (b) What is the weak-learning assumption in this setting? Solution: One possible assumption is ɛ+ t ɛ t γ > 0. Informally, 1 ɛ 0 t this assumption says that the difference between the accuracy and error of each weak hypothesis is non-negligible relative to the fraction of examples on which the hypothesis makes any prediction at all. In part (d) we will prove that this assumption suffices to drive the training error to zero. (c) Write the full pseudocode of the algorithm. Solution: 1. Given: Training examples ((x 1, y 1 ),..., (x m, y m )). 2. Initialize D 1 to the uniform distribution on training examples. 3. for t = 1,..., T : a. h t base classifier in H. ( b. t 1 2 log ɛ + ) t. ɛ t c. For each i = 1,..., m: D t+1 (i) Dt(i) exp( ty ih t(x i )) Z t, where Z t i D t(i) exp( t y i h t (x i )) is the normalization constant. i. f T t=1 th t. 4. Return: sign(f). 3
4 (d) Give an upper bound on the training error of the algorithm as a function of the number of rounds of boosting and ɛ s ts. Solution: Plug in the value of t from part (a) into Z t = ɛ 0 t +ɛ t exp( t)+ ɛ + t exp( t) to obtain Z t = ɛ 0 t + 2 ɛ t ɛ+ t. Therefore 1 m 1 yi f(x i )<0 t i Z t = t ( ) ɛ 0 t + 2 ɛ t ɛ+ t. Morever, if the weak learning assumption from part (b) is satisfied then ɛ 0 t + 2 ɛ t ɛ+ t = ɛ 0 t + (1 ɛ 0 t )2 (ɛ + t ɛ t )2 = ɛ 0 t + (1 ɛ 0 t ) 1 (ɛ+ t ɛ t )2 1 ɛ 0 t 1 γ 2. 1 (ɛ+ t ɛ t )2 (1 ɛ 0 t )2 The first equality follows from (ɛ + t + ɛ t )2 (ɛ + t ɛ t )2 = 4ɛ + t ɛ t (just multiply and gather terms) and ɛ + t + ɛ t = 1 ɛ 0 t. The first inequality follows from the fact that square root is concave on [0, ), and thus λ x + (1 λ) y λx + (1 λ)y for λ [0, 1]. The last inequality follows from the weak learning assumption. Therefore we have 1 ( T ( m i 1 y i f(x i )<0 1 γ 2) exp where we used 1 + x exp(x). γ2 T 2 ), B. On-line learning The objective of this problem is to show how another regret minimization algorithm can be defined and studied. Let L be a loss function convex in its first argument and taking values in [0, M]. We will adopt the notation used in the lectures and assume N > e 2. Additionally, for any expert i [1, N], we denote by r t,i the instantaneous regret of that expert at time t [1, T ], r t,i = L(ŷ t, y t ) L(y t,i, y t ), and by R t,i his cumulative regret up to time t: R t,i = t s=1 r t,i. For convenience, we also define R 0,i = 0 for all i [1, N]. For any x R, (x) + denotes max(x, 0), that is the positive part of x, and for x = (x 1,..., x N ) R N, (x) + = ((x 1 ) +,..., (x N ) + ). 4
5 Let > 2 and consider the algorithm that predicts at round t [1, T ] according n to ŷ t = w t,iy t,i, with the weight w t,i defined based on the th power of the n w t,i regret up to time (t 1): w t,i = (R t 1,i ) 1 +. The potential function we use to analyze the algorithm is based on the function Φ defined over R N by Φ: x (x) + 2 = [ N (x i) ] Show that Φ is twice differentiable over R N B, where B is defined as follows: B = {u R N : (u) + = 0}. Solution: For x B, we can write Φ(x) = φ 2 ( N φ 1(x i )), where x 1,..., x N denote the components of x and φ 1 : R R and φ 2 : (0, + ) R are the functions defined by u R, φ 1 (u) = (u) + (1) u (0, + ), φ 2 (u) = u 2. (2) It is not hard to see that both φ 1 and φ 2 are twice (continously) differentiable, which shows, by composition, the same property for Φ over R N B. 2. For any t [1, T ], let r t denote the vector of instantaneous regrets, r t = (r t,1,..., r t,n ), and similarly R t = (R t,1,..., R t,n ). We define the potential function as Φ(R t ) = (R t ) + 2. Compute Φ(R t 1 ) for R t 1 B and show that Φ(R t 1 ) r t 0 (hint: use the convexity of the loss with respect to the first argument). Solution: For any i [1, N], Φ(R t 1 ) R t 1,i Thus, we can write = 2 (R t 1,i) 1 + [ N (R t 1,i ) + sign( Φ(R t ) r t ) = sign ( N ) w t,i r t,i ] 2 = 2w t,i (R t 1 ) + 2. = sign ( N w t,i (L(ŷ t, y t ) L(y t,i, y t ) ) ( N w ) t,i = sign N w (L(ŷ t, y t ) L(y t,i, y t ). t,i (3) 5
6 Now, by the convexity of the first argument of L, the following holds: N w t,i N w t,i (L(ŷ t, y t ) L(y t,i, y t )) = L(ŷ t, y t ) N w t,i N w t,i L(y t,i, y t ) 0, since ŷ t = n w n t,iy t,i w t,i. 3. (Bonus question) Prove the inequality r [ 2 Φ(u)]r 2( 1) r 2 valid for all r R N and u R N B (hint: write the Hessian 2 Φ(u) as a sum of a diagonal matrix and a positive semi-definite matrix multiplied by (2 ). Also, use Hölder s inequality generalizing Cauchy-Schwarz: for any p > 1 and q > 1 with 1 p + 1 q = 1 and u, v RN, u v u p v q ). Solution: As already seen, for any i [1, N], we have For j i, we obtain: We also have 2 Φ(u) u 2 i Φ(u) u i = 2(u i ) 1 + (u) + 2. (4) 2 Φ(u) u j u i = 2(2 )(u j ) 1 + (u i) 1 + (u) + 2(1 ). (5) = 2( 1)(u i ) 2 + (u) (2 )(u i ) 2( 1) + (u) + 2(1 ). (6) Consider the diagonal matrix D = diag((u 1 ) 2 +,..., (u N) 2 + ) and the matrix M defined by M ij = (u j ) 1 + (u i) 1 +. In view of the previous expressions, we can write 2 Φ(u) = 2( 1) (u) + 2 D + 2(2 ) (u) + 2(1 ) M. (7) M is a positive semi-definite matrix since M = vv with v = (u 1 1,..., u 1 Thus, for any r, r Mr 0. Since 2 < 0, we can write r 2 Φ(u)r 2( 1) (u) + 2 = 2( 1) (u) + 2 r Dr N (u i ) 2 + r2 t,i. N ). 6
7 By Hölder s inequality, the following holds: N (u i ) 2 + r2 t,i [ N ((u i ) 2 + ) 2 ] 2 [ N (ri 2 ) 2 This implies that r 2 Φ(u)r 2( 1) r 2. ] 2 = (u) + 2 r Using the answers to the two previous questions and Taylor s formula, show that for all t 1, Φ(R t ) Φ(R t 1 ) ( 1) r t 2, if γr t 1 +(1 γ)r t B for all γ [0, 1]. Solution: By assumption, the segment [R t 1, R t ] does not meet B and thus Φ is twice differentiable over the interior (R t 1, R t ). Thus, by Taylor s formula, there exists u (R t 1, R t ) such that Φ(R t ) Φ(R t 1 ) = (R t 1 ) r t rt t 2 (u)r t 1 2 rt t 2 (u)r t ( 1) r t 2, where we used the inequalities obtained in the two previous questions. 5. Suppose there exists γ [0, 1] such that (1 γ)r t 1 + γr t B. Show that Φ(R t ) ( 1) r t 2. Solution: For that γ, by definition, we have (R t 1 + γr t ) + = 0. Observe that for any two scalars c and d, (c + d) + c + + d + and therefore that (R t 1 +r t ) + (R t 1 +γr t ) + +(1 γ)(r t ) + = (1 γ)(r t ) + (r t ) +. (8) u u 2 is a non-decreasing function of each component u i, thus Φ(R t ) = Φ(R t 1 + r t ) (r t ) + 2. Since > 2, this implies Φ(R t ) ( 1) (r t ) Using the two previous questions, derive an upper bound on Φ(R T ) expressed in terms of T, N, and M. Solution: In view of the two previous questions, the inequality Φ(R t ) Φ(R t 1 ) ( 1) r t 2 holds in all cases. Summing up these inequalities for all t [1, T ], we obtain Φ(R T ) = Φ(R T ) Φ(R 0 ) = T Φ(R t ) Φ(R t 1 ) ( 1) t=1 T r t 2. t=1 Since r t,i M for all t and i, this yields Φ(R T ) ( 1)N 2 M 2 T. 7
8 7. Show that Φ(R T ) admits as a lower bound the square of the regret R T of the algorithm. Solution: By definition of the regret, R T = max i [1,N] R T,i R T = Φ(R T ). Note that this implies that Φ(R T ) (R T ) 2 +, which is not exactly the statement given in the question. However, it suffices for proving upper bounds on the regret, which is the motivation behind our construction of Φ. 8. Using the two previous questions give an upper bound on the regret R T. For what value of is the bound the most favorable? Give a simple expression of the upper bound on the regret for a suitable approximation of that optimal value. Solution: In view of the inequalities of the two previous questions, we have R T ( 1)N 2 M 2 T. For N > e 2, the function ( 1)N 2 reaches its minimum over (2, + ) at ( log N + log 2 N log N ), which is approximately 2 log N. Plugging in that value yields the bound R T M (2 log N 1)eT. 8
i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 November 16, 017 Due: Dec 01, 017 A. Kernels Show that the following kernels K are PDS: 1.
More informationMehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationMachine Learning, Fall 2011: Homework 5
0-60 Machine Learning, Fall 0: Homework 5 Machine Learning Department Carnegie Mellon University Due:??? Instructions There are 3 questions on this assignment. Please submit your completed homework to
More informationFoundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
More informationCOMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017
COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University BOOSTING Robert E. Schapire and Yoav
More informationPerceptron Mistake Bounds
Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce
More informationOnline Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri
Online Learning Jordan Boyd-Graber University of Colorado Boulder LECTURE 21 Slides adapted from Mohri Jordan Boyd-Graber Boulder Online Learning 1 of 31 Motivation PAC learning: distribution fixed over
More informationIntroduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:
More informationReducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning
More information10701/15781 Machine Learning, Spring 2007: Homework 2
070/578 Machine Learning, Spring 2007: Homework 2 Due: Wednesday, February 2, beginning of the class Instructions There are 4 questions on this assignment The second question involves coding Do not attach
More information2 Upper-bound of Generalization Error of AdaBoost
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Haipeng Zheng March 5, 2008 1 Review of AdaBoost Algorithm Here is the AdaBoost Algorithm: input: (x 1,y 1 ),...,(x m,y
More informationThe definitions and notation are those introduced in the lectures slides. R Ex D [h
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 04, 2016 Due: October 18, 2016 A. Rademacher complexity The definitions and notation
More informationClassification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13
Classification Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY Slides adapted from Mohri Jordan Boyd-Graber UMD Classification 1 / 13 Beyond Binary Classification Before we ve talked about
More informationi=1 = H t 1 (x) + α t h t (x)
AdaBoost AdaBoost, which stands for ``Adaptive Boosting", is an ensemble learning algorithm that uses the boosting paradigm []. We will discuss AdaBoost for binary classification. That is, we assume that
More informationAdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology
AdaBoost S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology 1 Introduction In this chapter, we are considering AdaBoost algorithm for the two class classification problem.
More informationLecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016
Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More information1 Review and Overview
DRAFT a final version will be posted shortly CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture # 16 Scribe: Chris Cundy, Ananya Kumar November 14, 2018 1 Review and Overview Last
More informationThe AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m
) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ
More information1 Rademacher Complexity Bounds
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability
More informationCOS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013
COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 203 Review of Zero-Sum Games At the end of last lecture, we discussed a model for two player games (call
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems
More informationAdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague
AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne
More informationCS , Fall 2011 Assignment 2 Solutions
CS 94-0, Fall 20 Assignment 2 Solutions (8 pts) In this question we briefly review the expressiveness of kernels (a) Construct a support vector machine that computes the XOR function Use values of + and
More informationDS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM
DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM Due: Monday, April 11, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to
More information4.1 Online Convex Optimization
CS/CNS/EE 53: Advanced Topics in Machine Learning Topic: Online Convex Optimization and Online SVM Lecturer: Daniel Golovin Scribe: Xiaodi Hou Date: Jan 3, 4. Online Convex Optimization Definition 4..
More informationLecture 16: Perceptron and Exponential Weights Algorithm
EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew
More informationIntroduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation
More informationCOS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018
COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 208 Review of Game heory he games we will discuss are two-player games that can be modeled by a game matrix
More informationCS446: Machine Learning Spring Problem Set 4
CS446: Machine Learning Spring 2017 Problem Set 4 Handed Out: February 27 th, 2017 Due: March 11 th, 2017 Feel free to talk to other members of the class in doing the homework. I am more concerned that
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationBoosting with decision stumps and binary features
Boosting with decision stumps and binary features Jason Rennie jrennie@ai.mit.edu April 10, 2003 1 Introduction A special case of boosting is when features are binary and the base learner is a decision
More informationHomework 4. Convex Optimization /36-725
Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationCOMS 4771 Lecture Boosting 1 / 16
COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?
More information15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015
5-859E: Advanced Algorithms CMU, Spring 205 Lecture #6: Gradient Descent February 8, 205 Lecturer: Anupam Gupta Scribe: Guru Guruganesh In this lecture, we will study the gradient descent algorithm and
More informationCS446: Machine Learning Fall Final Exam. December 6 th, 2016
CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationSupport Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs
E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in
More informationML (cont.): SUPPORT VECTOR MACHINES
ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationFast Rates for Estimation Error and Oracle Inequalities for Model Selection
Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu
More informationThe exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.
CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please
More informationApplications of Linear Programming
Applications of Linear Programming lecturer: András London University of Szeged Institute of Informatics Department of Computational Optimization Lecture 9 Non-linear programming In case of LP, the goal
More informationFunctional Analysis Exercise Class
Functional Analysis Exercise Class Week 9 November 13 November Deadline to hand in the homeworks: your exercise class on week 16 November 20 November Exercises (1) Show that if T B(X, Y ) and S B(Y, Z)
More informationExponentiated Gradient Descent
CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,
More informationFoundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,
More information1 Overview. 2 A Characterization of Convex Functions. 2.1 First-order Taylor approximation. AM 221: Advanced Optimization Spring 2016
AM 221: Advanced Optimization Spring 2016 Prof. Yaron Singer Lecture 8 February 22nd 1 Overview In the previous lecture we saw characterizations of optimality in linear optimization, and we reviewed the
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationIntroduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 7 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Convex Optimization Differentiation Definition: let f : X R N R be a differentiable function,
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement
More informationFoundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:
More informationCHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS. W. Erwin Diewert January 31, 2008.
1 ECONOMICS 594: LECTURE NOTES CHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS W. Erwin Diewert January 31, 2008. 1. Introduction Many economic problems have the following structure: (i) a linear function
More informationSupport Vector Machines for Classification and Regression
CIS 520: Machine Learning Oct 04, 207 Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may
More informationwhere x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H:
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 25, 2017 Due: November 08, 2017 A. Growth function Growth function of stum functions.
More informationMAT 419 Lecture Notes Transcribed by Eowyn Cenek 6/1/2012
(Homework 1: Chapter 1: Exercises 1-7, 9, 11, 19, due Monday June 11th See also the course website for lectures, assignments, etc) Note: today s lecture is primarily about definitions Lots of definitions
More informationHOMEWORK 4: SVMS AND KERNELS
HOMEWORK 4: SVMS AND KERNELS CMU 060: MACHINE LEARNING (FALL 206) OUT: Sep. 26, 206 DUE: 5:30 pm, Oct. 05, 206 TAs: Simon Shaolei Du, Tianshu Ren, Hsiao-Yu Fish Tung Instructions Homework Submission: Submit
More informationThe Representor Theorem, Kernels, and Hilbert Spaces
The Representor Theorem, Kernels, and Hilbert Spaces We will now work with infinite dimensional feature vectors and parameter vectors. The space l is defined to be the set of sequences f 1, f, f 3,...
More informationUnconstrained minimization of smooth functions
Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and
More informationCOMS F18 Homework 3 (due October 29, 2018)
COMS 477-2 F8 Homework 3 (due October 29, 208) Instructions Submit your write-up on Gradescope as a neatly typeset (not scanned nor handwritten) PDF document by :59 PM of the due date. On Gradescope, be
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationOnline Convex Optimization
Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationConvex Optimization / Homework 1, due September 19
Convex Optimization 1-725/36-725 Homework 1, due September 19 Instructions: You must complete Problems 1 3 and either Problem 4 or Problem 5 (your choice between the two). When you submit the homework,
More informationFALL 2018 MATH 4211/6211 Optimization Homework 1
FALL 2018 MATH 4211/6211 Optimization Homework 1 This homework assignment is open to textbook, reference books, slides, and online resources, excluding any direct solution to the problem (such as solution
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationTHE INVERSE FUNCTION THEOREM
THE INVERSE FUNCTION THEOREM W. PATRICK HOOPER The implicit function theorem is the following result: Theorem 1. Let f be a C 1 function from a neighborhood of a point a R n into R n. Suppose A = Df(a)
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationBackground. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan
Adaptive Filters and Machine Learning Boosting and Bagging Background Poltayev Rassulzhan rasulzhan@gmail.com Resampling Bootstrap We are using training set and different subsets in order to validate results
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationEuclidean Space. This is a brief review of some basic concepts that I hope will already be familiar to you.
Euclidean Space This is a brief review of some basic concepts that I hope will already be familiar to you. There are three sets of numbers that will be especially important to us: The set of all real numbers,
More informationj=1 r 1 x 1 x n. r m r j (x) r j r j (x) r j (x). r j x k
Maria Cameron Nonlinear Least Squares Problem The nonlinear least squares problem arises when one needs to find optimal set of parameters for a nonlinear model given a large set of data The variables x,,
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More information10725/36725 Optimization Homework 2 Solutions
10725/36725 Optimization Homework 2 Solutions 1 Convexity (Kevin) 1.1 Sets Let A R n be a closed set with non-empty interior that has a supporting hyperplane at every point on its boundary. (a) Show that
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More information1 Review of the Perceptron Algorithm
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #15 Scribe: (James) Zhen XIANG April, 008 1 Review of the Perceptron Algorithm In the last few lectures, we talked about various kinds
More informationCS7267 MACHINE LEARNING
CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning
More informationCS229 Supplemental Lecture notes
CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,
More informationName (NetID): (1 Point)
CS446: Machine Learning (D) Spring 2017 March 16 th, 2017 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationWeb Search and Text Mining. Learning from Preference Data
Web Search and Text Mining Learning from Preference Data Outline Two-stage algorithm, learning preference functions, and finding a total order that best agrees with a preference function Learning ranking
More informationComputer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression
Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationCS 231A Section 1: Linear Algebra & Probability Review
CS 231A Section 1: Linear Algebra & Probability Review 1 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability
More informationCS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang
CS 231A Section 1: Linear Algebra & Probability Review Kevin Tang Kevin Tang Section 1-1 9/30/2011 Topics Support Vector Machines Boosting Viola Jones face detector Linear Algebra Review Notation Operations
More information1 Boosting. COS 511: Foundations of Machine Learning. Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/2003
COS 511: Foundations of Machine Learning Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/003 1 Boosting Theorem 1 With probablity 1, f co (H), and θ >0 then Pr D yf (x) 0] Pr S yf (x) θ]+o 1 (m)ln(
More informationCS798: Selected topics in Machine Learning
CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationChapter 14 Combining Models
Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients
More informationDeep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.
Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More information8.1 Concentration inequality for Gaussian random matrix (cont d)
MGMT 69: Topics in High-dimensional Data Analysis Falll 26 Lecture 8: Spectral clustering and Laplacian matrices Lecturer: Jiaming Xu Scribe: Hyun-Ju Oh and Taotao He, October 4, 26 Outline Concentration
More informationBoosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13
Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y
More informationThe fundamental theorem of calculus for definite integration helped us to compute If has an anti-derivative,
Module 16 : Line Integrals, Conservative fields Green's Theorem and applications Lecture 47 : Fundamental Theorems of Calculus for Line integrals [Section 47.1] Objectives In this section you will learn
More informationFINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE
FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE You are allowed a two-page cheat sheet. You are also allowed to use a calculator. Answer the questions in the spaces provided on the question sheets.
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More information