CSCI B609: Foundations of Data Science

Similar documents
10-701/ Machine Learning, Fall 2005 Homework 3

1 Gradient descent for convex functions: univariate case

1 Convex Optimization

Feature Selection: Part 1

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Ensemble Methods: Boosting

Support Vector Machines

Natural Language Processing and Information Retrieval

Support Vector Machines

Generalized Linear Methods

Lecture 10 Support Vector Machines II

Online Classification: Perceptron and Winnow

The Experts/Multiplicative Weights Algorithm and Applications

Support Vector Machines

Kristin P. Bennett. Rensselaer Polytechnic Institute

Multilayer Perceptron (MLP)

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Boostrapaggregating (Bagging)

1 The Mistake Bound Model

Learning Theory: Lecture Notes

Linear Classification, SVMs and Nearest Neighbors

The Geometry of Logit and Probit

Multilayer neural networks

Announcements EWA with ɛ-exploration (recap) Lecture 20: EXP3 Algorithm. EECS598: Prediction and Learning: It s Only a Game Fall 2013.

Support Vector Machines

Lecture 20: November 7

Lecture Notes on Linear Regression

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Evaluation of classifiers MLPs

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

CIS 700: algorithms for Big Data

Multi-layer neural networks

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

18-660: Numerical Methods for Engineering Design and Optimization

Logistic Classifier CISC 5800 Professor Daniel Leeds

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Topic 5: Non-Linear Regression

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

Which Separator? Spring 1

CSC 411 / CSC D11 / CSC C11

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

COS 521: Advanced Algorithms Game Theory and Linear Programming

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

COS 511: Theoretical Machine Learning

Discriminative classifier: Logistic Regression. CS534-Machine Learning

CS : Algorithms and Uncertainty Lecture 14 Date: October 17, 2016

FMA901F: Machine Learning Lecture 5: Support Vector Machines. Cristian Sminchisescu

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Lecture 3: Dual problems and Kernels

Erratum: A Generalized Path Integral Control Approach to Reinforcement Learning

Lagrange Multipliers Kernel Trick

The exam is closed book, closed notes except your one-page cheat sheet.

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Review: Fit a line to N data points

Active Learning Models and Noise

Matrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 5 September 14, Readings: Bishop, 3.1

LINEAR REGRESSION MODELS W4315

Support Vector Machines CS434

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

PROBLEM SET 7 GENERAL EQUILIBRIUM

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

Support Vector Machines

Research Article. Almost Sure Convergence of Random Projected Proximal and Subgradient Algorithms for Distributed Nonsmooth Convex Optimization

CS4495/6495 Introduction to Computer Vision. 3C-L3 Calibrating cameras

CSE 546 Midterm Exam, Fall 2014(with Solution)

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Week 5: Neural Networks

Lecture 14: Bandits with Budget Constraints

Lecture 10 Support Vector Machines. Oct

SELECTED SOLUTIONS, SECTION (Weak duality) Prove that the primal and dual values p and d defined by equations (4.3.2) and (4.3.3) satisfy p d.

1 Review From Last Time

15-381: Artificial Intelligence. Regression and cross validation

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

Advanced Introduction to Machine Learning

10) Activity analysis

Randomness and Computation

Computational and Statistical Learning theory Assignment 4

Pattern Classification

PHYS 705: Classical Mechanics. Calculus of Variations II

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Support Vector Machines CS434

Classification as a Regression Problem

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Logistic Regression Maximum Likelihood Estimation

Online Linear Regression using Burg Entropy

Maximal Margin Classifier

6.854J / J Advanced Algorithms Fall 2008

APPENDIX A Some Linear Algebra

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

Supplement: Proofs and Technical Details for The Solution Path of the Generalized Lasso

Feature Selection in Multi-instance Learning

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

Transcription:

CSCI B609: Foundatons of Data Scence Lecture 13/14: Gradent Descent, Boostng and Learnng from Experts Sldes at http://grgory.us/data-scence-class.html Grgory Yaroslavtsev http://grgory.us

Constraned Convex Optmzaton Non-convex optmzaton s NP-hard: Knapsack: x 1 x = 0 : x 0,1 Mnmze c x Subject to: w x W Convex optmzaton can often be solved by ellpsod algorthm n poly(n) tme, but too slow

Convex multvarate functons Convexty: x, y R n : f x f y + x y f(y) x, y R n, 0 λ 1: f λx + 1 λ y λf x + 1 λ f(y) If hgher dervatves exst: f x = f y + f y x y + x y T f x x y + f x j = f x x j s the Hessan matrx f s convex ff t s Hessan s postve semdefnte, y T fy 0 for all y.

Examples of convex functons l p -norm s convex for 1 p : λx + 1 λ y p λx p + 1 λ y p = λ x p + 1 λ y p f x = log(e x 1 + e x + e x n) max x 1,, x n f x max x 1,, x n + log n f x = x T Ax where A s a p.s.d. matrx, f = A Examples of constraned convex optmzaton: (Lnear equatons wth p.s.d. constrants): mnmze: 1 xt Ax b T x (soluton satsfes Ax = b) (Least squares regresson): Mnmze: Ax b = x T A T A x Ax T b + b T b

Constraned Convex Optmzaton General formulaton for convex f and a convex set K: mnmze: f x subject to: x K Example (SVMs): Data: X 1,, X N R n labeled by y 1,, y N 1,1 (spam / non-spam) Fnd a lnear model: W X 1 X s spam W X 1 X s non-spam : 1 y WX 0 More robust verson: mnmze: Loss 1 W y X + λ W E.g. hnge loss Loss(t)=max(0,t) Another regularzer: λ W (favors sparse solutons)

Gradent Descent for Constraned Convex Optmzaton (Projecton): x K y K y = argmn z K z x Easy to compute for : y = x/ x Let f x G, max x y x,y K D. Let T = 4D G ε Gradent descent (gradent + projecton oracles): Let η = D/G T Repeat for = 0,, T: y (+1) = x () η f x x (+1) = projecton of y (+1) on K Output z = 1 T x ()

Gradent Descent for Constraned Convex Optmzaton x +1 x y +1 x = x x η f x = x x Usng defnton of G: f x x x 1 η x x x +1 x + η G + η f x η f x x x f x f x 1 η Sum over = 1,, T: T =1 f x f x 1 η x x x +1 x x 0 x x T x η + G + Tη G

Gradent Descent for Constraned T f x f x =1 1 η f 1 T f Convex Optmzaton x 0 x 1 T Set η = x G D 1 T x x T x f x : f x T RHS DG T ε Tη + G D ηt + η G

Onlne Gradent Descent Gradent descent works n a more general case: f sequence of convex functons f 1, f, f T At step need to output x () K Let x be the mnmzer of f (w) Mnmze regret: f x f (x ) Same analyss as before works n onlne case.

Stochastc Gradent Descent (Expected gradent oracle): returns g such that E g g = f x. Example: for SVM pck randomly one term from the loss functon. Let g be the gradent returned at step Let f = g T x be the functon used n the -th step of OGD Let z = 1 T x () and x be the mnmzer of f.

Stochastc Gradent Descent Thm. E f z f x + DG where G s an upper bound of T any gradent output by oracle. f z f x 1 T (f x f(x )) (convexty) = 1 T 1 T f x x x E,g T (x x )- (grad. oracle) = 1 T E,f (x ) f (x )- = 1 T E, f (x ) f (x )- E[] = regret of OGD, always ε

VC-dm of combnatons of concepts For k concepts 1,, k + a Boolean functon f: comb f 1, k = *x X: f 1 x, k x = 1+ Ex: H = ln. separators, f = AND / f = Majorty For a concept class H + a Boolean functon f: COMB f,k H = *comb f 1, k : H+ Lem. If VC-dm(H)= d then for any f: VC-dm COMB f,k H O(kd log(kd))

VC-dm of combnatons of concepts Lem. If VC-dm(H)= d then for any f: VC-dm COMB f,k H O(kd log(kd)) Let n = VC-dm COMB f,k H set S of n ponts shattered by COMB f,k H Sauer s lemma n d ways of labelng S by H Each labelng n COMB f,k H determned by k labelngs of S by H (n d ) k = n kd labelngs n n kd n kd log n n kd log kd

Back to the batch settng Classfcaton problem Instance space X: 0,1 d or R d (feature vectors) Classfcaton: come up wth a mappng X *0,1+ Formalzaton: Assume there s a probablty dstrbuton D over X c = target concept (set c X of postve nstances) Gven labeled..d. samples from D produce h X Goal: have h agree wth c over dstrbuton D Mnmze: err D h = Pr D,h Δ c - err D h = true or generalzaton error

Boostng Strong learner: succeeds wth prob. 1 ε Weak learner: succeeds wth prob. 1 + γ Boostng (nformal): weak learner that works under any dstrbuton strong learner Idea: run weak leaner A on sample S under reweghtngs focusng on msclassfed examples

Boostng (cont.) H = class of hypothess produced by A Apply majorty rule to 1,, t0 H: VC-dm O(t 0 VC-dm H log(t 0 VC dm H )) Algorthm: Gven S = (x 1,, x n ) set w = 1 n w = (w 1,, w n ) For t = 1,, t 0 do: Call weak learner on S, w hypothess t For msclassfed x multply w by α = ( 1 + γ)/( 1 γ) Output: MAJ( 1,, t0 )

Boostng: analyss Def (γ-weak learner on sample): For labeled examples x weghted by w wth weght of correct 1 + γ n w =1 Thm. If A s γ-weak learner on S for t 0 = O 1 γ log n boostng acheves 0 error on S. Proof. m = # mstakes of the fnal classfer Each was msclassfed t 0 tmes weght αt 0/ Total weght mα t 0/ Total weght at t = W t W t + 1 α 1 γ + 1 + γ W t = 1 + γ W(t)

Boostng: analyss (cont.) W 0 = n W t 0 n 1 + γ t 0 mα t 0/ W t 0 n 1 + γ t 0 α = ( 1 + γ)/( 1 γ) = (1 + γ)/(1 γ) m n 1 γ t 0/ 1 + γ t0/ = n 1 4γ t 0/ 1 x e x m ne γ t 0 t 0 = O 1 γ log n Comments: Apples even f the weak learners are adversaral VC-dm bounds n = O 1 ε VC dm (H) γ