Introduction to Machine Learning DIS10

Similar documents
Optimally Sparse SVMs

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

10-701/ Machine Learning Mid-term Exam Solution

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

1 Review and Overview

Vector Quantization: a Limiting Case of EM

Machine Learning for Data Science (CS 4786)

Optimization Methods MIT 2.098/6.255/ Final exam

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Seunghee Ye Ma 8: Week 5 Oct 28

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

Support vector machine revisited

Machine Learning Brett Bernstein

Machine Learning for Data Science (CS 4786)

Support Vector Machines and Kernel Methods

The Method of Least Squares. To understand least squares fitting of data.

Lecture 2: Monte Carlo Simulation

Infinite Sequences and Series

6.867 Machine learning

Problem Set 4 Due Oct, 12

Introduction to Optimization Techniques. How to Solve Equations

Machine Learning Brett Bernstein

6.3 Testing Series With Positive Terms

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Inverse Matrix. A meaning that matrix B is an inverse of matrix A.

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Recurrence Relations

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Linear Regression Demystified

PRELIM PROBLEM SOLUTIONS

Chapter 2 The Solution of Numerical Algebraic and Transcendental Equations

MA131 - Analysis 1. Workbook 3 Sequences II

Lecture 15: Learning Theory: Concentration Inequalities

Problem Cosider the curve give parametrically as x = si t ad y = + cos t for» t» ß: (a) Describe the path this traverses: Where does it start (whe t =

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Random Models. Tusheng Zhang. February 14, 2013

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Solution of Final Exam : / Machine Learning

Linear Classifiers III

Section 1.1. Calculus: Areas And Tangents. Difference Equations to Differential Equations

Recursive Algorithms. Recurrences. Recursive Algorithms Analysis

Sequences, Mathematical Induction, and Recursion. CSE 2353 Discrete Computational Structures Spring 2018

f(x) dx as we do. 2x dx x also diverges. Solution: We compute 2x dx lim

September 2012 C1 Note. C1 Notes (Edexcel) Copyright - For AS, A2 notes and IGCSE / GCSE worksheets 1

Recitation 4: Lagrange Multipliers and Integration

Empirical Process Theory and Oracle Inequalities

10.1 Sequences. n term. We will deal a. a n or a n n. ( 1) n ( 1) n 1 2 ( 1) a =, 0 0,,,,, ln n. n an 2. n term.

THE SOLUTION OF NONLINEAR EQUATIONS f( x ) = 0.

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Notes on iteration and Newton s method. Iteration

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

CHAPTER 10 INFINITE SEQUENCES AND SERIES

Lecture 8: October 20, Applications of SVD: least squares approximation

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

Output Analysis and Run-Length Control

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

CS537. Numerical Analysis and Computing

Read carefully the instructions on the answer book and make sure that the particulars required are entered on each answer book.

Most text will write ordinary derivatives using either Leibniz notation 2 3. y + 5y= e and y y. xx tt t

Problem Set 2 Solutions

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

v = -!g(x 0 ) Ûg Ûx 1 Ûx 2 Ú If we work out the details in the partial derivatives, we get a pleasing result. n Ûx k, i x i - 2 b k

MIDTERM 3 CALCULUS 2. Monday, December 3, :15 PM to 6:45 PM. Name PRACTICE EXAM SOLUTIONS

Fall 2013 MTH431/531 Real analysis Section Notes

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Algebra of Least Squares

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

NBHM QUESTION 2007 Section 1 : Algebra Q1. Let G be a group of order n. Which of the following conditions imply that G is abelian?

Section 11.8: Power Series

Math 21C Brian Osserman Practice Exam 2

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

The z-transform. 7.1 Introduction. 7.2 The z-transform Derivation of the z-transform: x[n] = z n LTI system, h[n] z = re j

ECON 3150/4150, Spring term Lecture 3

Sequences I. Chapter Introduction

P1 Chapter 8 :: Binomial Expansion

MA131 - Analysis 1. Workbook 2 Sequences I

Analysis of Algorithms. Introduction. Contents

Confidence Intervals for the Population Proportion p

CSCI567 Machine Learning (Fall 2014)

Lesson 10: Limits and Continuity

IP Reference guide for integer programming formulations.

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

NUMERICAL METHODS FOR SOLVING EQUATIONS

Section A assesses the Units Numerical Analysis 1 and 2 Section B assesses the Unit Mathematics for Applied Mathematics

Solutions to Homework 7

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Riemann Sums y = f (x)

Ma 530 Infinite Series I

THE ASYMPTOTIC COMPLEXITY OF MATRIX REDUCTION OVER FINITE FIELDS

Math 220B Final Exam Solutions March 18, 2002

An Introduction to Randomized Algorithms

Math 116 Second Exam

TEACHER CERTIFICATION STUDY GUIDE

Transcription:

CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig all of the partial derivatives ad settig them to 0, we get this system of equatios: λx = 1 λy = 1 x + y = 3 We ca ifer that y = x. Pluggig this ito the costrait, we have: x + 4x = 3 3 which shows that x = ± 5. We have two critical poits, ( 3 5, 3 5 ) ad ( 3 5, 3 5 ). Pluggig these ito our objective fuctio f, we fid that the miimizer is the former, with a value of 3 3 5. (b) Miimize the fuctio such that Solutio: The Lagragia is: f (x,y,z) = x y x + y + 3z = 1. x y + λ(x + y + 3z 1) Takig all of the partial derivatives ad settig them to 0, we get this system of equatios: x = λx y = λy 0 = λz x + y + 3z = 1 CS 189, Fall 017, DIS10 1

To solve this, we look at several cases: Case 1: λ = 0. This implies that x = y = 0, ad z = ± 1 3. We have two critical poits: (0,0,± 3 1). Case : λ 0. z must be 0. Case a: x = 0. The costrait gives us that y = ± 1. This gives us aother two critical poits: (0,± 1,0). Case b: y = 0. The costrait gives us x = ±1, givig us aother two critical poits: (±1,0,0). Pluggig i all of our critical poits, we fid that (0,± 1,0) miimizes our fuctio with a value of 1. Support Vector Machies (a) We typically frame a SVM problem as tryig to maximize the margi. Explai ituitively why a bigger margi will result i a model that will geeralize better, or perform better i practice. Solutio: Oe ituitio is that if poits are closer to the border, we are less certai about their class. Thus, it would make sese to create a boudary where our certaity is highest about all the traiig set poits. Aother ituitio ivolves thikig about the process that geerated the data we are workig with. Sice it s a oisy process, if we drew a boudary close to oe of our traiig poits of some class, it s very possible that a poit of the same class will be geerated across the boudary, resultig i a icorrect classificatio. Therefore it makes sese to make the boudary as far away from our traiig poits as possible. (b) Will movig poits which are ot support vectors further away from the decisio boudary effect the SVM s hige loss? Solutio: No, the hige loss is defied as N max(0,1 y i( (w) (x) + b)). For osupport vectors, the right had side of the max fuctio is already egative ad movig the poit further away from the boudary will make it oly more egative. The max will retur a zero regardless. This meas that the loss fuctio ad the cosequet decisio boudary is etirely determied by the orietatio of the support vectors ad othig else. (c) Show that the width of a SVM slab with liearly separable data is w. Solutio: The width of the margi is defied by the poits that lie o it, also called support vectors. Let s say we have a poit, x, which is a support vector. The distace betwee x ad the separatig hyperplae ca be calculated by projectig the vector startig at the plae ad edig at x oto the plae s uit ormal vector. The equatio of the plae is w T x + b = 0. CS 189, Fall 017, DIS10

Sice w by defiitio is orthogoal to the hyperplae, we wat to project x x oto the uit vector ormal to the hyperplae, w w. w T w ( x x) = 1 w ( wt x w T x) = 1 w ( wt x + b w T x b) Sice we set w T x +b = 1 (or 1) ad by defiitio, w T x+b = 0, this quatity just turs ito 1 w, or 1 1 w, so the distace is the absolute value, w. Sice the margi is half of the slab, we double it to get the full width of w. (d) You are preseted with the followig set of data (triagle = +1, circle = -1): Fid the equatio (by had) of the hyperplae w T x + b = 0 that would be used by a SVM classifier. Which poits are support vectors? Solutio: The equatio of the hyperplae will pass through poit (,1), with a slope of -1. The equatio of this lie is x 1 + x = 3. We kow that from this form, w 1 = w. We also kow that the at the support vectors, w T x + b = ±1. This gives us the equatios: 1w 1 + 0w + b = 1 3w 1 + w + b = 1 Solvig this system of equatios, we get w = [ 1, 1 ]T ad b = 3. The support vectors are (1,0),(0,1), ad (3,). 3 Simple SGD updates Let us cosider a simple least squares problem, where we are iterested i optimizig the fuctio F(w) = 1 Aw y = 1 1 (a i w y i ). (a) What is the closed form OLS solutio? What is the time complexity of computig this solutio i terms of flops? CS 189, Fall 017, DIS10 3

Solutio: The closed form solutio is ŵ = (A A) 1 A y. (1) This takes time d + d + d 3 to compute d to fid A A sice it takes multiplicatios to compute each etry of this d d matrix, d to fid A y sice it takes d multiplicatios to compute each etry of this -vector, ad d 3 time to ivert a matrix via Gaussia elimiatio. (b) Write dow the gradiet descet update. What is the time complexity of computig a ε optimal solutio? Solutio: For gradiet descet, we have the update w t+1 = w t γ A (Aw t y). We kow from HW that deotig e k = w k w ad lettig Q deote ( the coditio umber of A A, we have e k Q 1 Q+1) ek 1. We therefore obtai geometric covergee to the optimum, ad the umber of iteratios is roughly T Qlog(1/ε) to coverge to withi ε of optimum (write this out to see why, usig the approximatio 1 x e x ). Also ote that durig each iteratio, we perform work d, sice Aw t takes d time to compute, ad performig the multiplicatio A (Aw t y) takes d time as well. So the total cost is d log(1/ε). (c) Write dow the stochastic gradiet descet update. What is the time complexity of computig a ε optimal solutio? You may wat to quickly go through a derivatio here. What happes whe Aw = y? Discuss why you would use ay of these methods for your problem. Solutio: Let us derive the SGD covergece rate from first priciples. We have the update equatio w t+1 = w t γa J (a J w t y J ), where J is chose uiformly at radom from the set {1,,3,...,}. Notice that this makes sese as a oisy gradiet, sice E J [a J (a J w t y J )] = f J (w t ) = 1 (A Aw t A y) = f (w t ) ad so the gradiet estimate is ubiased. Let us ow compute the covergece rate. We have w t+1 w = w t w + γ f J (w t ) γ(w t w ) f J (w t ). Now otice that there are two sources of radomess i the RHS. The iterate w t is i itself radom, sice we have chose radom idices up util that poit. The idex J is also radom. Crucially, these two sources of radomess are idepedet of each other. I particular, we may ow take the ier product ad compute the expectatio over the idex CS 189, Fall 017, DIS10 4

J to obtai ] E J [γ(w t w ) f J (w t ) = γ(w t w ) E[ f J (w t )] = γ(w t w ) f (w t ) = γ(w t w ) A A(w t w ) = A(w t w ) λ mi (A A) w t w, where i the last step, we have used a simple eigevalue boud; go back ad look at HW6 if this is ot clear. Lettig m = λ mi (A A), we have E J [ wt+1 w ] (1 γm) wt w + γ E J f J (w t ). We will ow make some additioal assumptios. First, we assume that a i = 1. Next, we assume that we will always stay withi a regio such that the fuctio F(w) M (ote that we ca do this by evaluatig the loss ad esurig that we do t take a step if this coditio is violated, or by projectio.) Cosequetly, we have E J f J (w t ) = 1 = 1 = F(w t ) M. a i (a i w t y i ) a i (a i w t y i ) We are ow i a positio to complete the aalysis. We have E [ w t+1 w ] (1 γm)e [ wt w ] + γ M, where we have take a additioal expectatio with respect to the radomess up to ad icludig time t. Rollig this out (thik of iductio i reverse), we have E [ w t+1 w ] (1 γm) E [ w t 1 w ] + γ M (1 γm) + γ M. Do you spot the patter? We effectively have E [ w t w ] (1 γm) t E [ w 0 w ] + γ M t 1 (1 γm) i i=0 (1 γm) t E [ w 0 w ] + γ M (1 γm) i i=0 = (1 γm) t E [ w 0 w ] + γ M 1 γm = (1 γm) t E [ w 0 w ] M + γ m. CS 189, Fall 017, DIS10 5

Now if we wat the LHS to be less tha ε, it suffices to set each of the above terms to be less tha ε/. I particular, we have the relatios γ M ε/ ad m (1 γm) t E [ w 0 w ] ε/. Doig some algebra, we are led to the choices γ = εm 4M, ad t = 1 γm log(d 0/ε) = 4M m where D 0 deotes our iitial distace to optimum. log(d 0 /ε), ε I effect, we coverge i ε 1 log(1/ε) iteratios, ad each iteratio takes O(d) time (why?). Let us ow compare all three algorithms. Clearly, GD beats OLS provided dlog1/ε < d, which happes whe d > log(1/ε). Thik about what this meas! Settig ε = 10 6 (almost optimum), we see that GD wis for ay problem i which d > 0! Comparig SGD ad GD, the quatities are dlog(1/ε) d ε log(1/ε). I other words, SGD provides gais i covergece whe 1/ε, i.e., whe we have sufficietly may samples. There are also other advatages to SGD that this aalysis does t quite illustrate; for istace scalability ad geeralizatio ability. Comparig SGD ad OLS, we see that SGD wis whe d > d ε log(1/ε), ad so the relevat compariso is betwee d ad 1/ε. SGD agai wis for moderately sized problems. (d) Write dow the SGD update for logistic regressio o two classes F(w) = 1 y i log 1 σ(w x i ) + (1 y 1 i)log 1 σ(w x i ). Discuss why this is equivalet to miimizig a cross-etropy loss. CS 189, Fall 017, DIS10 6