HW3. All layers except the first will simply pass through the previous layer: W (t+1) . (The last layer will pass the first neuron).

Similar documents
Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

1 Binary Classification

Generalization, Overfitting, and Model Selection

Generalization and Overfitting

COMS 4771 Introduction to Machine Learning. Nakul Verma

Neural Network Training

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Understanding Generalization Error: Bounds and Decompositions

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

Discriminative Models

CIS 520: Machine Learning Oct 09, Kernel Methods

Computational and Statistical Learning theory

Name (NetID): (1 Point)

Support Vector Machines

CS534 Machine Learning - Spring Final Exam

Machine Learning, Midterm Exam

Name (NetID): (1 Point)

Announcements - Homework

Introduction to Machine Learning CMU-10701

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

Introduction to Machine Learning (67577) Lecture 5

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Computational Learning Theory. CS534 - Machine Learning

Midterm: CS 6375 Spring 2015 Solutions

Introduction to Machine Learning

Research Methods in Mathematics Homework 4 solutions

Network Design and Game Theory Spring 2008 Lecture 2

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H:

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Discriminative Models

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Midterm Exam, Spring 2005

Empirical Risk Minimization

FINAL: CS 6375 (Machine Learning) Fall 2014

Radial-Basis Function Networks

Machine Learning for NLP

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

HOMEWORK 4: SVMS AND KERNELS

Introduction to Support Vector Machines

Machine Learning for NLP

Introduction to Support Vector Machines

The definitions and notation are those introduced in the lectures slides. R Ex D [h

ECS289: Scalable Machine Learning

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Kernel Logistic Regression and the Import Vector Machine

CS60021: Scalable Data Mining. Large Scale Machine Learning

1 Machine Learning Concepts (16 points)

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Linear & nonlinear classifiers

Introduction to Machine Learning HW6

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 11. Linear Soft Margin Support Vector Machines

Generalization, Overfitting, and Model Selection

Introduction to Machine Learning Spring 2018 Note 18

Introduction to Statistical Learning Theory

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machine

Machine Learning Lecture 6 Note

CSCI567 Machine Learning (Fall 2018)

Evaluation requires to define performance measures to be optimized

ECS171: Machine Learning

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

CS Communication Complexity: Applications and New Directions

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Multiclass Boosting with Repartitioning

CIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM

Cheng Soon Ong & Christian Walder. Canberra February June 2018

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

TDT4173 Machine Learning

Statistical learning theory, Support vector machines, and Bioinformatics

Support Vector Machines for Classification: A Statistical Portrait

Classifier Complexity and Support Vector Classifiers

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

The role of dimensionality reduction in classification

Computational and Statistical Learning Theory

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

Radial-Basis Function Networks

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Logistic Regression. COMP 527 Danushka Bollegala

Adaptive Crowdsourcing via EM with Prior

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Littlestone s Dimension and Online Learnability

Online Learning with Experts & Multiplicative Weights Algorithms

Support vector machines Lecture 4

Support Vector Machines and Kernel Methods

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Economics 204 Fall 2011 Problem Set 1 Suggested Solutions

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

EC 521 MATHEMATICAL METHODS FOR ECONOMICS. Lecture 1: Preliminaries

Advanced Introduction to Machine Learning

Spring, 2010 CIS 511. Introduction to the Theory of Computation Jean Gallier. Homework 4

10-701/ Machine Learning - Midterm Exam, Fall 2010

Multiclass and Introduction to Structured Prediction

What is semi-supervised learning?

Transcription:

HW3 Inbal Joffe Aran Carmon Theory Questions 1. Let s look at the inputs set S = x 1,..., x d R d such that (x i ) j = δ ij (that is, 1 if i = j, and 0 otherwise). The size of the set is d. And for every subset A of S, we can build a network that will give 1 only on inputs in A: All layers except the first will simply pass through the previous layer: W (t+1) δ ij, b (t+1) i = 1 2. (The last layer will pass the first neuron). The first layer will only accept inputs from A: W (1) ij = ij = { 1 x i A 0 otherwise, b(1) i = 1 2. Notice that we can get a better lower bound by looking at the network that connects all the inputs to the first neuron, and that only passes the output of the first neuron all the way to the final output. This network is essentially only the first neuron, and we know that it s VCdim equals d + 1. We will use this result in the next question. 2. a. The hypothesis class is of the form: B = {h 1... h d h 1,..., h d H} We have seen in the recitation that for m VCdim(H) = d + 1, π H (m) ( ) d+1 em d + 1 Also, we saw in class that π F1 F 2 (m) π F1 (m) π F2 (m) (and inductively we can deduce for all n N: π F1... F n (m) π F1 (m)... π Fn (m)). 1

( Therefore, π B (m) em d+1 ) d(d+1) b. The hypothesis class C is of the form {b 1... b l b 1,..., b l B} where B is as defined above. We saw in class that π F1 F 2 (m) π F1 (m) π F2 (m) (and inductively we can deduce for all n N: π F1... F n (m) π F1 (m)... π Fn (m)). ( ) ld(d+1) Therefore, π C (m) em d+1 c. Each neuron has d + 1 parameters (w (t) i,: and (b t+1 ) i ). For each of the first L 1 layers, we have d neurons per layer, and an additional single neuron in the last layer; all in all, N = (d + 1)d(L 1) + (d + 1) d. Assume 2 m (em) N ; then m N log 2 (em) em en log 2 (em) ( ) em 2eN log 2 (en) m 2N log 2 (en), as required. (*) Lemma: a > 0, x a log 2 (x) x 2a log 2 (a) Proof: Let a > 0, and assume x > 2a log 2 (a); we need to show that x > a log 2 (x). Notice that for a e it holds that x > a log 2 (x) (for x 1it is trivial, else 0 < x e log 2 (x) x a log 2 (x)). Now, for a > e we get x > 2a log 2 (A) > 2a 2 ln(2) x > a ln(2) (#). Let us look at the function f(x) = x a log 2 (x); its derivative is f (x) = 1 a x ln(2) (#) 0 At last, since a 2 log 2 (a) > 0 for all a > 0, it follows that: f(x) = x a log 2 (x) 2a log 2 (a)-a log 2 (2a log 2 (a)) = 2a log 2 (a) a log 2 (a) a log 2 (2 log 2 (a)) = a log 2 (a) a log 2 (2 log 2 (a)) 0 x > a log 2 (x), as required. e. For every m N, we have, π C (m) = max S =m Π C(S) = max S =m { h(s 1),..., h(s m ) h C} 2 m The last inequality holds since { h(s 1 ),..., h(s m ) h C} { 1, 1} m. We need to show: π C (m) ( em? d + 1 )Ld(d+1) (em) (d+1)d(l 1)+d+1 = (em) N (for m d + 1) Using d 1 and 1 L 2

Ld 2 Ld d 2 + 1 (em) Ld2 Ld (em) d2 +1 (d + 1) Ld2 Ld (em) d2 +1 (em) Ld2 +Ld (d + 1) Ld2 Ld (em) Ld2 +Ld d 2 +1 ( ) Ld em 2 +Ld (em) N d + 1 And we get π C (m) (em) N (for m d + 1) Therefore, when m = VCdim(C), we have 2 m = π C (m) (em) N, and we can apply the previous subquestion to get: VCdim(C) = m 2N log 2 (en) Notice that we can assign m = VCdim(C), since VCdim(C) d + 1 as we showed in the previous question. 3. a. In order to project, we first test if x R. If it is, we return x; else, we return R x x. b. 3

Let z K, and denote by a, b, c the edges of the triangle between x, y, z as in the picture. From the definition of x we know that a b, and we need to show that c b. It is sufficient to show that 90 o β for that matter. Notice that for every z = ɛx + (1 ɛ) z (for 0 < ɛ < 1), the equivalent triangle xyz also upholds a b ; in addition β = β, so α β for every xyz. Assume by way of contradiction that β < 90 o ;then α < 90 o for every xyz there exists γ > 0 small enough such that α + β + γ < 180 o ; a contradiction. we then deduce β 90 o b is the largest edge in xyz c b. c. The proof is exactly the same as the proof mentioned in the question, until equation (12). In our case, we use w t+1 = Π K (w t ηv t ) instead of w t+1 = w t ηv t, and it changes the equation after equation (12) from to w t+1 w 2 2 = w t ηv t w 2 2 = w t w 2 2 + η2 v t 2 2 2η (w t w ) v t w t+1 w 2 2 = Π K (w t ηv t ) w 2 2 w t ηv t w 2 2 = w t w 2 2 + η2 v t 2 2 2η (w t w ) v t Where the inequality is the inequality from the previous subquestion. 4

We continue to rearrange equation 13 to v t (w t w ) w t w 2 2 w t+1 w 2 2 2η The rest of the proof continues without other modifications. + 0.5η v t 2 2 4. a. Let w 1 and w 2 be the weights of a multiclass classifier with k = 2. We classify a new point x as 1 iff w 1 x > w 2 x, that is (w 1 w 2 )x > 0 which is the same as using a single class classifier with weights w = w 1 w 2. And wx > 0 iff (w 1 w 2 )x 0. On the other hand, if we have a single class classifier with weights w, we can build a multiclass classifier with w 1 = w and w 2 = w. The new classifier will classify a new point as 1 iff, w 1 x > w 2 x wx > wx 2wx > 0 wx > 0 which is the same as the original single class classifier. Furthermore, let us consider the optimization of the multiclass classifier, f = 1 2 w 1 2 + 1 2 w 2 2 + C 2 m max(0, (w 3 yi w yi ) x i + 1) i=1 Since we classify to either label 1 or label 2, it is reasonable to expect w 1 = w 2. In that case the above turns to, f = w 1 2 + C 2 m max(0, 1 2w yi x i ) i=1 Define y i = { 1 y i = 1 1 y i = 2, and define w = 1 2 w 1, f = 2 d w 2 + C 2 m max(0, 1 y iwx i ) i=1 We can define c = 2 d+2 C and get, 5

f = 1 m 2 w 2 + C max(0, 1 y iwx i ) which is the same optimization problem as in the SVM we learned. i=1 b. Derive with respect to w j : We define j (w, x i, y i ) = arg max p (w p x i w yi x i + 1(p y i )) l w j = x i (1(j = j ) 1(j = y i )) f w j = w j + C m So an SGD version, would be to sample a random point at each step, and to update all w j s according to the following rule, m i=1 l w j If j y i and j = arg max p (w p,t x i w yi,tx i + 1(p y i )): w j,t+1 = (1 η)w j,t ηcx i If j = y i and j arg max p (w p,t x i w yi,tx i + 1(p y i )): In any other case: w j,t+1 = (1 η)w j,t + ηcx i w j,t+1 = (1 η)w j,t c. We notice that w j is a linear combination of x i s. Instead of keeping w j explicitly, we can keep track of the coefficients of x i. Define w j = m i=1 M j,ix i. Classifying a new point x would be y = arg max j ( m i=1 M j,ik(x i, x)). Where K is the kernel function used. Pseudo-code for training: 6

Input: kernel function K list x i, y i of m training samples T number of iterations η step size C penalty coefficient Output: A matrix M Mat(k, m) to be used for classifying new points Initialize M Mat(k, m) to be zeroes for T iterations: choose a random point i [m] : x i, y i from the training set find j = arg max j ( m t=1 M j,tk(x t, x i )) M = (1 η)m for each j [k]: if j y i and j = j : M j,i = M j,i ηc if j = y i and j j : M j,i = M j,i + ηc return M 5. If at each level i of the tree, we will ask x i = 0, then after d questions, each leaf will contain only one member. That is, one-to-one correspondence between leafs and vectors {0, 1} d. To implement an arbitrary classifier using this tree, classify every leaf the same way the arbitrary classifier does. Let us show that the VCdim is 2 d : Let S {0, 1} d with S = 2 d (that is, S = {0, 1} d ), and let y 1,..., y 2 d be arbitrary labels. Since we can classify any subset we wish as 1, we can choose a binary decision tree in which each the leaf corresponding to each input s i will be classified as y i. We showed that a set of size 2 d can be shattered, which means VCdim 2 d ; and since VCdim {0, 1} d = 2 d, we have VCdim= 2 d. 7

HW3: Programming Assignment Aran Carmon Inbal Joffe 6. a. We created 2 functions to plot the training and validation errors for various η and C. How to run: python q.py 6 find_eta <from> <to> <step> <C> <T> <filename> python q.py 6 find_c <from> <to> <step> <eta> <T> <filename> We first start with scanning for η along a logarithmic scale, with T = 1000, and with c = 1: Both the test error and the validation error are shown in the plot, and we see they are almost the same. We continue to scan for C, using T = 1000 and η = 10 6.7 : We zoom in, and use T = 10000: 8

We choose the parameters η = 10 6.7 and C = 10 0.5 b. Weights for the digits, shown as images: 9

We see that some of the weights resemble the digits they classify. e.g. 2, 3, and 9. Other weights look more like a mix of other digits. How to run: python q.py 6 show_digit <C> <eta> <T> <digit> <filename> c. Using T = 4 len(train_data)=200000. We measured an accuracy of 0.9165 How to run: python q.py 6 calc_accuracy <C> <eta> <T> 10

7. a. We created 2 functions to plot the training and validation errors for various η and C. How to run: python q.py 7 find_eta <kernel> <training size> <from> <to> <step> <C> <T> <filename> python q.py 7 find_c <kernel> <training size> <from> <to> <step> <eta> <T> <filename> For a quicker tuning of parameters, we used a training set of only 1000 points, sampled randomly each time from the training set. We started by scanning for an η value, with T = 1000 and C = 1: The accuracy is mostly uniform at the lower part of the plot, we continue to scan for C value with η = 10 6 : The accuracy of C values seems uniform, so we will choose C = 1. b. With C = 1,η = 10 6, and T = 10000, we measured on the test set an accuracy of: 0.9352 How to run: python q.py 7 calc_accuracy <kernel> <C> <eta> <T> 11

c. We measured an accuracy of 0.932 with RBF σ = 1000, T = 10000, C = 1, η = 10 6 which is comparable to the quadratic kernel. Due to time constraints, we did not investigate it further. How to run: python q.py 7 calc_accuracy r1000 0-6 10000 12