HW3. All layers except the first will simply pass through the previous layer: W (t+1) . (The last layer will pass the first neuron).

HW3 Inbal Joffe Aran Carmon Theory Questions 1. Let s look at the inputs set S = x 1,..., x d R d such that (x i ) j = δ ij (that is, 1 if i = j, and 0 otherwise). The size of the set is d. And for every subset A of S, we can build a network that will give 1 only on inputs in A: All layers except the first will simply pass through the previous layer: W (t+1) δ ij, b (t+1) i = 1 2. (The last layer will pass the first neuron). The first layer will only accept inputs from A: W (1) ij = ij = { 1 x i A 0 otherwise, b(1) i = 1 2. Notice that we can get a better lower bound by looking at the network that connects all the inputs to the first neuron, and that only passes the output of the first neuron all the way to the final output. This network is essentially only the first neuron, and we know that it s VCdim equals d + 1. We will use this result in the next question. 2. a. The hypothesis class is of the form: B = {h 1... h d h 1,..., h d H} We have seen in the recitation that for m VCdim(H) = d + 1, π H (m) ( ) d+1 em d + 1 Also, we saw in class that π F1 F 2 (m) π F1 (m) π F2 (m) (and inductively we can deduce for all n N: π F1... F n (m) π F1 (m)... π Fn (m)). 1

( Therefore, π B (m) em d+1 ) d(d+1) b. The hypothesis class C is of the form {b 1... b l b 1,..., b l B} where B is as defined above. We saw in class that π F1 F 2 (m) π F1 (m) π F2 (m) (and inductively we can deduce for all n N: π F1... F n (m) π F1 (m)... π Fn (m)). ( ) ld(d+1) Therefore, π C (m) em d+1 c. Each neuron has d + 1 parameters (w (t) i,: and (b t+1 ) i ). For each of the first L 1 layers, we have d neurons per layer, and an additional single neuron in the last layer; all in all, N = (d + 1)d(L 1) + (d + 1) d. Assume 2 m (em) N ; then m N log 2 (em) em en log 2 (em) ( ) em 2eN log 2 (en) m 2N log 2 (en), as required. (*) Lemma: a > 0, x a log 2 (x) x 2a log 2 (a) Proof: Let a > 0, and assume x > 2a log 2 (a); we need to show that x > a log 2 (x). Notice that for a e it holds that x > a log 2 (x) (for x 1it is trivial, else 0 < x e log 2 (x) x a log 2 (x)). Now, for a > e we get x > 2a log 2 (A) > 2a 2 ln(2) x > a ln(2) (#). Let us look at the function f(x) = x a log 2 (x); its derivative is f (x) = 1 a x ln(2) (#) 0 At last, since a 2 log 2 (a) > 0 for all a > 0, it follows that: f(x) = x a log 2 (x) 2a log 2 (a)-a log 2 (2a log 2 (a)) = 2a log 2 (a) a log 2 (a) a log 2 (2 log 2 (a)) = a log 2 (a) a log 2 (2 log 2 (a)) 0 x > a log 2 (x), as required. e. For every m N, we have, π C (m) = max S =m Π C(S) = max S =m { h(s 1),..., h(s m ) h C} 2 m The last inequality holds since { h(s 1 ),..., h(s m ) h C} { 1, 1} m. We need to show: π C (m) ( em? d + 1 )Ld(d+1) (em) (d+1)d(l 1)+d+1 = (em) N (for m d + 1) Using d 1 and 1 L 2

Ld 2 Ld d 2 + 1 (em) Ld2 Ld (em) d2 +1 (d + 1) Ld2 Ld (em) d2 +1 (em) Ld2 +Ld (d + 1) Ld2 Ld (em) Ld2 +Ld d 2 +1 ( ) Ld em 2 +Ld (em) N d + 1 And we get π C (m) (em) N (for m d + 1) Therefore, when m = VCdim(C), we have 2 m = π C (m) (em) N, and we can apply the previous subquestion to get: VCdim(C) = m 2N log 2 (en) Notice that we can assign m = VCdim(C), since VCdim(C) d + 1 as we showed in the previous question. 3. a. In order to project, we first test if x R. If it is, we return x; else, we return R x x. b. 3

Let z K, and denote by a, b, c the edges of the triangle between x, y, z as in the picture. From the definition of x we know that a b, and we need to show that c b. It is sufficient to show that 90 o β for that matter. Notice that for every z = ɛx + (1 ɛ) z (for 0 < ɛ < 1), the equivalent triangle xyz also upholds a b ; in addition β = β, so α β for every xyz. Assume by way of contradiction that β < 90 o ;then α < 90 o for every xyz there exists γ > 0 small enough such that α + β + γ < 180 o ; a contradiction. we then deduce β 90 o b is the largest edge in xyz c b. c. The proof is exactly the same as the proof mentioned in the question, until equation (12). In our case, we use w t+1 = Π K (w t ηv t ) instead of w t+1 = w t ηv t, and it changes the equation after equation (12) from to w t+1 w 2 2 = w t ηv t w 2 2 = w t w 2 2 + η2 v t 2 2 2η (w t w ) v t w t+1 w 2 2 = Π K (w t ηv t ) w 2 2 w t ηv t w 2 2 = w t w 2 2 + η2 v t 2 2 2η (w t w ) v t Where the inequality is the inequality from the previous subquestion. 4

We continue to rearrange equation 13 to v t (w t w ) w t w 2 2 w t+1 w 2 2 2η The rest of the proof continues without other modifications. + 0.5η v t 2 2 4. a. Let w 1 and w 2 be the weights of a multiclass classifier with k = 2. We classify a new point x as 1 iff w 1 x > w 2 x, that is (w 1 w 2 )x > 0 which is the same as using a single class classifier with weights w = w 1 w 2. And wx > 0 iff (w 1 w 2 )x 0. On the other hand, if we have a single class classifier with weights w, we can build a multiclass classifier with w 1 = w and w 2 = w. The new classifier will classify a new point as 1 iff, w 1 x > w 2 x wx > wx 2wx > 0 wx > 0 which is the same as the original single class classifier. Furthermore, let us consider the optimization of the multiclass classifier, f = 1 2 w 1 2 + 1 2 w 2 2 + C 2 m max(0, (w 3 yi w yi ) x i + 1) i=1 Since we classify to either label 1 or label 2, it is reasonable to expect w 1 = w 2. In that case the above turns to, f = w 1 2 + C 2 m max(0, 1 2w yi x i ) i=1 Define y i = { 1 y i = 1 1 y i = 2, and define w = 1 2 w 1, f = 2 d w 2 + C 2 m max(0, 1 y iwx i ) i=1 We can define c = 2 d+2 C and get, 5

f = 1 m 2 w 2 + C max(0, 1 y iwx i ) which is the same optimization problem as in the SVM we learned. i=1 b. Derive with respect to w j : We define j (w, x i, y i ) = arg max p (w p x i w yi x i + 1(p y i )) l w j = x i (1(j = j ) 1(j = y i )) f w j = w j + C m So an SGD version, would be to sample a random point at each step, and to update all w j s according to the following rule, m i=1 l w j If j y i and j = arg max p (w p,t x i w yi,tx i + 1(p y i )): w j,t+1 = (1 η)w j,t ηcx i If j = y i and j arg max p (w p,t x i w yi,tx i + 1(p y i )): In any other case: w j,t+1 = (1 η)w j,t + ηcx i w j,t+1 = (1 η)w j,t c. We notice that w j is a linear combination of x i s. Instead of keeping w j explicitly, we can keep track of the coefficients of x i. Define w j = m i=1 M j,ix i. Classifying a new point x would be y = arg max j ( m i=1 M j,ik(x i, x)). Where K is the kernel function used. Pseudo-code for training: 6

Input: kernel function K list x i, y i of m training samples T number of iterations η step size C penalty coefficient Output: A matrix M Mat(k, m) to be used for classifying new points Initialize M Mat(k, m) to be zeroes for T iterations: choose a random point i [m] : x i, y i from the training set find j = arg max j ( m t=1 M j,tk(x t, x i )) M = (1 η)m for each j [k]: if j y i and j = j : M j,i = M j,i ηc if j = y i and j j : M j,i = M j,i + ηc return M 5. If at each level i of the tree, we will ask x i = 0, then after d questions, each leaf will contain only one member. That is, one-to-one correspondence between leafs and vectors {0, 1} d. To implement an arbitrary classifier using this tree, classify every leaf the same way the arbitrary classifier does. Let us show that the VCdim is 2 d : Let S {0, 1} d with S = 2 d (that is, S = {0, 1} d ), and let y 1,..., y 2 d be arbitrary labels. Since we can classify any subset we wish as 1, we can choose a binary decision tree in which each the leaf corresponding to each input s i will be classified as y i. We showed that a set of size 2 d can be shattered, which means VCdim 2 d ; and since VCdim {0, 1} d = 2 d, we have VCdim= 2 d. 7

HW3: Programming Assignment Aran Carmon Inbal Joffe 6. a. We created 2 functions to plot the training and validation errors for various η and C. How to run: python q.py 6 find_eta <from> <to> <step> <C> <T> <filename> python q.py 6 find_c <from> <to> <step> <eta> <T> <filename> We first start with scanning for η along a logarithmic scale, with T = 1000, and with c = 1: Both the test error and the validation error are shown in the plot, and we see they are almost the same. We continue to scan for C, using T = 1000 and η = 10 6.7 : We zoom in, and use T = 10000: 8

We choose the parameters η = 10 6.7 and C = 10 0.5 b. Weights for the digits, shown as images: 9

We see that some of the weights resemble the digits they classify. e.g. 2, 3, and 9. Other weights look more like a mix of other digits. How to run: python q.py 6 show_digit <C> <eta> <T> <digit> <filename> c. Using T = 4 len(train_data)=200000. We measured an accuracy of 0.9165 How to run: python q.py 6 calc_accuracy <C> <eta> <T> 10

7. a. We created 2 functions to plot the training and validation errors for various η and C. How to run: python q.py 7 find_eta <kernel> <training size> <from> <to> <step> <C> <T> <filename> python q.py 7 find_c <kernel> <training size> <from> <to> <step> <eta> <T> <filename> For a quicker tuning of parameters, we used a training set of only 1000 points, sampled randomly each time from the training set. We started by scanning for an η value, with T = 1000 and C = 1: The accuracy is mostly uniform at the lower part of the plot, we continue to scan for C value with η = 10 6 : The accuracy of C values seems uniform, so we will choose C = 1. b. With C = 1,η = 10 6, and T = 10000, we measured on the test set an accuracy of: 0.9352 How to run: python q.py 7 calc_accuracy <kernel> <C> <eta> <T> 11

c. We measured an accuracy of 0.932 with RBF σ = 1000, T = 10000, C = 1, η = 10 6 which is comparable to the quadratic kernel. Due to time constraints, we did not investigate it further. How to run: python q.py 7 calc_accuracy r1000 0-6 10000 12