Introduction to Machine Learning. Recitation 8. w 2, b 2. w 1, b 1. z 0 z 1. The function we want to minimize is the loss over all examples: f =

Introduction to Macine Learning Lecturer: Regev Scweiger Recitation 8 Fall Semester Scribe: Regev Scweiger 8.1 Backpropagation We will develop and review te backpropagation algoritm for neural networks. In order to ave a concrete example, we will focus on te coice of te sigmoid (i.e., (z = 1/(1 + e z, wit te log loss function (i.e., l(y, ŷ = y log ŷ (1 y log(1 ŷ, but of course, te derivation olds in general. 8.1.1 Two layers, one node We will start from te simplest (interesting example and do it very explicitly. Suppose we are given a sample x 1,..., x m wit labels y 1,..., y m { 1, 1}. Te simplest is two layers, wit one node in eac: w 1, b 1 w 2, b 2 z 0 z 1 ŷ = z 2 Te function we want to minimize is te loss over all examples: f = m l(y i, z 2 (x i i=1 We take a gradient descent approac. At eac iteration, we ave our current guesses for te best parameter values: w 1, b 1, w 2, b 2, and we wis to calculate te gradient at tat point, e.g: f f w 1 and f w1 = w 1,b 1 = b 1, b 1 and f w1 = w 1,b 1 = b 1, w 2 and w1 = w 1,b 1 = b 1, b 2 w1 = w 1,b 1 = b 1, First, since derivative is linear, we can focus only on te loss over a single sample, and set y = y i, z 0 = x i. and, e.g.: f w 1 = m i=1 w 1 l(y, z 2 (z 0 1

2 Lecture 8 We proceed as if tere is a single sample. To continue, we will benefit by introducing convenient notation (see also lesson scribe and drawing it as well. Define by v 1 = z 0 w 1 +b 1, v 2 = z 1 w 2 +b 2 te linear combinations used in te calculations of z 1, z 2 (respectively. Namely, z 1 = (v 1, z 2 = (v 2. We will add tese to te grap above, tat will now sow te explicit calculation we make. We will also sow te loss function, since it is added to te entire calculation. w 1, b 1 w 2, b 2 y z 0 v 1 z 1 v 2 z 2 l(y, z 2 Derivative w.r.t w 2, b 2 Let first recall te cain rule. Suppose we ave tree variables/functions x, y, z, and z = z(y, y = y(x, tus z = z(y(x. Say we want to evaluate te derivative of z at a point x. Ten, te cain rule says: z x = z x= x y y y=y( x x We want to calculate l(y, z 2 (z 0 w 2 x= x w1 = w 1,b 1 = b 1, te dependency on te current point is idden in z 2. We will denote by ṽ i te evaluation of v i on te specific point we are at, e.g., ṽ 1 = z 0 w 1 + b 1. Similarly, z i = (ṽ i. Following te cain rule and te drawing above, we can see tat l(y, z 2 (z 0 w 2 w1 = w 1,b 1 = b 1, = l(y, z 2 z 2 z 2 z2 = z 2 v 2 v 2 w 2 z 1 = z 1, Te left expression is te derivative of te loss function wrt to te estimate. In our case, it is te log loss, evaluated at te current point, tat is: l(y, z 2 z 2 = y/ z 2 + (1 y/(1 z 2 z2 = z 2 Te middle expression can be solved by te useful identity (exercise (v = (v(1 (v, to give: z 2 v 2 = (ṽ 2 (1 (ṽ 2 = z 2 (1 z 2

8.1. BACKPROPAGATION 3 Te rigt expression is simply v 2 = z 1 w 2 z 1 = z 1, We can do te same for /b 2, we would get te same calculation, wit te coefficient 1 instead of z 1 - indeed, we can tink about it as anoter input of constant 1 being input into eac one of te nodes. Derivative w.r.t w 1, b 1 Wat about te derivative wrt w 1 (and b 1? Following te cain rule (and te drawing again, l(y, z 2 (z 0 w 1 w1 = w 1,b 1 = b 1, = l(y, z 2 z 2 z 2 z2 = z 2 v 2 v 2 z 1 z 1 z 1 = z 1, v 1 v 1 v1 =ṽ 1 w 1 Te first two expressions are familiar to us, and in fact we already calculated tem! We define and Ten, we can write δ 2 = δ 1 = z 2 l(y, z 2 (z 0 w1 = w 1,b 1 = b 1, z 1 l(y, z 2 (z 0 w1 = w 1,b 1 = b 1, δ 1 = δ 2 z 2 v 2 v 2 z 1 z 1 = z 1, Te last two expressions can be calculated te same way: z 1 v 1 = (ṽ 1 (1 (ṽ 1 = z 1 (1 z 1 v1 =ṽ 1 Te rigt expression is simply Summary v 1 = z 0 w 1 w 1 = w 1,b 1 = b 1 How to calculate te above efficiently? Note tat we needed te evaluation of all te functions/variables at te current values of te parameters. We do tis using a forward pass - Going forward in te grap, starting wit ṽ 1, using it to calculate z 1, ten z 2, and so fort. Ten, we calculate δ 2 as explained above, and use its value to calculate δ 1 - tis is te backward pass (or, backpropagation. It s easy to extend tis logic to any number of layers. Anoter way to tink about it is as a dynamic programming or memoization algoritm - we reuse te same expressions over and over, so we calculate tem only once, in te order needed to calculate tem. w1 = w 1,b 1 =

4 Lecture 8 8.1.2 Many layers, many nodes We now generalize tis to several layers of several nodes. We do tis using matrix calculus - see also lesson scribe for a sligtly different derivation. In te general case, eac layer will now be a vector. Te parameters for te transition between layers are matrices and vectors: W L 2, b L 2 W L 1, b L 1 w L, b L... z L 2 z L 1 ŷ = z L Wit te notation v t+1 = W t+1 z t b t+1, we ave: W L 1, b L 1 w L, b L y... z L 2 v L 1 z L 1 v L z L l(y, z L We can now use matrix calculus to differentiate wit simplicity. (It s not new - it s stuff we ave all learned in multivariable calculus - Jacobians etc.. We omit te points were te derivatives are evaluated for clarity of presentation - te rationale is te same as before. Denote te i-t row of W t by w t,i. Ten, w t,i l(y, z L (z 0 = z L l(y, z L z L v L v L z L 1 z L 1 v L 1 v L 1 z L 2... Tis gradient is a row vector. We already know te first two expressions, z L l(y, z L and z L v L - tey are like before. Te expression v L z L 1 is a row vector - it is simply w L. Te expression z L 1 v L 1 is a matrix - te Jacobian z L 1 as a function of v L 1. It is simple to see its a diagonal matrix wit ((ṽ L 1 j on te diagonal j, j. Te matrix v L 1 z L 2 is simply W L 1! So we continue multiplying tese matrices, until te final matrix vt w t,i, wic is also easy to calculate - only te i-t row is nonzero, and it is z t. It s easy to verify tat tis gives exactly te same full algoritm as detailed in te lesson scribes. As before, we ave a forward pass to calculate all te v-s, and a backward pass to calculate all te δ-s, were δ t = z t l(y, z L (z 0. From tis, te calculation of te gradients are as described above and in te lesson. v t w t,i

8.2. DECISION TREES 5 8.2 Decision Trees 8.2.1 Terminology and Reminder Assume a binary classification setting (for every training sample, let f be te binary label. We like to decide in eac node on te split, i.e., te predicate to assign to te node. Te local parameters are q = Pr[f = 1], wic is te fraction of 1s in te examples reacing te node, u = Pr[ = 0] 1 is te fraction of samples for wic = 0 out of te samples reacing te node, p = Pr[f = 1 = 0] is te fraction of 1s in te samples reacing te node and aving = 0, and r = Pr[f = 1 = 1] is te fraction of 1s in te samples reacing te node and aving = 1. We ave tat q = up + (1 ur. (See Figure 8.1. Figure 8.1: Te split in a node Recall te decision-tree algoritm from class: We use a strictly convex node index function v( 2 tat associates a value to a node as a function of te proportion of positively labeled examples in te node (q using our above terminology. Now, by strict convexity of v( we ave v(q > u v(p + (1 uv(r And at a given node we seek to find a predicate tat splits in a way tat mostly reduces te rigt and side of te above inequality (te resulting node potential. 1 We use = 0 to indicate tat te predicate is false and = 1 for te case is true 2 An example of a split index is v(p = p log 2 p (1 p log 2 (1 p wic is te binary entropy function. (In class we normalized by multiplying by a alf, but tis will not make a difference.

6 Lecture 8 8.2.2 Instability Example We consider te sample 2-feature binary labeled data in Figure 8.2a 3. Te root s optimal decision stump = x 1 < 0.6 reduces te potential 4 from te initial 1 (since te sample contains an equal number of positive and negative samples to 10 16 v ( 7 10 + 6 16 v ( 1 6 0.79 (a Original sample set (b Sligt cange in one sample. Figure 8.2: Example of data for decision tree instability. Triangles are positively labeled and circles are negatively labeled We continue performing te splits and derive te decision tree of Figure 8.3. We can now consider wat will appen if we sligtly modify te location of a single point as follows. (See Figure 8.2b. Te modified data still as te root split = x 1 < 0.6 resulting in te same value 0.79, but for te root split = x 2 < 0.32 we ave 7 16 v ( 1 7 + 9 16 v ( 7 9 0.68 < 0.79 Tis implies tat te minor cange will cange te optimal predicate at te root and migt impact te entire tree. 3 Example from ttp://www.lsv.uni-saarland.de/pattern sr ws0607/psr 0607 Cap10.pdf, slide 30 4 We use te entropy function trougout.

8.2. DECISION TREES 7 Figure 8.3: Te tree tat is built