Introduction to Machine Learning. Recitation 8. w 2, b 2. w 1, b 1. z 0 z 1. The function we want to minimize is the loss over all examples: f =

Similar documents
LIMITATIONS OF EULER S METHOD FOR NUMERICAL INTEGRATION

Function Composition and Chain Rules

Solutions to the Multivariable Calculus and Linear Algebra problems on the Comprehensive Examination of January 31, 2014

Introduction to Derivatives

7.1 Using Antiderivatives to find Area

Differentiation in higher dimensions

Calculus I Homework: The Derivative as a Function Page 1

MVT and Rolle s Theorem

The Derivative The rate of change

Higher Derivatives. Differentiable Functions

1. Questions (a) through (e) refer to the graph of the function f given below. (A) 0 (B) 1 (C) 2 (D) 4 (E) does not exist

2.11 That s So Derivative

Continuity and Differentiability Worksheet

4.2 - Richardson Extrapolation

LIMITS AND DERIVATIVES CONDITIONS FOR THE EXISTENCE OF A LIMIT

Math 312 Lecture Notes Modeling

Bob Brown Math 251 Calculus 1 Chapter 3, Section 1 Completed 1 CCBC Dundalk

Section 15.6 Directional Derivatives and the Gradient Vector

158 Calculus and Structures

Mathematics 105 Calculus I. Exam 1. February 13, Solution Guide

5.1 We will begin this section with the definition of a rational expression. We

Average Rate of Change

SECTION 1.10: DIFFERENCE QUOTIENTS LEARNING OBJECTIVES

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

1 Limits and Continuity

Sin, Cos and All That

2.8 The Derivative as a Function

AMS 147 Computational Methods and Applications Lecture 09 Copyright by Hongyun Wang, UCSC. Exact value. Effect of round-off error.

f a h f a h h lim lim

Numerical Differentiation

THE IDEA OF DIFFERENTIABILITY FOR FUNCTIONS OF SEVERAL VARIABLES Math 225

Derivatives of Exponentials

1 Calculus. 1.1 Gradients and the Derivative. Q f(x+h) f(x)

Copyright c 2008 Kevin Long

How to Find the Derivative of a Function: Calculus 1

Lab 6 Derivatives and Mutant Bacteria

Lecture XVII. Abstract We introduce the concept of directional derivative of a scalar function and discuss its relation with the gradient operator.

Function Composition and Chain Rules

Main Points: 1. Limit of Difference Quotients. Prep 2.7: Derivatives and Rates of Change. Names of collaborators:

2.3 Algebraic approach to limits

1. Which one of the following expressions is not equal to all the others? 1 C. 1 D. 25x. 2. Simplify this expression as much as possible.

MAT244 - Ordinary Di erential Equations - Summer 2016 Assignment 2 Due: July 20, 2016

1 + t5 dt with respect to x. du = 2. dg du = f(u). du dx. dg dx = dg. du du. dg du. dx = 4x3. - page 1 -

Precalculus Test 2 Practice Questions Page 1. Note: You can expect other types of questions on the test than the ones presented here!

Minimizing D(Q,P) def = Q(h)

Lesson 6: The Derivative

Section 3: The Derivative Definition of the Derivative

1 Lecture 13: The derivative as a function.

NUMERICAL DIFFERENTIATION. James T. Smith San Francisco State University. In calculus classes, you compute derivatives algebraically: for example,

3.1 Extreme Values of a Function

Combining functions: algebraic methods

Recall from our discussion of continuity in lecture a function is continuous at a point x = a if and only if

Exam 1 Review Solutions

Tangent Lines-1. Tangent Lines

Differential Calculus (The basics) Prepared by Mr. C. Hull

Notes on wavefunctions II: momentum wavefunctions

The Derivative as a Function

4. The slope of the line 2x 7y = 8 is (a) 2/7 (b) 7/2 (c) 2 (d) 2/7 (e) None of these.

(4.2) -Richardson Extrapolation

3.4 Worksheet: Proof of the Chain Rule NAME

A.P. CALCULUS (AB) Outline Chapter 3 (Derivatives)

Chapter 5 FINITE DIFFERENCE METHOD (FDM)

MA455 Manifolds Solutions 1 May 2008

Polynomials 3: Powers of x 0 + h

1 The concept of limits (p.217 p.229, p.242 p.249, p.255 p.256) 1.1 Limits Consider the function determined by the formula 3. x since at this point

Time (hours) Morphine sulfate (mg)

MTH-112 Quiz 1 Name: # :

lecture 26: Richardson extrapolation

Continuity. Example 1

Dynamics and Relativity

Differentiation Rules c 2002 Donald Kreider and Dwight Lahr

THE IMPLICIT FUNCTION THEOREM

Exponentials and Logarithms Review Part 2: Exponentials

. If lim. x 2 x 1. f(x+h) f(x)

RightStart Mathematics

Math 102 TEST CHAPTERS 3 & 4 Solutions & Comments Fall 2006

1 Solutions to the in class part

Name: Answer Key No calculators. Show your work! 1. (21 points) All answers should either be,, a (finite) real number, or DNE ( does not exist ).

Consider a function f we ll specify which assumptions we need to make about it in a minute. Let us reformulate the integral. 1 f(x) dx.

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER /2019

Gradient Descent etc.

1 Power is transferred through a machine as shown. power input P I machine. power output P O. power loss P L. What is the efficiency of the machine?

SECTION 3.2: DERIVATIVE FUNCTIONS and DIFFERENTIABILITY

Test 2 Review. 1. Find the determinant of the matrix below using (a) cofactor expansion and (b) row reduction. A = 3 2 =

MTH 119 Pre Calculus I Essex County College Division of Mathematics Sample Review Questions 1 Created April 17, 2007

MATH CALCULUS I 2.1: Derivatives and Rates of Change

The Laplace equation, cylindrically or spherically symmetric case

Definition of the Derivative

SECTION 2.1 BASIC CALCULUS REVIEW

Solution. Solution. f (x) = (cos x)2 cos(2x) 2 sin(2x) 2 cos x ( sin x) (cos x) 4. f (π/4) = ( 2/2) ( 2/2) ( 2/2) ( 2/2) 4.

Excerpt from "Calculus" 2013 AoPS Inc.

WYSE Academic Challenge 2004 Sectional Mathematics Solution Set

MA119-A Applied Calculus for Business Fall Homework 4 Solutions Due 9/29/ :30AM

1.5 Functions and Their Rates of Change

CSCE 478/878 Lecture 2: Concept Learning and the General-to-Specific Ordering

Homework 1 Due: Wednesday, September 28, 2016

Regularized Regression

. Compute the following limits.

Lecture 21. Numerical differentiation. f ( x+h) f ( x) h h

Section 2.4: Definition of Function

Transcription:

Introduction to Macine Learning Lecturer: Regev Scweiger Recitation 8 Fall Semester Scribe: Regev Scweiger 8.1 Backpropagation We will develop and review te backpropagation algoritm for neural networks. In order to ave a concrete example, we will focus on te coice of te sigmoid (i.e., (z = 1/(1 + e z, wit te log loss function (i.e., l(y, ŷ = y log ŷ (1 y log(1 ŷ, but of course, te derivation olds in general. 8.1.1 Two layers, one node We will start from te simplest (interesting example and do it very explicitly. Suppose we are given a sample x 1,..., x m wit labels y 1,..., y m { 1, 1}. Te simplest is two layers, wit one node in eac: w 1, b 1 w 2, b 2 z 0 z 1 ŷ = z 2 Te function we want to minimize is te loss over all examples: f = m l(y i, z 2 (x i i=1 We take a gradient descent approac. At eac iteration, we ave our current guesses for te best parameter values: w 1, b 1, w 2, b 2, and we wis to calculate te gradient at tat point, e.g: f f w 1 and f w1 = w 1,b 1 = b 1, b 1 and f w1 = w 1,b 1 = b 1, w 2 and w1 = w 1,b 1 = b 1, b 2 w1 = w 1,b 1 = b 1, First, since derivative is linear, we can focus only on te loss over a single sample, and set y = y i, z 0 = x i. and, e.g.: f w 1 = m i=1 w 1 l(y, z 2 (z 0 1

2 Lecture 8 We proceed as if tere is a single sample. To continue, we will benefit by introducing convenient notation (see also lesson scribe and drawing it as well. Define by v 1 = z 0 w 1 +b 1, v 2 = z 1 w 2 +b 2 te linear combinations used in te calculations of z 1, z 2 (respectively. Namely, z 1 = (v 1, z 2 = (v 2. We will add tese to te grap above, tat will now sow te explicit calculation we make. We will also sow te loss function, since it is added to te entire calculation. w 1, b 1 w 2, b 2 y z 0 v 1 z 1 v 2 z 2 l(y, z 2 Derivative w.r.t w 2, b 2 Let first recall te cain rule. Suppose we ave tree variables/functions x, y, z, and z = z(y, y = y(x, tus z = z(y(x. Say we want to evaluate te derivative of z at a point x. Ten, te cain rule says: z x = z x= x y y y=y( x x We want to calculate l(y, z 2 (z 0 w 2 x= x w1 = w 1,b 1 = b 1, te dependency on te current point is idden in z 2. We will denote by ṽ i te evaluation of v i on te specific point we are at, e.g., ṽ 1 = z 0 w 1 + b 1. Similarly, z i = (ṽ i. Following te cain rule and te drawing above, we can see tat l(y, z 2 (z 0 w 2 w1 = w 1,b 1 = b 1, = l(y, z 2 z 2 z 2 z2 = z 2 v 2 v 2 w 2 z 1 = z 1, Te left expression is te derivative of te loss function wrt to te estimate. In our case, it is te log loss, evaluated at te current point, tat is: l(y, z 2 z 2 = y/ z 2 + (1 y/(1 z 2 z2 = z 2 Te middle expression can be solved by te useful identity (exercise (v = (v(1 (v, to give: z 2 v 2 = (ṽ 2 (1 (ṽ 2 = z 2 (1 z 2

8.1. BACKPROPAGATION 3 Te rigt expression is simply v 2 = z 1 w 2 z 1 = z 1, We can do te same for /b 2, we would get te same calculation, wit te coefficient 1 instead of z 1 - indeed, we can tink about it as anoter input of constant 1 being input into eac one of te nodes. Derivative w.r.t w 1, b 1 Wat about te derivative wrt w 1 (and b 1? Following te cain rule (and te drawing again, l(y, z 2 (z 0 w 1 w1 = w 1,b 1 = b 1, = l(y, z 2 z 2 z 2 z2 = z 2 v 2 v 2 z 1 z 1 z 1 = z 1, v 1 v 1 v1 =ṽ 1 w 1 Te first two expressions are familiar to us, and in fact we already calculated tem! We define and Ten, we can write δ 2 = δ 1 = z 2 l(y, z 2 (z 0 w1 = w 1,b 1 = b 1, z 1 l(y, z 2 (z 0 w1 = w 1,b 1 = b 1, δ 1 = δ 2 z 2 v 2 v 2 z 1 z 1 = z 1, Te last two expressions can be calculated te same way: z 1 v 1 = (ṽ 1 (1 (ṽ 1 = z 1 (1 z 1 v1 =ṽ 1 Te rigt expression is simply Summary v 1 = z 0 w 1 w 1 = w 1,b 1 = b 1 How to calculate te above efficiently? Note tat we needed te evaluation of all te functions/variables at te current values of te parameters. We do tis using a forward pass - Going forward in te grap, starting wit ṽ 1, using it to calculate z 1, ten z 2, and so fort. Ten, we calculate δ 2 as explained above, and use its value to calculate δ 1 - tis is te backward pass (or, backpropagation. It s easy to extend tis logic to any number of layers. Anoter way to tink about it is as a dynamic programming or memoization algoritm - we reuse te same expressions over and over, so we calculate tem only once, in te order needed to calculate tem. w1 = w 1,b 1 =

4 Lecture 8 8.1.2 Many layers, many nodes We now generalize tis to several layers of several nodes. We do tis using matrix calculus - see also lesson scribe for a sligtly different derivation. In te general case, eac layer will now be a vector. Te parameters for te transition between layers are matrices and vectors: W L 2, b L 2 W L 1, b L 1 w L, b L... z L 2 z L 1 ŷ = z L Wit te notation v t+1 = W t+1 z t b t+1, we ave: W L 1, b L 1 w L, b L y... z L 2 v L 1 z L 1 v L z L l(y, z L We can now use matrix calculus to differentiate wit simplicity. (It s not new - it s stuff we ave all learned in multivariable calculus - Jacobians etc.. We omit te points were te derivatives are evaluated for clarity of presentation - te rationale is te same as before. Denote te i-t row of W t by w t,i. Ten, w t,i l(y, z L (z 0 = z L l(y, z L z L v L v L z L 1 z L 1 v L 1 v L 1 z L 2... Tis gradient is a row vector. We already know te first two expressions, z L l(y, z L and z L v L - tey are like before. Te expression v L z L 1 is a row vector - it is simply w L. Te expression z L 1 v L 1 is a matrix - te Jacobian z L 1 as a function of v L 1. It is simple to see its a diagonal matrix wit ((ṽ L 1 j on te diagonal j, j. Te matrix v L 1 z L 2 is simply W L 1! So we continue multiplying tese matrices, until te final matrix vt w t,i, wic is also easy to calculate - only te i-t row is nonzero, and it is z t. It s easy to verify tat tis gives exactly te same full algoritm as detailed in te lesson scribes. As before, we ave a forward pass to calculate all te v-s, and a backward pass to calculate all te δ-s, were δ t = z t l(y, z L (z 0. From tis, te calculation of te gradients are as described above and in te lesson. v t w t,i

8.2. DECISION TREES 5 8.2 Decision Trees 8.2.1 Terminology and Reminder Assume a binary classification setting (for every training sample, let f be te binary label. We like to decide in eac node on te split, i.e., te predicate to assign to te node. Te local parameters are q = Pr[f = 1], wic is te fraction of 1s in te examples reacing te node, u = Pr[ = 0] 1 is te fraction of samples for wic = 0 out of te samples reacing te node, p = Pr[f = 1 = 0] is te fraction of 1s in te samples reacing te node and aving = 0, and r = Pr[f = 1 = 1] is te fraction of 1s in te samples reacing te node and aving = 1. We ave tat q = up + (1 ur. (See Figure 8.1. Figure 8.1: Te split in a node Recall te decision-tree algoritm from class: We use a strictly convex node index function v( 2 tat associates a value to a node as a function of te proportion of positively labeled examples in te node (q using our above terminology. Now, by strict convexity of v( we ave v(q > u v(p + (1 uv(r And at a given node we seek to find a predicate tat splits in a way tat mostly reduces te rigt and side of te above inequality (te resulting node potential. 1 We use = 0 to indicate tat te predicate is false and = 1 for te case is true 2 An example of a split index is v(p = p log 2 p (1 p log 2 (1 p wic is te binary entropy function. (In class we normalized by multiplying by a alf, but tis will not make a difference.

6 Lecture 8 8.2.2 Instability Example We consider te sample 2-feature binary labeled data in Figure 8.2a 3. Te root s optimal decision stump = x 1 < 0.6 reduces te potential 4 from te initial 1 (since te sample contains an equal number of positive and negative samples to 10 16 v ( 7 10 + 6 16 v ( 1 6 0.79 (a Original sample set (b Sligt cange in one sample. Figure 8.2: Example of data for decision tree instability. Triangles are positively labeled and circles are negatively labeled We continue performing te splits and derive te decision tree of Figure 8.3. We can now consider wat will appen if we sligtly modify te location of a single point as follows. (See Figure 8.2b. Te modified data still as te root split = x 1 < 0.6 resulting in te same value 0.79, but for te root split = x 2 < 0.32 we ave 7 16 v ( 1 7 + 9 16 v ( 7 9 0.68 < 0.79 Tis implies tat te minor cange will cange te optimal predicate at te root and migt impact te entire tree. 3 Example from ttp://www.lsv.uni-saarland.de/pattern sr ws0607/psr 0607 Cap10.pdf, slide 30 4 We use te entropy function trougout.

8.2. DECISION TREES 7 Figure 8.3: Te tree tat is built