Administrative Issues. CS5242 mirror site:

Administrative Issues CS5242 mirror site: https://web.bii.a-star.edu.sg/~leehk/cs5424.html 1

Assignments (10% * 4 = 40%) For each assignment It has multiple sub-tasks, including coding tasks You need to submit the answers, code and execution results 2

Assignments (10% * 4 = 40%) Quiz (20%) Final Project (40%) 3

Final Projects (40%) Projects will be held on Kaggle-in-class Register using your nus email (https://inclass.kaggle.com/) The project webpages will be announced once groups are finalized No data disclosure No additional data Each group <=2 students Each group will be randomly assigned to A computer vision project A natural language processing project Assessment is based on the report (15%) and ranking (25%) Top ranked groups of each project will be invited to do presentation at week 13. Will announce in IVLE once Kaggle is ready 4

Final Project (40%) Report template (15%) Problem definition Highlights (new algorithms, insights from the experiments) Dataset pre-processing description Training and testing procedure Experimental study (clarity, model understanding, and highlights are important for the assessment) Ranking -> score (25%) Rank top 10%, score ~25 (max score) Rank top 20%, score ~21 Rank top 40%, score ~17 Rank top 70%, score ~14 Rank others, score ~13 Prediction scores less than 10% better than random guess, score ~5 Final rank to grade distribution may be adjusted depending on overall class performance 5

Final projects timeline summary Date 27 August 18:00 04 November 18:00 Event Register your project group @ IVLE 2 or less students per group Results submission deadline 05 November 18:00 11 November 18:00 Report deadline 12 November 18:00 12 November 18:00 Selection and notification of selected projects for presentation Selected project group send slides to cs5242@comp.nus.edu.sg Make appointment to meet Hwee Kuan to do presentation rehearsal 16 November 18:30 Project presentations by students 6

GPU Machine/Cluster NSCC (National Super-Computing Center) Free for NUS students 128 GPU nodes, each with a NVIDIA Tesla K40 (12 GB) Follow the manual and test program on IVLE workbin/misc/gpu to get yourself familiar with it. Google Cloud Platform 300 USD credit for new register NVIDIA Tesla K80 (2x12GB) is available for Region Taiwan Amazon EC2 (g2.8xlarge) 100 USD credit for students NVIDIA Grid GPU, 1536 CUDA cores, 4GB memory Available for Singapore region Tips: STOP/TERMINATE the instance immediately after your program terminates Check the usage status frequently to avoid high outstanding bills Use Amazon or Google cloud platform for debugging and use NSCC for actual training and tuning 7

Lecture 1: Introduction Lecture 2: Neural Network Basics Lecture 3: Training the Neural Network Lecture 4: Convolutional Neural Network Lecture 5: Convolutional Neural Network Lecture 6: Convolutional Neural Network Lecture 7: Recurrent Neural Network Lecture 8: Recurrent Neural Network Lecture 9: Recurrent Neural Network Lecture 10: Advanced Topic: Generative Adversarial Network Lecture 11: Advanced Topic: One-shot learning Lecture 12: Advanced Topic: Distributed training Lecture 13: Project demonstrations 8

Forward Propagation 9

The network and information flow feed in data into first layer transmitter receiver final layer presents the output of network transmitter receiver 10

Notation Let x 2 R d be the input space y 2 R y 2 N Let or be the label Let o 2 R be the output of the neural network 11

o x 1 x 2 x 2 x 1 12

x =(x 1,x 2 ) 2 R 2 o = (w 1 x 1 + w 2 x 2 + b) x 1 x 2 A o (x1,x2,o) A x 2 o (x1,x2) x 1 x 2 A (x1,x2) A = (x1, x2, o) x 1 13

x =(x 1,x 2 ) 2 R 2 o = (w 1 x 1 + w 2 x 2 + b) x 1 x 2 o x 2 A o A x 1 x 1 14

x =(x 1,x 2 ) 2 R 2 o = (w 1 x 1 + w 2 x 2 + b) x 1 x 2 o x 2 o x 1 x 1 x 2 15

x 2 o x 1 x 2 x 2 x 1 level sets form straight lines 16 x 1

Concept of level sets 17

Next to simplest x =(x 1,x 2 ) 2 R 2 z = w 1 x 1 + w 2 x 2 + b o = (w 1 x 1 + w 2 x 2 + b) x 1 x 2 o = (z) x 1 o x 2 lines of constant z x 1 x 2 18

o x 1 x 2 These three surfaces add to form a triangular decision boundary! 19

Adding functions f(x) function 1 + function 2 (f+g)(x) function 1 a h a+b 2h + A x = g(x) function 2 A x b h A x 20

h1 h2 h3 x 2 o x 1 x 2 x 2 x 1 21 x 1

h1 h2 h3 h 1 (x 1,x 2 ) h 2 (x 1,x 2 ) h 3 (x 1,x 2 ) + + x 1 x 2 x 1 x 2 x 1 x 2 w 1 h 1 (x 1,x 2 )+w 2 h 2 (x 1,x 2 )+w 3 h 3 (x 1,x 2 )+b x 2 22 x 1

If you add up enough step surfaces, are you able to form any functions? 23

neural network fingers activities 24

Activity one Take out a piece of paper, draw patterns as shown X X X X X o o o o o o Task: Cut (tear) with one straight line to completely separate the X and O 25

X X X X X Activity two X X X X X X o o o o X X X XX X X X X X X X o oo o oo oo o o o o X X X X X X X X X X X X X X XX X X X X X Cut (tear) with one straight line! 26

X X X X X Activity two X X X X X X o o o o X X X XX X X X X X X X o oo o oo oo o o o o X X X X X X X X X X X X X X XX X X X X X Cut (tear) with one straight line! 27

Activity three X X X X X X X X X X X X X X X X o o o o o o oo o oo oo o o o o o o o o o o o oo oo o o o o X X X X X X X X X X X X X X Cut (tear) with one straight line! 28

Activity three X X X X X X X X X X X X X X X X o o o o o o oo o oo oo o o o o o o o o o o o oo oo o o o o X X X X X X X X X X X X X X Cut (tear) with one straight line! 29

Activity four X o X X o X X X X X o oo o o o X X X X X X X X X X X o o o o o X X X X X X X X XX X X Cut (tear) with one straight line! 30

Manifold view of neural network feed in data into first layer final layer presents the output of network data hidden layer output https://www.pinterest.com/pin/414260865705737484/ 31

Function view of neural network h 1 (x 1,x 2 ) h 2 (x 1,x 2 ) h 3 (x 1,x 2 ) + + x 1 x 2 x 2 x 1 x 1 x 2 x 2 x 1 32

x 1 x 2 w (1) 1 w (1) 2 b (1) h (1) Follow the data point x 2 A (x1,x2) (x1,x2,h1) A x 2 A (x1,x2) x 1 x 1 x 2 w (2) 1 w (2) 2 b (2) h (2) x 1 (x1,x2,h2) A x 2 A (x1,x2) (x A 1,x A 2 ) 7! (h A 1,h A 2 ) x 1 33

x 1 w (2) 1 w (1) 1 w (1) 2 h (1) b (2) x w (2) 2 2 h (2) b (1) x 2 A (x A 1,x A 2 ) h 2 A (h A 1,h A 2 ) x 1 h 1 (x A 1,x A 2 ) 7! (h A 1,h A 2 ) 34

x 1 w (2) 1 w (1) 1 w (1) 2 h (1) b (2) w (2) x 2 2 h (2) b (1) x 2 A h 2 A x 1 Neighbourhood relationship is conserved h 1 35

continuous mapping x 2 h 2 x 1 h 1 (x A 1,x A 2 ) 7! (h A 1,h A 2 ) 36

x 1 h 1 h 2 x 2 h 3 x 2 A h 3 A (x A 1,x A 2 ) (h A 1,h A 2,h A 3 ) h 2 x 1 h 1 37

x 1 h 1 h 2 x 2 h 3 x 2 x 1 38

x 1 w (2) 1 w (1) 1 b (2) x 2 h (2) w (2) 2 w (1) 2 b (1) h (1) w (3) 1 w (3) 2 x 2 o x 1 h 2 h 1 h (1) = (w (1) 1 x 1 + w (1) 2 x 2 + b (1) ) h (2) = (w (2) 1 x 1 + w (2) 2 x 2 + b (2) ) o = (w (3) 1 h(1) + w (3) 2 h(2) + b (3) ) 39 o

Some real life examples courtesy of our TA Connie Kou source code will be available online 40

The Angle Data The XOR Data The Ring Data 41

The Angle Data - ReLu 42

The Angle Data - ReLu 43

The Angle Data - ReLu 44

The Angle Data - ReLu 45

The Angle Data - ReLu 46

The Angle Data - ReLu 47

The Angle Data - ReLu 48

The Angle Data - ReLu 49

The Angle Data - ReLu 50

The Angle Data - ReLu 51

The Angle Data - ReLu 52

The Angle Data - Sigmoid 53

The Angle Data - Sigmoid 54

The Angle Data - Sigmoid 55

The Angle Data - Sigmoid 56

The Angle Data - Sigmoid 57

The Angle Data - Sigmoid 3 hidden nodes 58

The Angle Data - Sigmoid 59

The Angle Data - Sigmoid 60

The XOR Data - ReLU 61

The XOR Data - ReLU 62

The XOR Data - ReLU 63

The XOR Data - ReLU 64

The XOR Data - ReLU 65

The XOR Data - ReLU 66

The XOR Data - ReLU 67

The XOR Data - Sigmoid 68

The XOR Data - Sigmoid 69

The XOR Data - Sigmoid 70

The Ring Data - Sigmoid how the manifold is being fold? 71

The Ring Data - Sigmoid 72

Spiral 82

How about this network? Fully connected 83

Feed backward 84

A general objective Tweak the output of neural network to be as similar to the label as possible. o y How to tweak? Adjust the weights 85

Given an example (x, y) x o o! y For this example, the neural network behaves well 86

After the first example, give another example (x 0,?) x 0 o 0 Now we say for a well behaved neural network o 0 = y 0 87

How good is the trained network? x 0 o 0 88

Square error Given data points (x i,y i ),i=1, n l(y i,o i )=ky i o i k 2 C = 1 n X l(y i,o i ) i o 1,y 1 o 2,y 2... x n,x n 1, x 2,x 1 o n,y n Adjust the weights until C = 0 (or close to zero) 89

Cross entropy cost function Two class example y i 2 {0, 1} C = 1 n P i y i log[o(x i )] + (1 y i ) log[1 o(x i )] Three class example y i 2 {(1, 0, 0), (0, 1, 0), (0, 0, 1)} m =exp(v 1 )+exp(v 2 )+exp(v 3 ) o i =( exp(v 1) m, exp(v 2) m, exp(v 3) m ) l(y i,o i )= y i log(o i )... output v 1 v2 v 3 C = 1 n X i l(y i,o i ) 90

How to adjust the weights? Conceptually, this will work: Try all possible weights combinations, for all weights combinations, calculate the cost. Then pick the weight combination that has the lowest cost. Practically, there are too many weights combinations to try. 91

Gradient descend x w1 w2 w2 w3 v1 v2 v3 v 1 = (z 1 )= (w 1 x + b 1 ) v 2 = (z 2 )= (w 2 v 1 + b 2 ) v 3 = (z 3 )= (w 3 v 2 + b 3 ) @C @w j =0 @C @b j =0 92

@C @w j =0 @C @b j =0 @C @w j = 1 n = 2 n X i X i @(y i o i ) 2 @w j (y i o i ) @o i @w j We just need @o i @w j w j (t + 1) = w j (t) @C @w j 93

x w1 w2 w2 w3 v1 v2 v3 = o v 1 = (z 1 )= (w 1 x + b 1 ) v 2 = (z 2 )= (w 2 v 1 + b 2 ) v 3 = (z 3 )= (w 3 v 2 + b 3 ) @v 3 @w 3 = @v 3 @z 3 @z 3 @w 3 = 0 (z 3 )v 2 @v 3 @w 2 = @v 3 @z 3 @z 3 @w 2 = 0 (z 3 )w 3 @v 2 @w 2 = 0 (z 3 )w 3 0 (z 2 )v 1 @v 3 @w 1 = @v 3 @z 3 @z 3 @w 1 = 0 (z 3 )w 3 @v 2 @w 2 = 0 (z 3 )w 3 0 (z 2 )w 2 @v 1 = 0 (z 3 )w 3 0 (z 2 )w 2 0 (z 1 )x @w 1 94

x w1 w2 w2 w3 v1 v2 v3 = o @v 3 @w 3 = @v 3 @z 3 @z 3 @w 3 = 0 (z 3 )v 2 @v 3 @w 2 = @v 3 @z 3 @z 3 @w 2 = 0 (z 3 )w 3 @v 2 @w 2 = 0 (z 3 )w 3 0 (z 2 )v 1 @v 3 @w 1 = @v 3 @z 3 @z 3 @w 1 = 0 (z 3 )w 3 @v 2 @w 2 = 0 (z 3 )w 3 0 (z 2 )w 2 @v 1 = 0 (z 3 )w 3 0 (z 2 )w 2 0 (z 1 )x @w 1 problem!! : long mathematical expression leads to large computational time for deep network 95

Compute and store strategy x w1 w2 w2 w3 v1 v2 v3 = o @v 3 @z 1 @v 3 @z 3 = 0 (z 3 ) @v 3 @z 2 @v 3 @z 3 @v 3 @z 2 = @v 3 @z 3 @z 3 @z 2 = @v 3 @z 3 w 3 0 (z 2 ) @v 3 @z 1 = @v 3 @z 2 @z 2 @z 1 = @v 3 @z 2 w 2 0 (z 1 ) 96

Compute and store strategy x w1 w2 w2 w3 v1 v2 v3 = o @v 3 @z 1 @v 3 @z 2 @v 3 @z 3 @v 3 @w 3 = @v 3 @z 3 @z 3 @w 3 = @v 3 @z 3 v 2 @v 3 @w 2 = @v 3 @z 2 @z 2 @w 2 = @v 3 @z 2 v 1 @v 3 @w 1 = @v 3 @z 1 @z 1 @w 1 = @v 1 @z 1 x 97 @v 3 @b j? homework question

w2 w13 @v 5 @z 3 v1 v3 w35 w01 x w14 w02 v2 w23 w24 v4 w45 v5 = o @v 5 @z 5 @v 5 @z 4 @v 5 @z 4 = @v 5 @z 5 0 (z 4 )w 45 98

w2 w13 @v 5 @z 3 v1 v3 w35 w01 x w14 w02 v2 w23 w24 v4 w45 v5 = o @v 5 @z 5 @v 5 @v 5 @z 2 @z 4 @v 5 @z 2 = @v 5 @z 4 0 (z 2 )w 24 + @v 5 @z 3 0 (z 2 )w 23 99

@v 5 @v 5 @z 1 @z 3 w2 w13 v1 v3 w35 w01 x w14 w02 v2 w23 w24 v4 w45 v5 = o @v 5 @z 5 @v 5 @v 5 @z 2 @z 4 This is call the back propagation algorithm 100

101

102

103

104

105

106

107

108

Cost function is a surface 109

x 1 w1 b=0 o l(y i,o i )=ky i o i k 2 x 2 w2 C(w 1,w 2 )= 1 n X i l(y i,o i )+0.2(w 2 1 + w 2 2) x 2 cost x 1 110 w1 w2

cost w2 w1 w j (t + 1) = w j (t) @C @w j 111

How to use data set? 1.Train a network 2.Keep it for future use 3.When previously unseen data comes in without labels, use the network to predict There is a problem here! How can we know when we keep the network at step (2), it is good enough to fulfil the task of step (3)? We got to find some ways to test it before we use the network 112

Training and Validation set Given an annotated data set, (x i,y i ),i=1, n randomly sample x% of it and keep aside. Train the network on the remaining (100-x)% Validate the network with the x% of the data that you have not yet show to the network Network is ready to use if validation results is good. Else, start troubleshoot. 113

Administrative Issues. CS5242 mirror site: https://web.bii.a-star.edu.sg/~leehk/cs5424.html