Administrative Issues CS5242 mirror site: https://web.bii.a-star.edu.sg/~leehk/cs5424.html 1
Assignments (10% * 4 = 40%) For each assignment It has multiple sub-tasks, including coding tasks You need to submit the answers, code and execution results 2
Assignments (10% * 4 = 40%) Quiz (20%) Final Project (40%) 3
Final Projects (40%) Projects will be held on Kaggle-in-class Register using your nus email (https://inclass.kaggle.com/) The project webpages will be announced once groups are finalized No data disclosure No additional data Each group <=2 students Each group will be randomly assigned to A computer vision project A natural language processing project Assessment is based on the report (15%) and ranking (25%) Top ranked groups of each project will be invited to do presentation at week 13. Will announce in IVLE once Kaggle is ready 4
Final Project (40%) Report template (15%) Problem definition Highlights (new algorithms, insights from the experiments) Dataset pre-processing description Training and testing procedure Experimental study (clarity, model understanding, and highlights are important for the assessment) Ranking -> score (25%) Rank top 10%, score ~25 (max score) Rank top 20%, score ~21 Rank top 40%, score ~17 Rank top 70%, score ~14 Rank others, score ~13 Prediction scores less than 10% better than random guess, score ~5 Final rank to grade distribution may be adjusted depending on overall class performance 5
Final projects timeline summary Date 27 August 18:00 04 November 18:00 Event Register your project group @ IVLE 2 or less students per group Results submission deadline 05 November 18:00 11 November 18:00 Report deadline 12 November 18:00 12 November 18:00 Selection and notification of selected projects for presentation Selected project group send slides to cs5242@comp.nus.edu.sg Make appointment to meet Hwee Kuan to do presentation rehearsal 16 November 18:30 Project presentations by students 6
GPU Machine/Cluster NSCC (National Super-Computing Center) Free for NUS students 128 GPU nodes, each with a NVIDIA Tesla K40 (12 GB) Follow the manual and test program on IVLE workbin/misc/gpu to get yourself familiar with it. Google Cloud Platform 300 USD credit for new register NVIDIA Tesla K80 (2x12GB) is available for Region Taiwan Amazon EC2 (g2.8xlarge) 100 USD credit for students NVIDIA Grid GPU, 1536 CUDA cores, 4GB memory Available for Singapore region Tips: STOP/TERMINATE the instance immediately after your program terminates Check the usage status frequently to avoid high outstanding bills Use Amazon or Google cloud platform for debugging and use NSCC for actual training and tuning 7
Lecture 1: Introduction Lecture 2: Neural Network Basics Lecture 3: Training the Neural Network Lecture 4: Convolutional Neural Network Lecture 5: Convolutional Neural Network Lecture 6: Convolutional Neural Network Lecture 7: Recurrent Neural Network Lecture 8: Recurrent Neural Network Lecture 9: Recurrent Neural Network Lecture 10: Advanced Topic: Generative Adversarial Network Lecture 11: Advanced Topic: One-shot learning Lecture 12: Advanced Topic: Distributed training Lecture 13: Project demonstrations 8
Forward Propagation 9
The network and information flow feed in data into first layer transmitter receiver final layer presents the output of network transmitter receiver 10
Notation Let x 2 R d be the input space y 2 R y 2 N Let or be the label Let o 2 R be the output of the neural network 11
o x 1 x 2 x 2 x 1 12
x =(x 1,x 2 ) 2 R 2 o = (w 1 x 1 + w 2 x 2 + b) x 1 x 2 A o (x1,x2,o) A x 2 o (x1,x2) x 1 x 2 A (x1,x2) A = (x1, x2, o) x 1 13
x =(x 1,x 2 ) 2 R 2 o = (w 1 x 1 + w 2 x 2 + b) x 1 x 2 o x 2 A o A x 1 x 1 14
x =(x 1,x 2 ) 2 R 2 o = (w 1 x 1 + w 2 x 2 + b) x 1 x 2 o x 2 o x 1 x 1 x 2 15
x 2 o x 1 x 2 x 2 x 1 level sets form straight lines 16 x 1
Concept of level sets 17
Next to simplest x =(x 1,x 2 ) 2 R 2 z = w 1 x 1 + w 2 x 2 + b o = (w 1 x 1 + w 2 x 2 + b) x 1 x 2 o = (z) x 1 o x 2 lines of constant z x 1 x 2 18
o x 1 x 2 These three surfaces add to form a triangular decision boundary! 19
Adding functions f(x) function 1 + function 2 (f+g)(x) function 1 a h a+b 2h + A x = g(x) function 2 A x b h A x 20
h1 h2 h3 x 2 o x 1 x 2 x 2 x 1 21 x 1
h1 h2 h3 h 1 (x 1,x 2 ) h 2 (x 1,x 2 ) h 3 (x 1,x 2 ) + + x 1 x 2 x 1 x 2 x 1 x 2 w 1 h 1 (x 1,x 2 )+w 2 h 2 (x 1,x 2 )+w 3 h 3 (x 1,x 2 )+b x 2 22 x 1
If you add up enough step surfaces, are you able to form any functions? 23
neural network fingers activities 24
Activity one Take out a piece of paper, draw patterns as shown X X X X X o o o o o o Task: Cut (tear) with one straight line to completely separate the X and O 25
X X X X X Activity two X X X X X X o o o o X X X XX X X X X X X X o oo o oo oo o o o o X X X X X X X X X X X X X X XX X X X X X Cut (tear) with one straight line! 26
X X X X X Activity two X X X X X X o o o o X X X XX X X X X X X X o oo o oo oo o o o o X X X X X X X X X X X X X X XX X X X X X Cut (tear) with one straight line! 27
Activity three X X X X X X X X X X X X X X X X o o o o o o oo o oo oo o o o o o o o o o o o oo oo o o o o X X X X X X X X X X X X X X Cut (tear) with one straight line! 28
Activity three X X X X X X X X X X X X X X X X o o o o o o oo o oo oo o o o o o o o o o o o oo oo o o o o X X X X X X X X X X X X X X Cut (tear) with one straight line! 29
Activity four X o X X o X X X X X o oo o o o X X X X X X X X X X X o o o o o X X X X X X X X XX X X Cut (tear) with one straight line! 30
Manifold view of neural network feed in data into first layer final layer presents the output of network data hidden layer output https://www.pinterest.com/pin/414260865705737484/ 31
Function view of neural network h 1 (x 1,x 2 ) h 2 (x 1,x 2 ) h 3 (x 1,x 2 ) + + x 1 x 2 x 2 x 1 x 1 x 2 x 2 x 1 32
x 1 x 2 w (1) 1 w (1) 2 b (1) h (1) Follow the data point x 2 A (x1,x2) (x1,x2,h1) A x 2 A (x1,x2) x 1 x 1 x 2 w (2) 1 w (2) 2 b (2) h (2) x 1 (x1,x2,h2) A x 2 A (x1,x2) (x A 1,x A 2 ) 7! (h A 1,h A 2 ) x 1 33
x 1 w (2) 1 w (1) 1 w (1) 2 h (1) b (2) x w (2) 2 2 h (2) b (1) x 2 A (x A 1,x A 2 ) h 2 A (h A 1,h A 2 ) x 1 h 1 (x A 1,x A 2 ) 7! (h A 1,h A 2 ) 34
x 1 w (2) 1 w (1) 1 w (1) 2 h (1) b (2) w (2) x 2 2 h (2) b (1) x 2 A h 2 A x 1 Neighbourhood relationship is conserved h 1 35
continuous mapping x 2 h 2 x 1 h 1 (x A 1,x A 2 ) 7! (h A 1,h A 2 ) 36
x 1 h 1 h 2 x 2 h 3 x 2 A h 3 A (x A 1,x A 2 ) (h A 1,h A 2,h A 3 ) h 2 x 1 h 1 37
x 1 h 1 h 2 x 2 h 3 x 2 x 1 38
x 1 w (2) 1 w (1) 1 b (2) x 2 h (2) w (2) 2 w (1) 2 b (1) h (1) w (3) 1 w (3) 2 x 2 o x 1 h 2 h 1 h (1) = (w (1) 1 x 1 + w (1) 2 x 2 + b (1) ) h (2) = (w (2) 1 x 1 + w (2) 2 x 2 + b (2) ) o = (w (3) 1 h(1) + w (3) 2 h(2) + b (3) ) 39 o
Some real life examples courtesy of our TA Connie Kou source code will be available online 40
The Angle Data The XOR Data The Ring Data 41
The Angle Data - ReLu 42
The Angle Data - ReLu 43
The Angle Data - ReLu 44
The Angle Data - ReLu 45
The Angle Data - ReLu 46
The Angle Data - ReLu 47
The Angle Data - ReLu 48
The Angle Data - ReLu 49
The Angle Data - ReLu 50
The Angle Data - ReLu 51
The Angle Data - ReLu 52
The Angle Data - Sigmoid 53
The Angle Data - Sigmoid 54
The Angle Data - Sigmoid 55
The Angle Data - Sigmoid 56
The Angle Data - Sigmoid 57
The Angle Data - Sigmoid 3 hidden nodes 58
The Angle Data - Sigmoid 59
The Angle Data - Sigmoid 60
The XOR Data - ReLU 61
The XOR Data - ReLU 62
The XOR Data - ReLU 63
The XOR Data - ReLU 64
The XOR Data - ReLU 65
The XOR Data - ReLU 66
The XOR Data - ReLU 67
The XOR Data - Sigmoid 68
The XOR Data - Sigmoid 69
The XOR Data - Sigmoid 70
The Ring Data - Sigmoid how the manifold is being fold? 71
The Ring Data - Sigmoid 72
73
74
75
76
77
78
79
80
81
Spiral 82
How about this network? Fully connected 83
Feed backward 84
A general objective Tweak the output of neural network to be as similar to the label as possible. o y How to tweak? Adjust the weights 85
Given an example (x, y) x o o! y For this example, the neural network behaves well 86
After the first example, give another example (x 0,?) x 0 o 0 Now we say for a well behaved neural network o 0 = y 0 87
How good is the trained network? x 0 o 0 88
Square error Given data points (x i,y i ),i=1, n l(y i,o i )=ky i o i k 2 C = 1 n X l(y i,o i ) i o 1,y 1 o 2,y 2... x n,x n 1, x 2,x 1 o n,y n Adjust the weights until C = 0 (or close to zero) 89
Cross entropy cost function Two class example y i 2 {0, 1} C = 1 n P i y i log[o(x i )] + (1 y i ) log[1 o(x i )] Three class example y i 2 {(1, 0, 0), (0, 1, 0), (0, 0, 1)} m =exp(v 1 )+exp(v 2 )+exp(v 3 ) o i =( exp(v 1) m, exp(v 2) m, exp(v 3) m ) l(y i,o i )= y i log(o i )... output v 1 v2 v 3 C = 1 n X i l(y i,o i ) 90
How to adjust the weights? Conceptually, this will work: Try all possible weights combinations, for all weights combinations, calculate the cost. Then pick the weight combination that has the lowest cost. Practically, there are too many weights combinations to try. 91
Gradient descend x w1 w2 w2 w3 v1 v2 v3 v 1 = (z 1 )= (w 1 x + b 1 ) v 2 = (z 2 )= (w 2 v 1 + b 2 ) v 3 = (z 3 )= (w 3 v 2 + b 3 ) @C @w j =0 @C @b j =0 92
@C @w j =0 @C @b j =0 @C @w j = 1 n = 2 n X i X i @(y i o i ) 2 @w j (y i o i ) @o i @w j We just need @o i @w j w j (t + 1) = w j (t) @C @w j 93
x w1 w2 w2 w3 v1 v2 v3 = o v 1 = (z 1 )= (w 1 x + b 1 ) v 2 = (z 2 )= (w 2 v 1 + b 2 ) v 3 = (z 3 )= (w 3 v 2 + b 3 ) @v 3 @w 3 = @v 3 @z 3 @z 3 @w 3 = 0 (z 3 )v 2 @v 3 @w 2 = @v 3 @z 3 @z 3 @w 2 = 0 (z 3 )w 3 @v 2 @w 2 = 0 (z 3 )w 3 0 (z 2 )v 1 @v 3 @w 1 = @v 3 @z 3 @z 3 @w 1 = 0 (z 3 )w 3 @v 2 @w 2 = 0 (z 3 )w 3 0 (z 2 )w 2 @v 1 = 0 (z 3 )w 3 0 (z 2 )w 2 0 (z 1 )x @w 1 94
x w1 w2 w2 w3 v1 v2 v3 = o @v 3 @w 3 = @v 3 @z 3 @z 3 @w 3 = 0 (z 3 )v 2 @v 3 @w 2 = @v 3 @z 3 @z 3 @w 2 = 0 (z 3 )w 3 @v 2 @w 2 = 0 (z 3 )w 3 0 (z 2 )v 1 @v 3 @w 1 = @v 3 @z 3 @z 3 @w 1 = 0 (z 3 )w 3 @v 2 @w 2 = 0 (z 3 )w 3 0 (z 2 )w 2 @v 1 = 0 (z 3 )w 3 0 (z 2 )w 2 0 (z 1 )x @w 1 problem!! : long mathematical expression leads to large computational time for deep network 95
Compute and store strategy x w1 w2 w2 w3 v1 v2 v3 = o @v 3 @z 1 @v 3 @z 3 = 0 (z 3 ) @v 3 @z 2 @v 3 @z 3 @v 3 @z 2 = @v 3 @z 3 @z 3 @z 2 = @v 3 @z 3 w 3 0 (z 2 ) @v 3 @z 1 = @v 3 @z 2 @z 2 @z 1 = @v 3 @z 2 w 2 0 (z 1 ) 96
Compute and store strategy x w1 w2 w2 w3 v1 v2 v3 = o @v 3 @z 1 @v 3 @z 2 @v 3 @z 3 @v 3 @w 3 = @v 3 @z 3 @z 3 @w 3 = @v 3 @z 3 v 2 @v 3 @w 2 = @v 3 @z 2 @z 2 @w 2 = @v 3 @z 2 v 1 @v 3 @w 1 = @v 3 @z 1 @z 1 @w 1 = @v 1 @z 1 x 97 @v 3 @b j? homework question
w2 w13 @v 5 @z 3 v1 v3 w35 w01 x w14 w02 v2 w23 w24 v4 w45 v5 = o @v 5 @z 5 @v 5 @z 4 @v 5 @z 4 = @v 5 @z 5 0 (z 4 )w 45 98
w2 w13 @v 5 @z 3 v1 v3 w35 w01 x w14 w02 v2 w23 w24 v4 w45 v5 = o @v 5 @z 5 @v 5 @v 5 @z 2 @z 4 @v 5 @z 2 = @v 5 @z 4 0 (z 2 )w 24 + @v 5 @z 3 0 (z 2 )w 23 99
@v 5 @v 5 @z 1 @z 3 w2 w13 v1 v3 w35 w01 x w14 w02 v2 w23 w24 v4 w45 v5 = o @v 5 @z 5 @v 5 @v 5 @z 2 @z 4 This is call the back propagation algorithm 100
101
102
103
104
105
106
107
108
Cost function is a surface 109
x 1 w1 b=0 o l(y i,o i )=ky i o i k 2 x 2 w2 C(w 1,w 2 )= 1 n X i l(y i,o i )+0.2(w 2 1 + w 2 2) x 2 cost x 1 110 w1 w2
cost w2 w1 w j (t + 1) = w j (t) @C @w j 111
How to use data set? 1.Train a network 2.Keep it for future use 3.When previously unseen data comes in without labels, use the network to predict There is a problem here! How can we know when we keep the network at step (2), it is good enough to fulfil the task of step (3)? We got to find some ways to test it before we use the network 112
Training and Validation set Given an annotated data set, (x i,y i ),i=1, n randomly sample x% of it and keep aside. Train the network on the remaining (100-x)% Validate the network with the x% of the data that you have not yet show to the network Network is ready to use if validation results is good. Else, start troubleshoot. 113