Intro to Neural Networks and Deep Learning

Size: px

Start display at page:

Download "Intro to Neural Networks and Deep Learning"

Alexis Greene
5 years ago
Views:

1 Intro to Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi UVA CS

2 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 2

3 W1 1 2 W2 w3 ŷ 3 3

4 Logistic Regression w+b e P(Y=1 ) = w+b 1+e Sigmoid Function (aka logistic, logit, S, soft-step) 4

5 Epanded Logistic Regression 1 2 Input p= w1 w2 w 3 1 b Σ z Summing Function ŷ = P(Y=1,w) Sigmoid Function Multiply by weights T z = w p1+ b 11 1p y = sigmoid(z) = ez 1 + ez 5

6 Neuron w1 w2 w3 Σ z b1 +1 6

7 Neurons 7

8 Neuron 1 Input 2 w1 w2 Σ w3 z ŷ 3 From here on, we leave out bias for simplicity T z = w p1 11 1p ŷ = sigmoid(z) = ez 1 + ez 8

9 Block View of a Neuron parameterized block w * Input Dot Product z T z = w p1 11 1p ŷ = sigmoid(z) = ŷ Sigmoid ez 1 + ez output 9

10 Neuron Representation 1 2 w1 w2 Σ ŷ w3 3 The linear transformation and nonlinearity together is typically considered a single neuron 10

11 Neuron Representation w * ŷ The linear transformation and nonlinearity together is typically considered a single neuron 11

12 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 12

13 W1 1 2 W2 w3 ŷ 3 13

14 1-Layer Neural Network (with 4 neurons) vector z matri W Σ ŷ1 Σ ŷ2 1 Input 2 Output ŷ Σ ŷ3 Σ ŷ4 3 Linear Sigmoid 1 layer 14

15 1-Layer Neural Network (with 4 neurons) z W Σ ŷ1 Σ ŷ2 1 Input 2 Output ŷ Σ ŷ3 Σ ŷ4 3 p=3 Element-wise on vector z d=4 z =WT d1 dp p1 ŷ = sigmoid(z) = d1 d1 ez 1 + ez 15

16 1-Layer Neural Network (with 4 neurons) W ŷ1 Σ 1 ŷ2 Σ Input 2 Output ŷ Σ ŷ3 Σ ŷ4 3 p=3 Element-wise on vector z d=4 z =WT d1 dp p1 ŷ = sigmoid(z) = d1 d1 ez 1 + ez 16

17 Block View of a Neural Network W is now a matri W * Input Dot Product z is now a vector z ŷ Sigmoid output z =WT d1 dp p1 ŷ = sigmoid(z) = d1 d1 ez 1 + ez 17

18 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 18

19 W1 1 2 W2 w3 ŷ 3 19

20 Multi-Layer Neural Network (Multi-Layer Perceptron (MLP) Network) weight subscript represents layer number W1 w2 1 ŷ 2 3 Hidden layer Output layer 2-layer NN 20

21 Multi-Layer Neural Network (MLP) W2 W1 w3 1 ŷ 2 3 1st hidden layer 2nd hidden layer Output layer 3-layer NN 21

22 Multi-Layer Neural Network (MLP) h1 W1 h2 W2 w3 1 ŷ 2 3 hidden layer 1 output z1 =W1T h1 = sigmoid(z1) z2 =W2T h1 h2 = sigmoid(z2) z3 =w3t h2 ŷ = sigmoid(z3) 22

23 Multi-Class Output MLP W1 1 h1 W2 h2 w3 2 ŷ 3 z1 =W1T h1 = sigmoid(z1) z2 =W2T h1 h2 = sigmoid(z2) z3 =W3T h2 ŷ = sigmoid(z2) 23

24 Block View Of MLP W1 * z1 1st hidden layer h1 W2 * z2 2nd hidden layer h2 W3 * z3 y Output layer 24

25 Deep Neural Networks (i.e. > 1 hidden layer) Researchers have successfully used 1000 layers to train an object classifier 25

26 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 26

27 W1 1 2 W2 w3 ŷ E (ŷ) 3 27

28 Binary Classification Loss W1 W2 w3 1 ŷ = P(y=1 X,W) 2 3 this eample is for a single sample E= loss = - logp(y = ŷ X= ) = - y log(ŷ) - (1 - y) log(1-ŷ) 28 true output

29 Regression Loss W1 1 W2 w3 Σ 2 ŷ 3 E = loss= 12 ( y - ŷ )2 true output 29

30 Multi-Class Classification Loss W1 W2 W3 z1 Σ 1 ŷ1 z2 2 Σ 3 Σ ŷ2 z3 ŷi = P( ŷi = 1 ) ŷ3 K=3 Softma function. Normalizing function which converts each class output to a probability. E = loss = - Σ yj ln ŷj j = 1...K 0 for all ecept true class 30

31 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 31

32 W1 1 2 W2 w3 ŷ E (ŷ) 3 32

33 Training Neural Networks How do we learn the optimal weights WL for our task?? Gradient descent: WL(t+1) = WL(t) - E WL(t) W1 W2 w3 ŷ But how do we get gradients of lower layers? Backpropagation! Repeated application of chain rule of calculus Locally minimize the objective Requires all blocks of the network to be differentiable 33 LeCun et. al. Efficient Backpropagation. 1998

34 Backpropagation Intro 34

35 Backpropagation Intro 35

36 Backpropagation Intro 36

37 Backpropagation Intro 37

38 Backpropagation Intro 38

39 Backpropagation Intro 39

40 Backpropagation Intro 40

41 Backpropagation Intro 41

42 Backpropagation Intro 42

43 Backpropagation Intro 43

44 Backpropagation Intro 44

45 Backpropagation Intro 45

46 Backpropagation Intro Tells us: by increasing by a scale of 1, we decrease f by a scale of

47 Backpropagation (binary classification eample) W1 1 2 w2 ŷ = P(y=1 X,W) 3 Eample on 1-hidden layer NN for binary classification 47

48 Backpropagation (binary classification eample) W1 * z1 h1 w2 * z2 y 48

49 Backpropagation (binary classification eample) W1 * z1 h1 w2 * z2 y E = loss = Gradient Descent to Minimize loss: Need to find these! 49

50 Backpropagation (binary classification eample) W1 * z1 h1 w2 * z2 y 50

51 Backpropagation (binary classification eample) W1 * z1 h1 w2 * z2 y =?? =?? 51

52 Backpropagation (binary classification eample) W1 * z1 h1 w2 * z2 y =?? =?? Eploit the chain rule! 52

53 Backpropagation (binary classification eample) W1 * z1 h1 w2 * z2 y 53

54 Backpropagation (binary classification eample) W1 * z1 h1 w2 * z2 y chain rule 54

55 Backpropagation (binary classification eample) W1 * z1 h1 w2 * z2 y 55

56 Backpropagation (binary classification eample) W1 * z1 h1 w2 * z2 y 56

57 Backpropagation (binary classification eample) W1 * z1 h1 w2 * z2 y 57

58 Backpropagation (binary classification eample) W1 * z1 h1 w2 * z2 y 58

59 Backpropagation (binary classification eample) W1 * z1 h1 w2 * z2 y 59

60 Backpropagation (binary classification eample) W1 * z1 h1 w2 * z2 y 60

61 Backpropagation (binary classification eample) W1 * z1 h1 w2 * z2 y 61

62 Backpropagation (binary classification eample) W1 * z1 h1 w2 * z2 y already computed 62

63 Local-ness of Backpropagation activations local gradients f y gradients 63

64 Local-ness of Backpropagation activations local gradients f y gradients 64

65 Eample: Sigmoid Block sigmoid() = () 65

66 Deep Learning = Concatenation of Differentiable Parameterized Layers (linear & nonlinearity functions) W1 * z1 1st hidden layer h1 W2 * z2 2nd hidden layer h2 W3 * z3 y Output layer Want to find optimal weights W to minimize some loss function E! 66

67 Backprop Whiteboard Demo w1 1 w2 2 Σ w3 w4 b1 z1 h1 w5 Σ Σ h2 z2 b2 1 ŷ w6 b3 1 f1 z1 = 1w1 + 2w3 + b1 z2 = 1w2 + 2w4 + b2 f2 ep(z1) h1 = 1 + ep(z ) 1 ep(z2) h2 = 1 + ep(z2) f3 ŷ = h1w5 + h2w6 + b3 f4 E = ( y - ŷ )2 w(t+1) = w(t) - E w E w(t) =?? 67

68 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 68

69 Nonlinearity Functions (i.e. transfer or activation functions) w1 w2 w3 b1 Σ Summing Function z Sigmoid Function Multiply by weights 69

70 Nonlinearity Functions (i.e. transfer or activation functions) Name Plot Equation Derivative ( w.r.t ) 70

71 Nonlinearity Functions (i.e. transfer or activation functions) Name Plot Equation Derivative ( w.r.t ) 71

72 Nonlinearity Functions (aka transfer or activation functions) Name Plot Equation Derivative ( w.r.t ) 72 usually works best in practice

73 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs in Practice 73

74 Neural Net Pipeline W2 W1 w3 ŷ E 1. Initialize weights 2. For each batch of input samples S: a. Run the network Forward on S to compute outputs and loss b. Run the network Backward using outputs and loss to compute gradients c. Update weights using SGD (or a similar method) 3. Repeat step 2 until loss convergence 74

75 Non-Conveity of Neural Nets In very high dimensions, there eists many local minimum which are about the same. Pascanu, et. al. On the saddle point problem for non-conve optimization

76 Building Deep Neural Nets y f 76

77 Building Deep Neural Nets GoogLeNet for Object Classification 77

78 Block Eample Implementation 78

79 Advantage of Neural Nets As long as it s fully differentiable, we can train the model to automatically learn features for us. 79

Lecture 17: Neural Networks and Deep Learning

UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions