Deep Learning (CNNs)

Size: px

Start display at page:

Download "Deep Learning (CNNs)"

Prosper Merritt
5 years ago
Views:

1 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Deep Learning (CNNs) Deep Learning Readings: Murphy 28 Bishop - - HTF - - Mitchell - - Matt Gormley Lecture 21 April 05,

2 Reminders Homework 5 (Part II): Peer Review Release: Wed, Mar. 29 Due: Wed, Apr. 05 at 11:59pm Peer Tutoring Homework 7: Deep Learning Release: Wed, Apr. 05 Watch for multiple due dates!! Expectation: You should spend at most 1 hour on your reviews 2

3 BACKPROPAGATION 3

4 Background A Recipe for Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: Decision function Loss function 4. Train with SGD: (take small steps opposite the gradient) 4

5 Training Backpropagation Whiteboard Example: Backpropagation for Calculus Quiz #1 Calculus Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below? 5

6 Training Backpropagation Automatic Differentiation Reverse Mode (aka. Backpropagation) Forward Computation 1. Write an algorithm for evaluating the function y = f(x). The algorithm defines a directed acyclic graph, where each variable is a node (i.e. the computation graph ) 2. Visit each node in topological order. For variable u i with inputs v 1,, v N a. Compute u i = g i (v 1,, v N ) b. Store the result at the node Backward Computation 1. Initialize all partial derivatives dy/du j to 0 and dy/dy = Visit each node in reverse topological order. For variable u i = g i (v 1,, v N ) a. We already know dy/du i b. Increment dy/dv j by (dy/du i )(du i /dv j ) (Choice of algorithm ensures computing (du i /dv j ) is easy) Return partial derivatives dy/du i for all variables 6

7 Training Backpropagation Simple Example: The goal is to compute J = ( (x 2 )+3x 2 ) on the forward pass and the derivative dj dx on the backward pass. Forward J = cos(u) u = u 1 + u 2 u 1 = sin(t) u 2 =3t t = x 2 7

8 Training Backpropagation Simple Example: The goal is to compute J = ( (x 2 )+3x 2 ) on the forward pass and the derivative dj dx on the backward pass. Forward J = cos(u) u = u 1 + u 2 u 1 = sin(t) u 2 =3t t = x 2 Backward dj du = sin(u) dj = dj du, du 1 du du 1 dj dj du 1 = dt du 1 dt, du 1 dt dj dj du 2 = dt du 2 dt, du 2 dt dj dj dt = dx dt dx, du du 1 =1 = (t) =3 dt dx =2x dj = dj du 2 du du du 2, du du 2 =1 8

9 Training Backpropagation Case 1: Logistic Regression Output Input θ 1 θ 2 θ 3 θ M Forward J = y y +(1 y ) (1 y) y = a = 1 1+ ( a) D j=0 jx j Backward dj dy = y y + (1 y ) y 1 dj da = dj dy dy da, dy da = ( a) ( ( a) + 1) 2 dj = dj d j da dj = dj dx j da da d j, da dx j, da d j = x j da dx j = j 9

10 Training Backpropagation Output (E) Output (sigmoid) y = 1 1+ ( b) Hidden Layer (D) Output (linear) b = D j=0 jz j Input (C) Hidden (sigmoid) z j = 1 1+ ( a j ), j (B) Hidden (linear) a j = M i=0 jix i, j (A) Input Given x i, i 10

11 Training Backpropagation (F) Loss J = 1 2 (y y )2 Output (E) Output (sigmoid) y = 1 1+ ( b) Hidden Layer (D) Output (linear) b = D j=0 jz j Input (C) Hidden (sigmoid) z j = 1 1+ ( a j ), j (B) Hidden (linear) a j = M i=0 jix i, j (A) Input Given x i, i 11

12 Training Backpropagation Case 2: Neural Network Forward J = y y +(1 y ) (1 y) y = b = z j = a j = 1 1+ ( b) D j=0 jz j 1 1+ ( a j ) M i=0 jix i Backward dj dy = y y + (1 y ) y 1 dj db = dj dy dy db, dy db = ( b) ( ( b) + 1) 2 dj = dj d j db dj = dj dz j db db d j, db dz j, db d j = z j db dz j = dj = dj dz j, dz j = da j dz j da j da j dj = dj d ji da j da j d ji, dj = dj da j, da j = dx i da j dx i dx i j ( a j ) ( ( a j ) + 1) 2 da j d ji = x i D j=0 ji 12

Training Backpropagation Case 2: Neural Loss Network Sigmoid Linear Sigmoid Linear Forward J = y y +(1 y ) (1 y) y = b = z j = a j = 1 1+ ( b) D j=0 jz j 1 1+ ( a j ) M i=0 jix i Backward dj dy = y y

13 Training Backpropagation Case 2: Neural Loss Network Sigmoid Linear Sigmoid Linear Forward J = y y +(1 y ) (1 y) y = b = z j = a j = 1 1+ ( b) D j=0 jz j 1 1+ ( a j ) M i=0 jix i Backward dj dy = y y + (1 y ) y 1 dj db = dj dy dy db, dy db = ( b) ( ( b) + 1) 2 dj = dj d j db dj = dj dz j db db d j, db dz j, db d j = z j db dz j = dj = dj dz j, dz j = da j dz j da j da j dj = dj d ji da j da j d ji, dj = dj da j, da j = dx i da j dx i dx i j ( a j ) ( ( a j ) + 1) 2 da j d ji = x i D j=0 ji 13

14 Training Backpropagation Whiteboard SGD for Neural Network Example: Backpropagation for Neural Network 14

15 Training Backpropagation Backpropagation (Auto.Diff. - Reverse Mode) Forward Computation 1. Write an algorithm for evaluating the function y = f(x). The algorithm defines a directed acyclic graph, where each variable is a node (i.e. the computation graph ) 2. Visit each node in topological order. a. Compute the corresponding variable s value b. Store the result at the node Backward Computation 1. Initialize all partial derivatives dy/du j to 0 and dy/dy = Visit each node in reverse topological order. For variable u i = g i (v 1,, v N ) a. We already know dy/du i b. Increment dy/dv j by (dy/du i )(du i /dv j ) (Choice of algorithm ensures computing (du i /dv j ) is easy) Return partial derivatives dy/du i for all variables 15

16 Background A Recipe for Gradients Machine Learning 1. Given training data: 3. Define goal: And it s a special case of a more general algorithm called reverse- mode automatic differentiation that Decision functioncan compute 4. Train the with gradient SGD: of any differentiable function efficiently! 2. Choose each of these: Loss function Backpropagation can compute this gradient! (take small steps opposite the gradient) 16

17 Summary 1. Neural Networks provide a way of learning features are highly nonlinear prediction functions (can be) a highly parallel network of logistic regression classifiers discover useful hidden representations of the input 2. Backpropagation provides an efficient way to compute gradients is a special case of reverse- mode automatic differentiation 17

18 DEEP LEARNING 18

19 Deep Learning Outline Background: Computer Vision Image Classification ILSVRC Traditional Feature Extraction Methods Convolution as Feature Extraction Convolutional Neural Networks (CNNs) Learning Feature Abstractions Common CNN Layers: Convolutional Layer Max- Pooling Layer Fully- connected Layer (w/tensor input) Softmax Layer ReLU Layer Background: Subgradient Architecture: LeNet Architecture: AlexNet Training a CNN SGD for CNNs Backpropagation for CNNs 19

Three person startup (including Geoff Hinton) acquired by Google for unknown price tag Enlitic,

20 Motivation Why is everyone talking about Deep Learning? Because a lot of money is invested in it DeepMind: Acquired by Google for $400 million DNNResearch: Three person startup (including Geoff Hinton) acquired by Google for unknown price tag Enlitic, Ersatz, MetaMind, Nervana, Skylab: Deep Learning startups commanding millions of VC dollars Because it made the front page of the New York Times 20

with minimal feature engineering This wasn t always the case!

21 Motivation Why is everyone talking about Deep Learning? 1960s 1980s 1990s Deep learning: Has won numerous pattern recognition competitions Does so with minimal feature engineering This wasn t always the case! Since 1980s: Form of models hasn t changed much, but lots of new tricks More hidden units Better (online) optimization New nonlinear functions (ReLUs) Faster computers (CPUs and GPUs) 21

22 BACKGROUND: COMPUTER VISION 22

23 Example: Image Classification ImageNet LSVRC contest: Dataset: 1.2 million labeled images, 1000 classes Task: Given a new image, label it with the correct class Multiclass classification problem Examples from net.org/ 23

24 24

25 25

26 26

27 Example: Image Classification Traditional Feature Extraction for Images: SIFT HOG 27

28 Example: Image Classification CNN for Image Classification (Krizhevsky, Sutskever & Hinton, 2012) 15.3% error on ImageNet LSVRC contest Input image (pixels) Five convolutional layers (w/max- pooling) Three fully connected layers way softmax 28

29 CNNs for Image Recognition Slide from Kaiming He 29

30 CONVOLUTION 30

31 What s a convolution? Basic idea: Pick a 3x3 matrix F of weights Slide this over an image and compute the inner product (similarity) of F and the corresponding field of the image, and replace the pixel in the center of the field with the output of the inner product operation Key point: Different convolutions extract different types of low- level features from an image All that we need to vary to generate these different features is the weights of F Slide adapted from William Cohen

32 Background: Image Processing A convolution matrix is used in image processing for tasks such as edge detection, blurring, sharpening, etc. Input Image Convolution Convolved Image

33 Background: Image Processing A convolution matrix is used in image processing for tasks such as edge detection, blurring, sharpening, etc. Input Image Convolution Convolved Image

34 Background: Image Processing A convolution matrix is used in image processing for tasks such as edge detection, blurring, sharpening, etc. Input Image Convolution Convolved Image

35 Background: Image Processing A convolution matrix is used in image processing for tasks such as edge detection, blurring, sharpening, etc. Input Image Convolution Convolved Image

36 Background: Image Processing A convolution matrix is used in image processing for tasks such as edge detection, blurring, sharpening, etc. Input Image Convolution Convolved Image

37 Background: Image Processing A convolution matrix is used in image processing for tasks such as edge detection, blurring, sharpening, etc. Input Image Convolution 3 Convolved Image 37

38 Background: Image Processing A convolution matrix is used in image processing for tasks such as edge detection, blurring, sharpening, etc. Input Image Convolution Convolved Image

39 Background: Image Processing A convolution matrix is used in image processing for tasks such as edge detection, blurring, sharpening, etc. Input Image Convolution Convolved Image

40 Background: Image Processing A convolution matrix is used in image processing for tasks such as edge detection, blurring, sharpening, etc. Input Image Convolution Convolved Image

41 Background: Image Processing A convolution matrix is used in image processing for tasks such as edge detection, blurring, sharpening, etc. Input Image Convolution Convolved Image

42 Background: Image Processing A convolution matrix is used in image processing for tasks such as edge detection, blurring, sharpening, etc. Input Image Convolution Convolved Image

43 Background: Image Processing A convolution matrix is used in image processing for tasks such as edge detection, blurring, sharpening, etc. Input Image Convolution Convolved Image

44 Background: Image Processing A convolution matrix is used in image processing for tasks such as edge detection, blurring, sharpening, etc. Input Image Convolution Convolved Image

45 Background: Image Processing A convolution matrix is used in image processing for tasks such as edge detection, blurring, sharpening, etc. Input Image Identity Convolution Convolved Image

46 Background: Image Processing A convolution matrix is used in image processing for tasks such as edge detection, blurring, sharpening, etc. Input Image Blurring Convolution Convolved Image

47 What s a convolution? 5/3x3- convolution- kernels- with- online- demo Slide from William Cohen

What s a convolution? http://matlabtricks.

48 What s a convolution? 5/3x3- convolution- kernels- with- online- demo Slide from William Cohen

49 What s a convolution? 5/3x3- convolution- kernels- with- online- demo Slide from William Cohen

50 What s a convolution? 5/3x3- convolution- kernels- with- online- demo Slide from William Cohen

51 What s a convolution? 5/3x3- convolution- kernels- with- online- demo Slide from William Cohen

52 What s a convolution? 5/3x3- convolution- kernels- with- online- demo Slide from William Cohen

53 What s a convolution? Basic idea: Pick a 3x3 matrix F of weights Slide this over an image and compute the inner product (similarity) of F and the corresponding field of the image, and replace the pixel in the center of the field with the output of the inner product operation Key point: Different convolutions extract different types of low- level features from an image All that we need to vary to generate these different features is the weights of F Slide adapted from William Cohen

54 Downsampling Suppose we use a convolution with stride 2 Only 9 patches visited in input, so only 9 pixels in output Input Image Convolution Convolved Image 54

55 Downsampling Suppose we use a convolution with stride 2 Only 9 patches visited in input, so only 9 pixels in output Input Image Convolution Convolved Image 3 55

56 Downsampling Suppose we use a convolution with stride 2 Only 9 patches visited in input, so only 9 pixels in output Input Image Convolution Convolved Image

57 Downsampling Suppose we use a convolution with stride 2 Only 9 patches visited in input, so only 9 pixels in output Input Image Convolution Convolved Image

58 Downsampling Suppose we use a convolution with stride 2 Only 9 patches visited in input, so only 9 pixels in output Input Image Convolution Convolved Image

59 Downsampling Suppose we use a convolution with stride 2 Only 9 patches visited in input, so only 9 pixels in output Input Image Convolution Convolved Image

60 Downsampling Suppose we use a convolution with stride 2 Only 9 patches visited in input, so only 9 pixels in output Input Image Convolution Convolved Image

61 Downsampling Suppose we use a convolution with stride 2 Only 9 patches visited in input, so only 9 pixels in output Input Image Convolution Convolved Image

62 Downsampling Suppose we use a convolution with stride 2 Only 9 patches visited in input, so only 9 pixels in output Input Image Convolution Convolved Image

63 Downsampling Suppose we use a convolution with stride 2 Only 9 patches visited in input, so only 9 pixels in output Input Image Convolution Convolved Image

64 CONVOLUTIONAL NEURAL NETS 64

65 Deep Learning Outline Background: Computer Vision Image Classification ILSVRC Traditional Feature Extraction Methods Convolution as Feature Extraction Convolutional Neural Networks (CNNs) Learning Feature Abstractions Common CNN Layers: Convolutional Layer Max- Pooling Layer Fully- connected Layer (w/tensor input) Softmax Layer ReLU Layer Background: Subgradient Architecture: LeNet Architecture: AlexNet Training a CNN SGD for CNNs Backpropagation for CNNs 65

66 Convolutional Neural Network (CNN) Typical layers include: Convolutional layer Max- pooling layer Fully- connected (Linear) layer ReLUlayer (or some other nonlinear activation function) Softmax These can be arranged into arbitrarily deep topologies Architecture #1: LeNet- 5 66

0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0

67 Input Image Convolutional Layer CNN key idea: Treat convolution matrix as parameters and learn them! Learned Convolution θ 11 θ 12 θ 13 θ 21 θ 22 θ 23 θ 31 θ 32 θ 33 Convolved Image

68 Downsampling by Averaging Downsampling by averaging used to be a common approach This is a special case of convolution where the weights are fixed to a uniform distribution The example below uses a stride of 2 Input Image Convolution 1/4 1/4 1/4 1/4 Convolved Image 3/4 3/4 1/4 3/4 1/4 0 1/

Max- Pooling Max- pooling is another (common) form of downsampling Instead of averaging, we take the max value within the same range as the equivalently- sized convolution The example

69 Max- Pooling Max- pooling is another (common) form of downsampling Instead of averaging, we take the max value within the same range as the equivalently- sized convolution The example below uses a stride of 2 Input Image Max- pooling x i,j x i+1,j x i,j+1 x i+1,j+1 Max- Pooled Image

70 Multi- Class Output Output Hidden Layer Input 71

71 Multi- Class Output Softmax Layer: y k = (b k ) K l=1 (b l) (F) Loss J = K k=1 y k (y k) (E) Output (softmax) y k = (b k) K l=1 (b l) (D) Output (linear) b k = D j=0 kjz j k Output (C) Hidden (nonlinear) z j = (a j ), j Hidden Layer (B) Hidden (linear) a j = M i=0 jix i, j Input (A) Input Given x i, i 72

72 Whiteboard Training a CNN SGD for CNNs Backpropagation for CNNs 73

73 Whiteboard Common CNN Layers ReLU Layer Background: Subgradient Fully- connected Layer (w/tensor input) Softmax Layer Convolutional Layer Max- Pooling Layer 74

74 Convolutional Layer 75

75 Convolutional Layer 76

76 Max- Pooling Layer 77

77 Max- Pooling Layer 78

78 Convolutional Neural Network (CNN) Typical layers include: Convolutional layer Max- pooling layer Fully- connected (Linear) layer ReLUlayer (or some other nonlinear activation function) Softmax These can be arranged into arbitrarily deep topologies Architecture #1: LeNet- 5 79

79 Architecture #2: AlexNet CNN for Image Classification (Krizhevsky, Sutskever & Hinton, 2012) 15.3% error on ImageNet LSVRC contest Input image (pixels) Five convolutional layers (w/max- pooling) Three fully connected layers way softmax 80

80 CNNs for Image Recognition Slide from Kaiming He 81

81 CNN VISUALIZATIONS 83

82 3D Visualization of CNN

83 Convolution of a Color Image Color images consist of 3 floats per pixel for RGB (red, green blue) color values Convolution must also be 3- dimensional 32 32x32x3 image 5x5x3 filter activation map 28 convolve (slide) over all spatial locations 3 32 Figure from Fei- Fei Li & Andrej Karpathy & Justin Johnson (CS231N)

Animation of 3D Convolution http://cs231n.github.

84 Animation of 3D Convolution networks/ Figure from Fei- Fei Li & Andrej Karpathy & Justin Johnson (CS231N) 86

85 MNIST Digit Recognition with CNNs (in your browser) Figure from Andrej Karpathy 87

86 CNN Summary CNNs Are used for all aspects of computer vision, and have won numerous pattern recognition competitions Able learn interpretable features at different levels of abstraction Typically, consist of convolution layers, pooling layers, nonlinearities, and fully connected layers Other Resources: Readings on course website Andrej Karpathy, CS231n Notes networks/ 88

Bayesian Networks (Part I)

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks (Part I) Graphical Model Readings: Murphy 10 10.2.1 Bishop 8.1,