Introduction to Deep Learning CMPT 733. Steven Bergner

Size: px

Start display at page:

Download "Introduction to Deep Learning CMPT 733. Steven Bergner"

Madeline Eunice Burke
5 years ago
Views:

1 Introduction to Deep Learning CMPT 733 Steven Bergner

2 Overview Renaissance of artificial neural networks Representation learning vs feature engineering Background Linear Algebra, Optimization Regularization Construction and training of layered learners Frameworks for deep learning 2

3 Representations matter Transform into the right representation Classify points simply by threshold on radius axis 3

4 Representations matter Transform into the right representation Classify points simply by threshold on radius axis Single neuron with nonlinearity can do this 4

5 Depth: layered composition 5

6 Computational graph 6

7 Components of learning Hand designed program Input Output Increasingly automated Simple features Abstract features Mapping from features 7

8 Growing Dataset Size MNIST dataset 8

9 Basics Linear Algebra and Optimization 9

10 Linear Algebra Tensor is an array of numbers Multi-dim: 0d scalar, 1d vector, 2d matrix/image, 3d RGB image Matrix (dot) product Dot product of vectors A and B (m = p = 1 in above notation, n=2) 10

11 Linear Algebra Tensor is an array of numbers Multi-dim: 0d scalar, 1d vector, 2d matrix/image, 3d RGB image Matrix (dot) product Dot product of vectors A and B (m = p = 1 in above notation, n=2) 11

12 Linear Algebra Tensor is an array of numbers Multi-dim: 0d scalar, 1d vector, 2d matrix/image, 3d RGB image Matrix (dot) product Dot product of vectors A and B (m = p = 1 in above notation, n=2) 12

13 Linear Algebra Tensor is an array of numbers Multi-dim: 0d scalar, 1d vector, 2d matrix/image, 3d RGB image Matrix (dot) product Dot product of vectors A and B (m = p = 1 in above notation, n=2) 13

14 Linear algebra: Norms 14

15 Nonlinearities ReLU Sofplus Logistic Sigmoid [(c) public domain] 15

16 Approximate Optimization 16

17 Gradient descent 17

18 Critical points 18

19 Critical points Saddle point 1st and 2nd derivative vanish 19

20 Critical points Saddle point 1st and 2nd derivative vanish Poor conditioning: 1st deriv large in one and small in another direction 20

21 Tensorflow Playground Try out simple network configurations ssify2d.html Visualize linear and non-linear mappings 21

22 Regularization Reduced generalization error without impacting training error 22

23 Constrained optimization Unregularized objective 23

24 Constrained optimization Squared L2 encourages small weights Unregularized objective L2 regularizer 24

25 Constrained optimization Squared L2 encourages small weights Unregularized objective L1 encourages sparsity of model parameters (weights) L2 regularizer 25

26 Dataset augmentation 26

27 Learning curves 27

28 Learning curves Early stopping before validation error starts to increase 28

29 Bagging Average multiple models trained on subsets of the data 29

30 Bagging Average multiple models trained on subsets of the data First subset: learns top loop, Second subset: bottom loop 30

31 Dropout Random sample of connection weights is set to zero Train diferent network model each time Learn more robust, generalizable features 31

32 Multitask learning Shared parameters are trained with more data Improved generalization error due to increased statistical strength 32

33 Components of popular architectures 33

34 Convolution as edge detector 34

35 Gabor wavelets (kernels) 35

36 Gabor wavelets (kernels) Local average, first derivative 36

37 Gabor wavelets (kernels) Second derivative (curvature) Local average, first derivative 37

38 Gabor wavelets (kernels) Directional second derivative Second derivative (curvature) Local average, first derivative 38

39 Gabor-like learned kernels Features extractors provided by pretrained networks 39

40 Max pooling translation invariance Take max of certain neighbourhood 40

41 Max pooling translation invariance Take max of certain neighbourhood Ofen combined followed by downsampling 41

42 Max pooling transform invariance 42

43 Types of connectivity 43

44 Types of connectivity 44

45 Types of connectivity 45

46 Choosing architecture family 46

47 Choosing architecture family No structure fully connected 47

48 Choosing architecture family No structure fully connected Spatial structure convolutional 48

49 Choosing architecture family No structure fully connected Spatial structure convolutional Sequential structure recurrent 49

50 Optimization Algorithm Lots of variants address choice of learning rate See Visualization of Algorithms AdaDelta and RMSprop ofen work well 50

51 Sofware for Deep Learning 51

52 Current Frameworks Tensorflow / Keras Pytorch DL4J Cafe And many more Most have CPU-only mode but much faster on NVIDIA GPU 52

53 Development strategy Identify needs: High accuracy or low accuracy? Choose metric Accuracy (% of examples correct), Coverage (% examples processed) Precision TP/(TP+FP), Recall TP/(TP+FN) Amount of error in case of regression Build end-to-end system Start from baseline, e.g. initialize with pre-trained network Refine driven by data 53

54 Sources I. Goodfellow, Y. Bengio, A. Courville Deep Learning MIT Press 2016 [link] 54

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?