Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Neural Networks. Tobias Scheffer

Size: px

Start display at page:

Download "Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Neural Networks. Tobias Scheffer"

Kelley Wilkinson
6 years ago
Views:

1 Universität Potsam Institut für Informatik Lehrstuhl Maschinelles Lernen Neural Networks Tobias Scheffer

2 Overview Neural information processing. Fee-forwar networks. Training fee-forwar networks, back propagation. Unsupervise learning: Auto encoers. Training auto encoers via back propagation. Restricte Boltzmann machines. Convolutional networks.

3 Learning Problems can be Impossible without the Right Features motorcycle? motorcycle 3

4 Learning Problems can be Impossible without the Right Features Wheel Hanlebar Frame Wheel! motorcycle? motorcycle 4

5 Learning Problems can be Impossible without the Right Features Abstract features (higher level) Wheel Hanlebar Frame Wheel Feature funktion Raw ata (low-level) 5

6 Neuronal Networks Moel of neural information processing Waves of popularity Perceptron: Rosenblatt, 96. Perceptron only linear classifier (Minsky, Papert, 69). Multilayer perceptrons (9s). Popularity of SVMs (late 9s). Deep learning (late s). Now state of the art for Voice Recognition (Google DeepMin), Face Recognition (Deep Face, 4) 6

7 Neuronal Networks Deep learning, unsupervise feature learning Unsupervise iscovery of features which can then be use for supervise learning Implementation on GPU Able to process vast amounts of ata. Seen as step towars AI 7

8 Deep Learning Recors Neural networks best-performing algorithms for Object classification (CIFAR/NORB/PASCAL VOC- Benchmarks) Vieo classification (various benchmarks) Sentiment analysis (MR Benchmark) Peestrian etection Speech recognition Phycheelic art (Deep Dream) 8

9 Supervise an Unsupervise Learning Supervise learning Entire network traine on labele ata. Unsupervise learning Entire network traine on unlabele ata. Unsupervise pre-training + supervise learning Network (except top-most layer) traine layer-wise on unsupervise ata. Then, entire network is traine on labele ata. Goo for many unlabele + few labele ata. 9

10 Probability of an output spike Neural Information Processing Input signals Weighte input signals are aggregate Axon: output signal Weighte input signals Synaptic weights: strengthene an weakene by learning processes Output signals are electric spikes Connections to other nerve cells

11 Probability of an output spike Neural Information Processing: Moel x x x 3 Input vector x 3 T h x n Output (h) Weighte input signals Weight vector

12 Fee Forwar Networks Inex i x h x h ( h ) k k x k k k ( h ) k k k x k k Inex k x Forwar propagation: Input vector Linear moel: x Each unit has parameter i i i vector Layer i has matrix of parameters Output layer k h x i i i i k k k n i i i i i ni i i i ni ni nin i Hien layers x Input layer

13 Fee Forwar Networks Inex i x h x h ( h ) k k x k k k ( h ) k k k x k k Inex k x Forwar propagation: Input vector Linear moel: Activation function an propagation: Output vector i x i ( h ) x Output layer x h x i i i i k k k Hien layers x x m Input layer 3

14 Fee Forwar Networks Inex i x h x h ( h ) k k x k k k ( h ) k k k x k k Inex k x Bias unit Linear moell: i i i i k k k i Constant element k is replace by aitional unit with i i i constant output : x h x h k k [.. n ] k x x m 4

15 Fee Forwar Networks Inex i x h x h ( h ) k k x k k k ( h ) k k k x k k Inex k x Forwar propagation per layer in vector notation: h x i i i x x m 5

16 Fee Forwar Networks x x h x h ( h ) k k x ( h ) k k k x k k k k k x Stochastic graient escent Square loss: R = m Graient: m j= y j x j Stochastic graient for instance x ˆ Rˆ( ) ' R( ) ' ( y j x j) j m ' y ( ) x 6

17 Fee Forwar Nets: Back Propagation x y x x x δ Stochastic graient for instance x For top-level weights: ' y ( ) x ( y x ) ( y ( x )) k k k k k k ( y ( x )) ( x ) x k k k k ( k x ) kx k ( y ( x )) '( x ) x k k k ( y x ) '( h ) x k k x k k x with: k ( '( h yk x h k k ) )( y x ) k k k 7

18 Forwar Propagation Fee Forwar Nets: Back Propagation x y x i x i δ i i x i x l δ i x ( h i k i h k i lk ) i δ i δ Back Propagation For weights at layer i: with ( y x ) ( y x ) h k k k k i i k hk i k x i i k ( yk xk ) i h ( y x ) (,, ) k k i i x xn l l i i i i k k l k i i i hl xk hk i i i l lk k i i i k l l lk i i ( x,, xn ) ( y x ) h x '( h ) '( h ) k h i k i k i k x 8

19 Activation Function Any ifferentiable sigmoial function is suitable Examples: ( h) h e '( h) ( h)( ( h)) 9

20 Back Propagation: Algorithm Iterate over training instances (x, y): Forwar propagation: for i= : For k= n i : i i x ( h ) Back propagation: For k= n i : k For i=- : For k= n i : Until concergence h x i i i i k k k '( h )( y x ) k k k k ' k k x '( h ) i i i i k k l l lk i k i i ' k k x i

21 Back Propagation Loss function is not convex Each permutation of hien units is a local minimum. Learne features (hien units) may be ok, but not usually globally optimal. Hope: Local minima can still be arbitrarily goo. Many local minima can be equally goo. Reality: Back propagation works well for few ( or ) hien layers.

22 Regularization L-regularize loss Correspons to normal prior on parameters. Graient: Upate: Rˆ ( ) ( y x ) m j j j ˆ ( i i i R ) m δx j j ' δx j Calle weight ecay. Aitional regularization schemes: Early stopping: Stop training before convergence. Delete units with small weights. Dropout: During training, set some units output to zero at ranom. Normalize length of propagate vectors. T

23 Regularization: Dropout In compleetworks, complex co-aaptation relationships can form between units. Not robust for new ata. Dropout: In each training set, raw a fraction of units at ranom an set their output to zero. At application time, use all units. Improves overall robustness: each unit has to function within varying combinations of units. 3

24 Regularization: Stochastic Binary Units Deterministic units propagate x i i k = σ h k. i Stochastic-binary units calculate activation σ h k, Then propagate x i k = with probability σ(h i k ) x i k = otherwise. Similar to ropout: with some probability, i x k each unit oes not prouce output. Biological nneurons behave like this. 4

25 Back Propagation: Tricks Use cross-entropy as loss for classification Stochastic graient on small batches. Permute training ata at ranom. Decrease learning rate uring optimization Initialize weights ranomly (origin can be sale point). Initialize weights via unsupervise pre-training. 5

26 Overview Neural information processing. Fee-forwar networks. Training fee-forwar networks, back propagation. Unsupervise learning: Auto encoers. Training auto encoers via back propagation. Restricte Boltzmann machines. Convolutional networks. 6

27 Auto Encoers Auto encoers learn the ientity function. n input units to n hien units to n output units, with n > n. On the hien layer, the input has to be compresse. Learning algorithm erives a representation that preserves the information from the input. 7

28 Auto Encoers: Example Input: binary vectors with a single. 4 input units, hien units, 4 output units Inputs:,,,,,,,,,,,, 8

29 Auto Encoers: Example Possible activation of the hien units after training 9

30 Auto Encoers: Example Possible activation of the hien units after training 3

31 Auto Encoers: Example Possible activation of the hien units after training 3

32 Auto Encoers: Example Possible activation of the hien units after training 3

33 Auto Encoers: Example There are several local minima of the loss functions (how many?) 33

34 Auto Encoers: Example Input: units Each unit represents the grey value of a pixel. Hien layer: k units Output: units

35 Auto Encoers: Example Each of the hien units is a etector for a base face The weights from one hien unit to the output units encoe the image of the base face. Input faces are represente as a combination of these base faces

36 Auto Encoers: Example The weights from one hien unit to the output units encoe the image of the base face. 36

37 Auto Encoers: Example Feeing an input of into one of the hien units prouces the base face that the hien unit represents. 37

38 Auto Encoers: Example The weights from one hien unit to the output units encoe the image of the base face. Weights from all hien units to output units after training with a set of aligne faces: 38

39 Auto Encoers: Example After training on han-written igits x m 39

40 Auto Encoers via Backpropagation Desire output: y j = x j. Empirical risk: R = m m j= x j x j Train with stanar back propagation, using the input as target output values. x x n 4

41 Auto Encoers via Backpropagation Aitional regularization: hien units shoul be sparse (i.e., have activation most of the time). Minimize KL ivergence between ρ = (ρ,, ρ) an activation of hien units. KL ρ x n = ρ log ρ x + ( ρ) log ( ρ) i i= Moifie backprop upate: 3 3 k '( hk ) l l lk xk xk ( x i ) x x n 4

42 Deep Learning: Stacke Autoencoers Multiple hien layers, each layer has fewer units. Autoencoer has to reprouce the input vector. R = m m j= n > n > > n, n = n. x j x j x x xn 4

43 Stacke Autoencoers: Learning Step : Learn autoencoer using back propagation Run back propagation until convergence. R = m m j= x j x j x x n 43

44 Stacke Autoencoers: Learning Step : Freeze, a another layer. Train an 3 using back propagation R = m m j= x j x j x x n 44

45 Stacke Autoencoers: Learning Step : Freeze,,, a another layer. Train an using back propagation x j x m j= j R = m 3 x x xn 45

46 Stacke Autoencoers: Learning Step + : Remove layer. The result is a hierarchical feature representation. These features can be very useful for supervise learning, in particular for convolutional networks. x xn 46

47 Denoising Autoencoers Aitional regularization: Reconstruct input from corrupte version of the input. Ranomly set a fraction of the input to zero; use uncorrupte input as target for loss function. Tens to lea to more robust representations. x xn 47

48 Stacke Autoencoer: Example D visualization of a ocument corpus TF vector 5 hien units 5 hien units hien units (imensions) units 3 5 units 5 units Term frequencies of many terms 48

49 Auto Encoers: What are they Goo for? Hien units represent a clustering of the inputs. Hien units are features that can now be use for a classification task. For instance, face ientification, han-written letter recognition. x xn 49

50 Restricte Boltzmann Machine Unsupervise learning. Input layer an hien layer. Binary stochastic units, one bias unit per layer. Generative, probabilistic moel. Energy function: E( x, x ) ( x ) x T T n x x n nn x n x n Bias units x x k = (Bias units Are not connecte) x x m 5

51 Restricte Boltzmann Machine Energy function: Energy function~ log P activation E( x, x ) ( x ) x T P( x, ) P( x ) P E(, ) Z e x x x E(, ) Z e x x x ( x x ) e P ( x, x ) P( x ) E( x, x ) Z E( x, x ) Z e x e x Bias units x x k Z is normalization constant x x m 5

52 Inference in RBM RMB is a generative moel; generates states like a Bayesian network. Inference by Markov chain monte carlo: Iterate over units, alternate between input an hien units. Draw unit activation given activations of neighboring units. After burn-in phase, Markov chain of activations is governe by istribution of the encoe RBM. 5

53 RBM: Sampling of States Initialize states at ranom x x k x x m 53

54 RBM: Sampling of States Initialize states at ranom P x x = +e x x x k x x m 54

55 RBM: Sampling of States Initialize states at ranom P x x = P x x = +e x +e x x x k x x m 55

56 RBM: Sampling of States Initialize states at ranom P x x = P x x = P x x = +e x +e x +e x x x k x x m 56

57 RBM: Sampling of States Initialize states at ranom P x x = P x x = P x x = P x x = +e x +e x +e x +e x x x x k x m 57

58 RBM: Sampling of States Initialize states at ranom P x x = P x x = P x x = P x x = +e x +e x +e x +e x x x x k x m 58

59 Restricte Boltzmann Machine: Learning Learning: maximize log-likelihoo of the input vectors. Graient: Energy graient: arg max log P( x ) log p( x ) x ji p ( x x ) E Energy graient for observe input ( x, x ) ji E( x, x ) p( x, x ) x, x ji E( x, x ) ( ) x x T ji ji xx i j Marginal energy graient x x x k x m 59

60 Restricte Boltzmann Machine: Learning Graient: Mit Weight upate log p( x ) E(, ) E(, ) p( ) x x p( ) x x x x x x x x, x ji ji ji E( x, x ) ( ) x x T ji ji ji ' ji ( xi hj x hj ) xx i j Observe input Input inferre in an MCMC step 6

61 Restricte Boltzman Machine: Example Unsupervise learning of a representation, similar to hien layer of an autoencoer. 5 weight vectors i after training with a set of aligne faces: x x k x x m 6

62 Restricte Boltzman Machine: Example Unsupervise learning of a representation, similar to hien layer of an autoencoer. Weights i after training with a set of han-written igits: x x k x x m 6

63 Stacke RBM Learning Stacke RBMs analogous to stacke backpropagation autoencoers. Step : argmax { log P x } x x k x x m 63

64 Stacke RBM Learning Stacke RBMs analogous to stacke backpropagation autoencoers. Step : argmax { log P x } Step : argmax { log P x } x x x k x x m 64

65 RBM: What are they Goo for? Hien units represent a clustering of the inputs. Hien units are features that can then be use for supervise learning. x x x k x x m 65

66 Overview Neural information processing. Fee-forwar networks. Training fee-forwar networks, back propagation. Unsupervise learning: Auto encoers. Training auto encoers via back propagation. Restricte Boltzmann machines. Convolutional networks. 66

67 Convolutional Networks What happens when an autoencoer is traine with unaligne images of faces? Training images 67

68 Convolutional Networks What happens when an autoencoer is traine with unaligne images of faces? Training images 68

69 Convolutional Networks What happens when an autoencoer is traine with unaligne images of faces? Weights become blurry; hien units ten to represent ifferent face positions rather than ifferent looks of faces. Training images 69

70 Convolutional Networks Iea: Have etectors fin common patterns at ifferent positions of the input image an at ifferent scales. Training images 7

71 Convolution Multiplication of a filter with an area of the input gives intensity of the filter signal at the position. n n ij ik, jl kl k n ln x x Pixel in result image x n n convolution. is the result of a Use for images, auio signals. E.g., etection of eges (greyvalue graients). 7

72 Convolutional Networks Apply one or several filters to every position an possibly at several scales. Each unit prouces output of same filter for ifferent position. All units for one filter have the same weight. Example convolutional network with fixe scale an a single filter. n n = 7

73 Multiple Convolutions Multiple etectors per location an scale E.g., eges of varying orientation Eges of varying scale Results in an array of convolute result images. 73

74 Convolutional Networks: Example n n Local receptive fiels Share weights x kn k k x kn n k k k x k nk k k k k k 74

75 Convolutional Networks: Example Convolutional layer with k filters takes an input image an prouces k images. k images have to be aggregate by pooling layers. n n kn k x knn k k k x k n k k k 75

76 Convolutional Networks: Pooling Layers MaxPooling-Layer: Split layer into non-overlapping areas. For each area, pass on maximal value to next layer. Receptive fiel Aggregate feature 76

77 Stacke Convolutional Networks Local receptive fiels are traine layer-wise restricte Boltzmann machine. Weight coupling has to be observe in the implementation (one parameter vector per filter, applie to all areas an at all scales). Iteratively, train next layer on output of previous pooling layer. 77

78 Stacke Convolutional RBM Traine on faces, cars, motorbikes, airplanes 3 78

79 DeepFace: Face Ientification Discriminative layer: same person or not? Layer is traine on labele ata. Pooling Convolution Pooling Convolution Pooling Convolution Convolutiopnal layers traine with unlabele ata 79

80 GPU Training GPUs are suitable to parallelize neural network training Matrix multiplications, convolutions, element-wise operations GPUs oftware CUDA: NVIDIA C-API OPENCL: not specific to NVIDIA PyCUDA: Python-API PyOPENCL: not specific to NVIDIA 8

81 Deep Learning Step-wise transformation of the input space into more abstract feature spaces. Features emanate as solution of an optimization problem. Image processing Pixels local grey-value graients object parts objects. Natural-language text Characters wors chunks clauses sentences. Spoken language Signal spectral ban phone phoneme wor 8

82 Summary Supervise neural network learning Stochastic graient: back propagation. Unsupervise learning: Autoencoers learn features that preserve the iunformation in the training instances Stacke autoencoers. (Stacke) restricte Boltzmann machines: generative moels; inference by sampling (Stacke) convolutional networks: Learn filters, apply filters to ifferent regions, scales Aggregate filter banks by pooling layers. Increasingly abstract etectors applie to regions. 8

83 Seminar Lecture on Neural Networks? 3 x 3 minutes lecture incl. some time for questions. Natural-language escription of images. Worvec. Speech recognition. 83

Admin BACKPROPAGATION. Neural network. Neural network 11/3/16. Assignment 7. Assignment 8 Goals today. David Kauchak CS158 Fall 2016

Admin BACKPROPAGATION. Neural network. Neural network 11/3/16. Assignment 7. Assignment 8 Goals today. David Kauchak CS158 Fall 2016 Amin Assignment 7 Assignment 8 Goals toay BACKPROPAGATION Davi Kauchak CS58 Fall 206 Neural network Neural network inputs inputs some inputs are provie/ entere Iniviual perceptrons/ neurons Neural network