Antonio Torralba, Lecture 6 Pyramids. Learned feedforward visual processing

Size: px

Start display at page:

Download "Antonio Torralba, Lecture 6 Pyramids. Learned feedforward visual processing"

Reginald Blake
5 years ago
Views:

1 Antonio Torralba, 2016 Lecture 6 Pyramids Learned feedforward visual processing

2 The Gaussian pyramid (original image) 2

3 Image down-sampling ê2 Blur Image up-sampling é2 3

4 Image up-sampling = Start by inserting zeros

5 Convolution and up-sampling as a matrix multiply (1D case) Insert zeros between pixels, then apply a low-pass filter, [ ]

6 The Laplacian Pyramid Synthesis Compute the difference between upsampled Gaussian pyramid level and Gaussian pyramid level. band pass filter - each level represents spatial frequencies (largely) unrepresented at other level. 6

7 Laplacian pyramid algorithm 7

8 Showing, at full resolution, the information captured at each level of a Gaussian (top) and Laplacian (bottom) pyramid. 8

9 Laplacian pyramid reconstruction algorithm: recover x 1 from L 1, L 2, L 3 and x 4 G# is the blur-and-downsample operator at pyramid level # F# is the blur-and-upsample operator at pyramid level # Laplacian pyramid elements: L1 = (I F1 G1) x1 L2 = (I F2 G2) x2 L3 = (I F3 G3) x3 x2 = G1 x1 x3 = G2 x2 x4 = G3 x3 Reconstruction of original image (x1) from Laplacian pyramid elements: x3 = L3 + F3 x4 x2 = L2 + F2 x3 x1 = L1 + F1 x2 9

10 Laplacian pyramid reconstruction algorithm: recover x 1 from L 1, L 2, L 3 and g

11 Gaussian pyramid

12 (Low-pass residual) Laplacian pyramid

13 1-d Laplacian pyramid matrix, for [ ] low-pass filter high frequencies mid-band frequencies low frequencies 13

14 Laplacian pyramid applications Texture synthesis Image compression Noise removal Also related to SIFT 14

15 Image blending 15

16 Szeliski, Computer Vision,

17 Image blending Build Laplacian pyramid for both images: LA, LB Build Gaussian pyramid for mask: G Build a combined Laplacian pyramid: L(j) = G(j) LA(j) + (1-G(j)) LB(j) Collapse L to obtain the blended image 17

18 Image pyramids Gaussian Progressively blurred and subsampled versions of the image. Adds scale invariance to fixed-size algorithms. Laplacian Shows the information added in Gaussian pyramid at each spatial scale. Useful for noise reduction & coding. Those pyramids do not encode orientation 18

19 Gaussian derivalves: Steerability. g x (x, y) = g(x, y) x = x x 2 +y 2 2πσ 4 e 2σ 2 g y (x, y) = g(x, y) y = y x 2 +y 2 2πσ 4 e 2σ 2 What about other orientations not axis aligned?

20 Gaussian derivalves: Steerability. g x (x,y) g y (x,y) g 60 (x,y) Input image g 60 (x,y) = ½g x (x,y) + 3/2g y (x,y)

21 Gaussian derivalves: Steerability. For the Gaussian derivatives, any orientation can be obtained as a linear combination of two basis functions: g α (x,y) = cos(α)g x (x,y) + sin(α)g y (x,y) In general, a kernel is steerable, if it can any rotation can be obtained as a linear combination of N basis functions. Steereability of gaussian derivatives, Freeman & Adelson 92

22 Steerable filters N-th order derivalves of Gaussians are steerable with N basis funclons. In general, if a funclon can be decomposed as a Fourier basis in polar coordinates with a finite number of polar terms, then the funclon is steerable. Steereability of gaussian derivatives, Freeman & Adelson 92

23 Steerable Pyramids We can extend the oriented filters into a multi-scale pyramid Simoncelli, Freeman, Adelson

24 Steerable Pyramids Simoncelli, Freeman, Adelson

25 Steerable Pyramid We may combine Steerability with Pyramids to get a Steerable Laplacian Pyramid as shown below Decomposition Images from:

26 Steerable Pyramid We may combine Steerability with Pyramids to get a Steerable Laplacian Pyramid as shown below Decomposition Images from:

27 Steerable Pyramid We may combine Steerability with Pyramids to get a Steerable Laplacian Pyramid as shown below Decomposition Reconstruction Images from:

28 There is also a high pass residual Shiftable MultiScale Transforms, by Simoncelli et al., IEEE Transactions on Information Theory, 1992

29 Steerable pyramid Multiple orientations at = one scale * Steerable pyramid Multiple orientations at the next scale the next scale pixel image Over-complete representation, but non-aliased subbands. 29

Laplacian Shows the information added in Gaussian pyramid at each spatial scale.

30 Image pyramids Gaussian Progressively blurred and subsampled versions of the image. Adds scale invariance to fixed-size algorithms. Laplacian Shows the information added in Gaussian pyramid at each spatial scale. Useful for noise reduction & coding. Steerable pyramid Shows components at each scale and orientation separately. Non-aliased subbands. Good for texture and feature analysis. But overcomplete and with HF residual.

31 Image transformalons DFT Gaussian pyramid Laplacian pyramid Steerable pyramid

32 Matlab resources for pyramids (with tutorial) 32

33 Matlab resources for pyramids (with tutorial) 33

34 Chapter 3: Image Processing

35 Why use these representalons? Handle real-world size varialons with a constant-size vision algorithm. Remove noise Analyze texture Recognize objects Label image features Image priors can be specified naturally in terms of wavelet pyramids. 35

36 Phase-based Pipeline (SIGGRAPH 13) Amplitude Phase Gains Complex steerable pyramid [Simoncelli and Freeman 1995] Temporal filtering on phases Pyramid inversion

37 input moaon magniﬁed

38 input moaon magniﬁed

39 Antonio Torralba, 2016 Lecture 6 Learned feedforward visual processing (Deep learning)

40 What is the best representation? All the previous representation are manually constructed. Could they be learnt from data?

41 Linear filtering pyramid architecture w1 w2 w3 + w4 w

42 Convolutional neural network architecture w1 w2 w3 + w4 w

43 Perceptrons, _Nagy_Pace_FR.pdf. Photo by George Nagy Perceptrons PDP book Minsky and Papert AI winter Krizhevsky, Sutskever, Hinton 43

44 Perceptrons,

45 Minsky and Papert pointed out many limitations of single-layer Perceptrons. Among them: very difficult to compute connectedness 45

46 Minsky and Papert, Perceptrons, 1972 Perceptrons PDP book Minsky and Papert AI winter Krizhevsky, Sutskever, Hinton 46

47 Parallel Distributed Processing (PDP), 1986 Perceptrons PDP book Minsky and Papert AI winter Krizhevsky, Sutskever, Hinton 47

48 XOR problem Inputs Output PDP authors pointed to the backpropagation algorithm as a breakthrough, allowing multi-layer neural networks to be trained. Among the functions that a multi-layer network can represent but a single-layer network cannot: the XOR function. 48

49 LeCun conv nets, 1998 Demos: 49

50 50

51 Neural networks to recognize handwritten digits? yes Neural networks for tougher problems? not really C Farabet, C Poulet, Y LeCun Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International

52 Perceptrons NIPS 2000 NIPS, Neural Information Processing Systems, is the premier conference on machine learning. Evolved from an interdisciplinary conference to a machine learning conference. For the NIPS 2000 conference: title words predictive of paper acceptance: Belief Propagation and Gaussian. title words predictive of paper rejection: Minsky and Papert Neural Krizhevsky, and Network. PDP book AI winter Sutskever, Hinton 52

53 Krizhevsky, Sutskever, and Hinton, NIPS 2012 Perceptrons PDP book Minsky and Papert AI winter Krizhevsky, Sutskever, Hinton 53

54 Slide from Rob Fergus, NYU

55 Krizhevsky, Sutskever, and Hinton, NIPS

56 Test Nearby images, according to NN features Krizhevsky, Sutskever, and Hinton, NIPS

57 Research enthusiasm for artificial neural networks Perceptrons, years PDP book, 1986 Krizhevsky, Sutskever, Hinton, 2012 enthusiasm Minsky and Papert, 1972 AI winter, 2000 Geoff Hinton s citations time 57

58 Neural networks Neural nets composed of layers of artificial neurons. Each layer computes some function of layer beneath. Inputs mapped in feed-forward fashion to output. Consider only feed-forward neural models at the moment, i.e. no cycles

59 An individual neuron Input: x (n 1 vector) Parameters: weights w (n 1 vector), bias b (scalar) AcLvaLon: a = Pn i=1 x iwi + b. Note a is a scalar. MulLplicaLve interaclon between weights and input. Point-wise non-linear funclon: (: ), e.g. (: ) = tanh (: ). Output: y = f (a ) = ( Pn i=1 xiwi + b ) Can think of bias as weight w0, connected to constant input 1: y = f ( ~ wt [1; x ]).

60 An individual neuron (unit) Input: vector x (size n 1) x1 w1 Unit parameters: vector w (size n 1) bias b (scalar) Unit activation: a = Output: y = f (a) = f n i=1 x i w i + b ( n x i w i + b) i=1 x2 x3 xn 1 w2 w3 wn b + a y=f(a) f(.) is a point-wise non-linear function. E.g.: f (a) = tanh(a) = ea e a e a + e a Can think of bias as weight w 0, connected to constant input 1: y = f ([w 0, w] T [1; x ]).

61 Single layer network Input: column vector x (size n 1) Output: column vector y (size m 1) Layer parameters: weight matrix W (size n m) bias vector b (m 1) Units activation: ex. 4 inputs, 3 outputs a = Wx + b = + w1,m w1,1 wn,m Output: y = f (a) = f ( Wx + b)

62 f (a) = sigmoid(a) = Non-lineariLes: sigmoid 1 Interpretation as ring rate of neuron 1+ e a Bounded between [0,1] Saturation for large +ve,-ve inputs Gradients go to zero Outputs centered at 0.5 (poor conditioning) Not used in practice

63 Non-lineariLes: tanh f (a) = tanh(a) = ea e a e a + e a Bounded between [-1,+1] Saturation for large +ve,-ve inputs Gradients go to zero Outputs centered at 0 Preferable to sigmoid tanh(x) = 2 sigmoid(2x) 1

64 Non-lineariLes: reclfied linear (ReLU) f (a) = max(a,0) Unbounded output (on positive side) Efficient to implement: f '(a) = df da = 0 a < 0 1 a 0 Also seems to help convergence (see 6x speedup vs tanh in Krizhevsky et al.) Drawback: if strongly in negative region, unit is dead forever (no gradient). Default choice: widely used in current models.

65 Non-lineariLes: Leaky ReLU f (a) = max(0,a) a > 0 α min(0,a) a < 0 where α is small (e.g. 0.02) Efficient to implement: f '(a) = df da = α a < 0 1 a > 0 Also known as probabilistic ReLU (PReLU) Has non-zero gradients everywhere (unlike ReLU) α can also be learned (see Kaiming He et al. 2015).

66 Neural networks are composed of multiple layers of neurons. Acyclic structure. Basic model assumes full connections between layers. MulLple layers Layers between input and output are called hidden. Various names used: Articial Neural Nets (ANN) Multi-layer Perceptron (MLP) Fully-connected network Neurons typically called units.

67 Example: 3 layer MLP By convention, number of layers is hidden + output (i.e. does not include input). So 3-layer model has 2 hidden layers. Parameters: weight matrices W 1 ;W 2 ;W 3 bias vectors b 1 ; b 2 ; b 3.

68 MulLple layers Output layer n Hidden layer i (output) x n F n (x n-1, W n ) x n-1 x i F i (x i-1, W i ) x i-1 Hidden layer 1 x 1 F 1 (x 0, W 1 ) Input layer (input) x 0

69 Architecture seleclon How to pick number of layers and units/layer? AcLve area of research For fully connected models 2 or 3 layers seems the most that can be effeclvely trained (more later). Regarding number of units/layer: Parameters grows with (units/layer) 2. With large units/layer, can easily overt.

70 RepresentaLonal power of two-layer network 2 In bias Out 70

71 RepresentaLonal power 1 layer? Linear decision surface. 2+ layers? In theory, can represent any funclon. Assuming non-trivial non-linearity. Bengio 2009, hgp:// Bengio, Courville, Goodfellow book hgp:// Simple proof by M. Neilsen hgp://neuralnetworksanddeeplearning.com/chap4.html D. Mackay book hgp:// But issue is efficiency: very wide two layers vs narrow deep model? In praclce, more layers helps.

72 Training a model: overview Given dataset {x; y}, pick appropriate cost funclon C. Forward-pass (f-prop) training examples through the model to get network output. Get error using cost funclon C to compare output to targets y Use StochasLc Gradient Descent (SGD) to update weights adjuslng parameters to minimize loss/energy E (sum of the costs for each training example)

73 Cost funclon Consider model with N layers. Layer i has vector of weights Wi. Forward pass: takes input x and passes it through each layer F i : x i = F i (x i-1, W i ) Output of layer i is x i. Network output (top layer) is x n. (output) x n F n (x n-1, W n ) x n-1 x i F i (x i-1, W i ) x i-1 x 2 F 2 (x 1, W 2 ) x 1 F 1 (x 0, W 1 ) (input) x 0

74 Cost funclon Consider model with N layers. Layer i has vector of weights Wi. Forward pass: takes input x and passes it through each layer F i : x i = F i (x i-1, W i ) Output of layer i is x i. Network output (top layer) is x n. Cost funclon C compares x n to y Overall energy is the sum of the cost over all training examples: E = M C( x m n, y m ) m =1 (output) (input) F n (x n-1, W n ) x n-1 x i F i (x i-1, W i ) F 2 (x 1, W 2 ) F 1 (x 0, W 1 ) x 0 x n x i-1 x 2 x 1 E C(x n, y) y

75 StochasLc gradient descend Want to minimize overall loss funclon E. Loss is sum of individual losses over each example. In gradient descent, we start with some inilal set of parameters θ Update parameters: k is iteralon index, η is learning rate (scalar; set semi-manually). Gradients computed by b-prop. In StochasLc gradient descent, compute gradient on sub-set (batch) of data. If batchsize=1 then θ is updated aher each example. If batchsize=n (full set) then this is standard gradient descent. Gradient direclon is noisy, relalve to average over all examples (standard gradient descent). 75

76 StochasLc gradient descend We need to compute gradients of the cost with respect to model parameters w i Back-propagaLon is essenlally chain rule of derivalves back through the model. Each layer is differenlable with respect to parameters and input. 76

77 CompuLng gradients Training will be an iteralve procedure, and at each iteralon we will update the network parameters We want to compute the gradients Where θ = { w 1,w 2,,w } n 77

78 CompuLng gradients To compute the gradients, we could start by wring the full energy E as a funclon of the network parameters. ( ) = C F n F n 1 F 2 F 1 x m 0,w 1 E θ M m =1 ( ( ( ),w 2 ),w n 1 ),w n And then compute the partial derivatives instead, we can use the chain rule to derive a compact algorithm: back-propagation, y m 78

79 (

80 x column vector of size [n 1] Matrix calculus We now define a function on vector x: y = F(x) If y is a scalar, then [ ] The derivative of y is a row vector of size [1 n] y / x = y / x 1 y / x 2! y / x n If y is a vector [1 m], then (Jacobian formulation): y / x = x = x 1 x 2! x n y 1 / x 1 y 1 / x 2! y 1 / x n " " " y m / x 1 y m / x 2! y n / x m The derivative of y is a matrix of size [m n] (m rows and n columns)

81 Chain rule: Matrix calculus For the function: z = h(x) = f (g(x)) Its derivative is: h (x) = f (g(x)) g (x) and writing z=f(u), and u=g(x): dz = dz du dx x =a du u=g(a ) dx x =a [m n] [m p] [p n] with p = length vector u = u, m = z, and n = x Example, if z =1, u = 2, x =4 h (x) = =

82 Chain rule: Matrix calculus For the function: h(x) = f n (f n-1 ( (f 1 (x)) )) With u 1 = f 1 (x) u i = f i (u i-1 ) z = u n = f n (u n-1 ) The derivative becomes a product of matrices: dz dx x =a = dz du n 1 un 1 = f n 1 (u n 2 ) du n 1 du 2 du 1 du n 2 un 2 du = f n 2 (u n 3 ) 1 u1 dx = f 1 (a) x =a (exercise: check that all the matrix dimensions work fine)

83 )

84 CompuLng gradients The energy E is the sum of the costs associated to each training example x m, y m ( ) = C x n m, y m ;θ E θ M m =1 ( ) Its gradient with respect to the networks parameters is: E θ i = M m =1 C( x m n, y m ;θ) θ i is how much E varies when the parameter θ i is varied. 84

85 CompuLng gradients We could write the cost function to get the gradients: ( ) = C F n x n 1,w n C x n,y;θ with ( ), y ( ) [ ] θ = w 1,w 2,!,w n If we compute the gradient with respect to the parameters of the last layer (output layer) w n, using the chain rule: C = C x n w n x n w n = C x n F n ( x,w ) n 1 n w n (how much the cost changes when we change wn, is the product between how much the cost changes when we change the output of the last layer, times how much the output changes when we change the layer parameters.) 85

CompuLng gradients: cost layer If we compute the gradient with respect to the parameters of the last layer (output layer) w n, using the chain rule: C = C x n w n x n w n =

86 CompuLng gradients: cost layer If we compute the gradient with respect to the parameters of the last layer (output layer) w n, using the chain rule: C = C x n w n x n w n = C x n F n ( x n 1,w ) n w n For example, for an Euclidean loss: C(x n,y) = 1 2 x n y 2 Will depend on the layer structure and non-linearity. The gradient is: C = x x n y n 86

87 CompuLng gradients: layer i We could write the full cost function to get the gradients: ( ) = C F n F n 1 F 2 F 1 x 0,w 1 C x n,y;θ ( ( ( ),w 2 ),w n 1 ),w n, y If we compute the gradient with respect to w i, using the chain rule: C = C w i x n x n x n 1 x i+1 x i x n 1 x n 2 x i w i C x i And this can be computed iteratively! F i ( ) x i 1,w i w i This is easy. 87

88 C = C w i x n If we have the value of layer bellow as: BackpropagaLon x n x n 1 x i+1 x i x n 1 x n 2 x i w i C F i x i 1,w i x i C w i x i ( ) we can compute the gradient at the C = C x i 1 x i x i x i 1 Gradient layer i-1 Gradient layer i F i ( ) x i 1,w i x i 1

89 BackpropagaLon: layer i F i+1 x i Hidden layer i F i (x i-1, W i ) x i-1 F i-1 Forward pass C x i C x i 1 Backward pass Layer i has two inputs (during training) x C i-1 x i For layer i, we need the derivatives: F i (x i 1,w i ) F i (x i 1,w i ) x i 1 w i We compute the outputs x i = F i (x i 1,w i ) C = C F (x,w ) i i 1 i x i 1 x i x i 1 The weight update equation is: C = C F (x,w ) i i 1 i w i x i w i w i k +1 w i k +η t E w i (sum over all training examples to get E)

90 BackpropagaLon: summary E Forward pass: for each training example. Compute the outputs for all layers x i = F i (x i 1,w i ) Backwards pass: compute cost derivalves iteralvely from top to bogom: C = C F (x,w ) i i 1 i x i 1 x i x i 1 Compute gradients and update weights. (output) C x n x n F n (x n-1, W n ) x n-1 x i F i (x i-1, W i ) x i-1 x 2 F 2 (x 1, W 2 ) x 1 F 1 (x 0, W 1 ) C(X n,y) C x i C x i 1 C x 2 C x 1 (input) x 0 y

6.869 Advances in Computer Vision. Bill Freeman, Antonio Torralba and Phillip Isola MIT Oct. 3, 2018

6.869 Advances in Computer Vision Bill Freeman, Antonio Torralba and Phillip Isola MIT Oct. 3, 2018 1 Sampling Sampling Pixels Continuous world 3 Sampling 4 Sampling 5 Continuous image f (x, y) Sampling