Outline. CSCI567 Machine Learning (Spring 2019) Outline. Math formulation. Prof. Victor Adamchik. Feb. 12, 2019

Size: px

Start display at page:

Download "Outline. CSCI567 Machine Learning (Spring 2019) Outline. Math formulation. Prof. Victor Adamchik. Feb. 12, 2019"

Joshua Lawrence
5 years ago
Views:

Outline CSCI56 Machine Learning (Spring 29) Review of last lecture Prof. Victor Adamchik 2 Convolutional neural networks U of Southern California Feb.

1 Outline CSCI56 Machine Learning (Spring 29) Review of last lecture Prof. Victor Adamchik 2 Convolutional neural networks U of Southern California Feb. 2, 29 Kernel methods February 2, 29 / 48 February 2, 29 2 / 48 Outline Math formulation Review of last lecture An L-layer neural net can be written as f(x) = h L ( W L h L ( W L h ( W x ))) 2 Convolutional neural networks Kernel methods To ease notation, for a given input x, define recursively o = x, a l = W l o l, o l = h l (a l ) (l =,..., L) where D = D, D,..., D L are numbers of neurons at each layer W l R D l D l is the weights for layer l a l R D l is input to layer l o l R D l is output to layer l h l : R D l R D l is activation functions at layer l February 2, 29 / 48 February 2, 29 4 / 48

Putting everything into SGD Outline The backpropagation algorithm (986) computes the gradient of the cost function for a single training example. Initialize W,..., W L (all or randomly).

2 Putting everything into SGD Outline The backpropagation algorithm (986) computes the gradient of the cost function for a single training example. Initialize W,..., W L (all or randomly). Repeat: 2 randomly pick one data point n [N] forward propagation: for each layer ` =,..., L I ` ` ` compute a = W o ` Review of last lecture 2 Convolutional neural networks Motivation Architecture Kernel methods ` and o = h` (a ) (o = xn ) backward propagation: for each ` = L,..., I compute En a` I update weights W` W` λ En En = W ` λ ` (o` )T ` W a February 2, 29 5 / 48 February 2, 29 6 / 48 Acknowledgements Image Classification: A core task in Computer Vision (assume given set of discrete labels) {dog, cat, truck, plane,...} The materials I used to develop these notes include the following sources: Stanford Course Cs2n: (taught by Prof. Fei-Fei Li and her students). cat Dr. Ian Goodfellow s lectures on deep learning: Both website provides tons of useful resources: notes, demos, videos, etc. This image by Nikita is licensed under CC-BY 2. February 2, 29 / 48 Lecture 2-6 April 6, 2

The Problem: Semantic Gap Challenges: Viewpoint

a big grid of numbers between [, 255]: All pixels

Challenges: Illumination This image is CC.

public domain April 6, 2 Challenges: Deformation

public domain This image by Umberto Salvagnin is

Umberto Salvagnin is This image by sare bear is

3 The Problem: Semantic Gap Challenges: Viewpoint variation What the computer sees An image is just a big grid of numbers between [, 255]: All pixels change when the camera moves! e.g. 8 x 6 x ( channels RGB) This image by Nikita is licensed under CC-BY 2. This image by Nikita is licensed under CC-BY 2. Lecture 2 - April 6, 2 Challenges: Illumination This image is CC. public domain This image is CC. public domain April 6, 2 Challenges: Deformation This image is CC. public domain This image is CC. public domain This image by Umberto Salvagnin is licensed under CC-BY 2. Lecture 2-8 Lecture 2-9 April 6, 2 This image by Umberto Salvagnin is licensed under CC-BY 2. This image by sare bear is licensed under CC-BY 2. Lecture 2 - This image by Tom Thai is licensed under CC-BY 2. April 6, 2

image by jonsson is licensed under CC-BY 2.

public domain Lecture 2 - April 6, 2 This image is

Fundamental problems in vision Challenges:

train a model that can tolerate all those

Main ideas need a lot of data that exhibits those

4 Challenges: Occlusion This image is CC. public domain Challenges: Background Clutter This image by jonsson is licensed under CC-BY 2. This image is CC. public domain This image is CC. public domain Lecture 2 - April 6, 2 This image is CC. public domain Lecture 2-2 April 6, 2 Fundamental problems in vision Challenges: Intraclass variation The key challenge How to train a model that can tolerate all those variations? Main ideas need a lot of data that exhibits those variations need more specialized models to capture the invariance This image is CC. public domain Lecture 2 - April 6, 2 February 2, 29 8 / 48

A bit of history: A bit of history: Gradient-based learning applied to document recognition [LeCun, Bottou, Bengio, Haffner 998] ImageNet Classification with

special case of fully connected neural nets Fully Connected Layer usually consist of convolution layers, ReLU layers, pooling layers, and regular fully connected

5 A bit of history: A bit of history: Gradient-based learning applied to document recognition [LeCun, Bottou, Bengio, Haffner 998] ImageNet Classification with Deep Convolutional Neural Networks [Krizhevsky, Sutskever, Hinton, 22] Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 22. Reproduced with permission. AlexNet LeNet-5 4 Lecture 5-4 April 8, 2 Issues of standard NN for image inputs 5 Lecture 5-5 April 8, 2 Solution: Convolutional Neural Net (ConvNet/CNN) A special case of fully connected neural nets Fully Connected Layer usually consist of convolution layers, ReLU layers, pooling layers, and regular fully connected layers 2x2x image -> stretch to 2 x key idea: learning from low-level to high-level features input 2 activation x 2 weights number: the result of taking a dot product between a row of W and the input (a 2-dimensional dot product) Lecture 5-2 April 8, 2 Spatial structure is lost! February 2, 29 9 / 48 February 2, 29 / 48

6 CHAPTER 9. CONVOLUTIONAL NETWORKS Convolution layer Convolution Arrange neurons as a D matrix/tensor naturally 2D Convolution Convolution Layer Input 2x2x image -> preserve spatial structure a b c d e f g h i j k Kernel w x y z (filter/receptive field) l Output 2 height 2 width depth Lecture 5 - aw ey bx fz bw fy cx gz cw gy dx hz ew iy fx jz fw jy gx kz gw ky hx lz April 8, 2 February 2, 29 / 48 Figure 9.: An example of 2-D convolution without kernel-flipping. In this case we restrict the output to only positions where the kernel lies entirely within the image, called valid convolution in some contexts. We draw boxes with arrows to indicate how the upper-left element of the output tensor is formed by applying the kernel to the corresponding upper-left region of the input tensor. Convolution Layer Convolution Layer 2x2x image 2x2x image 4 5x5x filter 2 / 48 Filters always extend the full depth of the input volume 5x5x filter 2 2 Convolve the filter with the image i.e. slide over the image spatially, computing dot products Convolve the filter with the image i.e. slide over the image spatially, computing dot products 2 2 (Goodfellow 26) February 2, 29 Lecture 5-29 April 8, 2 Lecture 5 - April 8, 2

7 Convolution Layer Convolution Layer activation map 2 2x2x image 5x5x filter 2 2x2x image 5x5x filter convolve (slide) over all spatial locations number: the result of taking a dot product between the filter and a small 5x5x chunk of the image (i.e. 5*5* = 5-dimensional dot product bias) 2 2 Convolution Layer Lecture 5 - April 8, 2 consider a second, green filter Lecture 5-2 April 8, 2 For example, if we had 6 5x5 filters, we ll get 6 separate activation maps: activation maps 2 2x2x image 5x5x filter activation maps 2 Convolution Layer convolve (slide) over all spatial locations 2 Lecture 5 - April 8, We stack these up to get a new image of size xx6! Lecture 5-4 April 8, 2

Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions 2 2 Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions CONV,

8 Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions 2 2 Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions CONV, ReLU e.g. 6 5x5x filters Lecture 5-5 April 8, 2 CONV, ReLU e.g. 6 5x5x filters 24 CONV, ReLU e.g. 5x5x6 filters 6 CONV, ReLU. 24 Lecture 5-6 April 8, 2 Why convolution makes sense? CHAPTER 9. CONVOLUTIONAL NETWORKS one filter => one activation map example 5x5 filters (2 total) We call the layer convolutional because it is related to convolution of two signals: elementwise multiplication and sum of a filter and the signal (image) Figure copyright Andrej Karpathy. Lecture 5-9 April 8, 2 Main idea: if a filter is useful at one location, it should be useful at other locations. CHAPTER 9. CONVOLUTIONAL NETWORKS A simple example why filtering is useful Input Figure 9.6: Eﬃciency of edge detection. The image on the right was formed by taking each pixel in the original image and subtracting the value of its neighboring pixel on the left. This shows the strength of all of the vertically oriented edges in the input image, which can be a useful operation for object detection. Both images are pixels tall. - - while the output image is 9 pixels Output The input image is Figure 2 pixels wide. This 9.6: wide Eﬃciency of edge detection. The image on the right was formed by taking transformation can be described by aoriginal convolution containingthe two elements, and each pixel in the image kernel and subtracting value of its neighboring pixel on the Kernel requires 9 left. = 26,shows 96 floating point of operations multiplications and in the input image, This the strength all of the (two vertically oriented edges one addition per output using convolution. To describe theimages same are pixels tall. whichpixel) can betoacompute useful operation for object detection. Both transformation with The a matrix would wide take 2 9 image, orisover inputmultiplication image is 2 pixels while the output 9 pixels wide. This (Goodfellow 26) eight billion, entries in the matrix, making foura billion times kernel more eﬃcient for two elements, and transformation can beconvolution described by convolution containing representing this transformation. algorithm February 2, 29 requires 9 The straightforward = 26, 96 matrix floatingmultiplication point operations (two multiplications and performs over sixteenone billion floating operations, convolution roughly 6, addition perpoint output pixel) tomaking compute using convolution. To describe the same times more eﬃcient computationally. Of course, most of the entries of the matrix would be / 48

9 Connection to fully connected NNs Local Receptive Field Leads to Sparse Connectivity (affects less) Sparse connections due to small convolution kernel A convolution layer is a special case of a fully connected layer: filter = weights with sparse connection s s2 s s4 s5 x x2 x x4 x5 s s2 s s4 s5 x x2 x x4 x5 Dense connections February 2, 29 4 / 48 Sparse connectivity: being affected by less Figure 9.2: Sparse connectivity, viewed from below: We highlight one input unit, x(goodfellow 26), and also highlight the output units in s that are aﬀected by this unit. (Top)When s is formed by convolution with a kernel of width, only three outputs are aﬀected by x. (Bottom)When s is formed by matrix multiplication, connectivity is no longer sparse, so all of the outputs are aﬀected by x. Connection to fully connected NNs CHAPTER 9. CONVOLUTIONAL NETWORKS Sparse connections due to small convolution kernel s s2 s s4 s5 A convolution layer is a special case of6a fully connected layer: filter = weights with sparse connection x x2 x x4 x5 parameters sharing s s2 s s4 s5 x x2 x x4 x5 Dense connections Figure 9.: Sparse connectivity, viewed from above: We highlight one output unit, s, (Goodfellow 26) and also highlight the input units in x that aﬀect this unit. These units are known as the receptive field of s. (Top)When s is formed by convolution with a kernel of Figure 9. February 2, 29 5 / 48

10 Connection to fully connected NNs CHAPTER 9. CONVOLUTIONAL NETWORKS Parameter Sharing Convolution shares the same parameters across all spatial locations Traditional matrix multiplication does not share any parameters s s2 s s4 A convolution layer is a special case of a fully connected layer: filter = weights with sparse connection parameters sharing s5 Much less parameters! Example (ignore bias terms): x x2 x x4 x5 s s2 s s4 s5 x x2 x x4 x5 FC: (2 2 ) ( ) 2.4M CNN: 5 5 = 5 Figure 9.5: Parameter sharing: Black arrows indicate the connections that use a particular (Goodfellow 26) Figure (Top)The 9.5 parameter in two diﬀerent models. black arrows indicate uses of the central element of a -element kernel in a convolutional model. Due to parameter sharing, this single parameter is used at all input locations. (Bottom)The single black arrow indicates the use of the central element of the weight matrix in a fully connected model. This model Spatial arrangement: stride padding has no parameter sharing so the and parameter is used only once. February 2, 29 6 / 48 A closer look at spatial dimensions: for every location, we learn only one set. This does not aﬀect the runtime of A closer look at spatial dimensions: forward propagation it is still O(k n) but it does further reduce the storage requirements of the model to k parameters. Recall that k is usually several orders of magnitude less than m. Since m and n are usually roughly the same size, k is x input (spatially) x inputto (spatially) practically insignificant compared m n. Convolution is thus dramatically more assume x filter assume x filter in terms of the memory requirements eﬃcient than dense matrix multiplication and statistical eﬃciency. For a graphical depiction of how parameter sharing works, see figure 9.5. As an example of both of these first two principles in action, figure 9.6 shows how sparse connectivity and parameter sharing can dramatically improve the eﬃciency of a linear function for detecting edges in an image. In the case of convolution, the particular form of parameter sharing causes the Lecture 5-42 April 8, 2 layer to have a property called equivariance to translation. To say a function is Li & Justin Johnson & Serena Yeung Fei-Fei equivariant means that if the input changes, the output changes in the same way. Specifically, a function f (x) is equivariant to a function g if f (g(x)) = g(f (x)). February 2, 29 / 48 In the case of convolution, if we let g be any function that translates the input, i.e., shifts it, then the convolution function is equivariant to g. For example, let I Lecture 5-4 April 8, 2

11 A closer look at spatial dimensions: A closer look at spatial dimensions: x input (spatially) assume x filter x input (spatially) assume x filter Lecture 5-44 April 8, 2 A closer look at spatial dimensions: x input (spatially) assume x filter applied with stride 2 x input (spatially) assume x filter => 5x5 output April 8, 2 A closer look at spatial dimensions: Lecture 5-45 Lecture 5-46 April 8, 2 Lecture 5-4 April 8, 2

12 A closer look at spatial dimensions: A closer look at spatial dimensions: x input (spatially) assume x filter applied with stride 2 Lecture 5-48 April 8, 2 A closer look at spatial dimensions: Lecture 5-49 April 8, 2 A closer look at spatial dimensions: x input (spatially) assume x filter applied with stride? x input (spatially) assume x filter applied with stride 2 => x output! x input (spatially) assume x filter applied with stride? Lecture 5-5 April 8, 2 doesn t fit! cannot apply x filter on x input with stride. Lecture 5-5 April 8, 2

13 N In practice: Common to zero pad the border Output size: (N - F) / stride F N F e.g. N =, F = : stride => ( - )/ = 5 stride 2 => ( - )/2 = stride => ( - )/ = 2. :\ e.g. input x x filter, applied with stride pad with pixel border => what is the output? (recall:) (N - F) / stride Lecture 5-52 April 8, 2 In practice: Common to zero pad the border e.g. input x x filter, applied with stride pad with pixel border => what is the output? Lecture 5-54 April 8, 2 Lecture 5-5 April 8, 2 In practice: Common to zero pad the border x output! e.g. input x x filter, applied with stride pad with pixel border => what is the output? x output! in general, common to see CONV layers with stride, filters of size FxF, and zero-padding with (F-)/2. (will preserve size spatially) e.g. F = => zero pad with F = 5 => zero pad with 2 F = => zero pad with Lecture 5-55 April 8, 2

14 Remember back to E.g. 2x2 input convolved repeatedly with 5x5 filters shrinks volumes spatially! (2 -> -> 24...). Shrinking too fast is not good, doesn t work well. 2 2 CONV, ReLU e.g. 6 5x5x filters Input volume: 2x2x 5x5 filters with stride, pad 2 24 CONV, ReLU e.g. 5x5x6 filters 6 Examples time: CONV, ReLU. Output volume size:? 24 Lecture 5-56 April 8, 2 Lecture 5-5 Examples time: Examples time: Input volume: 2x2x 5x5 filters with stride, pad 2 Input volume: 2x2x 5x5 filters with stride, pad 2 Output volume size: (22*2-5)/ = 2 spatially, so 2x2x Number of parameters in this layer? Lecture 5-58 April 8, 2 Lecture 5-59 April 8, 2 April 8, 2

15 Summary for convolution layer Input: a tensor of size W H D Examples time: Hyperparameters: K filters of size F F Input volume: 2x2x 5x5 filters with stride, pad 2 Number of parameters in this layer? each filter has 5*5* = 6 params => 6* = 6 stride S amount of zero padding P (for one side) Output: a tensor of size W2 H2 D2 where W2 = (W 2P F )/S ( for bias) H2 = (H 2P F )/S D2 = K Lecture 5-6 April 8, 2 #parameters: (F F D ) K weights Common setting: F =, S = P = and K is a power of two. February 2, 29 The brain/neuron view of CONV Layer 2 2 The brain/neuron view of CONV Layer 2x2x image 5x5x filter 2 Lecture x2x image 5x5x filter It s just a neuron with local connectivity... number: the result of taking a dot product between the filter and this part of the image (i.e. 5*5* = 5-dimensional dot product) 8 / 48 2 April 8, 2 number: the result of taking a dot product between the filter and this part of the image (i.e. 5*5* = 5-dimensional dot product) Lecture 5-6 April 8, 2

The brain/neuron view of CONV Layer The brain/neuron view of CONV Layer 2 2 2 2 5x5 filter -> 5x5 receptive field for each neuron Lecture 5-68 April 8, 2 Another element: pooling 5 Lecture 5-69 April

16 The brain/neuron view of CONV Layer The brain/neuron view of CONV Layer x5 filter -> 5x5 receptive field for each neuron Lecture 5-68 April 8, 2 Another element: pooling 5 Lecture 5-69 April 8, 2 Max pooling Max pooling with 2 2 filter and stride 2 is very common Pooling layer - There will be 5 different neurons all looking at the same region in the input volume E.g. with 5 filters, CONV layer consists of neurons arranged in a D grid (xx5) An activation map is a x sheet of neuron outputs:. Each is connected to a small region in the input 2. All of them share parameters makes the representations smaller and more manageable operates over each activation map independently: MAX POOLING Single depth slice x Lecture 5-2 April 8, max pool with 2x2 filters and stride y Types of pooling: average, L2-norm, max February 2, / 48 Lecture 5 - April 8, 2 February 2, 29 2 / 48

17 Putting everything together Revolution of Depth Typical architecture for CNNs: Input [[Conv ReLU]*N Pool]*M [FC ReLU]*Q Softmax Common choices: N 5, Q 2, M is large Well-known CNNs: LeNet (998), AlexNet(22), ZF Net(2), GoogLeNet(24), VGGNet(24), ResNet(25), FractalNet(2) February 2, 29 2 / 48 February 2, / 48 How to train a CNN? Outline How do we learn the filters/weights? Run the image through layers of convolutions and train the filters with SGD/backpropagation What computer architecture to use? buy or more GPUs (NVIDIA graphic card) What software package to use? State-of-the-art: TensorFlow plus Keras or Caffe on the top. See demo at: cifar.html Review of last lecture 2 Convolutional neural networks Kernel methods Motivation Dual formulation of linear regression Kernel Trick February 2, 29 2 / 48 February 2, / 48

18 Motivation Case study: regularized linear regression Recall the question: how to choose nonlinear basis φ : R D R M? w T φ(x) neural network is one approach: learn φ from data kernel method is another one: sidestep the issue of choosing φ by using kernel functions Kernel methods work for many problems and we take regularized linear regression as an example. Recall the regularized least square solution: w = argmin F (w) w ( = argmin Φw y 2 2 λ w 2 2) w = ( Φ T Φ λi ) Φ T y Φ = φ(x ) T φ(x 2 ) T. φ(x N ) T Here Φ is N M matrix, and Φ T Φ is M M matrix,, y = Issue: operate in space R M and M could be huge or even infinity! y y 2. y N February 2, / 48 February 2, / 48 A closer look at the least square solution Why is this helpful? By setting the gradient of F (w) = Φw y 2 2 λ w 2 2 to : Φ T (Φw y) λw = Next we move λw to the other side and divide it by λ: w = λ ΦT (y Φw ) Let us define a new vector α = y Φw λ. Then w = Φ T α = N α n φ(x n ) n= Assuming we know α, the prediction of w on a new example x is w T φ(x) = N ( α n φ(xn ) T φ(x) ) n= Therefore we do not really need to know w. Only inner products in the new feature space matter! Kernel methods are exactly about computing inner products without knowing φ. Thus the least square solution is a linear combination of features! Note this is true for perceptron and many other problems. Of course, the above calculation does not show what α is. February 2, 29 2 / 48 But we need to figure out what α is! February 2, 29 / 48

19 How to find α? Gram matrix Plugging in w = Φ T α into F (w) gives G(α) = F (Φ T α) We find α by minimizing G(α). = ΦΦ T α y 2 2 λ Φ T α 2 2 = ΦΦ T α y 2 2 λα T ΦΦ T α = Kα y 2 2 λα T Kα (K = ΦΦ T ) Here K = ΦΦ T is called Gram matrix or kernel matrix where the (i, j) entry is φ(x i ) T φ(x j ) K = ΦΦ T = φ(x ) T φ(x ) φ(x ) T φ(x 2 ) φ(x ) T φ(x N ) φ(x 2 ) T φ(x ) φ(x 2 ) T φ(x 2 ) φ(x 2 ) T φ(x N ) φ(x N ) T φ(x ) φ(x N ) T φ(x 2 ) φ(x N ) T φ(x N ) RN N February 2, / 48 February 2, 29 / 48 Examples of kernel matrix data points in R x =, x 2 =, x = φ is polynomial basis with degree 4: φ(x ) = φ(x) = φ(x 2) = x x 2 x φ(x ) = Calculation of the Gram matrix φ(x ) = Gram/Kernel matrix K = = φ(x 2) = φ(x ) = φ(x ) T φ(x ) φ(x ) T φ(x 2 ) φ(x ) T φ(x ) φ(x 2 ) T φ(x ) φ(x 2 ) T φ(x 2 ) φ(x 2 ) T φ(x ) φ(x ) T φ(x ) φ(x ) T φ(x 2 ) φ(x ) T φ(x ) 4 4 February 2, 29 / 48 February 2, 29 2 / 48

20 Gram matrix vs covariance matrix How to find α? Differentiate G(α) G(α) = F (Φ T α) = Kα y 2 2 λα T Kα dimensions entry (i, j) property ΦΦ T N N φ(x i ) T φ(x j ) both are symmetric and Φ T Φ M M N n= φ(x n) i φ(x n ) j positive semidefinite and set it to. We have (note K is symmetric) = 2K (Kα y) 2λKα = (Kα y) λα = (K λi)α y It follows, that α = (K λi) y is a minimizer and we obtain w = Φ T α = Φ T (K λi) y February 2, 29 / 48 February 2, 29 4 / 48 Comparing two solutions Then what is the difference? Minimizing F (w) gives w = (Φ T Φ λi) Φ T y Minimizing G(α) gives w = Φ T (ΦΦ T λi) y, Note I has different dimensions in two formulas. Natural question: are they the same or different? They are the same, and here is the proof: ΦΦ T = K First, computing (ΦΦ T λi) = (K λi) can be more efficient than computing (Φ T Φ λi) when N M. More importantly, computing α = (K λi) also only requires computing inner products in the new feature space! (Φ T Φ λi ) Φ T y = (Φ T Φ λi ) Φ T (ΦΦ T λi 2 )(ΦΦ T λi 2 ) y = (Φ T Φ λi ) (Φ T ΦΦ T λφ T )(ΦΦ T λi 2 ) y = (Φ T Φ λi ) (Φ T Φ λi )Φ T (ΦΦ T λi 2 ) y = Φ T (ΦΦ T λi 2 ) y Now we can conclude that the exact form of φ( ) is not essential; all we need is computing inner products φ(x) T φ(x ). For some φ it is indeed possible to compute φ(x) T φ(x ) without computing/knowing φ. This is the kernel trick. February 2, 29 5 / 48 February 2, 29 6 / 48

21 Example of the kernel trick Consider the following polynomial basis φ : R 2 R : x 2 φ(x) = 2x x 2 What is the inner product between φ(x) and φ(x )? φ(x) T φ(x ) = x 2 x 2 2x x 2 x x 2 x 2 2 x 2 2 = (x x x 2 x 2) 2 = (x T x ) 2 Therefore, the inner product in the new space is simply a function of the inner product in the original space. Count the number of multiplications. x 2 2 February 2, 29 / 48 Another example φ : R D R 2D is parameterized by θ: φ θ (x) = cos(θx ) sin(θx ). cos(θx D ) sin(θx D ) What is the inner product between φ θ (x) and φ θ (x )? φ θ (x) T φ θ (x ) = = D cos(θx d ) cos(θx d ) sin(θx d) sin(θx d ) d= D cos(θ(x d x d )) d= Once again, the inner product in the new space is a simple function of the features in the original space. February 2, 29 8 / 48 More complicated example Based on the previous example mapping φ θ, we define a new one φ L : R D R 2D(L) as follows: φ (x) φ 2π (x) L φ L (x) = φ 2 2π (x) L. (x) φ L 2π L What is the inner product between φ L (x) and φ L (x )? φ L (x) T φ L (x ) = = L l= L φ 2πl L D l= d= (x) T φ 2πl (x ) L ( ) 2πl cos L (x d x d ) Infinite dimensional mapping Let us set L. This means that φ L (x) vector has infinite dimension. Clearly we cannot compute φ L (x), but we can still compute the inner product: φ (x) T φ (x ) = = 2π D d= D d= cos(θ(x d x d )) dθ sin(2π(x d x d )) x d x d Again, a simple function of the original features. Note that using this mapping in linear regression, we are learning a weight w with infinite dimension! February 2, 29 9 / 48 February 2, 29 4 / 48

22 Kernel functions Using kernel functions Definition: a function k : R D R D R is called a (positive semidefinite) kernel function if there exists a function φ : R D R M so that for any x, x R D, k(x, x ) = φ(x) T φ(x ) Examples we have seen k(x, x ) = (x T x ) 2 k(x, x ) = D d= sin(2π(x d x d )) x d x d Choosing a nonlinear basis φ becomes choosing a kernel function. As long as computing the kernel function is more efficient, we should apply the kernel trick. Gram/kernel matrix becomes: k(x, x ) k(x, x 2 ) k(x, x N ) K = ΦΦ T k(x 2, x ) k(x 2, x 2 ) k(x 2, x N ) =.... k(x N, x ) k(x N, x 2 ) k(x N, x N ) In fact, k is a kernel if and only if K is positive semidefinite for any x, x 2,..., x N (Mercer theorem). February 2, 29 4 / 48 February 2, / 48 Using kernel functions More examples of kernel functions Two most commonly used kernel functions in practice: The prediction on a new example x is N N w T φ(x) = α n φ(x n ) T φ(x) = α n k(x n, x) n= n= Polynomial kernel for c and d is a positive integer. k(x, x ) = (x T x c) d Gaussian kernel or Radial basis function (RBF) kernel k(x, x ) = e x x 2 2 2σ 2 for some σ >. Think about what the corresponding φ is for each kernel. February 2, 29 4 / 48 February 2, / 48

23 Composing kernels Examples that are not kernels Creating more kernel functions using the following rules: If k (, ) and k 2 (, ) are kernels, the followings are kernels too linear combination: αk (, ) βk 2 (, ) if α, β product: k (, )k 2 (, ) exponential: e k(, ) Verify using the definition of kernel! Function is not a kernel, why? k(x, x ) = x x 2 2 Example: if it is a kernel, the kernel matrix for two data points x and x 2 : ( ) ( k(x, x K = ) k(x, x 2 ) x = x 2 2 ) 2 k(x 2, x ) k(x 2, x 2 ) x x must be positive semidefinite, but is it? February 2, / 48 February 2, / 48 Kernelizing other ML algorithms Example: Kernelizing NNC Kernel trick is applicable to many ML algorithms: nearest neighbor classifier perceptron logistic regression SVM For NNC with L2 distance, the key is to compute for any two points x, x d(x, x ) = x x 2 2 = x T x x T x 2x T x With a kernel function k, we simply compute d kernel (x, x ) = k(x, x) k(x, x ) 2k(x, x ) which by definition is the L2 distance in a new feature space d kernel (x, x ) = φ(x) φ(x ) 2 2 February 2, 29 4 / 48 February 2, / 48

Lecture 7 Convolutional Neural Networks

Lecture 7 Convolutional Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 17, 2017 We saw before: ŷ x 1 x 2 x 3 x 4 A series of matrix multiplications: