Outline. CSCI567 Machine Learning (Spring 2019) Outline. Math formulation. Prof. Victor Adamchik. Feb. 12, 2019

Size: px
Start display at page:

Download "Outline. CSCI567 Machine Learning (Spring 2019) Outline. Math formulation. Prof. Victor Adamchik. Feb. 12, 2019"

Transcription

1 Outline CSCI56 Machine Learning (Spring 29) Review of last lecture Prof. Victor Adamchik 2 Convolutional neural networks U of Southern California Feb. 2, 29 Kernel methods February 2, 29 / 48 February 2, 29 2 / 48 Outline Math formulation Review of last lecture An L-layer neural net can be written as f(x) = h L ( W L h L ( W L h ( W x ))) 2 Convolutional neural networks Kernel methods To ease notation, for a given input x, define recursively o = x, a l = W l o l, o l = h l (a l ) (l =,..., L) where D = D, D,..., D L are numbers of neurons at each layer W l R D l D l is the weights for layer l a l R D l is input to layer l o l R D l is output to layer l h l : R D l R D l is activation functions at layer l February 2, 29 / 48 February 2, 29 4 / 48

2 Putting everything into SGD Outline The backpropagation algorithm (986) computes the gradient of the cost function for a single training example. Initialize W,..., W L (all or randomly). Repeat: 2 randomly pick one data point n [N] forward propagation: for each layer ` =,..., L I ` ` ` compute a = W o ` Review of last lecture 2 Convolutional neural networks Motivation Architecture Kernel methods ` and o = h` (a ) (o = xn ) backward propagation: for each ` = L,..., I compute En a` I update weights W` W` λ En En = W ` λ ` (o` )T ` W a February 2, 29 5 / 48 February 2, 29 6 / 48 Acknowledgements Image Classification: A core task in Computer Vision (assume given set of discrete labels) {dog, cat, truck, plane,...} The materials I used to develop these notes include the following sources: Stanford Course Cs2n: (taught by Prof. Fei-Fei Li and her students). cat Dr. Ian Goodfellow s lectures on deep learning: Both website provides tons of useful resources: notes, demos, videos, etc. This image by Nikita is licensed under CC-BY 2. February 2, 29 / 48 Lecture 2-6 April 6, 2

3 The Problem: Semantic Gap Challenges: Viewpoint variation What the computer sees An image is just a big grid of numbers between [, 255]: All pixels change when the camera moves! e.g. 8 x 6 x ( channels RGB) This image by Nikita is licensed under CC-BY 2. This image by Nikita is licensed under CC-BY 2. Lecture 2 - April 6, 2 Challenges: Illumination This image is CC. public domain This image is CC. public domain April 6, 2 Challenges: Deformation This image is CC. public domain This image is CC. public domain This image by Umberto Salvagnin is licensed under CC-BY 2. Lecture 2-8 Lecture 2-9 April 6, 2 This image by Umberto Salvagnin is licensed under CC-BY 2. This image by sare bear is licensed under CC-BY 2. Lecture 2 - This image by Tom Thai is licensed under CC-BY 2. April 6, 2

4 Challenges: Occlusion This image is CC. public domain Challenges: Background Clutter This image by jonsson is licensed under CC-BY 2. This image is CC. public domain This image is CC. public domain Lecture 2 - April 6, 2 This image is CC. public domain Lecture 2-2 April 6, 2 Fundamental problems in vision Challenges: Intraclass variation The key challenge How to train a model that can tolerate all those variations? Main ideas need a lot of data that exhibits those variations need more specialized models to capture the invariance This image is CC. public domain Lecture 2 - April 6, 2 February 2, 29 8 / 48

5 A bit of history: A bit of history: Gradient-based learning applied to document recognition [LeCun, Bottou, Bengio, Haffner 998] ImageNet Classification with Deep Convolutional Neural Networks [Krizhevsky, Sutskever, Hinton, 22] Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 22. Reproduced with permission. AlexNet LeNet-5 4 Lecture 5-4 April 8, 2 Issues of standard NN for image inputs 5 Lecture 5-5 April 8, 2 Solution: Convolutional Neural Net (ConvNet/CNN) A special case of fully connected neural nets Fully Connected Layer usually consist of convolution layers, ReLU layers, pooling layers, and regular fully connected layers 2x2x image -> stretch to 2 x key idea: learning from low-level to high-level features input 2 activation x 2 weights number: the result of taking a dot product between a row of W and the input (a 2-dimensional dot product) Lecture 5-2 April 8, 2 Spatial structure is lost! February 2, 29 9 / 48 February 2, 29 / 48

6 CHAPTER 9. CONVOLUTIONAL NETWORKS Convolution layer Convolution Arrange neurons as a D matrix/tensor naturally 2D Convolution Convolution Layer Input 2x2x image -> preserve spatial structure a b c d e f g h i j k Kernel w x y z (filter/receptive field) l Output 2 height 2 width depth Lecture 5 - aw ey bx fz bw fy cx gz cw gy dx hz ew iy fx jz fw jy gx kz gw ky hx lz April 8, 2 February 2, 29 / 48 Figure 9.: An example of 2-D convolution without kernel-flipping. In this case we restrict the output to only positions where the kernel lies entirely within the image, called valid convolution in some contexts. We draw boxes with arrows to indicate how the upper-left element of the output tensor is formed by applying the kernel to the corresponding upper-left region of the input tensor. Convolution Layer Convolution Layer 2x2x image 2x2x image 4 5x5x filter 2 / 48 Filters always extend the full depth of the input volume 5x5x filter 2 2 Convolve the filter with the image i.e. slide over the image spatially, computing dot products Convolve the filter with the image i.e. slide over the image spatially, computing dot products 2 2 (Goodfellow 26) February 2, 29 Lecture 5-29 April 8, 2 Lecture 5 - April 8, 2

7 Convolution Layer Convolution Layer activation map 2 2x2x image 5x5x filter 2 2x2x image 5x5x filter convolve (slide) over all spatial locations number: the result of taking a dot product between the filter and a small 5x5x chunk of the image (i.e. 5*5* = 5-dimensional dot product bias) 2 2 Convolution Layer Lecture 5 - April 8, 2 consider a second, green filter Lecture 5-2 April 8, 2 For example, if we had 6 5x5 filters, we ll get 6 separate activation maps: activation maps 2 2x2x image 5x5x filter activation maps 2 Convolution Layer convolve (slide) over all spatial locations 2 Lecture 5 - April 8, We stack these up to get a new image of size xx6! Lecture 5-4 April 8, 2

8 Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions 2 2 Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions CONV, ReLU e.g. 6 5x5x filters Lecture 5-5 April 8, 2 CONV, ReLU e.g. 6 5x5x filters 24 CONV, ReLU e.g. 5x5x6 filters 6 CONV, ReLU. 24 Lecture 5-6 April 8, 2 Why convolution makes sense? CHAPTER 9. CONVOLUTIONAL NETWORKS one filter => one activation map example 5x5 filters (2 total) We call the layer convolutional because it is related to convolution of two signals: elementwise multiplication and sum of a filter and the signal (image) Figure copyright Andrej Karpathy. Lecture 5-9 April 8, 2 Main idea: if a filter is useful at one location, it should be useful at other locations. CHAPTER 9. CONVOLUTIONAL NETWORKS A simple example why filtering is useful Input Figure 9.6: Efficiency of edge detection. The image on the right was formed by taking each pixel in the original image and subtracting the value of its neighboring pixel on the left. This shows the strength of all of the vertically oriented edges in the input image, which can be a useful operation for object detection. Both images are pixels tall. - - while the output image is 9 pixels Output The input image is Figure 2 pixels wide. This 9.6: wide Efficiency of edge detection. The image on the right was formed by taking transformation can be described by aoriginal convolution containingthe two elements, and each pixel in the image kernel and subtracting value of its neighboring pixel on the Kernel requires 9 left. = 26,shows 96 floating point of operations multiplications and in the input image, This the strength all of the (two vertically oriented edges one addition per output using convolution. To describe theimages same are pixels tall. whichpixel) can betoacompute useful operation for object detection. Both transformation with The a matrix would wide take 2 9 image, orisover inputmultiplication image is 2 pixels while the output 9 pixels wide. This (Goodfellow 26) eight billion, entries in the matrix, making foura billion times kernel more efficient for two elements, and transformation can beconvolution described by convolution containing representing this transformation. algorithm February 2, 29 requires 9 The straightforward = 26, 96 matrix floatingmultiplication point operations (two multiplications and performs over sixteenone billion floating operations, convolution roughly 6, addition perpoint output pixel) tomaking compute using convolution. To describe the same times more efficient computationally. Of course, most of the entries of the matrix would be / 48

9 Connection to fully connected NNs Local Receptive Field Leads to Sparse Connectivity (affects less) Sparse connections due to small convolution kernel A convolution layer is a special case of a fully connected layer: filter = weights with sparse connection s s2 s s4 s5 x x2 x x4 x5 s s2 s s4 s5 x x2 x x4 x5 Dense connections February 2, 29 4 / 48 Sparse connectivity: being affected by less Figure 9.2: Sparse connectivity, viewed from below: We highlight one input unit, x(goodfellow 26), and also highlight the output units in s that are affected by this unit. (Top)When s is formed by convolution with a kernel of width, only three outputs are affected by x. (Bottom)When s is formed by matrix multiplication, connectivity is no longer sparse, so all of the outputs are affected by x. Connection to fully connected NNs CHAPTER 9. CONVOLUTIONAL NETWORKS Sparse connections due to small convolution kernel s s2 s s4 s5 A convolution layer is a special case of6a fully connected layer: filter = weights with sparse connection x x2 x x4 x5 parameters sharing s s2 s s4 s5 x x2 x x4 x5 Dense connections Figure 9.: Sparse connectivity, viewed from above: We highlight one output unit, s, (Goodfellow 26) and also highlight the input units in x that affect this unit. These units are known as the receptive field of s. (Top)When s is formed by convolution with a kernel of Figure 9. February 2, 29 5 / 48

10 Connection to fully connected NNs CHAPTER 9. CONVOLUTIONAL NETWORKS Parameter Sharing Convolution shares the same parameters across all spatial locations Traditional matrix multiplication does not share any parameters s s2 s s4 A convolution layer is a special case of a fully connected layer: filter = weights with sparse connection parameters sharing s5 Much less parameters! Example (ignore bias terms): x x2 x x4 x5 s s2 s s4 s5 x x2 x x4 x5 FC: (2 2 ) ( ) 2.4M CNN: 5 5 = 5 Figure 9.5: Parameter sharing: Black arrows indicate the connections that use a particular (Goodfellow 26) Figure (Top)The 9.5 parameter in two different models. black arrows indicate uses of the central element of a -element kernel in a convolutional model. Due to parameter sharing, this single parameter is used at all input locations. (Bottom)The single black arrow indicates the use of the central element of the weight matrix in a fully connected model. This model Spatial arrangement: stride padding has no parameter sharing so the and parameter is used only once. February 2, 29 6 / 48 A closer look at spatial dimensions: for every location, we learn only one set. This does not affect the runtime of A closer look at spatial dimensions: forward propagation it is still O(k n) but it does further reduce the storage requirements of the model to k parameters. Recall that k is usually several orders of magnitude less than m. Since m and n are usually roughly the same size, k is x input (spatially) x inputto (spatially) practically insignificant compared m n. Convolution is thus dramatically more assume x filter assume x filter in terms of the memory requirements efficient than dense matrix multiplication and statistical efficiency. For a graphical depiction of how parameter sharing works, see figure 9.5. As an example of both of these first two principles in action, figure 9.6 shows how sparse connectivity and parameter sharing can dramatically improve the efficiency of a linear function for detecting edges in an image. In the case of convolution, the particular form of parameter sharing causes the Lecture 5-42 April 8, 2 layer to have a property called equivariance to translation. To say a function is Li & Justin Johnson & Serena Yeung Fei-Fei equivariant means that if the input changes, the output changes in the same way. Specifically, a function f (x) is equivariant to a function g if f (g(x)) = g(f (x)). February 2, 29 / 48 In the case of convolution, if we let g be any function that translates the input, i.e., shifts it, then the convolution function is equivariant to g. For example, let I Lecture 5-4 April 8, 2

11 A closer look at spatial dimensions: A closer look at spatial dimensions: x input (spatially) assume x filter x input (spatially) assume x filter Lecture 5-44 April 8, 2 A closer look at spatial dimensions: x input (spatially) assume x filter applied with stride 2 x input (spatially) assume x filter => 5x5 output April 8, 2 A closer look at spatial dimensions: Lecture 5-45 Lecture 5-46 April 8, 2 Lecture 5-4 April 8, 2

12 A closer look at spatial dimensions: A closer look at spatial dimensions: x input (spatially) assume x filter applied with stride 2 Lecture 5-48 April 8, 2 A closer look at spatial dimensions: Lecture 5-49 April 8, 2 A closer look at spatial dimensions: x input (spatially) assume x filter applied with stride? x input (spatially) assume x filter applied with stride 2 => x output! x input (spatially) assume x filter applied with stride? Lecture 5-5 April 8, 2 doesn t fit! cannot apply x filter on x input with stride. Lecture 5-5 April 8, 2

13 N In practice: Common to zero pad the border Output size: (N - F) / stride F N F e.g. N =, F = : stride => ( - )/ = 5 stride 2 => ( - )/2 = stride => ( - )/ = 2. :\ e.g. input x x filter, applied with stride pad with pixel border => what is the output? (recall:) (N - F) / stride Lecture 5-52 April 8, 2 In practice: Common to zero pad the border e.g. input x x filter, applied with stride pad with pixel border => what is the output? Lecture 5-54 April 8, 2 Lecture 5-5 April 8, 2 In practice: Common to zero pad the border x output! e.g. input x x filter, applied with stride pad with pixel border => what is the output? x output! in general, common to see CONV layers with stride, filters of size FxF, and zero-padding with (F-)/2. (will preserve size spatially) e.g. F = => zero pad with F = 5 => zero pad with 2 F = => zero pad with Lecture 5-55 April 8, 2

14 Remember back to E.g. 2x2 input convolved repeatedly with 5x5 filters shrinks volumes spatially! (2 -> -> 24...). Shrinking too fast is not good, doesn t work well. 2 2 CONV, ReLU e.g. 6 5x5x filters Input volume: 2x2x 5x5 filters with stride, pad 2 24 CONV, ReLU e.g. 5x5x6 filters 6 Examples time: CONV, ReLU. Output volume size:? 24 Lecture 5-56 April 8, 2 Lecture 5-5 Examples time: Examples time: Input volume: 2x2x 5x5 filters with stride, pad 2 Input volume: 2x2x 5x5 filters with stride, pad 2 Output volume size: (22*2-5)/ = 2 spatially, so 2x2x Number of parameters in this layer? Lecture 5-58 April 8, 2 Lecture 5-59 April 8, 2 April 8, 2

15 Summary for convolution layer Input: a tensor of size W H D Examples time: Hyperparameters: K filters of size F F Input volume: 2x2x 5x5 filters with stride, pad 2 Number of parameters in this layer? each filter has 5*5* = 6 params => 6* = 6 stride S amount of zero padding P (for one side) Output: a tensor of size W2 H2 D2 where W2 = (W 2P F )/S ( for bias) H2 = (H 2P F )/S D2 = K Lecture 5-6 April 8, 2 #parameters: (F F D ) K weights Common setting: F =, S = P = and K is a power of two. February 2, 29 The brain/neuron view of CONV Layer 2 2 The brain/neuron view of CONV Layer 2x2x image 5x5x filter 2 Lecture x2x image 5x5x filter It s just a neuron with local connectivity... number: the result of taking a dot product between the filter and this part of the image (i.e. 5*5* = 5-dimensional dot product) 8 / 48 2 April 8, 2 number: the result of taking a dot product between the filter and this part of the image (i.e. 5*5* = 5-dimensional dot product) Lecture 5-6 April 8, 2

16 The brain/neuron view of CONV Layer The brain/neuron view of CONV Layer x5 filter -> 5x5 receptive field for each neuron Lecture 5-68 April 8, 2 Another element: pooling 5 Lecture 5-69 April 8, 2 Max pooling Max pooling with 2 2 filter and stride 2 is very common Pooling layer - There will be 5 different neurons all looking at the same region in the input volume E.g. with 5 filters, CONV layer consists of neurons arranged in a D grid (xx5) An activation map is a x sheet of neuron outputs:. Each is connected to a small region in the input 2. All of them share parameters makes the representations smaller and more manageable operates over each activation map independently: MAX POOLING Single depth slice x Lecture 5-2 April 8, max pool with 2x2 filters and stride y Types of pooling: average, L2-norm, max February 2, / 48 Lecture 5 - April 8, 2 February 2, 29 2 / 48

17 Putting everything together Revolution of Depth Typical architecture for CNNs: Input [[Conv ReLU]*N Pool]*M [FC ReLU]*Q Softmax Common choices: N 5, Q 2, M is large Well-known CNNs: LeNet (998), AlexNet(22), ZF Net(2), GoogLeNet(24), VGGNet(24), ResNet(25), FractalNet(2) February 2, 29 2 / 48 February 2, / 48 How to train a CNN? Outline How do we learn the filters/weights? Run the image through layers of convolutions and train the filters with SGD/backpropagation What computer architecture to use? buy or more GPUs (NVIDIA graphic card) What software package to use? State-of-the-art: TensorFlow plus Keras or Caffe on the top. See demo at: cifar.html Review of last lecture 2 Convolutional neural networks Kernel methods Motivation Dual formulation of linear regression Kernel Trick February 2, 29 2 / 48 February 2, / 48

18 Motivation Case study: regularized linear regression Recall the question: how to choose nonlinear basis φ : R D R M? w T φ(x) neural network is one approach: learn φ from data kernel method is another one: sidestep the issue of choosing φ by using kernel functions Kernel methods work for many problems and we take regularized linear regression as an example. Recall the regularized least square solution: w = argmin F (w) w ( = argmin Φw y 2 2 λ w 2 2) w = ( Φ T Φ λi ) Φ T y Φ = φ(x ) T φ(x 2 ) T. φ(x N ) T Here Φ is N M matrix, and Φ T Φ is M M matrix,, y = Issue: operate in space R M and M could be huge or even infinity! y y 2. y N February 2, / 48 February 2, / 48 A closer look at the least square solution Why is this helpful? By setting the gradient of F (w) = Φw y 2 2 λ w 2 2 to : Φ T (Φw y) λw = Next we move λw to the other side and divide it by λ: w = λ ΦT (y Φw ) Let us define a new vector α = y Φw λ. Then w = Φ T α = N α n φ(x n ) n= Assuming we know α, the prediction of w on a new example x is w T φ(x) = N ( α n φ(xn ) T φ(x) ) n= Therefore we do not really need to know w. Only inner products in the new feature space matter! Kernel methods are exactly about computing inner products without knowing φ. Thus the least square solution is a linear combination of features! Note this is true for perceptron and many other problems. Of course, the above calculation does not show what α is. February 2, 29 2 / 48 But we need to figure out what α is! February 2, 29 / 48

19 How to find α? Gram matrix Plugging in w = Φ T α into F (w) gives G(α) = F (Φ T α) We find α by minimizing G(α). = ΦΦ T α y 2 2 λ Φ T α 2 2 = ΦΦ T α y 2 2 λα T ΦΦ T α = Kα y 2 2 λα T Kα (K = ΦΦ T ) Here K = ΦΦ T is called Gram matrix or kernel matrix where the (i, j) entry is φ(x i ) T φ(x j ) K = ΦΦ T = φ(x ) T φ(x ) φ(x ) T φ(x 2 ) φ(x ) T φ(x N ) φ(x 2 ) T φ(x ) φ(x 2 ) T φ(x 2 ) φ(x 2 ) T φ(x N ) φ(x N ) T φ(x ) φ(x N ) T φ(x 2 ) φ(x N ) T φ(x N ) RN N February 2, / 48 February 2, 29 / 48 Examples of kernel matrix data points in R x =, x 2 =, x = φ is polynomial basis with degree 4: φ(x ) = φ(x) = φ(x 2) = x x 2 x φ(x ) = Calculation of the Gram matrix φ(x ) = Gram/Kernel matrix K = = φ(x 2) = φ(x ) = φ(x ) T φ(x ) φ(x ) T φ(x 2 ) φ(x ) T φ(x ) φ(x 2 ) T φ(x ) φ(x 2 ) T φ(x 2 ) φ(x 2 ) T φ(x ) φ(x ) T φ(x ) φ(x ) T φ(x 2 ) φ(x ) T φ(x ) 4 4 February 2, 29 / 48 February 2, 29 2 / 48

20 Gram matrix vs covariance matrix How to find α? Differentiate G(α) G(α) = F (Φ T α) = Kα y 2 2 λα T Kα dimensions entry (i, j) property ΦΦ T N N φ(x i ) T φ(x j ) both are symmetric and Φ T Φ M M N n= φ(x n) i φ(x n ) j positive semidefinite and set it to. We have (note K is symmetric) = 2K (Kα y) 2λKα = (Kα y) λα = (K λi)α y It follows, that α = (K λi) y is a minimizer and we obtain w = Φ T α = Φ T (K λi) y February 2, 29 / 48 February 2, 29 4 / 48 Comparing two solutions Then what is the difference? Minimizing F (w) gives w = (Φ T Φ λi) Φ T y Minimizing G(α) gives w = Φ T (ΦΦ T λi) y, Note I has different dimensions in two formulas. Natural question: are they the same or different? They are the same, and here is the proof: ΦΦ T = K First, computing (ΦΦ T λi) = (K λi) can be more efficient than computing (Φ T Φ λi) when N M. More importantly, computing α = (K λi) also only requires computing inner products in the new feature space! (Φ T Φ λi ) Φ T y = (Φ T Φ λi ) Φ T (ΦΦ T λi 2 )(ΦΦ T λi 2 ) y = (Φ T Φ λi ) (Φ T ΦΦ T λφ T )(ΦΦ T λi 2 ) y = (Φ T Φ λi ) (Φ T Φ λi )Φ T (ΦΦ T λi 2 ) y = Φ T (ΦΦ T λi 2 ) y Now we can conclude that the exact form of φ( ) is not essential; all we need is computing inner products φ(x) T φ(x ). For some φ it is indeed possible to compute φ(x) T φ(x ) without computing/knowing φ. This is the kernel trick. February 2, 29 5 / 48 February 2, 29 6 / 48

21 Example of the kernel trick Consider the following polynomial basis φ : R 2 R : x 2 φ(x) = 2x x 2 What is the inner product between φ(x) and φ(x )? φ(x) T φ(x ) = x 2 x 2 2x x 2 x x 2 x 2 2 x 2 2 = (x x x 2 x 2) 2 = (x T x ) 2 Therefore, the inner product in the new space is simply a function of the inner product in the original space. Count the number of multiplications. x 2 2 February 2, 29 / 48 Another example φ : R D R 2D is parameterized by θ: φ θ (x) = cos(θx ) sin(θx ). cos(θx D ) sin(θx D ) What is the inner product between φ θ (x) and φ θ (x )? φ θ (x) T φ θ (x ) = = D cos(θx d ) cos(θx d ) sin(θx d) sin(θx d ) d= D cos(θ(x d x d )) d= Once again, the inner product in the new space is a simple function of the features in the original space. February 2, 29 8 / 48 More complicated example Based on the previous example mapping φ θ, we define a new one φ L : R D R 2D(L) as follows: φ (x) φ 2π (x) L φ L (x) = φ 2 2π (x) L. (x) φ L 2π L What is the inner product between φ L (x) and φ L (x )? φ L (x) T φ L (x ) = = L l= L φ 2πl L D l= d= (x) T φ 2πl (x ) L ( ) 2πl cos L (x d x d ) Infinite dimensional mapping Let us set L. This means that φ L (x) vector has infinite dimension. Clearly we cannot compute φ L (x), but we can still compute the inner product: φ (x) T φ (x ) = = 2π D d= D d= cos(θ(x d x d )) dθ sin(2π(x d x d )) x d x d Again, a simple function of the original features. Note that using this mapping in linear regression, we are learning a weight w with infinite dimension! February 2, 29 9 / 48 February 2, 29 4 / 48

22 Kernel functions Using kernel functions Definition: a function k : R D R D R is called a (positive semidefinite) kernel function if there exists a function φ : R D R M so that for any x, x R D, k(x, x ) = φ(x) T φ(x ) Examples we have seen k(x, x ) = (x T x ) 2 k(x, x ) = D d= sin(2π(x d x d )) x d x d Choosing a nonlinear basis φ becomes choosing a kernel function. As long as computing the kernel function is more efficient, we should apply the kernel trick. Gram/kernel matrix becomes: k(x, x ) k(x, x 2 ) k(x, x N ) K = ΦΦ T k(x 2, x ) k(x 2, x 2 ) k(x 2, x N ) =.... k(x N, x ) k(x N, x 2 ) k(x N, x N ) In fact, k is a kernel if and only if K is positive semidefinite for any x, x 2,..., x N (Mercer theorem). February 2, 29 4 / 48 February 2, / 48 Using kernel functions More examples of kernel functions Two most commonly used kernel functions in practice: The prediction on a new example x is N N w T φ(x) = α n φ(x n ) T φ(x) = α n k(x n, x) n= n= Polynomial kernel for c and d is a positive integer. k(x, x ) = (x T x c) d Gaussian kernel or Radial basis function (RBF) kernel k(x, x ) = e x x 2 2 2σ 2 for some σ >. Think about what the corresponding φ is for each kernel. February 2, 29 4 / 48 February 2, / 48

23 Composing kernels Examples that are not kernels Creating more kernel functions using the following rules: If k (, ) and k 2 (, ) are kernels, the followings are kernels too linear combination: αk (, ) βk 2 (, ) if α, β product: k (, )k 2 (, ) exponential: e k(, ) Verify using the definition of kernel! Function is not a kernel, why? k(x, x ) = x x 2 2 Example: if it is a kernel, the kernel matrix for two data points x and x 2 : ( ) ( k(x, x K = ) k(x, x 2 ) x = x 2 2 ) 2 k(x 2, x ) k(x 2, x 2 ) x x must be positive semidefinite, but is it? February 2, / 48 February 2, / 48 Kernelizing other ML algorithms Example: Kernelizing NNC Kernel trick is applicable to many ML algorithms: nearest neighbor classifier perceptron logistic regression SVM For NNC with L2 distance, the key is to compute for any two points x, x d(x, x ) = x x 2 2 = x T x x T x 2x T x With a kernel function k, we simply compute d kernel (x, x ) = k(x, x) k(x, x ) 2k(x, x ) which by definition is the L2 distance in a new feature space d kernel (x, x ) = φ(x) φ(x ) 2 2 February 2, 29 4 / 48 February 2, / 48

Lecture 7 Convolutional Neural Networks

Lecture 7 Convolutional Neural Networks Lecture 7 Convolutional Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 17, 2017 We saw before: ŷ x 1 x 2 x 3 x 4 A series of matrix multiplications:

More information

Deep Learning (CNNs)

Deep Learning (CNNs) 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Deep Learning (CNNs) Deep Learning Readings: Murphy 28 Bishop - - HTF - - Mitchell

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016 CS 1674: Intro to Computer Vision Final Review Prof. Adriana Kovashka University of Pittsburgh December 7, 2016 Final info Format: multiple-choice, true/false, fill in the blank, short answers, apply an

More information

Deep Learning. Convolutional Neural Network (CNNs) Ali Ghodsi. October 30, Slides are partially based on Book in preparation, Deep Learning

Deep Learning. Convolutional Neural Network (CNNs) Ali Ghodsi. October 30, Slides are partially based on Book in preparation, Deep Learning Convolutional Neural Network (CNNs) University of Waterloo October 30, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Convolutional Networks Convolutional

More information

Bayesian Networks (Part I)

Bayesian Networks (Part I) 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks (Part I) Graphical Model Readings: Murphy 10 10.2.1 Bishop 8.1,

More information

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets) COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets) Sanjeev Arora Elad Hazan Recap: Structure of a deep

More information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

More information

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting

More information

Convolutional Neural Network Architecture

Convolutional Neural Network Architecture Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55 Outline 1 Introduction of Convolution Motivation

More information

Convolutional neural networks

Convolutional neural networks 11-1: Convolutional neural networks Prof. J.C. Kao, UCLA Convolutional neural networks Motivation Biological inspiration Convolution operation Convolutional layer Padding and stride CNN architecture 11-2:

More information

Loss Functions and Optimization. Lecture 3-1

Loss Functions and Optimization. Lecture 3-1 Lecture 3: Loss Functions and Optimization Lecture 3-1 Administrative Assignment 1 is released: http://cs231n.github.io/assignments2017/assignment1/ Due Thursday April 20, 11:59pm on Canvas (Extending

More information

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text

More information

Loss Functions and Optimization. Lecture 3-1

Loss Functions and Optimization. Lecture 3-1 Lecture 3: Loss Functions and Optimization Lecture 3-1 Administrative: Live Questions We ll use Zoom to take questions from remote students live-streaming the lecture Check Piazza for instructions and

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

CSCI 315: Artificial Intelligence through Deep Learning

CSCI 315: Artificial Intelligence through Deep Learning CSCI 35: Artificial Intelligence through Deep Learning W&L Fall Term 27 Prof. Levy Convolutional Networks http://wernerstudio.typepad.com/.a/6ad83549adb53ef53629ccf97c-5wi Convolution: Convolution is

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

CSCI567 Machine Learning (Fall 2018)

CSCI567 Machine Learning (Fall 2018) CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due

More information

Neural networks COMS 4771

Neural networks COMS 4771 Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a

More information

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks 2. 2 Receptive fields and dealing with image inputs CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

Convolutional Neural Networks

Convolutional Neural Networks Convolutional Neural Networks Books» http://www.deeplearningbook.org/ Books http://neuralnetworksanddeeplearning.com/.org/ reviews» http://www.deeplearningbook.org/contents/linear_algebra.html» http://www.deeplearningbook.org/contents/prob.html»

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Lecture 35: Optimization and Neural Nets

Lecture 35: Optimization and Neural Nets Lecture 35: Optimization and Neural Nets CS 4670/5670 Sean Bell DeepDream [Google, Inceptionism: Going Deeper into Neural Networks, blog 2015] Aside: CNN vs ConvNet Note: There are many papers that use

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Introduction to CNN and PyTorch

Introduction to CNN and PyTorch Introduction to CNN and PyTorch Kripasindhu Sarkar kripasindhu.sarkar@dfki.de Kaiserslautern University, DFKI Deutsches Forschungszentrum für Künstliche Intelligenz http://av.dfki.de Some of the contents

More information

Machine Learning Basics

Machine Learning Basics Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Lecture 9 Numerical optimization and deep learning Niklas Wahlström Division of Systems and Control Department of Information Technology Uppsala University niklas.wahlstrom@it.uu.se

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17 3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural

More information

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Introduction to Convolutional Neural Networks 2018 / 02 / 23 Introduction to Convolutional Neural Networks 2018 / 02 / 23 Buzzword: CNN Convolutional neural networks (CNN, ConvNet) is a class of deep, feed-forward (not recurrent) artificial neural networks that

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Slides adapted from Jordan Boyd-Graber, Justin Johnson, Andrej Karpathy, Chris Ketelsen, Fei-Fei Li, Mike Mozer, Michael Nielson

More information

Jakub Hajic Artificial Intelligence Seminar I

Jakub Hajic Artificial Intelligence Seminar I Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network

More information

Introduction to Machine Learning (67577)

Introduction to Machine Learning (67577) Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks

More information

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 8. Sinan Kalkan

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 8. Sinan Kalkan CENG 783 Special topics in Deep Learning AlchemyAPI Week 8 Sinan Kalkan Loss functions Many correct labels case: Binary prediction for each label, independently: L i = σ j max 0, 1 y ij f j y ij = +1 if

More information

Convolutional Neural Networks. Srikumar Ramalingam

Convolutional Neural Networks. Srikumar Ramalingam Convolutional Neural Networks Srikumar Ramalingam Reference Many of the slides are prepared using the following resources: neuralnetworksanddeeplearning.com (mainly Chapter 6) http://cs231n.github.io/convolutional-networks/

More information

Deep Feedforward Networks. Sargur N. Srihari

Deep Feedforward Networks. Sargur N. Srihari Deep Feedforward Networks Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Neural networks Daniel Hennes 21.01.2018 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Logistic regression Neural networks Perceptron

More information

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Feb 23, 2018 1 Neural Networks Outline

More information

Neural Architectures for Image, Language, and Speech Processing

Neural Architectures for Image, Language, and Speech Processing Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction

More information

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer CSE446: Neural Networks Spring 2017 Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5 Scene

More information

Convolution and Pooling as an Infinitely Strong Prior

Convolution and Pooling as an Infinitely Strong Prior Convolution and Pooling as an Infinitely Strong Prior Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Convolutional

More information

Νεςπο-Ασαυήρ Υπολογιστική Neuro-Fuzzy Computing

Νεςπο-Ασαυήρ Υπολογιστική Neuro-Fuzzy Computing Νεςπο-Ασαυήρ Υπολογιστική Neuro-Fuzzy Computing ΗΥ418 Διδάσκων Δημήτριος Κατσαρός @ Τμ. ΗΜΜΥ Πανεπιστήμιο Θεσσαλίαρ Διάλεξη 21η BackProp for CNNs: Do I need to understand it? Why do we have to write the

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

Deep Learning: a gentle introduction

Deep Learning: a gentle introduction Deep Learning: a gentle introduction Jamal Atif jamal.atif@dauphine.fr PSL, Université Paris-Dauphine, LAMSADE February 8, 206 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 206 / Why

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

More information

Neural networks and support vector machines

Neural networks and support vector machines Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith

More information

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST 1 Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST Summary We have shown: Now First order optimization methods: GD (BP), SGD, Nesterov, Adagrad, ADAM, RMSPROP, etc. Second

More information

10-701/ Recitation : Kernels

10-701/ Recitation : Kernels 10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer

More information

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 16: ResNets and Attention CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the

More information

Training Neural Networks Practical Issues

Training Neural Networks Practical Issues Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adapted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017, and some from

More information

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal

More information

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Department of Electrical and Computer Engineering Seoul National University Seoul, 08826 Korea Email: sajid@dsp.snu.ac.kr, khwang@dsp.snu.ac.kr, wysung@snu.ac.kr

More information

Spatial Transformer Networks

Spatial Transformer Networks BIL722 - Deep Learning for Computer Vision Spatial Transformer Networks Max Jaderberg Andrew Zisserman Karen Simonyan Koray Kavukcuoglu Contents Introduction to Spatial Transformers Related Works Spatial

More information

Introduction to Deep Learning

Introduction to Deep Learning Introduction to Deep Learning A. G. Schwing & S. Fidler University of Toronto, 2015 A. G. Schwing & S. Fidler (UofT) CSC420: Intro to Image Understanding 2015 1 / 39 Outline 1 Universality of Neural Networks

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Yongjin Park 1 Goal of Feedforward Networks Deep Feedforward Networks are also called as Feedforward neural networks or Multilayer Perceptrons Their Goal: approximate some function

More information

SVMs: nonlinearity through kernels

SVMs: nonlinearity through kernels Non-separable data e-8. Support Vector Machines 8.. The Optimal Hyperplane Consider the following two datasets: SVMs: nonlinearity through kernels ER Chapter 3.4, e-8 (a) Few noisy data. (b) Nonlinearly

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Handwritten Indic Character Recognition using Capsule Networks

Handwritten Indic Character Recognition using Capsule Networks Handwritten Indic Character Recognition using Capsule Networks Bodhisatwa Mandal,Suvam Dubey, Swarnendu Ghosh, RiteshSarkhel, Nibaran Das Dept. of CSE, Jadavpur University, Kolkata, 700032, WB, India.

More information

Binary Convolutional Neural Network on RRAM

Binary Convolutional Neural Network on RRAM Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua

More information

Deep Learning Lecture 2

Deep Learning Lecture 2 Fall 2016 Machine Learning CMPSCI 689 Deep Learning Lecture 2 Sridhar Mahadevan Autonomous Learning Lab UMass Amherst COLLEGE Outline of lecture New type of units Convolutional units, Rectified linear

More information

Introduction to Deep Learning

Introduction to Deep Learning Introduction to Deep Learning Some slides and images are taken from: David Wolfe Corne Wikipedia Geoffrey A. Hinton https://www.macs.hw.ac.uk/~dwcorne/teaching/introdl.ppt Feedforward networks for function

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 9 Sep. 26, 2018 1 Reminders Homework 3:

More information

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks. Nicholas Ruozzi University of Texas at Dallas Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

More information

Feature Design. Feature Design. Feature Design. & Deep Learning

Feature Design. Feature Design. Feature Design. & Deep Learning Artificial Intelligence and its applications Lecture 9 & Deep Learning Professor Daniel Yeung danyeung@ieee.org Dr. Patrick Chan patrickchan@ieee.org South China University of Technology, China Appropriately

More information

CSC321 Lecture 20: Reversible and Autoregressive Models

CSC321 Lecture 20: Reversible and Autoregressive Models CSC321 Lecture 20: Reversible and Autoregressive Models Roger Grosse Roger Grosse CSC321 Lecture 20: Reversible and Autoregressive Models 1 / 23 Overview Four modern approaches to generative modeling:

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up! Feature Mapping Consider the following mapping φ for an example x = {x 1,...,x D } φ : x {x1,x 2 2,...,x 2 D,,x 2 1 x 2,x 1 x 2,...,x 1 x D,...,x D 1 x D } It s an example of a quadratic mapping Each new

More information

Understanding How ConvNets See

Understanding How ConvNets See Understanding How ConvNets See Slides from Andrej Karpathy Springerberg et al, Striving for Simplicity: The All Convolutional Net (ICLR 2015 workshops) CSC321: Intro to Machine Learning and Neural Networks,

More information

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture Neural Networks for Machine Learning Lecture 2a An overview of the main types of neural network architecture Geoffrey Hinton with Nitish Srivastava Kevin Swersky Feed-forward neural networks These are

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1 Kernel Methods Foundations of Data Analysis Torsten Möller Möller/Mori 1 Reading Chapter 6 of Pattern Recognition and Machine Learning by Bishop Chapter 12 of The Elements of Statistical Learning by Hastie,

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux INRIA 8 Nov 2011 Nicolas Le Roux (INRIA) Neural networks and optimization 8 Nov 2011 1 / 80 1 Introduction 2 Linear classifier 3 Convolutional neural networks

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML) Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017 Stochastic Gradient Descent: The Workhorse of Machine Learning CS6787 Lecture 1 Fall 2017 Fundamentals of Machine Learning? Machine Learning in Practice this course What s missing in the basic stuff? Efficiency!

More information

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017 Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion

More information

Lecture 7: Kernels for Classification and Regression

Lecture 7: Kernels for Classification and Regression Lecture 7: Kernels for Classification and Regression CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011 Outline Outline A linear regression problem Linear auto-regressive

More information

Lecture 15: Exploding and Vanishing Gradients

Lecture 15: Exploding and Vanishing Gradients Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train

More information

Theories of Deep Learning

Theories of Deep Learning Theories of Deep Learning Lecture 02 Donoho, Monajemi, Papyan Department of Statistics Stanford Oct. 4, 2017 1 / 50 Stats 385 Fall 2017 2 / 50 Stats 285 Fall 2017 3 / 50 Course info Wed 3:00-4:20 PM in

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

SVMs, Duality and the Kernel Trick

SVMs, Duality and the Kernel Trick SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

ML (cont.): SUPPORT VECTOR MACHINES

ML (cont.): SUPPORT VECTOR MACHINES ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Neural Network Language Modeling

Neural Network Language Modeling Neural Network Language Modeling Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Marek Rei, Philipp Koehn and Noah Smith Course Project Sign up your course project In-class presentation

More information

Lecture 4 Backpropagation

Lecture 4 Backpropagation Lecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 5, 2017 Things we will look at today More Backpropagation Still more backpropagation Quiz

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Machine Learning Lecture 10

Machine Learning Lecture 10 Machine Learning Lecture 10 Neural Networks 26.11.2018 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Today s Topic Deep Learning 2 Course Outline Fundamentals Bayes

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information