Topic 3: Neural Networks

Similar documents
Topic 2: Logistic Regression

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

ECE521 Lectures 9 Fully Connected Neural Networks

CSC321 Lecture 5: Multilayer Perceptrons

Introduction to Machine Learning Spring 2018 Note Neural Networks

CSC 411 Lecture 10: Neural Networks

Ch.6 Deep Feedforward Networks (2/3)

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning

ECE521 Lecture 7/8. Logistic Regression

1 What a Neural Network Computes

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CS 453X: Class 20. Jacob Whitehill

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CSC321 Lecture 2: Linear Regression

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Deep Feedforward Networks

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Introduction to Machine Learning

Neural Networks: Backpropagation

Introduction to Neural Networks

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Neural Networks and the Back-propagation Algorithm

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Networks and Deep Learning

Machine Learning Basics III

Lecture 5: Logistic Regression. Neural Networks

Week 5: Logistic Regression & Neural Networks

Lecture 4 Backpropagation

Lecture 15: Exploding and Vanishing Gradients

Introduction to Deep Neural Networks

9 Classification. 9.1 Linear Classifiers

Lecture 3 Feedforward Networks and Backpropagation

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

Deep Feedforward Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Lecture 17: Neural Networks and Deep Learning

Data Mining Part 5. Prediction

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

STA 414/2104: Lecture 8

HOMEWORK #4: LOGISTIC REGRESSION

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

4. Multilayer Perceptrons

Lecture 3 Feedforward Networks and Backpropagation

HOMEWORK #4: LOGISTIC REGRESSION

Feedforward Neural Networks. Michael Collins, Columbia University

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Course 395: Machine Learning - Lectures

STA 414/2104: Lecture 8

10 NEURAL NETWORKS Bio-inspired Multi-Layer Networks. Learning Objectives:

Computational statistics

Grundlagen der Künstlichen Intelligenz

MLPR: Logistic Regression and Neural Networks

Neural networks COMS 4771

Automatic Differentiation and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Artificial Neural Networks

DeepLearning on FPGAs

Machine Learning. Neural Networks

CSCI567 Machine Learning (Fall 2018)

ECS171: Machine Learning

Jakub Hajic Artificial Intelligence Seminar I

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Artificial Neural Networks. MGS Lecture 2

Lecture 2: Modular Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Notes on Back Propagation in 4 Lines

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Advanced computational methods X Selected Topics: SGD

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Statistical Machine Learning

Artificial Neural Networks

Introduction to Convolutional Neural Networks (CNNs)

Multilayer Perceptron

Introduction to Deep Learning

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Classification with Perceptrons. Reading:

Neural Networks in Structured Prediction. November 17, 2015

ECLT 5810 Classification Neural Networks. Reference: Data Mining: Concepts and Techniques By J. Hand, M. Kamber, and J. Pei, Morgan Kaufmann

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Artificial Neural Networks

Convolutional Neural Network Architecture

18.6 Regression and Classification with Linear Models

Optimization and Gradient Descent

From perceptrons to word embeddings. Simon Šuster University of Groningen

Lecture 13 Back-propagation

Linear Models for Regression

Bidirectional Representation and Backpropagation Learning

Transcription:

CS 4850/6850: Introduction to Machine Learning Fall 2018 Topic 3: Neural Networks Instructor: Daniel L. Pimentel-Alarcón c Copyright 2018 3.1 Introduction Neural networks are arguably the main reason why Machine Learning is making a big splash. Similar to logistic regression and random forests, their main purpose is prediction/classification. Classical applications of neural networks include: 1. Image classification. For example, determining the digit contained in an image, as in Figure 3.1. 2. Natural language processing. For example, interpreting voice commands, like Siri and Alexa do. 3. Stock market prediction. Each of these problems can be rephrased mathematically as finding an unknown (and highly complex) function f, such that given certain data x, f (x) gives the desired classification/prediction. In our examples: 1. Image classification. In our example, x is a vectorized image of a digit, and f (x) {0,..., 9} is the digit depicted in such figure. For the images x 1, x 2,..., x 5 in Figure 3.1, f (x 1 ) = 5, f (x 2 ) = 0,..., f (x 5 ) = 9. 2. Natural language processing. In our example, x is a vector containing the values in a voice signal, as in Figure 3.2, and f (x) is the interpretation, in this case turn off the t.v.. 3. Stock market prediction. Here x could be a sequence of stock market prices at times t = 1,..., T, and f (x) could be the prediction of such stock market price at time T + 1. All of these tasks can be tremendously challenging. For example, whether an image contains a 0, or a 1, or any other digit, depends not only on the values of isolated pixels, but on the way that pixels interact with one an other in complex manners. The main idea behind ANNs is to use a sequence of simpler functions that interact with one another in a networked way, so that combined, they approximate f with arbitrary precision. Figure 3.1: Digit images from the MNIST dataset. The goal is to automatically determine the digit contained in each image, in this case {5, 0, 4, 1, 9}. 3-1

Topic 3: Neural Networks 3-2 Figure 3.2: Voice signal corresponding to the command turn off the t.v.. 3.2 Intuition To build some intuition, suppose we want to estimate the following function f : Figure 3.3: Function f (x) that we want to estimate. One option is to use a Riemann-type approximation ˆf = L l=1 g l where g l are step functions of the form: { wl if β(l 1) x < βl g l (x) = 0 otherwise. Here w l indicates the height of f in the l th interval [β(l 1), βl). This type of approximation would look something like:

Topic 3: Neural Networks 3-3 Notice that the parameters w 1,..., w l, together with the interval width β determine ˆf. Unfortunately, estimating this type of function would require to observe at least one sample x on each interval [β(l 1), βl). For the figure above, this would require at least 21 samples. This may not sound like much. However, as we move to higher dimensions, this number scales exponentially, and quickly becomes prohibitively large: if f is a multivariate function of x R d (instead of a single variable function of x R), then a similar approximation would require 21 d samples. With d as little as 10, this would mean more than 1 trillion samples. As we will see, in most of the big data applications that we are interest, d is often in the order of hundreds, and often thousands or millions, whence this approach is completely infeasible. Another option is to allow varying size intervals, such that g l (x) = { wl if β l 1 x < β l 0 otherwise. In this case, the parameters are the weights w 1,..., w L and the interval boundaries β 0,..., β L. This type of approximation would look more like: and will potentially require far fewer samples. The downside is that now ˆf may be quite inaccurate (on top of being discontinuous). To address this, we can make each g l a simple nonlinear yet continuous function. For example, a sigmoid: which looks as follows: σ(x) = 1, (3.1) 1 + e x Notice that σ(wx β) shifts the function by β, and squeezes the function by w. So if we let g(x) = σ(wx β), and we choose the right parameters w and β, and add a bunch of these functions, then we can approximate

Topic 3: Neural Networks 3-4 f much more accurately. Moreover, if f is a function of more than one variable, say of a vector x R d, then we can generalize g in a very natural way: g(x) = σ(w T x β), (3.2) where now w R d is a parameter vector. Just as the step function is the building block of the Riemann-type approximation above, this type of function g is precisely the building block (often called neuron) of a neural network, usually depicted as follows: The main insight of a neural network is that by adding a bunch of these building blocks, one can approximate f with arbitrary accuracy. For example, with two neurons we would obtain the following: which we can depict as follows: ˆf(x) = g 1 (x) + g 2 (x), There is no reason why we cannot include weights to each g, nor a final shift, to obtain: [ ] [ ] w1 w ˆf(x) = w 1 g 1 (x) + w 2 g 2 (x) β = 2 g1 (x) β, =: w T g(x) β, g 2 (x) where w 1, w 2, and β are new parameters. This function would be represented like: There is no reason to stop there. We can add more neurons, and more layers to obtain more powerful networks, capable of approximating more complex functions. Let L be the number of layers, and let n l denote the number of neurons in the l th layer. For l = 2, 3,..., L, let W l R n l n l 1 be the matrix formed by transposing and stacking the weight vectors of the n l neurons in the l th layer, and let β l R n l be the vector containing the shifting coefficients (often called bias coefficients) of the neurons in the l th layer, such that the output at the second layer is given by: g 2 (x) = σ(w 2 x β 2 ),

Topic 3: Neural Networks 3-5 Figure 3.4: Components of a neural network. and for l = 3,..., L, the output at the l th layer is given by: g l (x) = σ(w l g l 1 (x) β l ), where g L (x) is the final output of the network, also denoted as: ˆf(x) := σ(w L σ(w L 1 σ(w 3 σ(w 2 x β 2 ) β 3 ) β L 1 ) β L ). (3.3) Notice that ˆf(x) R nl may be a vector (if n L > 1), as opposed to a scalar (if n L = 1), and so neural networks allow to infer vector functions. See Figure 3.4 to build some intuition. Going back to estimating the function in Figure 3.3, we can construct the following network: Depending on the parameters {W l, β l } L l=2 that we choose, may produce functions like the following:

Topic 3: Neural Networks 3-6 An approximation of the function in Figure 3.3 with a neural network like this would look like: However, this will require to identify the right parameters {W l, β l } L l=2 that produce this approximation.

Topic 3: Neural Networks 3-7 3.3 Identifying the Right Parameters As discussed before, with the right parameters {W l, β l } L l=2, the function ˆf(x) in (3.3) can approximate any function f (x) with arbitrary accuracy. The challenge is to find the right parameters {W l, β l } L l=2. Hence, our goal can be summarized as follows: find the parameters {W l, β l } L l=2 of the function ˆf in (3.3), such that ˆf(x) f (x). Recall that we do not know f. However, we do have training data. More precisely, we have N samples x 1, x 2,..., x N of N, as well as their response y 1, y 2,..., y N. For example, if we were studying diabetes, x i could contain demographic information about the i th person in our study, and y i could be a binary variable indicating whether this person is diabetic or not. In other words, we know that f (x i ) = y i for every i = 1, 2,..., N. Intuitively, this means that we get to see N snapshots of f at points x i. Our goal is to exploit this information to find the parameters {W l, β l } L l=2 such that ˆf f, such that the network can reproduce the response y whenever a new vector x is fed to the network. This can be done by minimizing the error (over all training data) between the network s prediction ˆf(x i ) and its corresponding observation y i. Mathematically, we can achieve this by solving the following optimization min {W l,β l } L l=2 N y i ˆf(x i ) 2. (3.4) i=1 The most widely used technique to solve (3.4) is through stochastic gradient descent and back-propagation. The key insight is to observe that the gradient of each weight matrix can be computed backwards in terms of the gradients of the weight matrices of subsequent layers. To see this, first define z 1 i = y 1 i = x i, and then for l = 2, 3,..., L, recursively define z l i := W l y l 1 i β l, where y l i := g l (x i ) = σ(z l i ) denotes the output at the l th layer, so that ˆf(x i ) = y L i. Define the cost of the ith sample as c i := y i ˆf(x i ), and δ L i := c i σ (z L i ), δ l i := [ (W l+1 ) T δ l+1 i ] σ (z l i ), 2 l L 1, where represents the Hadamard product, and σ represents the derivative of σ. Then with a simple chain rule we obtain the following gradients: i W l := c i 2 W l = 2 δ l i (y l 1 i ) T, i β l := c i 2 β l = 2 δ l i, 2 l L. (3.5) Training the neural network is equivalent to identifying the parameters {W l, β l } L l=2 such that ˆf f. This can be done using stochastic gradient descent, which iteratively updates the parameters of interest according to (3.5). More precisely, at each training time t we select a random subsample Ω t {1, 2,..., N} of the training data (hence the term stochastic), and we update our parameters as follows W l t = W l t 1 η i Ω t i W l t 1, β l t = β l t 1 η i Ω t i β l t 1, where η is the gradient step parameter to be tuned. This iteration is repeated until convergence to obtain the final parameters {Ŵl, β l } L l=2.

Topic 3: Neural Networks 3-8 3.4 Neural Networks Flavors At this point you may be wondering whether g l (x) = σ(w l g l 1 (x) β) is the only option for g, where σ is the sigmoid function in (3.1). The answer is: no. For instance, if you use then you obtain a linear layer/network. If you use σ(x) = x, σ(x) = max(0, x), then you obtain the so-called rectified linear unit (ReLU) layer/network. function. Similarly, if instead of g l (x) = σ(w l g l 1 (x) β) you use σ is usually called activation g l (x) = σ(w l g l 1 (x) β), where denotes the convolution operator, then you obtain a convolutional layer/network. The topology (shape) of the network is defined by the number of layers (L) and the number of neurons in each layer (n 1, n 2,..., n L ). Networks with large L are often called deep, as opposed to shallow networks that have small L. Depending on the problem at hand, one choice of L, n 1,..., n L, σ, and g (or combinations) may be better than another. For example, for image classification, people usually like to mix cascades of one linear layer, followed by a ReLU layer, followed by a convolutional layer, with large n l in each case, and sometimes with fixed (predetermined) W s in the convolutional layers. The possibilities are endless, and deciding on the best choice of topology remains more of an art than science. 3.5 A Word of Warning We have mentioned before that using neural networks we can approximate any function f with arbitrary accuracy. Put another way, for every f, and every ɛ > 0 there exists a function ˆf θ with parameters θ := {W l, β l } L l=2 such that, f (x) ˆf θ (x) 2 < ɛ for every x in the domain of f. (3.6) The challenge is to find the parameter θ for which (3.6) is true, which we aim to do using gradient descent to minimize the following cost function: c(θ) := N f (x i ) ˆf θ (x i ) 2. i=1 The wrinkle is that the cost function that we are trying to minimize may be non-convex, which implies that we may never find the right parameter θ. In other words, for every f there will always exist a neural network (parametrized by θ) that approximates f arbitrarily well. However, we may be unable to find such network.