Lecture 35: Optimization and Neural Nets

Similar documents
Understanding How ConvNets See

Introduction to Convolutional Neural Networks (CNNs)

CS 343: Artificial Intelligence

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

CS60010: Deep Learning

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016

Convolutional Neural Networks

Based on the original slides of Hung-yi Lee

1 What a Neural Network Computes

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Bayesian Networks (Part I)

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

CSC321 Lecture 4: Learning a Classifier

Neural Architectures for Image, Language, and Speech Processing

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lecture 4: Training a Classifier

Introduction to Machine Learning Spring 2018 Note Neural Networks

Lecture 17: Neural Networks and Deep Learning

Training Neural Networks Practical Issues

Deep Learning (CNNs)

CSC321 Lecture 4: Learning a Classifier

Lecture 7 Convolutional Neural Networks

Backpropagation and Neural Networks part 1. Lecture 4-1

Lecture 4: Training a Classifier

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

SGD and Deep Learning

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Neural networks and support vector machines

Deep Learning II: Momentum & Adaptive Step Size

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

More on Neural Networks

Computational statistics

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

CSC321 Lecture 7: Optimization

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

CSC321 Lecture 8: Optimization

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

ECE521 Lectures 9 Fully Connected Neural Networks

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Deep Neural Networks (1) Hidden layers; Back-propagation

CSC 411 Lecture 10: Neural Networks

Loss Functions and Optimization. Lecture 3-1

Convolutional neural networks

Learning with Momentum, Conjugate Gradient Learning

Lecture 3 Feedforward Networks and Backpropagation

Computing Neural Network Gradients

Data Mining & Machine Learning

Deep Neural Networks (1) Hidden layers; Back-propagation

Neural Networks (Part 1) Goals for the lecture

CSCI 315: Artificial Intelligence through Deep Learning

Neural networks and optimization

Computational Photography

Introduction to Neural Networks

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

CSC321 Lecture 5: Multilayer Perceptrons

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Machine Learning (CSE 446): Neural Networks

Lecture 3 Feedforward Networks and Backpropagation

Notes on Back Propagation in 4 Lines

Jakub Hajic Artificial Intelligence Seminar I

Loss Functions and Optimization. Lecture 3-1

Deep Learning for NLP

Introduction to Convolutional Neural Networks 2018 / 02 / 23

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

CSC321 Lecture 5 Learning in a Single Neuron

Neural Networks: Backpropagation

Deep Learning Recurrent Networks 2/28/2018

Neural Networks 2. 2 Receptive fields and dealing with image inputs

An overview of deep learning methods for genomics

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Day 3 Lecture 3. Optimizing deep networks

Neural Networks. Learning and Computer Vision Prof. Olga Veksler CS9840. Lecture 10

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Numerical Computation for Deep Learning

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

Least Mean Squares Regression

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Introduction to Machine Learning (67577)

text classification 3: neural networks

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Generative adversarial networks

Local Affine Approximators for Improving Knowledge Transfer

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Deep Feedforward Networks

Machine Learning

Vectorization. Yu Wu, Ishan Patil. October 13, 2017

Natural Language Processing with Deep Learning CS224N/Ling284

Transcription:

Lecture 35: Optimization and Neural Nets CS 4670/5670 Sean Bell DeepDream [Google, Inceptionism: Going Deeper into Neural Networks, blog 2015]

Aside: CNN vs ConvNet Note: There are many papers that use either phrase, but ConvNet is the preferred term, since CNN clashes with that other thing called CNN

(Review) Softmax Classifier p i, j = k es i, j e s i,k L i = log p i,yi Q2: At initialization, W is small and thus s ~= 0. What is the loss L? Slide from Karpathy 2016

Softmax vs SVM Loss Slide from Karpathy 2016

Softmax vs SVM Loss Softmax: Slide from Karpathy 2016

Softmax vs SVM Loss Softmax: SVM: Slide from Karpathy 2016

Softmax vs SVM Loss Softmax: SVM: Slide from Karpathy 2016

Softmax vs SVM Loss Softmax: SVM: Slide from Karpathy 2016

Softmax Classifier Let s code this up in NumPy: p i, j = es i, j k e s i,k

Softmax Classifier Let s code this up in NumPy: p i, j = k es i, j e s i,k Doesn t work what s the problem?

Softmax Classifier Let s code this up in NumPy: p i, j = k es i, j e s i,k Doesn t work what s the problem? - What if there is the value 1000 appears in s?

Softmax Classifier Let s code this up in NumPy: p i, j = k es i, j e s i,k Doesn t work what s the problem? - What if there is the value 1000 appears in s? Overflow > we get inf/inf = NaN

Softmax Classifier Let s code this up in NumPy: p i, j = k es i, j e s i,k Doesn t work what s the problem? - What if there is the value 1000 appears in s? Overflow > we get inf/inf = NaN - What if the largest value is -1000?

Softmax Classifier Let s code this up in NumPy: p i, j = k es i, j e s i,k Doesn t work what s the problem? - What if there is the value 1000 appears in s? Overflow > we get inf/inf = NaN - What if the largest value is -1000? Underflow > we get 0/0 = NaN

Softmax Classifier Let s code this up in NumPy: p i, j = k es i, j e s i,k Doesn t work what s the problem? - What if there is the value 1000 appears in s? Overflow > we get inf/inf = NaN - What if the largest value is -1000? Underflow > we get 0/0 = NaN This expression is numerically unstable

Softmax Classifier

Softmax Classifier Observation: subtracting a constant does not change p :

Softmax Classifier Observation: subtracting a constant does not change p : p i, j = k es i, j C e s j,k C = e C e s i, j k e C e s i,k = es i, j k e s i,k

Softmax Classifier Observation: subtracting a constant does not change p : p i, j = k es i, j C e s j,k C = e C e s i, j k e C e s i,k = es i, j k e s i,k If we choose C to be the max, then it works:

Softmax Classifier Observation: subtracting a constant does not change p : p i, j = k es i, j C e s j,k C = e C e s i, j k e C e s i,k = es i, j k e s i,k If we choose C to be the max, then it works: - If a large value appears in s, then that value will become 1 and all others will be 0 (avoiding overflow)

Softmax Classifier Observation: subtracting a constant does not change p : p i, j = k es i, j C e s j,k C = e C e s i, j k e C e s i,k = es i, j k e s i,k If we choose C to be the max, then it works: - If a large value appears in s, then that value will become 1 and all others will be 0 (avoiding overflow) - If all values in s are large negative, then they will be shifted up towards 0 (avoiding underflow)

Optimization How do we minimize the loss L?

Idea #1: Random Search

Idea #1: Random Search

Idea #1: Random Search

Idea #1: Random Search

Idea #2: Follow the Slope

Idea #2: Follow the Slope

Idea #2: Follow the Slope

Idea #2: Follow the Slope

Idea #2: Follow the Slope In 1D, the derivative of a function is: In multiple dimensions, the gradient is the vector of (partial derivatives). Slide from Karpathy 2016

Idea #2: Follow the Slope In 1D, the derivative of a function is: In multiple dimensions, the gradient is the vector of (partial derivatives). We could directly estimate it: Slide from Karpathy 2016

Idea #2: Follow the Slope In 1D, the derivative of a function is: In multiple dimensions, the gradient is the vector of (partial derivatives). We could directly estimate it: But this is super slow. Slide from Karpathy 2016

Idea #2: Follow the Slope We can just directly compute the gradient: Slide from Karpathy 2016

Idea #2: Follow the Slope We can just directly compute the gradient: Calculus Slide from Karpathy 2016

Idea #2: Follow the Slope Slide from Karpathy 2016

Idea #2: Follow the Slope Slide from Karpathy 2016

Gradient Descent Slide from Karpathy 2016

Gradient Descent Slide from Karpathy 2016

Gradient Descent Slide from Karpathy 2016

Gradient Descent Step size called the learning rate Slide from Karpathy 2016

Mini-batch Gradient Descent Slide from Karpathy 2016

Mini-batch Gradient Descent Note: In practice we use fancier update rules (momentum, RMSprop, ADAM, etc) Slide from Karpathy 2016

Mini-batch Gradient Descent Typical loss Slide from Karpathy 2016

Setting the learning rate Learning rate: 1e6 what could go wrong? L Loss A weight in W

Setting the learning rate Learning rate: 1e6 what could go wrong? L Loss A weight in W

Setting the learning rate Learning rate: 1e6 what could go wrong? L Loss A weight in W

Setting the learning rate Learning rate: 1e6 what could go wrong? L Loss A weight in W

Setting the learning rate Learning rate: 1e6 what could go wrong? L Loss A weight in W

Setting the learning rate Learning rate: 1e6 what could go wrong? L Loss A weight in W

Setting the learning rate Typical loss (more on this later) Slide from Karpathy 2016

Classification: Overview Training Images Test Image

Classification: Overview Training Images Gradient Descent Test Image

Classification: Overview Training Images Training Labels Gradient Descent Test Image

Classification: Overview Training Images Training Labels Trained Classifier Gradient Descent Test Image

Classification: Overview Training Images Training Labels Trained Classifier Gradient Descent Test Image

Classification: Overview Training Images Training Labels Trained Classifier Gradient Descent Prediction: outdoor Test Image

Neural Networks (First we ll cover Neural Nets, then build up to Convolutional Neural Nets)

Feature hierarchy with ConvNets End-to-end models [Zeiler and Fergus, Visualizing and Understanding Convolutional Networks, 2013]

Slide: R. Fergus

Inspiration from Biology Figure: Andrej Karpathy

Inspiration from Biology Figure: Andrej Karpathy

Inspiration from Biology Neural nets are loosely inspired by biology Figure: Andrej Karpathy

Inspiration from Biology Neural nets are loosely inspired by biology But they certainly are not a model of how the brain works, or even how neurons work Figure: Andrej Karpathy

Simple Neural Net: 1 Layer Let s consider a simple 1-layer network: x Wx + b f

Simple Neural Net: 1 Layer Let s consider a simple 1-layer network: x Wx + b f This is the same as what you saw last class:

1 Layer Neural Net Block Diagram: x Wx + b f (Input) (class scores)

1 Layer Neural Net Block Diagram: x Wx + b f (Input) (class scores) Expanded Block Diagram: M W + x b D D 1 M = f 1 M 1 M classes D features 1 example

1 Layer Neural Net Block Diagram: x Wx + b f (Input) (class scores) Expanded Block Diagram: M W + x b D D 1 M = f 1 M NumPy: 1 M classes D features 1 example

1 Layer Neural Net

1 Layer Neural Net How do we process N inputs at once?

1 Layer Neural Net How do we process N inputs at once? It s most convenient to have the first dimension (row) represent which example we are looking at, so we need to transpose everything

1 Layer Neural Net How do we process N inputs at once? It s most convenient to have the first dimension (row) represent which example we are looking at, so we need to transpose everything N X D

1 Layer Neural Net How do we process N inputs at once? It s most convenient to have the first dimension (row) represent which example we are looking at, so we need to transpose everything N X D W M D

1 Layer Neural Net How do we process N inputs at once? It s most convenient to have the first dimension (row) represent which example we are looking at, so we need to transpose everything N X W D + 1 1 1 b! b N D M M

1 Layer Neural Net How do we process N inputs at once? It s most convenient to have the first dimension (row) represent which example we are looking at, so we need to transpose everything N X W D + 1 1 1 b! b N = F N D M M M M classes D features N examples

1 Layer Neural Net How do we process N inputs at once? It s most convenient to have the first dimension (row) represent which example we are looking at, so we need to transpose everything N X W D + 1 1 1 b! b N = F N D M M M Note: Often, if the weights are transposed, they are still called W M classes D features N examples

1 Layer Neural Net

1 Layer Neural Net D X N Each row is one input example

1 Layer Neural Net D X N Each row is one input example W D Each column is the weights for one class M

1 Layer Neural Net D X N Each row is one input example W D Each column is the weights for one class M F M N Each row is the predicted scores for one example

1 Layer Neural Net Implementing this with NumPy: First attempt let s try this:

1 Layer Neural Net Implementing this with NumPy: First attempt let s try this: Doesn t work why?

1 Layer Neural Net Implementing this with NumPy: First attempt let s try this: N x D M Doesn t work why? D x M

1 Layer Neural Net Implementing this with NumPy: First attempt let s try this: N x D M Doesn t work why? D x M - NumPy needs to know how to expand b from 1D to 2D

1 Layer Neural Net Implementing this with NumPy: First attempt let s try this: N x D M Doesn t work why? D x M - NumPy needs to know how to expand b from 1D to 2D - This is called broadcasting

1 Layer Neural Net Implementing this with NumPy: What does np.newaxis do?

1 Layer Neural Net Implementing this with NumPy: What does np.newaxis do? b = [0, 1, 2]

1 Layer Neural Net Implementing this with NumPy: What does np.newaxis do? b = [0, 1, 2] Make b a row vector

1 Layer Neural Net Implementing this with NumPy: What does np.newaxis do? b = [0, 1, 2] Make b a row vector Make b a column vector

1 Layer Neural Net Implementing this with NumPy: What does np.newaxis do?

1 Layer Neural Net Implementing this with NumPy: What does np.newaxis do? Row vector (repeat along rows)

1 Layer Neural Net Implementing this with NumPy: What does np.newaxis do? Row vector (repeat along rows) Column vector (repeat along columns)

2 Layer Neural Net What if we just added another layer? x W (1) x + b (1) h

2 Layer Neural Net What if we just added another layer? x W (1) x + b (1) h W (2) h + b (2) f

2 Layer Neural Net What if we just added another layer? x W (1) x + b (1) h W (2) h + b (2) f Let s expand out the equation:

2 Layer Neural Net What if we just added another layer? x W (1) x + b (1) h W (2) h + b (2) f Let s expand out the equation: f = W (2) (W (1) x + b (1) ) + b (2) = (W (2) W (1) )x + (W (2) b (1) + b (2) )

2 Layer Neural Net What if we just added another layer? x W (1) x + b (1) h W (2) h + b (2) f Let s expand out the equation: f = W (2) (W (1) x + b (1) ) + b (2) = (W (2) W (1) )x + (W (2) b (1) + b (2) ) But this is just the same as a 1 layer net with:

2 Layer Neural Net What if we just added another layer? x W (1) x + b (1) h W (2) h + b (2) f Let s expand out the equation: f = W (2) (W (1) x + b (1) ) + b (2) = (W (2) W (1) )x + (W (2) b (1) + b (2) ) But this is just the same as a 1 layer net with: W = W (2) W (1) b = W (2) b (1) + b (2)

2 Layer Neural Net What if we just added another layer? x W (1) x + b (1) h W (2) h + b (2) f Let s expand out the equation: f = W (2) (W (1) x + b (1) ) + b (2) = (W (2) W (1) )x + (W (2) b (1) + b (2) ) But this is just the same as a 1 layer net with: W = W (2) W (1) b = W (2) b (1) + b (2) We need a non-linear operation between the layers

Nonlinearities

Sigmoid Nonlinearities

Nonlinearities Sigmoid Tanh

Nonlinearities Sigmoid Tanh ReLU

Nonlinearities Sigmoid Tanh ReLU Historically popular 2 Big problems: - Not zero centered - They saturate

Nonlinearities Sigmoid Tanh ReLU Historically popular 2 Big problems: - Not zero centered - They saturate - Zero-centered, - But also saturates

Nonlinearities Sigmoid Tanh ReLU Historically popular 2 Big problems: - Not zero centered - They saturate - Zero-centered, - But also saturates - No saturation - Very efficient

Nonlinearities Sigmoid Tanh ReLU Historically popular 2 Big problems: - Not zero centered - They saturate - Zero-centered, - But also saturates Best in practice for classification - No saturation - Very efficient

Nonlinearities Saturation What happens if we reach this part? Sigmoid

Nonlinearities Saturation What happens if we reach this part? Sigmoid

Nonlinearities Saturation What happens if we reach this part? Sigmoid

Nonlinearities Saturation What happens if we reach this part? Sigmoid

Nonlinearities Saturation What happens if we reach this part? σ (x) = 1 1+ e x

Nonlinearities Saturation What happens if we reach this part? σ (x) = 1 1+ e x σ x = σ (x)( 1 σ (x))

Nonlinearities Saturation What happens if we reach this part? σ (x) = 1 1+ e x σ x = σ (x)( 1 σ (x))

Nonlinearities Saturation What happens if we reach this part? σ (x) = 1 1+ e x σ x = σ (x)( 1 σ (x))

Nonlinearities Saturation What happens if we reach this part? σ (x) = 1 1+ e x σ x = σ (x)( 1 σ (x))

Nonlinearities Saturation What happens if we reach this part? σ (x) = 1 1+ e x σ x = σ (x)( 1 σ (x)) Saturation: the gradient is zero!

Nonlinearities [Krizhevsky 2012] (AlexNet) tanh ReLU

Nonlinearities [Krizhevsky 2012] (AlexNet) tanh ReLU In practice, ReLU converges ~6x faster than Tanh for classification problems

ReLU in NumPy Many ways to write ReLU these are all equivalent:

ReLU in NumPy Many ways to write ReLU these are all equivalent: (a) Elementwise max, 0 gets broadcasted to match the size of h1:

ReLU in NumPy Many ways to write ReLU these are all equivalent: (a) Elementwise max, 0 gets broadcasted to match the size of h1: (b) Make a boolean mask where negative values are True, and then set those entries in h1 to 0:

ReLU in NumPy Many ways to write ReLU these are all equivalent: (a) Elementwise max, 0 gets broadcasted to match the size of h1: (b) Make a boolean mask where negative values are True, and then set those entries in h1 to 0: (c) Make a boolean mask where positive values are True, and then do an elementwise multiplication (since int(true) = 1):

2 Layer Neural Net

2 Layer Neural Net (Layer 1) x W (1) x + b (1)

2 Layer Neural Net (Layer 1) x W (1) x + b (1) (Nonlinearity) max(0, ) h 1 h (1) ( hidden activations )

2 Layer Neural Net (Layer 1) (Nonlinearity) (Layer 2) x W (1) x + b (1) max(0, ) h 1 h (1) W (2) x + b (2) f ( hidden activations )

2 Layer Neural Net (Layer 1) (Nonlinearity) (Layer 2) x W (1) x + b (1) max(0, ) h 1 h (1) W (2) x + b (2) f ( hidden activations ) Let s expand out the equation:

2 Layer Neural Net (Layer 1) (Nonlinearity) (Layer 2) x W (1) x + b (1) max(0, ) h 1 h (1) W (2) x + b (2) f ( hidden activations ) Let s expand out the equation: f = W (2) max(0,w (1) x + b (1) ) + b (2)

2 Layer Neural Net (Layer 1) (Nonlinearity) (Layer 2) x W (1) x + b (1) max(0, ) h 1 h (1) W (2) x + b (2) f ( hidden activations ) Let s expand out the equation: f = W (2) max(0,w (1) x + b (1) ) + b (2) Now it no longer simplifies yay

2 Layer Neural Net (Layer 1) (Nonlinearity) (Layer 2) x W (1) x + b (1) max(0, ) h 1 h (1) W (2) x + b (2) f ( hidden activations ) Let s expand out the equation: f = W (2) max(0,w (1) x + b (1) ) + b (2) Now it no longer simplifies yay Note: any nonlinear function will prevent this collapse, but not all nonlinear functions actually work in practice

2 Layer Neural Net (Layer 1) (Nonlinearity) (Layer 2) x W (1) x + b (1) max(0, ) h 1 h (1) W (2) x + b (2) f ( hidden activations ) Note: Traditionally, the nonlinearity was considered part of the layer and is called an activation function In this class, we will consider them separate layers, but be aware that many others consider them part of the layer

Neural Net: Graphical Representation

Neural Net: Graphical Representation 2 layers

Neural Net: Graphical Representation 2 layers - Called fully connected because every output depends on every input. - Also called affine or inner product

Neural Net: Graphical Representation 2 layers - Called fully connected because every output depends on every input. - Also called affine or inner product 3 layers

Questions?

Neural Networks, More generally x Function h (1) Function h (2)... f

Neural Networks, More generally x Function h (1) Function h (2)... f This can be: - Fully connected layer - Nonlinearity (ReLU, Tanh, Sigmoid) - Convolution - Pooling (Max, Avg) - Vector normalization (L1, L2) - Invent new ones

Neural Networks, More generally x h (1) Function Function h (2)... f

Neural Networks, More generally (Input) (Hidden Activation) (Scores) x Function h (1) Function h (2)... f

Neural Networks, More generally (Input) (Parameters) θ (1) (Parameters) θ (2) (Hidden Activation) x Function h (1) Function h (2)... (Scores) f

Neural Networks, More generally (Input) (Parameters) θ (1) (Parameters) θ (2) (Hidden Activation) x Function h (1) Function h (2)... (Scores) f Here, θ represents whatever parameters that layer is using (e.g. for a fully connected layer ). θ (1) = { W (1), b (1) }

Neural Networks, More generally (Input) (Parameters) θ (1) (Parameters) θ (2) (Hidden Activation) x Function h (1) Function h (2)... (Scores) f (Ground Truth Labels) y Here, θ represents whatever parameters that layer is using (e.g. for a fully connected layer ). θ (1) = { W (1), b (1) }

Neural Networks, More generally (Input) (Parameters) θ (1) (Parameters) θ (2) (Hidden Activation) x Function h (1) Function h (2)... (Scores) f (Ground Truth Labels) y L Here, θ represents whatever parameters that layer is using (e.g. for a fully connected layer ). θ (1) = { W (1), b (1) } (Loss)

Neural Networks, More generally (Input) (Parameters) θ (1) (Parameters) θ (2) (Hidden Activation) x Function h (1) Function h (2)... (Scores) f (Ground Truth Labels) y L Here, θ represents whatever parameters that layer is using (e.g. for a fully connected layer ). θ (1) = { W (1), b (1) } Recall: the loss L measures how far the predictions f (Loss) are from the labels y. The most common loss is Softmax.