Machine Learning Basics III

Similar documents
Deep Feedforward Networks

Intro to Neural Networks and Deep Learning

Artificial Neuron (Perceptron)

Deep Feedforward Networks

Reading Group on Deep Learning Session 1

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Gradient-Based Learning. Sargur N. Srihari

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Lecture 5: Logistic Regression. Neural Networks

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Logistic Regression & Neural Networks

1 What a Neural Network Computes

Artificial Neural Networks 2

Neural Network Training

CSC 411 Lecture 10: Neural Networks

Logistic Regression. COMP 527 Danushka Bollegala

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Lecture 6. Regression

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Lecture 2: Learning with neural networks

CS60010: Deep Learning

Introduction to Machine Learning

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Deep Feedforward Networks

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Introduction to Machine Learning

From perceptrons to word embeddings. Simon Šuster University of Groningen

Feedforward Neural Networks

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Neural Networks: Backpropagation

Neural networks COMS 4771

1 Machine Learning Concepts (16 points)

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

MACHINE LEARNING ADVANCED MACHINE LEARNING

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Machine Learning 4771

Course 395: Machine Learning - Lectures

Deep Feedforward Networks. Sargur N. Srihari

10-701/ Machine Learning, Fall

Advanced Machine Learning

MLPR: Logistic Regression and Neural Networks

4. Multilayer Perceptrons

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Neural Networks (Part 1) Goals for the lecture

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Ch.6 Deep Feedforward Networks (2/3)

Feed-forward Network Functions

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

Lecture 3: Pattern Classification. Pattern classification

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Neural Networks (and Gradient Ascent Again)

Rapid Introduction to Machine Learning/ Deep Learning

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Introduction to Neural Networks

Machine Learning (CSE 446): Backpropagation

A summary of Deep Learning without Poor Local Minima

ECE521 week 3: 23/26 January 2017

Lecture 17: Neural Networks and Deep Learning

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning (CSE 446): Neural Networks

CS489/698: Intro to ML

Deep Learning & Artificial Intelligence WS 2018/2019

Linear Models for Regression

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Computational statistics

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Logistic Regression. Machine Learning Fall 2018

CSC 578 Neural Networks and Deep Learning

Artificial Neural Networks

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

CS229 Supplemental Lecture notes

Introduction to Machine Learning

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Artificial Neural Networks. MGS Lecture 2

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neural Networks: Backpropagation

Statistical Machine Learning from Data

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Transcription:

Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62

Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient Descent for Logistic Regression Stochastic Gradient Descent 3 Deep Feedforward Networks Design Choices for Output Units Design Choices for Hidden Units Backpropagation 4 Regularization L2-Regularization Benjamin Roth (CIS LMU München) Machine Learning Basics III 2 / 62

Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient Descent for Logistic Regression Stochastic Gradient Descent 3 Deep Feedforward Networks Design Choices for Output Units Design Choices for Hidden Units Backpropagation 4 Regularization L2-Regularization Benjamin Roth (CIS LMU München) Machine Learning Basics III 3 / 62

From Regression to Classification So far, linear regression: A simple linear model. Probabilistic interpretation. Find optimal parameters using Maimum Likelihood Estimation. Can we do something similar for classification? Logistic Regression (... it s not actually used for regression... ) Benjamin Roth (CIS LMU München) Machine Learning Basics III 4 / 62

Logistic Regression Binary logistic model: Estimate the probability of a binary response y {0, 1} based on features w. Logistic Regression is a Generalized Linear Model: Linear model is related to the response variable via a link function. p(y = 1 ; θ) = f (θ T ) (Note: Y denotes a random variable, whereas y, y (i), 0, 1 denote values that the random variable can take on. If the random variable is obvious from the contet, it may be omitted.) Benjamin Roth (CIS LMU München) Machine Learning Basics III 5 / 62

Logistic Regression Recall linear regression: p(y ; θ) = N(y; θ T, I) Predicts y R Classification: Outcome (per eample) 0 or 1 Logistic sigmoid: σ(z) = 1 1+e z Logistic Regression: Linear function + logistic sigmoid p(y = 1 ; θ) = σ(θ T ) Benjamin Roth (CIS LMU München) Machine Learning Basics III 6 / 62

Binary Logistic Regression Probability of different outcomes (for one eample): Probability of positive outcome: p(y = 1 ; θ) = 1 1 + e θt Probability of negative outcome: p(y = 0 ; θ) = 1 p(y = 1 ; θ) Benjamin Roth (CIS LMU München) Machine Learning Basics III 7 / 62

Probability of a Training Eample Probability for actual label y (i) given features (i) Can be written for both labels (0 and 1) without case distinction Label eponentiation trick: use 0 = 1 p(y = y (i) (i) ; θ) { p(y = 1 (i) ; θ) if y (i) = 1 = p(y = 0 (i) ; θ) if y (i) = 0 { p(y = 1 (i) ; θ) 1 p(y = 0 (i) ; θ) 0 if y (i) = 1 = p(y = 1 (i) ; θ) 0 p(y = 0 (i) ; θ) 1 if y (i) = 0 = p(y = 1 (i) ; θ) y (i) p(y = 0 (i) 1 y (i) ; θ) Benjamin Roth (CIS LMU München) Machine Learning Basics III 8 / 62

Binary Logistic Regression Conditional Negative Log-Likelihood (NLL): = log = m i=1 NLL(θ) = log p(y X; θ) m = log p(y = y (i) (i) ; θ) i=1 p(y = 1 (i) ; θ) y (i) (1 p(y = 1 (i) 1 y (i) ; θ)) m y (i) log σ(θ T (i) ) + (1 y (i) ) log(1 σ(θ T (i) )) i=1 No closed form solution for minimum Use numerical / iterative methods. LBFGS, Gradient descent... Benjamin Roth (CIS LMU München) Machine Learning Basics III 9 / 62

Logistic Regression Logistic regression: Logistic sigmoid function applied to a a weighted linear combination of feature values. To be interpreted as the probability that the label for a specific eample equals 1. Applying the model on test data: Predict y (i) = 1 if p(y = 1 (i) ; θ) > 0.5 No closed form solution for maimizing NLL, iterative methods necessary. Net up: Gradient descent. Benjamin Roth (CIS LMU München) Machine Learning Basics III 10 / 62

Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient Descent for Logistic Regression Stochastic Gradient Descent 3 Deep Feedforward Networks Design Choices for Output Units Design Choices for Hidden Units Backpropagation 4 Regularization L2-Regularization Benjamin Roth (CIS LMU München) Machine Learning Basics III 11 / 62

Optimization Optimization: Minimize some function J(θ) by altering θ. Maimize f (θ) by minimizing J(θ) = f (θ) J(θ): criterion, objective function, cost function, loss function, error function In a probabilistic machine learning setting often (conditional) negative log-likelihood: log p(x; θ) or as a function of θ θ = arg min θ J(θ) log p(y X; θ) Benjamin Roth (CIS LMU München) Machine Learning Basics III 12 / 62

Optimization If J(θ) is conve, it is minimized where θ J(θ) = 0 If J(θ) is not conve, the gradient can help us to improve our objective nevertheless (and find a local optimum). Many optimization techniques were originally developed for conve objective functions, but are found to be working well for non-conve functions too. Use the fact that gradient indicates the slope of the function in the direction of steepest increase. Benjamin Roth (CIS LMU München) Machine Learning Basics III 13 / 62

Gradient-Based Optimization Derivative: Given a small change in input, what is the corresponding change in output? f ( + ɛ) f () + ɛf () f ( ɛ sign f ()) < f () for small enough ɛ Benjamin Roth (CIS LMU München) Machine Learning Basics III 14 / 62

Gradient Descent For J(θ) : R n R If partial derivative J(θ) θ j > 0, J(θ) will increase for small increases of θ j go in opposite direction of gradient (since we want to minimize) Steepest descent: iterate θ t+1 θ η θ J(θ) η is the learning rate (set to small positive constant). Converges if θ J(θ) is (close to) 0 Benjamin Roth (CIS LMU München) Machine Learning Basics III 15 / 62

Local Minima If function is non-conve, different results can be obtained at convergence, depending on initialization of θ. Benjamin Roth (CIS LMU München) Machine Learning Basics III 16 / 62

Local Minima Minima can be global or local: Critical (stationary) points: f () = 0 For neural networks, only good (not perfect) parameter values can be found. Benjamin Roth (CIS LMU München) Machine Learning Basics III 17 / 62

Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient Descent for Logistic Regression Stochastic Gradient Descent 3 Deep Feedforward Networks Design Choices for Output Units Design Choices for Hidden Units Backpropagation 4 Regularization L2-Regularization Benjamin Roth (CIS LMU München) Machine Learning Basics III 18 / 62

Gradient Descent for Logistic Regression θ NLL(θ) = θ m i=1 y (i) log σ(θ T (i) ) + (1 y (i) ) log(1 σ(θ T (i) )) m = (y (i) σ(θ T (i) )) (i) i=1 The gradient descent update becomes: m θ t+1 := θ t + η (y (i) σ(θ T t (i) )) (i) i=1 Note: Which feature weights are increased, which are decreased? Benjamin Roth (CIS LMU München) Machine Learning Basics III 19 / 62

Derivation of Gradient for Logistic Regression This is a great eercise! Use the following facts: Gradient Derivative of a sum Chain rule Derivative of logarithm ( θ f (θ)) j = d dz f (θ) θ j i f i(z) = df i (z) i dz F (z) = f (g(z)) F (z) = f (g(z))g (z) d log z dz = 1/z D. of logistic sigmoid Partial d. of dot-product dσ(z) dz θ T θ j = σ(z)(1 σ(z)) = j Benjamin Roth (CIS LMU München) Machine Learning Basics III 20 / 62

Gradient Descent: Summary Iterative method for function minimization. Gradient indicates rate of change in objective function, given a local change to feature weights. Substract the gradient: decrease parameters that (locally) have positive correlation with objective increase parameters that (locally) have negative correlation with objective Gradient updates only have the desired properties in a small region around previous parameters θ t. Control locality by step-size η. Gradient descent is slow: For relatively small step in the right direction, all of training data has to be processed. This version of gradient descent is often also called batch gradient descent. Benjamin Roth (CIS LMU München) Machine Learning Basics III 21 / 62

Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient Descent for Logistic Regression Stochastic Gradient Descent 3 Deep Feedforward Networks Design Choices for Output Units Design Choices for Hidden Units Backpropagation 4 Regularization L2-Regularization Benjamin Roth (CIS LMU München) Machine Learning Basics III 22 / 62

Stochastic Gradient Descent (SGD) Batch gradient descent is slow: For relatively small step in the right direction, all of training data has to be processed. θ t+1 θ t + η θ m i=1 log p(y i i ; θ) Stochastic gradient descent in a nutshell: For each update, only use random sample Bt of training data (mini-batch). θ t+1 θ t + η θ log p(y i i ; θ) i B t Mini-batch size can also just be 1. More frequent updates. θ t+1 θ t + η θ log p(y t t ; θ) Benjamin Roth (CIS LMU München) Machine Learning Basics III 23 / 62

Stochastic Gradient Descent (SGD) The actual gradient is approimated using only a sub-sample of the data. For objective functions that are highly non-conve, the random deviations of these approimations may even help to escape local minima. Treat batch size and learning rate as hyper-parameter. Benjamin Roth (CIS LMU München) Machine Learning Basics III 24 / 62

Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient Descent for Logistic Regression Stochastic Gradient Descent 3 Deep Feedforward Networks Design Choices for Output Units Design Choices for Hidden Units Backpropagation 4 Regularization L2-Regularization Benjamin Roth (CIS LMU München) Machine Learning Basics III 25 / 62

Deep Feedforward Networks Function approimation: find good mapping ŷ = f (; θ) Network: Composition of functions f (1), f (2), f (3) with multi-dimensional input and output Each f (i) represents one layer f () = f (1) (f (2) (f (3) ()))) Feedforward: Input intermediate representation output No feedback connections Cf. recurrent networks Benjamin Roth (CIS LMU München) Machine Learning Basics III 26 / 62

Deep Feedforward Networks: Training Loss function defined on output layer, e.g. ŷ f (; θ) 1 2 Quality criterion on other layers not directly defined. Training algorithm must decide how to use those layers most effectively (w.r.t. loss on output layer) Non-output layers can be viewed as providing a feature function φ() of the input, that is to be learned. Benjamin Roth (CIS LMU München) Machine Learning Basics III 27 / 62

Neural Networks Inspired by biological neurons (nerve cells) Neurons are connected to each other, and receive and send electrical pulses. If the [input] voltage changes by a large enough amount, an all-or-none electrochemical pulse called an action potential is generated, which travels rapidly along the cell s aon, and activates synaptic connections with other cells when it arrives. (Wikipedia) Benjamin Roth (CIS LMU München) Machine Learning Basics III 28 / 62

Activation Functions with Non-Linearities Linear Functions are limited in what they can epress. Famous eample: XOR Simple layered non-linear functions can represent XOR. Benjamin Roth (CIS LMU München) Machine Learning Basics III 29 / 62

Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient Descent for Logistic Regression Stochastic Gradient Descent 3 Deep Feedforward Networks Design Choices for Output Units Design Choices for Hidden Units Backpropagation 4 Regularization L2-Regularization Benjamin Roth (CIS LMU München) Machine Learning Basics III 30 / 62

Design Choices for Output Units Typically can be interpreted as probabilities. Logistic sigmoid Softma mean and variance of a Gaussian,... Trained with negative log-likelihood. Benjamin Roth (CIS LMU München) Machine Learning Basics III 31 / 62

Softma Logistic sigmoid Vector y of binary outcomes, with no contraints on how many can be 1. Bernoulli distribution. Softma Eactly one element of y is 1. Multinoulli (categorical) distribution. p(y = i φ()) p(y = i φ()) = 1 i softma(z) i = ep(z i) j ep(z j) Benjamin Roth (CIS LMU München) Machine Learning Basics III 32 / 62

Parametrizing a Gaussian Distribution Use final layer to predict parameters of Gaussian miture model. Weight of miture component: softma. Means: no non-linearity. Precisions ( 1 ) need to be positive: softplus σ 2 softplus(z) = ln(1 + ep(z)) Benjamin Roth (CIS LMU München) Machine Learning Basics III 33 / 62

Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient Descent for Logistic Regression Stochastic Gradient Descent 3 Deep Feedforward Networks Design Choices for Output Units Design Choices for Hidden Units Backpropagation 4 Regularization L2-Regularization Benjamin Roth (CIS LMU München) Machine Learning Basics III 34 / 62

Rectified Linear Units Rectified Linear Unit: relu(z) = ma(0, z) z = T w + b Consistent gradient of 1 when unit is active (i.e. if there is an error to propagate). Default choice for hidden units. Benjamin Roth (CIS LMU München) Machine Learning Basics III 35 / 62

A Simple ReLU Network to Solve XOR f (; W, c, w) = w T ma(0, W T + c) [ ] 1 1 W = 1 1 [ ] 0 c = 1 [ ] 1 w = 2 Benjamin Roth (CIS LMU München) Machine Learning Basics III 36 / 62

Other Choices for Hidden Units A good activation function aids learning, and provides large gradients. Sigmoidal functions (logistic sigmoid) have only a small region before they flatten out in either direction. Practice shows that this seems to be ok in conjunction with Log-loss objective. But they don t work as well as hidden units. ReLU are better alternative since gradient stays constant. Other hidden unit functions: maout: take maimum of several values in previous layer. purely linear: can serve as low-rank approimation. Benjamin Roth (CIS LMU München) Machine Learning Basics III 37 / 62

Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient Descent for Logistic Regression Stochastic Gradient Descent 3 Deep Feedforward Networks Design Choices for Output Units Design Choices for Hidden Units Backpropagation 4 Regularization L2-Regularization Benjamin Roth (CIS LMU München) Machine Learning Basics III 38 / 62

Forward propagation: Input information propagates through network to produce output ŷ (and cost J(θ) in training) Back-propagation: compute gradient w.r.t. model parameters Cost gradient propagates backwards through the network Back-propagation is part of learning procedure (e.g. stochastic gradient descent), not learning procedure in itself. Benjamin Roth (CIS LMU München) Machine Learning Basics III 39 / 62

Chain Rule of Calculus: Real Functions Let Then, y, z R f, g : R R y = g() z = f (g()) = f (y) dz d = dz dy dy d Benjamin Roth (CIS LMU München) Machine Learning Basics III 40 / 62

Chain Rule of Calculus: Multivariate Functions Let Then R m, y R n, z R f : R n R g : R m R n y = g() z = f (g()) = f (y) z i = n j=1 z y j y j i In order to write this in vector notation, we need to define the Jacobian matri. Benjamin Roth (CIS LMU München) Machine Learning Basics III 41 / 62

Jacobian The Jacobian matri is the matri of all first-order partial derivatives of a vector-valued function. g() 1 1 J = g() = How to write in terms of gradients? We can write the chain rule as: z = g() 1 m g() 2 g() 2 1 m. g() n 1.... g() n m Benjamin Roth (CIS LMU München) Machine Learning Basics III 42 / 62

Jacobian The Jacobian matri is the matri of all first-order partial derivatives of a vector-valued function. g() 1 1 J = g() = How to write in terms of gradients? We can write the chain rule as: ( ) y T z = y z g() 1 m g() 2 g() 2 1 m. g() n 1.... g() n m Benjamin Roth (CIS LMU München) Machine Learning Basics III 42 / 62

Viewing the Network as a Graph Nodes are function outputs (can be scalar or vector valued) Arrows are inputs Eample: Scalar multiplication z = y. z y y Benjamin Roth (CIS LMU München) Machine Learning Basics III 43 / 62

Which Function? ^y σ (u) u T w w Benjamin Roth (CIS LMU München) Machine Learning Basics III 44 / 62

Graph with Cost L log P( y ^y) ^y σ (u) u T w w Benjamin Roth (CIS LMU München) Machine Learning Basics III 45 / 62

Which Function? Parameter vectors can be converted to matri as needed. σ (u) ^y u ma(0,u') h u' h T w w V V Benjamin Roth (CIS LMU München) Machine Learning Basics III 46 / 62

Forward Pass Green: known or computed. L log P( y ^y) ^y ma(0,u') h u' σ (u) u h T w w V V Benjamin Roth (CIS LMU München) Machine Learning Basics III 47 / 62

Forward Pass Green: known or computed. L log P( y ^y) ^y ma(0,u') h u' σ (u) u h T w w V V Benjamin Roth (CIS LMU München) Machine Learning Basics III 48 / 62

Forward Pass Green: known or computed. L log P( y ^y) ^y ma(0,u') h u' σ (u) u h T w w V V Benjamin Roth (CIS LMU München) Machine Learning Basics III 49 / 62

Forward Pass End of forward pass (some steps skipped). L log P( y ^y) ^y ma(0,u') h u' σ (u) u h T w w V V Benjamin Roth (CIS LMU München) Machine Learning Basics III 50 / 62

Backward Pass Red: gradient of cost computed for node. L log P( y ^y) ^y σ (u) u dl d ^y ma(0,u') h u' h T w w V V Benjamin Roth (CIS LMU München) Machine Learning Basics III 51 / 62

Backward Pass Red: gradient of cost computed for node. L log P( y ^y) ^y σ (u) u h T w h dl d ^y dl d ^y d ^y du w ma(0,u') u' V V Benjamin Roth (CIS LMU München) Machine Learning Basics III 52 / 62

Backward Pass Red: gradient of cost computed for node. L dl d ^y δu d ^y du δ h ma(0,u') log P( y ^y) h u' σ (u) ^y u h T w dl d ^y dl d ^y d ^y du w dl d ^y δ u d ^y du δ w V V Benjamin Roth (CIS LMU München) Machine Learning Basics III 53 / 62

End of Backward Pass We have the gradients for all parameters, let s use them for SGD. L dl d ^y δu d ^y du δ h ma(0,u') log P( y ^y) h u' V σ (u) ^y u h T w ( δ h dl d ^y δ u δ u ' )T d ^y du δ h V dl d ^y dl d ^y d ^y du w dl d ^y δ u d ^y du δ w ( δ u ' T δ V ) ( δ h dl d ^y δ u δ u ' )T d ^y du δ h Benjamin Roth (CIS LMU München) Machine Learning Basics III 54 / 62

Summary Gradient descent: Minimize loss by iteratively substracting gradient from parameter vector. Stochastic gradient descent: Approimate gradient by considering small subsets of eamples. Regularization: penalize large parameter values, e.g. by adding l2-norm of parameter vector. Feedforward networks: layers of (non-linear) function compositions. Output non-linearities: interpreted as probability densities (logistic sigmoid, softma, Gaussian) Hidden layers: Rectified linear units (ma(0, z)) Backpropagation: Compute gradient of cost w.r.t. parameters using chain rule. Benjamin Roth (CIS LMU München) Machine Learning Basics III 55 / 62

Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient Descent for Logistic Regression Stochastic Gradient Descent 3 Deep Feedforward Networks Design Choices for Output Units Design Choices for Hidden Units Backpropagation 4 Regularization L2-Regularization Benjamin Roth (CIS LMU München) Machine Learning Basics III 56 / 62

Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient Descent for Logistic Regression Stochastic Gradient Descent 3 Deep Feedforward Networks Design Choices for Output Units Design Choices for Hidden Units Backpropagation 4 Regularization L2-Regularization Benjamin Roth (CIS LMU München) Machine Learning Basics III 57 / 62

Regularization prediction prediction prediction feature feature feature Overfitting vs. underfitting Regularization: Any modification to a learning algorithm for reducing its generalization error but not its training error Build a preference into ML algorithm for one solution in hypothesis space over another Solution space is still the same Unpreferred solution is penalized: only chosen if there it fits training data much better Benjamin Roth (CIS LMU München) Machine Learning Basics III 58 / 62

L2-Regularization Large parameters overfitting σ() σ(2) σ(100) Prefer models with smaller feature weights. Popular regularizers: Penalize large L2 norm. Penalize large L1 norm (aka LASSO, induces sparsity) Benjamin Roth (CIS LMU München) Machine Learning Basics III 59 / 62

Regularization Add term that penalizes large l2 norm. The amount of penalty is controlled by a parameter λ Linear regression: J(θ) = MSE(θ) + λ Logistic regression: 2 θt θ J(θ) = NLL(θ) + λ 2 θt θ From a Bayesian perspective, l2-regularization corresponds to a Gaussian prior on the parameters. Benjamin Roth (CIS LMU München) Machine Learning Basics III 60 / 62

L2-Regularization The surface of the objective function is now a combination of the original cost, and the regularization penalty. Benjamin Roth (CIS LMU München) Machine Learning Basics III 61 / 62

L2-Regularization Gradient of regularization term: θ λ 2 θt θ = λθ Gradient descent for regularized cost function: θ t+1 := θ t η θ (NLL(θ t ) + λθ T t θ t ) θ t+1 := (1 ηλ)θ t η θ NLL(θ t ) Benjamin Roth (CIS LMU München) Machine Learning Basics III 62 / 62