Machine Learning Basics III - PDF Free Download

Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62

Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient Descent for Logistic Regression Stochastic Gradient Descent 3 Deep Feedforward Networks Design Choices for Output Units Design Choices for Hidden Units Backpropagation 4 Regularization L2-Regularization Benjamin Roth (CIS LMU München) Machine Learning Basics III 2 / 62

From Regression to Classification So far, linear regression: A simple linear model. Probabilistic interpretation. Find optimal parameters using Maimum Likelihood Estimation. Can we do something similar for classification? Logistic Regression (... it s not actually used for regression... ) Benjamin Roth (CIS LMU München) Machine Learning Basics III 4 / 62

Logistic Regression Binary logistic model: Estimate the probability of a binary response y {0, 1} based on features w. Logistic Regression is a Generalized Linear Model: Linear model is related to the response variable via a link function. p(y = 1 ; θ) = f (θ T ) (Note: Y denotes a random variable, whereas y, y (i), 0, 1 denote values that the random variable can take on. If the random variable is obvious from the contet, it may be omitted.) Benjamin Roth (CIS LMU München) Machine Learning Basics III 5 / 62

Logistic Regression Recall linear regression: p(y ; θ) = N(y; θ T, I) Predicts y R Classification: Outcome (per eample) 0 or 1 Logistic sigmoid: σ(z) = 1 1+e z Logistic Regression: Linear function + logistic sigmoid p(y = 1 ; θ) = σ(θ T ) Benjamin Roth (CIS LMU München) Machine Learning Basics III 6 / 62

Binary Logistic Regression Probability of different outcomes (for one eample): Probability of positive outcome: p(y = 1 ; θ) = 1 1 + e θt Probability of negative outcome: p(y = 0 ; θ) = 1 p(y = 1 ; θ) Benjamin Roth (CIS LMU München) Machine Learning Basics III 7 / 62

Probability of a Training Eample Probability for actual label y (i) given features (i) Can be written for both labels (0 and 1) without case distinction Label eponentiation trick: use 0 = 1 p(y = y (i) (i) ; θ) { p(y = 1 (i) ; θ) if y (i) = 1 = p(y = 0 (i) ; θ) if y (i) = 0 { p(y = 1 (i) ; θ) 1 p(y = 0 (i) ; θ) 0 if y (i) = 1 = p(y = 1 (i) ; θ) 0 p(y = 0 (i) ; θ) 1 if y (i) = 0 = p(y = 1 (i) ; θ) y (i) p(y = 0 (i) 1 y (i) ; θ) Benjamin Roth (CIS LMU München) Machine Learning Basics III 8 / 62

Binary Logistic Regression Conditional Negative Log-Likelihood (NLL): = log = m i=1 NLL(θ) = log p(y X; θ) m = log p(y = y (i) (i) ; θ) i=1 p(y = 1 (i) ; θ) y (i) (1 p(y = 1 (i) 1 y (i) ; θ)) m y (i) log σ(θ T (i) ) + (1 y (i) ) log(1 σ(θ T (i) )) i=1 No closed form solution for minimum Use numerical / iterative methods. LBFGS, Gradient descent... Benjamin Roth (CIS LMU München) Machine Learning Basics III 9 / 62

Logistic Regression Logistic regression: Logistic sigmoid function applied to a a weighted linear combination of feature values. To be interpreted as the probability that the label for a specific eample equals 1. Applying the model on test data: Predict y (i) = 1 if p(y = 1 (i) ; θ) > 0.5 No closed form solution for maimizing NLL, iterative methods necessary. Net up: Gradient descent. Benjamin Roth (CIS LMU München) Machine Learning Basics III 10 / 62

Optimization Optimization: Minimize some function J(θ) by altering θ. Maimize f (θ) by minimizing J(θ) = f (θ) J(θ): criterion, objective function, cost function, loss function, error function In a probabilistic machine learning setting often (conditional) negative log-likelihood: log p(x; θ) or as a function of θ θ = arg min θ J(θ) log p(y X; θ) Benjamin Roth (CIS LMU München) Machine Learning Basics III 12 / 62

Optimization If J(θ) is conve, it is minimized where θ J(θ) = 0 If J(θ) is not conve, the gradient can help us to improve our objective nevertheless (and find a local optimum). Many optimization techniques were originally developed for conve objective functions, but are found to be working well for non-conve functions too. Use the fact that gradient indicates the slope of the function in the direction of steepest increase. Benjamin Roth (CIS LMU München) Machine Learning Basics III 13 / 62

Gradient-Based Optimization Derivative: Given a small change in input, what is the corresponding change in output? f ( + ɛ) f () + ɛf () f ( ɛ sign f ()) < f () for small enough ɛ Benjamin Roth (CIS LMU München) Machine Learning Basics III 14 / 62

Gradient Descent For J(θ) : R n R If partial derivative J(θ) θ j > 0, J(θ) will increase for small increases of θ j go in opposite direction of gradient (since we want to minimize) Steepest descent: iterate θ t+1 θ η θ J(θ) η is the learning rate (set to small positive constant). Converges if θ J(θ) is (close to) 0 Benjamin Roth (CIS LMU München) Machine Learning Basics III 15 / 62

Local Minima If function is non-conve, different results can be obtained at convergence, depending on initialization of θ. Benjamin Roth (CIS LMU München) Machine Learning Basics III 16 / 62

Local Minima Minima can be global or local: Critical (stationary) points: f () = 0 For neural networks, only good (not perfect) parameter values can be found. Benjamin Roth (CIS LMU München) Machine Learning Basics III 17 / 62

Gradient Descent for Logistic Regression θ NLL(θ) = θ m i=1 y (i) log σ(θ T (i) ) + (1 y (i) ) log(1 σ(θ T (i) )) m = (y (i) σ(θ T (i) )) (i) i=1 The gradient descent update becomes: m θ t+1 := θ t + η (y (i) σ(θ T t (i) )) (i) i=1 Note: Which feature weights are increased, which are decreased? Benjamin Roth (CIS LMU München) Machine Learning Basics III 19 / 62

Derivation of Gradient for Logistic Regression This is a great eercise! Use the following facts: Gradient Derivative of a sum Chain rule Derivative of logarithm ( θ f (θ)) j = d dz f (θ) θ j i f i(z) = df i (z) i dz F (z) = f (g(z)) F (z) = f (g(z))g (z) d log z dz = 1/z D. of logistic sigmoid Partial d. of dot-product dσ(z) dz θ T θ j = σ(z)(1 σ(z)) = j Benjamin Roth (CIS LMU München) Machine Learning Basics III 20 / 62

Gradient Descent: Summary Iterative method for function minimization. Gradient indicates rate of change in objective function, given a local change to feature weights. Substract the gradient: decrease parameters that (locally) have positive correlation with objective increase parameters that (locally) have negative correlation with objective Gradient updates only have the desired properties in a small region around previous parameters θ t. Control locality by step-size η. Gradient descent is slow: For relatively small step in the right direction, all of training data has to be processed. This version of gradient descent is often also called batch gradient descent. Benjamin Roth (CIS LMU München) Machine Learning Basics III 21 / 62

Stochastic Gradient Descent (SGD) Batch gradient descent is slow: For relatively small step in the right direction, all of training data has to be processed. θ t+1 θ t + η θ m i=1 log p(y i i ; θ) Stochastic gradient descent in a nutshell: For each update, only use random sample Bt of training data (mini-batch). θ t+1 θ t + η θ log p(y i i ; θ) i B t Mini-batch size can also just be 1. More frequent updates. θ t+1 θ t + η θ log p(y t t ; θ) Benjamin Roth (CIS LMU München) Machine Learning Basics III 23 / 62

Stochastic Gradient Descent (SGD) The actual gradient is approimated using only a sub-sample of the data. For objective functions that are highly non-conve, the random deviations of these approimations may even help to escape local minima. Treat batch size and learning rate as hyper-parameter. Benjamin Roth (CIS LMU München) Machine Learning Basics III 24 / 62

Deep Feedforward Networks Function approimation: find good mapping ŷ = f (; θ) Network: Composition of functions f (1), f (2), f (3) with multi-dimensional input and output Each f (i) represents one layer f () = f (1) (f (2) (f (3) ()))) Feedforward: Input intermediate representation output No feedback connections Cf. recurrent networks Benjamin Roth (CIS LMU München) Machine Learning Basics III 26 / 62

Deep Feedforward Networks: Training Loss function defined on output layer, e.g. ŷ f (; θ) 1 2 Quality criterion on other layers not directly defined. Training algorithm must decide how to use those layers most effectively (w.r.t. loss on output layer) Non-output layers can be viewed as providing a feature function φ() of the input, that is to be learned. Benjamin Roth (CIS LMU München) Machine Learning Basics III 27 / 62

Neural Networks Inspired by biological neurons (nerve cells) Neurons are connected to each other, and receive and send electrical pulses. If the [input] voltage changes by a large enough amount, an all-or-none electrochemical pulse called an action potential is generated, which travels rapidly along the cell s aon, and activates synaptic connections with other cells when it arrives. (Wikipedia) Benjamin Roth (CIS LMU München) Machine Learning Basics III 28 / 62

Activation Functions with Non-Linearities Linear Functions are limited in what they can epress. Famous eample: XOR Simple layered non-linear functions can represent XOR. Benjamin Roth (CIS LMU München) Machine Learning Basics III 29 / 62

Design Choices for Output Units Typically can be interpreted as probabilities. Logistic sigmoid Softma mean and variance of a Gaussian,... Trained with negative log-likelihood. Benjamin Roth (CIS LMU München) Machine Learning Basics III 31 / 62

Softma Logistic sigmoid Vector y of binary outcomes, with no contraints on how many can be 1. Bernoulli distribution. Softma Eactly one element of y is 1. Multinoulli (categorical) distribution. p(y = i φ()) p(y = i φ()) = 1 i softma(z) i = ep(z i) j ep(z j) Benjamin Roth (CIS LMU München) Machine Learning Basics III 32 / 62

Parametrizing a Gaussian Distribution Use final layer to predict parameters of Gaussian miture model. Weight of miture component: softma. Means: no non-linearity. Precisions ( 1 ) need to be positive: softplus σ 2 softplus(z) = ln(1 + ep(z)) Benjamin Roth (CIS LMU München) Machine Learning Basics III 33 / 62

Rectified Linear Units Rectified Linear Unit: relu(z) = ma(0, z) z = T w + b Consistent gradient of 1 when unit is active (i.e. if there is an error to propagate). Default choice for hidden units. Benjamin Roth (CIS LMU München) Machine Learning Basics III 35 / 62

A Simple ReLU Network to Solve XOR f (; W, c, w) = w T ma(0, W T + c) [ ] 1 1 W = 1 1 [ ] 0 c = 1 [ ] 1 w = 2 Benjamin Roth (CIS LMU München) Machine Learning Basics III 36 / 62

Other Choices for Hidden Units A good activation function aids learning, and provides large gradients. Sigmoidal functions (logistic sigmoid) have only a small region before they flatten out in either direction. Practice shows that this seems to be ok in conjunction with Log-loss objective. But they don t work as well as hidden units. ReLU are better alternative since gradient stays constant. Other hidden unit functions: maout: take maimum of several values in previous layer. purely linear: can serve as low-rank approimation. Benjamin Roth (CIS LMU München) Machine Learning Basics III 37 / 62

Forward propagation: Input information propagates through network to produce output ŷ (and cost J(θ) in training) Back-propagation: compute gradient w.r.t. model parameters Cost gradient propagates backwards through the network Back-propagation is part of learning procedure (e.g. stochastic gradient descent), not learning procedure in itself. Benjamin Roth (CIS LMU München) Machine Learning Basics III 39 / 62

Chain Rule of Calculus: Real Functions Let Then, y, z R f, g : R R y = g() z = f (g()) = f (y) dz d = dz dy dy d Benjamin Roth (CIS LMU München) Machine Learning Basics III 40 / 62

Chain Rule of Calculus: Multivariate Functions Let Then R m, y R n, z R f : R n R g : R m R n y = g() z = f (g()) = f (y) z i = n j=1 z y j y j i In order to write this in vector notation, we need to define the Jacobian matri. Benjamin Roth (CIS LMU München) Machine Learning Basics III 41 / 62

Jacobian The Jacobian matri is the matri of all first-order partial derivatives of a vector-valued function. g() 1 1 J = g() = How to write in terms of gradients? We can write the chain rule as: z = g() 1 m g() 2 g() 2 1 m. g() n 1.... g() n m Benjamin Roth (CIS LMU München) Machine Learning Basics III 42 / 62

Jacobian The Jacobian matri is the matri of all first-order partial derivatives of a vector-valued function. g() 1 1 J = g() = How to write in terms of gradients? We can write the chain rule as: ( ) y T z = y z g() 1 m g() 2 g() 2 1 m. g() n 1.... g() n m Benjamin Roth (CIS LMU München) Machine Learning Basics III 42 / 62

Viewing the Network as a Graph Nodes are function outputs (can be scalar or vector valued) Arrows are inputs Eample: Scalar multiplication z = y. z y y Benjamin Roth (CIS LMU München) Machine Learning Basics III 43 / 62

Which Function? ^y σ (u) u T w w Benjamin Roth (CIS LMU München) Machine Learning Basics III 44 / 62

Graph with Cost L log P( y ^y) ^y σ (u) u T w w Benjamin Roth (CIS LMU München) Machine Learning Basics III 45 / 62

Which Function? Parameter vectors can be converted to matri as needed. σ (u) ^y u ma(0,u') h u' h T w w V V Benjamin Roth (CIS LMU München) Machine Learning Basics III 46 / 62

Forward Pass Green: known or computed. L log P( y ^y) ^y ma(0,u') h u' σ (u) u h T w w V V Benjamin Roth (CIS LMU München) Machine Learning Basics III 47 / 62

Forward Pass Green: known or computed. L log P( y ^y) ^y ma(0,u') h u' σ (u) u h T w w V V Benjamin Roth (CIS LMU München) Machine Learning Basics III 48 / 62

Forward Pass Green: known or computed. L log P( y ^y) ^y ma(0,u') h u' σ (u) u h T w w V V Benjamin Roth (CIS LMU München) Machine Learning Basics III 49 / 62

Forward Pass End of forward pass (some steps skipped). L log P( y ^y) ^y ma(0,u') h u' σ (u) u h T w w V V Benjamin Roth (CIS LMU München) Machine Learning Basics III 50 / 62

Backward Pass Red: gradient of cost computed for node. L log P( y ^y) ^y σ (u) u dl d ^y ma(0,u') h u' h T w w V V Benjamin Roth (CIS LMU München) Machine Learning Basics III 51 / 62

Backward Pass Red: gradient of cost computed for node. L log P( y ^y) ^y σ (u) u h T w h dl d ^y dl d ^y d ^y du w ma(0,u') u' V V Benjamin Roth (CIS LMU München) Machine Learning Basics III 52 / 62

Backward Pass Red: gradient of cost computed for node. L dl d ^y δu d ^y du δ h ma(0,u') log P( y ^y) h u' σ (u) ^y u h T w dl d ^y dl d ^y d ^y du w dl d ^y δ u d ^y du δ w V V Benjamin Roth (CIS LMU München) Machine Learning Basics III 53 / 62

End of Backward Pass We have the gradients for all parameters, let s use them for SGD. L dl d ^y δu d ^y du δ h ma(0,u') log P( y ^y) h u' V σ (u) ^y u h T w ( δ h dl d ^y δ u δ u ' )T d ^y du δ h V dl d ^y dl d ^y d ^y du w dl d ^y δ u d ^y du δ w ( δ u ' T δ V ) ( δ h dl d ^y δ u δ u ' )T d ^y du δ h Benjamin Roth (CIS LMU München) Machine Learning Basics III 54 / 62

Summary Gradient descent: Minimize loss by iteratively substracting gradient from parameter vector. Stochastic gradient descent: Approimate gradient by considering small subsets of eamples. Regularization: penalize large parameter values, e.g. by adding l2-norm of parameter vector. Feedforward networks: layers of (non-linear) function compositions. Output non-linearities: interpreted as probability densities (logistic sigmoid, softma, Gaussian) Hidden layers: Rectified linear units (ma(0, z)) Backpropagation: Compute gradient of cost w.r.t. parameters using chain rule. Benjamin Roth (CIS LMU München) Machine Learning Basics III 55 / 62

Regularization prediction prediction prediction feature feature feature Overfitting vs. underfitting Regularization: Any modification to a learning algorithm for reducing its generalization error but not its training error Build a preference into ML algorithm for one solution in hypothesis space over another Solution space is still the same Unpreferred solution is penalized: only chosen if there it fits training data much better Benjamin Roth (CIS LMU München) Machine Learning Basics III 58 / 62

L2-Regularization Large parameters overfitting σ() σ(2) σ(100) Prefer models with smaller feature weights. Popular regularizers: Penalize large L2 norm. Penalize large L1 norm (aka LASSO, induces sparsity) Benjamin Roth (CIS LMU München) Machine Learning Basics III 59 / 62

Regularization Add term that penalizes large l2 norm. The amount of penalty is controlled by a parameter λ Linear regression: J(θ) = MSE(θ) + λ Logistic regression: 2 θt θ J(θ) = NLL(θ) + λ 2 θt θ From a Bayesian perspective, l2-regularization corresponds to a Gaussian prior on the parameters. Benjamin Roth (CIS LMU München) Machine Learning Basics III 60 / 62

L2-Regularization The surface of the objective function is now a combination of the original cost, and the regularization penalty. Benjamin Roth (CIS LMU München) Machine Learning Basics III 61 / 62

L2-Regularization Gradient of regularization term: θ λ 2 θt θ = λθ Gradient descent for regularized cost function: θ t+1 := θ t η θ (NLL(θ t ) + λθ T t θ t ) θ t+1 := (1 ηλ)θ t η θ NLL(θ t ) Benjamin Roth (CIS LMU München) Machine Learning Basics III 62 / 62