Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann
Feedforward networks
Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
New features to the rescue! x 2 0 1 1 x 3 x 3 = 0 0 x 1
New features to the rescue! x 2 0 1 x 3 1 0 x 1 x 3 = xor(x 1, x 2 )
How do we get new features? We want to apply the linear model not to x directly but to a representation φ(x) of x. How do we get this representation? Option 1. Manually engineer φ using expert knowledge. feature engineering Option 2. Make the model sensitive to parameters such that learning these parameters identifies a good representation φ. feature learning
From linear models to neural networks x 1 x 1 h 1 y y x 2 x 2 h 2
Function composition Neural networks are called networks because they can be understood in terms of function composition. (f g)(x) = f(g(x)) In essence, a neural network is an acyclic directed graph that describes how a collection of functions are composed. length of the composition chain = depth of the model The compositional structure of neural networks is important for the success of gradient-based optimisation. chain rule of derivatives
Functions, types, compositions h 1 x 1 h 2 y x 2 h 3 g 2 3 f 3 1
Shapes of the parameter matrices h 1 x 1 h 2 y x 2 h 3 H : (2, 3) W : (3, 1)
Feedforward networks Information flows through the network from the input layer x, through the intermediate layers, to the output layer y. There are no feedback connections in which (possibly intermediate) outputs of the model are fed back to itself. When feedforward networks are extended to include feedback connections, they are called recurrent networks.
A simple feedforward network h 1 x 1 h 2 y x 2 h 3 input layer hidden layer output layer
Artificial neuron x 0 θ 0 Σ f h(x) x n θ n
The rules of the game Choose the activation functions that will be used at each layer. sigmoid, tanh, rectified linear units, Choose an error function. function of predicted output and target output Choose a regulariser to prevent the network from overfitting. encodes preferences over the choices of parameters Choose an optimisation procedure to minimise the training loss. typically a variant of stochastic gradient descent
Activation functions
Logistic function 1 1 0,5 0,5 0-6 -3 0 3 6 0-6 -3 0 3 6
Logistic function The output of a logistic unit is a number between 0 and 1. Therefore, the output can be interpreted as a conditional probability P(y = 1 x) for a binary random variable y. This makes logistic units ideal as output units for binary classification problems.
Softmax function The softmax function takes a k-dimensional vector z as its input and returns a k-dimensional vector y such that The softmax function generalises the logistic function in that it yields a probability distribution over k possible classes. In particular, each of the output components is a number between 0 and 1, and the sum of all output components is 1.
Softmax layer y 1 y 2 y 3 z 1 z 2 z 3 h 1 h 2 h 3 h 4
Hyperbolic tangent 1 1 0,5 0 0,5-0,5-1 -6-3 0 3 6 0-6 -3 0 3 6
Problems with sigmoidal units Sigmoidal units saturate across most of their domain, which can make gradient-based learning very difficult. gradient is close to zero both for negative and positive values For this reason, their use as hidden units in feedforward networks is now discouraged. Sigmoidal units can still be used as output units when the cost function can undo the saturation. not the case with squared loss!
Rectified linear units 1 1 0,5 0,5 0-6 -3 0 3 6 0-6 -3 0 3 6
Comparison of activation functions sigmoid tanh relu sigmoid tanh relu 1 1 0,5 0,75 0 0,5-0,5 0,25-1 -6-3 0 3 6 0-6 -3 0 3 6 activation functions gradients
Error functions
Maximum likelihood estimation Consider a family of probability distributions P(X; θ) that assign a probability to any sequence X of N examples. The maximum likelihood estimator for θ is then defined as If we assume that the examples are mutually independent and identically distributed, this can be rewritten as
Properties of the Maximum Likelihood Estimator The maximum likelihood estimator has two desirable properties: Consistency The mean squared error between the estimated and the true parameters decreases as N increases. Efficiency No consistent estimator has a lower mean squared error with respect to the parameters.
Conditional log-likelihood In supervised learning, we want to learn a conditional probability distribution over target values y, given features x. The assumption that the samples are i.i.d. gives us Maximising likelihood is the same as minimising the crossentropy between the empirical distribution and the model. derivation in GBC, section 5.7
Conditional log-likelihood The maximum likelihood principle gives us a principled way to derive the cost function for a supervised learning problem: In the case of linear regression, minimising this expression is equivalent to minimising the mean squared error. GBC, Section 5.7.1
Negative log-likelihood 5 3,75 log p 2,5 1,25 0 0 0,25 0,5 0,75 1 p
Logistic function 1 1 0,5 0,5 0-6 -3 0 3 6 0-6 -3 0 3 6
Cross-entropy error function The output of a logistic unit can be interpreted as the conditional probability P(y i = 1 x) for a binary random variable y i. The natural error function for a logistic unit is the negative log probability of the correct output: This is usually written as
Cross-entropy cost function 3 3 2,25 2,25 error 1,5 error 1,5 0,75 0,75 0 0 0,25 0,5 0,75 1 0 0 0,25 0,5 0,75 1 h(x) h(x) y = 1 y = 0
Sigmoid and cross-entropy balance each other E f y k z k The steepness of cross-entropy error exactly balances the flatness of the logistic function.
Regularisation
Norm-based regularisation We can regularise the training of a neural network by adding an additional term to the error function. L2-regularisation: Give preference to parameter vectors with smaller Euclidean norms ( lengths ): L1-regularisation: Give preference to parameter vectors with smaller absolute-value norms:
Selected regularisation techniques Dataset augmentation. Generate new training data by systematically transforming the existing data. example: rotating and scaling images Early stopping. Stop the training when the validation set error goes up and backtrack to the previous set of parameters. Bagging. Train several different models separately, then have all of the models vote on the output.
Dropout Randomly set a fraction of units to zero during training. for example, 50% of all units in a given layer Intuition: Damaging random parts of the network prevents it from becoming oversensitive to idiosyncratic patterns in the data.
Dropout unmodified neural net net after applying dropout
Backpropagation
Backpropagation Feedforward networks can be trained using gradient descent. feedforward network = chain of differentiable functions The computational problem is how to efficiently compute the gradient for all layers of the network, at the same time. The standard algorithm for this is called backpropagation.
Network structure f w jk f w ij y i
Forward pass E f y k z k w jk f y j z j w ij y i
What do we want? E w ij
Computing the errors E f y k z k w jk f y j z j w ij y i
Error in the output layer E f y k z k
Error in a hidden layer E f z k w jk y j z j
Computing the weight gradients E z j w ij y i
Lab: Handwritten digit recognition
Handwritten digit recognition You are to build a feedforward net that takes in a greyscale image of a handwritten digit and outputs the digit (an integer). supervised learning
Basic network architecture one neuron for each pixel one neuron for each digit 1 + 28 28? 10
How to use the network Translate each image to a vector x with 1 + 28 28 components, where component x i is the greyscale value for pixel i in the image. The greyscale value is a fraction k/155 between 0 (black) and 1 (white). Feed the image to the network. Find the neuron y i in the output layer that has the highest activation and predict the digit i. Bonus: Implement a softmax layer!
What does the net learn? Source: Kylin-Xu
How to train the network To train the network we use the MNIST database, which consists of 70,000 handwritten labelled digits. Each target is translated into a vector y with 10 components, where y i is 1 if the target equals i and 0 otherwise. Example: If the target is 3 then y 3 = 1, and all other components are zero.