# Cheng Soon Ong & Christian Walder. Canberra February June 2018

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 828

2 Part XIX 1 676of 828

3 Basis Functions The basis functions play a crucial role in the algorithms explored so far. Number and parameters of basis functions fixed before learning starts (e.g. Linear Regression and Linear Classification). Number of basis functions fixed, parameters of the basis functions are adaptive (e.g. ). Centre basis function on the data, select a subset of basis functions in the training phase (e.g. Support Vector Machines, Relevance Vector Machines). 677of 828

4 Outline The functional form of the network model (including special parametrisation of the basis functions). How to determine the network parameters within the maximum likelihood framework? (Solution of a nonlinear optimisation problem.) Error backpropagation : efficiently evaluate the derivatives of the log likelihood function with respect to the network parameters. Various approaches to regularise neural networks. 678of 828

5 Feed-forward Network Functions Same goal as before: e.g. for regression, decompose t(x) = y(x, w) + ɛ(x) where ɛ(x) is the residual error. (Generalised) Linear Model M y(x, w) = f w j φ j (x) j=0 where φ = (φ 0,..., φ M ) T is the fixed model basis and w = (w 0,..., w M ) T are the model parameter. For regression: f ( ) is the identity function. For classification: f ( ) is a nonlinear activation function. Goal : Let φ j (x) depend on parameters, and then adjust these parameters together with w. 679of 828

6 Feed-forward Network Functions Goal : Let φ j (x) depend on parameters, and then adjust these parameters together with w. Many ways to do this. Neural networks use basis functions which follow the same form as the (generalised) linear model. EACH basis function is itself a nonlinear function of an adaptive linear combination of the inputs. hidden units x D w (1) MD z M KM y K inputs outputs y 1 x 1 x 0 z 1 10 z 0 680of 828

7 Functional Transformations Construct M linear combinations of the input variables x 1,..., x D in the form a j = }{{} activations D w (1) ji x i + w (1) j0 }{{}}{{} weights bias i=1 j = 1,..., M Apply a differentiable, nonlinear activation function h( ) to get the output of the hidden units z j = h(a j ) h( ) is typically sigmoid, tanh, or more recently ReLU(x) = max(x, 0). hidden units xd w (1) MD zm KM yk inputs outputs y1 x1 x0 z1 10 z0 681of 828

8 Functional Transformations The outputs of the hidden units are again linearly combined a k = M j=1 kj z j + k0 k = 1,..., K Apply again a differentiable, nonlinear activation function g( ) to get the network outputs y k y k = g(a k ) hidden units xd w (1) MD zm KM yk inputs outputs y1 x1 x0 z1 10 z0 682of 828

9 Functional Transformations The activation function g( ) is determined by the nature of the data and the distribution of the target variables. For standard regression: g( ) is the identity function so that y k = a k. For multiple binary classification, g( ) is a logistic sigmoid function 1 y k = σ(a k ) = 1 + exp( a k ) hidden units xd w (1) MD zm KM yk inputs outputs y1 x1 x0 z1 10 z0 683of 828

10 Functional Transformations Combine all transformations into one formula ( M D ) y k (x, w) = g kj h w (1) ji x i + w (1) j0 j=1 i=1 where w contains all weight and bias parameters. + k0 hidden units xd w (1) MD zm KM yk inputs outputs y1 x1 x0 z1 10 z0 684of 828

11 Functional Transformations As before, the biases can be absorbed into the weights by introducing an extra input x 0 = 1 and a hidden unit z 0 = 1. ( M ) D y k (x, w) = g kj h w (1) ji x i j=0 where w now contains all weight and bias parameters. i=0 hidden units x D w (1) MD z M KM y K inputs outputs y 1 x 1 x 0 z 1 10 z 0 685of 828

12 Comparison to Perceptron A neural network looks like a multilayer perceptron. But perceptron s nonlinear activation function was a step function. Not smooth. Not differentiable. { +1, a 0 f (a) = 1, a < 0 The activation functions h( ) and g( ) of a neural network are smooth and differentiable. hidden units x D w (1) MD z M KM y K inputs outputs y 1 x 1 x 0 z 1 10 z 0 686of 828

13 Linear Activation Functions If all activation functions are linear functions then there exists an equivalent network without hidden units. (Composition of linear functions is a linear function.) But if the number of hidden units in this case is smaller than the number of input or output units, the resulting linear function are not the most general. Dimensionality reduction. Principal Component Analysis (comes later in the lecture). Generally, most neural networks use nonlinear activation functions as the goal is to approximate a nonlinear mapping from the input space to the outputs. 687of 828

14 FYI Only: NN Extensions Add more hidden layers (deep learning). Tom make it work we need many of the following tricks: Clever weight initialisation to ensure the gradient is flowing through the entire network. Some links may additively skip over one or several subsequent layer(s). Favour ReLU over e.g. the sigmoid, to avoid vanishing gradients. Clever regularisation methods such as dropout. Specific architectures, not further considered here: Parameters may be shared, notably as in convolutional neural networks for images. A state space model with neural network transitions is a recurrent neural network. Attention mechanisms learn to focus on specific parts of an input. 688of 828

15 as Universal Function Approximators Feed-forward neural networks are universal approximators. Example: A two-layer neural network with linear outputs can uniformly approximate any continuous function on a compact input domain to arbitrary accuracy if it has enough hidden units. Holds for a wide range of hidden unit activation functions, but NOT for polynomials. Remaining big question : Where do we get the appropriate settings for the weights from? With other words, how do we learn the weights from training examples? 689of 828

16 Aproximation Capabilities of Neural network approximating f (x) = x 2 Two-layer network with 3 hidden units (tanh activation functions) and linear outputs trained on 50 data points sampled from the interval ( 1, 1). Red: resulting output. Dashed: Output of the hidden units. 690of 828

17 Aproximation Capabilities of Neural network approximating f (x) = sin(x) Two-layer network with 3 hidden units (tanh activation functions) and linear outputs trained on 50 data points sampled from the interval ( 1, 1). Red: resulting output. Dashed: Output of the hidden units. 691of 828

18 Aproximation Capabilities of Neural network approximating f (x) = x Two-layer network with 3 hidden units (tanh activation functions) and linear outputs trained on 50 data points sampled from the interval ( 1, 1). Red: resulting output. Dashed: Output of the hidden units. 692of 828

19 Aproximation Capabilities of Neural network approximating Heaviside function { 1, x 0 f (x) = 0, x < 0 Two-layer network with 3 hidden units (tanh activation functions) and linear outputs trained on 50 data points sampled from the interval ( 1, 1). Red: resulting output. Dashed: Output of the hidden units. 693of 828

20 Variable Basis Functions in a Hidden layer nodes represent parametrised basis functions x x1 z = σ(w 0 + w 1 x 1 + w 2 x 2 ) for (w 0, w 1, w 2 ) = (0.0, 1.0, 0.1) of 828

21 Variable Basis Functions in a Hidden layer nodes represent parametrised basis functions x x1 z = σ(w 0 + w 1 x 1 + w 2 x 2 ) for (w 0, w 1, w 2 ) = (0.0, 0.1, 1.0) of 828

22 Variable Basis Functions in a Hidden layer nodes represent parametrised basis functions x x1 z = σ(w 0 + w 1 x 1 + w 2 x 2 ) for (w 0, w 1, w 2 ) = (0.0, 0.5, 0.5) of 828

23 Variable Basis Functions in a Hidden layer nodes represent parametrised basis functions x x1 z = σ(w 0 + w 1 x 1 + w 2 x 2 ) for (w 0, w 1, w 2 ) = (10.0, 0.5, 0.5) of 828

24 Aproximation Capabilities of Neural network for two-class classification. 2 inputs, 2 hidden units with tanh activation function, 1 output with logistic sigmoid activation function Red: y = 0.5 decision boundary. Dashed blue: z = 0.5 hidden unit contours. Green: Optimal decision boundary from the known data distribution. 698of 828

25 Given a set of weights w. This fixes a mapping from the input space to the output space. Does there exist another set of weights realising the same mapping? Assume tanh activation function for the hidden units. As tanh is an odd function: tanh( a) = tanh(a). Change the sign of all inputs to a hidden unit and outputs of this hidden unit: Mapping stays the same. hidden units x D w (1) MD z M KM y K inputs outputs y 1 x 1 x 0 z 1 10 z 0 699of 828

26 M hidden units, therefore 2 M equivalent weight vectors. Furthermore, exchange all of the weights going into and out of a hidden unit with the corresponding weights of another hidden unit. Mapping stays the same. M! symmetries. Overall weight space symmetry : M! 2 M M M! 2 M hidden units xd w (1) MD zm KM yk inputs outputs y1 x1 x0 z1 10 z0 700of 828

27 Assume the error E(w) is a smooth function of the weights. Smallest value will occur at a critical point for which E(w) = 0. This could be a minimum, maxium, or saddle point. Furthermore, because of symmetry in weight space, there are at least M! 2 M other critical points with the same value for the error. 701of 828

28 Definition (Global Minimum) A point w for which the error E(w ) is smaller than any other error E(w). Definition (Local Minimum) A point w for which the error E(w ) is smaller than any other error E(w) in some neighbourhood of w. E(w) w1 wa wb wc w2 E 702of 828

29 Finding the global minimium is difficult in general (would have to check everywhere) unless the error function comes from a special class (e.g. smooth convex functions have only one minimum). Error functions for neural networks are not convex (symmetries!). But finding a local minimum might be sufficient. Use iterative methods with weight vector update w (τ) to find a local minimum. w (τ+1) = w (τ) + w (τ) E(w) w1 wa wb wc w2 E 703of 828

30 Local Quadratic Approximation Around a stationary point w we can approximate E(w) E(w ) (w w ) T H(w w ), where the Hessian H is evaluated at w. Using a set {u i } of orthonormal eigenvectors of H, Hu i = λ i u i, to expand w w = i α i u i. We get E(w) = E(w ) + 1 λ i αi 2. 2 i 704of 828

31 Local Quadratic Approximation Around a minimum w we can approximate w 2 E(w) = E(w ) + 1 λ i αi 2. 2 i u 2 u 1 λ 1/2 2 w λ 1/2 1 w 1 705of 828

32 Local Quadratic Approximation Around a minimum w, the Hessian H must be positive definite if evaluated at w. w 2 u 2 u 1 λ 1/2 2 w λ 1/2 1 w 1 706of 828

33 Gradient Information improves Performances Hessian is symmetric and contains W(W + 1)/2 independent entries where W is the total number of weights in the network. Need to gather this O(W 2 ) pieces of information by doing O(W) function evaluations if nothing else is know. Get order O(W 3 ). The gradient E provides W pieces of information at once. Still need O(W) steps, but the order is now O(W 2 ). 707of 828

34 Batch processing : Update the weight vector with a small step in the direction of the negative gradient w (τ+1) = w (τ) η E(w (τ) ) where η is the learning rate. After each step, re-evaluate the gradient E(w (τ) ) again. has problems in long valleys. 708of 828

35 has problems in long valleys. Example of zig-zag of Algorithm. 709of 828

36 Nonlinear Use Conjugate instead of Gradient Descent to avoid zig-zag behaviour. Use Newton method which also calculates the inverse Hessian in each iteration (but inverting the Hessian is usually costly). Use Quasi-Newton methods (e.g. BFGS) which also calculates an estimate of the inverse Hessian while iterating. Run the algorithm from a set of starting points to find the smallest local minimum. 710of 828

37 Nonlinear Remaining big problem: Error function is defined over the whole training set. Therefore, need to process the whole training set for each calculation of the gradient E(w (τ) ). If the error function is a sum of errors for each data point E(w) = N E n (w) n=1 we can use on-line gradient descent (also called sequential gradient descent or stochastic gradient descent updating the weights by one data point at a time w (τ+1) = w (τ) η E n (w (τ) ). 711of 828

### Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,

### Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

### Neural Network Training

Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

### NEURAL NETWORKS

5 Neural Networks In Chapters 3 and 4 we considered models for regression and classification that comprised linear combinations of fixed basis functions. We saw that such models have useful analytical

### Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

### Feed-forward Network Functions

Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification

### Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

### Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

### Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

### Neural Networks and Deep Learning

Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

### Artificial Neural Networks

Introduction ANN in Action Final Observations Application: Poverty Detection Artificial Neural Networks Alvaro J. Riascos Villegas University of los Andes and Quantil July 6 2018 Artificial Neural Networks

### y(x n, w) t n 2. (1)

Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N,

### Introduction to Neural Networks

Introduction to Neural Networks Steve Renals Automatic Speech Recognition ASR Lecture 10 24 February 2014 ASR Lecture 10 Introduction to Neural Networks 1 Neural networks for speech recognition Introduction

### Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

### Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

### Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

### Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely

### Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will

### Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

### Multi-layer Neural Networks

Multi-layer Neural Networks Steve Renals Informatics 2B Learning and Data Lecture 13 8 March 2011 Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 1 Overview Multi-layer neural

### CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders

### Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of

### Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

### NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

### Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

### Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

### Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

### Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

### Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

### Course 395: Machine Learning - Lectures

Course 395: Machine Learning - Lectures Lecture 1-2: Concept Learning (M. Pantic) Lecture 3-4: Decision Trees & CBC Intro (M. Pantic & S. Petridis) Lecture 5-6: Evaluating Hypotheses (S. Petridis) Lecture

### Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

### April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 9, 2018 Content 1 2 3 4 Outline 1 2 3 4 problems { C 1, y(x) threshold predict(x) = C 2, y(x) < threshold, with threshold

### Machine Learning Lecture 5

Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

### Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Neural Networks - I Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Neural Networks 1 /

### 4. Multilayer Perceptrons

4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

### Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

### CSC321 Lecture 5: Multilayer Perceptrons

CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3

### CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross

### Pattern Recognition and Machine Learning

Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

### Multilayer Perceptron

Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4

### Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

### Deep Feedforward Networks

Deep Feedforward Networks Yongjin Park 1 Goal of Feedforward Networks Deep Feedforward Networks are also called as Feedforward neural networks or Multilayer Perceptrons Their Goal: approximate some function

### Logistic Regression & Neural Networks

Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

### Neural Networks (and Gradient Ascent Again)

Neural Networks (and Gradient Ascent Again) Frank Wood April 27, 2010 Generalized Regression Until now we have focused on linear regression techniques. We generalized linear regression to include nonlinear

### (Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

### Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

### Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

### Introduction to Machine Learning Spring 2018 Note Neural Networks

CS 189 Introduction to Machine Learning Spring 2018 Note 14 1 Neural Networks Neural networks are a class of compositional function approximators. They come in a variety of shapes and sizes. In this class,

### Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural Networks, Computation Graphs CMSC 470 Marine Carpuat Binary Classification with a Multi-layer Perceptron φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ

### CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! November 18, 2015 THE EXAM IS CLOSED BOOK. Once the exam has started, SORRY, NO TALKING!!! No, you can t even say see ya

### Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Introduction to Natural Computation Lecture 9 Multilayer Perceptrons and Backpropagation Peter Lewis 1 / 25 Overview of the Lecture Why multilayer perceptrons? Some applications of multilayer perceptrons.

### 1 What a Neural Network Computes

Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

### Introduction to Machine Learning

Introduction to Machine Learning Neural Networks Varun Chandola x x 5 Input Outline Contents February 2, 207 Extending Perceptrons 2 Multi Layered Perceptrons 2 2. Generalizing to Multiple Labels.................

### Artificial Neural Networks 2

CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b

### Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

+ Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Prof. Alexander Ihler Linear Classifiers (Perceptrons) Linear Classifiers a linear classifier is a mapping which partitions

### Error Backpropagation

Error Backpropagation Sargur Srihari 1 Topics in Error Backpropagation Terminology of backpropagation 1. Evaluation of Error function derivatives 2. Error Backpropagation algorithm 3. A simple example

### Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

### Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011! 1 Todayʼs lecture" How the brain works (!)! Artificial neural networks! Perceptrons! Multilayer feed-forward networks! Error

### Computational statistics

Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

### Linear Models for Classification

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

### Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

### Machine Learning. 7. Logistic and Linear Regression

Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,

### A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

### COMP 551 Applied Machine Learning Lecture 14: Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted,

### Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

### Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

### Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 4 October 2017 / 9 October 2017 MLP Lecture 3 Deep Neural Networs (1) 1 Recap: Softmax single

### From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

### Linear Regression (continued)

Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

### Deep Feedforward Networks. Sargur N. Srihari

Deep Feedforward Networks Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

### Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

### Regularization in Neural Networks

Regularization in Neural Networks Sargur Srihari 1 Topics in Neural Network Regularization What is regularization? Methods 1. Determining optimal number of hidden units 2. Use of regularizer in error function

### Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Thore Graepel and Nicol N. Schraudolph Institute of Computational Science ETH Zürich, Switzerland {graepel,schraudo}@inf.ethz.ch

### Multiclass Logistic Regression

Multiclass Logistic Regression Sargur. Srihari University at Buffalo, State University of ew York USA Machine Learning Srihari Topics in Linear Classification using Probabilistic Discriminative Models

### Multilayer Perceptrons (MLPs)

CSE 5526: Introduction to Neural Networks Multilayer Perceptrons (MLPs) 1 Motivation Multilayer networks are more powerful than singlelayer nets Example: XOR problem x 2 1 AND x o x 1 x 2 +1-1 o x x 1-1

### Deep Feedforward Networks

Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

### <Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

### Neural Networks (Part 1) Goals for the lecture

Neural Networks (Part ) Mark Craven and David Page Computer Sciences 760 Spring 208 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed

### Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 2012 Engineering Part IIB: Module 4F10 Introduction In

### Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

### CSC242: Intro to AI. Lecture 21

CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages

### Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 89 Part II

### Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation

### Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Advanced Machine Learning Lecture 12 Neural Networks 10.12.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

### Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

### Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

### Lecture 17: Neural Networks and Deep Learning

UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

### Deep Feedforward Networks

Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

### LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

### Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

### COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-

### Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks (sometimes called Multilayer Perceptrons or MLPs) Linear separability Hyperplane In 2D: w x + w 2 x 2 + w 0 = 0 Feature x 2 = w w 2 x w 0 w 2 Feature 2 A perceptron can separate

### ECS171: Machine Learning

ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

### Linear & nonlinear classifiers

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

### Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 2 October 2018 http://www.inf.ed.ac.u/teaching/courses/mlp/ MLP Lecture 3 / 2 October 2018 Deep

### Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks (sometimes called Multilayer Perceptrons or MLPs) Linear separability Hyperplane In 2D: w 1 x 1 + w 2 x 2 + w 0 = 0 Feature 1 x 2 = w 1 w 2 x 1 w 0 w 2 Feature 2 A perceptron