CS489/698: Intro to ML

Similar documents
Introduction to Machine Learning Spring 2018 Note Neural Networks

Machine Learning (CSE 446): Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Artifical Neural Networks

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

A summary of Deep Learning without Poor Local Minima

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Machine Learning Basics III

Introduction to Neural Networks

Intelligent Systems Discriminative Learning, Neural Networks

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

CSC 411 Lecture 10: Neural Networks

Neural Networks and Deep Learning

Bits of Machine Learning Part 1: Supervised Learning

CSC242: Intro to AI. Lecture 21

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CSC321 Lecture 5: Multilayer Perceptrons

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning (CSE 446): Backpropagation

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Introduction to Machine Learning

Artificial Neural Networks

Lecture 5: Logistic Regression. Neural Networks

CS60010: Deep Learning

Reading Group on Deep Learning Session 1

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 3 Feedforward Networks and Backpropagation

Neural networks COMS 4771

Machine Learning

Introduction to Convolutional Neural Networks (CNNs)

Intro to Neural Networks and Deep Learning

On Some Mathematical Results of Neural Networks

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Feedforward Neural Nets and Backpropagation

y(x n, w) t n 2. (1)

Artificial Neural Networks. MGS Lecture 2

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Advanced Machine Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Logistic Regression & Neural Networks

Artificial Intelligence

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Deep Feedforward Networks

Lecture 3 Feedforward Networks and Backpropagation

Lecture 17: Neural Networks and Deep Learning

Notes on Back Propagation in 4 Lines

Gradient-Based Learning. Sargur N. Srihari

Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Artificial Neuron (Perceptron)

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Machine Learning Lecture 5

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Deep Feedforward Networks

Multilayer Perceptron

INTRODUCTION TO DATA SCIENCE

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Chapter ML:VI (continued)

Introduction to Deep Learning

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Computational Intelligence Winter Term 2017/18

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Networks: Backpropagation

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

CSCI567 Machine Learning (Fall 2018)

1 What a Neural Network Computes

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Deep Feedforward Networks. Sargur N. Srihari

Neural Networks and the Back-propagation Algorithm

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Neural Networks. Intro to AI Bert Huang Virginia Tech

Machine Learning and Data Mining. Linear classification. Kalev Kask

Chapter ML:VI (continued)

Neural Networks (and Gradient Ascent Again)

Linear Models in Machine Learning

ECE521 Lecture 7/8. Logistic Regression

Hilbert s 13th Problem Great Theorem; Shame about the Algorithm. Bill Moran

Rapid Introduction to Machine Learning/ Deep Learning

10 NEURAL NETWORKS Bio-inspired Multi-Layer Networks. Learning Objectives:

Artificial Neural Networks

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Lecture 4: Perceptrons and Multilayer Perceptrons

ECE521 Lectures 9 Fully Connected Neural Networks

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Overfitting, Bias / Variance Analysis

Computational statistics

Transcription:

CS489/698: Intro to ML Lecture 03: Multi-layer Perceptron

Outline Failure of Perceptron Neural Network Backpropagation Universal Approximator 2

Outline Failure of Perceptron Neural Network Backpropagation Universal Approximator 3

The XOR problem X = { (0,0), (0,), (,0), (,) }, y = {-,,, -} f * (x) = sign[ f xor - ½ ], f xor (x) = (x & -x 2 ) (-x & x 2 ) 4

No separating hyperplane x = (0,0), y = - b < 0 x 2 = (0,), y 2 = w 2 + b > 0 x 3 = (,0), y 3 = w + b > 0 x 4 = (,), y 4 = - w + w 2 + b < 0 Contradiction! (w + w 2 + b) + b > 0 Ex what happens if we run perceptron/winnow on this example? [At least one of the blue or green inequalities has to be strict] 5

Fixing the problem Our model (hyperplanes) underfits the data (xor func) Fix representation, richer model Fix model, richer representation NN: still use hyperplane, but learn representation simultaneously 6

Outline Failure of Perceptron Neural Network Backpropagation Universal Approximator 7

Two-layer perceptron input layer x x 2 u u 2 u 2 u 22 c c 2 weights z z 2 hidden layer h h Nonlinear learned activation rep function 2 w w 2 b weights ŷ output layer st linear layer makes all the difference! nonlinear transform 8 2 nd linear layer

Does it work? Rectified Linear Unit (ReLU) x = (0,0), y = - z = (0,-), h = (0,0) ŷ = - x 2 = (0,), y 2 = z 2 = (,0), h 2 = (,0) ŷ 2 = x 3 = (,0), y 3 = z 3 = (,0), h 3 = (,0) ŷ 3 = x 4 = (,), y 4 = - z 4 = (2,), h 4 = (2,) ŷ 4 = - 9

Multi-layer perceptron x z h ŷ input layer: R d x 2 x z 2 z k h 2 h k ŷ 2 ŷ m output layer: R m 0 d hidden layer weights: fully connected R k weights: fully connected

Multi-layer perceptron (stacked) depth width x x 2 x d z z 2 p h ŷx z h ŷ x z h ŷ x z h ŷ x z 2 2 2 z h ŷx k k md 2 2 2 z h ŷ k k m 2 x d h h 2 2 2 z h ŷ k k m ŷ ŷ learned hierarchical nonlinear feature representation linear predictor feed-forward

Activation function Sigmoid saturate Tanh smooth Rectified Linear nonsmooth 2

Underfitting vs Overfitting Linear predictor (perceptron / winnow / linear regression) underfits NNs learn hierarchical nonlinear feature jointly with linear predictor may overfit tons of heuristics (some later) Size of NN: # of weights/connections ~ billion; human brain 0 6 billions (estimated) 3

Weights training x in R d x x 2 x d z h ŷx z h ŷ x 2 2 2 z h ŷx k k md z h ŷ z h ŷ 2 2 2 z h ŷ k k m x x 2 x d z h ŷ z h ŷ 2 2 2 z h ŷ k k m ŷ in R m ŷ = q(x; Θ) Need a loss to measure diff between pred ŷ and truth y Eg, (ŷ-y) 2 ; more later Need a training set {(x,y ),, (x n,y n )} to train weights Θ 4

Gradient Descent (Generalized) gradient O(n)! Step size (learning rate) const, if L is smooth diminishing, otherwise 5

Stochastic Gradient Descent (SGD) a random sample suffices average over n samples diminishing step size, eg, /sqrt{t} or /t averaging, momentum, variance-reduction, etc sample w/o replacement; cycle; permute in each pass 6

A little history on optimization Gradient descent mentioned first in (Cauchy, 847) First rigorous convergence proof (Curry, 944) SGD proposed and analyzed (Robbins & Monro, 95) 7

Herbert Robbins (95 200) 8

Outline Failure of Perceptron Neural Network Backpropagation Universal Approximator 9

Backpropogation A fancy name for the chain rule for derivatives: f(x) = g[ h(x) ] f (x) = g [ h(x) ] * h (x) Efficiently computes the derivative in NN Two passes; complexity = O(size(NN)) forward pass: compute function value sequentially backward pass: compute derivative sequentially 20

Algorithm Forward pass f: activation function J = L + λω: training obj Backward pass 2

History (perfect for a course project!) 22

Outline Failure of Perceptron Neural Network Backpropagation Universal Approximator 23

Rationals are dense in R Any real number can be approximated by some rational number arbitrarily well Or in fancy mathematical language domain of interest subset in pocket metric for approx 24

Kolmogorov-Arnold Theorem Theorem (Kolmogorov, Arnold, Lorentz, Sprecher, ) Any continuous function g: [0,] d R can be written (exactly!) as: Binary addition is the only multivariate function! Solves Hilbert s 3 th problem! can be constructed from g, which is unknown 25

Andrey Kolmogorov (903-987) Foundations of the theory of probability "Every mathematician believes that he is ahead of the others The reason none state this belief in public is because they are intelligent people" 26

Universal Approximator Theorem (Cybenko, Hornik et al, Leshno et al, ) Any continuous function g: [0,] d R can be uniformly approximated in arbitrary precision by a two-layer NN with an activation function f that is locally bounded 2 has negligible closure of discontinuous points 3 is not a polynomial conditions are necessary in some sense includes (almost) all activation functions in practice 27

Caveat and Remedy NNs were praised for being universal but shall see later that many kernels are universal as well desirable but perhaps not THE explanation May need exponentially many hidden units Increase depth may reduce network size, exponentially! can be a course project, ask for references 28

Questions? 29