epochs epochs

Similar documents
Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Introduction to Neural Networks

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Computational statistics

Neural Networks Task Sheet 2. Due date: May

Based on the original slides of Hung-yi Lee

CSC 578 Neural Networks and Deep Learning

Training Multi-Layer Neural Networks. - the Back-Propagation Method. (c) Marcin Sydow

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

ECE521 Lectures 9 Fully Connected Neural Networks

1 What a Neural Network Computes

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

EPL442: Computational

Backpropagation Neural Net

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Artifical Neural Networks

4. Multilayer Perceptrons

10-701/ Machine Learning, Fall

The Perceptron algorithm

Project 1: A comparison of time delay neural networks (TDNN) trained with mean squared error (MSE) and error entropy criterion (EEC)

Multilayer Perceptrons (MLPs)

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL

Neural Networks (Part 1) Goals for the lecture

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Day 3 Lecture 3. Optimizing deep networks

Artificial Neural Networks

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Neural Networks and Deep Learning

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

CSC242: Intro to AI. Lecture 21

Learning Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

The Multi-Layer Perceptron

Neural Networks: Basics. Darrell Whitley Colorado State University

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Resampling techniques for statistical modeling

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

B555 - Machine Learning - Homework 4. Enrique Areyan April 28, 2015

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Machine Learning (CSE 446): Neural Networks

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Introduction to Neural Networks

Nonlinear Classification

Linear classifiers: Overfitting and regularization

Machine Learning

Lecture 5: Logistic Regression. Neural Networks

Learning from Data: Multi-layer Perceptrons

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

ECE521 Lecture 7/8. Logistic Regression

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Pattern Classification

Lecture 6: Neural Networks for Representing Word Meaning

Unit III. A Survey of Neural Network Model

CSCI567 Machine Learning (Fall 2018)

Incremental Stochastic Gradient Descent

ESTIMATING THE ACTIVATION FUNCTIONS OF AN MLP-NETWORK

ESS2222. Lecture 3 Bias-Variance Trade-off

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Linear Models for Regression

Artificial Neural Network

ECS171: Machine Learning

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Reading Group on Deep Learning Session 1

CS260: Machine Learning Algorithms

text classification 3: neural networks

Multilayer Neural Networks

Midterm: CS 6375 Spring 2018

Hierarchical Boosting and Filter Generation

Neural Networks. Hopfield Nets and Auto Associators Fall 2017

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Supervised Learning in Neural Networks

Effect of number of hidden neurons on learning in large-scale layered neural networks

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

ECE521 Lecture7. Logistic Regression

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Lecture 35: Optimization and Neural Nets

Retrieval of Cloud Top Pressure

Logistic Regression & Neural Networks

Midterm: CS 6375 Spring 2015 Solutions

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Deep Feedforward Networks

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Artificial Neural Networks

Neural networks III: The delta learning rule with semilinear activation function

Lecture 2: Logistic Regression and Neural Networks

Deep Neural Networks (1) Hidden layers; Back-propagation

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Comparison of the Complex Valued and Real Valued Neural Networks Trained with Gradient Descent and Random Search Algorithms

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

FINAL: CS 6375 (Machine Learning) Fall 2014

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Deep Learning & Artificial Intelligence WS 2018/2019

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Transcription:

Neural Network Experiments To illustrate practical techniques, I chose to use the glass dataset. This dataset has 214 examples and 6 classes. Here are 4 examples from the original dataset. The last values indicates the class. 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0 1 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0 1 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0 1 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0 1 Some of the values are relatively large, and some attributes have a small range, so I scaled each attribute to be between 0 and 1, and formatted the examples for my program. 0.43 0.44 1.00 0.25 0.35 0.01 0.31 0 0 1-1 -1-1 -1-1 0.28 0.48 0. 0.33 0.52 0.08 0.22 0 0 1-1 -1-1 -1-1 0.22 0.42 0.79 0.39 0.57 0.06 0.22 0 0 1-1 -1-1 -1-1 0.29 0.37 0.82 0.31 0.50 0.09 0.26 0 0 1-1 -1-1 -1-1

For training and testing purposes, I randomly split the dataset into 114 training and test examples. This split was stratified, i.e., each split has about the same proportion of each class. Unless stated otherwise, there are output neurons, hidden neurons use tanh activation, output neurons use identity activation, and the goal is to minimize square error/2. Do not generalize too much from these results. The samples are too small. Number of Hidden Neurons With more hidden neurons, lower training error is expected, but test error will increase at some point. I compared the following. 0 hidden neurons, 0.01 lrate, initial weights 0 5 hidden neurons, 0.001 lrate, random initial weights in [ 0.1, 0.1] 10 hidden neurons, 0.001 lrate, random initial weights in [ 0.1, 0.1]

0 hidden 5 hidden 10 hidden train error 60 0 2000 4000 6000 00 00 test error 150 145 135 130 125 115 110 105 0 200 400 600 0 0 0 hidden 5 hidden 10 hidden

Interpretation of results: Over 10,000, 10 hidden units had the lowest training error, and 0 hidden units the highest. Over 1,000, all NNs decreased test error up to a point, and then started overfitting. 5 and 10 hidden units had the lowest minimum, but the training algorithm needs a stopping criterion. Different Output Activation Functions Because the desired outputs are all in { 1, 1}, an activation that keeps outputs in [ 1, 1] might produce much lower error. Using 10 hidden units, initial weights in [ 0.1, 0.1], and tanh for hidden neurons, I considered: identity activation for output units, 0.001 lrate tanh actitvation for output units, 0.002 lrate bipolar sigmoid for output units, 0.005 lrate

output = identity tanh bipolar sigmoid train error 60 40 0 2000 4000 6000 00 00 1 160 output = identity tanh bipolar sigmoid test error 0 2000 4000 6000 00 00

The three output activation functions are very close in training error, but tanh outperformed the other two on test error. I ran tanh output activation again, starting the random number generation with different seeds. The result was a considerable variation in performance though all of them were better than the other two activation functions. 130 output = tanh, Seed 1 Seed 2 Seed 3 Seed 4 test error 110 90 0 2000 4000 6000 00 00

Different Hidden Activation Functions I also tried different activation functions for the hidden units. Using 10 hidden units, initial weights in [ 0.1, 0.1], and identity for output neurons, I considered: tanh actitvation for hidden units, 0.001 lrate logistic activation for hidden units, 0.002 lrate bipolar sigmoid for hidden units, 0.002 lrate hidden = tanh logistic bipolar sigmoid train error 60 40 0 2000 4000 6000 00 00

160 150 hidden = tanh logistic bipolar sigmoid test error 130 110 0 500 0 1500 2000 tanh for hidden neurons achieved the best training error, but all three had similar results for test error. In the first 0-2000 a test error minimum would be reached and then overfitting would take place.

Weights Initialization Initializing all weights to zero results in zero learning. With zero weights, a tanh hidden neuron produces a zero output, and, as a result, all error derivatives are zero. Using 10 hidden units, 0.001 lrate, tanh for hidden neurons, and identity for output neurons: random initial weights in [ 1.0, 1.0] random initial weights in [ 0.1, 0.1] random initial weights in [ 0.01, 0.01] initial [-1.0,1.0] [-0.1,0.1] [-0.01,0.01] train error 60 40 0 2000 4000 6000 00 00

160 150 initial [-1.0,1.0] [-0.1,0.1] [-0.01,0.01] test error 130 110 0 500 0 1500 2000 If the weights are initially too high, it is more likely to find a bad minimum. If too low, then convergence will be slower. In this case [ 1.0, 1, 0] leads to initially faster convergence, but then hits a minimum after 4000. The test error is similar except for faster learning for larger initializations.

Selection Criterion: Validation Set A NN with more hidden neurons can represent more functions, but is also more likely to overfit the data. I recommend using validation sets as a selection criterion. Validation Set Method: Split the training set into a learning set and a validation set. Train on the learning set. Choose the weights that minimize error on the validation set. I split the glass dataset into a test set of 50 examples, a learning set of 114 examples, and a validation set of 50 examples. test set minimum validation set method 57.9546 58.6212 52.1225 62.2628 50.3152 55.4330 51.4142 51.9228 48.6549 57.5341 There is considerable variance. This improves with larger datasets.

Momentum Faster convergence can often be achieved by combining error derivatives over several. Momentum applies to batch learning. Let w be the previous weight change. Let η m (0, 1) be the momentum parameter. Then the new weight change is: w η m w η E w This approximates second derivative methods. Using 10 hidden units, η = 0.1/114 initial weights in [ 0.1, 0.1], tanh for hidden neurons, and identity for output neurons, I evaluated: η m {0, 0.2, 0.4, 0.6, 0.8}, batch learning The training error decreases with higher momentum, though 0.8 is a little unstable. Unfortunately, higher momentum also results in faster overfitting. Note: the x-axis is logscale in the test graph.

momentum = 0.0 0.2 0.4 0.6 0.8 train error 60 40 20 0 2000 4000 6000 00 00 400 350 300 momentum = 0.0 0.2 0.4 0.6 0.8 test error 250 200 150 0 00