Advances in Extreme Learning Machines

Similar documents
Approximate Principal Components Analysis of Large Data Sets

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Statistical Machine Learning from Data

ECE521 week 3: 23/26 January 2017

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Linear Methods for Regression. Lijun Zhang

CS540 Machine learning Lecture 5

Air Quality Forecasting using Neural Networks

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Resampling techniques for statistical modeling

Neural Networks and Deep Learning

Randomized Algorithms

CS534 Machine Learning - Spring Final Exam

Introduction to Gaussian Process

Dimensionality Reduction and Principle Components Analysis

Linear Models for Regression CS534

In the Name of God. Lectures 15&16: Radial Basis Function Networks

Decision Tree Learning Lecture 2

From perceptrons to word embeddings. Simon Šuster University of Groningen

Linear Models for Regression. Sargur Srihari

STA 414/2104: Lecture 8

MANY scientific computations, signal processing, data analysis and machine learning applications lead to large dimensional

Dimension Reduction Methods

COMPRESSED AND PENALIZED LINEAR

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Machine Learning for OR & FE

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

Random Methods for Linear Algebra

The Application of Extreme Learning Machine based on Gaussian Kernel in Image Classification

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Sketched Ridge Regression:

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

The exam is closed book, closed notes except your one-page cheat sheet.

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

STA 414/2104: Lecture 8

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Linear model selection and regularization

Data Mining und Maschinelles Lernen

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

epochs epochs

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

10-701/ Machine Learning, Fall

Using SVD to Recommend Movies

Learning Deep Architectures for AI. Part II - Vijay Chakilam

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Radial-Basis Function Networks

Introduction to Machine Learning

Linear Regression (9/11/13)

Linear Models for Regression

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Collaborative Filtering. Radek Pelánek

Sample Exam COMP 9444 NEURAL NETWORKS Solutions

Introduction to Neural Networks

Data Mining Part 5. Prediction

Linear Regression (continued)

Compressed Sensing and Neural Networks

CSC 578 Neural Networks and Deep Learning

Learning Gaussian Process Models from Uncertain Data

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Mathematical Formulation of Our Example

Machine Learning Techniques

ECE521 Lectures 9 Fully Connected Neural Networks

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

PCA and admixture models

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

USING SINGULAR VALUE DECOMPOSITION (SVD) AS A SOLUTION FOR SEARCH RESULT CLUSTERING

Lecture 5: Logistic Regression. Neural Networks

Understanding Generalization Error: Bounds and Decompositions

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment.

Introduction to Machine Learning and Cross-Validation

ISyE 691 Data mining and analytics

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

Holdout and Cross-Validation Methods Overfitting Avoidance

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Introduction to Machine Learning Midterm Exam

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Using the Delta Test for Variable Selection

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

1 Cricket chirps: an example

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear Regression and Its Applications

Hypothesis Evaluation

Lecture 6. Regression

10701/15781 Machine Learning, Spring 2007: Homework 2

Transcription:

Advances in Extreme Learning Machines Mark van Heeswijk April 17, 2015

Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 2/23

Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 3/23

Trends in Machine Learning increasing dimensionality of data sets increasing size of data sets 300 hours of video are uploaded to YouTube every minute 1.8 billion pictures uploaded every day to various sites (mid-2014) Advances in Extreme Learning Machines 4/23

Trends in Machine Learning [http://tech.firstpost.com/news-analysis/now-upload-share-1-8-billion-photoseveryday-meeker-report-224688.html] Advances in Extreme Learning Machines 4/23

Trends in Machine Learning increasing dimensionality of data sets increasing size of data sets 300 hours of video are uploaded to YouTube every minute 1.8 billion pictures uploaded every day to various sites (mid-2014) machine learning methods needed, that can be used to analyze this data and extract useful knowledge and insights from this wealth of information Advances in Extreme Learning Machines 4/23

Contributions Combinations of Multiple Extreme Learning Machines (ELM) I: Adaptive Ensemble Models of ELMs for Time Series Prediction II: GPU-accelerated and parallelized ELM ensembles for large-scale regression Variable Selection and ELM III: Feature selection for nonlinear models with extreme learning machines IV: Fast Feature Selection in a GPU Cluster Using the Delta Test V: Binary/Ternary Extreme Learning Machines Trade-offs in Extreme Learning Machines VI: Compressive ELM: Improved Models Through Exploiting Time-Accuracy Trade-offs Advances in Extreme Learning Machines 5/23

Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 6/23

Extreme Learning Machines Input layer Hidden layer Output layer input x j 1 input x j 2 input x j 3 output y j input x j 4 Advances in Extreme Learning Machines 7/23

Extreme Learning Machines Input layer Hidden layer Output layer input x j 1 input x j 2 input x j 3 random weights output y j input x j 4 Advances in Extreme Learning Machines 7/23

Extreme Learning Machines Input layer Hidden layer Output layer input x j 1 input x j 2 input x j 3 random weights output y j input x j 4 least-squares solution Advances in Extreme Learning Machines 7/23

Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 8/23

Adaptive Ensemble of ELMs (Publ I) random inputs random #neurons trained on sliding/growing window adaptive linear combination based on performance of each ELM Advances in Extreme Learning Machines 9/23

Adaptive Ensemble of ELMs (Publ I) random inputs random #neurons trained on sliding/growing window adaptive linear combination based on performance of each ELM Advances in Extreme Learning Machines 9/23

Adaptive Ensemble of ELMs (Publ I) random inputs random #neurons trained on sliding/growing window adaptive linear combination based on performance of each ELM Advances in Extreme Learning Machines 9/23

Parallelized/GPU-Accelerated Ensemble of ELMs (Publ II) random inputs optimized #neurons (loo-cv on GPU) training performed on GPU training of ELMs parallelized over multiple CPU/GPU fixed linear combination based on loo-cv Advances in Extreme Learning Machines 10/23

Parallelized/GPU-Accelerated Ensemble of ELMs (Publ II) random inputs optimized #neurons (loo-cv on GPU) training performed on GPU training of ELMs parallelized over multiple CPU/GPU fixed linear combination based on loo-cv Advances in Extreme Learning Machines 10/23

Parallelized/GPU-Accelerated Ensemble of ELMs (Publ II) random inputs optimized #neurons (loo-cv on GPU) training performed on GPU training of ELMs parallelized over multiple CPU/GPU fixed linear combination based on loo-cv Advances in Extreme Learning Machines 10/23

Highlighted ELM Ensemble Results w i 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 2000 4000 6000 8000 10000 time w i 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 1000 2000 3000 4000 5000 time Adaptation of ensemble weight during task Advances in Extreme Learning Machines 11/23

Highlighted ELM Ensemble Results 0.016 NMSE ensemble 0.0158 0.0156 0.0154 0.0152 0 20 40 60 80 100 Number of models in ensemble The more models in the ensemble, the more accurate the results Advances in Extreme Learning Machines 11/23

Highlighted ELM Ensemble Results 6000 Running time ensemble (s) 5000 4000 3000 2000 1000 0 1 2 3 4 Number of workers Scalability through parallelization, limited precision, use of GPUs Advances in Extreme Learning Machines 11/23

Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 12/23

ELM-based Feature Selection (Publ III) scaling layer random #neurons for each restart of algorithm input x j 1 s 1 input x j 2 input x j 3 s 2 s 3 output y j input x j 4 s 4 starting from a random scaling: gradually reduced until empty, based on training and loo error Advances in Extreme Learning Machines 13/23

ELM-based Feature Selection (Publ III) scaling layer random #neurons for each restart of algorithm input x j 1 s 1 input x j 2 input x j 3 s 2 s 3 output y j input x j 4 s 4 starting from a random scaling: gradually reduced until empty, based on training and loo error Advances in Extreme Learning Machines 13/23

ELM-based Feature Selection (Publ III) scaling layer random #neurons for each restart of algorithm input x j 1 s 1 input x j 2 input x j 3 s 2 s 3 output y j input x j 4 s 4 starting from a random scaling: gradually reduced until empty, based on training and loo error Advances in Extreme Learning Machines 13/23

Highlighted ELM-FS Results Advances in Extreme Learning Machines 14/23

Fast Feature Selection with GPU-accelerated Delta Test (Publ IV) input x j 1 0/1 input x j 2 input x j 3 0/1 0/1 output y j input x j 4 0/1 variable selection optimized using an evolutionary algorithm with GPU-accelerated Delta-Test as fitness criterion Advances in Extreme Learning Machines 15/23

Fast Feature Selection with GPU-accelerated Delta Test (Publ IV) input x j 1 0/1 input x j 2 input x j 3 0/1 0/1 output y j input x j 4 0/1 variable selection optimized using an evolutionary algorithm with GPU-accelerated Delta-Test as fitness criterion Advances in Extreme Learning Machines 15/23

Fast Feature Selection with GPU-accelerated Delta Test (Publ IV) 1 2N N (y i y NN(i) ) 2 i=1 Advances in Extreme Learning Machines 16/23

Binary/Ternary Extreme Learning Machines (Publ V) input x j 1 input x j 2 input x j 3 binary/ ternary weights output y j input x j 4 scaling of weights and biases optimized using batchintrinsic plasticity pretraining least-square fit with fast L2 regularization by using SVD decomposition Advances in Extreme Learning Machines 17/23

Binary/Ternary Extreme Learning Machines (Publ V) input x j 1 input x j 2 input x j 3 binary/ ternary weights output y j input x j 4 scaling of weights and biases optimized using batchintrinsic plasticity pretraining least-square fit with fast L2 regularization by using SVD decomposition Advances in Extreme Learning Machines 17/23

Binary/Ternary Extreme Learning Machines (Publ V) input x j 1 input x j 2 input x j 3 binary/ ternary weights output y j input x j 4 scaling of weights and biases optimized using batchintrinsic plasticity pretraining least-square fit with fast L2 regularization by using SVD decomposition Advances in Extreme Learning Machines 17/23

Binary/Ternary Extreme Learning Machines (Publ V) input x j 1 input x j 2 input x j 3 binary/ ternary weights output y j input x j 4 scaling of weights and biases optimized using batchintrinsic plasticity pretraining least-squares solution with fast L2 regularization by using SVD decomposition Advances in Extreme Learning Machines 17/23

Ternary ELM Results 1 var 2 vars 3 vars +1 0 0 0 1 0 0 0 0 +1 0 0 0 1 0 0 +1 +1 0 0 +1 1 0 0 1 +1 0 0 1 1 0 0 0 0 1 1 RMSEtest 0.7 0.69 0.68 0.67 0.66 0.65 0.64 0.63 0.62 0.61 BIP(rand)-TR-ELM BIP(rand)-TR-2-ELM BIP(rand)-TR-3-ELM 0.6 0 100 200 300 400 500 600 700 800 900 1,000 numhidden Advances in Extreme Learning Machines 18/23

Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 19/23

Compressive ELM (Publ VI) input x j 1 input x j 2 input x j 3 binary/ ternary weights output y j input x j 4 scaling of weights and biases optimized using batchintrinsic plasticity pretraining least-squares solution with fast L2 regularization by using SVD decomposition Advances in Extreme Learning Machines 20/23

Compressive ELM (Publ VI) fast low-distortion embedding into lower-dimensional space input x j 1 input x j 2 input x j 3 binary/ ternary weights output y j input x j 4 scaling of weights and biases optimized using batchintrinsic plasticity pretraining least-squares solution with fast L2 regularization by using SVD decomposition Advances in Extreme Learning Machines 20/23

Highlighted Results Compressive ELM training time 12 10 8 6 4 TR-3-ELM CS-TR-3-ELM(k=50) CS-TR-3-ELM(k=100) CS-TR-3-ELM(k=200) CS-TR-3-ELM(k=400) CS-TR-3-ELM(k=600) msetest 0.4 0.38 0.36 0.34 0.32 0.3 0.28 0.26 TR-3-ELM CS-TR-3-ELM(k=50) CS-TR-3-ELM(k=100) CS-TR-3-ELM(k=200) CS-TR-3-ELM(k=400) CS-TR-3-ELM(k=600) 2 0.24 0.22 0 0 100 200 300 400 500 600 700 800 900 1,000 #hidden neurons 0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 training time Advances in Extreme Learning Machines 21/23

Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 22/23

Summary Overall: resulting collection of proposed methods provides an efficient, accurate and flexible framework for solving large-scale supervised learning problems. proposed methods: are not limited to the particular types of random-weight neural networks and contexts in which they have been tested can easily be incorporated in new contexts and models Advances in Extreme Learning Machines 23/23

Defence Slides: Related Models Advances in Extreme Learning Machines 24/23

ELM vs RVFL Input layer Hidden layer Output layer input x j 1 input x j 2 input x j 3 output y j input x j 4 Advances in Extreme Learning Machines 25/23

ELM vs RVFL Input layer Hidden layer Output layer input x j 1 input x j 2 input x j 3 output y j input x j 4 Advances in Extreme Learning Machines 25/23

ELM vs RVFL Input layer Hidden layer Output layer input x j 1 input x j 2 input x j 3 output y j input x j 4 Advances in Extreme Learning Machines 25/23

Defence Slides: General Advances in Extreme Learning Machines 26/23

Standard ELM Given a training set (x i, y i ), x i R d, y i R, an activation function f : R R and M the number of hidden nodes: 1: - Randomly assign input weights w i and biases b i, i [1, M]; 2: - Calculate the hidden layer output matrix H; 3: - Calculate output weights matrix β = H Y. where f (w 1 x 1 + b 1) f (w M x 1 + b M ) H =....... f (w 1 x N + b 1) f (w M x N + b M ) Advances in Extreme Learning Machines 27/23

ELM Theory vs Practice In theory, ELM is universal approximator In practice, limited number of samples; risk of overfitting Therefore: the functional approximation should use as limited number of neurons as possible the hidden layer should extract and retain as much useful information as possible from the input samples Advances in Extreme Learning Machines 28/23

ELM Theory vs Practice Weight considerations: weight range determines typical activation of the transfer function (remember w i, x = w i x cos θ,) therefore, normalize or tune the length of the weights vectors somehow Linear vs non-linear: since sigmoid neurons operate in nonlinear regime, add d linear neurons for the ELM to work better on (almost) linear problems Avoiding overfitting: use efficient L2 regularization Advances in Extreme Learning Machines 29/23

Batch Intrinsic Plasticity suppose (x 1,..., x N ) R N d, and output of neuron i is h i = f (a i w i x k + b i ), where f is an invertible transfer function for each neuron i from exponential distribution with mean µ exp, draw targets t = (t 1, t 2,..., t N ) and sort such that t 1 < t 2 <... < t N compute all presynaptic inputs s k = w i x k, and sort such that s 1 < s 2 <... < s N now, find a i and b i such that s 1 1. 1 s N 1 ( ai b i ) = f 1 (t 1 ). f 1 (t N ) Advances in Extreme Learning Machines 30/23

Fast leave-one-out cross-validation The leave-one-out (LOO) error can be computed using the PRESS statistics: N ( yi ŷ i ) 2 E loo = 1 N i=1 1 hat ii where hat ii is the i th value on the diagonal of the HAT-matrix, which can be quickly computed, given H : Ŷ = Hβ = HH Y = HAT Y Advances in Extreme Learning Machines 31/23

Fast leave-one-out cross-validation Using the SVD decomposition of H = UDV T, it is possible to obtain all needed information for computing the PRESS statistic without recomputing the pseudo-inverse for every λ: Ŷ = Hβ = H(H T H + λi) 1 H T Y = HV(D 2 + λi) 1 DU T Y = UDV T V(D 2 + λi) 1 DU T Y = UD(D 2 + λi) 1 DU T Y = HAT Y Advances in Extreme Learning Machines 32/23

Fast leave-one-out cross-validation where D(D 2 + λi) 1 D is a diagonal matrix with diagonal entry. Now: N ( ) yi ŷ 2 i d 2 ii d2 ii +λ as the i th MSE TR-PRESS = 1 N = 1 N = 1 N 1 hat ii N ( y i ŷ i 1 h i (H T H + λi) 1 h T i 2 N y i ŷ ( i ) d 2 1 u ii i dii 2+λ u T i i=1 i=1 i=1 ) 2 Advances in Extreme Learning Machines 33/23

Leave-one-out computation overhead 8 6 Time (s) 4 2 0 0 500 1000 1500 2000 Number of hidden neurons Figure : Comparison of running times of ELM training (solid lines) and ELM training + leave-one-out-computation (dotted lines), with (black lines) and without (gray lines) explicitly computing and reusing H Advances in Extreme Learning Machines 34/23

Defence Slides: Ensembles Advances in Extreme Learning Machines 35/23

Error reduction by ensembling ŷ i = y + ɛ i, (1) [{ 1 E ens = E m E[ { ŷ i y } 2 ] = E[ɛ 2 i ]. (2) E avg = 1 m E[ɛ 2 i ]. m (3) m i=1 i=1 } 2 ] [{ 1 (ŷ y) = E m m } 2 ɛ i ]. (4) i=1 E ens = 1 m E avg = 1 m m 2 E[ɛ 2 i ], (5) i=1 Advances in Extreme Learning Machines 36/23

Running times parallelized and GPU-Accelerated ELM Ensemble (SantaFe, ESTSP) N t(mldiv dp ) t(gesv dp ) t(mldiv sp) t(gesv sp) t(culagesv sp) SantaFe 0 674.0 s 672.3 s 515.8 s 418.4 s 401.0 s 1 1781.6 s 1782.4 s 1089.3 s 1088.8 s 702.9 s 2 917.5 s 911.5 s 567.5 s 554.7 s 365.3 s 3 636.1 s 639.0 s 392.2 s 389.3 s 258.7 s 4 495.7 s 495.7 s 337.3 s 304.0 s 207.8 s ESTSP 0 2145.8 s 2127.6 s 1425.8 s 1414.3 s 1304.6 s 1 5636.9 s 5648.9 s 3488.6 s 3479.8 s 2299.8 s 2 2917.3 s 2929.6 s 1801.9 s 1806.4 s 1189.2 s 3 2069.4 s 2065.4 s 1255.9 s 1248.6 s 841.9 s 4 1590.7 s 1596.8 s 961.7 s 961.5 s 639.8 s Advances in Extreme Learning Machines 37/23

Running times parallelized and GPU-Accelerated ELM Ensemble (SantaFe, ESTSP) Running time ensemble (s) 1800 1600 1400 1200 1000 800 600 400 200 0 1 2 3 4 Number of workers Running time ensemble (s) 6000 5000 4000 3000 2000 1000 0 1 2 3 4 Number of workers Advances in Extreme Learning Machines 38/23

ELM Ensemble Accuracy (SantaFe, ESTSP) NMSE ensemble 0.01 0.008 0.006 0.004 0.002 NMSE ensemble 0.016 0.0158 0.0156 0.0154 0.0152 0 0 20 40 60 80 100 Number of models in ensemble 0 20 40 60 80 100 Number of models in ensemble Advances in Extreme Learning Machines 39/23

Defence Slides: Binary / Ternary ELM Advances in Extreme Learning Machines 40/23

Better Weights random layer weights and biases drawn from e.g. uniform / normal distribution with certain range / variance typical transfer function f ( w i, x + b i ) from w i, x = w i x cos θ, it can be seen that the typical activation of f depends on: expected length of w i expected length of x angles θ between the weights and the samples Advances in Extreme Learning Machines 41/23

Better Weights: Orthogonality? Idea 1: improve the diversity of the weights by taking weights that are mutually orthogonal (e.g. M d-dimensional basis vectors, randomly rotated in the d-dimensional space) however, does not give significantly better accuracy apparently, for the tested cases, random weight scheme of ELM already covers the possible weight space pretty well Advances in Extreme Learning Machines 42/23

Better Weights: Sparsity! Idea 2: improve the diversity of the weights by having each of them work in a different subspace (e.g. each weight vector has different subset of variables as input) spoiler: significantly improves accuracy, at no extra computational cost experiments suggest this is due to the weight scheme enabling implicit variable selection Advances in Extreme Learning Machines 43/23

Binary Weight Scheme 1 var 2 vars 3 vars 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 etc. until enough neurons: add w {0, 1} d with 1 var (# = 2 1 ( d 1) ) add w {0, 1} d with 2 vars (# = 2 2 ( d 2) ) add w {0, 1} d with 3 vars (# = 2 3 ( d 3) )... For each subspace, weights are added in random order to avoid bias toward particular variables Advances in Extreme Learning Machines 44/23

Ternary Weight Scheme 1 var 2 vars 3 vars +1 0 0 0 until enough neurons: 1 0 0 0 add w { 1, 0, 1} 0 +1 0 0 d with 1 var (3 1 ( d 1) ) 0 1 0 0 add w { 1, 0, 1} d with 2 vars (3 2 ( d 2) ) +1 +1 0 0 add w { 1, 0, 1} d with 3 vars (3 3 ( d 3) ) +1 1 0 0... 1 +1 0 0 1 1 0 0 For each subspace, weights are added in random order to avoid bias toward particular variables 0 0 1 1 Advances in Extreme Learning Machines 45/23

Experimental Settings Data Abbreviation number of variables # training # test Abalone Ab 8 2000 2177 CaliforniaHousing Ca 8 8000 12640 CensusHouse8L Ce 8 10000 12784 DeltaElevators De 6 4000 5517 ComputerActivity Co 12 4000 4192 BIP(CV)-TR-ELM vs BIP(CV)-TR-2-ELM vs BIP(CV)-TR-3-ELM Experiment 1: relative performance Experiment 2: robustness against irrelevant vars Experiment 3: implicit variable selection (all results are averaged over 100 repetitions, each with randomly drawn training/test set) Advances in Extreme Learning Machines 46/23

Exp 1: numhidden vs. RMSE (Abalone) RMSEtest 0.7 0.69 0.68 0.67 0.66 0.65 0.64 0.63 0.62 0.61 BIP(rand)-TR-ELM BIP(rand)-TR-2-ELM BIP(rand)-TR-3-ELM averages over 100 runs gaussian < binary ternary < gaussian better RMSE with much less neurons 0.6 0 100 200 300 400 500 600 700 800 900 1,000 numhidden Advances in Extreme Learning Machines 47/23

Exp 1: numhidden vs. RMSE (CpuActivity) RMSEtest 0.3 0.28 0.26 0.24 0.22 0.2 0.18 0.16 0.14 0.12 BIP(rand)-TR-ELM BIP(rand)-TR-2-ELM BIP(rand)-TR-3-ELM averages over 100 runs binary < gaussian ternary < gaussian better RMSE with much less neurons 0.1 0 100 200 300 400 500 600 700 800 900 1,000 numhidden Advances in Extreme Learning Machines 48/23

Exp 2: Robustness against irrelevant variables (Abalone) RMSEtest 0.72 0.7 0.68 0.66 BIP(rand)-TR-ELM BIP(rand)-TR-2-ELM BIP(rand)-TR-3-ELM 0.64 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 number of added noise variables 1000 neurons binary weight scheme gives similar RMSE ternary weight scheme makes ELM more robust against irrelevant vars Advances in Extreme Learning Machines 49/23

Exp 2: Robustness against irrelevant variables (CpuActivity) RMSEtest 0.3 0.25 0.2 BIP(rand)-TR-ELM BIP(rand)-TR-2-ELM BIP(rand)-TR-3-ELM 1000 neurons binary and ternary weight scheme makes ELM more robust against irrelevant vars 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 number of added noise variables Advances in Extreme Learning Machines 50/23

Exp 2: Robustness against irrelevant variables Ab Co gaussian binary ternary gaussian binary ternary RMSE with original variables 0.6497 0.6544 0.6438 0.1746 0.1785 0.1639 RMSE with 30 added irr. vars 0.6982 0.6932 0.6788 0.3221 0.2106 0.1904 RMSE loss 0.0486 0.0388 0.0339 0.1475 0.0321 0.0265 Table : Average RMSE loss of ELMs with 1000 hidden neurons, trained on the original data, and the data with 30 added irrelevant variables Advances in Extreme Learning Machines 51/23

Exp 3: Implicit Variable Selection (CpuAct) relevance of each input variable quantified as M i=1 β i w i gaussian variable relevance D1 D2 D3 D4 D5 R1 R2 R3 R4 R5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 variables Advances in Extreme Learning Machines 52/23

Exp 3: Implicit Variable Selection (CpuAct) relevance of each input variable quantified as M i=1 β i w i binary view variable relevance D1 D2 D3 D4 D5 R1 R2 R3 R4 R5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 variables Advances in Extreme Learning Machines 53/23

Exp 3: Implicit Variable Selection (CpuAct) relevance of each input variable quantified as M i=1 β i w i ternary variable relevance D1 D2 D3 D4 D5 R1 R2 R3 R4 R5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 variables Advances in Extreme Learning Machines 54/23

Binary/Ternary ELM Conclusions We propose simple change to weight scheme and introduce robust ELM variants: BIP(rand)-TR-ELM BIP(rand)-TR-2-ELM BIP(rand)-TR-3-ELM Our experiments suggest that 1. ternary weight scheme generally better than gaussian weights 2. ternary weight scheme robust against irrelevant variables 3. binary/ternary weight scheme allows ELM to perform implicit variable selection The added robustness and increased accuracy comes for free! Advances in Extreme Learning Machines 55/23

Defence Slides: Trade-offs / Compressive ELM Advances in Extreme Learning Machines 56/23

Time-accuracy Trade-offs for Several ELMs ELM OP-ELM: Optimally Pruned ELM with neurons ranked by relevance, and then pruned to optimize the leave-one-out error TR-ELM: Tikhonov-regularized ELM, with efficient optimization of regularization parameter λ, using the SVD approach to computing H TROP-ELM: Tikhonov regularized OP-ELM BIP(0.2), BIP(rand), BIP(CV): ELMs pretrained using Batch Intrinsic Plasticity mechanism, adapting the hidden layer weights and biases, such that they retain as much information as possible BIP parameter is either fixed, randomized, or cross-validated over 20 possible values Advances in Extreme Learning Machines 57/23

ELM Time-accuracy Trade-offs (Abalone UCI) training time 8 7 6 5 4 3 OP-3-ELM TROP-3-ELM TR-3-ELM BIP(CV)-TR-3-ELM BIP(0.2)-TR-3-ELM BIP(rand)-TR-3-ELM msetest 0.5 0.49 0.48 0.47 0.46 0.45 0.44 OP-3-ELM TROP-3-ELM TR-3-ELM BIP(CV)-TR-3-ELM BIP(0.2)-TR-3-ELM BIP(rand)-TR-3-ELM 2 0.43 0.42 1 0.41 0 0 100 200 300 400 500 600 700 800 900 1,000 #hidden neurons 0.4 0 100 200 300 400 500 600 700 800 900 1,000 #hidden neurons Advances in Extreme Learning Machines 58/23

ELM Time-accuracy Trade-offs (Abalone UCI) 0.5 0.49 0.48 0.47 OP-3-ELM TROP-3-ELM TR-3-ELM BIP(CV)-TR-3-ELM BIP(0.2)-TR-3-ELM BIP(rand)-TR-3-ELM 0.5 0.49 0.48 0.47 OP-3-ELM TROP-3-ELM TR-3-ELM BIP(CV)-TR-3-ELM BIP(0.2)-TR-3-ELM BIP(rand)-TR-3-ELM 0.46 0.46 msetest 0.45 0.44 msetest 0.45 0.44 0.43 0.43 0.42 0.42 0.41 0.41 0.4 0 1 2 3 4 5 6 7 training time 0.4 0 1 2 3 testing time 4 5 6 10 2 Advances in Extreme Learning Machines 59/23

ELM Time-accuracy Trade-offs (Abalone UCI) Depending on the user s criteria, these results suggest: training time most important: BIP(rand)-TR-3-ELM (almost optimal performance, while keeping training time low) if test error is most important: BIP(CV)-TR-3-ELM (slightly better accuracy, but training time is 20 times as high) if testing time is most important: BIP(rand)-TR-3-ELM (surprisingly) (OP-ELM and TROP-ELM tend to be faster in test, but suffer from slight overfitting) Since TR-3-ELM offers attractive trade-offs between speed and accuracy, this model will be central in the rest of the paper. Advances in Extreme Learning Machines 60/23

Two approaches for improving models Time-accuracy trade-offs suggest two possible strategies to obtain models that are preferable over other models: reducing test error, using a better algorithm ( in terms of training time-accuracy plot: pushing the curve down ) reducing computational time, while retaining as much accuracy as possible ( in terms of training time-accuracy plot: pushing the curve to the left ) Compressive ELM focuses on reducing computational time by performing the training in a reduced space, and then projecting back the solution back to the original space. Advances in Extreme Learning Machines 61/23

Compressive ELM Given m n matrix A, compute k-term approximate SVD A UDV T [Halko2009]: Form the n (k + p) random matrix Ω. (where p is small) Form the m (k + p) sampling matrix Y = AΩ. (sketch it by applying Ω) Form the m (k + p) orthonormal matrix Q (such that range(q) = range(y )) Compute B = Q A. Form the SVD of B so that B = ÛDV T Compute the matrix U = QÛ Advances in Extreme Learning Machines 62/23

Faster Sketching? Bottleneck in Algorithm is the time it takes to sketch the matrix. Rather than using Gaussian random matrices for sketching A, use random matrices that are sparse or structured in some way and allow for faster multiplication: ( ) ( ) ( ) P W D k n n n Fast Johnson Lindenstrauss Transform (FJLT) introduced in [Ailon2006] for which P is a sparse matrix of random Gaussian variables, and H encodes the Discrete Hadamard Transform Subsampled Randomized Hadamard Transform (SRHT) for which P is a matrix selecting k random columns from H, and H encodes the Discrete Hadamard Transform (Experiments did not show substantial difference in terms of computational time. Dataset too small?) n n Advances in Extreme Learning Machines 63/23

Compressive ELM (CalHousing, FJLT) training time 12 10 8 6 4 TR-3-ELM CS-TR-3-ELM(k=50) CS-TR-3-ELM(k=100) CS-TR-3-ELM(k=200) CS-TR-3-ELM(k=400) CS-TR-3-ELM(k=600) msetest 0.4 0.38 0.36 0.34 0.32 0.3 0.28 0.26 TR-3-ELM CS-TR-3-ELM(k=50) CS-TR-3-ELM(k=100) CS-TR-3-ELM(k=200) CS-TR-3-ELM(k=400) CS-TR-3-ELM(k=600) 2 0.24 0.22 0 0 100 200 300 400 500 600 700 800 900 1,000 #hidden neurons 0.2 0 100 200 300 400 500 600 700 800 900 1,000 #hidden neurons Advances in Extreme Learning Machines 64/23

Compressive ELM (CalHousing, FJLT) 0.4 0.38 0.36 0.34 TR-3-ELM CS-TR-3-ELM(k=50) CS-TR-3-ELM(k=100) CS-TR-3-ELM(k=200) CS-TR-3-ELM(k=400) CS-TR-3-ELM(k=600) 0.4 0.38 0.36 0.34 TR-3-ELM CS-TR-3-ELM(k=50) CS-TR-3-ELM(k=100) CS-TR-3-ELM(k=200) CS-TR-3-ELM(k=400) CS-TR-3-ELM(k=600) 0.32 0.32 msetest 0.3 0.28 msetest 0.3 0.28 0.26 0.26 0.24 0.24 0.22 0.22 0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 training time 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 testing time Advances in Extreme Learning Machines 65/23

Compressive ELM Conclusions Contributions Compressive ELM provides a flexible way to reduce training time by doing the optimization in a reduced space of k dimensions given k large enough, Compressive ELM achieves the best test error for each computational time (i.e. there are no models that achieve better test error and can be trained in the same or less time). Future work let theory/bounds on low-distortion embeddings inform the choice of k Advances in Extreme Learning Machines 66/23