Advances in Extreme Learning Machines

Advances in Extreme Learning Machines Mark van Heeswijk April 17, 2015

Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 2/23

Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 3/23

Trends in Machine Learning [http://tech.firstpost.com/news-analysis/now-upload-share-1-8-billion-photoseveryday-meeker-report-224688.html] Advances in Extreme Learning Machines 4/23

Trends in Machine Learning increasing dimensionality of data sets increasing size of data sets 300 hours of video are uploaded to YouTube every minute 1.8 billion pictures uploaded every day to various sites (mid-2014) machine learning methods needed, that can be used to analyze this data and extract useful knowledge and insights from this wealth of information Advances in Extreme Learning Machines 4/23

Contributions Combinations of Multiple Extreme Learning Machines (ELM) I: Adaptive Ensemble Models of ELMs for Time Series Prediction II: GPU-accelerated and parallelized ELM ensembles for large-scale regression Variable Selection and ELM III: Feature selection for nonlinear models with extreme learning machines IV: Fast Feature Selection in a GPU Cluster Using the Delta Test V: Binary/Ternary Extreme Learning Machines Trade-offs in Extreme Learning Machines VI: Compressive ELM: Improved Models Through Exploiting Time-Accuracy Trade-offs Advances in Extreme Learning Machines 5/23

Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 6/23

Extreme Learning Machines Input layer Hidden layer Output layer input x j 1 input x j 2 input x j 3 output y j input x j 4 Advances in Extreme Learning Machines 7/23

Extreme Learning Machines Input layer Hidden layer Output layer input x j 1 input x j 2 input x j 3 random weights output y j input x j 4 Advances in Extreme Learning Machines 7/23

Extreme Learning Machines Input layer Hidden layer Output layer input x j 1 input x j 2 input x j 3 random weights output y j input x j 4 least-squares solution Advances in Extreme Learning Machines 7/23

Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 8/23

Adaptive Ensemble of ELMs (Publ I) random inputs random #neurons trained on sliding/growing window adaptive linear combination based on performance of each ELM Advances in Extreme Learning Machines 9/23

Parallelized/GPU-Accelerated Ensemble of ELMs (Publ II) random inputs optimized #neurons (loo-cv on GPU) training performed on GPU training of ELMs parallelized over multiple CPU/GPU fixed linear combination based on loo-cv Advances in Extreme Learning Machines 10/23

Highlighted ELM Ensemble Results w i 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 2000 4000 6000 8000 10000 time w i 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 1000 2000 3000 4000 5000 time Adaptation of ensemble weight during task Advances in Extreme Learning Machines 11/23

Highlighted ELM Ensemble Results 0.016 NMSE ensemble 0.0158 0.0156 0.0154 0.0152 0 20 40 60 80 100 Number of models in ensemble The more models in the ensemble, the more accurate the results Advances in Extreme Learning Machines 11/23

Highlighted ELM Ensemble Results 6000 Running time ensemble (s) 5000 4000 3000 2000 1000 0 1 2 3 4 Number of workers Scalability through parallelization, limited precision, use of GPUs Advances in Extreme Learning Machines 11/23

Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 12/23

ELM-based Feature Selection (Publ III) scaling layer random #neurons for each restart of algorithm input x j 1 s 1 input x j 2 input x j 3 s 2 s 3 output y j input x j 4 s 4 starting from a random scaling: gradually reduced until empty, based on training and loo error Advances in Extreme Learning Machines 13/23

Highlighted ELM-FS Results Advances in Extreme Learning Machines 14/23

Fast Feature Selection with GPU-accelerated Delta Test (Publ IV) input x j 1 0/1 input x j 2 input x j 3 0/1 0/1 output y j input x j 4 0/1 variable selection optimized using an evolutionary algorithm with GPU-accelerated Delta-Test as fitness criterion Advances in Extreme Learning Machines 15/23

Fast Feature Selection with GPU-accelerated Delta Test (Publ IV) 1 2N N (y i y NN(i) ) 2 i=1 Advances in Extreme Learning Machines 16/23

Binary/Ternary Extreme Learning Machines (Publ V) input x j 1 input x j 2 input x j 3 binary/ ternary weights output y j input x j 4 scaling of weights and biases optimized using batchintrinsic plasticity pretraining least-square fit with fast L2 regularization by using SVD decomposition Advances in Extreme Learning Machines 17/23

Binary/Ternary Extreme Learning Machines (Publ V) input x j 1 input x j 2 input x j 3 binary/ ternary weights output y j input x j 4 scaling of weights and biases optimized using batchintrinsic plasticity pretraining least-squares solution with fast L2 regularization by using SVD decomposition Advances in Extreme Learning Machines 17/23

Ternary ELM Results 1 var 2 vars 3 vars +1 0 0 0 1 0 0 0 0 +1 0 0 0 1 0 0 +1 +1 0 0 +1 1 0 0 1 +1 0 0 1 1 0 0 0 0 1 1 RMSEtest 0.7 0.69 0.68 0.67 0.66 0.65 0.64 0.63 0.62 0.61 BIP(rand)-TR-ELM BIP(rand)-TR-2-ELM BIP(rand)-TR-3-ELM 0.6 0 100 200 300 400 500 600 700 800 900 1,000 numhidden Advances in Extreme Learning Machines 18/23

Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 19/23

Compressive ELM (Publ VI) input x j 1 input x j 2 input x j 3 binary/ ternary weights output y j input x j 4 scaling of weights and biases optimized using batchintrinsic plasticity pretraining least-squares solution with fast L2 regularization by using SVD decomposition Advances in Extreme Learning Machines 20/23

Compressive ELM (Publ VI) fast low-distortion embedding into lower-dimensional space input x j 1 input x j 2 input x j 3 binary/ ternary weights output y j input x j 4 scaling of weights and biases optimized using batchintrinsic plasticity pretraining least-squares solution with fast L2 regularization by using SVD decomposition Advances in Extreme Learning Machines 20/23

Highlighted Results Compressive ELM training time 12 10 8 6 4 TR-3-ELM CS-TR-3-ELM(k=50) CS-TR-3-ELM(k=100) CS-TR-3-ELM(k=200) CS-TR-3-ELM(k=400) CS-TR-3-ELM(k=600) msetest 0.4 0.38 0.36 0.34 0.32 0.3 0.28 0.26 TR-3-ELM CS-TR-3-ELM(k=50) CS-TR-3-ELM(k=100) CS-TR-3-ELM(k=200) CS-TR-3-ELM(k=400) CS-TR-3-ELM(k=600) 2 0.24 0.22 0 0 100 200 300 400 500 600 700 800 900 1,000 #hidden neurons 0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 training time Advances in Extreme Learning Machines 21/23

Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 22/23

Summary Overall: resulting collection of proposed methods provides an efficient, accurate and flexible framework for solving large-scale supervised learning problems. proposed methods: are not limited to the particular types of random-weight neural networks and contexts in which they have been tested can easily be incorporated in new contexts and models Advances in Extreme Learning Machines 23/23

Defence Slides: Related Models Advances in Extreme Learning Machines 24/23

ELM vs RVFL Input layer Hidden layer Output layer input x j 1 input x j 2 input x j 3 output y j input x j 4 Advances in Extreme Learning Machines 25/23

Defence Slides: General Advances in Extreme Learning Machines 26/23

Standard ELM Given a training set (x i, y i ), x i R d, y i R, an activation function f : R R and M the number of hidden nodes: 1: - Randomly assign input weights w i and biases b i, i [1, M]; 2: - Calculate the hidden layer output matrix H; 3: - Calculate output weights matrix β = H Y. where f (w 1 x 1 + b 1) f (w M x 1 + b M ) H =....... f (w 1 x N + b 1) f (w M x N + b M ) Advances in Extreme Learning Machines 27/23

ELM Theory vs Practice In theory, ELM is universal approximator In practice, limited number of samples; risk of overfitting Therefore: the functional approximation should use as limited number of neurons as possible the hidden layer should extract and retain as much useful information as possible from the input samples Advances in Extreme Learning Machines 28/23

ELM Theory vs Practice Weight considerations: weight range determines typical activation of the transfer function (remember w i, x = w i x cos θ,) therefore, normalize or tune the length of the weights vectors somehow Linear vs non-linear: since sigmoid neurons operate in nonlinear regime, add d linear neurons for the ELM to work better on (almost) linear problems Avoiding overfitting: use efficient L2 regularization Advances in Extreme Learning Machines 29/23

Batch Intrinsic Plasticity suppose (x 1,..., x N ) R N d, and output of neuron i is h i = f (a i w i x k + b i ), where f is an invertible transfer function for each neuron i from exponential distribution with mean µ exp, draw targets t = (t 1, t 2,..., t N ) and sort such that t 1 < t 2 <... < t N compute all presynaptic inputs s k = w i x k, and sort such that s 1 < s 2 <... < s N now, find a i and b i such that s 1 1. 1 s N 1 ( ai b i ) = f 1 (t 1 ). f 1 (t N ) Advances in Extreme Learning Machines 30/23

Fast leave-one-out cross-validation The leave-one-out (LOO) error can be computed using the PRESS statistics: N ( yi ŷ i ) 2 E loo = 1 N i=1 1 hat ii where hat ii is the i th value on the diagonal of the HAT-matrix, which can be quickly computed, given H : Ŷ = Hβ = HH Y = HAT Y Advances in Extreme Learning Machines 31/23

Fast leave-one-out cross-validation Using the SVD decomposition of H = UDV T, it is possible to obtain all needed information for computing the PRESS statistic without recomputing the pseudo-inverse for every λ: Ŷ = Hβ = H(H T H + λi) 1 H T Y = HV(D 2 + λi) 1 DU T Y = UDV T V(D 2 + λi) 1 DU T Y = UD(D 2 + λi) 1 DU T Y = HAT Y Advances in Extreme Learning Machines 32/23

Fast leave-one-out cross-validation where D(D 2 + λi) 1 D is a diagonal matrix with diagonal entry. Now: N ( ) yi ŷ 2 i d 2 ii d2 ii +λ as the i th MSE TR-PRESS = 1 N = 1 N = 1 N 1 hat ii N ( y i ŷ i 1 h i (H T H + λi) 1 h T i 2 N y i ŷ ( i ) d 2 1 u ii i dii 2+λ u T i i=1 i=1 i=1 ) 2 Advances in Extreme Learning Machines 33/23

Leave-one-out computation overhead 8 6 Time (s) 4 2 0 0 500 1000 1500 2000 Number of hidden neurons Figure : Comparison of running times of ELM training (solid lines) and ELM training + leave-one-out-computation (dotted lines), with (black lines) and without (gray lines) explicitly computing and reusing H Advances in Extreme Learning Machines 34/23

Defence Slides: Ensembles Advances in Extreme Learning Machines 35/23

Error reduction by ensembling ŷ i = y + ɛ i, (1) [{ 1 E ens = E m E[ { ŷ i y } 2 ] = E[ɛ 2 i ]. (2) E avg = 1 m E[ɛ 2 i ]. m (3) m i=1 i=1 } 2 ] [{ 1 (ŷ y) = E m m } 2 ɛ i ]. (4) i=1 E ens = 1 m E avg = 1 m m 2 E[ɛ 2 i ], (5) i=1 Advances in Extreme Learning Machines 36/23

Running times parallelized and GPU-Accelerated ELM Ensemble (SantaFe, ESTSP) N t(mldiv dp ) t(gesv dp ) t(mldiv sp) t(gesv sp) t(culagesv sp) SantaFe 0 674.0 s 672.3 s 515.8 s 418.4 s 401.0 s 1 1781.6 s 1782.4 s 1089.3 s 1088.8 s 702.9 s 2 917.5 s 911.5 s 567.5 s 554.7 s 365.3 s 3 636.1 s 639.0 s 392.2 s 389.3 s 258.7 s 4 495.7 s 495.7 s 337.3 s 304.0 s 207.8 s ESTSP 0 2145.8 s 2127.6 s 1425.8 s 1414.3 s 1304.6 s 1 5636.9 s 5648.9 s 3488.6 s 3479.8 s 2299.8 s 2 2917.3 s 2929.6 s 1801.9 s 1806.4 s 1189.2 s 3 2069.4 s 2065.4 s 1255.9 s 1248.6 s 841.9 s 4 1590.7 s 1596.8 s 961.7 s 961.5 s 639.8 s Advances in Extreme Learning Machines 37/23

Running times parallelized and GPU-Accelerated ELM Ensemble (SantaFe, ESTSP) Running time ensemble (s) 1800 1600 1400 1200 1000 800 600 400 200 0 1 2 3 4 Number of workers Running time ensemble (s) 6000 5000 4000 3000 2000 1000 0 1 2 3 4 Number of workers Advances in Extreme Learning Machines 38/23

ELM Ensemble Accuracy (SantaFe, ESTSP) NMSE ensemble 0.01 0.008 0.006 0.004 0.002 NMSE ensemble 0.016 0.0158 0.0156 0.0154 0.0152 0 0 20 40 60 80 100 Number of models in ensemble 0 20 40 60 80 100 Number of models in ensemble Advances in Extreme Learning Machines 39/23

Defence Slides: Binary / Ternary ELM Advances in Extreme Learning Machines 40/23

Better Weights random layer weights and biases drawn from e.g. uniform / normal distribution with certain range / variance typical transfer function f ( w i, x + b i ) from w i, x = w i x cos θ, it can be seen that the typical activation of f depends on: expected length of w i expected length of x angles θ between the weights and the samples Advances in Extreme Learning Machines 41/23

Better Weights: Orthogonality? Idea 1: improve the diversity of the weights by taking weights that are mutually orthogonal (e.g. M d-dimensional basis vectors, randomly rotated in the d-dimensional space) however, does not give significantly better accuracy apparently, for the tested cases, random weight scheme of ELM already covers the possible weight space pretty well Advances in Extreme Learning Machines 42/23

Better Weights: Sparsity! Idea 2: improve the diversity of the weights by having each of them work in a different subspace (e.g. each weight vector has different subset of variables as input) spoiler: significantly improves accuracy, at no extra computational cost experiments suggest this is due to the weight scheme enabling implicit variable selection Advances in Extreme Learning Machines 43/23

Binary Weight Scheme 1 var 2 vars 3 vars 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 etc. until enough neurons: add w {0, 1} d with 1 var (# = 2 1 ( d 1) ) add w {0, 1} d with 2 vars (# = 2 2 ( d 2) ) add w {0, 1} d with 3 vars (# = 2 3 ( d 3) )... For each subspace, weights are added in random order to avoid bias toward particular variables Advances in Extreme Learning Machines 44/23

Ternary Weight Scheme 1 var 2 vars 3 vars +1 0 0 0 until enough neurons: 1 0 0 0 add w { 1, 0, 1} 0 +1 0 0 d with 1 var (3 1 ( d 1) ) 0 1 0 0 add w { 1, 0, 1} d with 2 vars (3 2 ( d 2) ) +1 +1 0 0 add w { 1, 0, 1} d with 3 vars (3 3 ( d 3) ) +1 1 0 0... 1 +1 0 0 1 1 0 0 For each subspace, weights are added in random order to avoid bias toward particular variables 0 0 1 1 Advances in Extreme Learning Machines 45/23

Experimental Settings Data Abbreviation number of variables # training # test Abalone Ab 8 2000 2177 CaliforniaHousing Ca 8 8000 12640 CensusHouse8L Ce 8 10000 12784 DeltaElevators De 6 4000 5517 ComputerActivity Co 12 4000 4192 BIP(CV)-TR-ELM vs BIP(CV)-TR-2-ELM vs BIP(CV)-TR-3-ELM Experiment 1: relative performance Experiment 2: robustness against irrelevant vars Experiment 3: implicit variable selection (all results are averaged over 100 repetitions, each with randomly drawn training/test set) Advances in Extreme Learning Machines 46/23

Exp 1: numhidden vs. RMSE (Abalone) RMSEtest 0.7 0.69 0.68 0.67 0.66 0.65 0.64 0.63 0.62 0.61 BIP(rand)-TR-ELM BIP(rand)-TR-2-ELM BIP(rand)-TR-3-ELM averages over 100 runs gaussian < binary ternary < gaussian better RMSE with much less neurons 0.6 0 100 200 300 400 500 600 700 800 900 1,000 numhidden Advances in Extreme Learning Machines 47/23

Exp 1: numhidden vs. RMSE (CpuActivity) RMSEtest 0.3 0.28 0.26 0.24 0.22 0.2 0.18 0.16 0.14 0.12 BIP(rand)-TR-ELM BIP(rand)-TR-2-ELM BIP(rand)-TR-3-ELM averages over 100 runs binary < gaussian ternary < gaussian better RMSE with much less neurons 0.1 0 100 200 300 400 500 600 700 800 900 1,000 numhidden Advances in Extreme Learning Machines 48/23

Exp 2: Robustness against irrelevant variables (Abalone) RMSEtest 0.72 0.7 0.68 0.66 BIP(rand)-TR-ELM BIP(rand)-TR-2-ELM BIP(rand)-TR-3-ELM 0.64 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 number of added noise variables 1000 neurons binary weight scheme gives similar RMSE ternary weight scheme makes ELM more robust against irrelevant vars Advances in Extreme Learning Machines 49/23

Exp 2: Robustness against irrelevant variables (CpuActivity) RMSEtest 0.3 0.25 0.2 BIP(rand)-TR-ELM BIP(rand)-TR-2-ELM BIP(rand)-TR-3-ELM 1000 neurons binary and ternary weight scheme makes ELM more robust against irrelevant vars 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 number of added noise variables Advances in Extreme Learning Machines 50/23

Exp 2: Robustness against irrelevant variables Ab Co gaussian binary ternary gaussian binary ternary RMSE with original variables 0.6497 0.6544 0.6438 0.1746 0.1785 0.1639 RMSE with 30 added irr. vars 0.6982 0.6932 0.6788 0.3221 0.2106 0.1904 RMSE loss 0.0486 0.0388 0.0339 0.1475 0.0321 0.0265 Table : Average RMSE loss of ELMs with 1000 hidden neurons, trained on the original data, and the data with 30 added irrelevant variables Advances in Extreme Learning Machines 51/23

Exp 3: Implicit Variable Selection (CpuAct) relevance of each input variable quantified as M i=1 β i w i gaussian variable relevance D1 D2 D3 D4 D5 R1 R2 R3 R4 R5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 variables Advances in Extreme Learning Machines 52/23

Exp 3: Implicit Variable Selection (CpuAct) relevance of each input variable quantified as M i=1 β i w i binary view variable relevance D1 D2 D3 D4 D5 R1 R2 R3 R4 R5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 variables Advances in Extreme Learning Machines 53/23

Exp 3: Implicit Variable Selection (CpuAct) relevance of each input variable quantified as M i=1 β i w i ternary variable relevance D1 D2 D3 D4 D5 R1 R2 R3 R4 R5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 variables Advances in Extreme Learning Machines 54/23

Binary/Ternary ELM Conclusions We propose simple change to weight scheme and introduce robust ELM variants: BIP(rand)-TR-ELM BIP(rand)-TR-2-ELM BIP(rand)-TR-3-ELM Our experiments suggest that 1. ternary weight scheme generally better than gaussian weights 2. ternary weight scheme robust against irrelevant variables 3. binary/ternary weight scheme allows ELM to perform implicit variable selection The added robustness and increased accuracy comes for free! Advances in Extreme Learning Machines 55/23

Defence Slides: Trade-offs / Compressive ELM Advances in Extreme Learning Machines 56/23

Time-accuracy Trade-offs for Several ELMs ELM OP-ELM: Optimally Pruned ELM with neurons ranked by relevance, and then pruned to optimize the leave-one-out error TR-ELM: Tikhonov-regularized ELM, with efficient optimization of regularization parameter λ, using the SVD approach to computing H TROP-ELM: Tikhonov regularized OP-ELM BIP(0.2), BIP(rand), BIP(CV): ELMs pretrained using Batch Intrinsic Plasticity mechanism, adapting the hidden layer weights and biases, such that they retain as much information as possible BIP parameter is either fixed, randomized, or cross-validated over 20 possible values Advances in Extreme Learning Machines 57/23

ELM Time-accuracy Trade-offs (Abalone UCI) training time 8 7 6 5 4 3 OP-3-ELM TROP-3-ELM TR-3-ELM BIP(CV)-TR-3-ELM BIP(0.2)-TR-3-ELM BIP(rand)-TR-3-ELM msetest 0.5 0.49 0.48 0.47 0.46 0.45 0.44 OP-3-ELM TROP-3-ELM TR-3-ELM BIP(CV)-TR-3-ELM BIP(0.2)-TR-3-ELM BIP(rand)-TR-3-ELM 2 0.43 0.42 1 0.41 0 0 100 200 300 400 500 600 700 800 900 1,000 #hidden neurons 0.4 0 100 200 300 400 500 600 700 800 900 1,000 #hidden neurons Advances in Extreme Learning Machines 58/23

ELM Time-accuracy Trade-offs (Abalone UCI) 0.5 0.49 0.48 0.47 OP-3-ELM TROP-3-ELM TR-3-ELM BIP(CV)-TR-3-ELM BIP(0.2)-TR-3-ELM BIP(rand)-TR-3-ELM 0.5 0.49 0.48 0.47 OP-3-ELM TROP-3-ELM TR-3-ELM BIP(CV)-TR-3-ELM BIP(0.2)-TR-3-ELM BIP(rand)-TR-3-ELM 0.46 0.46 msetest 0.45 0.44 msetest 0.45 0.44 0.43 0.43 0.42 0.42 0.41 0.41 0.4 0 1 2 3 4 5 6 7 training time 0.4 0 1 2 3 testing time 4 5 6 10 2 Advances in Extreme Learning Machines 59/23

ELM Time-accuracy Trade-offs (Abalone UCI) Depending on the user s criteria, these results suggest: training time most important: BIP(rand)-TR-3-ELM (almost optimal performance, while keeping training time low) if test error is most important: BIP(CV)-TR-3-ELM (slightly better accuracy, but training time is 20 times as high) if testing time is most important: BIP(rand)-TR-3-ELM (surprisingly) (OP-ELM and TROP-ELM tend to be faster in test, but suffer from slight overfitting) Since TR-3-ELM offers attractive trade-offs between speed and accuracy, this model will be central in the rest of the paper. Advances in Extreme Learning Machines 60/23

Two approaches for improving models Time-accuracy trade-offs suggest two possible strategies to obtain models that are preferable over other models: reducing test error, using a better algorithm ( in terms of training time-accuracy plot: pushing the curve down ) reducing computational time, while retaining as much accuracy as possible ( in terms of training time-accuracy plot: pushing the curve to the left ) Compressive ELM focuses on reducing computational time by performing the training in a reduced space, and then projecting back the solution back to the original space. Advances in Extreme Learning Machines 61/23

Compressive ELM Given m n matrix A, compute k-term approximate SVD A UDV T [Halko2009]: Form the n (k + p) random matrix Ω. (where p is small) Form the m (k + p) sampling matrix Y = AΩ. (sketch it by applying Ω) Form the m (k + p) orthonormal matrix Q (such that range(q) = range(y )) Compute B = Q A. Form the SVD of B so that B = ÛDV T Compute the matrix U = QÛ Advances in Extreme Learning Machines 62/23

Faster Sketching? Bottleneck in Algorithm is the time it takes to sketch the matrix. Rather than using Gaussian random matrices for sketching A, use random matrices that are sparse or structured in some way and allow for faster multiplication: ( ) ( ) ( ) P W D k n n n Fast Johnson Lindenstrauss Transform (FJLT) introduced in [Ailon2006] for which P is a sparse matrix of random Gaussian variables, and H encodes the Discrete Hadamard Transform Subsampled Randomized Hadamard Transform (SRHT) for which P is a matrix selecting k random columns from H, and H encodes the Discrete Hadamard Transform (Experiments did not show substantial difference in terms of computational time. Dataset too small?) n n Advances in Extreme Learning Machines 63/23

Compressive ELM (CalHousing, FJLT) training time 12 10 8 6 4 TR-3-ELM CS-TR-3-ELM(k=50) CS-TR-3-ELM(k=100) CS-TR-3-ELM(k=200) CS-TR-3-ELM(k=400) CS-TR-3-ELM(k=600) msetest 0.4 0.38 0.36 0.34 0.32 0.3 0.28 0.26 TR-3-ELM CS-TR-3-ELM(k=50) CS-TR-3-ELM(k=100) CS-TR-3-ELM(k=200) CS-TR-3-ELM(k=400) CS-TR-3-ELM(k=600) 2 0.24 0.22 0 0 100 200 300 400 500 600 700 800 900 1,000 #hidden neurons 0.2 0 100 200 300 400 500 600 700 800 900 1,000 #hidden neurons Advances in Extreme Learning Machines 64/23

Compressive ELM (CalHousing, FJLT) 0.4 0.38 0.36 0.34 TR-3-ELM CS-TR-3-ELM(k=50) CS-TR-3-ELM(k=100) CS-TR-3-ELM(k=200) CS-TR-3-ELM(k=400) CS-TR-3-ELM(k=600) 0.4 0.38 0.36 0.34 TR-3-ELM CS-TR-3-ELM(k=50) CS-TR-3-ELM(k=100) CS-TR-3-ELM(k=200) CS-TR-3-ELM(k=400) CS-TR-3-ELM(k=600) 0.32 0.32 msetest 0.3 0.28 msetest 0.3 0.28 0.26 0.26 0.24 0.24 0.22 0.22 0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 training time 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 testing time Advances in Extreme Learning Machines 65/23

Compressive ELM Conclusions Contributions Compressive ELM provides a flexible way to reduce training time by doing the optimization in a reduced space of k dimensions given k large enough, Compressive ELM achieves the best test error for each computational time (i.e. there are no models that achieve better test error and can be trained in the same or less time). Future work let theory/bounds on low-distortion embeddings inform the choice of k Advances in Extreme Learning Machines 66/23