Advances in Extreme Learning Machines

Size: px
Start display at page:

Download "Advances in Extreme Learning Machines"

Transcription

1 Advances in Extreme Learning Machines Mark van Heeswijk April 17, 2015

2 Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 2/23

3 Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 3/23

4 Trends in Machine Learning increasing dimensionality of data sets increasing size of data sets 300 hours of video are uploaded to YouTube every minute 1.8 billion pictures uploaded every day to various sites (mid-2014) Advances in Extreme Learning Machines 4/23

5 Trends in Machine Learning [ Advances in Extreme Learning Machines 4/23

6 Trends in Machine Learning increasing dimensionality of data sets increasing size of data sets 300 hours of video are uploaded to YouTube every minute 1.8 billion pictures uploaded every day to various sites (mid-2014) machine learning methods needed, that can be used to analyze this data and extract useful knowledge and insights from this wealth of information Advances in Extreme Learning Machines 4/23

7 Contributions Combinations of Multiple Extreme Learning Machines (ELM) I: Adaptive Ensemble Models of ELMs for Time Series Prediction II: GPU-accelerated and parallelized ELM ensembles for large-scale regression Variable Selection and ELM III: Feature selection for nonlinear models with extreme learning machines IV: Fast Feature Selection in a GPU Cluster Using the Delta Test V: Binary/Ternary Extreme Learning Machines Trade-offs in Extreme Learning Machines VI: Compressive ELM: Improved Models Through Exploiting Time-Accuracy Trade-offs Advances in Extreme Learning Machines 5/23

8 Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 6/23

9 Extreme Learning Machines Input layer Hidden layer Output layer input x j 1 input x j 2 input x j 3 output y j input x j 4 Advances in Extreme Learning Machines 7/23

10 Extreme Learning Machines Input layer Hidden layer Output layer input x j 1 input x j 2 input x j 3 random weights output y j input x j 4 Advances in Extreme Learning Machines 7/23

11 Extreme Learning Machines Input layer Hidden layer Output layer input x j 1 input x j 2 input x j 3 random weights output y j input x j 4 least-squares solution Advances in Extreme Learning Machines 7/23

12 Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 8/23

13 Adaptive Ensemble of ELMs (Publ I) random inputs random #neurons trained on sliding/growing window adaptive linear combination based on performance of each ELM Advances in Extreme Learning Machines 9/23

14 Adaptive Ensemble of ELMs (Publ I) random inputs random #neurons trained on sliding/growing window adaptive linear combination based on performance of each ELM Advances in Extreme Learning Machines 9/23

15 Adaptive Ensemble of ELMs (Publ I) random inputs random #neurons trained on sliding/growing window adaptive linear combination based on performance of each ELM Advances in Extreme Learning Machines 9/23

16 Parallelized/GPU-Accelerated Ensemble of ELMs (Publ II) random inputs optimized #neurons (loo-cv on GPU) training performed on GPU training of ELMs parallelized over multiple CPU/GPU fixed linear combination based on loo-cv Advances in Extreme Learning Machines 10/23

17 Parallelized/GPU-Accelerated Ensemble of ELMs (Publ II) random inputs optimized #neurons (loo-cv on GPU) training performed on GPU training of ELMs parallelized over multiple CPU/GPU fixed linear combination based on loo-cv Advances in Extreme Learning Machines 10/23

18 Parallelized/GPU-Accelerated Ensemble of ELMs (Publ II) random inputs optimized #neurons (loo-cv on GPU) training performed on GPU training of ELMs parallelized over multiple CPU/GPU fixed linear combination based on loo-cv Advances in Extreme Learning Machines 10/23

19 Highlighted ELM Ensemble Results w i time w i time Adaptation of ensemble weight during task Advances in Extreme Learning Machines 11/23

20 Highlighted ELM Ensemble Results NMSE ensemble Number of models in ensemble The more models in the ensemble, the more accurate the results Advances in Extreme Learning Machines 11/23

21 Highlighted ELM Ensemble Results 6000 Running time ensemble (s) Number of workers Scalability through parallelization, limited precision, use of GPUs Advances in Extreme Learning Machines 11/23

22 Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 12/23

23 ELM-based Feature Selection (Publ III) scaling layer random #neurons for each restart of algorithm input x j 1 s 1 input x j 2 input x j 3 s 2 s 3 output y j input x j 4 s 4 starting from a random scaling: gradually reduced until empty, based on training and loo error Advances in Extreme Learning Machines 13/23

24 ELM-based Feature Selection (Publ III) scaling layer random #neurons for each restart of algorithm input x j 1 s 1 input x j 2 input x j 3 s 2 s 3 output y j input x j 4 s 4 starting from a random scaling: gradually reduced until empty, based on training and loo error Advances in Extreme Learning Machines 13/23

25 ELM-based Feature Selection (Publ III) scaling layer random #neurons for each restart of algorithm input x j 1 s 1 input x j 2 input x j 3 s 2 s 3 output y j input x j 4 s 4 starting from a random scaling: gradually reduced until empty, based on training and loo error Advances in Extreme Learning Machines 13/23

26 Highlighted ELM-FS Results Advances in Extreme Learning Machines 14/23

27 Fast Feature Selection with GPU-accelerated Delta Test (Publ IV) input x j 1 0/1 input x j 2 input x j 3 0/1 0/1 output y j input x j 4 0/1 variable selection optimized using an evolutionary algorithm with GPU-accelerated Delta-Test as fitness criterion Advances in Extreme Learning Machines 15/23

28 Fast Feature Selection with GPU-accelerated Delta Test (Publ IV) input x j 1 0/1 input x j 2 input x j 3 0/1 0/1 output y j input x j 4 0/1 variable selection optimized using an evolutionary algorithm with GPU-accelerated Delta-Test as fitness criterion Advances in Extreme Learning Machines 15/23

29 Fast Feature Selection with GPU-accelerated Delta Test (Publ IV) 1 2N N (y i y NN(i) ) 2 i=1 Advances in Extreme Learning Machines 16/23

30 Binary/Ternary Extreme Learning Machines (Publ V) input x j 1 input x j 2 input x j 3 binary/ ternary weights output y j input x j 4 scaling of weights and biases optimized using batchintrinsic plasticity pretraining least-square fit with fast L2 regularization by using SVD decomposition Advances in Extreme Learning Machines 17/23

31 Binary/Ternary Extreme Learning Machines (Publ V) input x j 1 input x j 2 input x j 3 binary/ ternary weights output y j input x j 4 scaling of weights and biases optimized using batchintrinsic plasticity pretraining least-square fit with fast L2 regularization by using SVD decomposition Advances in Extreme Learning Machines 17/23

32 Binary/Ternary Extreme Learning Machines (Publ V) input x j 1 input x j 2 input x j 3 binary/ ternary weights output y j input x j 4 scaling of weights and biases optimized using batchintrinsic plasticity pretraining least-square fit with fast L2 regularization by using SVD decomposition Advances in Extreme Learning Machines 17/23

33 Binary/Ternary Extreme Learning Machines (Publ V) input x j 1 input x j 2 input x j 3 binary/ ternary weights output y j input x j 4 scaling of weights and biases optimized using batchintrinsic plasticity pretraining least-squares solution with fast L2 regularization by using SVD decomposition Advances in Extreme Learning Machines 17/23

34 Ternary ELM Results 1 var 2 vars 3 vars RMSEtest BIP(rand)-TR-ELM BIP(rand)-TR-2-ELM BIP(rand)-TR-3-ELM ,000 numhidden Advances in Extreme Learning Machines 18/23

35 Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 19/23

36 Compressive ELM (Publ VI) input x j 1 input x j 2 input x j 3 binary/ ternary weights output y j input x j 4 scaling of weights and biases optimized using batchintrinsic plasticity pretraining least-squares solution with fast L2 regularization by using SVD decomposition Advances in Extreme Learning Machines 20/23

37 Compressive ELM (Publ VI) fast low-distortion embedding into lower-dimensional space input x j 1 input x j 2 input x j 3 binary/ ternary weights output y j input x j 4 scaling of weights and biases optimized using batchintrinsic plasticity pretraining least-squares solution with fast L2 regularization by using SVD decomposition Advances in Extreme Learning Machines 20/23

38 Highlighted Results Compressive ELM training time TR-3-ELM CS-TR-3-ELM(k=50) CS-TR-3-ELM(k=100) CS-TR-3-ELM(k=200) CS-TR-3-ELM(k=400) CS-TR-3-ELM(k=600) msetest TR-3-ELM CS-TR-3-ELM(k=50) CS-TR-3-ELM(k=100) CS-TR-3-ELM(k=200) CS-TR-3-ELM(k=400) CS-TR-3-ELM(k=600) ,000 #hidden neurons training time Advances in Extreme Learning Machines 21/23

39 Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs in ELM Summary Advances in Extreme Learning Machines 22/23

40 Summary Overall: resulting collection of proposed methods provides an efficient, accurate and flexible framework for solving large-scale supervised learning problems. proposed methods: are not limited to the particular types of random-weight neural networks and contexts in which they have been tested can easily be incorporated in new contexts and models Advances in Extreme Learning Machines 23/23

41 Defence Slides: Related Models Advances in Extreme Learning Machines 24/23

42 ELM vs RVFL Input layer Hidden layer Output layer input x j 1 input x j 2 input x j 3 output y j input x j 4 Advances in Extreme Learning Machines 25/23

43 ELM vs RVFL Input layer Hidden layer Output layer input x j 1 input x j 2 input x j 3 output y j input x j 4 Advances in Extreme Learning Machines 25/23

44 ELM vs RVFL Input layer Hidden layer Output layer input x j 1 input x j 2 input x j 3 output y j input x j 4 Advances in Extreme Learning Machines 25/23

45 Defence Slides: General Advances in Extreme Learning Machines 26/23

46 Standard ELM Given a training set (x i, y i ), x i R d, y i R, an activation function f : R R and M the number of hidden nodes: 1: - Randomly assign input weights w i and biases b i, i [1, M]; 2: - Calculate the hidden layer output matrix H; 3: - Calculate output weights matrix β = H Y. where f (w 1 x 1 + b 1) f (w M x 1 + b M ) H = f (w 1 x N + b 1) f (w M x N + b M ) Advances in Extreme Learning Machines 27/23

47 ELM Theory vs Practice In theory, ELM is universal approximator In practice, limited number of samples; risk of overfitting Therefore: the functional approximation should use as limited number of neurons as possible the hidden layer should extract and retain as much useful information as possible from the input samples Advances in Extreme Learning Machines 28/23

48 ELM Theory vs Practice Weight considerations: weight range determines typical activation of the transfer function (remember w i, x = w i x cos θ,) therefore, normalize or tune the length of the weights vectors somehow Linear vs non-linear: since sigmoid neurons operate in nonlinear regime, add d linear neurons for the ELM to work better on (almost) linear problems Avoiding overfitting: use efficient L2 regularization Advances in Extreme Learning Machines 29/23

49 Batch Intrinsic Plasticity suppose (x 1,..., x N ) R N d, and output of neuron i is h i = f (a i w i x k + b i ), where f is an invertible transfer function for each neuron i from exponential distribution with mean µ exp, draw targets t = (t 1, t 2,..., t N ) and sort such that t 1 < t 2 <... < t N compute all presynaptic inputs s k = w i x k, and sort such that s 1 < s 2 <... < s N now, find a i and b i such that s s N 1 ( ai b i ) = f 1 (t 1 ). f 1 (t N ) Advances in Extreme Learning Machines 30/23

50 Fast leave-one-out cross-validation The leave-one-out (LOO) error can be computed using the PRESS statistics: N ( yi ŷ i ) 2 E loo = 1 N i=1 1 hat ii where hat ii is the i th value on the diagonal of the HAT-matrix, which can be quickly computed, given H : Ŷ = Hβ = HH Y = HAT Y Advances in Extreme Learning Machines 31/23

51 Fast leave-one-out cross-validation Using the SVD decomposition of H = UDV T, it is possible to obtain all needed information for computing the PRESS statistic without recomputing the pseudo-inverse for every λ: Ŷ = Hβ = H(H T H + λi) 1 H T Y = HV(D 2 + λi) 1 DU T Y = UDV T V(D 2 + λi) 1 DU T Y = UD(D 2 + λi) 1 DU T Y = HAT Y Advances in Extreme Learning Machines 32/23

52 Fast leave-one-out cross-validation where D(D 2 + λi) 1 D is a diagonal matrix with diagonal entry. Now: N ( ) yi ŷ 2 i d 2 ii d2 ii +λ as the i th MSE TR-PRESS = 1 N = 1 N = 1 N 1 hat ii N ( y i ŷ i 1 h i (H T H + λi) 1 h T i 2 N y i ŷ ( i ) d 2 1 u ii i dii 2+λ u T i i=1 i=1 i=1 ) 2 Advances in Extreme Learning Machines 33/23

53 Leave-one-out computation overhead 8 6 Time (s) Number of hidden neurons Figure : Comparison of running times of ELM training (solid lines) and ELM training + leave-one-out-computation (dotted lines), with (black lines) and without (gray lines) explicitly computing and reusing H Advances in Extreme Learning Machines 34/23

54 Defence Slides: Ensembles Advances in Extreme Learning Machines 35/23

55 Error reduction by ensembling ŷ i = y + ɛ i, (1) [{ 1 E ens = E m E[ { ŷ i y } 2 ] = E[ɛ 2 i ]. (2) E avg = 1 m E[ɛ 2 i ]. m (3) m i=1 i=1 } 2 ] [{ 1 (ŷ y) = E m m } 2 ɛ i ]. (4) i=1 E ens = 1 m E avg = 1 m m 2 E[ɛ 2 i ], (5) i=1 Advances in Extreme Learning Machines 36/23

56 Running times parallelized and GPU-Accelerated ELM Ensemble (SantaFe, ESTSP) N t(mldiv dp ) t(gesv dp ) t(mldiv sp) t(gesv sp) t(culagesv sp) SantaFe s s s s s s s s s s s s s s s s s s s s s s s s s ESTSP s s s s s s s s s s s s s s s s s s s s s s s s s Advances in Extreme Learning Machines 37/23

57 Running times parallelized and GPU-Accelerated ELM Ensemble (SantaFe, ESTSP) Running time ensemble (s) Number of workers Running time ensemble (s) Number of workers Advances in Extreme Learning Machines 38/23

58 ELM Ensemble Accuracy (SantaFe, ESTSP) NMSE ensemble NMSE ensemble Number of models in ensemble Number of models in ensemble Advances in Extreme Learning Machines 39/23

59 Defence Slides: Binary / Ternary ELM Advances in Extreme Learning Machines 40/23

60 Better Weights random layer weights and biases drawn from e.g. uniform / normal distribution with certain range / variance typical transfer function f ( w i, x + b i ) from w i, x = w i x cos θ, it can be seen that the typical activation of f depends on: expected length of w i expected length of x angles θ between the weights and the samples Advances in Extreme Learning Machines 41/23

61 Better Weights: Orthogonality? Idea 1: improve the diversity of the weights by taking weights that are mutually orthogonal (e.g. M d-dimensional basis vectors, randomly rotated in the d-dimensional space) however, does not give significantly better accuracy apparently, for the tested cases, random weight scheme of ELM already covers the possible weight space pretty well Advances in Extreme Learning Machines 42/23

62 Better Weights: Sparsity! Idea 2: improve the diversity of the weights by having each of them work in a different subspace (e.g. each weight vector has different subset of variables as input) spoiler: significantly improves accuracy, at no extra computational cost experiments suggest this is due to the weight scheme enabling implicit variable selection Advances in Extreme Learning Machines 43/23

63 Binary Weight Scheme 1 var 2 vars 3 vars etc. until enough neurons: add w {0, 1} d with 1 var (# = 2 1 ( d 1) ) add w {0, 1} d with 2 vars (# = 2 2 ( d 2) ) add w {0, 1} d with 3 vars (# = 2 3 ( d 3) )... For each subspace, weights are added in random order to avoid bias toward particular variables Advances in Extreme Learning Machines 44/23

64 Ternary Weight Scheme 1 var 2 vars 3 vars until enough neurons: add w { 1, 0, 1} d with 1 var (3 1 ( d 1) ) add w { 1, 0, 1} d with 2 vars (3 2 ( d 2) ) add w { 1, 0, 1} d with 3 vars (3 3 ( d 3) ) For each subspace, weights are added in random order to avoid bias toward particular variables Advances in Extreme Learning Machines 45/23

65 Experimental Settings Data Abbreviation number of variables # training # test Abalone Ab CaliforniaHousing Ca CensusHouse8L Ce DeltaElevators De ComputerActivity Co BIP(CV)-TR-ELM vs BIP(CV)-TR-2-ELM vs BIP(CV)-TR-3-ELM Experiment 1: relative performance Experiment 2: robustness against irrelevant vars Experiment 3: implicit variable selection (all results are averaged over 100 repetitions, each with randomly drawn training/test set) Advances in Extreme Learning Machines 46/23

66 Exp 1: numhidden vs. RMSE (Abalone) RMSEtest BIP(rand)-TR-ELM BIP(rand)-TR-2-ELM BIP(rand)-TR-3-ELM averages over 100 runs gaussian < binary ternary < gaussian better RMSE with much less neurons ,000 numhidden Advances in Extreme Learning Machines 47/23

67 Exp 1: numhidden vs. RMSE (CpuActivity) RMSEtest BIP(rand)-TR-ELM BIP(rand)-TR-2-ELM BIP(rand)-TR-3-ELM averages over 100 runs binary < gaussian ternary < gaussian better RMSE with much less neurons ,000 numhidden Advances in Extreme Learning Machines 48/23

68 Exp 2: Robustness against irrelevant variables (Abalone) RMSEtest BIP(rand)-TR-ELM BIP(rand)-TR-2-ELM BIP(rand)-TR-3-ELM number of added noise variables 1000 neurons binary weight scheme gives similar RMSE ternary weight scheme makes ELM more robust against irrelevant vars Advances in Extreme Learning Machines 49/23

69 Exp 2: Robustness against irrelevant variables (CpuActivity) RMSEtest BIP(rand)-TR-ELM BIP(rand)-TR-2-ELM BIP(rand)-TR-3-ELM 1000 neurons binary and ternary weight scheme makes ELM more robust against irrelevant vars number of added noise variables Advances in Extreme Learning Machines 50/23

70 Exp 2: Robustness against irrelevant variables Ab Co gaussian binary ternary gaussian binary ternary RMSE with original variables RMSE with 30 added irr. vars RMSE loss Table : Average RMSE loss of ELMs with 1000 hidden neurons, trained on the original data, and the data with 30 added irrelevant variables Advances in Extreme Learning Machines 51/23

71 Exp 3: Implicit Variable Selection (CpuAct) relevance of each input variable quantified as M i=1 β i w i gaussian variable relevance D1 D2 D3 D4 D5 R1 R2 R3 R4 R5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 variables Advances in Extreme Learning Machines 52/23

72 Exp 3: Implicit Variable Selection (CpuAct) relevance of each input variable quantified as M i=1 β i w i binary view variable relevance D1 D2 D3 D4 D5 R1 R2 R3 R4 R5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 variables Advances in Extreme Learning Machines 53/23

73 Exp 3: Implicit Variable Selection (CpuAct) relevance of each input variable quantified as M i=1 β i w i ternary variable relevance D1 D2 D3 D4 D5 R1 R2 R3 R4 R5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 variables Advances in Extreme Learning Machines 54/23

74 Binary/Ternary ELM Conclusions We propose simple change to weight scheme and introduce robust ELM variants: BIP(rand)-TR-ELM BIP(rand)-TR-2-ELM BIP(rand)-TR-3-ELM Our experiments suggest that 1. ternary weight scheme generally better than gaussian weights 2. ternary weight scheme robust against irrelevant variables 3. binary/ternary weight scheme allows ELM to perform implicit variable selection The added robustness and increased accuracy comes for free! Advances in Extreme Learning Machines 55/23

75 Defence Slides: Trade-offs / Compressive ELM Advances in Extreme Learning Machines 56/23

76 Time-accuracy Trade-offs for Several ELMs ELM OP-ELM: Optimally Pruned ELM with neurons ranked by relevance, and then pruned to optimize the leave-one-out error TR-ELM: Tikhonov-regularized ELM, with efficient optimization of regularization parameter λ, using the SVD approach to computing H TROP-ELM: Tikhonov regularized OP-ELM BIP(0.2), BIP(rand), BIP(CV): ELMs pretrained using Batch Intrinsic Plasticity mechanism, adapting the hidden layer weights and biases, such that they retain as much information as possible BIP parameter is either fixed, randomized, or cross-validated over 20 possible values Advances in Extreme Learning Machines 57/23

77 ELM Time-accuracy Trade-offs (Abalone UCI) training time OP-3-ELM TROP-3-ELM TR-3-ELM BIP(CV)-TR-3-ELM BIP(0.2)-TR-3-ELM BIP(rand)-TR-3-ELM msetest OP-3-ELM TROP-3-ELM TR-3-ELM BIP(CV)-TR-3-ELM BIP(0.2)-TR-3-ELM BIP(rand)-TR-3-ELM ,000 #hidden neurons ,000 #hidden neurons Advances in Extreme Learning Machines 58/23

78 ELM Time-accuracy Trade-offs (Abalone UCI) OP-3-ELM TROP-3-ELM TR-3-ELM BIP(CV)-TR-3-ELM BIP(0.2)-TR-3-ELM BIP(rand)-TR-3-ELM OP-3-ELM TROP-3-ELM TR-3-ELM BIP(CV)-TR-3-ELM BIP(0.2)-TR-3-ELM BIP(rand)-TR-3-ELM msetest msetest training time testing time Advances in Extreme Learning Machines 59/23

79 ELM Time-accuracy Trade-offs (Abalone UCI) Depending on the user s criteria, these results suggest: training time most important: BIP(rand)-TR-3-ELM (almost optimal performance, while keeping training time low) if test error is most important: BIP(CV)-TR-3-ELM (slightly better accuracy, but training time is 20 times as high) if testing time is most important: BIP(rand)-TR-3-ELM (surprisingly) (OP-ELM and TROP-ELM tend to be faster in test, but suffer from slight overfitting) Since TR-3-ELM offers attractive trade-offs between speed and accuracy, this model will be central in the rest of the paper. Advances in Extreme Learning Machines 60/23

80 Two approaches for improving models Time-accuracy trade-offs suggest two possible strategies to obtain models that are preferable over other models: reducing test error, using a better algorithm ( in terms of training time-accuracy plot: pushing the curve down ) reducing computational time, while retaining as much accuracy as possible ( in terms of training time-accuracy plot: pushing the curve to the left ) Compressive ELM focuses on reducing computational time by performing the training in a reduced space, and then projecting back the solution back to the original space. Advances in Extreme Learning Machines 61/23

81 Compressive ELM Given m n matrix A, compute k-term approximate SVD A UDV T [Halko2009]: Form the n (k + p) random matrix Ω. (where p is small) Form the m (k + p) sampling matrix Y = AΩ. (sketch it by applying Ω) Form the m (k + p) orthonormal matrix Q (such that range(q) = range(y )) Compute B = Q A. Form the SVD of B so that B = ÛDV T Compute the matrix U = QÛ Advances in Extreme Learning Machines 62/23

82 Faster Sketching? Bottleneck in Algorithm is the time it takes to sketch the matrix. Rather than using Gaussian random matrices for sketching A, use random matrices that are sparse or structured in some way and allow for faster multiplication: ( ) ( ) ( ) P W D k n n n Fast Johnson Lindenstrauss Transform (FJLT) introduced in [Ailon2006] for which P is a sparse matrix of random Gaussian variables, and H encodes the Discrete Hadamard Transform Subsampled Randomized Hadamard Transform (SRHT) for which P is a matrix selecting k random columns from H, and H encodes the Discrete Hadamard Transform (Experiments did not show substantial difference in terms of computational time. Dataset too small?) n n Advances in Extreme Learning Machines 63/23

83 Compressive ELM (CalHousing, FJLT) training time TR-3-ELM CS-TR-3-ELM(k=50) CS-TR-3-ELM(k=100) CS-TR-3-ELM(k=200) CS-TR-3-ELM(k=400) CS-TR-3-ELM(k=600) msetest TR-3-ELM CS-TR-3-ELM(k=50) CS-TR-3-ELM(k=100) CS-TR-3-ELM(k=200) CS-TR-3-ELM(k=400) CS-TR-3-ELM(k=600) ,000 #hidden neurons ,000 #hidden neurons Advances in Extreme Learning Machines 64/23

84 Compressive ELM (CalHousing, FJLT) TR-3-ELM CS-TR-3-ELM(k=50) CS-TR-3-ELM(k=100) CS-TR-3-ELM(k=200) CS-TR-3-ELM(k=400) CS-TR-3-ELM(k=600) TR-3-ELM CS-TR-3-ELM(k=50) CS-TR-3-ELM(k=100) CS-TR-3-ELM(k=200) CS-TR-3-ELM(k=400) CS-TR-3-ELM(k=600) msetest msetest training time testing time Advances in Extreme Learning Machines 65/23

85 Compressive ELM Conclusions Contributions Compressive ELM provides a flexible way to reduce training time by doing the optimization in a reduced space of k dimensions given k large enough, Compressive ELM achieves the best test error for each computational time (i.e. there are no models that achieve better test error and can be trained in the same or less time). Future work let theory/bounds on low-distortion embeddings inform the choice of k Advances in Extreme Learning Machines 66/23

Approximate Principal Components Analysis of Large Data Sets

Approximate Principal Components Analysis of Large Data Sets Approximate Principal Components Analysis of Large Data Sets Daniel J. McDonald Department of Statistics Indiana University mypage.iu.edu/ dajmcdon April 27, 2016 Approximation-Regularization for Analysis

More information

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods. TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

CS540 Machine learning Lecture 5

CS540 Machine learning Lecture 5 CS540 Machine learning Lecture 5 1 Last time Basis functions for linear regression Normal equations QR SVD - briefly 2 This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 3

More information

Air Quality Forecasting using Neural Networks

Air Quality Forecasting using Neural Networks Air Quality Forecasting using Neural Networks Chen Zhao, Mark van Heeswijk, and Juha Karhunen Aalto University School of Science, FI-00076, Finland {chen.zhao, mark.van.heeswijk, juha.karhunen}@aalto.fi

More information

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Linear models for regression Regularized

More information

Resampling techniques for statistical modeling

Resampling techniques for statistical modeling Resampling techniques for statistical modeling Gianluca Bontempi Département d Informatique Boulevard de Triomphe - CP 212 http://www.ulb.ac.be/di Resampling techniques p.1/33 Beyond the empirical error

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

Dimensionality Reduction and Principle Components Analysis

Dimensionality Reduction and Principle Components Analysis Dimensionality Reduction and Principle Components Analysis 1 Outline What is dimensionality reduction? Principle Components Analysis (PCA) Example (Bishop, ch 12) PCA vs linear regression PCA as a mixture

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

In the Name of God. Lectures 15&16: Radial Basis Function Networks

In the Name of God. Lectures 15&16: Radial Basis Function Networks 1 In the Name of God Lectures 15&16: Radial Basis Function Networks Some Historical Notes Learning is equivalent to finding a surface in a multidimensional space that provides a best fit to the training

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable

More information

MANY scientific computations, signal processing, data analysis and machine learning applications lead to large dimensional

MANY scientific computations, signal processing, data analysis and machine learning applications lead to large dimensional Low rank approximation and decomposition of large matrices using error correcting codes Shashanka Ubaru, Arya Mazumdar, and Yousef Saad 1 arxiv:1512.09156v3 [cs.it] 15 Jun 2017 Abstract Low rank approximation

More information

Dimension Reduction Methods

Dimension Reduction Methods Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.

More information

COMPRESSED AND PENALIZED LINEAR

COMPRESSED AND PENALIZED LINEAR COMPRESSED AND PENALIZED LINEAR REGRESSION Daniel J. McDonald Indiana University, Bloomington mypage.iu.edu/ dajmcdon 2 June 2017 1 OBLIGATORY DATA IS BIG SLIDE Modern statistical applications genomics,

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models

More information

Random Methods for Linear Algebra

Random Methods for Linear Algebra Gittens gittens@acm.caltech.edu Applied and Computational Mathematics California Institue of Technology October 2, 2009 Outline The Johnson-Lindenstrauss Transform 1 The Johnson-Lindenstrauss Transform

More information

The Application of Extreme Learning Machine based on Gaussian Kernel in Image Classification

The Application of Extreme Learning Machine based on Gaussian Kernel in Image Classification he Application of Extreme Learning Machine based on Gaussian Kernel in Image Classification Weijie LI, Yi LIN Postgraduate student in College of Survey and Geo-Informatics, tongji university Email: 1633289@tongji.edu.cn

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Sketched Ridge Regression:

Sketched Ridge Regression: Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang UC Berkeley Alex Gittens RPI Michael Mahoney UC Berkeley Overview Ridge Regression min w f w = 1 n Xw y + γ w Over-determined:

More information

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right

More information

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

Linear model selection and regularization

Linear model selection and regularization Linear model selection and regularization Problems with linear regression with least square 1. Prediction Accuracy: linear regression has low bias but suffer from high variance, especially when n p. It

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Multiple-step Time Series Forecasting with Sparse Gaussian Processes Multiple-step Time Series Forecasting with Sparse Gaussian Processes Perry Groot ab Peter Lucas a Paul van den Bosch b a Radboud University, Model-Based Systems Development, Heyendaalseweg 135, 6525 AJ

More information

epochs epochs

epochs epochs Neural Network Experiments To illustrate practical techniques, I chose to use the glass dataset. This dataset has 214 examples and 6 classes. Here are 4 examples from the original dataset. The last values

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

10-701/ Machine Learning, Fall

10-701/ Machine Learning, Fall 0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training

More information

Using SVD to Recommend Movies

Using SVD to Recommend Movies Michael Percy University of California, Santa Cruz Last update: December 12, 2009 Last update: December 12, 2009 1 / Outline 1 Introduction 2 Singular Value Decomposition 3 Experiments 4 Conclusion Last

More information

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures for AI. Part II - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Radial-Basis Function Networks

Radial-Basis Function Networks Radial-Basis Function etworks A function is radial () if its output depends on (is a nonincreasing function of) the distance of the input from a given stored vector. s represent local receptors, as illustrated

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding

More information

Collaborative Filtering. Radek Pelánek

Collaborative Filtering. Radek Pelánek Collaborative Filtering Radek Pelánek 2017 Notes on Lecture the most technical lecture of the course includes some scary looking math, but typically with intuitive interpretation use of standard machine

More information

Sample Exam COMP 9444 NEURAL NETWORKS Solutions

Sample Exam COMP 9444 NEURAL NETWORKS Solutions FAMILY NAME OTHER NAMES STUDENT ID SIGNATURE Sample Exam COMP 9444 NEURAL NETWORKS Solutions (1) TIME ALLOWED 3 HOURS (2) TOTAL NUMBER OF QUESTIONS 12 (3) STUDENTS SHOULD ANSWER ALL QUESTIONS (4) QUESTIONS

More information

Introduction to Neural Networks

Introduction to Neural Networks CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Compressed Sensing and Neural Networks

Compressed Sensing and Neural Networks and Jan Vybíral (Charles University & Czech Technical University Prague, Czech Republic) NOMAD Summer Berlin, September 25-29, 2017 1 / 31 Outline Lasso & Introduction Notation Training the network Applications

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Vahid Dehdari and Clayton V. Deutsch Geostatistical modeling involves many variables and many locations.

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

Machine Learning Techniques

Machine Learning Techniques Machine Learning Techniques ( 機器學習技法 ) Lecture 13: Deep Learning Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University ( 國立台灣大學資訊工程系

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

USING SINGULAR VALUE DECOMPOSITION (SVD) AS A SOLUTION FOR SEARCH RESULT CLUSTERING

USING SINGULAR VALUE DECOMPOSITION (SVD) AS A SOLUTION FOR SEARCH RESULT CLUSTERING POZNAN UNIVE RSIY OF E CHNOLOGY ACADE MIC JOURNALS No. 80 Electrical Engineering 2014 Hussam D. ABDULLA* Abdella S. ABDELRAHMAN* Vaclav SNASEL* USING SINGULAR VALUE DECOMPOSIION (SVD) AS A SOLUION FOR

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment.

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment. 1 Introduction and Problem Formulation In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment. 1.1 Machine Learning under Covariate

More information

Introduction to Machine Learning and Cross-Validation

Introduction to Machine Learning and Cross-Validation Introduction to Machine Learning and Cross-Validation Jonathan Hersh 1 February 27, 2019 J.Hersh (Chapman ) Intro & CV February 27, 2019 1 / 29 Plan 1 Introduction 2 Preliminary Terminology 3 Bias-Variance

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials Contents 1 Pseudo-code for the damped Gauss-Newton vector product 2 2 Details of the pathological synthetic problems

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes Mixtures of Gaussians with Sparse Regression Matrices Constantinos Boulis, Jeffrey Bilmes {boulis,bilmes}@ee.washington.edu Dept of EE, University of Washington Seattle WA, 98195-2500 UW Electrical Engineering

More information

Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms 03/Feb/2010 VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms Presented by Andriy Temko Department of Electrical and Electronic Engineering Page 2 of

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Using the Delta Test for Variable Selection

Using the Delta Test for Variable Selection Using the Delta Test for Variable Selection Emil Eirola 1, Elia Liitiäinen 1, Amaury Lendasse 1, Francesco Corona 1, and Michel Verleysen 2 1 Helsinki University of Technology Lab. of Computer and Information

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

1 Cricket chirps: an example

1 Cricket chirps: an example Notes for 2016-09-26 1 Cricket chirps: an example Did you know that you can estimate the temperature by listening to the rate of chirps? The data set in Table 1 1. represents measurements of the number

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Hypothesis Evaluation

Hypothesis Evaluation Hypothesis Evaluation Machine Learning Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall 1395 1 / 31 Table of contents 1 Introduction

More information

Lecture 6. Regression

Lecture 6. Regression Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron

More information

10701/15781 Machine Learning, Spring 2007: Homework 2

10701/15781 Machine Learning, Spring 2007: Homework 2 070/578 Machine Learning, Spring 2007: Homework 2 Due: Wednesday, February 2, beginning of the class Instructions There are 4 questions on this assignment The second question involves coding Do not attach

More information