Fast and Effective Limited Pass Learning for Large Data Quantities. Presented by: Nayyar Zaidi

Size: px

Start display at page:

Download "Fast and Effective Limited Pass Learning for Large Data Quantities. Presented by: Nayyar Zaidi"

Godwin Dorsey
5 years ago
Views:

1 Fast and Effective Limited Pass Learning for Large Data Quantities Presented by: Nayyar Zaidi

2 RMSE , 2, 3, 4, 5, 6, Training set size 7, 8, 9, 1,,

3 RMSE , 2, 3, 4, 5, 6, Training set size 7, 8, 9, 1,,

Machine Learning Computational Models of Neural Networks Machine Learning Artificial Neural Networks < 194 1959 SVM (Linear) 1963 1975 Support Vector Machines 1992 1996

4 Machine Learning Computational Models of Neural Networks Machine Learning Artificial Neural Networks < SVM (Linear) Support Vector Machines Deep Learning Random Boosting Forest Kernel Perceptron Relational Databases Data Warehousing Small Machine Learning WWW iphone Kaggle Data Scale Big Machine Learning

5 Good Old-Fashioned Machine Learning 1) Regularization 2) Non-parameteric Methods Nearest Neighbour Tree-based Methods 3) Power of Ensemble Random Forest Boosting 4) Kernel Theory 5) Batch Optimization Methods 6) Bayesian vs. Frequentist 7) Feature Selection Large Scale Machine Learning 1) 2) 3) 4) 5) 6) Future Machine Learning Feature Engineering Deep Learning SGD Minimal Pass Learning Automatic Regularization Minimal Tuning 1) Single Pass Learning 2) Automatic Feature Engineering 3) No Tuning Parameter

6 Good Old-Fashioned Machine Learning 1) Regularization 2) Non-parameteric Methods Nearest Neighbour Tree-based Methods 3) Power of Ensemble Random Forest Boosting 4) Kernel Theory 5) Batch Optimization Methods 6) Bayesian vs. Frequentist 7) Feature Selection Large Scale Machine Learning (LSML) Future Machine Learning (FML) 1) 2) 3) 4) 5) 6) 1) Single Pass Learning 2) Automatic Feature Engineering 3) No Tuning Parameter Feature Engineering Deep Learning SGD Minimal Pass Learning Automatic Regularization Minimal Tuning

7 GOFML vs. LSML RMSE , 2, 3, 4, 5, 6, Training set size 7, 8, 9, 1,,

8 Objectives of the Talk Summarize the properties of Large-scale Machine Learning (LSML) algorithms Propose two Fast and Effective limited pass learning algorithms Outline of the Talk Introduction Background (NB, RF and Bayesian Networks) Algorithm I: FewPla Algorithm II: Selective ALR Discussion Target Audience Research Scientists Machine Learning Practitioners Ph.D Students Final year under-graduate students

9 Three Properties of LSML Minimal Pass Learning Minimal Tuning Parameters Low-Bias Learning

10 Two Extremes NB High-bias, low-variance Extremely easy to train Single Pass Minimal Tuning Parameters #Passes through the Data Random Forest RF Low-bias, high-variance Multiple Pass Some Tuning Parameters Naive Bayes Bias of the learner #Tuning Parameters

11 Random Forest Naive Bayes 2 Y 1 2 n Bayesian Network Factorizes joint distribution, P(,Y) Maximum Likelihood Estimate of Log-Likelihood Good for small data Non-parameteric Method Bagged data + Bagged variables Trees are grown to full Many variants such as Gradient boosting gives state of the art results Good for large datasets

Variance.5.25.4.2 RF.3.3.15.2.1.1.5.1.2.3.4.5.6.5.1 NB.5.5.4.4 RF.6.3.2.1.1.2.3.25.3.4.5.6.3.2.1.2-1 Loss.6.15 NB -1 Loss RF Comparison b/w NB and RF RF Bias.

12 Variance RF NB RF Loss.6.15 NB -1 Loss RF Comparison b/w NB and RF RF Bias NB NB Training Time 25 NB RF1 Classification Time 4 NB RF Semi-naive Bayes Methods AnDE, TAN, KDB, etc. 5 All Big All Big

13 Bayesian Network Classifiers Y 3 PB (y, x) = y A BN is characterised by two set of parameters: B = (G, ) Learning a BN: Structure Learning + Parameters Learning Graph has nice properties Structure Learning: K2 and many variants Parameter Learning: Accumulating Counts Maximise the log-likelihood n Y i=1 xi y, i (x). Qn y i=1 xi y, i (x) PB (y, x) Qn PB (y x) = =P. PB (x) y 2Y y i =1 xi y, i (x) LL(B) = N j=1 = N j=1 log PB (y (j), x(j) ), n log y(j) + i=1! log x(j) i (x(j) ). i

14 Bias.6 k-dependence Bayesian Estimator.5 KDB1.4 Y NB n Bias KDB2 Known as KDB Calculate Mutual information MI(i,Y) for all attributes Calculate MI(i,j Y) Sort all attributes based on MI(i,Y) For every i'th attribute, Make class Y its parent Set K = min(i-1, k) Choose K attributes from to i-1 attributes based on their MI(i,j Y) score KDB1 Bias KDB KDB2

15 Covtype KDB NB RF.3.25 Error K= K=1 K=2 K=3 K=4 Comparative Analysis of the Performance of KDB with NB and RF

16 Covtype KDB NB RF.3.25 Error Comparative Analysis of the Performance of KDB with NB and RF

17 KDB - Model Structure P (, Y ) = P (Y )P (1 Y )P (2 Y, 1)P (3 Y, 1, 2) Y 3 Y 2 Y 1 Y 1=a 2 1 1=a 1=c 1=b 1=c 1=b 3 2=a 2=b An Example of Parameter Structure for KDB (Tries)

18 Selective KDB Adds an additional third pass to KDB Selects the best k and the best number of attributes Good for reducing size of KDB A maximum kmax has to be specified SKDB explores the symmetry of KDB attributes and k s by using the leaveone-out-cross-validation PK= (, Y ) = P (1 Y )P (2 Y )P (3 Y )P (4 Y )P (5 Y ) PK=1 (, Y ) = P (1 Y )P (2 Y, 1 )P (3 Y, 1 )P (4 Y, 2 )P (5 Y, 4 ) PK=2 (, Y ) = P (1 Y )P (2 Y, 1 )P (3 Y, 1, 2 )P (4 Y, 2, 3 )P (5 Y, 4, 3 ) PK=3 (, Y ) = P (1 Y )P (2 Y, 1 )P (3 Y, 1, 2 )P (4 Y, 2, 3, 1 )P (5 Y, 4, 3, 2 ) PK=4 (, Y ) = P (1 Y )P (2 Y, 1 )P (3 Y, 1, 2 )P (4 Y, 2, 3, 1 )P (5 Y, 4, 3, 2, 1 )

19 P (4 Y, 3 ) K=1 4 a b 3

20 P (4 Y, 3, 2 ) K=2 4 a b 3 a a b b 2

21 P (4 Y, 3, 2, 1 ) K=3 4 a b 3 a a b b 2 a b a b a b a b 1

22 Selective KDB Attributes First Pass Second Pass (Third Pass Begins) For each data point x: Subtract it from the counts table For each k ( to K) For each Attribute i (in ordered set) LF[k][i] += LossFunction(P[y x],yx) i= i=1 i=2. k= k=1 k=2. k Select k and i from the table with best LF Trim data structure Loss Function Table 1. Martinez, A. and Webb, G. and Li, S. and Zaidi, N. Scalable Learning of Bayesian Network Classifiers, JMLR, pp 1-35, 216

23 Covtype KDB skdb NB RF.3.25 Error K= K=1 K=2 K=3 K=4 Comparative Analysis of the Performance of KDB and skdb

24 1 Covtype 8 KDB skdb No. of Parameters K= K=1 K=2 K=3 K=4 Comparative Analysis of the No. of Parameters of KDB and skdb

25 Discriminative Semi-naive Bayes Classifiers PB (y, x) = y LL(B) = N j=1 = log y(j) j=1 CLL(B) = j=1 = N j=1 = N j=1 i=1 xi y, i (x). 1) 2) 3) 4) 5) log PB (y (j), x(j) ), N N n Y NB -> NBd TAN -> TANd KDB -> KDBd AnDE -> AnDEd BN -> BNd! n + log x(j) i (x(j) ). i i=1 log PB (y (j) x(j) PB (y (j), x(j) ) log y(j) + n i=1 log Y y PB (y, x(j) )A, log x(j) i (x(j) ) i 1! 1 Y n log y xi y, i (x(j) ) A. y i =1 1. Zaidi, N. and Webb G. and Carman, M. and Petitjean, F. and Buntine, W. and Hynes, M. and De Sterck H. Efficient Parameter Learning of Bayesian Network Classifiers, Machine Learning, Volume 16, pp 1-44, 216

26 P (, Y ) = P (Y )P (1 Y )P (2 Y, 1)P (3 Y, 1, 2) 3 Y Counts Parameters Gradients 3 1=a 1=c 1=b 2=a 2=b An Example of Parameter Structure for dkdb (Tries)

27 Covtype KDB dkdb NB RF.3.25 Error K= K=1 K=2 K=3 K=4 Comparative Analysis of the Performance of KDB and dkdb

28 Covtype KDB skdb dkdb NB RF.3.25 Error K= K=1 K=2 K=3 K=4 Comparative Analysis of the Performance of KDB, skdb and dkdb

29 Covtype KDB skdb dkdb sdkdb NB RF.3.25 Error K= K=1 K=2 K=3 K=4 Comparative Analysis of the Performance of KDB, skdb, dkdb, sdkdb

30 Discriminative Semi-naive Bayes Classifiers #Passes through the Data RF K-DBd K-DB 2-DBd 1-DBd 2-DB 1-DB NB Bias of the learner #Tuning Parameters 1. Zaidi, N. and Webb G. Fast and Efficient Single Pass Bayesian Learning, Advances in Knowledge Discovery and Data Mining, pp , Martinez, A. and Webb, G. and Li, S. and Zaidi, N. Scalable Learning of Bayesian Network Classifiers, JMLR, pp 1-35, 216

31 The Equivalence y PNB (y x) = PC c=1 exp( PLR (y x) = PC c=1 Q exp(log y + i log y,i,xi ) PNB (y x) = PC. P c=1 exp(log c + j log c,j,xj ) i y,i,xi c y Q + exp( P c j P + c,j,xj log y,i,xi! log c,j,xj! c,j,xj ) exp(. PNB (y x) = PC c=1 exp( PWC (y x) = PC c=1 y log y + exp( c P log c + i y,i,xi P j y + exp( c log y,i,xi ) c,j,xj log c,j,xj ) P + i y y,i,xi log c!. i y,i,xi ) P j log y! c c,j,xj y,i,xi ) P j c,j,xj ).. 1. Zaidi, N. and Carman, M. and Cerquides, J. and Webb G. Naive-Bayes Inspired Effective Pre-Conditioners for Speeding-up Logistic Regression, ICDM, pp , Zaidi, N. and Webb, G. Preconditioning an Artificial Neural Network Using Naive Bayes, Advances in Knowledge Discovery and Data Mining, pp , 216

32 Discriminative Semi-naive Bayes Methods CLLd (B) = N y (j) + j=1 CLLe (B) = N i=1 log y(j) + j=1 CLLw (B) n n i=1 = Q (j) xi y (j), i (x(j) ) log x(j) y(j),q N i y log y + j=1 Y y i (x! Y 1 Y n Y log y (j) log x(j) y(j),q (x(j) ) A (j) ) n i=1 y log y y y xi y, n Y i =1 Q i (x) xi y, y (j) n Y i =1 i =1 log xi y,qi (x) Q i (x) xi (j) y (j),! i Q i (x (j) ) 1 A i 1 log xi y,qi (x) A

33 15 Covtype dkdb () wkdb () Negative Log-Likelihood Covtype dkdb (3) wkdb (3) No. of Iterations Negative Log-Likelihood Negative Log-Likelihood 11 No. of Iterations Covtype Covtype -1 dkdb (2) wkdb (2) No. of Iterations 1 dkdb (1) wkdb (1) Negative Log-Likelihood -.5 Negative Log-Likelihood No. of Iterations 15 Covtype dkdb (4) wkdb (4) No. of Iterations

34 FewPLA Discriminative Selective K-DB Salient Features: 1. Nine Pass Learning 2. Low-Bias 3. Minimal Tuning Parameters MNIST.25 SKDB FewPLa NB RF 2-Pass KDB 1-Pass SKDB 5-Pass for SGD optimising Error Step-size tuning Adaptive Gradients AdaGrad Tune initial step size on hold-out set NoPeskyLearning Rates Regularization Adaptive None Fixed.5 K= K=1 K=2 K=3 K=4 K=5

35 Covtype KDB skdb dkdb sdkdb NB RF.3.25 Error K= K=1 K=2 K=3 K=4 Comparative Analysis of the Performance of KDB, skdb, dkdb, sdkdb

36 Story So Far PK=2 (, Y ) = P (1 Y )P (2 Y, 1 )P (3 Y, 1, 2 )P (4 Y, 2, 3 )P (5 Y, 4, 3 ) PK=2 (, Y ) = P (1 )P (2 1 )P (3 2, 1 )P (4 3, 2 )P (5 4, 3 )

37 PK=2 (, Y ) = P (1 )P (2 1 )P (3 2, 1 )P (4 3, 2 )P (5 4, 3 )

38 Higher-order Logistic Regression LRn PLRn (y x) = y PY y y = PLRn (y x) Q x y 2(A n) Q x y 2 ( A n) B y + = B PLRn (y x) = B y log y + 2(A n) y +, log x y ) 2(A n) 2(A n) x y ) log( Y exp(log y + y log( x y log x y ) Y 1 2( A n) exp( y + y log( C log x y ))A. x 2( A n) Y y exp( y log y + y 1 C ))A. 2 ( A n) x y 1 C log x y ))A

39 Higher-order Logistic Regression LRn #Passes through the Data RF LRK LR2 K-DBd 2-DBd K-DB LR1 1-DBd 2-DB 1-DB NB Bias of the learner #Tuning Parameters

40 Covtype KDB dkdb ALR NB RF.3.25 Error K= K=1 K=2 K=3 K=4 Comparative Analysis of the Performance of LRn and dkdb

41 Accelerated Logistic Regression (ALRn) PLRn (y x) = P PAnJE (y x) = log PALRn (y x) = P P exp( c2c exp( + c + exp(log y + c2c exp( c2c y P exp( 2(A n) P 2 ( A n) P 2(A n) exp(log y + y log y + c P log c + y,,x ) P log y,,x ) 2 ( A n) 2(A n) P c,,x ). 2 ( A n) log y,, ) y,,x. log y,,x ) c,,x log c,,x ). 1. Zaidi, N. and Webb, G. and Carman, M. and Petitjean, F. and Cerquides, J. ALRn: Accelerated Higher-Order Logistic Regression, Machine Learning, Volume 14, pp , 216

42 Accelerated Logistic Regression (ALRn) PLRn (y x) = P PAnJE (y x) = log PALRn (y x) = P P exp( c2c y exp( + c exp(log y + c2c + 1 w 2(A n) P exp( y log y + c c,,x ) P. w= P log c + y,,x ) 2 ( A n) exp(log y + exp( c2c P 2(A n) 1 w P log y,,x ) 2 ( A n) 2(A n) P 2 ( A n) log y,, ) y,,x A/n.. log y,,x ) c,,x A n log c,,x ). 1. Zaidi, N. and Webb, G. and Carman, M. and Petitjean, F. and Cerquides, J. ALRn: Accelerated Higher-Order Logistic Regression, Machine Learning, Volume 14, pp , 216

43 .66 AnJE LR n ALRn Prequential Learning of of LRn, ALRn, and AnJE

44 Selective Accelerated-Higher-order Logistic Regression Salient Features: 1. Two Pass Learning 2. Low-Bias 3. Minimal Tuning Parameters How to do selection to reduce the size of the model? Indexing Approximate: Feature Hashing Mutual Information Frequency Automatic selection LOOCV as in SKDB Or validate on sample of data

45 Covtype NB RF.3 ALR ALR (Count) ALR2 (MI) ALR2 (CV) Error Comparative Analysis of the Performance of salr2 (count, MI and CV)

46 Covtype NB RF.3 ALR ALR (MI) ALR3 (CV) Error Comparative Analysis of the Performance of salr3 (MI and CV)

47 Decreasing Bias AnDE skdb KDB NB sdkdb RF dkdb LRn slrn hlrn shlrn FM Minimal Pass, Minimal Tuning Parameters

48 Burning Issues 1. Discretization 1. Discretization leads to better results 2. Multiple Classes 1. Optimizing Softmax leads to better calibrated probabilities 3. SGD 1. AdaGrad 2. Cross-validate Eta 4. Regularization 1. L2 regularisation with lambda equal.1 works well 2. Adaptive Regularization 5. Indexing 1. Hashing 2. Feature transformation 6. Non-stationary Data 1. Low-bias models for fast decay 2. High-bias models for slow decay 7. Adaptive Models 1. Starts with high-bias model and then shift gears with more data 2. Hierarchical Models

49 Aquila Audax Salient Features Implements KDB, ALR, FM Objective Functions MSE, CLL, HL Multiple Classes Optimizes Softmax SGD AdaGrad AdaDelta Regularization Adaptive Regularization Creates Features on the run Feature Selection Counts, MI, LOOCV, Hashing Others

50 Collaborators Offline Discussions Github: nayyarzaidi LinkedIn: nayyar_zaidi URL: Questions?

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted