Modeling and Estimation from High Dimensional Data TAKASHI WASHIO THE INSTITUTE OF SCIENTIFIC AND INDUSTRIAL RESEARCH OSAKA UNIVERSITY

Size: px

Start display at page:

Download "Modeling and Estimation from High Dimensional Data TAKASHI WASHIO THE INSTITUTE OF SCIENTIFIC AND INDUSTRIAL RESEARCH OSAKA UNIVERSITY"

Adela Kristin Preston
5 years ago
Views:

1 Modeling and Estimation from High Dimensional Data TAKASHI WASHIO THE INSTITUTE OF SCIENTIFIC AND INDUSTRIAL RESEARCH OSAKA UNIVERSITY

2 Department of Reasoning for Intelligence, Division of Information and Quantum Sciences, The Institute of Scientific and Industrial Research, Osaka University Our Research Scope 1. Data Mining and Machine Learning Techniques 2. Application to Science, Engineering and Society Useful Information and Knowledge Data Mining, Machine Learning We have been developing 1. Graph mining 2. Structural regularization 3. Causal inference 4. Mass based learning 5. Scientific discovery, and so on We have been applying to industrial quality control, chemical activity analysis, medical KDD, bioinformatics, high dimensional sensing, and so on Research Staffs Helsinki Max Planck Takashi Washio (Prof.) Univ. Tubingen Beijing Washington State Yoshinobu Kawahara (Assoc. Prof.) Joseph Fourier Univ. Univ., Seattle Univ., Grenoble Stanford Univ., Shohei Shimizu (Assoc. Prof.) National Uiv. San Francisco Singapore Mahito Sugiyama (Assistant Prof.) Monash Univ. Post Docs Melbourne

3 Background High dimensional big data are rapidly growing based on recent sensing, storage and network technologies. Science and Engineering Society The needs to efficiently and accurately derive models of the high dimensional big data and to analyze/ estimate their important characteristics are increasing.

4 Contents 1. Mass based density and dissimilarity estimations and their application to ML tasks (IEEE ICDM11 and IEEE ICDM14) The curse of dimensionality and the computational intractability of the big data is alleviated by using an ensemble of data subsamples and projections without any distance measures. 2. Structural regularization based ML and its application to bioinformatics (NIPS09,NIPS10, UAI13, ISMB/ECCB13) Accuracy and efficiency of ML for the high dimensional big data significantly increase by introducing structural prior knowledge to the model regularization.

5 Mass based density and dissimilarity estimations and their application to ML tasks (Joint Work with Monash University, Australia) Density Estimation Based on Mass (ICDM2011) High dim. big data x x x x 1.Data are subsampled in each frame. 2.Data mass in a randomly given axis parallel vicinity of a point x is computed. 3.Mass based density of x is their ensemble. pˆ d (x) = 1 t t i= 1 m( Ti ( x D nv i i ) Squared Bias = O(d 2 L 2 ) Variance = O(t -2 L -d ) Mass enables consistent density estimation WITHOUT DISTANCE. Its accuracy is comparable with the kernel density estimator. Its computational complexity is O(nt) (far less than O(N)).

6 Density Estimation Based on Mass (Continued) Comparison with Kernel Density Est. Density-based Clustering: DBSCAN vs DEMass-DBSCAN At Data size ratio = 150, Runtime: 4.5 hours vs 36 days Replacing conventional density estimator with DEMass improves time and space complexities of existing algorithms without loosing accuracy. Anomaly Detection: LOF VS DEMass-LOF At Data size ratio = 128, Runtime: 45 seconds vs 28 hours

7 m p -dissimilarity: A mass-based dissimilarity measure (ICDM2014) We consider to reflect the relative position of the two instances with respect to the rest of the data to the dissimilarity measure. Similar Similar? m p -dissimilarity evaluates the dissimilarity between two instances in terms of probability mass in a region covering the two instances in each dimension (No use of the special distance). p p p p mass 1 ( x, y) + mass2( x, y) + mass3( x, y)

m p -dissimilarity (Continued) Real world benchmark data Size n : 1000-9100 Dimensions d : 188-10000 Classes c : 2-52 knn classification (accuracy) Relevance feedback based

8 m p -dissimilarity (Continued) Real world benchmark data Size n : Dimensions d : Classes c : 2-52 knn classification (accuracy) Relevance feedback based information retrieval (accuracy) ML using mass based dissimilarity reflecting the data distribution outperforms the conventional distance based ML for high dimensional data.

9 Structural regularization based ML and its application to bioinformatics (Joint Work with Max Planck Inst., Tubingen, Germany) (NIPS09,NIPS10,UAI13) High dim. data ML Model estimation Loss (error) Generally, this is a Submodular optimization problem which is NP-hard. Minimum Cost Flow Problem which is efficiently solved by parametric flow algorithm. (Gallo et al, 1989) Estimated parameters Prior Graph structured regularization for classification and regression This regularization term avoids excessively complex modeling.

10 Structural regularization based ML and its application to bioinformatics (Continued) (ISMB/ECCB13) Modeling for phenotype prediction from SNPs Q + λ Objective phenotype min S V ( S) F( S) SNPs By courtesy of D. Weigel Prediction Performance: FDR vs Power Neighbour and related genes are linked by prior knowledge. Significant performance is attained by the structural regularization.

11 Structural regularization based ML and its application to bioinformatics (Continued) (ISMB/ECCB13) Computation Time for ,000snps(200 population) Proposed method Linear regression + Statistical test Our proposed method is faster in two orders of magnitudes than graph-lasso and nclasso. This is applicable to very high dimensional data.

12 Summary The needs of modeling high dimensional big data and estimating target information from it are increasing. Both the curse of dimensionality and the computational intractability of big data should be alleviated. We are developing machine learning approaches for the needs by introducing various principles; random sampling, model ensemble, structural regularization and fast algorithms.

13 Department of Reasoning for Intelligence, Division of Information and Quantum Sciences, The Institute of Scientific and Industrial Research, Osaka University ISIR, Osaka University Location and Members

MASS ESTIMATION: ENABLING DENSITY BASED

MASS ESTIMATION: ENABLING DENSITY BASED OR DISTANCE BASED ALGORITHMS TO DO WHAT THEY CANNOT DO Kai Ming Ting Federation University Australia 16 November 2016 A Tutorial at ACML 2016 Blind Men and Elephant