A Statistical Framework for Analysing Big Data Global Conference on Big Data for Official Statistics October, 2015 by S Tam, Chief

Similar documents
MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CS534 Machine Learning - Spring Final Exam

Artificial Neural Networks

Classification Techniques with Applications in Remote Sensing

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Machine Learning to Automatically Detect Human Development from Satellite Imagery

Pattern Recognition and Machine Learning

Tools of AI. Marcin Sydow. Summary. Machine Learning

Linear Classifiers: Expressiveness

Course in Data Science

Holdout and Cross-Validation Methods Overfitting Avoidance

Nonlinear Classification

Midterm: CS 6375 Spring 2018

Machine Learning Basics

Machine Learning (CSE 446): Neural Networks

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Temporal and spatial approaches for land cover classification.

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

AGOG 485/585 /APLN 533 Spring Lecture 5: MODIS land cover product (MCD12Q1). Additional sources of MODIS data

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Leveraging Sentinel-1 time-series data for mapping agricultural land cover and land use in the tropics

PAC-learning, VC Dimension and Margin-based Bounds

Energy Based Learning. Energy Based Learning

DETECTING HUMAN ACTIVITIES IN THE ARCTIC OCEAN BY CONSTRUCTING AND ANALYZING SUPER-RESOLUTION IMAGES FROM MODIS DATA INTRODUCTION

Bayesian Learning (II)

How to evaluate credit scorecards - and why using the Gini coefficient has cost you money

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Stochastic Gradient Descent

Transboundary water management with Remote Sensing. Oluf Jessen DHI Head of Projects, Water Resources Technical overview

Machine learning for pervasive systems Classification in high-dimensional spaces

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

PAC-learning, VC Dimension and Margin-based Bounds

Advancing Machine Learning and AI with Geography and GIS. Robert Kircher

Data Mining und Maschinelles Lernen

Lecture 4 Discriminant Analysis, k-nearest Neighbors

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

VBM683 Machine Learning

Week 5: Logistic Regression & Neural Networks

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

ECE 5424: Introduction to Machine Learning

Notes on Machine Learning for and

Qualifying Exam in Machine Learning

Machine Learning, Fall 2009: Midterm

1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure

Machine Learning Gaussian Naïve Bayes Big Picture

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

EECS 349:Machine Learning Bryan Pardo

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

ECE 5984: Introduction to Machine Learning

WMO Aeronautical Meteorology Scientific Conference 2017

Machine Learning: Evaluation

Molinas. June 15, 2018

CS798: Selected topics in Machine Learning

Validation and verification of land cover data Selected challenges from European and national environmental land monitoring

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Machine Learning Methods for Radio Host Cross-Identification with Crowdsourced Labels

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Artificial Neural Networks Examination, March 2004

Machine Learning

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Remote sensing of sealed surfaces and its potential for monitoring and modeling of urban dynamics

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Probabilistic Machine Learning. Industrial AI Lab.

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Hidden Markov Models Part 1: Introduction

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Species Distribution Modeling

Statistical Machine Learning from Data

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Applied Statistics. Multivariate Analysis - part II. Troels C. Petersen (NBI) Statistics is merely a quantization of common sense 1

Learning Theory Continued

KNOWLEDGE-BASED CLASSIFICATION OF LAND COVER FOR THE QUALITY ASSESSEMENT OF GIS DATABASE. Israel -

MACHINE LEARNING FOR GEOLOGICAL MAPPING: ALGORITHMS AND APPLICATIONS

CSCI-567: Machine Learning (Spring 2019)

Machine Learning

Proteomics and Variable Selection

Multivariate Analysis, TMVA, and Artificial Neural Networks

Multivariate statistical methods and data mining in particle physics

Machine Learning Overview

Classification techniques focus on Discriminant Analysis

1. Introduction. S.S. Patil 1, Sachidananda 1, U.B. Angadi 2, and D.K. Prabhuraj 3

Machine Learning for Signal Processing Bayes Classification and Regression

Gradient Coding: Mitigating Stragglers in Distributed Gradient Descent.

Fast and Effective Limited Pass Learning for Large Data Quantities. Presented by: Nayyar Zaidi

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Adaptive Crowdsourcing via EM with Prior

Matching. Quiz 2. Matching. Quiz 2. Exact Matching. Estimand 2/25/14

Linear and Logistic Regression. Dr. Xiaowei Huang

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

A Community Gridded Atmospheric Forecast System for Calibrated Solar Irradiance

TDT4173 Machine Learning

Incorporating Boosted Regression Trees into Ecological Latent Variable Models

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Remote Ground based observations Merging Method For Visibility and Cloud Ceiling Assessment During the Night Using Data Mining Algorithms

CC283 Intelligent Problem Solving 28/10/2013

The Lady Tasting Tea. How to deal with multiple testing. Need to explore many models. More Predictive Modeling

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Transcription:

A Statistical Framework for Analysing Big Data Global Conference on Big Data for Official Statistics 20-22 October, 2015 by S Tam, Chief Methodologist Australian Bureau of Statistics 1

Big Data (BD) Issues Characterise BD with big N, big p and big t Sampling error Reduced by increasing the sample size Bias Coverage bias - Big Data population is not the population of interest Self selection bias are their views representative of the silent population segments? Representation bias multiple representation Measurement error Are the data related to the concept of interest? Increasing the sample size does NOT reduce nonsampling errors 2

Methods overview Domain (e.g. crop) modelling Machine learning methods Decision Trees Artificial Neural Networks Support Vector Machines Nearest Neighbour Ensemble classifiers Statistical modelling Spatial-Temporal models Use space and time information Use in ABS modelling for satellite imagery and simulated phone data

ABS method overview Two-Step Approach Calibration and Prediction Use a sample to calibrate the Big Data (treated as covariates ) using ground truths/measurements Calibrate using a linear model with time varying coefficients Dynamic Model Estimate parameters (using Frequentist/Bayesian approaches) Predict the non-sampled values using the covariates

Selectivity - Missing Covariate challenges The selectivity bias issue When the Big Data population only intersects with the inference population, Big Data covariates are missing Two problems Bias in estimating beta Covariates are missing Can we just ignore the selectivity bias issue? Big Data bias and statistical modelling

When can selectivity be ignored? Inference on finite population values Modelling the missing processes Sampling and missing data processes can be ignored if these processes are not dependent on the unobserved population values Fulfilled if the training data set is selected by probability sampling There is no missing data or the missing data is missing at random Big Data process can be ignored if this process is not dependent on the missing covariates This condition is difficult to check If condition not satisfied, modelling for the missing covariates a difficult task will be needed.

Predicting crop types from Satellite Imagery Satellite imagery Data at time = t Big Data - reflectance data from 7 frequency bands from satellite imagery - Statistical models require ground truths/measurements 7

Some interesting observations Satellite Imagery no undercoverage (missing covariates) issues Enhanced vegetation index EVI plotted over the growing season

What covariates to use? This is where crop science comes in Possible covariate curves Land surface temperature curve Moisture curve Grow curves etc. Want to use the whole curve Need a method to pick finite points Functional Data Analysis Pick the points which explains most of the variation of the curve» Functional Data Analysis

Data, Static Models and Accuracy The training data Models and accuracy

Temporal Modelling Data over time Dynamic models

The Algorithms Dynamin Linear Model Dynamic Logistic Regression Model

Concluding remarks These methods apply to a large number of Big Data applications For example, mobile phone data The challenge (cost, feasibility etc.) is availability of ground truths/measurements For official statistics, there is a role for survey sampling even with Big Data

References: Tam, S & Clarke, F 2015 (2015). Big Data, Official Statistics and Some Initiatives of the ABS, International Statistical Review, to appear. Tam, S (2015). A Statistical Framework for Analysing Big Data, June 2015, cat. no. 1351.0.55.056, ABS, Canberra. siu-ming.tam@abs.gov.au 14