Leveraging Randomness in Structure to Enable Efficient Distributed Data Analytics

Similar documents
Learning Decision Trees

The Solution to Assignment 6

Learning Decision Trees

Rule Generation using Decision Trees

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Administrative notes. Computational Thinking ct.cs.ubc.ca

Decision Tree Learning and Inductive Inference

Learning Classification Trees. Sargur Srihari

Classification Using Decision Trees

Artificial Intelligence. Topic

The Quadratic Entropy Approach to Implement the Id3 Decision Tree Algorithm

Bayesian Classification. Bayesian Classification: Why?

Decision Tree Learning - ID3

Decision Trees. Tirgul 5

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Decision Trees. Gavin Brown

Chapter 4.5 Association Rules. CSCI 347, Data Mining

Administrative notes February 27, 2018

Decision Trees. Common applications: Health diagnosis systems Bank credit analysis

Induction on Decision Trees

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Machine Learning 2nd Edi7on

Data classification (II)

Data Mining. Chapter 1. What s it all about?

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision Support. Dr. Johan Hagelbäck.

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Algorithms for Classification: The Basic Methods

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Decision Trees / NLP Introduction

Classification: Decision Trees

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Decision Trees.

Tools of AI. Marcin Sydow. Summary. Machine Learning

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Machine Learning Alternatives to Manual Knowledge Acquisition

Typical Supervised Learning Problem Setting

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Chapter 3: Decision Tree Learning

Decision Tree Learning

Probability Based Learning

Dan Roth 461C, 3401 Walnut

Mining Classification Knowledge

Lecture 3: Decision Trees

Decision Trees.

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Introduction to Machine Learning CMU-10701

CS 6375 Machine Learning

Classification and Regression Trees

10-701/ Machine Learning: Assignment 1

Classification and regression trees

Decision Trees Part 1. Rao Vemuri University of California, Davis

Decision Tree Learning

Decision Trees. Danushka Bollegala

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Classification and Prediction

CC283 Intelligent Problem Solving 28/10/2013

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Chapter 6: Classification

Modern Information Retrieval

Lecture 3: Decision Trees

Intuition Bayesian Classification

Moving Average Rules to Find. Confusion Matrix. CC283 Intelligent Problem Solving 05/11/2010. Edward Tsang (all rights reserved) 1

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Symbolic methods in TC: Decision Trees

EVALUATING RISK FACTORS OF BEING OBESE, BY USING ID3 ALGORITHM IN WEKA SOFTWARE

the tree till a class assignment is reached

ARTIFICIAL INTELLIGENCE. Supervised learning: classification

Building Bayesian Networks. Lecture3: Building BN p.1

The Naïve Bayes Classifier. Machine Learning Fall 2017

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Decision Tree Learning

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Mining Classification Knowledge


Machine Learning 3. week

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Inteligência Artificial (SI 214) Aula 15 Algoritmo 1R e Classificador Bayesiano

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Lecture 9: Bayesian Learning

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Bayesian Learning Features of Bayesian learning methods:

Bias Correction in Classification Tree Construction ICML 2001

Machine Learning in Bioinformatics

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

UVA CS 4501: Machine Learning

A.I. in health informatics lecture 3 clinical reasoning & probabilistic inference, II *

EECS 349:Machine Learning Bryan Pardo

Transcription:

Leveraging Randomness in Structure to Enable Efficient Distributed Data Analytics Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Basit Shafiq, Wei Fan, Danish Mehmood, and David Lorenzi

Distributed data analytics Global analysis of data can lead to unexpected insights, and results of significant value Generally, data is distributed across several sources Privacy and security concerns restrict sharing or centralization of data Local data analytics give suboptimal utility Example Discover patterns of adverse events (ER admissions) and severe adverse events (death) among patients that are: i) diagnosed with co-occurring psychiatric disorder (e.g., Schizophrenia) and opioid use disorder; ii) are taking prescription drug Clozapine for at least 30 days Pid Medicine Pid Diagnosis Prescription Pid Tests Consultant Pharmacy Mental Health Clinic Laboratory Pid Age Gender Ward General Hospital Admission Date Coverage Claim HIPAA Privacy law strictly prohibits release of personally identifiable health information of a patient to non-covered entities Medical researchers; administrative staff, and healthcare professionals who are not directly involved in the diagnosis and treatment process of the patient without consent Pid Insurance Company 2

Distributed Data Analytics Need to address privacy and security concerns Perturbation-based solution do not provide stringent privacy Secure multiparty computation based solutions are too inefficient to enable large scale-data analytics Proposed solution Focuses on decision tree-based learning tasks Building data classification model Uses randomization and cryptographic techniques for efficiency and security Random decision trees 3

Random Decision Trees (RDT) RDTs are used for multiple data mining tasks: classification, regression, ranking RDTs are multiple iso-depth trees Depth = n/2 (n is the number of attributes) Structure of RDT is completely independent of training data Attributes are randomly selected as nodes and subtrees are recursively built until the maximum depth is reached The training data is used to update the statistics of each node. The leaf nodes store the distribution of the class values The statistics of the tree can be updated for any addition in training data. 4

Example RDT Outlook Temper -ature Humidity Wind Play Tennis Wind Sunny Hot High Strong Sunny Hot High Weak Strong Weak Overcast Hot High Weak Rainy Mild High Weak Humidity Outlook Rainy Cool rmal Weak Rainy Cool rmal Strong High rmal Overcast Cool rmal Strong Sunny Mild High Weak 1 2 2 1 Sunny Rain Overcast Sunny Cool rmal Weak Rainy Mild rmal Weak Sunny Mild rmal Strong 1 2 3 0 2 0 Overcast Hot rmal Weak Overcast Hot rmal Weak Rainy Mild High Strong yes 5

Example RDT Suppose if your want to decide if it is a good day to play Tennis when: {Outlook=Sunny; Temperature=Mild; Humidity=rmal; Wind=Weak} Class distribution = (2,0) Class distribution = (1,2) Overall class distribution = (1.5,1) non-normalized 6

Distributed Environment Data Partitioning Horizontally partitioned Outlook Temper -ature Site 1 Humidity Windy Play Tennis Sunny Hot High Strong Sunny Hot High Weak Overcast Hot High Weak Rainy Mild High Weak Site 2 Outlook Temper -ature Humidity Windy Play Tennis Rainy Mild rmal Weak Sunny Mild rmal Strong Overcast Hot rmal Weak Overcast Hot rmal Weak 7

Distributed Environment Data Partitioning Vertically partitioned Our approach considers both horizontal and vertical partitioning Outlook Sunny Sunny Overcast Site 1 Temperature Hot Hot Hot Site 2 Humidity High High High Windy Strong Weak Weak Site 3 Play Tennis Discuss RDT-based classification for vertically partitioned data only Rainy Rainy Rainy Overcast Mild Cool Cool Cool High rmal rmal rmal Weak Weak Strong Strong Sunny Mild High Weak Sunny Cool rmal Weak Rainy Mild rmal Weak Sunny Mild rmal Strong Rainy Mild High Strong yes 8

Privacy Requirement The process of building classification model or of classifying an instance must not leak any additional information to participating sites beyond what they can learn from their local inputs or final result 9

Tool Used Homomorphic Encryption 10

Approach for Vertically Partitioned Data All sites collaborate to build RDT(s), but they do not share any information even the basic attribute information Every site knows the total numbers of trees (m) All the m trees are fully distributed The leaf nodes of the RDTs are stored at the Site owning the class attributes 11

RDT Build Tree Site 3 Tok. Attr. Ptr. S3L1 Wind S3L1: Wind Site 1 S3L3 S3L4 Strong Weak S1L5 Outlook S2L2: Humidity S1L5: Outlook S2L2 Site 2 Humidity High normal Sunny Rain Overcast S3L3 S3L4 S3L6 S3L7 S3L8 E r (0) E r (0) E r (0) E r (0) E r (0) E r (0) E r (0) E r (0) E r (0) E r (0) 12

RDT Update Statistics Site 3 Tok. Attr. Ptr. Outlook Temperature Humidity Wind Play Tennis Sunny Mild rmal Strong S3L1 Wind Root S3L1 S3L4 Strong S2L2 S2L2 Site 2 Humidity rmal S3L4 E r (0) E r (1) E r (0) 14

RDT Instance Classification Instance classification proceeds in a distributed fashion similar to update statistics Site 3 Tok. Attr. Ptr. Outlook Temper -ature Humidity Wind Rain Mild rmal Strong S3L1 Wind Root S3L1 S3L4 Strong Site 2 S2L2 S2L2 Humidity rmal S3L4 E (2) E (1) 15

RDT Instance Classification, cont. For the given instance: All the RDTs are traversed to get the encrypted class distribution vectors These encrypted class distribution vectors are homomorphically added together to get the sum The sum is decrypted by exactly k sites (k-threshold decryption) and averaged to get the predicted class distribution E (a) E (b) E (c) E (d) E (e) E (f) D 1 D k E (a+c+e) D 1 D k E (b+d+f) a+c+e b+d+f 16

Analysis 17

Security Efficient Distributed Data Analytics The tree structure is not known to any site The process of building RDT model does not reveal any information Instance classification reveals some limited information Analysis Which parties are traversed for a given classification instance Since each new instance is also distributed, it is impossible to reconstruct the entire tree without the help of all parties 18

Proposed an algorithm for distributed privacypreserving RDTs Conclusion Leverage the fact that randomness in structure can provide strong privacy with less computation Experimental results show that our algorithm: Scales linearly with dataset size Requires significantly less time than alternative cryptographic approaches 23