Leveraging Randomness in Structure to Enable Efficient Distributed Data Analytics Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Basit Shafiq, Wei Fan, Danish Mehmood, and David Lorenzi
Distributed data analytics Global analysis of data can lead to unexpected insights, and results of significant value Generally, data is distributed across several sources Privacy and security concerns restrict sharing or centralization of data Local data analytics give suboptimal utility Example Discover patterns of adverse events (ER admissions) and severe adverse events (death) among patients that are: i) diagnosed with co-occurring psychiatric disorder (e.g., Schizophrenia) and opioid use disorder; ii) are taking prescription drug Clozapine for at least 30 days Pid Medicine Pid Diagnosis Prescription Pid Tests Consultant Pharmacy Mental Health Clinic Laboratory Pid Age Gender Ward General Hospital Admission Date Coverage Claim HIPAA Privacy law strictly prohibits release of personally identifiable health information of a patient to non-covered entities Medical researchers; administrative staff, and healthcare professionals who are not directly involved in the diagnosis and treatment process of the patient without consent Pid Insurance Company 2
Distributed Data Analytics Need to address privacy and security concerns Perturbation-based solution do not provide stringent privacy Secure multiparty computation based solutions are too inefficient to enable large scale-data analytics Proposed solution Focuses on decision tree-based learning tasks Building data classification model Uses randomization and cryptographic techniques for efficiency and security Random decision trees 3
Random Decision Trees (RDT) RDTs are used for multiple data mining tasks: classification, regression, ranking RDTs are multiple iso-depth trees Depth = n/2 (n is the number of attributes) Structure of RDT is completely independent of training data Attributes are randomly selected as nodes and subtrees are recursively built until the maximum depth is reached The training data is used to update the statistics of each node. The leaf nodes store the distribution of the class values The statistics of the tree can be updated for any addition in training data. 4
Example RDT Outlook Temper -ature Humidity Wind Play Tennis Wind Sunny Hot High Strong Sunny Hot High Weak Strong Weak Overcast Hot High Weak Rainy Mild High Weak Humidity Outlook Rainy Cool rmal Weak Rainy Cool rmal Strong High rmal Overcast Cool rmal Strong Sunny Mild High Weak 1 2 2 1 Sunny Rain Overcast Sunny Cool rmal Weak Rainy Mild rmal Weak Sunny Mild rmal Strong 1 2 3 0 2 0 Overcast Hot rmal Weak Overcast Hot rmal Weak Rainy Mild High Strong yes 5
Example RDT Suppose if your want to decide if it is a good day to play Tennis when: {Outlook=Sunny; Temperature=Mild; Humidity=rmal; Wind=Weak} Class distribution = (2,0) Class distribution = (1,2) Overall class distribution = (1.5,1) non-normalized 6
Distributed Environment Data Partitioning Horizontally partitioned Outlook Temper -ature Site 1 Humidity Windy Play Tennis Sunny Hot High Strong Sunny Hot High Weak Overcast Hot High Weak Rainy Mild High Weak Site 2 Outlook Temper -ature Humidity Windy Play Tennis Rainy Mild rmal Weak Sunny Mild rmal Strong Overcast Hot rmal Weak Overcast Hot rmal Weak 7
Distributed Environment Data Partitioning Vertically partitioned Our approach considers both horizontal and vertical partitioning Outlook Sunny Sunny Overcast Site 1 Temperature Hot Hot Hot Site 2 Humidity High High High Windy Strong Weak Weak Site 3 Play Tennis Discuss RDT-based classification for vertically partitioned data only Rainy Rainy Rainy Overcast Mild Cool Cool Cool High rmal rmal rmal Weak Weak Strong Strong Sunny Mild High Weak Sunny Cool rmal Weak Rainy Mild rmal Weak Sunny Mild rmal Strong Rainy Mild High Strong yes 8
Privacy Requirement The process of building classification model or of classifying an instance must not leak any additional information to participating sites beyond what they can learn from their local inputs or final result 9
Tool Used Homomorphic Encryption 10
Approach for Vertically Partitioned Data All sites collaborate to build RDT(s), but they do not share any information even the basic attribute information Every site knows the total numbers of trees (m) All the m trees are fully distributed The leaf nodes of the RDTs are stored at the Site owning the class attributes 11
RDT Build Tree Site 3 Tok. Attr. Ptr. S3L1 Wind S3L1: Wind Site 1 S3L3 S3L4 Strong Weak S1L5 Outlook S2L2: Humidity S1L5: Outlook S2L2 Site 2 Humidity High normal Sunny Rain Overcast S3L3 S3L4 S3L6 S3L7 S3L8 E r (0) E r (0) E r (0) E r (0) E r (0) E r (0) E r (0) E r (0) E r (0) E r (0) 12
RDT Update Statistics Site 3 Tok. Attr. Ptr. Outlook Temperature Humidity Wind Play Tennis Sunny Mild rmal Strong S3L1 Wind Root S3L1 S3L4 Strong S2L2 S2L2 Site 2 Humidity rmal S3L4 E r (0) E r (1) E r (0) 14
RDT Instance Classification Instance classification proceeds in a distributed fashion similar to update statistics Site 3 Tok. Attr. Ptr. Outlook Temper -ature Humidity Wind Rain Mild rmal Strong S3L1 Wind Root S3L1 S3L4 Strong Site 2 S2L2 S2L2 Humidity rmal S3L4 E (2) E (1) 15
RDT Instance Classification, cont. For the given instance: All the RDTs are traversed to get the encrypted class distribution vectors These encrypted class distribution vectors are homomorphically added together to get the sum The sum is decrypted by exactly k sites (k-threshold decryption) and averaged to get the predicted class distribution E (a) E (b) E (c) E (d) E (e) E (f) D 1 D k E (a+c+e) D 1 D k E (b+d+f) a+c+e b+d+f 16
Analysis 17
Security Efficient Distributed Data Analytics The tree structure is not known to any site The process of building RDT model does not reveal any information Instance classification reveals some limited information Analysis Which parties are traversed for a given classification instance Since each new instance is also distributed, it is impossible to reconstruct the entire tree without the help of all parties 18
Proposed an algorithm for distributed privacypreserving RDTs Conclusion Leverage the fact that randomness in structure can provide strong privacy with less computation Experimental results show that our algorithm: Scales linearly with dataset size Requires significantly less time than alternative cryptographic approaches 23