Statistical Privacy For Privacy Preserving Information Sharing

Similar documents
PRIVACY PRESERVING INFORMATION SHARING

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Privacy-preserving Data Mining

Privacy-Preserving Data Mining

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

A Framework for High-Accuracy Privacy-Preserving Mining

CS5112: Algorithms and Data Structures for Applications

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Random Multiplication based Data Perturbation for Privacy Preserving Distributed Data Mining - 1

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

Mining Positive and Negative Fuzzy Association Rules

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

Associa'on Rule Mining

5th March Unconditional Security of Quantum Key Distribution With Practical Devices. Hermen Jan Hupkes

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.

[Title removed for anonymity]

Frequency-hiding Dependency-preserving Encryption for Outsourced Databases

FRAPP: a framework for high-accuracy privacy-preserving mining

1 Distributional problems

Secure Multiparty Computation from Graph Colouring

A Clear View on Quality Measures for Fuzzy Association Rules

CS 484 Data Mining. Association Rule Mining 2

732A61/TDDD41 Data Mining - Clustering and Association Analysis

From Secure MPC to Efficient Zero-Knowledge

Correlation Preserving Unsupervised Discretization. Outline

Technical Report CSD-TR No Mining Frequent Itemsets in Uncertain Datasets

Data-Driven Logical Reasoning

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

10/19/2017 MIST.6060 Business Intelligence and Data Mining 1. Association Rules

Approximate counting: count-min data structure. Problem definition

A Practical Universal Circuit Construction and Secure Evaluation of Private Functions

Overview of the Talk. Secret Sharing. Secret Sharing Made Short Hugo Krawczyk Perfect Secrecy

Privacy Preserving OLAP

Privacy in Statistical Databases

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Probabilistic Frequent Itemset Mining in Uncertain Databases

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

arxiv: v2 [quant-ph] 14 Jan 2016

Secure Two-Party Distribution Testing

Introduction to Cryptography Lecture 13

Privacy-Preserving Multivariate Statistical Analysis: Linear Regression and Classification

Cyber Security in the Quantum Era

Circular Range Search on Encrypted Spatial Data

Frequent Itemset Mining

Privacy in Statistical Databases

DATA SAFEKEEPING AGAINST KNOWLEDGE DISCOVERY. Seunghyun Im

Semantic Security: Privacy Definitions Revisited

D B M G Data Base and Data Mining Group of Politecnico di Torino

Pushing Tougher Constraints in Frequent Pattern Mining

Basic Data Structures and Algorithms for Data Profiling Felix Naumann

Association Rules. Fundamentals

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

NetBox: A Probabilistic Method for Analyzing Market Basket Data

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s

Frequent Pattern Mining. Toon Calders University of Antwerp

Processing Count Queries over Event Streams at Multiple Time Granularities

Challenges in Privacy-Preserving Analysis of Structured Data Kamalika Chaudhuri

arxiv: v1 [cs.ds] 3 Feb 2018

Differential Privacy and its Application in Aggregation

A FRAMEWORK FOR UNCONDITIONALLY SECURE PUBLIC-KEY ENCRYPTION (WITH POSSIBLE DECRYPTION ERRORS)

Privacy and Fault-Tolerance in Distributed Optimization. Nitin Vaidya University of Illinois at Urbana-Champaign

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

Session-Based Queueing Systems

Data Aggregation with InfraWorks and ArcGIS for Visualization, Analysis, and Planning

On Differentially Private Frequent Itemsets Mining

Association Rule Mining on Web

Editorial Manager(tm) for Data Mining and Knowledge Discovery Manuscript Draft

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

Machine Learning: Pattern Mining

Tight Bounds for Distributed Streaming

Association Analysis. Part 1

Quantum threat...and quantum solutions

Is Quantum Search Practical?

CSE 5243 INTRO. TO DATA MINING

Privacy-preserving cooperative statistical analysis

Cryptography CS 555. Topic 25: Quantum Crpytography. CS555 Topic 25 1

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

Preserving Privacy in Data Mining using Data Distortion Approach

T Cryptography: Special Topics. February 24 th, Fuzzy Extractors: Generating Strong Keys From Noisy Data.

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti

Differential Privacy and Pan-Private Algorithms. Cynthia Dwork, Microsoft Research

Maryam Shoaran Alex Thomo Jens Weber. University of Victoria, Canada

Privacy Preserving Frequent Itemset Mining. Workshop on Privacy, Security, and Data Mining ICDM - Maebashi City, Japan December 9, 2002

Publishing Search Logs A Comparative Study of Privacy Guarantees

Answering Many Queries with Differential Privacy

Efficient Fuzzy Matching and Intersection on Private Datasets

18734: Foundations of Privacy. Anonymous Cash. Anupam Datta. CMU Fall 2018

Data Mining and Matrices

1 Secure two-party computation

to be more efficient on enormous scale, in a stream, or in distributed settings.

CPSC 467b: Cryptography and Computer Security

RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response

Transcription:

Statistical Privacy For Privacy Preserving Information Sharing Johannes Gehrke Cornell University http://www.cs.cornell.edu/johannes Joint work with: Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal

Introduction To operate, an e-business needs to query data owned by clients or other businesses The owners are concerned about privacy of their data, they will not ship all data to one server We want algorithms that efficiently evaluate multi-party queries while disclosing as little extra information as possible.

Introduction Company A Company B Company C D A D B D C Customer 1 Customer 2 Customer 3 r 1 r 2 r 3 Customer N r N

Find aggregate properties of {r 1, r 2,, r N } Company A Introduction Company B Company C D A D B D C Customer 1 Customer 2 Customer 3 r 1 r 2 r 3 Customer N r N

Introduction Company A Company B Join T A T B =? Company C T A D A T B D B D C Customer 1 Customer 2 Customer 3 r 1 r 2 r 3 Customer N r N

Privacy and Disclosure Tolerated Disclosure Computationally private hard to make useful Statistically private too fuzzy or unlikely

Privacy and Disclosure Most studied Tolerated Disclosure Computationally private hard to make useful Statistically private too fuzzy or unlikely Yao [1986]: Any two-party data operation can be made computationally private, if the operation is converted into a Boolean circuit.

Privacy and Disclosure Tolerated Disclosure Today s Talk Computationally private hard to make useful Statistically private too fuzzy or unlikely

Talk Outline Introduction Randomization and privacy for association rules Amplification: upper bound on breaches Experimental results Conclusion

Privacy Preserving Associations We have one server and many clients Each client has a private transaction (a set of items) Example: product preferences The server wants to find frequent subsets of items (aggregate statistical information) Each client wants to hide its transaction from the server

Privacy Preserving Associations Find associations among items in {t 1, t 2,, t N } Company D An An association is is an anitemset with with frequency s min min Customer 1 Customer 2 Customer 3 t 1 t 2 t 3 Customer N t N

Privacy Preserving Associations Let T be the set of all transactions, and t T be a transaction Any itemset A has support (frequency) s in T if s = supp ( A) Itemset A is frequent if s s min = { t T A t} Antimonotonicity: if A B, then supp (A) supp (B). # Association rule: A B holds when the union A B is frequent and: supp (A B) supp (A) conf min T

Alice J.S. Bach, painting, nasa.gov, Randomization Overview Recommendation Service Bob B. Spears, baseball, cnn.com, Chris B. Marley, camping, linux.org,

Alice Randomization Overview J.S. Bach, painting, nasa.gov, J.S. J.S. Bach, painting, nasa.gov, Recommendation Service Bob B. Spears, baseball, cnn.com, B. B. Spears, baseball, cnn.com, Chris B. Marley, camping, linux.org, B. B. Marley, camping, linux.org,

Alice Randomization Overview J.S. Bach, painting, nasa.gov, J.S. J.S. Bach, painting, nasa.gov, Recommendation Service Bob B. Spears, baseball, cnn.com, B. B. Spears, baseball, cnn.com, Chris B. Marley, camping, linux.org, B. B. Marley, camping, linux.org, Associations Recommendations

Alice Randomization Overview J.S. Bach, painting, nasa.gov, Bob B. Spears, baseball, cnn.com, B. B. Spears, soccer, bbc.co.uk, Chris B. Marley, camping, linux.org, Metallica, painting, nasa.gov, B. B. Marley, camping, microsoft.com Recommendation Service Support Recovery Associations Recommendations

The Problem How to randomize transactions so that we can find frequent itemsets while preserving privacy at transaction level?

Randomization Example A randomization may look strong but sometimes fail to hide some items of an individual transaction. Randomization example: given a transaction, keep item with 20% probability, replace with a new random item with 80% probability.

Example: {a, b, c} 10 M transactions of size 10 with 10 K items: 1% have {a, b, c} 5% have {a, b}, {a, c}, or {b, c} only 94% have one or zero items of {a, b, c}

Example: {a, b, c} 10 M transactions of size 10 with 10 K items: 1% have {a, b, c} 5% have {a, b}, {a, c}, or {b, c} only 94% have one or zero items of {a, b, c} After randomization: How many have {a, b, c}?

1% have {a, b, c} Example: {a, b, c} 10 M transactions of size 10 with 10 K items: 0.2 3 5% have {a, b}, {a, c}, or {b, c} only 0.2 2 8 0.8/10,000 94% have one or zero items of {a, b, c} at most 0.2 (9 0.8/10,000) 2 0.008% 800 ts. 0.000128% 13 trans. less than 0.00002% 2 transactions After randomization: How many have {a, b, c}?

1% have {a, b, c} Example: {a, b, c} 10 M transactions of size 10 with 10 K items: 5% have {a, b}, {a, c}, or {b, c} only 0.2 3 0.2 2 8 0.8/10,000 94% have one or zero items of {a, b, c} at most 0.2 (9 0.8/10,000) 2 0.008% 800 ts. 98.2% 0.000128% 13 trans. 1.6% less than 0.00002% 2 transactions 0.2% After randomization: How many have {a, b, c}?

Example: {a, b, c} Given nothing, we have only 1% probability that {a, b, c} occurs in the original transaction Given {a, b, c} in the randomized transaction, we have about 98% certainty of {a, b, c} in the original one. This is what we call a privacy breach. The example randomization preserves privacy on average, but not in the worst case.

Privacy Breaches Suppose the adversary wants to know if z t, where t is an original transaction; t is the corresponding randomized transaction; A is a (frequent) itemset, z A Itemset A causes a privacy breach of level β (e.g. 50%) if: Prob [ z t A t ] β Knowledge of A t makes a jump from Prob [z t] to Prob [z t A t ] (in the adversary s viewpoint).

Talk Outline Introduction Randomization and privacy for association rules Amplification: upper bound on breaches Experimental results Conclusion

Generalized Approach We want a bound for all privacy breaches not only for: item t versus itemset t No knowledge of data distribution is required in advance We don t have to know Prob [item t] Applicable to numerical data as well Easy to work with, even for complex randomizations

Our Model Randomization operator y = R (x) x Original (private) data Assumptions: Described by a random variable X. Each client is independent. y Randomized data Described by a random variable Y = R (X).

α-to-β Privacy Breach Let P (x) be any property of client s private data; Let 0 < α < β < 1 be two probability thresholds. x P (x) α β 0% 100% Example: P (x) = transaction x contains {a, b, c} α = 1% and β = 50%

α-to-β Privacy Breach Let P (x) be any property of client s private data; Let 0 < α < β < 1 be two probability thresholds. Client X = x SERVER Prob [P (X)] α 0% 100%

α-to-β Privacy Breach Let P (x) be any property of client s private data; Let 0 < α < β < 1 be two probability thresholds. Client X = x y = R (x) (x) SERVER Prob [P (X)] α 0% 100% Prob [P (X) Y = y] β

α-to-β Privacy Breach Let P (x) be any property of client s private data; Let 0 < α < β < 1 be two probability thresholds. Client X = x y = R (x) (x) SERVER Prob [P (X)] α 0% 100% Prob [P (X) Y = y] β Disclosure of y causes an α-to-β privacy breach w.r.t. property P (x).

α-to-β Privacy Breach Checking for α-to-β privacy breaches: There are exponentially many properties P (x) ; We have to know the data distribution in order to check whether Prob [P (X)] α and Prob [P (X) Y = y] β. Is there a simple property of randomization operator R that limits privacy breaches?

Amplification Condition 1 2 3 4 5 6 7 8 9 10 R (x) = y 1 2 3 4 5 6 7 8 9 10

Amplification Condition x 1 = 1 2 3 4 5 6 7 p [x 8[x y] y] are transition 9 probabilities 10 1 2 3 4 5 6 7 8 9 10 0% 10% 20% p [2 y]

Amplification Condition x 1 = x 2 = 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 p [2 y] p [8 y] 0% 10% 20%

Amplification Condition x 1 = x 2 = 1 2 3 4 5 6 7 8 9 10 1 2 3 Worst discrepancy 4 p [ 2 y ] 85 p [ 8 y ] 6 7 8 9 10 0% 10% 20% p [2 y] p [8 y]

Amplification Condition Definition: Randomization operator R is called at most γ-amplifying if: max x 1, x 2 max y p p [ x1 y] [ x y] Transition probabilities p [x y] = Prob [R (x) = y] depend only on the operator R and not on data. We assume that all y have a nonzero probability. The bigger γ is, the more may be revealed about x. 2 γ

The Bound on α-to-β Breaches Statement: If randomization operator R is at most γ-amplifying, and if: γ < β α Then, revealing R (X) to the server will never cause an α-to-β privacy breach. 1 1 α β See proof in [PODS 2003].

The Bound on α-to-β Breaches Examples: 5%-to-50% privacy breaches do not occur for γ < 19: 0.5 0.05 1 0.05 1 0.5 1%-to-98% privacy breaches do not occur for γ < 4851: 0.98 0.01 1 1 0.01 0.98 50%-to-100% privacy breaches do not occur for any finite γ. = = 19 4851

Amplification: Summary An α-to-β privacy breach w.r.t. property P (x) occurs when Prob[P is true] α Prob[P is true Y = y] β. Amplification methodology limits privacy breaches by just looking at transitional probabilities of randomization. Does not use data distribution: max x 1, x 2 max y p p [ x1 y] [ x y] 2 γ

Amplification In Practice Given transaction t of size m, construct t = R (t): t = a, b, c, d, e, f, u, v, w t =

Definition of select-a-size Given transaction t of size m, construct t = R (t): Choose a number j {0, 1,, m} with distribution {p [j]} 0..m ; t = t = a, b, c, d, e, f, u, v, w j = 4

Definition of select-a-size Given transaction t of size m, construct t = R (t): Choose a number j {0, 1,, m} with distribution {p [j]} 0..m ; Include exactly j items of t into t ; t = t = a, b, c, d, e, f, u, v, w b, e, u, w j = 4

Definition of select-a-size Given transaction t of size m, construct t = R (t): Choose a number j {0, 1,, m} with distribution {p [j]} 0..m ; Include exactly j items of t into t ; Each other item (not from t ) goes into t with probability ρ. The choice of {p[j]} 0..m and ρ is based on the desired privacy level. t = a, b, c, d, e, f, u, v, w t = b, e, u, w j = 4 œ, å, ß,, ψ,,,א ъ, ђ, items inserted with prob. ρ

Talk Outline Introduction Randomization and privacy for association rules Amplification: upper bound on breaches Experimental results Conclusion

Lowest Discoverable Support LDS is s.t., when predicted, is 4σ away from zero. Roughly, LDS is proportional to 1 # trans. t = 5, 5%-50% privacy breaches are the worst allowed 1,2 1 LDS vs. number of transactions 1-itemsets 2-itemsets 3-itemsets 0,8 LDS, % 0,6 0,4 0,2 0 1 10 100 Number of transactions, millions

LDS vs. Breach Threshold α * t = 5, 5 M trans., α-to-50% privacy breaches are limited to α α * 0,9 0,8 1-itemsets 2-itemsets 0,7 3-itemsets 0,6 LDS, % 0,5 0,4 0,3 0,2 0,1 0 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% Privacy Breach Threshold α *, % Reminder: α-to-50% breach occurs when Prob [P (t)] α and Prob [P (t) R (t) = t ] 50%.

LDS vs. Transaction Size 5%-50% privacy breaches are the worst allowed, T = 5 M 1,8 1,6 1,4 1-itemsets 2-itemsets 3-itemsets 1,2 LDS, % 1 0,8 0,6 0,4 0,2 0 3 4 5 6 7 8 9 10 Transaction Size Longer transactions are harder to use in support recovery

Real datasets: soccer, mailorder Soccer is the clickstream log of WorldCup 98 web site, split into sessions of HTML requests. 11 K items (HTMLs), 6.5 M transactions Available at http://www.acm.org/sigcomm/ita/ Mailorder is a purchase dataset from a certain on-line store Products are replaced with their categories 96 items (categories), 2.9 M transactions A small fraction of transactions are discarded as too long. longer than 10 (for soccer) or 7 (for mailorder)

Restricted Privacy Breaches Real data experiments used older approach [KDD 2002] We constrained only z t versus A t privacy breaches Restrictions in the form Prob [z t A t ] < β Older approach used some (minimal) information about data distribution to choose randomization parameters

Modified Apriori on Real Data Breach level β = 50%. Inserted 20-50% items to each transaction. Soccer: s min = 0.2% Itemset Size 1 True Itemsets 266 True Positives 254 False Drops 12 False Positives 31 σ 0.07% for 3-itemsets 2 3 217 48 195 43 22 5 45 26 Mailorder: s min = 0.2% Itemset Size 1 True Itemsets 65 True Positives 65 False Drops 0 False Positives 0 σ 0.05% for 3-itemsets 2 3 228 22 212 18 16 4 28 5

False Drops False Positives Soccer Pred. supp%, when true supp 0.2% True supp%, when pred. supp 0.2% Size < 0.1 0.1-0.15 0.15-0.2 0.2 Size < 0.1 0.1-0.15 0.15-0.2 0.2 1 0 2 10 254 1 0 7 24 254 2 0 5 17 195 2 7 10 28 195 3 0 1 4 43 3 5 13 8 43 Mailorder Pred. supp%, when true supp 0.2% True supp%, when pred. supp 0.2% Size < 0.1 0.1-0.15 0.15-0.2 0.2 Size < 0.1 0.1-0.15 0.15-0.2 0.2 1 0 0 0 65 1 0 0 0 65 2 0 1 15 212 2 0 0 28 212 3 0 1 3 18 3 1 2 2 18

Actual Privacy Breaches Verified actual privacy breach levels The breach probabilities Prob [z t A t ] are counted in the datasets for frequent and near-frequent itemsets. With the right choice of randomization parameters, even worst-case breach levels fluctuated around 50% At most 53.2% for soccer, At most 55.4% for mailorder.

Ongoing Research Using randomization and traditional secure multiparty computation together Privacy preserving two-party join size computation with sketches What if we cannot guarantee amplification condition? Probabilistic privacy breaches and amplification on average Information theory and statistical privacy A slightly modified information measure that provably bounds privacy breaches

Future Work Tolerated Disclosure Future work Computationally private hard to make useful Statistically private too fuzzy or unlikely Irrelevant safe to disclose Can statistical privacy be extended so that we can prove orthogonality between disclosure and sensitive questions?

Conclusion We defined privacy using statistics, not computational hardness Randomization can guarantee statistical privacy Demonstrated for association mining A simple property of randomization operators provably bounds privacy breaches

Thank You! Questions?