Differential Privacy and Pan-Private Algorithms. Cynthia Dwork, Microsoft Research

Similar documents
Differential Privacy

[Title removed for anonymity]

Lecture 11- Differential Privacy

Privacy-Preserving Data Mining

Report on Differential Privacy

Privacy in Statistical Databases

Privacy of Numeric Queries Via Simple Value Perturbation. The Laplace Mechanism

Database Privacy: k-anonymity and de-anonymization attacks

Differential Privacy: a short tutorial. Presenter: WANG Yuxiang

Answering Many Queries with Differential Privacy

Privacy in Statistical Databases

Calibrating Noise to Sensitivity in Private Data Analysis

What Can We Learn Privately?

Differential Privacy II: Basic Tools

CMPUT651: Differential Privacy

The Optimal Mechanism in Differential Privacy

How Well Does Privacy Compose?

Enabling Accurate Analysis of Private Network Data

Differential Privacy and its Application in Aggregation

A Note on Differential Privacy: Defining Resistance to Arbitrary Side Information

Safer Data Mining: Algorithmic Techniques in Differential Privacy

Differentially Private Machine Learning

Lecture 20: Introduction to Differential Privacy

Maryam Shoaran Alex Thomo Jens Weber. University of Victoria, Canada

The Optimal Mechanism in Differential Privacy

Accuracy First: Selecting a Differential Privacy Level for Accuracy-Constrained Empirical Risk Minimization

AN AXIOMATIC VIEW OF STATISTICAL PRIVACY AND UTILITY

Lecture 1 Introduction to Differential Privacy: January 28

Differential Privacy for Probabilistic Systems. Michael Carl Tschantz Anupam Datta Dilsun Kaynar. May 14, 2009 CMU-CyLab

Toward Privacy in Public Databases (Extended Abstract)

Locally Differentially Private Protocols for Frequency Estimation. Tianhao Wang, Jeremiah Blocki, Ninghui Li, Somesh Jha

Toward Privacy in Public Databases

Theory for Society. Cynthia Dwork, Microsoft Research

Acknowledgements. Today s Lecture. Attacking k-anonymity. Ahem Homogeneity. Data Privacy in Biomedicine: Lecture 18

1 Differential Privacy and Statistical Query Learning

Secure Multiparty Computation from Graph Colouring

Linear classifiers: Overfitting and regularization

Information-theoretic foundations of differential privacy

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Statistical Privacy For Privacy Preserving Information Sharing

Approximate and Probabilistic Differential Privacy Definitions

Quantitative Approaches to Information Protection

CIS 700 Differential Privacy in Game Theory and Mechanism Design January 17, Lecture 1

Differentially Private ANOVA Testing

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

The Pennsylvania State University The Graduate School College of Engineering A FIRST PRINCIPLE APPROACH TOWARD DATA PRIVACY AND UTILITY

Estimation of the Click Volume by Large Scale Regression Analysis May 15, / 50

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Click Prediction and Preference Ranking of RSS Feeds

Privacy-preserving Data Mining

Data Mining und Maschinelles Lernen

DP-Finder: Finding Differential Privacy Violations by Sampling and Optimization. Dana Drachsler-Cohen

Differentially-private Learning and Information Theory

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

The Exponential Mechanism (and maybe some mechanism design) Frank McSherry and Kunal Talwar Microsoft Research, Silicon Valley Campus

Linear Regression (continued)

The Power of Linear Reconstruction Attacks

Ad Placement Strategies

18.9 SUPPORT VECTOR MACHINES

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

PAC-learning, VC Dimension and Margin-based Bounds

Towards a Systematic Analysis of Privacy Definitions

Chapter 15. Probability Rules! Copyright 2012, 2008, 2005 Pearson Education, Inc.

Differentially Private Sequential Data Publication via Variable-Length N-Grams

Andriy Mnih and Ruslan Salakhutdinov

The State Explosion Problem

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Jeffrey D. Ullman Stanford University

Differentially Private Linear Regression

Answering Numeric Queries

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

An Introduction to Machine Learning

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Boosting and Differential Privacy

Machine Learning

Boosting: Foundations and Algorithms. Rob Schapire

Privacy-preserving logistic regression

Differential Privacy and Robust Statistics

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

Differentially Private Sequential Data Publication via Variable-Length N-Grams

An Algorithms-based Intro to Machine Learning

A Multiplicative Weights Mechanism for Privacy-Preserving Data Analysis

1 Distributional problems

Collaborative Filtering. Radek Pelánek

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

PAC-learning, VC Dimension and Margin-based Bounds

Multi-theme Sentiment Analysis using Quantified Contextual

On the Complexity of Differentially Private Data Release

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

Linear Models for Regression

Randomized Decision Trees

Machine Learning

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

CSCI567 Machine Learning (Fall 2014)

Final Exam, Fall 2002

Trustworthy, Useful Languages for. Probabilistic Modeling and Inference

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Midterm: CS 6375 Spring 2015 Solutions

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Ad Placement Strategies

Transcription:

Differential Privacy and Pan-Private Algorithms Cynthia Dwork, Microsoft Research

A Dream? C? Original Database Sanitization Very Vague AndVery Ambitious Census, medical, educational, financial data, commuting patterns, web traffic; OTC drug purchases, query logs, social networking,

Reality: Sanitization Can t be Too Accurate Overly accurate answers to too many questions leads to blatant non-privacy [Dinur-Nissim 03; DMT07,DY08,HT09 ] Results may have been over-interpreted Subject of another talk Ask me about this

Success: Interactive Solutions C? Multiple Queries, Adaptively Chosen e.g. n/polylog(n), noise o( n) Accuracy eventually deteriorates as # queries grows Has led to intriguing non-interactive solutions Both Achieve Differential Privacy

A Linkage Attack Using innocuous data in one dataset to identify a record in a different dataset containing both innocuous and sensitive data Made Famous by Sweeney Linked HMO data release to voter registration data Netflix Prize Training Data Linked to the IMDb [NS 07]

Many Other Successful Privacy Attacks Against Privacy by Aggregation Including implementation by secure multi-party computation, secure function evaluation it s not the how, it s the what Simple differencing Against Auditing/Inference Control [KMN05, KPR00] Against Tokenized Search Logs [KNPT07] Against Randomized IDs Analysis of the AOL dataset [Free for all] Discovering relationships in social networks [BDK07] Against K-anonymity [MGK06] Proposed L-diversity Against L-Diversity [XT07] Proposed M-Invariance But wait, there s more! [KS08]

What is Going Wrong? Failure to account for auxiliary information In vitro versus in vivo Other datasets, previous version of current dataset, innocuous inside information, etc. Literally, Ad Hoc The defined set of excluded ( bad ) outcomes fails to capture certain privacy breaches Needed: An Ad Omnia, achievable, definition that composes automatically and obliviously with (past and future) information, databases

Dalenius / Semantic Security An ad omnia guarantee: Dalenius, 1977 Anything that can be learned about a respondent from the statistical database can be learned without access to the database Popular Intuition: prior and posterior views about an individual shouldn t change too much. Problematic: My (incorrect) prior is that everyone has 2 left feet. Provably Unachievable [DN06] Shift Focus to Risk Incurred by Joining (or Leaving) DB Differential Privacy [DMNS06] Before/After interacting vs Risk when in/not in DB 8

ε - Differential Privacy K gives ε - differential privacy if for all neighboring D1 and D2, and all C µ range(k ): Pr[ K (D1) 2 C] e ε Pr[ K (D2) 2 C] Neutralizes all linkage attacks. Composes unconditionally and automatically: Σ i ε i ratio bounded Pr [response] Bad Responses: X X X 9

(ε, δ) - Differential Privacy K gives (ε, δ) - differential privacy if for all neighboring D1 and D2, and all C µ range(k ): Pr[ K (D1) 2 C] e ε Pr[ K (D2) 2 C] + δ Neutralizes all linkage attacks. Composes unconditionally and automatically: Σ i ε i + Σ i δ i. ratio bounded Pr [response] Bad Responses: X X X 10

Techniques and Programming Methodologies SuLQ, Addition of Noise, Exponential Mechanism

Subject of Many Other Talks General Techniques (not exhaustive) Privacy-preserving programming interface: SuLQ [DN04,BDMN05, DKMMN06] Available Instantiation/generalization: PINQ [M09] Addition of Laplacian or Gaussian Noise [DMNS06,NRS07,DN04 etc.] Discrete Outputs [MT07] Synthetic Databases [BLR08,DNRRV09] Optimization Function Perturbation [CM09] Geometric Approach [HT09] Specific Tasks (not exhaustive) Contingency Tables / OLAP Cubes [BDMT05] PAC Learning [NRS08], Boosting [DRV] Logistic Regression [CM09,MW09] Statistical Estimators [DL09,S09] Halfspace Queries [BDMN05,BLR08]

Eternal Vigilance is the Price of Privacy Social Privacy Maxim: Only use the data for the purpose for which they were collected Mission Creep Think of the children! One defense: non-interactive sanitization The data can be destroyed once sanitization is complete NEW: pan-private streaming algorithms [DNPRY] Never store the data; no tempting target Pan-Private: even internal state is differentially private

Pan-Privacy Model Data is a stream of items, eg, search queries Data of different users interleaved arbitrarily state Curator sees each item and updates internal state User-Level Pan-Privacy: For every possible behavior of user in stream, at any single point in time, internal state and output are differentially private

Example: Stream Density or # Distinct Elements Universe of users; estimate what fraction of users in the universe actually appear in the data stream Lovely streaming techniques don t preserve privacy Flavor of solution: Keep a bit b x for each user x Initially set bit b x according to probability distribution D 0 D 0 : Pr[b x = 0] = Pr[b x = 1] = ½ Each time x appears, set bit b x according to distribution D 1 D 1 : Pr[b x = 0] = ½ + ²; Pr[b x = 1] = ½ - ²

Many Fascinating Questions Pan-Private Algorithms for Several Tasks Stream density / number of distinct elements t-cropped mean: mean, over users, of min(t, # appearances) Fraction of items appearing k times exactly Fraction of heavy-hitters, items appearing at least k times Continual Outputs? Yes, for stream density. What else? Continual Intrusions? Only really lousy density estimators. Other problems? Multiple Unannounced Intrusions? Two?

Backup Slides

Multiple Intrusions? Intrusions Output Pan-privacy one intrusion One output at end Continual intrusions Continual outputs, state is not completely public Theorem [Continual Intrusion Lower Bound] For any density estimator, if there are continual intrusions, then additive accuracy cannot be better than Ω(n) In Progress Multiple announced intrusions (accuracy degrades fast) Multiple unannounced intrusions (even 2)

Sublinear Queries (SuLQ) Programming Query = h: row [0,1] Database D Exact Answer: r h(d r ) Response: Exact+noise Σ + noise 19

Addition of Laplacian Noise [DMNS06] 20

Addition of Laplacian Noise [DMNS06] Δ 1 = max D1,D2 neighbors f(d1) f(d2) 1 Theorem: To achieve ε-differential privacy, use scaled symmetric noise [Lap(R)] d with R = Δ 1 /ε. 21

Addition of Gaussian Noise [DMNS06,DKMMN06] 22

Laplacian vs Gaussian Noise Δ 1 = max D1,D2 neighbors f(d1) f(d2) 1 Theorem: To achieve ε-differential privacy, use scaled symmetric noise [Lap(R)] d with R = Δ 1 /ε. Δ 2 = max D1,D2 neighbors f(d1) f(d2) 2 Theorem: To achieve (ε,±)-differential privacy, use noise N(0,R 2 ) d with R = 2[log(1/δ) ] 1/2 Δ 2 /ε. 23

Exponential Mechanism [MT07] Given database D, produce output object y y can be anything: strings, trees, text, a number between 1 and 10 Utility function u(d,y) measures goodness of y Output y with (normalized) probability exponential in utility of y

Techniques and Programming Methodologies Sublinear Queries (SuLQ) Programming Addition of Laplacian Noise to Outcome: f(d)+noise Addition of Gaussian Noise to Outcome: f(d)+noise Exponential Mechanism Use in Combination, eg, Noisy Gradient Descent [MW09] SuLQ computation, using Laplacian noise, of a noisy gradient Step size chosen using the exponential mechanism (+ use probabilistic inference to wring out additional accuracy) (+ programmed in PINQ [M07])

Techniques and Programming Methodologies Sublinear Queries (SuLQ) Programming Addition of Laplacian Noise to Outcome: f(d)+noise Addition of Gaussian Noise to Outcome: f(d)+noise Exponential Mechanism Subsample and Aggregate, Smoothed Sensitivity, Propose-Test-Release, data structures for multiple queries, differentially private synthetic databases, Differential Privacy Branches Out Utility maximization, private approximations, multiparty protocols, auctions,