A Methodology for Direct and Indirect Discrimination Prevention in Data Mining

Similar documents
Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.

Correlation Preserving Unsupervised Discretization. Outline

Assignment 7 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

Cse537 Ar*fficial Intelligence Short Review 1 for Midterm 2. Professor Anita Wasilewska Computer Science Department Stony Brook University

Association Analysis. Part 1

Effective Elimination of Redundant Association Rules

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29

Privacy-preserving Data Mining

1 Frequent Pattern Mining

Data Analytics Beyond OLAP. Prof. Yanlei Diao

arxiv: v1 [cs.db] 26 Oct 2016

732A61/TDDD41 Data Mining - Clustering and Association Analysis

D B M G Data Base and Data Mining Group of Politecnico di Torino

Association Rules. Fundamentals

Data Mining of Medical Data: Opportunities and Challenges

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

Data Warehousing & Data Mining

Data Warehousing & Data Mining

Privacy Preserving Frequent Itemset Mining. Workshop on Privacy, Security, and Data Mining ICDM - Maebashi City, Japan December 9, 2002

P, NP, NP-Complete, and NPhard

Parts 3-6 are EXAMPLES for cse634

Data Mining Part 4. Prediction

10/19/2017 MIST.6060 Business Intelligence and Data Mining 1. Association Rules

Association Analysis. Part 2

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

CHAPTER 7 FUNCTIONS. Alessandro Artale UniBZ - artale/

arxiv: v1 [cs.lg] 22 Nov 2016

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Algorithms for Classification: The Basic Methods

Testing for Discrimination

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

ECLT 5810 Data Preprocessing. Prof. Wai Lam

EECS 349:Machine Learning Bryan Pardo

Rough Set Model Selection for Practical Decision Making

Unit 1A: Computational Complexity

15 Introduction to Data Mining

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Statistical Privacy For Privacy Preserving Information Sharing

1 [15 points] Frequent Itemsets Generation With Map-Reduce

STAT Section 2.1: Basic Inference. Basic Definitions

Data Mining Project. C4.5 Algorithm. Saber Salah. Naji Sami Abduljalil Abdulhak

Università di Pisa A.A Data Mining II June 13th, < {A} {B,F} {E} {A,B} {A,C,D} {F} {B,E} {C,D} > t=0 t=1 t=2 t=3 t=4 t=5 t=6 t=7

The Beauty and Joy of Computing

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Chapter 4.5 Association Rules. CSCI 347, Data Mining

Algorithms and Complexity Theory. Chapter 8: Introduction to Complexity. Computer Science - Durban - September 2005

Unsupervised Data Discretization of Mixed Data Types

Approximate counting: count-min data structure. Problem definition

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies

Data Structures and Algorithms

A Clear View on Quality Measures for Fuzzy Association Rules

Standardising the Lift of an Association Rule

Why Spatial Data Mining?

Data classification (II)

CSE 5243 INTRO. TO DATA MINING

Decision trees for stream data mining new results

Data Mining and Matrices

CS5112: Algorithms and Data Structures for Applications

Machine Learning (CS 567) Lecture 2

CS4445 B10 Homework 4 Part I Solution

Cse352 AI Homework 2 Solutions

A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors

Multiprocessor Scheduling I: Partitioned Scheduling. LS 12, TU Dortmund

Data Warehousing. Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Machine Learning: Pattern Mining

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Summary. 8.1 BI Overview. 8. Business Intelligence. 8.1 BI Overview. 8.1 BI Overview 12/17/ Business Intelligence

Variables, distributions, and samples (cont.) Phil 12: Logic and Decision Making Fall 2010 UC San Diego 10/18/2010

Data Mining. Chapter 1. What s it all about?

Machine Learning & Data Mining

MN 400: Research Methods. CHAPTER 7 Sample Design

The Beauty and Joy of Computing

Report on Differential Privacy

Real-Time Course. Transaction based temporal model for Real-time databases

CSE 5243 INTRO. TO DATA MINING

Midterm: CS 6375 Spring 2015 Solutions

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Adaptive Learning and Mining for Data Streams and Frequent Patterns

Possibilities of third parties in real estate management in the light of the INSPIRE Directive

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

CS145: INTRODUCTION TO DATA MINING

Qualifying Exam in Machine Learning

Key words. free Boolean algebra, measure, Bonferroni-type inquality, exclusion-inclusion, missing

Distributed Consensus

Mining Positive and Negative Fuzzy Association Rules

THE IMPACT ON SCALING ON THE PAIR-WISE COMPARISON OF THE ANALYTIC HIERARCHY PROCESS

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Mining Infrequent Patter ns

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Be able to define the following terms and answer basic questions about them:

Test and Evaluation of an Electronic Database Selection Expert System

Transcription:

A Methodology for Direct and Indirect Discrimination Prevention in Data Mining Sara Hajian and Josep Domingo-Ferrer IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013 Presented by Polina Rozenshtein

Outline Problem addressed Direct and indirect discrimination Background, definitions and measures Approach proposed Discrimination Measurement Data Transformation Algorithms and running time Experimental results

Problem Discrimination: direct or indirect. Direct discrimination: decisions are made based on sensitive attributes. Indirect discrimination (redlining): decisions are made based on nonsensitive attributes which are strongly correlated with biased sensitive ones. Decision rules

Definitions Dataset collection of records Item - attribute with its value, e.g., Race = black item set - collection of items X: {Foreign worker = Yes; City = NYC} classification rule - X C {yes/no} {Foreign worker = Yes; City = NYC} Hire = no

Definitions support, supp(x) - fraction of records that contain X confidence, conf X C - how often C appears in records that contain X conf X C = supp(x,c) supp(x) frequent classification rule: supp X, C > s conf X C > c negated item set: X = {Foreign worker = Yes} X = {Foreign worker = No}

Classification rules DI s - predetermined discriminatory items DI s = {Foreign worker = Yes; Race = Black} X C - potentially discriminatory (PD) X = A, B with A DI s, B DI s {Foreign worker = Yes; City = NYC} Hire = No X C - potentially nondiscriminatory (PND) X = D, B with D DI s, B DI s {Zip = 10451; City = NYC} Hire = No

Direct Discrimination Measure extended lift (elift): elift A, B C = conf(a,b C) conf(b C) A DI S A, B C is α-protective, if and elift A, B C < α A, B C is α-discriminatory, if elift A, B C α

Indirect Discrimination Measure Theorem: Let r: D, B C is PND; γ = conf(r: D, B C) and δ = conf B C > 0 A DI s, conf r b1 : A, B D β 1, conf r b2 : D, B A β 2 > 0 f x = β 1 β 2 β 2 + x 1 elb x, y = f(x) y, if f x > 0 0, otherwise Then if elb γ, δ α, then PD r : A, B C is α-discriminatory

Indirect Discrimination or not A PND rule r: D, B C is a redlining rule, if it could yield αdiscriminatory rule r 0 : A, B C available knowledge rules r b1 : A, B D and r b2 : D, B A With A DI s {Zip = 10451; City = NYC} Hire = No. A PND rule r: D, B C is a nonredlining rule, if it cannot yield α-discriminatory rule r 0 : A, B C available rules r b1 : A, B D and r b2 : D, B A and A DI s {Experience = Low; City = NYC} Hire = No.

The Approach Discrimination measurement: Find PD and PND Direct discrimination: In PD find α-discriminatory by elif() Indirect discrimination: In PND find redlining by elb() + background knowledge Data transformation: Alter dataset and remove discriminatory biases Minimum impact on data and legitimate rules

Direct rules protection A ID S, Wish elif r : A, B C > α conf(a,b C) conf(b C) < α Decrease conf A, B C = supp(a,b,c) supp(a,b) Decrease conf(a, B C) by increasing supp(a, B)! A, B C A, B C supp A, B, C remains the same

Direct rules protection 2 Wish elif r : A, B C > α conf(a,b C) conf(b C) < α Increase conf B C = supp(b,c) supp(b) Increase supp B, C! A, B C A, B C supp B remains the same

Direct rules generalization PD: {Foreign worker = Yes; City = NYC} Hire = No. PND: {Experience = Low; City = NYC} Hire = No. If conf r: D, B C conf r : A, B C, and conf A, B D = 1 then PD rule r : A, B C is an instance of a PND rule r: D, B C

Direct rules generalization PD: {Foreign worker = Yes; City = NYC} Hire = No. PND: {Experience = Low; City = NYC} Hire = No. 1) If conf r: D, B C p conf r : A, B C, 2) and conf A, B D p then PD rule r : A, B C is an p-instance of a PND rule r: D, B C Change α-discriminatory to be p-instance of some PND rule r: D, B C

Direct rules generalization Condition 2 is satisfied, but Condition 1 is not: Wish conf r: D, B C p conf r : A, B C Decrease conf r : A, B C, preserve conf A, B D A, B, D C A, B, D C Condition 1 is satisfied, but Condition 2 is not: Wish conf A, B D p Increase conf A, B D, preserve conf r: D, B C p conf r : A, B C Impossible

Direct rules generalization Use generalization when possible to increase number of PND Use generalization when at least Condition 2 is satisfied After generalization is done, use methods for direct protection Try to perform minimum transformation

Indirect Rule Protection The same strategy as for Directed Rule Protection: Wish elb conf r: D, B C, conf B C > α conf r b1 :A,B D conf r b2 :D,B A conf r b2:d,b A +conf r:d,b C 1 conf(b C) < α Method 1: Decrease conf A, B D A, B, D C A, B, D C Method 2: Increase conf B C A, B, D C A, B, D C

Simultaneous direct and indirect discrimination prevention Method 1 Method 2 Direct Rule Protection A, B C A, B C A, B C A, B C Indirect Rule Protection A, B, D C A, B, D C A, B, D C A, B, D C Lemma 1. Method 1 for DRP cannot be used for simultaneous DRP and IRP Method 1 for DRP might undo the protection provided by Method 1 for IRP

Simultaneous direct and indirect discrimination prevention Method 1 Method 2 Direct Rule Protection A, B C A, B C A, B C A, B C Indirect Rule Protection A, B, D C A, B, D C A, B, D C A, B, D C Lemma 2. Method 2 for IRP is beneficial for Method 2 for DRP. Method 2 for DRP is at worst neutral for Method 2 for IRP. Method 2 for DRP and Method 2 for IRP both increase conf(b C).

Simultaneous direct and indirect discrimination prevention Transform PD to PND when possible Run Method 2 for IRP for PND and Method 2 for DRP for the rest PD.

Algorithms DB database FR frequent rules MR direct discriminative rules DI s discriminative item set

Computational Cost m - the number of records in DB k - number of rules in FR h - number of records in subset DB c n - the number of discriminatory rules in MR O(m) to get DB c O(kh) to get impact(db c ) for all db c DB c O(h log h ) for sorting O(dm) for modification O(n (m + kh + h log h + dm))

Experiments German credit data set and adult data set. Direct discrimination prevention degree (DDPD): percentage of α-discriminatory rules that are no longer αdiscriminatory Direct discrimination protection preservation (DDPP): percentage of α-protective rules that remain α- protective IDPD and IDPP the same for redlining rules

German credit data set Min support 5%, min confidence 10% 32 340 frequent classification rules 22 763 background knowledge rules 37 redlining rules, 42 indirect and 991 direct discriminations

Information loss Misses cost (MC): percentage of lost rules Ghost cost (GC): percentage of introduced rules

Conclusions Considers frequent classification rule mining Defines direct and indirect discrimination Propose measures of discrimination Propose methods to modify dataset to avoid discrimination Meaningful qualitative results