Securing private data sharing in multi-party analytics
First Monday

Securing private data sharing in multi-party analytics by Gowtham Bellala and Bernardo Huberman



Abstract
A general class of problems arises when datasets containing private information belong to multiple parties or owners and they collectively want to perform analytic studies on the entire set while respecting the privacy and security concerns of each individual party. We describe a solution to this problem in the form of a secure procedure for data mapping and/or linkage, which allows to identify the correspondence between entities in a distributed dataset. In contrast to existing methods this solution does not require either a trusted or semi-trusted third party, while being simple, efficient and scalable for both large datasets and number of parties.

Contents

1. Introduction
2. Problem statement
3. List matching protocol for secure data linkage
4. Conclusions

 


 

1. Introduction

Advances in “big data” analytics software and the deployment of large numbers of smart sensors (i.e., the Internet of Things) have made it possible to collect and analyze vast quantities of data that can be used for improved decision-making, spotting important trends and coordinate multi-party efforts in a number of fields, such as epidemiological studies and marketing purposes. However, challenges remain. An important one is how to extract information from a dataset that can facilitate good decisions without compromising the privacy of the individuals or organizations whose sensitive details may be contained in the dataset. This challenge is compounded when the analysis involves multiple organizations (or “parties”) wishing to collaborate in order to obtain a broader understanding of a topic of mutual interest.

Consider as an example a scenario where a group of hospitals wish to work together to improve the collective quality of healthcare. Each hospital has large amounts of data about its patients, including their demographics, past medical history, lab results, current diagnosis, prescribed treatment and outcomes. This data contains a wealth of information that if shared across the group could mutually benefit all parties by enabling faster diagnosis and effective treatment for similar cases. However, this information contains extremely sensitive and private information both about the patients and the hospitals. Thus, for a variety of reasons (including regulatory ones), sharing this data can be problematic.

This general class of problems arises when a dataset containing private information belongs to multiple parties or owners and they collectively want to perform analytic studies on the entire dataset while respecting the privacy and security concerns of each individual party. This is broadly referred to as privacy preserving data mining (PPDM) or secure multi-party computation (SMC) in the literature.

In this paper we present a secure solution for data mapping and data linkage, which arises as a pre-processing step in a multi-party distributed data analytics task. The goal is to identify the correspondence between entities in a distributed dataset and to do so while respecting the privacy of the data.

Since secure data linkage plays an important role as a building block for privacy preserving data mining, a number of solutions have been proposed. Most of the existing work on privacy-preserving data mining (Aggarwal and Yu, 2008), and secure multi-party computation (Lindell and Pinkas, 2009) implicitly assumes that the datasets that are input to their procedures are a priori processed and linked, in the sense that the correspondence between the entities is known (Karr, et al., 2004; Jagannathan and Wright, 2005; Rane and Boufounos, 2013).

A common approach to enable secure record linkage is to use a trusted third party (honest broker) (Jones, et al., 2014; Churches and Christen, 2004) or a semi-trusted third party (Schnell, et al., 2009; Lazrig, et al., 2015). However, such solutions are often not secure (Boyd, et al., 2009; Dhir, et al., 2008). To address this issue, several solutions have been proposed in the literature. Some of these solutions are based on secure protocols such as garbled circuits (Yao, 1982) and oblivious transfer. Though these solutions provide strong security guarantees, they are inherently complex and often restricted to a two party scenario. On the other hand, hash based approaches such as Bloom filters that have been proposed as alternative scalable solutions for privacy preserving record linkage (Dusserre, et al., 1995; Schnell, et al., 2009) are susceptible to different types of attacks such as dictionary or frequency-based ones (Agrawal, et al., 2003; Niedermeyer, et al., 2014).

In contrast, our proposed solution is based on a secure list matching protocol (Huberman, 2016; Huberman, et al., 2012) that does not require either a trusted or a semi-trusted third party. The proposed protocol has strong security guarantees, while being simple, efficient and scalable in terms of both large datasets and large number of parties.

 

++++++++++

2. Problem statement

In a multi-party distributed analytics problem, the data may be partitioned horizontally, vertically, or in an arbitrary manner. In a horizontal partition, rows representing samples of the database are owned by different parties as shown in Figure 1. In a vertical partition, columns representing features or attributes of the database are owned by different parties as shown in Figure 2. More generally, in an arbitrary partition, each party has data belonging to a subset of rows and a subset of columns.

For instance, in the healthcare example, each hospital may hold all the attributes for a subset of patients as horizontal partitions, or each hospital may hold different attributes for the same patient as vertical partitions. More generally, each hospital may have data belonging to a subset of patients for a subset of attributes.

In any multi-party distributed analytics application, one of the first steps is to ensure that the datasets and the corresponding data elements are aligned to facilitate subsequent analytics task such as similarity search, clustering, outlier detection, etc.

For instance, say Party A in the toy example shown in Figure 1 wants to find patients similar to Patient ID 002 in Party B’s database. Party A would need to compute the similarity between this patient with all patients in the database of Party B. In order to compute this similarity, Party A would first need to identify the set of common attributes between the two databases, and order (or link) these common attributes to facilitate similarity computation.

Similarly, consider the case of vertically partitioned data shown in Figure 2 where Party A wants to find patients similar to Patient ID 002 based on the similarity across all attributes (i.e., age, weight, blood pressure and total cholesterol). In order to compute this similarity, Party A should first identify the set of common patients between the two databases, and order (or link) these common patients to compute global similarity across all attributes. The common patients may be identified using an attribute in their database that corresponds to a unique ID such as the patient’s Social security number.

 

Horizontally partitioned data
 
Figure 1: Horizontally partitioned data.

 

 

Vertically partitioned data
 
Figure 2: Vertically partitioned data.

 

More generally, in order to facilitate a multi-party distributed data computation the parties would first need to identify the common set of attributes and records (e.g., patients), and then agree on a common order for these items. However, the parties may not want to disclose their attribute and record information to each other, as it may contain private and sensitive information. To use the healthcare example, each hospital may not want to disclose who their patients are, or what set of attribute data they record in their database.

 

++++++++++

3. List matching protocol for secure data linkage

We propose to use a list matching protocol — described in Huberman (2016) and Huberman, et al. (2012) — to enable a secure, privacy-preserving data linkage. Given each party’s list of attributes, the list matching protocol finds the set intersection securely as shown below. Once the set intersection is obtained, the parties may also agree on a common order to facilitate subsequent privacy preserving analytics.

 

List matching protocol

 

 

Both devices

 

Note that the security (or privacy) of this operation would be compromised if party A or B are able to compute either x or y from the data they initially sent to each other. However, that is almost impossible because of the intractability of the discrete logarithm problem: Given integers a and b, and prime p, it is computationally hard to find an integer n such that

 

Integer n

 

This method generalizes to any large set of data and thus allows for each party to find the set of common attributes or records across their private databases.

3.1. Extension to multi-party scenario

Consider a multi-party scenario where each party holds a list of entities (attributes or records), and the goal is to identify the set of common entities and their correspondence across all parties. The proposed solution can be easily extended to the multi-party scenario. We describe two approaches to extend the solution. Without loss of generality, we describe each approach with respect to party 1.

3.1.1. Approach 1

This approach is a simple extension of the two-party scenario described above, where Party 1 participates in the above two-party protocol with every other party to obtain the pairwise set of intersecting entities. More specifically, Party 1 initiates the query with every party i (2 ≤ ip) to find the pairwise intersection of their entities X1i = X1Xi. Then, party 1 would compute the global set intersection as Xglobal = X12X13 ∩...∩ X1p, and publish the list of common entities along with a prescribed order (for subsequent analysis) to all parties.

3.1.2. Approach 2

This approach is based on a ring protocol that works as follows.

1. Party 1 would first mask its private list X1 = {a,b} using its secret key k1, shares it with party 2

who in turn would further encrypt the incoming data using his secret key k2 and share with the next party who repeats this process. The ring protocol would terminate at party 1, after all the parties have masked the data using their secret key.

2. Party 1 would publish its encrypted data to all other parties participating in the protocol.

 

Party 1, 2, 3, 4

 

Each party would follow the same approach by applying the ring protocol to encrypt their private list using the secret keys of all parties, and then share the encrypted data with all parties. Once all parties complete the above two steps, they can find the intersecting set by matching the encrypted data, and agree on a common order.

Note that this ring topology approach is not susceptible to collusion. For example, if parties i-1 and i+1 collude, they cannot guess the secret key of party i, due to the intractability of the discrete logarithm problem as described above.

An alternative approach to the second step described above would be to use an untrusted mediator (or a broker), where each party would send its final encrypted values to the untrusted mediator, who would extract the set of common entities, and order them. Note that the mediator only has access to encrypted data. Moreover, the mediator would not be able to guess the secret key of a party, even if he colludes with one or more parties, again due to the intractability of the discrete logarithm problem.

3.2. Discussion

Unlike most existing methods for secure data linkage, our solution does not require a trusted or semi-trusted third party. In addition, it is simple, efficient, and scales for large datasets as well as large numbers of parties.

  • String handling: Though the above protocol is defined on integers, it could be used on strings by applying standardized encodings that convert strings to an integer such as ASCII encoding or a hash function.
  • Case sensitivity handling: To enable case insensitive string matching (e.g., “Gender” vs “gender”), one could use a case-insensitive encoding, or the parties may decide a priori to use either all upper or lower case characters during encoding.
  • Special character handling: The protocol could be modified to handle unnecessary white spaces or special characters. For example, the protocol may require the parties to drop all white spaces and special characters in the attribute names during encoding. This would enable the parties to match “blood pressure” with “blood_pressure”.
  • Rule-based matching: More generally, this protocol can be used to perform a deterministic record linkage. In deterministic record linkage, based on a set of rules, two records can either be matched or not matched. This approach helps provide matching parameters that are robust to common errors such as typos. As an example, consider a patient name entry recorded by two parties as “Katherine Doe” vs. “Katharine Doe”. Using a set of pre-determined rules such as matching only the first few letters of first name and last name can help link these entities in spite of such common errors. Such pre-determined rules have been well studied in the literature for applications such as healthcare (Kho, et al., 2015).

 

++++++++++

4. Conclusions

Secure data linkage plays an important role as a building block for privacy preserving data mining, where the goal is to identify the correspondence between data entities belonging to different parties in a secure fashion. This problem arises in various applications such as healthcare and the Internet of Things (IoT). Most existing solutions for secure record linkage are either complex with strong security guarantees or very efficient with little security guarantees. In this paper, we proposed a solution based on a secure list matching protocol that provide strong security guarantees while being simple, efficient and scalable to large datasets and large number of parties. However, one limitation with the proposed approach is that it only supports deterministic record linkage where the output is binary, i.e., either a match or not. It would be interesting to extend this framework to support probabilistic record linkage, which is part of our future work. End of article

 

About the authors

Gowtham Bellala is a Senior Researcher at Hewlett Packard Labs, Palo Alto, California.
Web: http://www.labs.hpe.com/people/gowtham_bellala/
E-mail: gowtham [dot] bellala [at] hpe [dot] com

Bernardo A. Huberman is a Senior Fellow and Director at Hewlett Packard Labs, Palo Alto, California.
Web: http://www.labs.hpe.com/people/bernardo_huberman/
E-mail: bernardo [dot] huberman [at] hpe [dot] com

 

References

Charu C. Aggarwal and Philip S. Yu (editors), 2008. Privacy-preserving data mining: Models and algorithms. Advances in database systems, volume 34. New York: Springer.
doi: http://dx.doi.org/10.1007/978-0-387-70992-5, accessed 11 August 2016.

Rakesh Agrawal, Alexandre Evfimievski and Ramakrishnan Srikant, 2003. “Information sharing across private databases,” SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 86–97.
doi: http://dx.doi.org/10.1145/872757.872771, accessed 11 August 2016.

Andrew D. Boyd, Paul R. Saxman, Dale A. Hunscher, Kevin A. Smith, Timothy D. Morris, Michelle Kaston, Frederick Bayoff, Bruce Rogers, Pamela Hayes, Namrata Rajeev, Eva Kline-Rogers, Kim Eagle, Daniel Clauw, John F. Greden, Lee A. Green and Brian D. Athey, 2009. “The University of Michigan Honest Broker: A Web-based service for clinical and translational research and practice,” Journal of American Medical Informatics Association, volume 16, number 6, pp. 784–791.
doi: http://dx.doi.org/10.1197/jamia.M2985, accessed 11 August 2016.

Tim Churches and Peter Christen, 2004. “Some methods for blindfolded record linkage,” BMC Medical Informatics and Decision Making, volume 4, number 9, at http://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/1472-6947-4-9, accessed 11 August 2016.
doi: http://dx.doi.org/10.1186/1472-6947-4-9, accessed 11 August 2016.

Rajiv Dhir, Ashok A. Patel, Sharon Winters, Michelle Bisceglia, Dennis Swanson, Roger Aamodt and Michael J. Becich, 2008. “A multi-disciplinary approach to honest broker services for tissue banks and clinical data: A pragmatic and practical model,” Cancer, volume 113, number 7, pp. 1,705–1,715.
doi: http://dx.doi.org/10.1002/cncr.23768, accessed 11 August 2016.

L. Dusserre, C. Quantin and H. Bouzelat, 1995. “A one way public key cryptosystem for the linkage of nominal files in epidemicological studies,” Medical Informatics, volume 8, part 1, pp. 644–647.

Bernardo A. Huberman, 2016. “Ensuring trust and security in the industrial IoT,” Ubiquity, volume 2016, article number 2, pp. 1–7.
doi: http://dx.doi.org/10.1145/2822883, accessed 11 August 2016.

Bernardo Huberman, Stephen Sorkin and Joshua Tyler, 2012. “Encoded attribute matching on communication devices,” U.S. patent 8,316,234 (20 November).

Geetha Jagannathan and Rebecca N. Wright, 2005. “Privacy-preserving distributed k-means clustering over arbitrarily partitioned data,” KDD ’05: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 593–599.
doi: http://dx.doi.org/10.1145/1081870.1081942, accessed 11 August 2016.

Kerina H. Jones, David V. Ford, Chris Jones, Rohan Dsilva, Simon Thompson, Caroline J. Brooks, Martin L. Heaven, Daniel S. Thayer, Cynthia L. McNerney and Ronan A. Lyons, 2014. “A case study of the secure anonymous information linkage (SAIL) gateway: A privacy-protecting remote access system for health-related research and evaluation,” Journal of Biomedical Informatics, volume 50, pp. 196–204.
doi: http://dx.doi.org/10.1016/j.jbi.2014.01.003, accessed 11 August 2016.

Alan F. Karr, Xiaodong Lin, Ashish P. Sanil and Jerome P. Reiter, 2004. “Secure regression on distributed databases,” Journal of Computational and Graphical Statistics, volume 14, number 2, pp. 263–279.
doi: http://dx.doi.org/10.1198/106186005X47714, accessed 11 August 2016.

Abel N. Kho, John P. Cashy, Kathryn L. Jackson, Adam R. Pah, Satyender Goel, Jörn Boehnke, John Eric Humphries, Scott Duke Kominers, Bala N. Hota, Shannon A. Sims, Bradley A. Malin, Dustin D. French, Theresa L. Walunas, David O. Meltzer, Erin O. Kaleba, Roderick C. Jones and William L. Galanter, 2015. “Design and implementation of a privacy preserving health record linkage tool in Chicago,” Journal of American Medical Informatics Association, volume 22, number 5, pp. 1,072–1,080.
doi: http://dx.doi.org/10.1093/jamia/ocv038, accessed 11 August 2016.

Ibrahim Lazrig, Tarik Moataz, Indrajit Ray, Indrakshi Ray, Toan Ong, Michael Kahn, Frédéric Cuppens and Nora Cuppens, 2015. “Privacy preserving record matching using automated semi-trusted broker,” In: Pierangela Samarati (editor). Data and Applications Security and Privacy XXIX. Lecture Notes in Computer Science, volume 9149. Berlin: Springer International, pp. 103–118.
doi: http://dx.doi.org/10.1007/978-3-319-20810-7_7, accessed 11 August 2016.

Yehuda Lindell and Benny Pinkas, 2009. “Secure multiparty computation for privacy-preserving data mining,” Journal of Privacy and Confidentiality, volume 1, number 1, article number 5, at http://repository.cmu.edu/jpc/vol1/iss1/5, accessed 11 August 2016.

Frank Niedermeyer, Simone Steinmetzer, Martin Kroll and Rainer Schnell, 2014. “Cryptanalysis of basic Bloom filters used for privacy preserving record linkage,” Journal of Privacy and Confidentiality, volume 6, number 2, pp. 59–79, and at http://repository.cmu.edu/jpc/vol6/iss2/3/, accessed 11 August 2016.

Shantanu Rane and Petros T. Boufounos, 2013. “Privacy-preserving nearest neighbor methods: Comparing signals without revealing them,” IEEE Signal Processing, volume 30, number 2, pp. 18–28.
doi: http://dx.doi.org/10.1109/MSP.2012.2230221, accessed 11 August 2016.

Rainer Schnell, Tobias Bachteler and Jorg Reiher, 2009. “Privacy-preserving record linkage using Bloom filters,” BMC Medical Informatics and Decision Making, volume 9, number 41, at http://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/1472-6947-9-41, accessed 11 August 2016.
doi: http://dx.doi.org/10.1186/1472-6947-9-41, accessed 11 August 2016.

Andrew C. Yao, 1982. “Protocols for secure computations,” SFCS ’08: 23rd Annual Symposium on Foundations of Computer Science, pp. 160–164.
doi: http://dx.doi.org/10.1109/SFCS.1982.38, accessed 11 August 2016.

 


Editorial history

Received 27 July 2016; accepted 12 August 2016.


Copyright © 2016, First Monday.
Copyright © 2016, Gowtham Bellala and Bernardo Huberman.

Securing private data sharing in multi-party analytics
by Gowtham Bellala and Bernardo Huberman.
First Monday, Volume 21, Number 9 - 5 September 2016
http://firstmonday.org/ojs/index.php/fm/article/view/6842/5616
doi: http://dx.doi.org/10.5210/fm.v21i9.6842





A Great Cities Initiative of the University of Illinois at Chicago University Library.

© First Monday, 1995-2017. ISSN 1396-0466.