Privacy-Preserving Data Science, Explained

Many of the issues that we want to solve today with data science require access to sensitive, personal information - be it our medical history, financial records, or private habits. Every day, people like you and I produce a vast amount of data on our smartphones, electronic devices, or medical equipment. But because of privacy or proprietary concerns, data for tackling meaningful problems can be limited and difficult to access.

Can we perform data science without intruding on our individual privacy? If so, what technologies can we combine to make it possible?

Traditionally, training a model would require transferring this data to a central server, but this raises numerous concerns about the privacy and security of the data. The risks from data leaks and misuse have led various parts of the world to legislate data protection laws. To perform data science in domains that require private data while abiding data privacy laws and minimizing risks, machine learning researchers have harnessed solutions from privacy and security research, developing the field of private and secure data science.

Private and secure machine learning (ML) is heavily inspired by cryptography and privacy research. It consists of a collection of techniques that allow models to be trained without having direct access to the data and that prevent these models from inadvertently storing sensitive information about the data.

Private and secure ML is performed in practice using a combination of techniques, though each method has limitations and costs. Some techniques would be overly burdensome in contexts where data and model owners already trust each other (e.g. when employees inside a company trains models on company-internal data), while others would be insufficiently secure for contexts that need to protect data and models from the actions of malicious actors. An appropriate mix of techniques for a specific project can only be decided once the various trade-offs of techniques are clearly communicated to the data-holders and key stakeholders of the project.

In this blog series, we’ll explain common topics in privacy-preserving data science. We'll distill each topic to a single sentence and quick overview in this introductory page, and in the followup posts, you'll the further details and code demonstrations of each technique.

We hope these posts serve as a useful resource for you to figure out the best techniques for use cases in your organization.

Want to go straight to the deep-dives? Here's a shortcut:

Privacy Techniques: One Sentence Summaries

Federated Learning

In short: Federated learning means training your machine learning model on data that is stored on different devices or servers across the world, without having to centrally collect the data samples.

Instead of moving the data to the model, copies of the global model are sent to where the data is located. The local data samples remain at their source devices, say a smartphone or a hospital server. A model is sent to the device and trained on the local data, after which the newly improved model with it’s update is sent back to the main server to be aggregated with the main model.

This preserves privacy in the sense that the data has not been moved from the device. However, there is still a limitation: the content of the local data can sometimes be inferred from the weight updates or improvements in the model. While individual clients are not able to reconstruct samples, an "honest-but-curious" server could. To prevent the possibility of inferring personal characteristics from the data, further techniques can be employed, such as differential privacy or encrypted computation.

For more information and a code demonstration, see What is Federated Learning?

There are, of course, some variations of federated learning - if you're interested, learn more about the difference between 'model-centric' and 'data-centric' federated learning here. The description above focused on 'data-centric'.

You can check out OpenMined's library for federated learning, PySyft, on GitHub.

Differential Privacy

In short: Sometimes, AI models can memorize details about the data they've trained on and could 'leak' these details later on. Differential privacy is a framework (using math) for measuring this leakage and reducing the possibility of it happening.

Often, deep-neural networks are over-parameterized, meaning that they can encode more information than is necessary for the prediction task. The result is a machine learning model that can inadvertently memorize individual samples. For example, a language model designed to emit predictive text (such as the next-word suggestions seen on smartphones) can be probed to release information about individual samples that were used for training (“my social security number is …”).

Differential privacy is a mathematical framework for measuring this leakage. Differential privacy describes the following promise to data owners: "you will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, datasets, or information sources are available".

A critical aspect of this definition is the guarantee of privacy no matter what other studies, datasets or information sources are available to the attacker - it’s been well-publicized that two or more ‘anonymized’ datasets can be combined to successfully infer and de-anonymize highly private information. This is known as a ‘linkage’ attack, and presents a serious risk given the abundance of data so easily available to attackers today (examples: the infamous Netflix prize attack, health records being re-identified). Differential privacy, however, is more robust than simple dataset anonymization in that it quantifies the risk that such de-anonymization can occur, empowering a data owner with the ability to minimize the risk.

Differential privacy works by injecting a controlled amount of statistical noise to obscure the data contributions from individuals in the dataset. This is performed while ensuring that the model still gains insight into the overall population, and thus provides predictions that are accurate enough to be useful. Research in this field allows the degree of privacy loss to be calculated and evaluated based on the concept of a privacy ‘budget’, and ultimately, the use of differential privacy is a careful tradeoff between privacy preservation and model utility.

Stay tuned for our detailed series on What is Differential Privacy?

What is PyDP?

In short: PyDP is a Python wrapper for Google's Differential Privacy project.

Python has incredible adoption around the world and has become a tool of choice by many data scientists and machine learning experts. Making differential privacy accessible to their ecosystem is a priority for OpenMined. The library provides a set of ε-differentially private algorithms, which can be used to produce aggregate statistics over numeric data sets containing private or sensitive information.

For more information, check out the PyDP repo on Github. Stay tuned for more posts on PyDP.

What is Differential Privacy by Shuffling?

In short: The shuffler is a separate service that is responsible for receiving, grouping, and shuffling the data. Shuffling isn’t a privacy model in itself but a layer that can be compatible with various existing privacy strategies.

Differential privacy has been established as the gold standard for measuring and guaranteeing data privacy, but putting it into practice has proved challenging until recently. Practitioners often face a difficult choice between privacy and accuracy. Privacy amplification by shuffling is a relatively new idea that aims to provide greater accuracy while preserving privacy by shuffling batches of similar data. This approach has the potential to allow for richer, more reliable data analysis while preserving privacy.

For a more in-depth explanation, see What is Differential Privacy by Shuffling?

Homomorphic Encryption

In short: Homomorphic encryption allows you to make your data unreadable yet still do math on it.

Homomorphic encryption (HE), as opposed to traditional encryption methods, allows meaningful calculations to be performed on encrypted data. When using homomorphic encryption, data can be encrypted by its owner and sent to the model owner to run computation. For example, it would apply a trained classification model to encrypted patient data, and send back the encrypted result (e.g. a prediction of a disease) back to the patient. Notably, the model weights don’t need to be encrypted here as the computation happens on the model owner’s side. There are currently restrictions on the type of calculations that can be performed using homomorphic encryption, and the computation performance is still very far from traditional techniques.

For more information and a code demonstration, see What is Homomorphic Encryption?

You can check out OpenMined's TenSEAL library for doing homomorphic encryption operations on tensors on GitHub.

You might also be interested in: Homomorphic Encryption in PySyft with SEAL and PyTorch, Build an Homomorphic Encryption Scheme from Scratch with Python

The Paillier cryptosystem, invented by Pascal Paillier in 1999, is a partial homomorphic encryption scheme which allows two types of computation:

addition of two ciphertexts
multiplication of a ciphertext by a plaintext number

For a detailed explanation, please see What is the Paillier Cryptosystem?

What is Private Set Intersection?

In short: If two parties want to test if their datasets contain a matching value, but don’t want to ‘show’ their data to each other, they can use private set intersection to do so.

Private set intersection (PSI) is a powerful cryptographic technique which enables two parties, which both have a set of data points, to compare these data sets without exposing their raw data to the other party (thus sacrificing their individual data privacy). In other words, PSI allows us to test whether the parties share a common datapoint (such as a location, ID, etc) - the result is a third data set with only those elements, which both parties have in common.

For more information and a code demonstration, see What is Private Set Intersection?

You can also check out OpenMined's PSI library on GitHub.

You might also like to see how a PSI protocol can be built using the Paillier cryptosystem in Private Set Intersection with the Paillier Cryptosystem.

The Diffie-Hellman key exchange protocol allows two parties to agree on a single secret without an eavesdropper discovering what it is, and without revealing their respective private keys to each other. For more detail, please see What is the Diffie-Hellman key exchange protocol?

You might also like to see how a PSI protocol can be built using the Diffie-Hellman key exchange protocol in Private Set Intersection with Diffie-Hellman.

What is Secure Multi-Party Computation?

In short: Secure multi-party computation allows multiple parties to collectively perform some computation and receive the resulting output without ever exposing any party’s sensitive input.

Secure multi-party computation (SMPC), in turn, is a method that allows separate parties to jointly compute a common function while keeping both the inputs and the function parameters private. It allows a model to be trained or applied to data from different sources without disclosing the training data items or the model’s weights. It relies on building shares of some value, which, when summed, reconstruct the original value. SMPC is computationally less intensive than HE, but requires a lot of communication between the parties, so bandwidth can be a bottleneck.

For more information and a code demonstration, see What is Secure Multi-Party Computation?

What is CrypTen?

In short: CrypTen is a framework developed by Facebook Research for Privacy Preserving Machine Learning built on PyTorch.

The goal of CrypTen is to make secure computing techniques accessible to Machine Learning practitioners and efficient for server to server interactions. It currently implements Secure Multi-Party Computation as its secure computing backend. More information can be found on the project repo on Github.

For more information, see What is CrypTen? / CrypTen Integration into PySyft

What is a Split Neural Network (SplitNN)?

In short: The training of the neural network (NN) is ‘split’ across two or more hosts.

Traditionally, PySyft has been used to facilitate federated learning. However, we can also leverage the tools included in this framework to implement distributed neural networks. These allow for researchers to process data held remotely and compute predictions in a radically decentralised way.

For more information and a code demonstration, see What is a Split Neural Network?

What is PyVertical?

In short: PyVertical uses private set intersection (PSI) to link datasets in a privacy-preserving way. We train SplitNNs on the vertically partitioned data to ensure the data remains separated throughout the entire process.

For a detailed explanation, please see What is PyVertical?

What are Zero Knowledge Proofs?

In short: A Zero Knowledge Proof (ZKP) is a mathematical method to prove that one party possesses something without actually revealing the information.

Stay tuned for What are Zero Knowledge Proofs? In the meantime, you can check out OpenMined's Python library for Zero Proof Knowledge on GitHub.

Protecting the model

Note: While Federated Learning and Differential Privacy can be used to protect data owners from loss of privacy, they are insufficient to protect a model from theft or mis-use by the data owner. Federated Learning, for example, requires that a model owner send a copy of the model to many data owners, putting the model at risk of IP theft or sabotage through data poisoning. Encrypted computation can be used to address this risk by allowing the model to train while in an encrypted state. The most well known methods of encrypted computation are Homomorphic Encryption, Secure Multi-Party Computation, and Functional Encryption.

What is Encrypted Machine Learning as a Service?

In short: Instead of merely providing MLaaS that might be leaky, service providers can introduce EMLaaS(Encrypted Machine Learning as a Service) to assure customers about their data security.

Today, some cloud operators are offering Machine Learning as a Service(MLaaS). Service providers don’t want to open up about their model, which are black boxes to the customers. Vice versa, due to data sensitivity, customers may not be interested to share their raw data through API calls. Encrypted Machine Learning can help protect the data and the model by encryption.

For more information, see What is Encrypted Machine Learning as a Service?

Looking Deeper

In this blog series, we'll show how federated learning can provide us the data we need to train the model and how homomorphic encryption, encrypted deep learning, secure multi-party computation and differential privacy can protect the privacy of your clients. In these links, you'll find example code of each technique used to build modern privacy-preserving data applications.

These links will have plenty of code snippets to get you started with your use case, and links to other resources to go into the weeds of privacy-preserving ML.

OpenMined would like to thank Antonio Lopardo, Emma Bluemke, Théo Ryffel, Nahua Kang, Andrew Trask, Jonathan Lebensold, Ayoub Benaissa, and Madhura Joshi, Shaistha Fathima, Nate Solon, Robin Röhm, Sabrina Steinert, Michael Höh and Ben Szymkow, Laura Ayre, Mir Mohammad Jaber, Adam J Hall, and Will Clark for their contributions to various parts of this series. We'd also like to thank Bennett Farkas and Kyoko Eng from the OpenMined design team for graphics!

What is Secure Multi-Party Computation?

Multi-Party Computation 4 years ago

What is Federated Learning?

Federated Learning 4 years ago

Privacy Techniques: One Sentence Summaries

Federated Learning

Differential Privacy

What is PyDP?

What is Differential Privacy by Shuffling?

Homomorphic Encryption

What is Private Set Intersection?

What is Secure Multi-Party Computation?

What is CrypTen?

What is a Split Neural Network (SplitNN)?

What is PyVertical?

What are Zero Knowledge Proofs?

Protecting the model

What is Encrypted Machine Learning as a Service?

Looking Deeper

Emma Bluemke

Antonio Lopardo

Andrew Trask

Nahua Kang

Previous post

What is Secure Multi-Party Computation?

Next post

What is Federated Learning?