Many of the issues that we want to solve today with data science require access to sensitive, personal information - be it our medical history, financial records, or private habits. Every day, people like you and I produce a vast amount of data on our smartphones, electronic devices, or medical equipment. But because of privacy or proprietary concerns, data for tackling meaningful problems can be limited and difficult to access.
Can we perform data science without intruding on our individual privacy? If so, what technologies can we combine to make it possible?
Traditionally, training a model would require transferring this data to a central server, but this raises numerous concerns about the privacy and security of the data. The risks from data leaks and misuse have led various parts of the world to legislate data protection laws. To perform data science in domains that require private data while abiding data privacy laws and minimizing risks, machine learning researchers have harnessed solutions from privacy and security research, developing the field of private and secure data science.
Private and secure machine learning (ML) is heavily inspired by cryptography and privacy research. It consists of a collection of techniques that allow models to be trained without having direct access to the data and that prevent these models from inadvertently storing sensitive information about the data.
Private and secure ML is performed in practice using a combination of techniques, though each method has limitations and costs. Some techniques would be overly burdensome in contexts where data and model owners already trust each other (e.g. when employees inside a company trains models on company-internal data), while others would be insufficiently secure for contexts that need to protect data and models from the actions of malicious actors. An appropriate mix of techniques for a specific project can only be decided once the various trade-offs of techniques are clearly communicated to the data-holders and key stakeholders of the project.
In this blog series, we’ll explain common topics in privacy-preserving data science. We'll distill each topic to a single sentence and quick overview in this introductory page, and in the followup posts, you'll the further details and code demonstrations of each technique.
We hope these posts serve as a useful resource for you to figure out the best techniques for use cases in your organization.
Privacy Techniques: One Sentence Summaries
In short: Federated learning means training your machine learning model on data that is stored on different devices or servers across the world, without having to centrally collect the data samples.
Instead of moving the data to the model, copies of the global model are sent to where the data is located. The local data samples remain at their source devices, say a smartphone or a hospital server. A model is sent to the device and trained on the local data, after which the newly improved model with it’s update is sent back to the main server to be aggregated with the main model.
This preserves privacy in the sense that the data has not been moved from the device. However, there is still a limitation: the content of the local data can sometimes be inferred from the weight updates or improvements in the model. While individual clients are not able to reconstruct samples, an "honest-but-curious" server could. To prevent the possibility of inferring personal characteristics from the data, further techniques can be employed, such as differential privacy or encrypted computation.
For more information and a code demonstration, see What is Federated Learning?
In short: Sometimes, AI models can memorize details about the data they've trained on and could 'leak' these details later on. Differential privacy is a framework (using math) for measuring this leakage and reducing the possibility of it happening.
Often, deep-neural networks are over-parameterized, meaning that they can encode more information than is necessary for the prediction task. The result is a machine learning model that can inadvertently memorize individual samples. For example, a language model designed to emit predictive text (such as the next-word suggestions seen on smartphones) can be probed to release information about individual samples that were used for training (“my social security number is …”).
Differential privacy is a mathematical framework for measuring this leakage. Differential privacy describes the following promise to data owners: "you will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, datasets, or information sources are available".
A critical aspect of this definition is the guarantee of privacy no matter what other studies, datasets or information sources are available to the attacker - it’s been well-publicized that two or more ‘anonymized’ datasets can be combined to successfully infer and de-anonymize highly private information. This is known as a ‘linkage’ attack, and presents a serious risk given the abundance of data so easily available to attackers today (examples: the infamous Netflix prize attack, health records being re-identified). Differential privacy, however, is more robust than simple dataset anonymization in that it quantifies the risk that such de-anonymization can occur, empowering a data owner with the ability to minimize the risk.
Differential privacy works by injecting a controlled amount of statistical noise to obscure the data contributions from individuals in the dataset. This is performed while ensuring that the model still gains insight into the overall population, and thus provides predictions that are accurate enough to be useful. Research in this field allows the degree of privacy loss to be calculated and evaluated based on the concept of a privacy ‘budget’, and ultimately, the use of differential privacy is a careful tradeoff between privacy preservation and model utility.
In short: Homomorphic encryption allows you to make your data unreadable yet still do math on it.
Homomorphic encryption (HE), as opposed to traditional encryption methods, allows meaningful calculations to be performed on encrypted data. When using homomorphic encryption, data can be encrypted by its owner and sent to the model owner to run computation. For example, it would apply a trained classification model to encrypted patient data, and send back the encrypted result (e.g. a prediction of a disease) back to the patient. Notably, the model weights don’t need to be encrypted here as the computation happens on the model owner’s side. There are currently restrictions on the type of calculations that can be performed using homomorphic encryption, and the computation performance is still very far from traditional techniques.
Secure Multi-Party Computation
In short: Secure multi-party computation allows multiple parties to collectively perform some computation and receive the resulting output without ever exposing any party’s sensitive input.
Secure multi-party computation (SMPC), in turn, is a method that allows separate parties to jointly compute a common function while keeping both the inputs and the function parameters private. It allows a model to be trained or applied to data from different sources without disclosing the training data items or the model’s weights. It relies on building shares of some value, which, when summed, reconstruct the original value. SMPC is computationally less intensive than HE, but requires a lot of communication between the parties, so bandwidth can be a bottleneck.
For more information and a code demonstration, see What is Secure Multi-Party Computation?
Private Set Intersection
In short: If two parties want to test if their datasets contain a matching value, but don’t want to ‘show’ their data to each other, they can use private set intersection to do so.
Private set intersection (PSI) is a powerful cryptographic technique which enables two parties, which both have a set of data points, to compare these data sets without exposing their raw data to the other party (thus sacrificing their individual data privacy). In other words, PSI allows us to test whether the parties share a common datapoint (such as a location, ID, etc) - the result is a third data set with only those elements, which both parties have in common.
For more information and a code demonstration, see What is Private Set Intersection?
Protecting the model
Note: While Federated Learning and Differential Privacy can be used to protect data owners from loss of privacy, they are insufficient to protect a model from theft or mis-use by the data owner. Federated Learning, for example, requires that a model owner send a copy of the model to many data owners, putting the model at risk of IP theft or sabotage through data poisoning. Encrypted computation can be used to address this risk by allowing the model to train while in an encrypted state. The most well known methods of encrypted computation are Homomorphic Encryption, Secure Multi-Party Computation, and Functional Encryption.
In this blog series, we'll show how federated learning can provide us the data we need to train the model and how homomorphic encryption, encrypted deep learning, secure multi-party computation and differential privacy can protect the privacy of your clients. In these links, you'll find example code of each technique used to build modern privacy-preserving data applications.
These links will have plenty of code snippets to get you started with your use case, and links to other resources to go into the weeds of privacy-preserving ML.
- What is Federated Learning?
- What is Private Set Intersection?
- What is Secure Multi-Party Computation?
- What is Differential Privacy?
- What is Encrypted Deep Learning?
- What is Homomorphic Encryption?
OpenMined would like to thank Antonio Lopardo, Emma Bluemke, Théo Ryffel, Nahua Kang, Andrew Trask, Jonathan Lebensold, Ayoub Benaissa, and Madhura Joshi, Shaistha Fathima, Nate Solon, Robin Röhm, Sabrina Steinert, Michael Höh and Ben Szymkow for their contributions to various parts of this series.