Today, we are proud to announce the Beta version of a framework for differential privacy in Python we call PipelineDP, which OpenMined built in tight collaboration with Google.
What is PipelineDP?
PipelineDP is a Python framework for applying central differential privacy to large datasets using batch processing systems such as Apache Spark, Apache Beam. Also, PipelineDP can run locally without any batch processing systems, which is convenient for small datasets. The “Pipeline” in PipelineDP is there because it supports running data processing pipelines (such as Beam or Spark) with differential privacy.
The goal of PipelineDP is to make differential privacy accessible to non-experts:
- provides a convenient API familiar to Spark or Beam developers;
- encapsulates the complexities of differential privacy, such as protection of outliers and rare categories, generation of safe noise and privacy budget accounting;
- supports many standard computations, such as count, sum, average (and in future more metrics).
PipelineDP builds upon the previous open-source work:
- it uses low-level differential privacy primitives from our PyDP library;
- it is conceptually similar to Privacy On Beam, a similar framework for Go; the key difference is availability for Python developers and extensibility to arbitrary data processing systems;
Example of applying PipelineDP
The goal of PipelineDP is to make private processing as easy and scalable as regular processing using systems such as Spark or Beam. Here’s a simplified example showing regular and private processing side by side:
Let’s consider how to compute the number of views per movie in the Netflix prize dataset (which can be downloaded here). The dataset consists of movie views, which might be represented in Python:
Let’s assume movie views are loaded to Spark RDD movie_views.
One of the differences with regular computations is that PipelineDP needs to know what is the privacy id for each dataset record. Usually the privacy id corresponds to the user id. The privacy is specified by the privacy_id_extractor function.
In PipelineDP terminology, a partition is a subset of the data by which dp statistics is computed, in this case 1 partition corresponds to 1 movie (a partition for every movie record).
Internally PipelineDP manages all complexities for assuring that the result is differentially privacy. For example in this case
- PipelineDP performs contribution bounding by each privacy_id, namely in this case it’s ensured that each privacy id contributes to not more than 100 partitions. In case if there are more than 100 views per user, 100 views are randomly sampled, others are dropped. That is needed in order to limit a mechanism's sensitivity
- Laplace noise is added. The Laplace mechanism is a default mechanism, another supported option is the Gaussian mechanism.
This is somewhat simplified, and there’s more setup that needs to be done to define privacy properties of the pipeline, such as privacy budget. For more details, please check our examples for Spark and Beam, and a thorough Jupiter Notebook that walks you through the main concepts.
DP is a vast and quickly evolving area. Now PipelineDP supports a small set of possible computations. It is a big road ahead. We are planning to add new aggregation types (e.g. private quantiles), improve usability by supporting automatic tuning of parameters.
Interested in contributing?
If you like to help please let us know (Contact @chinmay on Slack).
- Interested to learn more about how differential privacy works? Contributing code to PipelineDP is an excellent way to get your hands dirty.
- Are you a researcher and you’ve published a new method or improvement? Adding this to PipelineDP will make it available to the community.
- Have an exciting example of differential privacy? We’d like to hear more and add your example to our collection of examples, either as a python script or as a Jupiter notebook.