The compelling use cases for differential privacy are growing each day. Engineers at OpenMined have been busy building libraries to improve developer accessibility to industry tested implementations.
In this latest instalment of OpenMined Dev Diaries we talk to Chinmay Shah, Python lead for OpenMined's Differential Privacy team about his experiences building PyDP, a Python API for Google's Differential Privacy library.
Why bring differential privacy to Python?
Python has incredible adoption around the world and has become a tool of choice by many data scientists and machine learning experts. Making differential privacy accessible to their ecosystem is a priority for OpenMined. COVID-19 certainly added to the urgency of our efforts, but differential privacy for python is one of the most requested capabilities by our community.
What were some of the lessons your team learned along the way?
The engineering challenges of bringing Google's differential privacy library to Python in a way that is palatable to our intended user base was fraught with challenges:
- The first question we needed to answer was choice of binding/wrapping framework. SWIG looked good, but we ran into some problems with C++ templates. We also tried a host of others but eventually landed on pybind11. This library gave our team the flexibility to really take control of what our API looked like, integrated well with our build system and had great documentation.
- The Google library uses the Bazel build system, but our team was more comfortable with CMake. After some experimentation it was pretty clear that we were on the wrong path. Once we learned how to use Bazel it really accelerated our work!
- Google's DP library makes extensive use of C++ templates. Figuring out the nuances of pybind11 bindings, templates and casting between C++/Python types turned out to be a lot harder than we expected. We made a breakthrough after lots of sleepless nights when we started looking at how other popular projects conquered this like the TensorFlow library.
What's next?
These days, we are spending time bug fixing, improving bench-marking and validation, and collaborating with our research team to ensure that the code is fit-for-purpose for the most common production scenarios.
Beyond PyDP, we plan to develop plugins for popular databases and expand the breadth of our offering to include additional perturbation mechanisms (e.g. Gaussian, exponential etc...) and support for more scenarios.
How you can help!
If you want to start contributing to PyDP, why not try your hand at a good first issue? Feel free to join in the conversation on our Slack community as well. Join #lib_pydp to get started!