There has been an extraordinary amount of hype attributed to Open AI’s recent GPT-3 API launch —it is very well-deserved.

GPT-3 is a remarkable natural language processing (NLP) 175 billion parameter model trained on the web corpus. It has led to a broad swath of intriguing text generation examples - see a few examples here. We may be entering a decade of incredible NLP impact.

The web is the perfect initial training ground for text generation — so much of our lives, esp in these Covid times, are being digitized there. This incredibly large corpus is showing the potential to generate realistic text summaries, establish basic Q&A systems, write legal contracts, pre-write articles/stories/content for editing, and so much more.

But some of the most interesting and important use cases of natural language processing will require training on our sensitive data beyond the public web corpus. Whether it be related to medical assistance, therapy, or simply how we interact in our private lives, powerful applications for GPT-3 will arise that will require training on sensitive data to achieve. In order to get that data, we will need robust privacy preserving data science.

Let’s take the example of building an interactive nurse AI, which ideally would help anyone on the planet get meaningful, medical help with just a smartphone. Certainly, there are a lot of medical help sites, nurse FAQs, etc. on the web which can be leveraged as a starting corpus to build such a starter assistant. Such an ‘interactive nurse AI ‘might be a new user interface improvement, but quality-wise it wouldn’t be that much better than our current search and webpage FAQ approach.

The true magic of our nurses naturally lies in walls of hospitals, doctor offices, and medical clinics. They have ‘trained-on’ years of their specific, precious longtail patient interaction and knowledge of critical medical issues into protected servers which meet compliance regulations. Perhaps that might involve potentially noticing that a slightly slurred speech for a diabetic patient with certain characteristics as an early warning sign for a specific type of stroke. These nuanced details aren’t always recorded online in public web corpora. For an NLP ‘nurse assistant’, therapist, etc. to be superior to our current search and webpage FAQ approach, the training data likely also needs to come from these sensitive sources.

Fortunately, there are ways to train on such sensitive data securely. This is the mission of our OpenMined community - answering questions with “data you cannot see”. We are providing open-source libraries for reliable, safe, and robust privacy preserving techniques. These include libraries for federated learning, secure multi-party computation, differential privacy, homomorphic encryption, private set intersection, and many more.

You can learn more about privacy preserving techniques in this blog series, and learn more about OpenMined at openmined.org. OpenMined is a community of 8,600+ engineers, researchers, marketers, and hackers dedicated to lowering the barrier-to-entry to private AI technologies through open-source code and free education.

To learn more about the state of privacy tech, join us at our OpenMined Privacy Conference PriCon - a free event organized by the OpenMined community covering all aspects of privacy-related technology research, deployments and issues. If you’d like to sponsor our conference - each donation directly sponsors a grant for an open-source developer - you can do so here, or email partnerships@openmined.org!