Three Ways to Future-Proof your Data Analytics against the Changing Regulatory Landscape

Originally published on DataFleets' Blog

Europe map at night. Provided by NASA, visibleearth.nasa.gov/view.php?id=79765.

“A ruling by the EU’s top court invalidates the key mechanism for transferring personal data from the EU to the US and imposes additional conditions for use of the standard contractual clauses.”

— Latham and Watkins, regarding Data Protection Commissioner v. Facebook Ireland Limited, Maximillian Schrems (Case C-311/1) (Schrems II)

On Thursday, July 16th 2020, the European Court of Justice invalidated the EU-US Privacy Shield, one of the key mechanisms for lawfully transferring personal data between the two jurisdictions. Data controllers must now conduct detailed examinations of the circumstances of each transfer, the adequacy of protection at the recipient country, and the parties involved (1). More than 5,300 companies operated under the EU-US Privacy Shield, ⅔ of which were SMEs (2). Authorities like the Berlin data commissioner have called data localization the only credible solution (3), and The International Association of Privacy Professionals (disclaimer: DataFleets is a Corporate Member) highlighted the risk of the EU becoming an "Information Island" (4)

The core issue for enterprise AI / ML initiatives is that data must be “pooled”. The inability to aggregate data from Europe may cause AI / ML models to degrade, including risk assessments for financial transaction monitoring / anti-money laundering (AML) and recommendation engines for what to watch, where to travel, and what to buy. All of this is compounded by two existing challenges: GDPR and COVID-19.

“Our data in Europe is essentially frozen in an iceberg by GDPR. No one in the U.S. can touch it for analytics, and our ML models are poor because of it.” — Market-leading technology and travel company

“Coronavirus broke our credit underwriting models. All the patterns changed.” — Market-leading financial services institution

We suggest that data teams use this opportunity to future proof their analytics against the changing regulatory landscape in three ways. Let’s take them in turn.

1. Data Sovereignty

Future proof assumption: data should remain resident where it was created when establishing data pipelines and architecture.

Schrems II is the latest confirmation that data sovereignty is here to stay. The data economy is getting chopped up into Westphalian bits, blocking aggregation for analytics.

Definitions:

Data residency means that data “resides” or is stored in a location for regulatory purposes, such as tax regimes.
Data sovereignty is indistinguishable from data residency in practice but may denote local governance in addition to residency.
Data localization is sovereignty with an additional significant parameter: requiring data to be located exclusively in the jurisdiction where it was created.

Our CEO David Gilmore was asked by Bloomberg about similar trends affecting USA / California CCPA and China:

… laws that require data reaped inside the country to stay there, with China being perhaps the most stringent example…More than 100 countries have some sort of data sovereignty laws in place, according to David Gilmore, chief executive officer of DataFleets Ltd., an enterprise software firm. In the U.S., state policies, such as California’s new consumer privacy law, provide further restrictions on how cloud companies handle data. “It’s just the tip of the iceberg,” he said.

According to Bart Willemsen, Vice President Analyst at Gartner:

… by 2023, 65% of the world’s population will have its personal information covered under modern privacy regulations, up from 10% today

… by 2023, more than 80% of companies worldwide will be facing at least one privacy-focused data protection regulation

2. Cloud Migration and Multi-Cloud

Future proof assumption: my cloud provider must have a local data center in my countries of operation, and a multi-cloud approach may be required.

Just like COVID-19 accelerated cloud computing, Schrems II may catalyze local data centers for cloud providers. We researched which cloud provider was best positioned to take advantage of this shift. Here’s how many jurisdictions can be currently served by each provider (as of July 2020):

While currently all three have data centers in a similar number of geographies, Azure’s experience with a more distributed footprint could help them capitalize on this trend.

Disclaimer: DataFleets is cloud-agnostic, and we currently use cloud services from all three of the above providers.

We also observe regional fragmentation leading to multi-cloud implementations. For example, a leading financial services institution working with DataFleets uses Alibaba Cloud to support Asia Pacific while using one of the three above providers in US and Europe. With cloud data becoming increasingly politicized, we expect this Balkanization to continue.

3. Privacy-by-design (PBD) and privacy-enhancing technologies (PETs)

Future proof assumption: data ops and analytics should include best practices to mathematically limit privacy risk.

With this rapid increase in privacy regulation, investing now in best practices such as data minimization, reducing data copies, and risk-based anonymization is not only ethical, it makes business sense to preserve operating continuity and gain a marketing edge as a privacy-first brand. An example is Microsoft’s decision to uphold CCPA standards across the entire U.S, not just in California.

Privacy-enchancing technologies are rapidly maturing and gaining admiration from regulators. The UK’s Information Commissioner’s Office listed Federated Learning as a tool that can meaningfully contribute to data minimization efforts. There are three best-of-breed open source projects we recommend evaluating:

OpenMined Homomorphic Encryption, MPC, Differential Privacy, and Federated Learning
White Noise Differential Privacy by Microsoft and Sarah Bird
TensorFlow Federated Learning by Google

Federated Learning is especially applicable to the EU-US divide because its core insight is shipping models to data rather than aggregating data centrally. It combines:

Privacy removes the need for traditional privacy approaches like data masking and tokenization
Federated architecture removes the need for data aggregation such a single data lake or data warehouse

In the future, we predict Federated Learning and differentially-private federated SQL will be the prevailing paradigm for unified multi-jurisdictional analytics. This form of “arm’s-length data science” comes with the benefits of potentially greater and faster data access, improved developer productivity, and best-in-class privacy and security.

Conclusion

It’s worth remembering there are trillions of dollars of economic growth at stake. A study from James Manyika and the McKinsey Global Institute in 2016 showed that cross-border data flows significantly contribute to economic growth, with upwards of $2.8 trillion of net positive economic activity. A separate study by in 2018 found AI can contribute 40 percent of the overall $9.5 trillion to $15.4 trillion annual impact by analytics.

Tag us on Twitter @DataFleets or sign up to access our Federated Learning on MNIST tutorial.