In the last few years, the idea that there are some big concerns about the impact of technology on people and the planet has been slowly creeping into the awareness of those who work in the technology sector.

At the same time, organizations have been developing principles, frameworks and tools to help make technology and data practices more responsible and ethical.

In early 2021, Consequential, alongside the Open Data Institute (ODI), conducted a short, qualitative research project to explore the impact of these tools in supporting a vision of a world where tech and data works for everyone.

This research showed some of the broad shapes and patterns of the tech and data ecosystem and its relationships to these tools at this moment. We found that adoption of these tools is ‘in the chasm’. This refers to Geoffrey Moore’s technology adoption lifecycle, or the transition from the early market into the mainstream eye. Crossing the chasm is the leap from being a new, little-known and exploratory product to mass adoption and well-known status.

In our interviews, we heard the same observations and themes from many actors and vantages, pointing towards some shifts within the tech and data ethics ecosystem. These shifts represent many challenges and opportunities for those who are trying to create the next generation of tech and data ethics tools and for those who want to be ethical and responsible in building technology:

**From individual drivers to systemic drivers of ethical behaviour.**Early adopters are passionate individuals, but do not represent consistent levers for change. Many have recognized this in working or experimenting with embedding data ethics tools, and have found that without systemic drivers change can only go so far or so deep. This has led many in the ethics field to start to find these systemic levers and to create and address long-term incentives for change.**From proliferation to polish of tech and data ethics tools.**

During the early adoption phase, many experts created tech and data ethics tools and frameworks, leading to a range of approaches as a natural part of an emerging field. Many of these early tools were experiments or prototypes. Now it seems that as part of this shift, people are starting to revisit these early creations or to create new ones, informed by the lessons of the first generation. There are more standards and certifications in the space than ever before, and people are tackling the question of ‘what good looks like’.**From ethics in tech and data to governance of tech and data.**

Many sectors don’t engage with the ethics conversation, but pay a lot of attention to governance and regulation. As businesses large and small are reconsidering the role they play in society more broadly, governance – or how businesses are guided and managed – has become a critical issue. The question for many will be what may get lost in this shift, and how ethics can become integrated into governance.**From broad ethical principles to specific ethical practices.**Organizations that want to be ethical are faced with the challenge of turning general ethical principles into specific practices within their business. With the incredible range of industries, sectors, sizes of businesses and cultures, solutions for ethical issues need to be heavily contextualized.

In our full presentation we explore these shifts in more depth, and look at some of the possible challenges and opportunities that exist for tech and data ethics tools within these shifts. We also explore what lessons have been learned from early adopters and the creators of the first generation of tech and data ethics tools, and how these lessons can help to shape the next generation.

As technologists, we’ve become aware of the need to build open, trustworthy tech and data systems that unlock positive and lasting social benefits. Now we need to design the next generation of data ethics tools to help us do this, and use them as part of our everyday practice.

*Read the full research here. To engage further on tech and data ethics tools, **get in touch**!*

Cover Image: Photo by Pietro Jeng on Unsplash

]]>Let us begin by looking at a generic Machine Learning (ML) algorithm that takes in **our** data as input and outputs some kind of decision - a classification label, numerical value or a recommendation. The privacy problem stems from the fact that we have to input **our** data to get those nice and valuable predictions. Many AI applications powered by smart agents are hosted on the cloud, and protecting privacy of user data is pivotal to building secure applications. The privacy of data can be protected through **encryption**. However, standard encryption methods do not allow computation on encrypted data. Here's where Homomorphic Encryption (HE) helps.

In simple terms, Homomorphic Encryption is a mathematical tool that allows for encryption of data, ensuring privacy while at the same time, allowing computations to be performed on the encrypted data. The result of computation can then be decrypted to get the results.

As shown in the figure below, with Homomorphic Encryption (HE), the order of encryption and computation can be interchanged.

Suppose you have data `a`

and `b`

. You can perform computation on the data and then encrypt the result, denoted by `E(a*b)`

. Alternatively, you could encrypt the data and then perform computation, denoted by `E(a) * E(b)`

in the figure below. If the encryption is homomorphic, then these two values, `E(a*b)`

and `E(a) * E(b)`

decrypt to the same value.

Therefore, we can choose to encrypt private data `a`

and `b`

, outsource computation, say, to the cloud, and decrypt the obtained result of the computation to view the actual meaningful results.

Let's try to think of a fictional character and draw a relatable analogy.

Remember Homer Simpson from 'The Simpsons'?🙂

The following illustration is aimed at giving an intuitive explanation to what Homomorphic Encryption does.

Let us say you need to get a jewel made and you have your gold ready! You'd now like to call your jeweler (Homer Simpson) and get your jewel made, however you're not very sure if your jeweler is trustworthy. Here's a suggestion.

You may place your gold in a box, lock it and keep the key with yourself. You may now invite over your jeweler and ask him to work on the gold nuggets through the glove box. Once the jewel is done, you may unlock your box and retrieve your jewel. Isn't that cool?

Let us try to parse the analogy a bit. Your private **data** is analogous to gold, **outsourcing computations** on encrypted data in a public environment is similar to getting your jeweler to work on the gold through glove box. **Decrypting** the results of computation to view the meaningful results is analogous to opening the box to get your jewel after the jeweler has left. 🙂

Without delving deep into the math involved, the high-level idea behind homomorphic encryption is as follows. Homomorphic Encryption uses lattice-based encoding. Encryption adds noise to a *secret* inner product. Decryption subtracts the secret inner product and the noise becomes easy to cancel.

In the next section, let's see the capabilities of Microsoft SEAL, a library for Homomorphic Encryption.

SEAL (Simple Encrypted Arithmetic Library) is Microsoft's library for Homomorphic Encryption (HE), widely adopted by research teams worldwide. It was first released publicly in 2015, followed by the standardization of HE in November 2018. Microsoft SEAL is available for download and use at https://www.microsoft.com/en-us/research/project/microsoft-seal/.

In recent years, availability of hardware accelerators has also enabled several orders of magnitude speedup. Here's a timeline of how Homomorphic Encryption has been adopted due to advances in research and easier access to compute.

- Idea: Computation on encrypted data without decrypting it
- 2009: Considered impractical due to substantial overhead involved
- 2011: Breakthrough at Microsoft Research
- Subsequent years of research: Practical encoding techniques that achieved several orders of speed-up
- 2016: Crypto Nets at ICML 2016 - Neural Net predictions on encrypted data

Now that we're familiar with how HE has been adopted over the years, the subsequent sections will focus on the possible applications of HE.

The following are some of the cloud computing scenarios that could potentially benefit from Homomorphic Encryption (HE):

- Private Storage and Computation
- Private Prediction Services
- Hosted Private Training
- Private Set Intersection
- Secure Collaborative Computation

The markets benefitting from such services include healthcare, pharmaceutical, financial, government, insurance, manufacturing, oil and gas sector, to name a few. A few applications of Private AI across industries are listed below.

- Finance: Fraud detection, automated claims processing, threat intelligence, and data analytics are some applications.
- Healthcare: The scope of Private AI in healthcare includes medical diagnosis, medical support systems such as healthcare bots, preventive care and data analytics.
- Manufacturing: Predictive maintenance and data analytics on potentially sensitive data.

An image classification model for *encrypted inferencing* in Azure Container Instance (ACI), built on Microsoft SEAL, was announced at Microsoft Build Conference 2020. The tutorial can be accessed here.

[1] Recording of the PriCon talk.

Cover Image of Post: Photo by David Sjunnesson on Unsplash

]]>The goal of this notebook is to implement the calculation of local sensitivity from scratch.

I benchmarked the results of these experiments against the results of the global sensitivity from my previous blog post.

Local sensitivity differs from global sensitivity in that it considers only the dataset to be released and not all the possible release datasets.

This means that one calculates the neighbors of the released dataset and not of all possible datasets.

Furthermore, it is only within these neighbors and their corresponding release dataset where one finds the maximum L1 norm.

Global sensitivity is, therefore, an upper bound of local sensitivity.

**Note: Unbounded sensitivity can be achieved in 2 ways, either by adding or subtracting records. In this notebook, I computed both at the same time and chose the one that yielded the highest sensitivity. However, I would say that in a real scenario, one could take either and calculate the sensitivity, as both equally protect the privacy of the individuals in the records. However, it is true that for the same privacy guarantees, one might use less noise than the other. This is an object for discussion.**

- I programmed two functions to calculate the local sensitivity of a dataset empirically.
- I compared local and global sensitivity results.
- I present a visualization of sensitivities for a median query, to show how to calculate sensitivities locally and how bounded and unbounded sensitivity differ visually

- If local sensitivity would imply less noise due to its smaller value, then why do we not always use local sensitivity?

If for each dataset you would calculate its particular local sensitivity, an adversary could also consider this when plotting noise distributions of the different possible released dataset combinations. These distributions would have a lower std (local sensitivity is lower than global sensitivity, and, therefore, less noise), and thus, once the adversary gets a query DP result, it would be easier for him/her to discard possible release datasets (A visual representation of this process carried out by an attacker is in the paper I implemented 2 blog posts ago, depicted in Figs. 1 and 3). That is why some researchers invented smooth bounds for local sensitivities, but that is out of the scope of this blog post.

**TIP**: The best way to understand this notebook better is to open my previous notebook, and go through the visualizations of scenario (a), cells 25 and 33 to compare the results.

You can see the comparisons of the plots between local and global sensitivities and a visual representation of how sensitivity is calculated. (**Ignore the decimals on the x-axis, Hamming distances are integers**)

For the visual representation, we have chosen:

- Universe of possible values: X = {1, 2, 3, 10, 11}
- Release dataset size (cardinality) = 3
- Hamming distance = 1
- x are all the possible value combinations from X
- x' are all the possible neighbors for each x
- Neighbor's cardinality difference is set to the Hamming distance of 1, | |x| - |x'| | = 1
- The L1 norm is calculated with the median of the possible release datasets (on the 2nd column) and with each of the medians of each neighboring dataset.
- I select the first maximum for each group of L1 norms (The first cell in the column that has the highest value of the column).

Notice that I calculated the local sensitivities on the horizontal axis; there is one per possible release dataset (first column). The global sensitivity is calculated in the vertical axis, selecting the maximum value out of all the L1 norms, i.e., selecting the maximum out of all local sensitivities.

The general form of L1 unbounded sensitivity:

General form of L1 bounded sensitivity:

**Here come the figures!**

**(Ignore the decimals on the x-axis, hamming distances are integers)**

It is interesting to see that the sensitivity for variance declines and at least for the sum the local sensitivity seems to be equal to the global sensitivity.

The rest have lower values. This means that this particular dataset is the one with the maximum sensitivity for sum.

At a first glance, the bounded local and global sensitivities seem equal, but if you pay close attention, for some queries they are not.

I use the same datasets as in the previous blog post.

Here is the code for calculating the local sensitivity for unbounded DP:

```
def unbounded_empirical_local_L1_sensitivity(universe, universe_subset, columns, query_type, hamming_distance, percentile=50):
"""
INPUT:
universe - (df ot dict) contains all possible values of the dataset
universe_subset - (df or dict) contains the subset chosen for the release dataset
columns - (array) contains the names of the columns we would like to obtain the sensitivity from
query_type - (str) contain the category declaring the type of query to be later on executed
hamming_distance - (int) hamming distance between neighboring datasets
percentile - (int) percentile value for the percentile query
OUTPUT:
unbounded_global_sensitivity - (float) the unbounded global sensitivity of the input universe
Description:
It claculates the global sensitivity of an array based on the knowledge of the entire universe of
the dataset and query_type.
"""
# Check if the values for the hamming distance and universe sizes comply with the basic constraints
# verify_sensitivity_inputs(universe.shape[0], universe_subset.shape[0], hamming_distance)
verify_sensitivity_inputs(universe.shape[0], universe_subset.shape[0], hamming_distance)
# We initialie the type of query for which we would like calculate the sensitivity
query = Query_class(query_type)
# We will store the sensitivity of each column of the dataset containing universe in a dictionary
unbounded_global_sensitivity_per_colum = {}
for column in columns:
# 1) RELEASE DATASET /// it could be a dataframe or a tupple
try:
release_dataset = universe_subset[column]
except:
release_dataset = universe_subset
# 2) |NEIGHBORING DATASET| < |RELEASE DATASET| //// cardinalities
neighbor_with_less_records_datasets = itertools.combinations(release_dataset, \
universe_subset.shape[0] - hamming_distance)
neighbor_with_less_records_datasets = list(neighbor_with_less_records_datasets)
# 3) |NEIGHBORING DATASET| > |RELEASE DATASET| //// cardinalities
symmetric_difference = list((Counter(universe[column]) - Counter(release_dataset)).elements())
neighbor_possible_value_combinations = itertools.combinations(symmetric_difference, hamming_distance)
neighbor_possible_value_combinations = list(neighbor_possible_value_combinations)
neighbor_with_more_records_datasets = []
for neighbor_possible_value_combination in neighbor_possible_value_combinations:
# We create neighboring datasets by concatenating the neighbor_possible_value_combination with the release dataset
neighbor_with_more_records_dataset = np.append(release_dataset, neighbor_possible_value_combination)
neighbor_with_more_records_datasets.append(neighbor_with_more_records_dataset)
# 4) For each possible release datase, there is a set of neighboring datasets
# We will iterate through each possible release dataset and calculate the L1 norm with
# each of its repspective neighboring datasets
L1_norms = []
release_dataset_query_value = query.run_query(release_dataset, percentile)
L1_norm = L1_norm_max(release_dataset_query_value, neighbor_with_less_records_datasets, query, percentile)
L1_norms.append(L1_norm)
L1_norm = L1_norm_max(release_dataset_query_value, neighbor_with_more_records_datasets, query, percentile)
L1_norms.append(L1_norm)
# We pick the maximum out of all the maximum L1_norms calculated from each possible release dataset
unbounded_global_sensitivity_per_colum[column] = max(L1_norms)
return unbounded_global_sensitivity_per_colum
```

And, here is the code to calculate the local sensitivity for bounded DP:

```
def bounded_empirical_local_L1_sensitivity(universe, universe_subset, columns, query_type, hamming_distance, percentile=50):
"""
INPUT:
universe - (df) contains all possible values of the dataset
universe_subset - (df or dict) contains the subset chosen for the release dataset
columns - (array) contains the names of the columns we would like to obtain the sensitivity from
query_type - (str) contain the category declaring the type of query to be later on executed
hamming_distance - (int) hamming distance between neighboring datasets
percentile - (int) percentile value for the percentile query
OUTPUT:
bounded_global_sensitivity - (float) the bounded global sensitivity of the input universe
Description:
It claculates the global sensitivity of an array based on the knowledge of the entire universe of
the dataset and query_type.
"""
# Check if the values for the hamming distance and universe sizes comply with the basic constraints
verify_sensitivity_inputs(universe.shape[0], universe_subset.shape[0], hamming_distance)
# We initialie the type of query for which we would like calculate the sensitivity
query = Query_class(query_type)
# We will store the sensitivity of each column of the dataset containing universe in a dictionary
bounded_global_sensitivity_per_column = {}
for column in columns:
# We calculate all the possible release datasets
# First we obtain the combinations within the release dataset. The size of this combinations is not the original size
# but the original size minus the hamming_distance /// it could be a dataframe or a tupple
try:
release_i_datasets = itertools.combinations(universe_subset[column], universe_subset.shape[0] - hamming_distance)
except:
release_i_datasets = itertools.combinations(universe_subset, universe_subset.shape[0] - hamming_distance)
release_i_datasets = list(release_i_datasets)
# it will contain sets of neighboring datasets. The L1 norm will be calculated between these sets. The maximum will be chosen
# The datasets from different groups do not necesarilly need to be neighbors, thus we separate them in groups
neighbor_dataset_groups = []
for release_i_dataset in release_i_datasets:
# second we calculate the combinations of the items in the universe that are not in the release dataset
# the size of a combination is equal to the hamming distance. This will not discard duplicates
symmetric_difference = list((Counter(universe[column]) - Counter(release_i_dataset)).elements())
release_ii_datasets = itertools.combinations(symmetric_difference, hamming_distance)
release_ii_datasets = list(release_ii_datasets)
# We create neighboring datasets by concatenating i with ii
neighbor_datasets = []
for release_ii_dataset in release_ii_datasets:
neighbor = list(release_i_dataset + release_ii_dataset)
neighbor_datasets.append(neighbor)
neighbor_dataset_groups.append(neighbor_datasets)
# We calculate the L1_norm for the different combinations with the aim to find the max
# We can loop in this manner because we are obtaining the absolute values
L1_norms = []
for m in range(0, len(neighbor_dataset_groups)):
for i in range(0, len(neighbor_dataset_groups[m])-1):
for j in range(i+1, len(neighbor_dataset_groups[m])):
L1_norm = np.abs(query.run_query(neighbor_dataset_groups[m][i], percentile) - query.run_query(neighbor_dataset_groups[m][j], percentile))
L1_norms.append(L1_norm)
bounded_global_sensitivity_per_column[column] = max(L1_norms)
return bounded_global_sensitivity_per_column
```

And here we do some plotting:

```
plt.figure(figsize=(15, 7))
query_types = ['mean', 'median', 'count', 'sum', 'std', 'var', 'percentile_25', 'percentile_50', 'percentile_75', 'percentile_90']
x_values = []
for key in unbounded_sensitivities.keys():
x_values.append(key)
for inedx_column, column in enumerate(columns):
# Start the plot
plot_index = int(str(1) + str(len(columns)) + str(inedx_column+1))
plt.subplot(plot_index)
query_type_legend_handles = []
for query_type in query_types:
y_values = []
for hamming_distance in unbounded_sensitivities.keys():
y_values.append(unbounded_sensitivities[hamming_distance][query_type][column])
# plot the sensitivities
legend_handle, = plt.plot(x_values, y_values, label=query_type)
query_type_legend_handles.append(legend_handle)
# Legends
legend = plt.legend(handles=query_type_legend_handles, bbox_to_anchor=(0., -0.2, 1., .102), \
ncol=5, mode="expand", borderaxespad=0.)
legend_plot = plt.gca().add_artist(legend)
# axis labels and titles
plt.xlabel('Hamming distance')
plt.ylabel('Local Sensitivity')
plt.title('{}) Universe Domain {} = {}'.format(inedx_column+1, column, D_large_universe[column].values))
plt.suptitle('Local sensitivities based on unbounded DP of different queries for different domains for different hamming distances')
plt.show()
;
```

It is interesting to see that the sensitivity for variance declines and at least for the sum the local sensitivity seems to be equal to the global sensitivity. The rest have lower values. This means that this particular dataset is the one with the maximum sensitivity for sum.

This post has covered the most salient contributions of the notebook, however, there is more to see! Check the notebook to have a better understanding of the intricacies of local sensitivity vs. global.

**THE END**

The first post of the Private AI series was all about information flows and how they are fundamental to our society. We also learned about how information flows are often broken today because of the privacy-transparency trade-off.

In the second post we discussed which technical problems exactly are underlying the privacy-transparency trade-off.

In the third post we were introduced to the concept of **structured transparency**. We learned about input and output privacy, two of the five guarantees of structured transparency.

Today, in post four, we continue with structured transparency, covering input and output **verification** as well as **flow governance**.

We learned about how input and output privacy can help you hiding information you don't want to share. But how do we know that information from such a well-hidden information flow is true?

**Examples:**

- If a data scientist can't look at her data, how does she know what it looks like?
- A police officer wants to check if a driver is of legal age. He would usually look at the driver's license. This comes bundled with all kinds of information: date of birth, photo, license ID, address... When all he has to know is that YES, you are allowed to drive. But how could he verify a card that only said YES?

The driver's license is a good example of **internal consistency**. It makes it hard to create fake documents, videos, or money. It also makes it hard to pretend you're someone else in real life.

**Internal consistency** refers to someone in the real world recognizing **how things usually look like.**

Example:You receive a banknote that was printed on a regular printer. You would immediately feel that something is not right, even if the layout etc. is correct.

Internal consistency is one of humanity's most powerful tools for communicating trust. But this approach to verify information has two problems.

- With enough effort, internal consistency can be faked. From classic approaches like faking dollar bills to modern, AI-based techniques called deepfakes.
- It requires sending way more information than is necessary to communicate. It has data leakage built-in!

We need tools that allow verifying a piece of information. In a way that it a) can't be faked and b) does not require revealing any extra information.

When you look at the top left of your browser, you probably see a little padlock 🔒. This symbolizes one of the most extraordinary input verification technologies ever deployed. When you enter a URL in your browser, hundreds of computers and network switches and cables help you find the right files. What if only one of these machines lied and sent you an altered version of the website? Would you notice it, if the website looked just as you expected?

This is called a man-in-the-middle attack. It is one of the most important input verification problems.

Aman-in-the-middle attackis when someone gets between you and another party and alters messages you are exchanging.

This kind of attack is impossible when the little lock 🔒 in your browser appears. It means that your communication with a server is encrypted with HTTPS.

HTTPS is an example of the most important tool for input verification: A cryptographic signature.

Acryptographic signatureallows someone to sign a document with a mark that only they could make.

It has a significant advantage over real-life signatures. It cannot be copied! Remember public-key cryptography? The public key lets you encrypt something that only the private key can decrypt.

Cryptographic signatures are like this, too. You can sign any digital document with your **private key**. It does not matter where these documents go. Anyone who receives the signed document can verify with your **public key** that it is in fact your document.

What does it mean to "sign" a digital document, an image, an audio file? Signatures are just a mark. They are attached to an object, affiliating this object to you. What keeps a signature attached to this object? Digital signatures are actually a second file that you are sending with a document. We call this second file the **certificate**. You need the certificate to guarantee that this document is yours.

When you enter google.com in your browser - how do you know the page you receive is actually from google.com? Couldn't anyone copy the public key (it's public, after all)? This is what a **certificate authority** is for. It provides a big list of domain names and associated public keys, called certificates. When you receive a webpage that's supposed to be from google.com, your browser checks its signature using the public key from the registry.

How do we know that a signature is associated with this particular file, and that nobody altered the file? For this, we need to know what a **hash** is.

Ahashis a way to convert any kind of digital object into a single big number. It is deterministic, which means: no matter who calculates the hash, the same document will always result in the same hash.

If you change even one letter in a text document or one pixel in an image file, the hash will be completely different! When google.com signs a website before sending it to you, this is what it actually does. It hashes the big page into a single number. When it creates a signature, it's a signature for *that number*. It is unique for this page!

Tip:If this was not clear or you want to gain a better understanding of cryptographic signatures, please check out this great video

From a structured transparency perspective, cryptographic signatures are only good for a single, **straight pipe** of information. Any alterations of the content would corrupt the message. This is good in case a hacker messed with your data. But it also prevents us from performing any useful computations along the way. What we want is a tool that allows transformation of the data, but only in a very specific way. How would input verification work in such a case?

We look at two of the many techniques looking to verify a flow and its inputs.

- Zero-Knowledge Proofs
- Encrypted computation (like HE or SMPC) with Active Security

**Example:** Your country holds an election. When you see the big number on the TV screen, counting the votes: How can you be sure that your vote was counted? **Goal:** We need something like a super-signature for the result. This would allow to proof that all inputs were used in the calculation of the final result. Every voter could check - with their public key - if their signed vote was included.

This is the idea behind encrypted communication with active security or zero-knowledge proofs. If a computation is agreed upon, then a signature can be created when information is processed in an information flow. There is cryptographic evidence allowing anyone who participated in the computation that their input was included in the final result.

Note:While these technologies exist, using them at the scale of an actual vote would be extremely challenging today. It is still ongoing research work.

These concepts are not only applicable to simple computations like voting, but also for complex tasks like machine learning algorithms. You could have proof that your data was processed by a fair algorithm.

- In previous lessons, we learned about private/public keys. One key pair was always linked to one person. But a key could also proof your anonymous identity. Think of it like a username: it can be anonymous, but represents the same person every time.
- There can be an infinite amount of key pairs. Those could be used not only by individuals, but by groups of people. Like all doctors in your town or everyone who is a English/Spanish interpreter. These public/private key pairs could allow you to proof that you are a member of a group, without revealing which member exactly you are.

Your real-life signature is tied to your identity. In contrast, cryptographic signatures don't have to be this way. They can be used to verify only specific attributes about you. Returning to our popular bar example: you could have a private key which proofs that you are of legal age to drink. If the bartender asked you for ID, you could prove that you are a member of the group that is over a certain age. Without having to reveal all the private information on your ID card.

This is input privacy and input verification at the same time!

A new concept: here we don't mean reputation not for people for companies, but the reputation for a verified input. This is a topic of active research, but it's supposed to inspire creativity.

**Example:** You are a data scientist and you are connected to a **network of data owners**. Let's say there are `635`

data owners who claim to have data on breast cancer. But because it's private data, you're not allowed to actually look at this data. Even though you can't see the data, they want to charge you `10$/day`

to train your ML model. How do you decide which one of these `635`

owners you trust? What would prevent anyone from attaching themselves to the network and pretending to have data on breast cancer?

- Some of them might be
**entities you know and trust**. Let's say`40`

of the data owners were well-known hospitals. They can prove their identity via a signature and have a brand they don't want to damage, so you can trust their data. - What about the other
`595`

data owners? Well, if you are the first person using a private dataset,**you can't know**. You just have to try it. The data owner should probably let you do so for free.

Let's say you are the first person to use data owner `346`

's dataset on breast cancer. You train an ML model, validate it with a known dataset, and it does pretty well. `346`

had some useful data! You leave a review: "The dataset worked". You sign it with your private key. You could even leave cryptographic proof that *you* did the analysis and which accuracy you achieved. The second, third, fourth, ... person comes along and sees your review. They can try the dataset for themselves and leave a review as well. They are all co-signing the same statement: This dataset, which you cannot see, is actually good. As the dataset becomes more verified, its price should increase.

What if there hadn't been a trusted dataset to evaluate with? Wasn't it lucky that a hospital had a similar dataset you could trust and compare? The solutions to this problem are complex, but there are ways to deal with it. This category of solutions is called **endorsements through a social network**.

Important:You cannot invent trust out of nowhere. But you can take small amounts of trust and scale that trust to an entire community.

Humanity is very connected. On LinkedIn, there is a feature that shows the degree of connection to other people. 1st degree connections are the people in your contact list. 2nd degree connections are the people your contacts have as 1st degree connections. Similar approaches could work to estimate whether an anonymous review or signature is real. You could look at the connections of this signature to people you actually know. If you can't find any connections, then it probably is fake.

This is futuristic, but it shows how much trust and collaboration we can build as these tools make progress.

Output verificationasks: how do we verify that the outputs from a hidden information flow contain the properties we want?

Output verification is one of the most important components of structured transparency. It allows you to verify that an algorithm, that's affecting your life, is actually good.

Machine learning models are often difficult to fully understand. They also can produce very biased results.

Tip:I can highly recommend Rachel Thomas’ lesson on Data Ethics from fast.ai. It is a great introduction on the topic of Bias in AI.

If we want to use algorithms, we need a way to test and **audit an algorithm's performance**. Fairness & Bias in ML are an area of active research and they are very important. For ethical reasons first, but also for legal reasons. An example is the automatic processing of applications at some companies. The algorithm must now lower the chances for someone because their gender, race, etc. are historically underrepresented in the dataset.

This is a complex challenge. Let's cover some of the highlights.

Think about a world without digital tools, when all decisions were made by humans. Do we know that any particular decision was **made in a fair way**? If a person is following a procedure, we can analyze the procedure. But if a person simply makes a decision, it can be difficult to prove that this was an unfair decision, i.e., a gender-based decision.

How has society responded to human brains being black boxes and the bias of human judgement? We have laws. We have procedures for what people are and aren't allowed to do. A formal policy, **a set of procedures someone has to follow** to ensure their decisions are fair and just. It's not perfect, but it brings the decision-making process into the open.

In the long run, **algorithms have the potential to be more transparent** and more consistent than humans. However: most of the decisions humans make today are not ready for algorithms to take over. Algorithms are not accurate enough yet. But while the early use of algorithms has made mistakes regarding fairness and bias, Andrew is optimistic. The long-term potential for more fairness, transparency, and accountability is promising. Because algorithms are easier to audit than a person.

How to certify algorithms that we can't look at, because they're private?

- Let trusted third parties or regulators audit them. Example: The Coca Cola recipe is famously secret, but
*someone*from the government had to certify that it's safe. But there will not be a regulator for every possible case. That brings us to the second point: - Construct a new private information flow to evaluate the algorithm. Use a different algorithm to evaluate this one. Use input and output privacy techniques to analyze the algorithm. You could measure how accurate, fair, safe an algorithm is, without having access to the algorithm itself.

Setting up an ideal information flow is not enough. If anyone could change the guarantees, that would be a problem. Who should have permission to change the guarantees? This is regarding input/output privacy as well as input/output verification.

Important:Without proper governance mechanisms, the integrity of the other guarantees are at risk.

There are different forms of governance:

**Unilateral governance:**An example is simple**public-key encryption**. I encrypt my message with your public key and send it to you. This is now an object only governed by you. Nobody else could use the information within the message.**Homomorphic encryption**is slightly different. If somebody sends me homomorphically encrypted information, I have*some*governance over it. I can't read it, but I can modify it, perform calculations on it. Nevertheless, it is controlled by only one person, the private-key holder.**Consensus governance:**Remember**additive secret-sharing**from SMPC)? Where a number was encrypted between multiple shareholders. Someone could only perform calculations on the number or decrypt it if**all**shareholders decided to allow it. Each shareholder has veto power.**Threshold schemes:**These allow for decisions with a democratic majority. SMPC and HE can both be configured this way. You can even set an arbitrary threshold. If a percentage of shareholders larger than this threshold wants to do something, they'll have enough cryptographic permission to do it. This can mean running a computation, decrypting an output, adding a shareholder, ...

Why is this so powerful? All these fancy tools rely on trust. Nobody is perfectly trustworthy. But it's possible to design systems where people are more likely to work together. Groups of people, with checks and balances and control over each other.

An **example** where distributing governance over a number is especially useful: You want to store all your photos in the cloud. You could split the photos into two or more shares (SMPC), and give them to different cloud providers. You could still use their services but unless they all decided to collude, they couldn't use your information.

Note:These tools are still in development and might not work in every practical use case.

Being able to share data in SMPC state is a genuine solution to the Recursive Enforcement Problem. Having information split between multiple people, with everyone having a veto power to prevent misuse, alleviates the recursive enforcement problem.

These tools give us incredible new options. But they aren't useful until we use them in systems that match the strenghts and weaknesses of humans. Computer scientists and social scientists will have to collaborate!

This lesson explored the remaining three guarantees of structured transparency: Input Verification, Output Verification, and Flow Governance.

Thank you for reading! The next post in the series will cover the impact of Structured Transparency.

Cover Image: Photo by Ron Whitaker on Unsplash

]]>This talk provides an example of private deep learning using federated learning and differential privacy.

**Usecase and Baseline:**

A hospital wants to predict how many days a patient will stay when they get admitted into the hospital. Now every hospital

]]>This talk provides an example of private deep learning using federated learning and differential privacy.

**Usecase and Baseline:**

A hospital wants to predict how many days a patient will stay when they get admitted into the hospital. Now every hospital wants a similar model to predict how many days since it helps them manage resources. The basic proposal would be to get the data from all the hospitals and then use them to build one big model which can do the prediction. But hospitals generally have constraints such as those listed below:

- They would want to maintain sovereignty of their data i.e. data should not cross the walls of their premises
- They should maintain privacy and confidentiality of patients i.e. using the data to learn general trends is allowed but these trends must never point towards individuals or groups.

For the purpose of comparison of the result, a baseline was implemented with the following criteria

- Medical training data can be requested by anyone
- The model must be a very simple neural network with no CNN, no LSTM etc.

The dataset which meets this criteria is the MIMIC dataset (). This dataset consists of about 60000 samples of clinical physiological data. The baseline model is based on the paper “Benchmarking deep learning models on large healthcare datasets” [https://arxiv.org/abs/1710.08531] specifically the fit front wrap model.

**Federated Learning**

**Motivation: **there is a motivation for the distributed learning as data has to stay private, model has to have high utility, institutional anonymity and standardised learning setting etc. But in real cases the utility of the model comes at the cost of user privacy. This project deals with this tradeoff

Existing solution is to use a centralized server where the data from all institutions is sent for training and then back to the institution server again. While there are issues of sending data, the main issue is that there is no guarantee of privacy. Federated learning provides solution to this issue. In this learning concept the collaborative model can be trained without sharing the data with with the aggregate server. The aggregate server is used a model aggregator and not as a learning node. The model is trained on local data. So the central model is generated and distributed across all clients. The local model update is then shared with the central server

This is more suitable to the task at hand due to

- Only model updates are shared not individual data points
- Allows for application of multiple privacy preserving techniques on top of federated learning.

**Differential Privacy**

**Motivation:**

Differential privacy is smart way adding random noise to data to make it accessible to applications while ensuring the privacy of the user at the ease of computation and minimum loss of utility.

How is DP achieved?

Random noise is added sampled from a noise distribution such as gaussian or laplacian. The shape and size is determined by two factors.

- Privacy budget
- Query sensitivity

The application at hand is that of deep learning. The main components of deep learning are

- Model architecture and weights
- Optimizer on which the loss value is minimized

The goal us to ensure that the network generalizes and does not memorize the data. So, where to add the noise?The noise can be added to the input data, to the model, to the loss function and to the step of optimiser. The algorithm used here is Differentially Private SGD, a variant of SGD.

How does DP-SGD work? When we create the mini batches we clip the L2 norm of the mini batch gradients before we average them. After averaging them we add the gaussian noise to protect the privacy of the users. The step taken in the loss landscape would be in the direction that is opposite to the average noisy gradient direction.

**Implementation with the help of PySyft **

A new tensor is added to the Pysyft library that supports DP-SGD embedded in an FL scheme. Preprocessing steps according to paper mentioned before to obtain feature sets. Three models are implemented

- Simple baseline without DP and FL
- Simple network with DP
- Simple network with just FL

In the final stage embedding Dp mechanism into FL through secure aggregation scheme.

**Data Preprocessing**

**Methodology**

Complexities added to the network are

- Dropout, batchnorm and early stopping from a DP standpoint.
- Increase the number of layers to mention the correlation between number of hyperparameters and model performance
- Networks with CNN and LSTM
- Adding different noise mechanisms
- Replace trusted aggregator with a non-trusted one

Thank you for reading.

References:

1** . Video Link: ****https://www.youtube.com/watch?v=F46lX5VIoas&t=121m49s**

2. https://arxiv.org/abs/1710.08531

3. https://mimic.physionet.org/

4. Image and captions from the author slides

This notebook aims to showcase two functions, one that implements sensitivity based on the unbounded differential privacy (DP) definition, and another that implements sensitivity based on a bounded definition.

I am not aiming for efficiency but for a deeper understanding of how to implement sensitivity empirically from scratch.

Before continuing there needs to be some clarifications:

In bounded DP, the neighboring dataset is built by changing the records of the dataset (not adding or removing records). E.g. x = {1, 2, 3} (|x|=3) with universe X = {1, 2, 3, 4}, a neighboring dataset with bounded DP would be: x' = {1, 2, 4} (|x'| = 3). They have the same cardinality.

In unbounded DP definition, the neighboring dataset is built by adding or removing records. E.g. x = {1, 2, 3} (|x| = 3) with universe X = {1, 2, 3, 4}, a neighboring dataset in this case could be: x' = {1, 2} or {1, 3} or {1, 2, 3, 4} (|x|=2 or |x|=2 or |x|=4, but not |x|=3). Their cardinality differs by 1 (the so called Hamming distance).

In this book they authors explain why we should stick to unbounded sensitivity.

**More specificaly, the datasets are multisets, and their cardinality is the sum of the multiplicities of each value they contain.**

The neighboring datasets are also multisets and are considered neighbors if the Hamming distance concerning the original dataset is of value k. The data scientist sets this parameter, but the original definition of DP has a value of k=1 (As discussed above, one considers a dataset a neighbor for a difference in cardinality of 1).

The Hamming distance can be seen as the cardinality of the symmetric difference between 2 datasets. With this in mind, the definition of DP can be written as:

P(M(x) = O) <= P(M(x') = O) * exp(ε * |x ⊖ x'|)

Where M is a randomized computation, x a dataset, x' its neighbor at Hamming distance k = |x ⊖ x'|, and O is the output of M given x and x'.

K was first defined to be equal to 1 because one aims to protect an individual in the dataset, and by definition, any individual within the dataset would also be protected. By making the probabilities of obtaining an output *o* similar between two datasets that differ only by 1 record, one is successfully cloaking the actual value of *o* and therefore not fully updating the knowledge of the adversary.

Looking at the definition of DP, the higher the k, the more exp(·) would increase, which means that the difference between the probabilities to obtain those outputs will be larger, which is not good from a privacy perspective. Nonetheless, because ε is defined by the data curator, one may just compensate the impact of a larger Hamming distance with a lower value of ε.

The intuition behind a larger Hamming distance is group privacy. This is useful when sets of individuals in a dataset have similar qualities, e.g., a family. The individuals of these groups should be indistinguishable from other sets of individuals of the same size. For example, having a Hamming distance of k=2 would aim to protect pairs of records (pairs of individuals), i.e., it accounts for the fact that there are dependencies between records in the dataset that need to be considered, lest an output reveals exceeding information. It could make sense if there are some binary relationships between records, e.g., pairs of siblings, or n-ary relationships for k=n, e.g., in a social network.

- I programmed functions that calculate empirically the bounded and unbounded sensitivity of a dataset formed by numeric columns. Additionally, it allows you to vary the Hamming distance. Its empirical nature will not allow it to scale well, i.e., the function creates all possible neighboring datasets, with k less or more records (for unbounded DP) and the same amount of records but changing k values (bounded DP), where k is the Hamming distance. The function calculates the query result for each possible neighboring dataset, calculates all possible L1 norms, and then chooses the maximum. That will be the sensitivity defined in DP (Global sensitivity).
- The sensitivity can be calculated for most of the basic queries: mean, median, percentile, sum, var, std, count.

I tried for different domains, bounded and unbounded sensitivities, different Hamming distances. If you are impatient, you can go directly to the results.

- Increasing the Hamming distance will increase the sensitivities; it makes sense because the larger the number of elements you can include, the more outliers can be present in the neighboring datasets, increasing the L1 norm.
- This increase in sensitivity, in turn, will increase the noise added, as the Hamming distance multiplies the chosen epsilon in the definition of DP. However, as described above, one may choose a lower ε to compensate.
- Bounded sensitivities seem smaller than unbounded ones. But that is not always the case; you can check the example given in the next blog post, where I give a visual example of how sensitivities are calculated.
- Bounded sensitivities are more taxing to compute than unbounded, but that might be because I have not optimized both functions equally.
- Sensitivities, in general, seem to either plateau, have a logarithmic behavior, or be linear. However, this cannot be generalized as the number of samples is small.

**Note 1: Unbounded sensitivity can be achieved in 2 ways, either by adding or subtracting records. In this notebook, I computed both simultaneously and chose the one that yielded the highest sensitivity. However, I would say that in a real scenario, you could choose any and calculate the sensitivity, as both equally protect the privacy of the individuals in the records by defintion. However, it is true that for the same privacy guarantees, one might use less noise than the other. This is an object for discussion.**

Note 2: Some of these conclusions have been drawn from a set of experiments, it sets the ground for hypothesis, but to assert the conclusions, we would need to prove them theoretically.

I have differentiated between 2 cases. (Their explanation have a larger impact on the understanding of DP rather than on the hereby presented notebook.)

(a) The universe of possible values is based on a dataset, which the adversary knows. Furthermore, there is a released dataset, which is a subset of the universe. The adversary only knows the size of the released dataset and that the members hold a common characteristic. In the worst case scenario, aligned with differential privacy, the cardinality of the release dataset is one less than the universe, and, therefore, the adversary knows the size of the released dataset. This scenario could be, e.g., releasing a study based on some students out of all the students at a school. (Note: The dataset to be released cannot be larger than the dataset used for the universe, only equal or smaller).

(b) This case is similar to the previous one, but the universe of possible values is a range of values instead of a dataset. Furthermore, the adversary knows this range of possible values and the size of the released dataset. We define this second case because sometimes the exact values that could potentially be released by a data owner are not known in advance; we might only know a range where those values could fall. This scenario could be, e.g., releasing real-time DP results from an IoT device. We assume that the size of the released dataset is known, i.e., we know the adversary queries n amount of records or a synopsis (statistical summary) of these n records is released. This last statement about the knowledge of the adversary is realistic because the number of users or IoT devices in an application can be designed to be known.

For simplicity, from now on, I will call the datasets D_universe_a, D_universe_b, D_release_a, D_release_b, and D_neighbor for (a) and (b).

Note that this is somewhat different from the **online (on)** or interactive, and the **offline (off)** or non-interactive definition that C. Dwork introduces in her work. Her definitions deal with not knowing or knowing the queries beforehand, respectively. However, we could have the case (a) and case (b) in both (on) or (off):

- Scenario (a) + (on): An API allows external entities to query in a DP manner a subset of the school dataset you host internally (Or its entirety).
- Scenario (a) + (off): Release externally a DP synopsis (statistical summary) of a subset of the school dataset you host internally (Or its entirety).
- Scenario (b) + (on): An API allows external entities to query in a DP manner a dataset updated in real-time by IoT devices hosted internally.
- Scenario (b) + (off): Release externally a DP synopsis of a dataset. The dataset is constantly updated in real-time by IoT devices; therefore, the synopsis must also be updated. (Not a good idea, I think)

For this notebook, I will consider Scenario (a) + (off) and (b) + (off). But for the latter I will not include dataset updates, it is out of the scope of this post.

Also, note that (a) and (b) do not necessarily need to be categorized as **local DP (LDP)**. LDP is applied at the device level on individual data points before any third party aggregates all data points from all devices; usually, a randomized response is the DP mechanism of choice. However, one can adapt the setting so that (a) and (b) are cases of LDP. Nonetheless, in this notebook, we implement (a) and (b) as **global DP (GDP)** instances. GDP is applied when a trusted third party gathers all the data and applies a DP mechanism on the aggregate and not at a record level. GDP is not as restrictive as LDP in terms of allowed DP mechanisms, and the amount of noise in GDP can be lower than in LDP. This notebook focuses on GDP.

In summary, we have (a) + (off) + (GDP) and (b) + (off) + (GDP).

- How could (b) and (GDP) go together?

The third-party can host a server to process real-time data in aggregate.

Caveat: Because the dataset is ever-growing, one might need to compute sensitivities every time the dataset would change so that the executed queries or the synopsis have the appropriate amount of noise.

However, if you know the range of possible values or you are confident of your estimate, then you could calculate the sensitivity analytically just once. - Mmm, and what if you do not know the full domain of your universe?

That is indeed a good question. Well, then you will have to do some clipping not to get any surprises. E.g., if you defined your universe like D_universe_b = {'Age': [1, 2, ..., 99, 100]}, but you get a value in real-time like 122, then you should do top-coding, 122->100, and then include its value. Outliers and DP do not go well. You can protect them, but the cost would be too high (Higher noise), and why would you do that? The mean or the sum would not be representative of the population if there are outliers in it. DP is used to learn from a population, not from outliers; if you would like to learn about outliers, DP is not for you. HOWEVER, this clipping might not be a trivial fix, I have to do more research to know whether that is okay to do.

However, in this notebook, the main difference between scenarios (a) and (b) is programmatic. While the functions I created (For the sensitivities in unbounded and bounded DP) serve both scenarios (a) and (b), to calculate the sensitivity in scenario (b), you have to make as many replicates of each value of the universe as the size of the released dataset. Why? Because if you define your range like D_universe_b = {'Age': [1, 2, ..., 99, 100]} and with a |D_release| = 4, you could get on real-time a D_release_b={'Age':[100,100,100,100]} or another like D_release_b={'Age':[35, 35, 35, 35]}.

Something to note is that the functions that calculate the sensitivities only need a universe and the size of the released dataset (Together with the Hamming distance). They do not need the actual released dataset, that would be needed for calculating local sensitivities (future blogpost).

While the adversary model is not essential for the notebook, it is good practice to set up the context so that we do not lose track of why we are doing this.

The adversary model adopted is the worst-case scenario, and it will be the one I adopt for this notebook: An attacker has infinite computation power, and because DP provides privacy given adversaries with arbitrary background knowledge, it is fitting to assume that the adversary has full access to all the records (The adversary knows all the universe, i.e., D_universe). However, there is a dataset made from the universe without an individual (D_release), and the adversary does not know who is and who is not in it (This is the only thing the adversary does not know about the universe), but the adversary knows D_release contains people with a certain quality (E.g., the students who have not been in probation). With D_universe, the attacker will reconstruct the relased dataset (D_release) by employing queries on D_release without having access to the data.

- The functions to calculate sensitivity do not scale well in terms of the size of your universe because they calculate sensitivities empirically.
- **The count query sensitivity should be 1 for unbounded and 2 for bounded DP in case of multi-dimensional queries. The former is straightforward because you add or remove one record, increasing or decreasing the total count of the record by one. However, if you have bounded sensitivity, the change of one record might lead to the decrease of the count of one attribute and the increase of the count of another, yielding a total difference of 2. These 2 cases are not accounted for in this notebook, as we are doing only one-dimensional queries and not higher dimensional queries or releasing histograms. We solely count the number of elements in the array, which leads to a sensitivity of 1 in unbounded and also of 1 in bounded. **

*If the number of users/IoT devices itself is desired to be protected, then one can take a large sample of records, but not all the records, and the cardinality considered would be the number of the sampled records.*

**(Ignore the decimals on the x-axis, Hamming distances are integers)**

In this post we will show only the highlights of the underlying notebook. I invite you to check it out for more details.

In the notebook, I used two datasets for each scenario (a) and (b), one that is large and one that is small. In this post, I will only target the larger datasets to allow for larger than one Hamming distance. The smaller dataset is the same as one included in a paper from Jaewoo Lee et al. I implemented that paper in a previous post. I used it as a sanity check.

**Scenario (a)**

The larger dataset as the universe is used in the notebook to test the functions with a Hamming distance >1. Furthermore, note that the universe has duplicated values both in the universe and the release, which does not mean they are the same records. This complicates the implementation.

```
# We define the actual dataset (conforming the universe)
D_a_large_universe_dict = {'name': ['Chris', 'Kelly', 'Keny', 'Sherry', 'Jerry', 'Morty', "Beth", "Summer", "Squanchy", "Rick"], \
'school_year': [1, 2, 2, 2, 5, 5, 7, 8, 9, 9], 'absence_days': [1, 2, 3, 4, 5, 6, 7, 8, 15, 20]}
D_a_large_universe = pd.DataFrame(D_a_large_universe_dict)
D_a_large_universe
```

And also we create the dataset that will be queried by the adversary:

```
# We define the the dataset that we will release
D_a_large_release = D_a_large_universe.iloc[:6,:]
D_a_large_release
```

**Scenario (b)**

There will be only one dataset for scnario b, designed to test the functions with a Hammig distance > 1. Notice that I only define the cardinality (Number of records in this case) of the release dataset and not of the universe. This is because we do not know all the values that conform the universe, we only know the unique values the data points may take from the universe.

The use case will be a vehicle sending the mood of the driver (1 - sad to 2 - happy) and how well he/she drives (0 - terrible to 3 - awesome). The sizes of the datasets are chosen so that they are not too large to compute their sensitivities. It is a computationally taxing operation because it is an empirical way of calculating sensitivity.

```
# We define the actual dataset (conforming the universe)
D_b_release_length = 4
D_b_universe = {'mood': [1, 2], 'driving_proficiency': [1, 2, 3]}
# Because we know the cardinality of the release datase, then:
for key, values in D_b_universe.items():
D_b_universe[key] = values * D_b_release_length
D_b_universe
```

>>> {'mood': [1, 2, 1, 2, 1, 2, 1, 2], 'driving_proficiency': [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]}

**Main functions**

Now we show the core of the notebook, the functions that calculate the sensitivities of the queries. Here is for the unbounded sensitivity:

```
def unbounded_empirical_global_L1_sensitivity(universe, universe_subset_cardinality, columns, query_type, hamming_distance, percentile=50):
"""
INPUT:
universe - (df) contains all possible values of the dataset
universe_subset_cardinality - (int) contains the length of the subset chosen for the release dataset
columns - (array) contains the names of the columns we would like to obtain the sensitivity from
query_type - (str) contain the category declaring the type of query to be later on executed
hamming_distance - (int) hamming distance between neighboring datasets
percentile - (int) percentile value for the percentile query
OUTPUT:
unbounded_global_sensitivity - (float) the unbounded global sensitivity of the input universe
Description:
It claculates the global sensitivity of an array based on the knowledge of the entire universe of
the dataset and query_type.
"""
# We initialie the type of query for which we would like calculate the sensitivity
query = Query_class(query_type)
# We will store the sensitivity of each column of the dataset containing universe in a dictionary
unbounded_global_sensitivity_per_colum = {}
for column in columns:
# Check if the values for the hamming distance and universe sizes comply with the basic constraints
verify_sensitivity_inputs(len(universe[column]), universe_subset_cardinality, hamming_distance)
# 1) RELEASE DATASET
# We calculate all the possible release datasets formed by the combination of values sampled from the universe
release_datasets = itertools.combinations(universe[column], universe_subset_cardinality)
release_datasets = list(release_datasets)
# 2) |NEIGHBORING DATASET| < |RELEASE DATASET| //// cardinalities
# The neighboring datasets are subsets of a smaller dimension of the possible release datasets (smaller by the hamming_distance)
# The neighboring release datasets are used to calculate the max sensitivity, stemming from the DP definition
neighbor_with_less_records_datasets = []
for release_dataset in release_datasets:
# These yields the smaller possible neighboring datasets
neighbor_with_less_records_dataset = itertools.combinations(release_dataset, \
universe_subset_cardinality - hamming_distance)
neighbor_with_less_records_dataset = list(neighbor_with_less_records_dataset)
neighbor_with_less_records_datasets.append(neighbor_with_less_records_dataset)
# 3) |NEIGHBORING DATASET| > |RELEASE DATASET| //// cardinalities
# similar process but adding records
neighbor_with_more_records_datasets = []
for release_dataset in release_datasets:
# We obtain combinations of values from the univsere and these will be appended to the release datasets.
# The size of each combination is equal to the hamming distance, as the neighboring dataset will be that much larger
# However, in case your universe is a dataset and not just a range of values, then the neighboring
# dataset could contain the same record twice, which is NOT desirable (1 person appearing twice)
# Therefore, the values must be sampled from the symmetric difference between the release dataset and the universe dataset
# REF: https://www.geeksforgeeks.org/python-difference-of-two-lists-including-duplicates/
symmetric_difference = list((Counter(universe[column]) - Counter(release_dataset)).elements())
neighbor_possible_value_combinations = itertools.combinations(symmetric_difference, hamming_distance)
neighbor_possible_value_combinations = list(neighbor_possible_value_combinations)
temp_neighbor_with_more_records_datasets = []
for neighbor_possible_value_combination in neighbor_possible_value_combinations:
# We create neighboring datasets by concatenating the neighbor_possible_value_combination with the release dataset
neighbor_with_more_records_dataset = list(release_dataset + neighbor_possible_value_combination)
temp_neighbor_with_more_records_datasets.append(neighbor_with_more_records_dataset)
# We append in this manner to cluster the neighboring datasets with their respective release dataset
neighbor_with_more_records_datasets.append(temp_neighbor_with_more_records_datasets)
# 4) For each possible release datase, there is a set of neighboring datasets
# We will iterate through each possible release dataset and calculate the L1 norm with
# each of its repspective neighboring datasets
L1_norms = []
for i, release_dataset in enumerate(release_datasets):
release_dataset_query_value = query.run_query(release_dataset, percentile)
L1_norm = L1_norm_max(release_dataset_query_value, neighbor_with_less_records_datasets[i], query, percentile)
L1_norms.append(L1_norm)
L1_norm = L1_norm_max(release_dataset_query_value, neighbor_with_more_records_datasets[i], query, percentile)
L1_norms.append(L1_norm)
# We pick the maximum out of all the maximum L1_norms calculated from each possible release dataset
unbounded_global_sensitivity_per_colum[column] = max(L1_norms)
return unbounded_global_sensitivity_per_colum
```

Here is for the bounded sensitivity:

```
def bounded_empirical_global_L1_sensitivity(universe, universe_subset_cardinality, columns, query_type, hamming_distance, percentile=50):
"""
INPUT:
universe - (df) contains all possible values of the dataset
universe_subset_cardinality - (int) contains the length of the subset chosen for the release dataset
columns - (array) contains the names of the columns we would like to obtain the sensitivity from
query_type - (str) contain the category declaring the type of query to be later on executed
hamming_distance - (int) hamming distance between neighboring datasets
percentile - (int) percentile value for the percentile query
OUTPUT:
bounded_global_sensitivity - (float) the bounded global sensitivity of the input universe
Description:
It claculates the global sensitivity of an array based on the knowledge of the entire universe of
the dataset and query_type.
"""
# We initialie the type of query for which we would like calculate the sensitivity
query = Query_class(query_type)
# We will store the sensitivity of each column of the dataset containing universe in a dictionary
bounded_global_sensitivity_per_column = {}
for column in columns:
# Check if the values for the hamming distance and universe sizes comply with the basic constraints
verify_sensitivity_inputs(len(universe[column]), universe_subset_cardinality, hamming_distance)
# We calculate all the possible release datasets
# First we obtain the combinations within the release dataset. The size of this combinations is not the original size
# but the original size minus the hamming_distance
release_i_datasets = itertools.combinations(universe[column], universe_subset_cardinality - hamming_distance)
release_i_datasets = list(release_i_datasets)
# it will contain sets of neighboring datasets. The L1 norm will be calculated between these sets. The maximum will be chosen
# The datasets from different groups do not necesarilly need to be neighbors, thus we separate them in groups
neighbor_datasets = []
for release_i_dataset in release_i_datasets:
# second we calculate the combinations of the items in the universe that are not in the release dataset
# the size of a combination is equal to the hamming distance
symmetric_difference = list((Counter(universe[column]) - Counter(release_i_dataset)).elements())
release_ii_datasets = itertools.combinations(symmetric_difference, hamming_distance)
release_ii_datasets = list(release_ii_datasets)
# We create neighboring datasets by concatenating i with ii
temp_neighbors = []
for release_ii_dataset in release_ii_datasets:
temp_neighbor = list(release_i_dataset + release_ii_dataset)
temp_neighbors.append(temp_neighbor)
neighbor_datasets.append(temp_neighbors)
# We calculate the L1_norm for the different combinations with the aim to find the max
# We can loop in this manner because we are obtaining the absolute values
L1_norms = []
for m in range(0, len(neighbor_datasets)):
for i in range(0, len(neighbor_datasets[m])-1):
for j in range(i+1, len(neighbor_datasets[m])):
L1_norm = np.abs(query.run_query(neighbor_datasets[m][i], percentile) - query.run_query(neighbor_datasets[m][j], percentile))
L1_norms.append(L1_norm)
bounded_global_sensitivity_per_column[column] = max(L1_norms)
return bounded_global_sensitivity_per_column
```

Now we can visualize how the sensitivity varies with the Hamming distance. I will show the code only for the first set of plots.

```
plt.figure(figsize=(15, 7))
query_types = ['mean', 'median', 'count', 'sum', 'std', 'var', 'percentile_25', 'percentile_50', 'percentile_75', 'percentile_90']
x_values = []
for key in unbounded_sensitivities.keys():
x_values.append(key)
for inedx_column, column in enumerate(columns):
# Start the plot
plot_index = int(str(1) + str(len(columns)) + str(inedx_column+1))
plt.subplot(plot_index)
query_type_legend_handles = []
for query_type in query_types:
y_values = []
for hamming_distance in unbounded_sensitivities.keys():
y_values.append(unbounded_sensitivities[hamming_distance][query_type][column])
# plot the sensitivities
legend_handle, = plt.plot(x_values, y_values, label=query_type)
query_type_legend_handles.append(legend_handle)
# Legends
legend = plt.legend(handles=query_type_legend_handles, bbox_to_anchor=(0., -0.2, 1., .102), \
ncol=5, mode="expand", borderaxespad=0.)
ax = plt.gca().add_artist(legend)
# axis labels and titles
plt.xlabel('Hamming distance')
plt.ylabel('Sensitivity')
plt.title('{}) Universe Domain {} = {}'.format(inedx_column+1, column, D_a_large_universe[column].values))
plt.suptitle('Sensitivities based on unbounded DP of different queries for different domains for different hamming distances')
plt.show()
;
```

**(Ignore the decimals on the x-axis, hamming distances are integers)**

And here are the rest of plots:

And for scenario (b):

I hope you enjoyed the tutorial, I invite you to check the notebook.

**THE END**

Github: @tcp | Twitter: @1tcp | LinkedIN: @thiagocostaporto

**Where are you based?**

"I'm based in Lisbon, Portugal. I'm originally from Belo Horizonte, Brazil."

**What do you do?**

"After getting my BSc degree in Computer Science at UFMG in Brazil, I went to Sweden for my masters and the US for my PhD in Operations Research. I'm super into optimization, sports analytics and data visualization. I've been developing web applications since I was a teenager and earlier this year I jumped on the opportunity to lead OpenMined's Web and Mobile team."

**What are your specialties?**

"I have lived through a succession of frameworks and tools for building web applications. These days, I use React in front end development — and Next.js is pretty nice — and I quite enjoy Vue and Svelte as well. In the back end, I'm more used to using Flask and Express, although I've been getting back to Ruby here and there.

Data visualization is a passion of mine and I love the entire d3.js library.

I feel at home using JavaScript, but I still have nostalgia feelings for C. I'm trying to expand the repertoire by including Rust and I'm a regular at OpenMined'sRust Thursdays. That's a great way to learn more about Rust while contributing to our OpenMined libraries. And it's open for anyone to join. Stop by #rust in our community Slack to code with us!"

**How and when did you originally come across OpenMined?**

"It was actually Patrick who introduced me to OpenMined. Back in 2018, I was one of the owners of a software house in Brazil and he was a client. We were working together on a web application and he happened to mention OpenMined and talk about the community, what they were doing and how much it was growing.

I checked it out and it was fantastic! Privacy is a topic of interest which I'm very concerned about, so it was great to join the community and learn not only about the problems related to privacy in many domains but to become part of this group of highly skilled, highly educated individuals coming together and working on the solutions."

**What was the first thing you started working on within OpenMined?**

"PyGrid! I did some proof-of-concept work with PyGrid which was eye opening to grasp what it is possible to do with the "syft-grid ecosystem". Very excited for what's coming next."

**And what are you working on now?**

"I'm basically "taking over" OpenMined's web platforms — OpenMined Education, the website, PyGrid Admin, this blog! — and leading the transition to new versions of these platforms or creating new web applications. Plus, the Web and Mobile team is developing many other web applications that will be unveiled in the next couple of months. These are very exciting times!

If you want to jump in and help us develop our web platforms, please reach out in the #web-and-mobile channel on Slack!"

**What would you say to someone who wants to start contributing?**

"Just do it! OpenMined is a fantastic place to learn and build interesting tools. There are lots of skilled contributors who will help you out and the support team is amazing. Reach out to us, we are friendly!

More importantly, your contributions will help tons of people worldwide and you will be working on tools that help solving impactful real world problems that everyone can relate to. Just do it!"

**Please recommend one interesting book, podcast or resource to the OpenMined community.**

]]>"I really enjoy the Lex Fridman podcast. It's awesome! Lex has such a unique way to talk to the guests and the topics are varied, ranging from philosophy, to aliens (it is a favorite topic for Lex), AI, teaching, learning, future technology, and many others. He talks to very influential people and it's remarkable how interesting and filled with content these interviews are. He's done some 3-hour, 4-hour long podcasts that are not hard to listen to."

[**Originally published by Lourdes M. Turrecha in the Privacy & Technology Medium blog. Part 2 of this privacy tech series will provide a framework for assessing privacy tech. Part 3 will cover the privacy problems that are solvable/not solvable through technology.**]

There is increasing interest in the nascent privacy tech landscape.

Entrepreneurs are building solutions in response to privacy and data protection problems. As of the publication of this post, Crunchbase lists 824 self-described privacy companies.

Investor interest in funding privacy tech startups is also strong, even in the midst of the pandemic when funding was reportedly stalled. Crunchbase reports that investors have poured $4.6B in cumulative funding towards emerging privacy companies.

2020 saw the first two privacy tech startups gaining unicorn status: OneTrust and BigID.

In the midst of the pandemic, COVID-19 tool developers came out touting privacy features, including Apple and Google’s joint contact tracing proposal.

In the consumer space, people’s concern over their privacy led to their increased adoption of the Signal messaging app, which made it the top downloaded app during last year’s wave of national protests in support of the Black Lives Matters movement. Signal downloads surged again during the recent US national elections and in early 2021 after Elon Musk urged his Twitter followers to “Use Signal.” Users are also increasingly using privacy tech tools like Brave and DuckDuckGo, which have reported similar increased adoption as privacy preserving alternatives to existing privacy invasive browsers and search engines.

Clearly, privacy tech is on the rise, but understanding of what constitutes privacy tech remains low.

What exactly is privacy tech? And how nascent is it, really?

To define and understand privacy tech, it is important to understand the operative term: privacy. There are different branches of privacy: physical, behavioral, decisional, and information privacy. In today’s digital world, the other branches of privacy often feed into information privacy.

Thanks to many privacy scholars, privacy is a well-covered and explored concept. Perhaps the most widely accepted definition of privacy centering on individual control over personal information. That said, different scholars have offered varying conceptions of privacy beyond individual control: secrecy, trust, obscurity, power, contextual integrity, just to name a few. We briefly expand on some of these conceptions below.

Privacy as secrecy underpins the US constitutional right to privacy, stemming from SCOTUS cases, * Griswold *and

Alan Westin defined information privacy as “the claim of individuals, groups, or institutions to determine for themselves when, how, and to what extent information about them is communicated to others.” This is perhaps the most commonly accepted definition of information privacy: individuals retain ultimate control over their personal data, including how much of it is disclosed and to whom, as well as how it should be maintained and disseminated.

Building on the privacy as choice school of thought, Carisa Veliz argues that privacy is power. In a world where personal data is constantly being harvested and exploited through a surveillance economy, she urges users to exercise their power and take back control over their personal data.

In contrast, Neil Richards and Woodrow Hartzog conceptualize privacy as trust. Trust, here, includes building information relationships over time. Trust allows us to develop more long-term relationships when it comes to information by having sincere exchanges with the confidence that the data we’ve shared will be used for our benefit, not to our detriment.

Ari Ezra Waldman furthers this point by explaining that what makes expectations of privacy reasonable are expectations of trust. Waldman defines trust as a social fact of cooperative behavior, manifested by reciprocal exchanges, assumptions about intentions, and expectations regarding future action. Trust is described here as a functional necessity of society because it “greases the wheels of effective sharing,” or in other words, trust encourages interactions. Tying this concept to privacy, Waldman argues that when we trust, we share our personal information.

Evan Selinger and Woodrow Hartzog also explore the concept of obscurity and how it applies to privacy. They explain that obscurity is the idea that information or people are safe, when they are hard to obtain or understand. If information is hard to access or understand, the only people who will take advantage of it are those who are sufficiently motivated to break barriers or expend effort and resources to access or understand it.

Helen Nissenbaum offers contextual integrity as an alternative conception of privacy, accounting for privacy norms or expectations in different contexts. Nissenbaum explains that contextual integrity links privacy protection to norms for specific contexts, so that “information gathering and dissemination [are] appropriate to that context and obey the governing norms of distribution within it.”

Daniel Solove argues that instead of forcing a unified conception of privacy, we should conceive privacy in a more pluralistic way, focusing on * privacy problems*. Solove provides a taxonomy of privacy problems broken down into four main categories: information collection, information processing, information dissemination, and invasion.

As explored by the privacy thinkers cited above, privacy is a complex, multi-faceted concept that encompasses secrecy, individual control, trust, obscurity, power, and context. Because many of these conceptions of privacy are limited and do not fully represent privacy on their own, it makes sense to adopt a broader conception of privacy by looking at privacy problems.

P*rivacy tech is the broad term used to describe technology solutions that solve privacy problems.*

Privacy tech is the broad term used to describe technology solutions that solve privacy problems.

That said, solving a privacy problem doesn’t necessarily need to be a privacy tech tool’s primary purpose. Privacy tech can have a privacy solution be a key differentiating feature. Examples of these are the aforementioned Brave, Signal, and DuckDuckGo products, which are primarily browsing, searching, and messaging tools that are built with privacy as a key differentiating feature.

The privacy tech landscape might be nascent from funding and market viability standpoints, but privacy tech is the culmination of decades’ worth of developments from privacy academia and thought leadership (as covered above), the privacy regulatory landscape, the privacy engineering field, the maturing of early privacy enhancing technologies, the development of consumer privacy sentiment, and, most recently, VC interest in privacy tech solutions. All of these paved the way for privacy innovation and created a market for tools that solve privacy problems.

To further illustrate privacy tech’s scope, in addition to defining it, we explore its relationships to the following existing concepts that are often used alongside or interchangeably with it: security, privacy enhancing technologies, privacy by design, privacy engineering, data protection compliance tools, and privacy-first tech.

The information privacy and security domains intersect, but there is also a clear distinction between the two. The two domains intersect where: information privacy requires that personal information be secured during its entire lifecycle — from collection and use, to transit, storage, and destruction. But the privacy domain covers a breadth of inquiries beyond security, such as transparency (notice and consent), data minimization, purpose specification, individual rights, etc.

There are existing bodies of literature defining privacy and security. But for the purpose of this article, we refer back to the complex and multi-faceted conceptions of privacy that encompass individual control, trust, obscurity, power, context, and anonymity. Whereas, information security is concerned with the confidentiality, integrity, and availability of information (not just personal information, but also trade secrets, intellectual property, and other information that warrants securing) and the systems that process such information.

Given privacy’s breadth of inquiry beyond security, the privacy and security conflation is oftentimes an indicator that a self-proclaimed privacy tech company may not understand privacy, or that it may not really be a privacy tech company (but is a security company capitalizing on the rising privacy tech marketplace).

Privacy tech encompasses privacy enhancing technologies (PETs), but also more. It’s hard to definitively claim that privacy tech is merely PETs because there is currently no universally accepted definition for PETs.

That said, the Organisation for Economic Co-operation and Development’s (OECD) definition of PETs is illuminating: *PETs commonly refer to a wide range of technologies that help protect personal privacy. PETs aim to give the individual user or technology manager the capability of controlling if, how much or under what circumstances information is disclosed.*

PETs’ primary aim is to protect * personal* privacy, whereas privacy tech covers solutions that solve for

Another second widely cited definition of PETs is: a coherent system of technology measures that protect privacy by eliminating or reducing personal data or by preventing unnecessary and/or undesired processing of personal data; * all without losing the functionality of the data system*.

Under this second definition for PETs, a technology that blocks online ad tracking would not be a PET because it interferes with system functionality, but it would certainly be privacy tech because it solves for a privacy problem: a user’s choice not to be tracked. In comparison, a technology that allowed ad personalization while giving users control over their personal data would be a PET under this second definition for PETs and, in turn, would also be privacy tech because it both solves for a privacy problem (user control over their personal data) without losing ad functionality.

PETs have also been used to cover computational tools such as differential privacy, homomorphic encryption, secure multi-party computation, and zero-knowledge proofs.

For a long time, PETs rarely made it to market due to challenges surrounding tech maturity and business model. The broader umbrella of privacy tech, on the other hand, covers some tools that are already out in the marketplace, ready for users.

In addition to PETs, privacy tech also includes “privacy-first” tools, a term used to describe tools that provide their service in a privacy-forward way. Their primary purpose is not to solve for a privacy problem. Instead, their primary purpose could be anything else, including search (in the case of DuckDuckGo), browsing (Brave), and messaging (Signal). That said, these tools offer privacy solutions despite their non-privacy primary purpose. Brave solves for privacy problems in web browsing; DuckDuckGo, in search; and Signal, in messaging.

Because their primary purpose is not privacy, one could argue that the term “privacy-first” is an ill-fitting term for such tools. Other more appropriate terms could be “privacy-forward,” “privacy-friendly,” or simply, “privacy tech.”

Setting aside the appropriate label, at the end of the day, these tools solve privacy problems (albeit as a secondary purpose), and therefore fall under the broader umbrella of privacy tech.

With today’s government surveillance and surveillance capitalism double whammy, it’s not surprising that a slew of anonymity tech have popped up. Anonymity tech are tools that help users maintain anonymity online. These include Tor, certain Zero Knowledge Proofs roll-ups to Blockchain, and other cryptographic technologies that solve for anonymity.

As covered above, privacy is by no means anonymity. Anonymity is an incomplete take on privacy. Under the privacy as choice school of thought, privacy covers other individual choices and exercises of power over personal information, beyond the choice to remain anonymous. Anonymity is merely one way to exercise privacy choice — it’s an incomplete take on privacy but still an important aspect of it.

Anonymity tech solves for an individual’s problem stemming from a desire to remain anonymous in certain situations in this highly connected and surveilled world. Because privacy includes individual choice and context, the choice to remain anonymous in certain contexts is a privacy problem. This means that anonymity tech is covered under the broader umbrella of privacy tech.

Privacy tech is neither privacy by design nor privacy engineering. It is a distinct concept on its own, but it does require both.

Lowercase privacy by design means privacy aforethought. Building upon this, Ann Cavoukian coined uppercase * Privacy by Design*,

Enter * privacy engineering*, the “how” to PbD’s “what.” Coined by Thomas Finneran, privacy engineering refers to “a discrete discipline or field of inquiry and innovation … using engineering principles and processes to build controls and measure into processes, systems, components, and products that enable the authorized, fair, and legitimate processing of personal data.” Privacy engineering provides a discipline for achieving Privacy by Design principles.

Bringing these concepts together, privacy tech is one of the tangible outputs of theoretical privacy by design. Privacy engineering provides the discipline (or the * ho*w) for translating privacy by design principles (or the

Privacy tech includes B2B data protection compliance tools, which may not necessarily or directly improve personal privacy, unlike PETs. Instead, these B2B privacy compliance tools help companies solve their privacy or data protection compliance problems. These compliance tools may not directly address individual privacy, but the argument is as follows: because data protection laws are supposed to improve individual privacy, these compliance tools are helping companies improve their data protection practices, which in turn should then improve these companies’ own users’ privacy.

Although privacy tech has gained more momentum in the B2B space, tools that protect consumer privacy qualify as privacy tech because they solve consumer privacy problems. Some examples of these consumer privacy tech tools that we’ve already mentioned and include private browser, search, and messaging tools such as Brave, DuckDuckGo, and Signal. We’re also seeing more and more data subject request (DSR) tools and digital wallets for consumers.

Interestingly, the consumer privacy tech space is playing catch up to the enterprise privacy tech space. This is likely due to a combination of several factors. First, it’s only been recently that we’ve seen consumer sentiment change to demand privacy tech tools. It wasn’t until last year that more than half of Americans reported rejecting online services based on privacy concerns, as reported by Pew Research in their annual consumer privacy study. The pandemic further highlighted this sensitivity, with various studies reporting up to 90% of consumers refusing to use contact tracing tools due to privacy concerns. Second, it wasn’t also until recently when Big Tech platforms like Apple and Google stepped in to require app developers on their platforms to build increased privacy features. Lastly, while there have been privacy laws in the books for decades, it hasn’t been until the last few years when we saw the passing of comprehensive privacy laws that explicitly provide individual privacy rights to access, correction, portability, deletion, etc.

We covered B2B data protection compliance tools and B2C consumer privacy tools, but note that there are other types of B2B privacy tech and that there are also emerging B2B2C privacy tech solutions.

Perhaps most telling of its rise, maturity, and increasing introduction to the marketplace, privacy tech is also used to describe the emerging market of privacy tech solutions: both at the consumer and enterprise levels. This is analogous to how cybersecurity not only refers to tools that protect our computer systems, but also the market of cybersecurity tech solutions. In contrast to the mature cybersecurity landscape, there simply wasn’t much of a niche market for privacy tech until recently.

As the privacy tech landscape continues to grow, it’s critical to define and understand what qualifies as privacy tech for several reasons:

One could argue that definitions are mere semantics, but being clear about what we’re talking about has significant implications, as further outlined in the below bulletpoints. Most importantly, it will help us avoid talking past each other, saving us valuable time and resources, and enable us instead to focus on the greater goal of moving forward privacy and true privacy tech solutions.**To facilitate healthy debate and avoid talking past each other.**Transparency is a key privacy principle. Consumer and enterprise customers deserve to know whether the privacy-branded tools they’re using actually solve privacy problems, including which problems they’re trying to solve.**To practice transparency.**It’s important to learn from other movements like the green movement and the diversity & inclusion movement, which have suffered “greenwashing” and “diversity washing” and quash “privacy washing” at its tracks. We expect companies purporting to solve privacy problems through technology to actually build these privacy tech solutions, not just advertise them.**To quash “privacy washing.”**Privacy tech boundaries help customers better identify true privacy tech from tools that don’t actually solve privacy problems. For example, while tools in the adjacent data, advertising, and cybersecurity industries may sometimes overlap with privacy tech, we want to know when they are and aren’t providing tech solutions to privacy problems.**To assess privacy tools.**It’s unlikely that each one of the 824 self-described privacy companies listed in Crunchbase builds technologies that meet the definition of privacy tech. By being clear about what privacy tech is, we weed out technologies that don’t actually solve for privacy problems. This improves the privacy tech landscape’s success as it matures and strengthens the privacy tech market’s viability in the long-run. In turn, this means better privacy tech tools for both businesses and consumers.**To fuel the privacy tech industry’s success.**As many privacy advocates believe, individuals have a fundamental right to privacy. But this fundamental right hasn’t materialized in the technologies that we use, perhaps partly due to regimes like the US treating privacy as a consumer protection issue, instead of a fundamental rights issue. We shouldn’t have to choose between functionality and privacy. While we wait for laws to catch up to innovation, we can do our part by building tools that solve privacy problems.**To safeguard our fundamental right to privacy.**

To continue solving privacy problems and fueling privacy innovation, we need to begin with understanding and defining what constitutes privacy tech.

]]>Steganography is the art of hiding information. Whereas cryptography aims to prevent people understanding information once it has been found, the goal of steganography is to prevent the information from being discovered in the first place. Of course, for any practical covert communication you’re likely to want to combine the two, so that even if the message is discovered it cannot be understood. In this blog post we’ll see how neural networks can be used as a vehicle for covert communication.

In a digital context, the steganographer wants to hide a payload inside another file (or set of files) in such a way that those files appear normal so that the existence of a message is not detected. For example, some PNG images store 3 bytes of RGB data for each pixel in the image (1 byte per colour channel). But modifying the least significant bit (LSB) of each of these bytes will not have a perceptible impact on how the image looks to a human observer. This gives us one of the classic steganographic techniques: serializing the payload data as a string of bits, and setting each LSB in the image data to the corresponding bit from the payload. To recover the hidden message, you simply read off the LSBs of the image’s RGB data.

This technique takes advantage of a general property of such vehicle files, which is that they have more precision than they need: the image in question could just as well have been stored in a format which used one bit fewer per RGB value. The steganographer repurposes the unused precision to store additional information.

This brings us to neural networks as potential vehicles for steganography. Serialized neural networks are often very large files, which means they can potentially hold a lot of covert information without arousing suspicion, assuming they can be modified without affecting performance. The question is: can we use the same LSB technique here, by modifying the values of the network parameters, ie the weights and biases?

It turns out this works. There’s a proof of concept in my machine learning/infosec repository here. In this example `sender.py` trains a simple PyTorch network that can categorize MNIST digits with >95% accuracy, and then adjusts the parameters using the LSB technique to store a secret image file. Then `receiver.py` loads this model, compares the performance of the original network with the modified network, and also reconstructs the image from the data stored in the modified network. The performance on the MNIST test set appears to be unaffected. It would be interesting to see whether this result holds for deeper networks (eg some of the widely used pre-trained networks) or the modifications add up to performance degradation when many layers are involved.

The data storage in this demo could probably be made more efficient by setting two LSBs per parameter (or three, or ...). And for situations where the LSB really is needed for the original performance level, you always have the option of artificially increasing the precision. For example if the original model works with 32-bit float values, you could just translate the model to 64-bit floats and gain 32 bits of storage per parameter. Of course this would drastically increase the size of the model file, which might be a problem.

One obvious use case for this would be exfiltration: a tech company employee could upload a neural network somewhere, making the network request look innocent, but actually the parameters encode some company secrets.

Another intriguing possibility is that publicly available pre-trained networks could be used to disseminate a relatively large amount of arbitrary data for subsequent retrieval to a large number of machine learning practitioners’ machines, who may be considered a valuable target group for certain kinds of attack. This probably has very few applications, but could perhaps be useful for an attacker who can’t make arbitrary network requests on the victim’s machine and can only deliver a small payload using other means.

Cover Image: Photo by Alina Grubnyak on Unsplash.

]]>An interview with Yuxin Wang, a rising star in programming analysis of differential privacy.

**Editors: Wenhui Zhang, Yizheng Jiao, Yuxin Wang, Bala Priya**

Google Scholar: @yuxinwang | LinkedIn: @yuxinwangcs

Differential Privacy is a big and hot topic these years. Ever since Dwork et al [1] proposed Differential Privacy as a notion to quantify privacy leakage, there have been works that try to propose new algorithms that release different kinds of information while satisfying differential privacy.We had an interview with Yuxin Wang, a 4th year Ph.D. student at Pennsylvania State University**,** who focuses on verification and program analysis of differential privacy.

**Hi, Yuxin, why don’t we start with introducing the topic you are working on?**

I mainly work onprogram analysis of differential privacy,including verification, violation detection, and program synthesis of algorithms that are related to differential privacy.

**As a researcher working in Differential Privacy for so many years, do you mind sharing your insights on what privacy means from your perspective?**

I think privacy means that information aboutyoustays yours. In other words, the information that canidentifyyou should remain private, such as SSN, driver license number, medical records, or your name, anything that reveals the information that has a connection to you.

**So, put it together, do you mind giving us an example of Differential Privacy?**

To put it simply, suppose we have a database that has your information, denoted as D1, and the other one without your data denoted as D2. A mechanism M is differentially private if the outputs of M(D1) and M(D2) are “roughly” the same. This is sometimes referred to as “I could have opted out”, meaning that it’s hard to judge only from the outputs of the algorithm if your data is in the database. Interested readers can refer to [1] for a full definition.

**Why is differential privacy important? Actually, this question can be made broader: Why is privacy important?**

I think recent incidents from Facebook and Cambridge Analytica showed us that the breach of individual privacy can have a great negative impact on the society. On the bright side, this raises the awareness of the importance of privacy, so we have seen many actions taken on the legal side such as GDPR and other laws for privacy protections.

**How is differential privacy enforced in programming languages?**

In academia, there have been several lines of works that aim at automatically verify / violation detection.

**Could you introduce a few interesting works in the field of program analysis of differential privacy?**

Sure, in the past years we have seen many works onProgram Verification. One line of work uses a proving technique called Randomness Alignment, it is a simple yet powerful technique and a few tools have been developed [7,10-11] that are capable of proving a variety of fundamental differential privacy algorithms.

Besides alignment-based proofs, probabilistic couplings and liftings [12-14] have also been used in language-based verification of differential privacy. Most notably, Albarghouthi and Hsu [3] proposed the first automated tool capable of generating coupling proofs for complex mechanisms. Coupling proofs are known to be more general than alignment-based proofs, while alignment-based proofs are more light-weight.

There is also an early line of work [15-19] that uses (variations of) relational Hoare logic and linear indexed types to derive differential privacy guarantees.

On theViolation Detectionside, Ding et al. [20] and Bichsel et al. [21] proposed counterexample generators that rely on sampling – running an algorithm hundreds of thousands of times to estimate the output distribution of mechanisms (this information is then used to find counterexamples). Our recent work [7] uses static analysis to detect violations without actually running the code.

Recently there have been works onProgram Synthesis(i.e., take a non-private base algorithm and make it differentially private), such as (1) KOLAHAL [5] takes a sketch program with unknown noise scales but known locations, then try to find the good values for the scales by leveraging the counterexamples generated by [2] and solving a continuous optimization approximation problem; (2) an earlier synthesizer [6] relies on user-supplied examples and sensitivity-directed Differential Privacy analysis language called DFuzz [5] to automatically synthesize DFuzz programs that satisfy differential privacy.

**There are some ongoing research works on differential privacy-preserving databases. Do you think program analysis could help with building such systems?**

Programmers write programs that they claim to satisfy differential privacy, where they often accompany the claim with mathematical proof. However, manual proof is hard and error-prone, as shown in [9], even differential privacy experts can make mistakes. So automated verification can help in the development of differentially private algorithms and even provide insights for new algorithms.

**How to verify if differential privacy is enforced or violated?**

Generally speaking it requires a proof (either mathematical or machine-generated from the verification tools) showing differential privacy is satisfied or statistical evidence that differential privacy is likely violated.

Thanks Yuxin for sharing time with us and explaining all the details. If you have more questions please feel free to contact **Yuxin. **Hope you enjoyed this session, and more of you will join the workforce and movement of differential privacy-preserving systems, i.e., **OpenMined**.

Thanks! If you like to discuss anything related to Differential Privacy, feel free to reach me at yxwang@psu.edu.

[1] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Theoretical Computer Science, 9(3–4):211–407, 2014.

[2] Zeyu Ding, Yuxin Wang, Guanhong Wang, Danfeng Zhang, and Daniel Kifer. Detecting violations of differential privacy. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS ́18, pages 475–489, New York, NY, USA, 2018. ACM.

[3] Zeyu Ding, Yuxin Wang, Danfeng Zhang, and Daniel Kifer. Free gap information from the differentially private sparse vector and noisy max mechanisms. PVLDB, 13(3):293–306, 2019.

[4] Marco Gaboardi, Andreas Haeberlen, Justin Hsu, Arjun Narayan, and Benjamin C. Pierce. Linear dependent types for differential privacy. In Proceedings of the 40th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’13, pages 357–370, New York, NY, USA, 2013. ACM.

[5] S. Roy, J. Hsu, and A. Albarghouthi. Learning differentially private mechanisms. In IEEE Symposium on Security and Privacy (SP), pages 1033–1046, Los Alamitos, CA, USA, may 2021. IEEE Computer Society.

[6] Calvin Smith and Aws Albarghouthi. Synthesizing differentially private programs. Proc. ACM Program. Lang., 3(ICFP), July 2019.

[7] Yuxin Wang, Zeyu Ding, Daniel Kifer, and Danfeng Zhang. CheckDP: An automated and integrated approach for proving differential privacy or finding precise counterexamples. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, CCS ’20, page 919–938, New York, NY, USA, 2020. Association for Computing Machinery.

[8] Jun Zhang, Xiaokui Xiao, and Xing Xie. Privtree: A differentially private algorithm for hierarchical decompositions. In Proceedings of the 2016 International Conference on Management of Data, pages 155–170, 2016.

[9] Lyu, Min, Dong Su, and Ninghui Li. "Understanding the sparse vector technique for differential privacy." arXiv preprint arXiv:1603.01699 (2016).

[10] Danfeng Zhang and Daniel Kifer. 2017. LightDP: Towards Automating Differential Privacy Proofs. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages (Paris, France) (POPL 2017). ACM, New York, NY, USA, 888–901.

[11] Yuxin Wang, Zeyu Ding, Guanhong Wang, Daniel Kifer, and Danfeng Zhang. 2019. Proving Differential Privacy with Shadow Execution. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (Phoenix, AZ, USA) (PLDI 2019). ACM, New York, NY, USA, 655–669. https: //doi.org/10.1145/3314221.3314619

[12] Aws Albarghouthi and Justin Hsu. 2017. Synthesizing Coupling Proofs of Differ- ential Privacy. Proceedings of ACM Programming Languages 2, POPL, Article 58 (Dec. 2017), 30 pages.

[13] Gilles Barthe, Noémie Fong, Marco Gaboardi, Benjamin Grégoire, Justin Hsu, and Pierre-Yves Strub. 2016. Advanced Probabilistic Couplings for Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (Vienna, Austria) (CCS ’16). ACM, New York, NY, USA, 55–67.

[14] Gilles Barthe, Marco Gaboardi, Benjamin Grégoire, Justin Hsu, and Pierre-Yves Strub. 2016. Proving Differential Privacy via Probabilistic Couplings. In Proceed- ings of the 31st Annual ACM/IEEE Symposium on Logic in Computer Science (New York, NY, USA) (LICS ’16). ACM, New York, NY, USA, 749–758.

[15] Gilles Barthe, Marco Gaboardi, Emilio Jesús Gallego Arias, Justin Hsu, César Kunz, and Pierre-Yves Strub. 2014. Proving Differential Privacy in Hoare Logic. In Proceedings of the 2014 IEEE 27th Computer Security Foundations Symposium (CSF ’14). IEEE Computer Society, Washington, DC, USA, 411–424.

[16] Gilles Barthe, Boris Köpf, Federico Olmedo, and Santiago Zanella Béguelin. 2012. Probabilistic Relational Reasoning for Differential Privacy. In Proceedings of the 39th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Philadelphia, PA, USA) (POPL ’12). ACM, New York, NY, USA, 97–110.

[17] Gilles Barthe and Federico Olmedo. 2013. Beyond Differential Privacy: Compo- sition Theorems and Relational Logic for f-divergences Between Probabilistic Programs. In Proceedings of the 40th International Conference on Automata, Lan- guages, and Programming - Volume Part II (Riga, Latvia) (ICALP’13). Springer- Verlag, Berlin, Heidelberg, 49–60.

[18] Marco Gaboardi, Andreas Haeberlen, Justin Hsu, Arjun Narayan, and Benjamin C. Pierce. 2013. Linear Dependent Types for Differential Privacy. In Proceedings of the 40th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Rome, Italy) (POPL ’13). ACM, New York, NY, USA, 357–370. https: //doi.org/10.1145/2429069.2429113

[19] Jason Reed and Benjamin C. Pierce. 2010. Distance Makes the Types Grow Stronger: A Calculus for Differential Privacy. In Proceedings of the 15th ACM SIG- PLAN International Conference on Functional Programming (Baltimore, Maryland, USA) (ICFP ’10). ACM, New York, NY, USA, 157–168. https://doi.org/10.1145/ 1863543.1863568

[20] Zeyu Ding, Yuxin Wang, Guanhong Wang, Danfeng Zhang, and Daniel Kifer. 2018. Detecting Violations of Differential Privacy. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (Toronto, Canada) (CCS ’18). ACM, New York, NY, USA, 475–489.

[21] Benjamin Bichsel, Timon Gehr, Dana Drachsler-Cohen, Petar Tsankov, and Martin Vechev. 2018. DP-Finder: Finding Differential Privacy Violations by Sampling and Optimization. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (Toronto, Canada) (CCS ’18). ACM, New York, NY, USA, 508–524.]]>

**Confidential computing explained:**

Part 2 : attestation

**I — Introduction**

We saw in the previous article why we need techniques such as confidential computing to secure sensitive workloads, and how different it is from regular methods.

The above scheme used in the last article sums it up :

- The user encrypts their data using a key only they have access to
- They send the encrypted data to the service provider
- The service provider then uses an enclave on the data in a way that prevents the service provider to see the data in clear
- Finally the transformed data is encrypted using another key, only available to the user, and the encrypted result is sent back to the client

This scenario seems relatively simple, but many things happen behind the scenes to make it possible.

Especially there are hardware features allowing service providers to launch enclaves, load it with code but protect the enclave contents from being read by anyone external to the enclave. We will not cover this feature too much for now as we will see it in the technical articles. All you need to remember is that enclaves are like the magic hats, it allows someone to manipulate someone else’s data without being able to see what the data is.

Instead, we will cover here one fundamental property of Intel SGX enclaves that make this solution extremely interesting : code integrity with attestation.

**II — The problem**

Once we assume that only an enclave can handle sensitive data, using a secure communication channel directly between the enclave and the user, and the fact that the service provider has no way to know what data is sent to the enclave, we still have one question : how can one be convinced that the right code is loaded in the enclave, and that this code will not be malicious?

Indeed, we know that only the enclave code will be applied to the data, and that the service provider will not be able to know what is happening inside. But what if the enclave code contains a backdoor to allow the service provider to retrieve information from its users? Even worse, what proof do I have that the service provider actually uses a genuine enclave to handle my data?

If we continue with the magic hat analogy, the two questions we have are :

- What proof do I have that the magic hat that the magician shows me, is indeed a magic hat, and not a regular hat that could leak all my belongings?
- Supposing that the hat is indeed a magic hat, what proof do I have that this magic hat is not malicious and secretly leaks things to the service provider?

To answer these two questions, Intel provides a very interesting mechanism called **attestation.** The idea of attestation is to show evidence to someone, external to the platform where the enclave is hosted, that this person is talking to a genuine enclave with the right code loaded inside.

**III — Attestation**

Let us come back to the analogy with the ring and the magic hat. The magician wants to open a business to handle people’s rings with magic hats so that her clients never have to send their rings directly, but still benefit from her expertise. To do so, she will handle the rings in magic hats that are ready to be used on the sealed boxes containing the ring, but how does she actually create such magic hats ?

Well, suppose there is a very famous felt manufacturer in your town, Intel Magic Felt, that sells felt to magicians who want to craft magic hats, and only hats made from their magic felt can be considered a magic hat. Felts are sold in packs, and each pack has a special and unique component added, which is known only to Intel and makes each felt pack unique.

Magicians will then buy magic felt packs, and specially design their magic hats to answer their clients’ needs with a specific magic hat. At the end of the fabrication, the magic hat is appended a small ticket, which attests the materials used, and what the hat does exactly once it is put on objects. Because magic felt from Intel Magic Felt was used, this label is signed with special properties that only Intel Magic Felt can bestow. This means other people than Intel Magic Felt will not be able to reproduce labels that look like ones from genuine Intel. This is possible thanks to the components Intel added to each felt pack.

Therefore, when someone wants to know if they should trust a magic hat, they simply need to ask Intel, and Intel will tell them:

- Whether or not the hat they are shown indeed comes with the protection provided by hats made with Intel magic felt
- What the hat will do exactly, as hats made with magic felt will have an unforgeable and unique label saying what the hat will do. Each hat is uniquely identifiable and will have an exact purpose

For a service provider to convince the user that the hat is indeed magical before precious secrets are sent, the following process happens:

- 1 : the service provider gives the label of the hat they will use on the user’s belongings
- 2 : the user takes the label attached to the hat and sends it to Intel
- 3 : Intel replies by saying whether or not the hat the label is attached to is indeed magical. Because Intel has a list of all felt packs it sold, Intel will be able to know whether or not the label they are shown indeed comes from one of their factories. Moreover, the label on the hat contains information about what the hat does, and this cannot be forged as they used magic felt
- 4 : the user can then send their belongings securely to the service provider once they know that they indeed deal with a magic hat doing what exactly what they want

If the magician tries to trick their clients by asking them to send them their precious belongings to a hat that is not magical, or a malicious magic hat, we will know right away during the attestation, before anything is given to the magician.

As you would guess, the same process is similar with Intel SGX: in addition to the security feature securing enclaves, tamper-proof secrets are written on the hardware.

Once code is loaded to the enclave, a report can be generated, containing a hash of the code. This report is signed by a key derived by the hardware secrets and negotiated with Intel. Then the user wanting to consume the service provided by the enclave will ask Intel if the report indeed comes from an enclave, and Intel will approve or reject it.

The real process is a little more involved than that in practice, but if you understand all this it is already a great thing! So stay tuned because we will cover the technical details in the next article.

]]>Notebook

(The HTML file is too large for viewing in GitHub, the fastest way to access the content is to download the HTML file and open it with your browser)

The goal of this notebook is to implement the following paper from 2012 with around 50 citations. * Differential Identifiability* is a work by Jaewoo Lee et al., which you may find here.

For more background of what differential privacy is, check this old post.

- I explain the steps involved in the proof of differential identifiability in a more granular way than the ones provided in the paper so that other researchers can quickly and deeply understand the publication
- I provide the source code to implement differential identifiability
- I provide my humble thoughts about the usefulness of this publication
- I verify that the results from the computations in the paper are correct
- All the results check out; however, I have not been able to replicate the sensitive range of the standard deviation

Differential privacy (DP) applied to analytics ensures that the results do not reveal any additional information about the individuals who are not in the dataset.

While DP (Semantic method) has mathematical guarantees of indistinguishability and avoids modeling the prior knowledge of an adversary (A) unlike any other syntactic method such as k-anonymity, DP does not include an intuitive or systematic process to select the parameter ε. How can we go from extant data privacy regulation to actionable DP algorithms? ε limits how much an individual may affect the resulting model, not how much information is revealed from an individual, which is to what regulation refers. This work from Jaewoo Lee strives to make this process simpler by introducing a new parameter **ρ, which indicates the probability that an individual contributes to the results of the query**. Jaewoo calculated the worst-case probability that an individual is in the US Census (Looking at people older than 85), rounding the probability to 1.5%, and using this number to set his new parameter. If the algorithm executed in a set of possible worlds yields a posterior for at least one of them larger than ρ, then the data of individuals in the dataset may be linked to their identities, or the data may be linked to other datasets.

This new formulation of DP, called differential identifiability (DI), offers the same guarantees, i.e., offers protection against an arbitrarily strong adversary.

Additionally, there is a connection between ε and ρ, which one can exploit to set ε based on ρ.

In this paper, Jaewoo proves how to achieve DI and gives examples of how to use it.

Great knowledge nugget:

"The protection of differential privacy measures only the impact of an individual on the output, not the ability to identify an individual, or even the ability to infer data values for an individual."

I very much liked this paper, it is concise and informative. He makes the math beautifully simple yet powerful. It helps to see DP from another angle and increases your understanding of the technology. However, I think it is also non-trivial to go from regulation to a probability that tells you how likely it is to find someone in the dataset. In the paper, Jaewoo calculates the chances that someone is in the dataset when that person is older than 85, but he does not consider the additional information from other datasets, which may increase this probability. While DP with ε allows you to avoid considering prior knowledge (Only the worst-case but straightforward scenario of an informed adversary), ρ makes you consider it, and finding datasets that can be linked and therefore increase the probability of identification is again a difficult task.

All the plots were successfully replicated.

This plot shows the upper bound formulated in the paper.

I performed a binary search to find the optimal value of ε per value of risk disclosure (ρ). This way, I went from the previous image to a tight upper bound curve:

However, for some functions like the median, there is no need to go beyond the formulation because the upper bound is an equality, and therefore, you get precisely the amount of noise you need.

We can see here how the noise range widens the less disclosure risk is accepted:

Also for values of ε for the DP enthusiasts:

Finally, we can see that varying the dataset size; we need less noise the larger the dataset is:

I hope you enjoy the notebook!

For more details, you may go to point 3 of the paper. Here is a summary of the context:

A universe **U** of data points belong to individuals **i**, an individual's identity is **I(i)**, and the identities that belong to a dataset **D** are in the set **I_D**. A multiset of these data points forms **D**, which we as privacy professionals want to protect while letting data scientists make queries to this inaccessible dataset. An A of the type *informed adversary* is assumed to know all individuals in **U** and knows every single individual included in **D** except for one, which we one to protect, and the A wants to single out certain information hidden in the inaccessible dataset **D**. This individual could be any, and thus it applies to anyone who is or is not in the dataset. Therefore, A knows **D'**, which is one data point (Individual) less than **D** (**|D|** = **|D'|** + 1). Aside from **U**, and **|D'|** and the identities of the people in **D'**, the A also knows the privacy mechanism **M_f**, i.e., we are using the Laplace distribution to generate our random noise. If the A finds out **D** because it knows **D'**, it is trivial to know which data point belongs to the victim.

Having this prior knowledge, the A will construct with **U** and **D'** all the possible **D**'s that may exist. Toy example: if the **U** is of size 10 and **D** is of size 9, i.e.,**|D'|**=8, then there are (10-8 choose 9-8)=2 possible worlds (ω ∈ Ψ) the A will test. Process of the A:

(i) A queries our dataset D and obtains a DP response M_f(D) = R = f(D) + Noise, where f(.) is an analytics query such as the mean.

(ii) Consequently, A tests these worlds using conditional probabilities, Pr[ω_i = D | M(D) = R].

(iii) A then calculates the posterior probability of each word assuming a uniform prior, i.e., before seeing R, any world had the same probability of being the correct one.

(iv) The world ω with the highest posterior is the most likely candidate to be the true D, and if it is indeed D, then the privacy of the individuals in the dataset has been violated.

**Math**

Before continuing, I must introduce some definitions the author explains in the preliminaries of the paper

Sensitivity is one of the parameters that tunes the noise in DP. It depends on the query type f(.), e.g., the mean, median, or even a machine learning algorithm, and the universe of data points. The higher the sensitivity, i.e., the higher the difference an individual's data point makes in the query result, the higher the DP mechanism's noise. There are some queries whose sensitivity is lower than others, e.g., the mean's sensitivity is lower than the one from the sum, given the same universe. x and x' in this definition act as D and D'. (The author forgot to add the extra bars to represent the norm, or he was just refering to one dimensional space). The author of the paper uses this notion and building block of DP to define "Contribution of an individual" and "Sensitive range S(f):

This is just like sensitivity, but it only considers the worlds where an individual is included Ψ_{i} and the worlds where such individual is not included Ψ_{i}^C. For example, an individual is in 3 worlds and is not in 2 worlds. This definition calculates f(ω) for each possible world and finds all the differences between them. In this toy example, it would yield 6 values (3x2 combinations). The maximum of these values is the contribution of an individual.

However, we need to position ourselves in the worst-case scenario. What could happen if we choose an individual whose sensitivity (Contribution of such individual) is not the largest?

Well, then the noise would be smaller than it should be.

What would happen then?

The individual with the highest contribution would be unprotected if she were the one being attacked in D. This is because the result R would be too different from the other possible results R when this individual is not in the dataset. More specifically, because the standard deviation of the noise is proportional to the sensitivity, which we said was smaller than needed because we chose a smaller sensitivity, then the noise range will not cover the largest contributor's value. So if the attacker queries and obtains R and then does experiments with all the possible worlds, the attacker will have an easier time singling out which world is the correct one. The probability of the individual being in D would be higher than in the rest of possible D's (worlds), i.e., the posterior is higher. To have a picture in mind, check Jaewo's previous paper, Figure 1. To ameliorate this, we choose the individual with the largest contribution, which the author does with definition 3.

This is why S(f) is the maximum value of the set of possible C_f(i).

This means that in order to comply with DI with a parameter of ρ, i.e., the upper bound of the risk to identify someone in the dataset, the following must be fulfilled: The probability that A asserts that the identity of an individual **I(i)** is in the set of identities **I_D** of the dataset under attack **D**, given the knowledge of A about **D'** and with the output of the query **R**, is less than or equal to ρ.

Now, I am going to do my best to explain the steps of the proof:

(2) Risk of disclosure = Probability that the identity being attacked I(i) belongs to the set of identities in D, given that the mechanism M executed on D yields R and that the neighboring dataset is D'.

(2) to (3)

That an identity belongs to the set of identities in the dataset D (Possible world) is the same as constructing the dataset D with the known D' and the data point that belongs to an individual that is not in D'. This individual i is under attack, but it could be anyone that is not in D' (Like you :D).

(3) to (4)

Applies Bayes Rule: Posterior = (Prior * Revised beliefs) / Normalization

Prior: The author assumes that each D is equally likely - uninformative prior, i.e., without any information, the attacker's best guess is that the probability that {i} belongs to D is equally likely among possible D's (worlds).

Revised Beliefs: Probability that in a reality where D contains {i} (A possible world) we would obtain R

Normalization: M_f(ω) is a random variable that maps a sample space of possible events (Possible worlds) to a one-dimensional real space. The probability of a possible world to go to any R is 1, the probability of one world to go to a specific R is less than one, and the probability of all worlds to go to a specific R is also less than 1 (And more than the previous one). Not all R's are equally likely, as it is a random variable that follows a Laplace distribution.

The normalization in Bayes Rule is there because we focus on the case where we obtain a particular R, which is a subset (one value) of the rest of possible R's. Therefore, to study these events with the conditions of the numerator, these events' probabilities within the studied sample space must be normalized to add up to 1 (The normalization axiom of probability theory).

The condition of given D' in **Pr[D=D' U {i} | M_f(D) = R, D']** disappears because knowing D' will not affect in any way which {i} will be included in D, so it is uninformative. This may not be true, e.g., if you know the friends and family of someone in D', it is more likely that some are in D as well, but this is not considered.

(4) to (5)

In the numerator, the author compresses **Pr[M_f(D) = R | D=D' U {i}]** into **Pr[M_f(D' U {i}) = R]**, he does this because the attacker is already considering D in the probability the she wants to calculate, the condition does not add new information. The denominator follows from the total probability theorem, and the author went from **Pr[ω]xPr[M_f(D) = R | ω]** directly to **Pr[ω]xPr[M_f(ω) = R]**, like he did in the numerator.

(5) to (6)

Here the author uses the Laplace distribution, it is the substitution of **Pr[M_f(D' U {i}) = R]**, ditto for **Pr[M_f(ω) = R]**, this means *the probability that applying Mf to ω the output is R*. However, how do you go from there to the Laplace probability density function (pdf):

(i) Because **M_f(.) = f(.) + Noise = R**, you may also do **Noise = R - f(.)**.

(ii) Given the true result f(ω), i.e., we assume this is the correct world, the probability of obtaining R is the same as the probability of obtaining the Noise. So **Pr[M_f(ω) = R] = Pr[Noise]**

(iii) **Pr[Noise] = 1/(2λ)exp(-|Noise - μ|/λ) = 1/(2λ)exp(-|R - f(ω) - μ|/λ)**. We sample from a centered noise distribution, μ=0. The coefficient 1/(2λ) cancels out as every single component of the numerator and denominator is multiplied by it.

What happened to **Pr[D' U {i})]** and to **Pr[ω]**?

They cancel out.

Why?

Because the author assumes that the prior is uniform, i.e., the probability of having **Pr[ω_i]**, like **D** in the numerator, over another **Pr[ω_j]** is the same.

Furthermore, the author takes out from the sum in the denominator the value that is equal to the numerator. This is convenient, and visually, it conveys that we are trying to find the posterior of the world ω that is actually D, the real world. This is the only interesting posterior for the formulation of the proof, as we know there is only one world that is true, and the only one we must ensure is protected. This, of course, has no loss of generality; it could be any D.

(6) to (7)

The author unfolds the Sum. **m** is the number of possible worlds ω. Because we already took out the ω corresponding to D outside, the unfolding only goes until m-1.

(7) to (8)

As it shows in the picture, it divides the numerator and denominator by the numerator. The 1's are trivial, but what about the other expressions in the denominator? Here are the intermediate steps, I take one value in the denominator to show how it computes:

Ups, that is larger than expected...

What is next?

(8) to (9)

We know that S(f), the sensitivity range of f(.), will contribute to an individual that yields the largest difference in outputs. Therefore, it will be larger than any other possible contribution. That is why we can make this inequality by substituting |f(D) - f(ω)| with S(f) in the exponent in all the components of the sum. Because the exponent will be larger and negative, the numerator will be smaller, making the fraction larger. It is a convenient substitution because now we have to only deal with S(f), which can be calculated theoretically. This provides an upper bound, which might not be tight.

The author trivially shows what the lower bound is. Because (m-1)*exp(-S(f)/λ) <= (m-1), then the lower bound of the expression is (1/m). In his words, "This implies that it is impossible to protect the privacy of individuals in the database with the probability less than an adversary's probability of a correct random guess." This is because m is the number of possible worlds, so you naturally cannot protect more than randomly picking a dataset with equal probability.

(9) to (10)

In (10), the author does the algebra so that we can easily determine which value of λ fulfills this upper bound when the upper bound is set to ρ. Expression 9 is the **disclosure risk**, and it has to be smaller than ρ (shown in (10)).

(10) to (12)

In essence, arithmetic to get closer to find the expression for λ.

(12) to (15)

Arithmetic to find the lower bound of λ.

The comment between (14) and (15) is because if ρ = 1/m, then the natural log of the expression (ln[(m-1)ρ /(1-ρ )]) is zero or negative, which cannot be, given that S(f)/λ will always be larger than 0. This also makes sense because 1/m is the lower bound previously defined; 1/m is the random guess from the attacker, so, as we said before, you cannot get lower than that.

On another note: Referring to what I mentioned before, the possible worlds' priors might be different (If someone is in D', her friends or family are more likely to be in D). in case a prior is higher than ρ, then privacy is violated inherently. I.e., if **Pr[D = D′ ∪ {i}] > ρ**, then the attacker already guesses with a probability of being correct higher than the upper bound we would like to impose.

The author then says that for less severe cases, i.e., the prior is different between possible worlds but with less ρ, one can substitute **m = 1 / max Pr[D = D′ ∪ {i}]** in Ψ. What I understand with this is that for example, if you have 9 worlds whose probability is equally likely and equal to 8.88...%, and one world that is likely with probability 20% (In total 10 worlds), to make things easier, this is equivalent to say that you have 5 options (1/0.2) from which to pick, each of the five options is equally likely, and the probability is equal to the worst-case scenario when you had 10 options. This is an elegant way to keep the proof simple, even for non-uniform priors. Beautiful :) And avoids an attack by the reviewer :D

Finally, he makes a connection to DP because all along, ε was hidden **λ = S(f)/ ε**. Surprise!

In the beginning, we mentioned that DP adds random noise to deterministic results from a query and that this noise's standard deviation is proportional to the sensitivity, and there it is. The tails of the Laplacian will become longer the larger S(f). Furthermore, we can have an impact on the tails by manually selecting ε. The smaller it is, the longer these tails and the more noise is added, as more possibilities become relatively more likely. And how do we do this? Well, this is what the paper is about, we define the risk ρ, and from there, we get ε.

Let me show this and then we get on with the code!

These are the lower and upper bonds of the disclosure risk.

**Datasets**

Toy example of section 4 of the paper

```
w_1 = [1, 2, 3]
w_2 = [1, 3, 4]
w_3 = [1, 3, 5]
w_4 = [1, 3, 6]
w_5 = [1, 3, 7]
w_6 = [1, 3, 8]
w_7 = [1, 3, 9]
w_8 = [1, 3, 10]
worlds = [w_1, w_2, w_3, w_4, w_5, w_6, w_7, w_8]
```

Adult dataset for section 5 of the paper. In the paper, the author only uses numerical values: as age, education-num, capital-gain, capital-loss, hours-per-week.

Therefore, we will filter only these attributes.

```
df = pd.read_csv("adult.csv")
census_df = df.filter(['age', 'educational-num', 'capital-gain', 'capital-loss', 'hours-per-week'], axis=1)
census_df.head(5)
```

**Helper functions**

In this blog post I made the functions to calculate empirically the sensitivites of queries based on the datasets from scratch, this time however I will use theoretical formulations.

```
def mean_sensitive_range(deterministic_query_outputs):
return np.abs(np.max(deterministic_query_outputs) - np.min(deterministic_query_outputs)) / len(deterministic_query_outputs)
def calculate_λ(sensitive_range, m, ρ):
return sensitive_range / (np.log((m-1)*ρ/(1-ρ)))
def calculate_ε_i(sensitive_range, m, ρ):
"""
Description:
It calculates ε based on the calculation of λ and the sensitive range, m, ρ
"""
return sensitive_range / calculate_λ(sensitive_range, m, ρ)
def calculate_ε_ii(m, ρ):
"""
Description:
It calculates ε based only based on m, ρ
"""
return np.log(((m-1)*ρ)/(1-ρ))
def laplace_noise(Response, deterministic_query_outputs, λ):
return np.exp(-np.abs(Response - deterministic_query_outputs)/λ)
def noise_ratio(Noise, U_range):
return Noise/U_range
def ε_binary_search(ρ_target, w, worlds, R, sensitive_range, query):
"""
INPUT
ρ_target - this is the disclosure risk we are willing to accept
w - the world that provides the worst-case scenario
worlds - the different worlds that could be the real dataset
R - The random response of the query, it should also be the worst-case scenario
sensitive_range - this is the range of values that the deterministic output of the query can make based on the possible worlds
query - the query we are dealing with, it could be mean, median,...
OUTPUT
ε - the closest ε we can get to comply with ρ_target with an error of 0.01%
Description
Executes binary search to find ε based on a ρ_target - risk disclosure target; this way, we can find the best utility for reasonable privacy. To simplify things,
we perform a binary search with λ instead. And then calculate ε from λ
"""
m = len(worlds)
λ_high = calculate_λ(sensitive_range, m, ρ_target)
λ_low = 0
λ = (λ_high + λ_low)/2
ρ = calculate_disclosure_risk(w, worlds, R, λ, np.mean)
while np.abs(ρ_target - ρ) > 0.0001:
if ρ_target < ρ:
λ_low = λ
if ρ_target > ρ:
λ_high = λ
λ = (λ_high + λ_low)/2
ρ = calculate_disclosure_risk(w, worlds, R, λ, np.mean)
ε = sensitive_range/λ
return ε
```

**Main function**

```
def calculate_disclosure_risk(attacked_world, worlds, R, λ, query):
numerator = laplace_noise(R, query(attacked_world), λ)
denominator = 0
for world in worlds:
denominator += laplace_noise(R, query(world), λ)
return numerator / denominator
```

After the definition of these functions, the notebook goes on to replicate the various calculations that have been performed in the paper. I encourage you to follow up there and delve deeper into the various things one can do with just the few functions defined above.

**THE END**

Github: @mireshghallah | Twitter: @limufar

**Where are you based?**

"San Diego, CA, USA"

**What do you do?**

"I’m a 3rd year CS Phd student at UC San Diego"

**What are your specialties?**

"Research, PyTorch Development, Event Organization"

**How and when did you originally come across OpenMined?**

"I think it was December of 2019, or January 2020, I came across it when I was watching a Lecture by Andrew! I really liked the initiative right away! Then a while after I saw Andrew tweet a an application form for research scientists, I filled one right away and was luckily selected ^_^"

**What was the first thing you started working on within OpenMined?**

"When we launched the research team, each research lead had proposed a project to work on. The project I proposed was on the intersection of differential privacy and fairness. I worked with an amazing research engineer, Tom Farrand, and Sahib Singh and Andrew, and we got a publication out of it! I was also working on another project with one of OM’s great researchers, Georgios Kaissis and with the help of the great Teddy Koker and Tom Titcombe. We have also submitted a paper on that."

**And what are you working on now?**

"Right now, I have the pleasure of working with another one of OM’s great research engineers, Abinav Ravi where we are working on differentially private language models."

**What would you say to someone who wants to start contributing?**

"OpenMined is an amazing welcoming community, and joining it was one of the highlights of 2020 for me! I made lots of new friends and had lots of fun in the meetings. I urge people to start helping however they can, and whenever they are feeling lost or confused, just reach out to more senior members and ask for help!"

**Please recommend one interesting book, podcast or resource to the OpenMined community.**

"I love the book “freakonomics” it offers a fresh look into problems that might seem mundane or simple from the outside."]]>

**Confidential computing explained:**

Part 1 : introduction

**I — Introduction**

While many people call data the “21st century oil”, we are still a long way from reaching a state where we will have exploited data to its full potential, and made society advance as a whole.

As expected by most forecasters, we have witnessed an explosion of data that is collected from public to private organisms, from IoT to social behavior on the internet, and all this data is improving our lives better in various domains such as :

- Health : IoT paves the way for a better health through a much more precise and affordable monitoring while allowing accurate potential disease prediction.
- Finance : the Payment Service Directive makes it possible for newcomers to leverage people’s banking information to provide innovative services and lower the costs in this sector.
- Personalized services such as tailor-made experiences for sport, shopping, leisures, etc.

However just by looking at these examples we see that all these innovations rely on accessing people’s private and highly confidential data. This is a real cause of concern today as we see with the various privacy breaches such as medical data leak, spamming campaigns based on leaked information such as email, or even impersonation for frauds.

So, we see with these frightening examples that while we are on a verge of a great advancement with data technologies, it also opens new risks that could have profound impacts on our lives. So where are we on this ?

**II — Data**

To go further with the oil analogy, we might say that the oil drills capacities are here: we know today how to collect data as proven by the explosion of data available. Then we also know how to make secure pipelines that enable the confidential data to be moved and stored in “data terminals”. However, the tricky part is the refining of the data. We know that oil is a precious resource but it is often raw and needs to be processed to truly extract its value. The same can be said for data, where we often need to clean it, cross it, analyze it and so on to fully get meaningful insight.

However, this refining process often involves other parties for several reasons : the data collecting party does not necessarily have the expertise to extract value from the data, nor have the time, the resources, or even complementary data that could be combined with the original data.

Therefore for all these reasons data value extraction process is often externalized to third parties but this where the risks begin. Even if we are able to securely store the data in our facilities and send data with best security standards, once the third party gets access to the data we will never know what it truly does on the data and we have no physical means to prevent several things that could happen:

- It can sell your data on the darknet
- It can use it for other purposes, like train some model on it and never tell you a word
- It can get compromised and have your data leaked

Thus, while we saw the opportunities that this new oil can create, there are still many issues to solve before tackling the use cases we mentioned at the beginning.

**III — Confidential Computing**

Now that we saw the opportunities of data exploitation and the issues that come with it, what can we do ? Well, there are several techniques that could allow several parties to collaborate on confidential data without having to reveal their data in clear. My previous article on Homomorphic Encryption, briefly presented the three main solutions currently :

- Homomorphic encryption
- Secure multi-party computation
- Confidential computing

Throughout this series we will focus on the Confidential Computing (CC) which is based on the use of Intel SGX. To better understand how it works, let us use a metaphor with a ring.

Imagine you have a treasured asset, let us say the gold ring of your grand-dad, which even has personal scriptures of your family. You really love it, but you realize that it should be slightly expanded to fit your finger. There is a jeweller in town that could do it for you as he is an expert.

As your ring is very precious, you want to safely send it to the jeweller, so you put your ring inside a locked box with a key that only you and the jeweller have. You send it to the jeweller who will be able to open the box and work on it. However you need to trust the jeweller now as he will necessarily access your precious ring to work on it. So if you really need the jeweller to work on your ring, you can only trust him to do what he says he would with your ring, and hope he does nothing else with it.

The ring here represents your data, the box an encryption mechanism, and the jeweler is a third party service provider.

Today most of the time when you want to benefit from someone’s service you often have to send your data to them, and even though the communication is secure, at the end of the day the service provider will have your data in clear and can do anything it wants with it. While there are contractual clauses around it, you never know what they will actually do, and they could even get hacked.

Now how does Confidential Computing help solve this issue ? Let’s come back to the ring and imagine now that there is a new actor in town, a magician specialized in ring handling with a magic hat.

How is this process different ? Well you will still have to send him your ring locked in a box, but this time with a key that only **you have**. The magician will receive this box but will have no way of knowing what is inside. Instead of opening it himself and working over the ring, he simply put a magic hat on the box. And then the magic happens ! If you have agreed to let the magic hat be used on your box, it will open the box inside the hat, expand your ring, put the expanded ring back into the box and lock it.

All this happens while the magician is not able to peek into the hat to see what your ring looks like. He is also not able to lift the hat during the process because magic will seal it. It will only be liftable once the hat has finished its treatment but then the ring will be put back into the box, so it will be safe. At this moment, the magician will simply lift the hat and send the box containing the expanded ring to you.

At no time the magician will be able to see what your ring looks like, so your secret is kept safe, while at the same time the magician has been able to provide you a great service ! If the magician sells the ring to your cousin or gets robbed, the malicious acquiring third-party will only acquire a locked box (impossible to open).

In the end, Confidential Computing with Intel SGX works the same way. Intel processors with SGX allow you to create what we call enclaves, which are the magic hats. The special property they have is that the service provider that wants to analyze the data on your behalf will not be able to know what happens inside the enclave, as the memory is encrypted and you cannot have a look at what is happening inside. Using a secure channel between you and the enclave, you will be able to send your sensitive encrypted data to the enclave with a key only you and the enclave know, and the service provider will never be able to know what data it fed to the enclave as it is encrypted with a key it does not have access to.

There is one important point that we have not covered : how does the enclave / magic hat know what to do with the input they are given ? Well it is simply because both are simply code : if we know what the input looks like in advance, we just have to write an algorithm/magic spell that will take care of it, and the enclave / magic hat will now follow the algorithm and know what to do on the data/the locked box content, without requiring external help.

We will come back to this, but this means that it is really important we know as much as possible what the data we will handle looks like to make this process work. There are ways to have a more interactive process but it will be harder to preserve the same levels of confidentiality.

**Conclusion**

In the end, enclaves are really promising as they allow service providers to handle confidential data without having to see the data in clear. Even more interesting, the cloud providers would also not see the data that is handled if the enclave was to be managed on the cloud. This makes the enclave technology a key asset to address the ever growing challenge of deploying confidential applications in the cloud.

We can imagine that enclaves could be applied in the scenarios we mentioned at the beginning, like analyzing your personal health data, while never having to let anyone access your data in clear.

We hope that you enjoyed reading this article. We will see in the next one, one important principle : remote attestation which is the glue that links everything together. Then we will get our hands dirty and start coding applications using enclaves, so stay tuned!

]]>The PySyft framework enables practitioners and stakeholders in the AI domain to leverage the potential of Federated

]]>The PySyft framework enables practitioners and stakeholders in the AI domain to leverage the potential of Federated Learning. This method is part of privacy preserving machine learning and allows data scientists to work with remote data, without revealing it. This approach is especially interesting in the context of high demand for big data to train AI models on the one side and data privacy regulations on the other side. Some examples of such regulations are the General Data Protection Regulation in the European Union [1], the California Consumer Privacy Act in the USA [2] or the Personal Data Protection (Amendment) Act in Singapore [3].

Federated Learning systems can either be tested with virtual nodes on the same machine or with physically separated nodes. Running experiments with the data scientist and data owner being on separate devices is important to account for possible hardware constraints. A Raspberry Pi can be a good choice for simulating the data owner’s device. It is a Single-Board-Computer (SBC), which can handle data acquisition and control, data processing and storage, connectivity and power management. The CPU architecture is ARM based and some Python packages are not instantly available over pip install <package_name> for example. This is especially the case for older PySyft versions like v0.2.9 and v0.3.0. To reduce complexity, we focus therefore on the latest version of syft, at the time of writing this post. At the end you'll find a dockerfile capturing all mentioned steps. The image is also hosted on Docker Hub with the image name rene36/pysyft050rc1.

The numbers you will see in brackets give the execution time of a specific command. They should help giving an order of magnitude about the required installation time for syft. Our setup for installing PySyft 0.5.0rc1 is as follows:

- Raspberry Pi 4B with 8 GB of memory
- 32 GB SSD card (erased)
- Image (Ubuntu Server 20.04 LTS 64bit) flashed with Raspberry Pi Imager v1.3.
- Using a fresh Ubuntu 20.04 LTS 64bit installation.
- Connecting to the Raspberry Pi via SSH, which is connected with a LAN.

Some characteristics of the system are:

Command | Output |
---|---|

lsb_release -a | Distributor ID: Ubuntu Description: Ubuntu 20.10 Release: 20.10 Codename: groovy |

uname -a | Linux raspi28 5.8.0-1015-raspi #18-Ubuntu SMP PREEMPT Fri Feb 5 06:09:58 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux |

gcc --version | gcc (Ubuntu 10.2.0-13ubuntu1) 10.2.0 |

python --version | Python 3.8.5 |

Based on the GitHub repository of PySyft the latest supported torch version is 1.8. Note that numpy is already pre-installed on the system. If this is not the case, the installation cantake much longer.

```
pip3 install torch==1.8.0
# Successfully installed torch-1.8.0 typing-extensions-3.7.4.3
```

The requirements.torch.txt states a torchvision version of >=0.5,<=0.9. Version 0.9.1 is readily available via pip3 install torchvision==0.9.1, however syft complains about that down the road. Feel free to adjust the version in the requirements.torch.txt to check if any issues come up with a slightly more up-to-date torchvision version. Otherwise build the officially required version from source.

```
cd ~ && git clone https://github.com/pytorch/vision.git --branch=v0.9.0 && cd vision # (1 min)
sudo -E python3 setup.py install # (10 min)
# Finished processing dependencies for torchvision==0.9.0a0+01dfa8e
```

After installing these two major requirements (torch and torchvision) for syft, we can go the last step. This is split into two steps. First we build the aiortc package from source and then install syft with pip3. Starting withthe latter throws an error with respect to aiortc.

The aiortc package has some dependencies. Let’s install them and check if we fulfill its requirements.

```
sudo apt install -y libavdevice-dev libavfilter-dev libopus-dev libvpx-dev pkg-config # (2 min)
openssl version # should be >= v1.0.2, I have 1.1.1f
ffmpeg -version
sudo apt install -y ffmpeg # installed version 4.3.1-4ubuntu1 for me.
sudo apt install -y libvpx-dev libopus-dev libffi-dev
# Installed libvpx-dev:v1.8.2-1build1, libopus-dev:v1.3.1-0.1, libffi-dev: 3.4~20200819gead65ca871-0ubuntu3.
```

Finally, lets get the source code of aiortc from GitHub and build it from source.

```
cd ~ && git clone https://github.com/aiortc/aiortc.git && cd aiortc
sudo -E python3 setup.py install # (5 min)
# This installed aiortc==1.2.0
```

Now we are ready to install syft version 0.5.0rc1 with

```
pip3 install syft==0.5.0rc1 # (6 min)
# Successfully installed PyNaCl-1.4.0 Werkzeug-1.0.1 dataclasses-0.6 dpcontracts-0.6.0 flask-1.1.2
# forbiddenfruit-0.1.4 itsdangerous-1.1.0 loguru-0.5.3 nest-asyncio-1.5.1 packaging-20.9 protobuf-3.15.8
# pyparsing-2.4.7 sqlitedict-1.7.0 syft-0.5.0rc1 syft-proto-0.5.3 typeguard-2.12.0 websocket-client-0.58.0
```

Let’s check if the installation worked as expected. Open Python in a console, import the three major packages and check the exact version of them.

```
python3
import syft
import torch
import torchvision
# Check package verions
print(syft.__version__) # two under scores
print(torch.__version__)
print(torchvision.__version__)
```

The docker image is either available over Docker Hub (rene36/pysyft050rc1) or you adjust and build it yourself. Feel free to use the below dockerfile as a starting point.

```
FROM ubuntu:20.04
RUN apt-get update && \
apt-get upgrade --yes
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get install --yes software-properties-common python3 python3-pip && \
apt-get install --yes git && \
apt-get install --yes libavdevice-dev libavfilter-dev libopus-dev libvpx-dev pkg-config ffmpeg && \
apt-get install --yes libvpx-dev libopus-dev libffi-dev
# Install torch
RUN pip3 install torch==1.8.0
# Build torchvision from source
RUN git clone https://github.com/pytorch/vision.git --branch=v0.9.0
WORKDIR /vision
RUN python3 setup.py install
# Build aiortc from source
WORKDIR /
RUN git clone https://github.com/aiortc/aiortc.git
WORKDIR /aiortc
RUN python3 setup.py install
# Install syft
RUN pip3 install syft==0.5.0rc1
CMD ["/bin/sh"]
```

[1] European Union, REGULATION (EU) 2016/679 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation), 2016

[2] State of California Department of Justice , California Consumer Privacy Act of 2018 [1798.100 – 1798.199.100], 2018

[3] Personal Data Protection Commission Singapore, Personal Data Protection Act, 2014

]]>