Speaker: Dr.Georgios Kaisssis, MHBA
Video Link: https://www.youtube.com/watch?v=F46lX5VIoas&t=21m50s
AI in medical imaging has approached clinical applicability and has helped improve diagnosis and early detection of disease. This development can help us counter the lack of radiologists in disadvantaged areas. However,
]]>Speaker: Dr.Georgios Kaisssis, MHBA
Video Link: https://www.youtube.com/watch?v=F46lX5VIoas&t=21m50s
AI in medical imaging has approached clinical applicability and has helped improve diagnosis and early detection of disease. This development can help us counter the lack of radiologists in disadvantaged areas. However, the current methodology of central data collection and training of models is a key problem. Even with the proper consent of patients, the data collection process has huge problems in facets of accumulation, storage, and transmission. This impinges on the patient’s right to be informed about their data usage and storage. In disadvantaged communities, these rights may not be informed to the patient. This is where privacy-preserving machine learning comes into the picture. It bridges the gap between deriving insights from data and protecting the data of the individuals. Federated learning involves training the data to remain on-premises which allows the hospital to retain control over it and enforce its governance. Encrypted computation services, which are an integral part, allow us to protect both the data and algorithm and provide end-to-end services. It also allows for single-use accountability; that is the notion that the data is used only for a singular purpose.
Privacy-preserving machine learning [1] has remained in the proof of concept stage for some time now. A new tool called PriMIA has been introduced as part of the OpenMined libraries. PriMIA stands for Private Medical Image Analysis. PriMIA is a library for doing federated learning and encrypted inference on medical imaging data.
The main goal of the library is to provide securely aggregated federated learning in a Multi-Institutional setting and provide training algorithms in an encrypted inference scenario. Some features that ensure that the goals are achieved are
The library’s main design goal is to enable a moderate tech-savvy user to perform not more than 3 steps to achieve the task. This varies depending on the roles.
Support is offered for DICOM and other important imaging formats, even a mixture such as DICOM and Jpeg formats are supported.
There are pre-existing scripts for hyperparameter configuration available over the entire federation.
As for the architecture: A hub and spoke configuration is used which is preferred over serial processing topologies from previous work on such architectures in the medical imaging community. This architecture also supports synchronous training. For example, when nodes drop out in synchronous training. Secure aggregation is done by secure multi-party computation using the secret-sharing function. There were tests done to break the secure aggregation and they generally failed, which shows robust secure aggregation.
To perform encrypted inference, computational nodes such as a crypto provider and the model server have to be set up, then the data owner initiates a request and receives the result locally. Most of the work is driven from the client-side, including the output of encrypted JSON files. This encryption happens under a zero-knowledge end-to-end premise.
To show the application of the library, there is a case study attached. This case study contains data of chest X-rays for detecting pediatric pneumonia. This is an example of a remote diagnosis as a service model. There are 5163 training images and 2 external validation datasets.
The main goal is to train a federated remote diagnostic model. The recently introduced confidential computing node has been used for the training. A total of 3 computing nodes have been used for the case study training. The whole system relies on memory encryption of computing nodes and computer wide trusted execution.
To assess the performance of the trained model it has been compared with a locally trained neural network as well as with two human experts. The results of the study show that the Federated trained model performed better than the human experts on both datasets and is comparable to the results of the locally trained model.
Some of the key challenges which are quite common among the medical community are
The library is soon to be released in the OpenMined GitHub repository and a current version can be found at https://github.com/gkaissis/PriMIA. The website for the library can be found in https://g-k.ai/PriMIA/ .
References:[1] : https://www.nature.com/articles/s42256-020-0186-1
]]>We are familiar with the traditional Public Key Encryption(PKE), where we have a set of a public key(PK) and private key or secret key(SK) which we use for encrypting and decrypting a message. So, considering our traditional entities, Alice and Bob (who for some reason do a lot of cryptographic data exchange) data can be encrypted using the PK and the other entity can decrypt the data using their SK.
The main drawback with PKE is that the decryption is "All or Nothing." In a sense, if Bob has the secret key(SK), then he can decrypt the entire message, and if not, then he is left with a sequence of encrypted text that is of no use to him.
But as we discussed earlier, what if you want to work with the data without accessing the actual values of it. We shall see how Functional Encryption can be used here.
Functional Encryption(FE) allows one to get the output of the function performed on the data. Let us consider an example where Alice holds data a = 3, b = 5. Now, Bob wants to know the value of (a+b), but a and b are encrypted. So, Alice provides access to Bob to compute a function on the encrypted variables a and b. In this way, Bob would get the result of his function in decrypted form without even accessing the decrypted values of the variables.
The implementation of above mathematical expression of FE is represented in Figure 1. Alice and Bob being the two parties involved in the data exchange and the operations are performed on a secure cloud server.
While the methodology of FE might be similar to that of Fully Homomorphic Encryption(FHE), it has some differences. If the requirements of Bob are the same(to get the value of a+b), in the case of FHE, he would be allowed to compute the addition function on the variables a and b. But, the result generated in this case would be an encrypted form. So even if Bob didn't access the decrypted variables, he would still need the key to decrypt the output which he got from his function. Whereas, in the case of FE, the result would be the actual answer 8(i.e. a=3 + b=5).
Both methods have their designated use cases. While FHE adds an extra burden of decrypting the final output, it allows the user to select any function to compute on the data. On the other hand, FE provides the user with the decrypted output but may be restricted with the type of function to be used by the owner of the data.
This control over which functions to allow to be computed on the data that to my which user, benefits the owner of the data in multiple cases.
The content is inspired by the paper: Boneh, Dan, Amit Sahai, and Brent Waters. "Functional encryption: a new vision for public-key cryptography." Communications of the ACM 55.11 (2012): 56-64. (You can find this paper here)
Interested in knowing more about FE? This talk will give you more information about FE, along with its drawbacks and how it can be resolved using Multi-input FE.
You can also know more about the recent advancements in FHE by IBM here.
This is a great blog which explains the working of Homomorphic Encryption.
]]>The main objective of Natural Lanuage Processing (NLP) is to read, decipher, understand and make sense of the human language. The general pipeline is to collect data and train a model and deploy it for various applications such as text generation, text understanding and embeddings.
Assuming we have a rogue Data Scientist with access to training process. They could make gradient attack and try to infer presonal information or attack the trained model or the embeddings.
External Threats include quering the model to extract private information about the data contributors.
Embeddings are a type of real valued vector representations that allow words with similar meanings to have similar representations. The embeddings can be used for response generation, question answering and text classification.
Embedding Inversion attack is when the attacker tries to invert the embedding and find the actual sequence access. Attribute Inference is when attacker tries to infer attributes of x*. Membership Inference attack is when the attacker tries to infer if x* or its context x' is used for the training of the embedding model.
To measure the unintended memorization in text-generation models. Carlin et. al proposed the exposure metric which answers the question of what information can be gained about an inserted canary with the access to the model. To measure this metric, random canaries (strings of random tokens) are inserted in the training data of the model. The probability that is assigned to different sequences and different possible canaries is measured. This probability is used to rank different canaries and the exposure for each canarie is based on the ranking of different possible canaries.
The higher the exposure the easier it is to extract the secret for the secret canarie.
As we have seen the vulnerabilities, let's look at what are the mitigations for the above threats to privacy in NLP.
The word embeddings are perturbed with noise sampled from an exponential distribution, this distribution is calibrated such that the perturbed representations are more likely to be words with similar meanings.
Considering the examples shown in the below image, the perturbed embeddings for words encryption, hockey and spacecraft to have a high privacy is publik-key, futsal and aerojet and the lowest privacy would be the words itself.
The indistinguishability is used to make sure that the perturbations within the context radius we want them to be and not map to random unrelated words.
This is a mixture of federated learning and differentially private deep learning. Each user trains a separate model on their own device with their own data. During the training, the gradients of the individual models are clipped and then the updates are applied. These updates are aggregated and are sent to the cloud where noise is added to it and the final update is generated
In this scheme, since each user is training their own model and updates are clipped for each user. User level privacy is achieved as opposed to private work DP-SGD, which targeted example level privacy
In this scheme, since each user is training their own model and updates are clipped for each user. User level privacy is acheived as opposed to private work DP-SGD, which targeted example level privacy
Other mitigations which mainly target sensor data privacy but can be extended and used for NLP tasks as well due to the sequential nature of the data.
However, this area is heavily under explored and there is definitely a lot of room for improvement both in the sense of the attacks that exist and the mitigations.
Watch full talk on youtube:
(💐Authors: Nhat Hai Phan, Xintao Wu, Dejing Dou)
As you may have noticed, we have been hosting OpenMined's Paper Session Events by August. The paper we have covered in August focuses on differential privacy for Convolutional Deep Belief Network
]]>(💐Authors: Nhat Hai Phan, Xintao Wu, Dejing Dou)
As you may have noticed, we have been hosting OpenMined's Paper Session Events by August. The paper we have covered in August focuses on differential privacy for Convolutional Deep Belief Network (CDBN).
In this paper, the main idea from Phan et al. is to enforce Ɛ-differential by using the functional mechanism to perturb the energy-based objective functions of traditional CDBNs, instead of their results. They employ the functional mechanism to perturb the objective function by injecting Laplace noise into its polynomial coefficients.
The key contribution of the paper is that the proposal of using Chebyshev expansion to derive the approximate polynomial representation of objective functions. The authors show that they can further derive the sensitivity and error bounds of the approximate polynomial representation, which means preserving differential privacy in CDBNs is feasible.
Phan et al. emphasize some studies in the literature. The literature review begins with the progress of deep learning in the healthcare and expresses the need for privacy with regards to deep learning due to the fact that healthcare data is private and sensitive. In the study, Ɛ-differential privacy approach is used to protect privacy.
Looking at the literature, several major studies stand out (please see Fig.1):
📌Shokri et. al. implemented DP by adding noise into gradient descents of parameters. Even if this method is attractive for deep learning applications on mobile devices, it consumes a large amount of privacy budget to provide model accuracy. It is because of a large number of training epochs and a large number of parameters[1].
📌To improve this challenge, Abadi et. al proposed a privacy accountant based on the composition theorem. But the challenge on the dependency of training epoch number remains unresolved[2].
📌Phan et. al proposed an approach independent of training epoch number called deep private auto-encoders (dPAs). They enforced differential privacy by perturbing the objective functions of deep autoencoders[3].
According to the authors, the following two points urgently require further research under development of a privacy-preserving framework:
In the paper, the authors aim to develop a private convolutional deep belief network (pCDBN) based on a CDBN[4]. CDBN is an energy-based model which is a hierarchical generative model scaling to full-sized images[5][6][7]. So the paper requires revisiting some important concepts, including:
We need to approximate the functions to simplify things. Estimating the lower and upper bounds of the approximation error incurred by applying a particular polynomial in deep neural networks is very challenging. The approximation error bounds must also be independent of the number of data instances to guarantee the ability to be applied in large datasets without consuming much privacy budget. Chebyshev polynomial stands out to solve this problem.
Private Convolutional Deep Belief Network
In this section, the approach will be wrapped up. Fig.2 depicts the algorithm of the approached algorithm of pCDBN. This approach does not enforce privacy in max-pooling layers because they use the pooling layers only in the signal filter role.
👩💻 I am creating a notebook for this blog post. Please feel free to join me in the learning process (under construction🚧)
💐Thank you very much to Helena Barmer for being the best teammate ever.
💐Thank you very much to Emma Bluemke and Nahua Kang for their editorial review.
[1] Shokri R, Shmatikov V (2015) Privacy-preserving deep learning. In: CCS’15, pp 1310–1321
[2]Abadi M, Chu A, Goodfellow I, McMahan HB, Mironov I, Talwar K, Zhang L (2016) Deep learning with differential privacy. arXiv:160700133
[3] Phan N, Wang Y, Wu X, Dou D (2016) Differential privacy preservation for deep auto-encoders: an application of human behavior prediction. In: AAAI’16, pp 1309–1316
[5]Energy-Based Models. Available: https://cs.nyu.edu/~yann/research/ebm/
[6]Preserving Differential Privacy in Convolutional Deep Belief Networks, Available:https://arxiv.org/pdf/1706.08839v2.pdf
[7]A Tutorial on Energy-Based Learning, Available:http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf
[8] Zhu, T., Li, G., Zhou, W., Yu, P.S., Differential Privacy and Applications,Springer,2017.
]]>This post is part of our Privacy-Preserving Data Science, Explained series.
Part 1, Vanilla Encoding and Decoding
Part 2, Full Encoding and Decoding
Part 3, Encryption and Decryption
Part 4, Multiplication and Relinearization
Part 5, Rescaling
In the previous article of CKKS explained, Part 4: Multiplication and Relinearization, we saw how ciphertext multiplication works in CKKS, why we needed to relinearize the output in order to keep a constant ciphertext size and how to do it.
Nonetheless, as we will see, we need a final operation called rescaling to manage the noise and avoid overflow. This will be the last theoritical article of this series, and in the next and final article we will implement everything in Python!
In order to understand how this works we will first have a high level picture, then see how it works in detail.
So far we have dug pretty deep into the details of CKKS, but for this we will take a step back. CKKS works with what we call levels, in the sense that there will be a limited a number of multiplications allowed before the noise is too big to correctly decrypt the output.
You can imagine this as a gas tank. Initially the gas tank is filled with gas, but as you perform perform more and more operations, the gas tank will be drained until there is no gas left and you can't do anything anymore. Well the same thing is true for leveled homomorphic encryption schemes: you start with a certain amount of gas first, but as you do multiplications you have less and less gas, until you run out of it, and you can't perform any operation anymore.
The following figure illustrates this. When you start, you have your initial full gas tank. But as you perform multiplications and rescaling, you will lose level, which is equivalent to consuming some of your gas. So if you start with \(L\) levels, denoted \(q_L,q_{L-1},\dots,q_1\), being at level \(q_l\) means you have \(l\) multiplications left, and performing one multiplication will decrease the level from \(q_l\) to \(q_{l-1}\).
Now once you run out of gas, as in real life it is possible to fill your tank to be able to go farther down the road. This operation is called bootstrapping and we will not cover it in this article. So if we assume that we do not have the possibility to refill the gas tank, there is one interesting aspect one must take into account when using leveled homomorphic encryption scheme: you need to know the amount of multiplications you will do in advance!
Indeed, just like in real life, if you plan to travel quite far, you will need to have more gas than if you simply go around your neighborhood. The same applies here, and depending on how many multiplications you will need to perform, you will have to adjust the size of your tank. But the bigger the tank, the heavier the computations, and also the less secure your parameters are. Indeed, just like in real life, if you need a bigger tank, the heavier it will be, and it will make things slower, as well as making it less secure.
We will not go into all the details, but know that the hardness of the CKKS scheme is based on the ratio \(\frac{N}{q}\), with \(N\) the degree of our polynomials, i.e. the size of our vectors, and \(q\) the coefficient modulus, i.e. our gas tank.
Therefore, the more multiplications we need, the bigger the gas tank, and therefore our parameters become less secure. To maintain the same level of security we then need to increase \(N\) which will increase the computational cost of our operations.
The following figure, from the Microsoft Private AI Bootcamp, shows this tradeoff that one must consider when using CKKS. To guarantee 128 bits of security, we must increase the polynomial degree, even if we do not need the extra slots it provides, as the increase modulo could make our parameters insecure.
So before we move on to the more theoretical part, let's see what are the key takeaways:
So now that we see the high level picture let's dig into the why and how it all works.
If you remember correctly from the Part 2 about encoding, if we had an initial vector of values \(z\), it is multiplied by a scale \(\Delta\) during encoding to keep some level of precision.
So the underlying value contained in the plaintext \(\mu\) and ciphertext \(c\) is \(\Delta.z\). So the problem when we multiply two ciphertexts \(c,c'\) is that the result holds the result \(z.z'.\Delta^2\). So it contains the square of the scale which might lead to overflow after a few multiplications as the scale might grow exponentially. Moreover, as we saw before, the noise increases after each multiplication.
Therefore the rescaling operation's goal is to actually keep the scale constant, and also reduce the noise present in the ciphertext.
So how can we solve this problem? Well to do so we need to see how to define \(q\). Remember that this parameter \(q\) is used as the modulo of the coefficients in our polynomial ring \(\mathcal{R}_q = Z_q[X]/(X^N+1)\).
As described in the high level view, \(q\) will be used as a gas tank that we will progressively empty for our operations.
If we suppose we must do \(L\) multiplications, with a scale \(\Delta\), then we will define \(q\) as:
\(q = \Delta^L.q_0\)
with \(q_0 \geq \Delta\) which will dictate how many bits we want before the decimal part. Indeed if we suppose we want 30 bits of precision for the decimal part, and 10 bits of precision for the integer part, we will set:
\(\Delta = 2^{30}, q_0 = 2^{\text{# bits integer}}. 2^{\text{# bits decimal}} = 2^{10+30} = 2^{40}\)
Once we have set the precision we want for the integer and decimal part, chosen the number \(L\) of multiplications we want to perform, and set \(q\) accordingly, it is pretty easy to define the rescaling operation: we simply divide and round our ciphertext.
Indeed, suppose we are at a given level \(l\) so the modulo is \(q_l\). We have a ciphertext \(c \in \mathcal{R}{q_l}^2\). Then we can define the rescaling operation from level \(l\) to \(l-1\) as:
\(RS_{l \rightarrow l-1}(c) = \lfloor \frac{q_{l-1}}{q_l} c \rceil (\text{mod } q_{l-1}) = \lfloor \Delta^{-1} c \rceil (\text{mod } q_{l-1}) \)
because \(q_l = \Delta^l.q_0\).
So by doing so, there are two things we manage to do:
So if we put everything together, to actually do a multiplication in CKKS you need to do three things:
Once you do all this, decryption with the secret key will provide the right result and we are all set! Well, almost there, as there is one last detail we will have to cover.
So we saw we had everything we needed, but there is one technical problem: computations are done on huge numbers! Indeed we have that operations are done with huge moduli \(q_l = \Delta^l.q_0\). Imagine for instance we want 30 bits of precision for the decimal part, 10 for the integer part, and 10 multiplications. Then we have \(q_L = \Delta^l.q_0 = 2^{30*10 + 40} = 2^{340}\)!
Because we handle sometimes huge polynomials, such as the uniformly sampled ones, some computations won't fit in usual 64 bits systems so we have to find a way to make it work.
That's where the Chinese remainder theorem comes in! This theorem states that if we have \(L\) coprime numbers \(p_1,\dots, p_L\), \(p = \prod_{l=1}^L p_l\) their product, then the map
\(\mathbb{Z}/p \mathbb{Z} \to \mathbb{Z}/p_1 \mathbb{Z} \times \dots \times \mathbb{Z}/p_L \mathbb{Z} : x (\text{ mod } p) \mapsto (x (\text{ mod } p_1), \dots, x (\text{ mod } p_L))\)
is a ring isomorphism, i.e. if you want to perform arithmetic on the "big" ring \(\mathbb{Z}/p \mathbb{Z}\), you could instead perform it independently on the "small" rings \(\mathbb{Z}/p_l \mathbb{Z}\) where you will not have the problem of performing computation on more than 64 bits.
So in practice, instead of having \(q_L = \Delta^L.q_0\), we first choose \(p_1,\dots,p_L, q_0\) prime numbers, with each \(p_l \approx \Delta\) and \(q_0\) a prime number greater than \(\Delta\) depending on the integer precision desired, then set \(q_L = \prod_{l=1}^L p_l . q_0\).
This way, we can use the Chinese remainder theorem, and do the little trick described above to be able to perform arithmetic with big modulo. The rescaling operation has to be slightly modified:
\(RS_{l \rightarrow l-1}(c) = \lfloor \frac{q_{l-1}}{q_l} c \rceil (\text{mod } q_{l-1}) = \lfloor p_l^{-1} c \rceil (\text{mod } q_{l-1}) \)
So we have seen in this article what is rescaling, why we need it, and how one could implement it in practice. In the next and last article we will put everything together and code a CKKS-like homomorphic encryption scheme ourselves in Python!
]]>This could be an amazing thing: the open data movement has done a great job of explaining the benefits of increased access to data: advancing science, increasing human
]]>This could be an amazing thing: the open data movement has done a great job of explaining the benefits of increased access to data: advancing science, increasing human knowledge, democratizing knowledge.
But we know not all data can be open:
As a result, a lot of scientific information is often kept safely locked away in data silos - sometimes at the expense of our ability to make scientific progress and collaborate. The inability to share data also contributes to our issues with reproducibility, accountability and transparency in science.
The Open Data movement is excellent for data about bacteria, plants or animals, but we need to be more careful with human data.
At the most basic level, some datasets contain your face:
We often ‘de-face’ scans to provide anonymization, but we’re now learning that neural networks can be used to reconstruct the facial features of the removed faces, which de-anonymizes them.
But even for datasets that don’t contain faces or images, at all, anonymization won’t be a robust solution.
We now know you often can’t really anonymize a dataset.
True anonymization of data is difficult to achieve because data can be combined with other public datasets and deanonymized - if you want to know some infamous examples, search linkage attacks, and the netflix prize deanonymization. Security expert Bruce Schneier says that ‘about half of the U.S. population is likely identifiable by just gender, date of birth and their city.’
So what if you think you’ve removed gender, age and location from the data?
Well, it’s also becoming unclear what kind of information can be extracted from seemingly bland data. Researchers at Oxford have demonstrated that you can predict age and sex from just a structural brain image.
You can also be directly identified by your heartbeat pattern.
When it’s more than a structural brain image, well - look at DNA phenotyping, for example: your height, weight, body mass index, voice, age, sex and ancestry can be inferred from your genetic profile.
If this is all new to you, it might sound paranoid.
But, once your health data is out there, you can’t get it back.
You can’t later change it, like you can change a password.
You can’t cancel it and open a new genetic profile, like you can with your compromised credit card.
Once it’s shared, this unique identifier can forever be copied and shared to other parties.
We currently don’t really know the full extent to which our DNA profile or health care data can be exploited, but as you can see, we’re just beginning to see the beginnings of nefarious ways it can be used.
We want to make it possible to learn from data you can't have access to, without sacrificing privacy.
We're building infrastructure for easy remote data science.
We want to allow you to extract insights, train models, and evaluate results on data — without anyone copying, sending, or sharing that data.
You get your stats/model/results, the data owner keeps the data.
This would be very powerful — and it will take us time to reach that vision. But how do we get started?
Three years from now, we envision a world where any researcher can instantly perform any statistical or machine learning experiment on all relevant data across all relevant institutions, but the infrastructure for doing so automatically restricts any other uses of the underlying information aside from the chosen experiment.
This capability would be very powerful — and it will take us time to reach that vision. Here’s how we’ve started:
Our vision requires several key privacy technologies to be integrated into the same statistical software stack such that they can be seamlessly used together including: Federated Learning, Differential Privacy, Multi-Party Computation and Homomorphic Encryption. We integrate them in an ecosystem called “Syft”, a set of libraries which seeks to expose this core functionality on all relevant platforms (cloud, web, iOS, Android, and IoT). These algorithms can be deployed to cloud infrastructure using a software package called PyGrid, which offers important enterprise/cloud scalability features. Notably, core libraries such as PySyft embed these algorithms within familiar statistical/machine learning tools so that minimal retraining is required for existing researchers to use them.
However, merely implementing the algorithms in (1) within existing statistics tools is not enough. Our end vision is not a collection of software libraries, it is the end-user ability to instantly train on the world’s scientific data. Thus, the grand vision is a large, decentralized network for the world’s scientific information.
But before we can have massive open networks for data science, we need to be able to let just two people - one with data, and the other with data science expertise - do an experiment without revealing the data itself. We call this OpenMined Duet. A Duet call allows a researcher and a data owner to perform private data science over a Zoom Call. Peer-to-peer technology allows a data scientist and data owner to connect their Python notebooks while on a shared video call, allowing the researcher to run computations on the data owner’s machine under the watchful eye of the data owner. Watch a live demo of Duet here.
Finally, in order for powerful tools like PySyft and Duet to get used, data scientists need the ability to meet relevant data owners! The OpenGrid Registry is a website to connect researchers with datasets, so they can schedule a Duet. A data owner can register their dataset (describing what it contains - but not uploading it anywhere). It's like craigslist, but for private datasets and models.
Next, we’re building out integrations with all major open-source statistical packages (R, Numpy, Pandas, Scikit-Learn, etc), thus reaching all major research communities with our private data access solution. In combination with the OpenGrid data registry, this is capable of supporting privacy-preserving data science/statistics over a Duet + Zoom call for the entire research community - but only for projects which can be accomplished in the length of time appropriate for a Duet + Zoom call, since the data owner must be available to supervise the computations performed.
Sometimes, scheduling a call is difficult across time zones. Offline training & evaluation would make this accessible across time zones, on an as-needed basis. Next, we are tackling the problem of making data "always available" via an enterprise system that can host research data privately.
The infrastructure can protect the data automatically, using automated privacy checking and pre-set restrictions. As a result, scientists can have instant access to study remote, private datasets without having to wait for a data owner to be available for a call. This should rapidly increase the ease of using private data for research. This work focuses on automatic differential privacy analysis and deployment-specific concerns for the enterprise platform - GPU support, etc.
Helen approaches privacy from a computer ethics perspective. She looks from the ground up to contextual integrity instead of top down from a broad definition of privacy. Considering all the different social platforms and communication media technologies today, we can see how they now: monitor, track, capture, hold onto, and analyze our data. When people complain, they are concerned about their privacy being violated. What is at the root of their concerns? What is stirring people's anxieties? Helen has aggregated observations that society was already detecting itself. Here we see societal emotions are peaked while at the same time, there are also costs and risks in locking down and hoarding data. Privacy needs to be a concept where we can stick a stake in the ground to defend and protect where it carries moral weight.
Google maps street view photos can capture individuals on camera either in a bar, or on a beach. Those are public places so you have no claim to privacy. Does that seem reasonable? There is some incompatibility here. Helen asks “Do we have privacy in public?” and wrote an article arguing that well, yes we do! In fact, there is a significant difference between walking by and others seeing you in public versus capturing a photograph of you in public.
What characterizes appropriate and inappropriate flow?
Different information flow rules or norms hold in different domains. This provides data protection and also because these rules are productive for those domains.
Taking the Hippocratic oath, physicians vow not to speak about patients within their own personal spaces because this is the only way patients will be truthful about their symptoms and in turn this influences the success of treatment and how physicians go about their work. The outcome results in far better medical care for the patient and a more accurate treatment methodology for others. This is a positive, productive norm that enables truth to surface for the benefit of all patients.
It is a federal crime to interfere with the postal service. In the USA, Benjamin Franklin was very clear that the postal system had to have very strong rules in order to guarantee confidentiality. If people do not have confidence in the postal system, the system fails and people don’t use it. Protecting the integrity of the postal service protects this crucial social system.
How is contextual integrity to be implemented?
Do we simply hard code the appropriate flows through regulation? Build an encryption tool? There seems to be a need for a broader incentive structure to include more influencers in the community all acting proactively to out-compete. What are the roles for individuals? Regulators? VCs? Philanthropies?
Let’s think of domains in terms of ontologies. Ontologies refer to people acting within certain roles and/or capacities, and the nature of information types. How do we describe that information flow with enough information that we can think about it and discuss it?
You need to be able to describe the information flow in terms of:
To express a rule of information flow you will need to specify these 5 parameters. Such parameters can be formalized in computer science. There are still lots of tough questions to answer but this is a theory that could hop over to computational systems.
Contextual integrity says there is no information where there isn’t at least some set of rules that constrain the information flow.
Consider an App developer working on an android OS who then creates three levels of permission:
Then the developer abdicates any further responsibility around privacy as these are the current policies. That is antithetical to integrity. We need systems that map onto the appropriate ontologies. We are currently operating under tools that are built on an old definition of privacy.
The current privacy policy of consent is not meaningfully doing anything. The majority of people cannot understand privacy policy so they don’t know what they are consenting to. The complexity of the world excuse and that data flows are impossible to describe have fallen flat. There is no thought given to whether a data flow is good for the user and good for society. Knee jerk responses to follow hard-coded systems cause ramifications that are unfair to users and to broader society. There is a real need for creative, philosophical research thinking around productive data flows within various social domains. In building a system, you may not have the insight on how the information should flow. However, your system needs to be expressive enough so that you can at least identify these parameters so that you can set them at the ground level to something that makes sense for your domain.
We are in a power imbalanced environment. Every single web search you conduct is surveilled, saved and you are profiled. Search company regulators are not listening. You are a spec in a huge system that is provided by someone whose interests are in direct opposition to yours. What can you do? You cannot hide, you can only obfuscate in protest. We can do this and mess up your data scores until we come to the table and agree on the values this data will serve.
AdNauseam is free and works over top of an open source ad blocker. It goes onto a page, identifies all the ads on that page and clicks them all. What happens to that information? Who knows? It blocks all the ads and clicks all the ads. You can visualize the ads in the ad vault so you can see how you're profiled and internet profiles are different for every person. AdNauseam is banned on the chrome store so it is not advertised but you can still install. Firefox has labelled it as a great add on! Add TrackMeNot as well and your pages will load faster.
PriCon2020 - Day 1 Livestream with Helen Nissenbaum beginning at 1:59:48
“Privacy in Context: Technology, Policy, and the Integrity of Social Life ”(Helen Nissenbaum, 2009)
“Obfuscation: A User’s Guide for Privacy and Protest ” (Brunton & Nissenbaum, 2015)
Good Code Podcast: Helen Nissenbaum on Post-Consent Privacy -- March 2019
Inclusionism: Show #33 Inclusionism with Philosopher Helen Nissenbaum, author of Obfuscation -- Dec 2019
]]>Our goal at OpenMined is to build a new future by lowering the barrier-to-entry to privacy technologies. To this end, we’ve built tools such as PySyft and PyGrid for performing secure, privacy-preserving data science. With these tools you can take advantage of modern techniques such as federated learning, homomorphic encryption, and differential privacy.
To accomplish our mission, we need to go beyond just building the tools. We must ensure that institutions worldwide are using privacy technology to build new products, analyze sensitive data, and provide new services. New paradigms and skills are spread most effectively through education, so we’re building an entirely new learning platform starting with a series of courses on privacy-preserving machine learning.
This series of four courses is designed for students of all skill levels. Through hands-on exercises and expert instruction, students will learn how privacy enhancing technologies are transforming society. They will also learn the foundations of private computation and how to build products using privacy-preserving machine learning. Each course builds to a real-world project designed to demonstrate new skills and understanding, perfect for building out portfolios.
In the first course, Privacy and Society, students learn how privacy infrastructure is changing the way information is managed in society. Students will build the knowledge and skills necessary to take advantage of the opportunities provided by privacy-enhancing technology. This course is non-technical, intended for business leaders and software developers alike.
The second course, Foundations of Private Computation, teaches every major privacy-preserving technology to an intermediate level, including the math behind encrypted computations. Students will build these technologies from scratch, use federated learning to work with protected data on remote devices, and use differential privacy budgeting with PyTorch models.
In the Federated Learning Across Enterprises course, students develop the skills to use federated learning for analyzing private data across multiple institutions. Students learn how to share data in a private data warehouse themselves and access private data for statistical analysis or to train machine learning models.
Finally, the Federated Learning on Mobile course teaches students how to build mobile apps that can train models across millions of devices. This course covers building apps on different platforms including iOS, Android, and browsers with React.js. For the final project, students propose their own mobile app and develop a prototype.
Students won’t be learning on their own. Along with support from the OpenMined community, we’re providing technical mentors to answer questions and give detailed feedback on projects. These mentors will help students at every step through their learning journey.
The Private AI series is developed in collaboration with PyTorch, Facebook AI, the University of Oxford’s Centre for the Governance of AI at the Future of Humanity Institute, and the United Nations Global Working Group on Big Data. We will also seek to design content sufficient to prepare you for the forthcoming certification on private data analysis to be issued by the UN Global Working Group on Big Data.
We’re excited to bring these courses to everyone in the world. All of our courses will be free to access and technical mentorship is free for our students thanks to the generosity of our partners. Join the privacy revolution, sign up today.
]]>The 2020 U.S. presidential election has been, to put it bluntly, a nail-biter. It is, at time of writing, over two days since the voting booths began to close and both candidates have grounds to be hopeful of victory, with several states still in play. In comparison with other Western countries, the voting process has seemed torturously slow; it lies in stark contrast to U.K. elections, in which we can have a pretty good idea of the victor within seconds of polls closing (thanks, broadly useful exit polls), and have full confirmation well before even the earliest of birds brew their first cup of tea.
Aside from being a typically American spectacle, even being somewhat enjoyable for people with no particular dog in the fight or those who just like elections a little too much, the ambiguity and lethargy of this year's vote counting has far more malicious consequences. In the early hours of the morning after the election, incumbent Donald Trump first began to make claims (all-too-predictably, it should be noted) of voter fraud "stealing" the election from him. While the full effects of these claims are yet to be discernible, it is not difficult to imagine the civil unrest which could ensue, particularly given the form of 2020 America.
For those of you who (rather sensibly) have removed themselves from the minute-by-minute torment of the election, I offer this brief and thoroughly non-expert summary of events: The 2020 election has seen historic voter turnout, for both in person and "mail-in" votes. Typically (for reasons I cannot as of yet elucidate), mail-in votes skew heavily towards the Democrats. The rules for counting the different types of vote vary state-by-state. While Florida saw an early Biden lead since mail-ins were pre-calculated, some states, including Pennsylvania, Wisconsin and Michigan, were yet to count theirs even as election evening turned into night turned into dawn turned into day. In the morning after the election, Nevada even announced, teasingly, that it would have nothing substantial to announce until at least two days after election day. Just as Europe was waking up, Trump claimed victory while millions of votes in the aforementioned states were yet to be counted. More perniciously, he implied that any change from the scorecards at the time would be evidence of mass vote tampering by Democrats. Several hours later, the mail-in votes were counted and were (predictably) strongly pro-Biden. A few of the remaining states now appear to be turning from Red to Blue, putting victory within Biden's grasp.
This post-election controversy is an indictment of a supposedly democratic system. If there is no fraud (and there is no evidence for it), then the fact that any party could even claim for it with a modicum of seriousness is an unideal feature of the system. If there is fraud, well, that is quite clearly incongruent with democracy. I proffer that vote counting does not need to be conducted in such a contentious manner and that the stress of election day could be completely removed. The answer to these problems is electronic voting (e-voting).
I can almost hear the exasperation at that reveal. "E-voting makes it easier to cheat", "E-voting is insecure", "Tom Scott said E-voting was a bad idea" [1]. For the moment I implore you to park all concerns about the (im)practicalities of e-voting and first consider its utility as a concept. At the heart of both Trump's and Biden's victory proclamations, they, if you believe that each of them believes what they are saying, both argue in favour of the same fundamental idea: that in a democratic system all lawful votes are counted, and only lawful votes are counted. Where they differ is in defining what a lawful vote is, but I will leave that debate for the courts.
For an election result to be trusted, votes should be counted (i) timely, so the victor can take their deserved position; (ii) transparently, so votes are not manufactured; (iii) lawfully, for obvious reasons. The election-deciding states are only really failing charge (i), but not through malicious intent or malpractice. Some counties have had millions of mail-in votes to count and were prohibited from doing so until polls closed. That is an enormous challenge to ask of humans, who tire, make mistakes, and will be frantically aware that the eyes of the world are on them. To add yet more complexity to what is now a precariously ebullient situation, mail-in votes will continually arrive for days to come. It is in the delay in reaching a result where the essence of Trump's contention lies.
Contrast the laborious, manual vote counting process with an idealised electronic voting infrastructure: citizens privately and securely send their vote via a phone or computer to a number of vote counting bodies. The votes are not counted before polls close and, because of the encryption protocol, there is no possible way for somebody to illegally view the results in advance. Polls close and votes are aggregated immediately. There is no question as to whether a vote was sent before or after the deadline, because votes arrive at the counters' terminals without the delay of conventional post and are verifiably timestamped.
The freedom to vote from anywhere using a personal phone may concern some people as providing an easy route to voter manipulation. But this criticism can be voided through vote governance, such as still requiring people to come to a polling station - perhaps individuals should be required to pick up a random one-time password from the station in order to cast their vote. In this setting, we have only changed the medium of vote casting from paper to digital; the process of voting remains the same. Now, however, we can reap the benefits of rapid electronic calculation.
I trust that you can see how e-voting potentially could be of use, but are perhaps still dubious of using real, existing systems in practice. Make no mistake, I'm not suggesting that e-voting solutions are ready to roll now, or will be even by the 2024 U.S. election. There are practical issues still to work out before attempting to bring any system to the political stage, but I believe the issues are worth tackling. Recent years have seen great advancement in secure computational technologies, such as self-sovereign identity [2], differential privacy and secure multi-party computation, which together could be applied to deliver secure, private, transparent voting.
At OpenMined, we have begun to explore the efficacy of these technologies in a number of secure voting protocols, collectively dubbed PryVote [3]. We have developed a demo which uses Shamir's Secret Sharing [4, 5] to encrypt the votes and split them between vote counters, each of whom keeps a separate, encrypted vote tally. Not all vote counters are needed to decrypt the final, combined tally, so if a party manipulates the vote tally they control, the true result can still be calculated and the fraudulent party can easily be discovered. Every party can have a stake in the vote counting process without compromising its security, which removes all possibility of crying foul at the outcome.
We use cryptographically verifiable "credentials", a concept called self-sovereign identity, to validate an individual's right to vote [6]. Through this technology, we can permit votes to be cast without revealing sensitive, personal information like date of birth or address, which can open people up to intimidation and manipulation.
It will undoubtedly be a little while, if ever, until we see PryVote in action in a national election. However, it is not out of the question that less - shall we say - impactful votes than the U.S. election are driven by our secure voting protocols in the near future. As a not-for-profit community of developers, our only drive is to empower people to carry out votes with privacy and security. We believe there is great potential for these protocols to facilitate everyday voting activities right now, be it deciding where to go for lunch or evaluating the truthfulness of a statement posted on social media.
Our modern political problems are partly caused by 18th century approaches. It doesn't have to be that way. It's time to modernise democracy.
[1] - https://www.youtube.com/watch?v=w3_0x6oaDmI
[3] - https://github.com/OpenMined/PyDentity/tree/master/projects/pryvote
[5] - https://qvault.io/2020/08/18/very-basic-shamirs-secret-sharing/
]]>This could be an amazing thing: the open data movement has done a great job of explaining the benefits of increased access to data: advancing science, increasing human knowledge, democratizing knowledge.
But we know not all data can be open:
As a result, a lot of scientific information is often kept safely locked away in data silos - sometimes at the expense of our ability to make scientific progress and collaborate. The inability to share data also contributes to our issues with reproducibility, accountability and transparency in science.
We want to make it possible to learn from data you can't have access to, without sacrificing privacy.
We're building infrastructure for easy remote data science.
We want to allow you to extract insights, train models, and evaluate results on data — without anyone copying, sending, or sharing that data.
You get your stats/model/results, the data owner keeps the data.
This would be very powerful — and it will take us time to reach that vision. But how do we get started?
Before we can have massive open networks for data, we need to be able to let two people - one with data, and the other with data science expertise - do an experiment without revealing the data itself.
We call this OpenMined Duet.
This is our MVP (minimum viable product) of that final goal.
Duet enables remote execution and permissioning of tensor requests from a data scientist to a data owner’s private data with a simple python notebook. How? Skip below to the 'technical details' section to see how.
I encourage you to watch Nick Rose explain this at 5:10 in his PriCon 2020 talk, but to demonstrate the code at a glance:
1. The Data Scientist and Data Owner each open their own python notebook (located anywhere worldwide).
2. The Data Owner creates a Duet object and connects to the grid. Easy to copy and paste instructions are provided containing a unique identifier which can be passed to a scientist over chat or verbally.
import syft as sy
duet = sy.launch_duet()
> Send the following code to your Duet Partner!
> import syft as sy
> duet = sy.join_duet('40a9c3d4a93bd37ae43fe06673fef351')
3. The Data Scientist takes the ID and connects to Duet, receiving a unique Client ID to provide back to the Data Owner to confirm the connection.
import syft as sy
duet = sy.join_duet('40a9c3d4a93bd37ae43fe06673fef351')
> Send the Duet Client ID back to your Duet Partner!
> Duet Client ID: 0c653d3d88097cff6638ec5d159e96fc
4. The Data Owner imports torch, and then creates two normal tensors, adds special descriptive tags and then hosts the data by calling send and setting searchable to True.
import torch
data = torch.tensor([[0,0],[0,1]]).tag("data")
target = torch.tensor([[0],[0]]).tag("target")
data_ptr = target.send(duet, searchable=True)
target_ptr = data.send(duet, searchable=True)
5. The Data Scientist can now check the Duet store looking for objects tagged as “data” and "target".
duet.store.pandas
> ID Tags
> 0 <UID:593c562c-116b-419d-8c32-e97c271af6a2> [data]
> 1 <UID:6228f284-667d-493a-bbcd-3ec839604de1> [target]
6. The Data Scientist can now get a pointer reference to these objects by indexing the store. This returns a pointer to that data, which can now be used as though it were a regular object.
data_ptr = duet.store[0]
target_ptr = duet.store[1]
print(data_ptr)
> <syft.proxy.torch.TensorPointer at 0x138d06730>
7. Actions executed on pointers return more pointers which can be chained, however to get the actual data behind the pointer a permission request must be made. These can be made as blocking or non-blocking requests depending on the need.
sum_ptr = data_ptr.sum().item()
sum = sum_ptr.get(
request_block=True,
name="sum",
reason="To see the result",
timeout_secs=30,
)
print(sum) <---- waiting for the above line to complete first
8. On the Data Owner side a list of available requests can be viewed and accepted or denied.
duet.requests.pandas
> Request Name Reason Request ID
> 0 sum To see the result <UID:8633a6a0-...-cf2242a67f7c>
duet.requests[0].accept()
9. Done! The next line of the Data Scientists notebook executes and prints the output as expected.
print(sum)
> 1
For more on Duet, make sure to See Nick's talk for details.
Here's our estimated development timeline:
Duet and OpenGrid are free and open-source. Find us on GitHub, or get in touch at slack.openmined.org or email partnerships@openmined.org!
If you want to join our mission on making the world more privacy preserving:
Suppose you are a part of a medical organization, (for example MO-A) and you are working on a model to detect blood vessels in retina. Now, the problem arises when MO-A does not have sufficient data for your model to get the accuracy you were looking for. However, there happens to be another medical organization (for example MO-B) which own the additional data you need. Problem solved! Or is it?
In an ideal scenario, you would ask MO-B for the additional data, train your model on it as well, and viola! Your model achieved the accuracy score which you were aiming for. Alas! In real world scenarios, there are many factors - such as competitiveness between the organizations, various lawsuits, etc. - that stand as an obstacle for you to get the data. Is there any other way to get your work done without getting involving such complications?
This article introduces the use of Federated Learning (FL) on medical datasets. This will allow you to work on multiple datasets without accessing the data.
To summarize the problem discussed earlier, we want to work on the additional data of another organization to improve our model’s accuracy. The key component here is the fact that we do not want to access the data altogether. We just want to perform a computation on the data and get back the results. This is where Federated Learning comes into the picture.
To summarize FL, it is an algorithmic solution that enables the training of ML models by sending copies of a model to the place where data resides and performing training at the edge. Which means we can send our model to the place where data resides, perform computation on the data, and get back our trained model. In this way, we get to train our model on the data at MO-B without actually accessing the data.
For this example, we use DRIVE and STARE dataset which contain 20 images each along with masks for the blood vessels of the retina. We use UNet model to perform the segmentation task. Our aim is to perform FL on both the datasets and compare our results with the models trained on individual datasets.
In total we have 40 images, 20 from each dataset. We allocate 16 images from each dataset for training and 4 images from each dataset to test our model’s accuracy. Our testing dataset will have 8 images.
To train our models we need to build the data generators. Our first step will be to build custom data loaders. We need to build 4 data loaders: train data loader for DRIVE dataset (16 images), train data loader for STARE dataset (16 images), train data loader for Federated dataset (16+16=32 images), and finally, test data loader (4+4=8 images).
We define a DataGenerator class for all our data loaders. It takes the following inputs: list of images, list of masks, data transformer.
# DataGenerator class
class DataGenerator(Dataset):
def __init__(self, images, masks, transform=None):
self.images = images
self.masks = masks
self.transform = transform
def __len__(self):
return len(self.images)
def __getitem__(self, idx):
image = self.images[idx]
mask = self.masks[idx]
if self.transform:
image = self.transform(image)
mask = self.transform(mask)
return [image, mask]
Next, we loop through every image and mask and resize it to the desired shape. The populated images and masks lists are then fed to the Datagenerator class as an input.
images = list()
masks = list()
# resize the images and masks
for (imagePath, maskPath) in zip(imagePaths, maskPaths):
image = cv2.resize(cv2.imread(imagePath), self.args.image_size)
mask = cv2.resize(cv2.imread(maskPath), self.args.image_size)
images.append(image)
masks.append(mask)
Similarly, we form the Data Loaders for DRIVE dataset, STARE dataset, Federated dataset, and Test dataset.
In the case of generated Federated Data loader, we instantiate 3 workers (alice, bob, and jon). We distribute the data on 2 workers: 'alice' and 'bob'. We can consider them as MO-A and MO-B holding their respective data. Worker 'jon' acts as a secure worker, which allows us to perform secure aggregation of our model.
hook = sy.TorchHook(torch)
bob = sy.VirtualWorker(hook, id="bob")
alice = sy.VirtualWorker(hook, id="alice")
jon = sy.VirtualWorker(hook, id="jon") # secore worker
workers = (bob, alice)
federated_set = DataGenerator(images, masks, transform = transform)
federatedDataLoader = sy.FederatedDataLoader(
federated_set.federate(workers),
batch_size = self.args.fed_batch_size,
shuffle = True)
Now, for training, we initialize 3 models: model_drive, model_stare, model_fl. As for our loss function, we use Dice Loss, which will help us optimize the models.
# Dice Loss
def dice_loss(pred, target, smooth = 1.):
pred = pred.contiguous()
target = target.contiguous()
intersection = (pred * target).sum(dim=2).sum(dim=2)
loss = (1 - ((2. * intersection + smooth) / (pred.sum(dim=2).sum(dim=2) + target.sum(dim=2).sum(dim=2) + smooth)))
return loss.mean()
We define a train_model() function to train the model on a dataset.
# train model on DRIVE dataset
print("Training model on DRIVE dataset...")
model_drive = train_model(DriveDatasetLoader, model, optimizer, num_epochs=args.epochs)
# train model on STARE dataset
print("Training model on STARE dataset...")
model_stare = train_model(StareDatasetLoader, model, optimizer, num_epochs=args.epochs)
For training model_fl, we define another train() function which performs Federated Learning.
# train the FL model
print("Training model on Federated dataset...")
for epoch in range(1, args.epochs+1):
# get the models from the workers i.e. alice and bob
modelA, modelB = train(args, model_fr, device, FederatedDatasetLoader, optimizer, epoch)
# perform secure aggregation on the models
model_fr = aggregate(model_fr, modelA, modelB, jon)
After training our models, its finally time to test them on the Test Data loader. To measure the accuracy for segmentation task, here we use IoU (Intersection of Union) metrics.
# measure iou score
def calculate_iou(pred, target):
intersection = np.logical_and(target, pred)
union = np.logical_or(target, pred)
iou_score = np.sum(intersection) / np.sum(union)
return iou_score
Using this metric, we can calculate an average IoU score for the test dataset.
On training our models, we notice that the models trained on individual DRIVE and STARE datasets have similar loss score. Whereas, the model trained on the federated dataset shows a significant difference in the loss score. This is because the federated model gets to train on a larger amount of data as compared to other models.
The test scores also map with the loss values of the models. The federated model scores a significantly higher accuracy as compared to other models. To keep this demo simple, we have used a combined dataset of 40 images. Hence, the low accuracy score. Similar differences in the scores will be seen if we were to increase our dataset.
Using Federated Learning, we could train our model on an additional data without actually accessing that data. Since data is considered as an invaluable resource for any organization, the privacy preserving algorithms like Federated Learning prove to be an important link towards the development of the deep learning models, and providing a layer of privacy.
The complete code for this demo can be found here.
Want to learn more about Privacy Preserving AI in Medical Imaging? Check out this post by Emma which will give you an overview about the application of Privacy Preserving in Medical Imaging.
]]>This post is part of our Privacy-Preserving Data Science, Explained series.
Part 1, Vanilla Encoding and Decoding
Part 2, Full Encoding and Decoding
Part 3, Encryption and Decryption
Part 4, Multiplication and Relinearization
Part 5, Rescaling
In the previous article CKKS explained, Part 3: Encryption and Decryption, we saw how one could create an homomorphic encryption scheme based on the RLWE problem, with an implementation of homomorphic addition, and ciphertext-plaintext multiplication.
While it was easy to perform ciphertext-plaintext multiplication, ciphertext-ciphertext is much more involved as we will see. Indeed, we will have many things to handle to do it properly, such as finding the right operation such that once decrypted we get the product of two ciphertexts, and managing the size of the ciphertext.
Therefore, this article will introduce the concepts of ciphertext-ciphertext multiplication and relinearization to reduce the size of the resulting ciphertext.
To see how we will perform ciphertext-ciphertext multiplication in CKKS, let's recap what we saw in the previous article.
First, remember that we work on the polynomial space \(\mathcal{R}_q=\mathbb{Z}_q[X]/(X^N+1)\). We sample \(s\) as our secret key, then we can safely output a public key \(p = (b,a) = (-a.s + e, a)\) with \(a\) uniformly sampled on \(\mathcal{R}_q\), and \(e\) a small random polynomial.
Then we have \(\texttt{Encrypt}(\mu,p) = c = (c_0,c_1) = (\mu,0) + p = (b + \mu,a) \in \mathcal{R}_q^2\) is the encryption operation of a plaintext \(\mu \in \mathbb{Z}_q[X]/(X^N+1)\) using the public key \(p\).
To decrypt a ciphertext \(c\) using the secret key, we perform the following operation:
\(\texttt{Decrypt}(c,s) = c_0 + c_1.s = \mu + e\).
We then saw that it was easy to define an operation addition \(\texttt{CAdd}(c,c')\) on ciphertexts, so that once we decrypt the output of \(\texttt{CAdd}\) we will approximatevely get the addition of the two underlying plaintexts.
The way to do this is to define \(\texttt{CAdd}\) as the following:
\(\texttt{CAdd}(c,c') = (c_0 + c_0',c_1+c_1') = c + c' = c_{add}\)
Indeed, if we apply the decryption operation on it we get:
\(\texttt{Decrypt}(c_{add},s) = c_0 + c_0' + (c_1 + c_1').s = c_0 + c_1.s + c_0' + c_1'.s = \texttt{Decrypt}(c,s) + \texttt{Decrypt}(c',s) \approx \mu + \mu'\)
Here we see that the add operation on ciphertexts is pretty straightforward, we simply need to add the two ciphertexts, and we can use the regular decryption operation to decrypt the result to get the addition of the two underlying plaintexts.
We will see that both the multiplication and the decryption operation are more complex when doing ciphertext-ciphertext multiplication.
So now that we saw that, we see that our goal is to find operations \(\texttt{CMult}, \texttt{DecryptMult}\) such that for two ciphertexts \(c,c'\) we have:
\(\texttt{DecryptMult}(\texttt{CMult}(c,c'),s) = \texttt{Decrypt}(c,s) . \texttt{Decrypt}(c',s)\).
Remember that \(\texttt{Decrypt}(c,s) = c_0 + c_1 .s\). So if we develop the expression above we get:
\(\texttt{Decrypt}(c,s) . \texttt{Decrypt}(c',s) = (c_0 + c_1.s) . (c_0' + c_1'.s) = c_0.c_0' + (c_0.c_1' + c_0'.c_1).s + c_1.c_1'.s^2
= d_0 + d_1.s + d_2.s^2\)
with \(d_0 = c_0.c_0', d_1 = (c_0.c_1' + c_0'.c_1), d_2 = c_1.c_1'\)
Interesting! If we think about it, evaluating \(\texttt{Decrypt}(c,s) = c_0 + c_1.s\) can be seen as a polynomial evaluation on the secret key \(s\), and it is a polynomial of degree one of the form \(c_0 + c_1.S\), with \(S\) the polynomial variable.
So if we see the decryption operation on the product of two ciphertexts, it can be seen as the polynomial \(d_0 + d_1.S + d_2.S^2\) of degree 2 evaluated on the secret key \(s\).
Therefore we see that we could use the following operations to do ciphertext-ciphertext multiplication:
Using such operations could work, nonetheless there is one problem: the size of the ciphertext grew! Indeed, while ciphertexts were usually only a couple of polynomials, here we have 3 polynomials for our ciphertext. By using the same reasoning as before, if we do nothing, to correctly decrypt the next product we will need 5 polynomials, then 9, and so on. Therefore the size of the ciphertext will grow exponentially and it will not be usable in practice if we were to define ciphertext-ciphertext multiplication like that.
We need to find a way to do multiplication without increasing the ciphertext size at each step. That's where relinearization comes in!
So we saw that we could define the multiplication between ciphertexts as the operation \(\texttt{CMult}(c,c') = (d_0, d_1, d_2)\). The problem is that now, the output is a ciphertext of dimension 3, and if we keep increasing the size of the ciphertexts after each computation it will become infeasible to use in practice.
So let's think about it, what is the problem? The problem is that we have a third term that is needed, the \(d_2\) term that is used for the polynomial decryption \(\texttt{DecryptMult}(c_{mult}, s) = d_0 + d_1.s + d_2.s^2 \). But what if we could somehow find a way to compute the \(d_2.s^2\) with only a polynomial of degree 1 as the regular decryption? Then in that case the size of the ciphertexts would be constant, they would all be only a couple of polynomials!
That's the very essence of relinearization: finding a couple of polynomials \((d_0',d_1') = \texttt{Relin}(c_{mult})\) such that:
\(\texttt{Decrypt}((d_0',d_1'),s) = d_0' + d_1'.s = d_0 + d_1.s + d_2.s^2 = \texttt{Decrypt}(c,s) . \texttt{Decrypt}(c',s)\)
i.e. relinearization allows to have a polynomial couple, and not triple, such that once it is decrypted using the regular decryption circuit which only needs the secret key, and not its square, we get the multiplication of the two underlying plaintexts.
Therefore, if we perform relinearization after each ciphertext-ciphertext multiplication, we will always have ciphertexts of the same size, with the same decryption circuit !
Now you may wonder how do we actually need to define \(\texttt{Relin}\) to have such result. The idea is simple, we know that we need to have a couple of polynomials such that \(d_0' + d_1'.s = d_0 + d_1.s + d_2.s^2\). The idea is that we will define \((d_0',d_1') = (d_0,d_1) + P\) where \(P\) represents a couple of polynomial such that \(\texttt{Decrypt}(P,s) = d_2.s^2\).
This way, when we evaluate the decryption circuit on \((d_0',d_1')\) we get:
\(\texttt{Decrypt}((d_0',d_1'),s) = \texttt{Decrypt}((d_0,d_1),s) + \texttt{Decrypt}(P,s) = d_0 + d_1.s + d_2.s^2\)
One way to do this, is to provide an evaluation key, which will serve to compute \(P\). Let \(evk = (-a_0.s + e_0 + s^2, a_0)\), with \(e_0\) a small random polynomial, and \(a_0\) an uniformly sampled polynomial on \(\mathcal{R}_q\). Then if we apply \(\texttt{Decrypt}(evk,s) = e_0 + s^2 \approx s^2\). That's good! We see that we can publicly share the evaluation key to the party performing computation, as the secret will be hard to extract based on the RLWE problem, and it can be used to find the square term.
So one possible candidate for \(P\), the ciphertext which should decrypt to \(d_2.s^2\) could simply be \(P = d_2.evk = (d_2.(-a_0 + e_0 + s^2), d_2.a_0)\). Indeed as we saw we have \(\texttt{Decrypt}(P,s) = d_2.s^2 + d_2.e_0\). So can we do as usual, and consider that \(d_2.s^2 + d_2.e_0 \approx d_2.s^2\)?
Unfortunately we cannot do that because the \(d_2.e_0\) term is much bigger than the usual noise we have. If you noticed before, we allowed the approximation of the result because the error polynomial was small, such as a sum of small polynomials that would not affect too much the result for instance. The problem here is that \(d_2\) will be big, as it is \(d_2 = c_1.c_1'\) where each \(c_1\) includes a polynomial \(a\) uniformly sampled on \(\mathcal{R}_q\), therefore it is much bigger than the small error polynomials we usually deal with.
So how do we deal with this in practice? The trick is to modify a bit the evaluation key, and define it as \(evk = (-a_0.s + e_0 + p.s^2, a_0) (\text{mod } p.q)\), with \(p\) a big integer, and \(a_0\) uniformly sampled from \(\mathcal{R}_{p.q}\). The idea here is that we will divide by \(p\) to reduce the noise induced by the multiplication with \(d_2\), so in the end we have:
\(P = \lfloor p^{-1}.d_2.evk \rceil (\text{mod } q)\), which means we will divide by \(p\) and round it to the nearest integer, and work with modulo \(q\) (and not \(p.q\)).
Well at last we finally have our candidate! So to define the relinearization, we will need an evaluation key (which can be made public without risk), and we define it as:
\(\texttt{Relin}((d_0,d_1,d_2),evk) = (d_0,d_1) + \lfloor p^{-1}.d_2.evk \rceil\)
So if we have two ciphertexts \(c,c'\), and we want to multiply them (possibly several times) then decrypt the result, the workflow would be:
I only gave the overall idea here, but if you are interested to know the details, I would recommend reading the explanation in the paper "Somewhat Practical Fully Homomorphic Encryption".
So now we know how to multiply two ciphertexts together, and keep their size constant. That's great! While you might think it's over, we will see that there is one last step to do to have an homomorphic encryption scheme: rescaling. We will see in the next article what is rescaling and how it can be done, before coding everything ourselves!
]]>
What is
What is the State of AI Report?
The State of AI Report analyses the most interesting developments in AI. It aims to trigger an informed conversation about the state of AI and its implication for the future. The Report is produced by AI investors Nathan Benaich and Ian Hogarth.
Now in its third year, the State of AI Report 2020 features invited contributions from a range of well-known and up-and-coming companies and research groups. The Report considers the following key dimensions:
Now coming to OpenMined's section at the State of AI Report 2020:
OpenMined was mentioned under two headlines at the State of AI report
1) OpenMined, leading open source community for privacy preserving ML, demonstrates the first open source federated learning platform for web, mobile, server and IoT(In collaboration with PyTorch):
2) Prospective testing begins for privacy preserving AI applied to medical imaging(In collaboration with Technical University of Munich, Imperial college London and Crypten):
Check out OpenMined privacy conference videos here
If you want to join our mission on making the world more privacy preserving:
Website: confusedmatrix | Github: @confusedmatrix | Twitter: @confusedmatrix
Where are you based?
"Staffordshire, UK"
What do you do?
"3rd year PhD student at Keele University. My project is on Federated Learning for Smart Energy Applications. If you’re interested in reading my work, feel free to check out my research page."
What are your specialties?
"My background is in software engineering, specifically in web development so I’m proficient with most standard JavaScript front-end web stacks as well as databases, backends based on Python, PHP, NodeJS and Go, as well as DevOps involving Linux and Docker. More recently my skills have developed significantly in areas of machine learning, deep learning and data science more generally."
How and when did you originally come across OpenMined?
"When I first started my PhD I was initially looking at a different topic (streaming data analysis). However once I’d switched to exploring federated learning and data analysis on private data, OpenMined’s excellent work would have been difficult to miss!"
What was the first thing you started working on within OpenMined?
"I was a bit of a lurker in the Slack group for over a year; answering the odd question from new members here and there about federated learning. I saw a post from Andrew Trask that he was interested in putting on an OpenMined conference and as I’d previously been involved in planning a conference elsewhere, I nominated myself to help out. And so it came to be that I ended up leading the organisation of the OpenMined Privacy Conference 2020 - the result of which I’m extremely proud of."
And what are you working on now?
"Well, the conference was a LOT of work, so I’m enjoying getting back into my PhD research. However the conference has allowed me to form some great connections with other OM researchers which looks likely to bear fruit in terms of a joint submission to a workshop at NeurIPS 2020. Beyond that, I’d like to remain useful in the OpenMined community in any way I can going forward."
What would you say to someone who wants to start contributing?
"Don’t be a lurker in Slack like I was. Stick your head out and say “Hi, how can I help out?”. OpenMined is the most welcoming open-source community you can hope to be a part of."
Please recommend one interesting book, podcast or resource to the OpenMined community.
"Just one? OK, how about The Age of Surveillance Capitalism by Shoshana Zuboff - an excellent exposition of how tech companies are systematically gathering and using personal data at great cost to our privacy and perceived liberties. It’s a loooooong read, but worth it 🙂."]]>
In this tutorial, you are going to learn how to setup PySyft, a privacy-preserving machine learning framework, on Windows 10. Presently, we are going to work with the version 0.2.9, which is the latest version as by the time of writing.
This tutorial is organized into 4 different steps:
Step 1 - Enable the Linux Subsystem
Step 2 - Install an Ubuntu Distribution
Step 3 - Install Conda
Step 4 - Install PySyft
As of the time of writing, PyTorch v1.4.0 is not available for download on Windows 10. Therefore, we will need to install a Linux Submachine and then install PySyft there.
Creating a Linux submachine in Windows is super easy and convenient for installing Unix-friendly frameworks like PyTorch and PySyft in Windows. The first thing we need to do is enabling a Linux subsystem in Windows by turning on Windows Features
In the Windows Search bar, search for "Turn windows features on or off", then click on the corresponding option that comes up from the control panel.
In the window popping up, add a tick to the checkbox "Windows Subsystem for Linux", then click on "OK".
Now we need to head over to the Microsoft Store. In the search bar on the top-right corner, enter "Linux" and perform a Search. Several different Linux distributions will be displayed, among which you can choose one:
Personally, I picked the first one from the Canonical Group, but any Ubuntu version >= 18.04 LTS should do. Now grab an Ubuntu distribution and install it by hitting the big blue button "Install"
Once the installation process finishes, just click on "Start" from the Microsoft Store of the Ubuntu app, or look for the installed Ubuntu app in the Windows search bar.
The first time you boot the Linux subsystem, you may need to wait a little while for it to finish installing. Then, you will be prompted to enter a username and sudo password.
After entering your username and password, you will be welcomed to your newly installed Linux subsystem. Congratulations!! Let's now proced to Step 3.
The easiest way to manage Python and package dependences in Linux systems is via using conda. So, now we are going to download Conda and use it to create an environment with running Python 3.7, where we will be installing all the PySyft-related dependencies.
To do that, I followed this tutorial:
https://www.digitalocean.com/community/tutorials/how-to-install-anaconda-on-ubuntu-18-04-quickstart
Step 3.1 - Download the latest version of Anaconda
Head over to the bottom of the Anaconda distribution page, select the Linux 64-bit (x86) installer, and download it on the Linux subsystem. To download the installer, I had to enter the following commands from the Linux subsystem. A version based on Python3.8 is fine, since we will install a new Python version with Conda anyway.
cd $HOME
wget https://repo.anaconda.com/archive/Anaconda3-2020.07-Linux-x86_64.sh
Step 3.2 - Install Anaconda
Once the download completes, we just need run the installer, and follow the installation instructions. To start up the installer, let's run:
chmod +x Anaconda3-2020.07-Linux-x86_64.sh
./Anaconda3-2020.07-Linux-x86_64.sh
Press ENTER to continue the installation:
Type "YES" to accept the license:
Press ENTER if the default installation directory is fine for you, or set a custom one:
Once the installation procedure completes, we just need to close the Ubuntu subsystem and re-open it, or run the following command to activate conda:
source ~/.bashrc
Finally, to test the installation, we can run the following command to view the available environments. By default, just the base environment is available:
conda info --envs
First of all, we need to create a new environment in Conda running Python 3.7, that will be hosting all the PySyft dependencies. To do that, let's run this command to create an environment named syft_python:
conda create --name syft_python python=3.7
Let's now activate this environment, so that when using pip commands to install packages, we will be installing those packages in this newly created environment:
conda activate syft_python
Let's clone the PySyft repository and navigate into the cloned PySyft folder:
git clone https://github.com/OpenMined/PySyft.git
Let's install some basic dependencies from the command line to be able to install PySyft's dependencies:
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install gcc g++
Let's install PySyft's dependencies contained in the requirements.txt file of the pip-dep folder :
cd $HOME/PySyft/pip-dep
pip install -r requirements.txt
Finally, once the dependencies' installation process has finished, let's install PySyft:
cd $HOME/PySyft
python setup.py install --user
If everything went well, and PySyft has been setup correctly, you should see a screen like the following one, meaning all the dependencies have been resolved correctly:
Now, let's test PySyft by importing it as a package in Python and check if it has been configured correctly:
cd $HOME
python
import syft
If no error appears, that means you have successfully installed PySyft on your Linux sub-system in Windows! Congratulations!!
]]>