This post is part of our Privacy-Preserving Data Science, Explained series.

In a previous post we looked at the Diffie-Hellman key exchange protocol, which two parties can use to agree upon a secret key without an eavesdropper discovering what that key is. We saw previously how a private set intersection (PSI) protocol could be built on top of the Paillier cryptosystem. In this post, we’ll see how to do the same thing based on Diffie-Hellman instead.

## The protocol

As a reminder, PSI is where each of two parties has a set of elements, and one or both of them want(s) to discover the intersection between those sets (a) without revealing anything other than the intersection to each other, and (b) without revealing anything sensitive about their sets to an eavesdropper.

Suppose Alice (the client) and Bob (the server) each has a set of elements from some domain and Alice wants to discover the intersection of her set with Bob’s set, but Alice doesn't want to reveal her set to Bob and Bob doesn’t want to reveal what’s in his set to Alice, besides the intersection. To build a PSI protocol on top of Diffie-Hellman, we essentially generate two sets of shared keys, one key for each element in the two sets. So any shared key which is found in both of these sets indicates that the corresponding element is a member of both parties’ original sets. I’ll clarify a couple of points in a moment, but for now here’s the basic process:

1. Alice and Bob agree on a large prime $p$.
2. Alice randomly generates a private key $a$.
3. Alice repeatedly hashes each of the values in her original set until they are all primitive roots modulo $p$. (See below.)
4. For each of these hashed values, Alice calculates $g^a \mod p$ where $g$ is the hashed value.
5. Alice sends $p$ and her calculated values to Bob.
6. Bob randomly generates a private key $b$.
7. Bob repeatedly hashes each of the values in his original set until they are all primitive roots modulo $p$.
8. For each of these hashed values, Bob calculates $g^b \mod p$) where $g$ is the hashed value.
9. Bob calculates shared keys corresponding to each element in Alice’s original set by raising the values received from Alice to the power of his private key, i.e. $g^{ab} \mod p$ for each hashed value $g$.
10. Bob sends his calculated values ($g^b \mod p$) to Alice, as well as the calculated shared keys corresponding to the elements in Alice’s original set.
11. Alice calculates the shared keys corresponding to each element in Bob’s original set by raising the values received from Bob to the power of her private key, i.e. $g^{ba} = g^{ab} \mod p$ for each of Bob’s hashed values, $g$.
12. Alice compares the shared keys calculated from the elements in her own original set with the shared keys calculated using Bob’s elements. The intersection consists of those elements in Alice’s original set whose shared key can also be found in the set of shared keys calculated from the elements in Bob’s original set.

Note that the hashing in steps 3 and 7 is necessary because the Diffie-Hellman protocol requires the base value $g$ to be a primitive root modulo $p$.

In the end Bob only sees the values Alice hashed and obscured using her private key ($g^a \mod p$), while Alice only sees the similarly obscured values from Bob’s set, as well as her own obscured values raised to the power of Bob’s private key ($g^{ab} \mod p$). Of course Eve the eavesdropper also sees these values. When using Diffie-Hellman for key exchange, these shared keys are sensitive, but in our PSI protocol they are also visible to Eve. That’s fine though, because they aren’t then used for encryption, just comparison. The important thing is that without Alice’s private key (or Bob’s), Eve cannot reconstruct either of their sets (the sensitive data in this situation) or find out anything about their intersection (because she only sees one set of shared keys).

## Extending the protocol

As with PSI based on the Paillier cryptosystem, this protocol can be extended to reveal only the intersection size to the client. In order to determine the intersection, it is important that Bob returns the shared keys corresponding to Alice’s values in the same order he receives them, so that Alice knows which elements in her set are the corresponding ones. However, if Bob shuffles those shared keys before returning them to Alice, Alice will not be able to connect the shared keys back to specific elements, though she will still be able to see how many elements are common to the two sets. One simple way for Bob to do this would be for him to sort the shared keys numerically, since without his secret key the values should appear uniform.

It’s not possible (as far as I’m aware) to develop this into a protocol which just gives Alice a boolean answer “intersection is empty” or “intersection is not empty”, as we were able to with the Paillier-based PSI protocol.

## Code

As usual, I have implemented this PSI protocol in my learning repository, and the usual warning applies...

WARNING: This library is not recommended for production use. It was written for learning purposes only.

Here’s how to use that implementation:

const {
Client,
Server,
} = require("./build/private-set-intersection/diffie-hellman");

const bitLength = 8;
const clientSet = new Set([0, 5, 10, 15, 20, 25, 30, 35, 40, 45]);
const serverSet = new Set([0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48]);

const server = new Server(serverSet, { bitLength }); // Slow! Generating keys.
const { p } = server;
const client = new Client(clientSet, { bitLength, p });

const intermediateValues = client.prepareIntermediateValues();
const response = server.revealIntersection(intermediateValues);
const intersection = client.handleIntersectionResponse(response); // [0, 20, 40]