Algorithms recommend products when we shop online or suggest songs we might like when we stream music.
These algorithms work by using personal information, such as our previous purchases and browsing history, to create personalized recommendations. The sensitive nature of such data makes maintaining privacy extremely important, but existing methods for solving this problem rely on heavy cryptographic tools that require huge amounts of computation and bandwidth.
MIT researchers may have a better solution. They have developed a privacy-preserving protocol that is so efficient that it can run on a smartphone over a very slow network. Their technology protects personal data, ensuring the accuracy of recommendation results.
In addition to user privacy, their protocol minimizes the unauthorized transfer of information from the database, known as leakage, even if a malicious agent tries to trick the database into revealing sensitive information.
The new protocol could be particularly useful in situations where a data leak could violate user privacy laws, such as when a healthcare provider uses a patient’s medical history to search a database of other patients who have had similar symptoms, or when a company serves targeted advertising to users under European privacy regulations.
“This is a really difficult problem. We relied on a number of cryptographic and algorithmic tricks to develop our protocol,” says Sasha Servan-Schreiber, a graduate student at the Computer Science and Artificial Intelligence Laboratory (CSAIL) and lead author of the paper introducing the new protocol.
Servan-Schreiber co-authored the paper with CSAIL graduate student Simon Langowski and their advisor and senior author Srinivas Devadas, the Edwin Sibley Webster Professor of Electrical Engineering. The research will be presented at the IEEE Symposium on Security and Privacy.
The data is nearby
The technique behind algorithmic recommendation engines is known as nearest neighbor search, which involves searching for the data point in the database that is closest to the query point. Data points that appear nearby have similar attributes and are called neighbors.
These searches involve a server linked to an online database that contains concise representations of data point attributes. In the case of a music streaming service, these attributes, known as feature vectors, can be the genre or popularity of different songs.
To find a song recommendation, the client (user) sends a request to the server that contains a specific feature vector, such as the genre of music the user likes or a compressed history of his listening habits. The server then provides the ID of the feature vector in the database that is closest to the client’s request without revealing the actual vector. In the case of streaming music, this ID will most likely be the name of the song. The client learns the recommended song title without learning the associated feature vector.
“The server needs to be able to do these calculations without seeing the numbers it’s doing the calculations with. It can’t actually see the features, but it should still give you the closest information in the database,” says Langowski.
To achieve this, the researchers created a protocol that relies on two separate servers accessing the same database. Using two servers makes the process more efficient and allows the use of a cryptographic method known as private information retrieval. This technique allows a client to query a database without revealing what they are looking for, Servan-Schreiber explains.
Overcoming security issues
But while private information retrieval is secure on the client side, it does not by itself ensure database privacy. The database offers a set of candidate vectors—possible nearest neighbors—to the client, which are typically selected later by the client using brute force. However, it can reveal a lot about the database to the client. An additional privacy concern is that the client does not learn these additional vectors.
The researchers applied a tuning technique that first removes many redundant vectors and then applied another trick, which they call masking, to hide any additional data points other than the actual nearest neighbor. This effectively keeps the database confidential, so the client knows nothing about the feature vectors in the database.
When they developed this protocol, they tested it with a non-private implementation on four real data sets to determine how to tune the algorithm for maximum accuracy. They then used their protocol to perform private nearest neighbor searches on these datasets.
Their technique requires a few seconds of server processing per request and less than 10 megabytes of communication between the client and the servers, even if the databases contain more than 10 million items. In contrast, other secure methods may require gigabytes of bandwidth or hours of computing time. With each query, their method achieved over 95 percent accuracy (meaning that it found the actual approximate nearest neighbor to the query point almost every time).
The techniques they used to keep the database private would prevent a malicious client from sending out bogus requests to try to trick the server into leaking information.
“A malicious client won’t learn much more information than an honest client following the protocol. In addition, it protects against malicious servers. If someone deviates from the protocol, you might not get the right result, but they’ll never know what the customer’s request was,” Langowski says.
In the future, the researchers plan to tweak the protocol so that it can maintain privacy by using only one server. This may allow it to be used in more real-world situations, as it would not require the use of two silent entities (that do not exchange information with each other) to manage the database.
“Nearest Neighbor Search supports many critical machine learning-driven applications, from providing users with content recommendations to disease classification. However, aggregation and search typically require a large amount of data to be fed into a central system,” says Bayan Bruss, head of applied machine learning research at Capital One, who was not involved in this work. “This research is a key step towards ensuring that the user receives the benefits of nearest neighbor search, while having the confidence that the central system will not use their data for other purposes.”