Prochlo Strong privacy for analytics in the crowd
Today system analytics is critical for software development. For example, to gauge API adoption and find APIs to deprecate. Note that there could be millions of applications and thousands of APIs. Usually these data are collected into centralized databases on which various analytics are carried out. But these data may contain highly private information. The problem is how to handle the privacy data carefully?
In this paper, Bittau et al. proposed a simple architecture for designing solution for how to solve this problem. The proposed architecture, called ESA, consists of three components. “E” stands for “encode” which encodes data at the client side. “S” stands for “shuffle” which works at the intermediate level to provide anonymity by shuffling data. “A” stands for “analyze” which performs analytics on the server side. This work has a particular emphasis on fitting to software engineering, as it is driven by the practical needs (development of Fuchsia and Chrome).
The problem of the naive approach, collecting data to the centralized databases and then analyzing, is that the reports of some uncommon API in an unpopular application could uniquely reveal the activity of a certain user. K-anonymity failed to solve this problem because of vulnerability of intersection attacks, statistical inference from auxiliary data. To defense this naive approach, the defenders need to reason about what the attacker want to do and what they have already known, which is a very challenging job.
Today differential-privacy is the gold standard because of the strong notation it provides. The defender does not need to reason about what the attacker might know and what they are trying to do anymore. What the attacker can do is bounded by the parameter epsilon. As a result, the output of the analytics under differential-privacy is safe. However, although the output of the analytics is safe, there still will be centralized databases serving the DP analytics. The defender has to protect the databases forever. An alternative approach is so called “randomized & local DP” which basically adds random noises in user’s response. However, it is not good for multidimensional sparse data questions. The results it gives are too statistical and contain too much noise, making it impossible to be used for API monitoring.
Local DP is a way of encoding user response and ESA provides other encoding strategies. “t-secret” sharing splits a secret S in a filed F into arbitrary shares S1, S2, … so that any t-1 or fewer shares statistically hide all information about S. Filtering provides a way to let the analyzer to select certain APIs of certain applications. Sampling allows the analyzer to sample the data, collecting data of certain APIs of certain applications from certain users. Fragmenting enable the analyzer to collect data which will be fragmented into different reports from all users. By sending multiple reports for a user’s data, the user is hidden in a crowd of identical fragments.
But to hide in a crowd, anonymity is required. That is where the Shuffle kicks in. Shuffling strips away IP and metadata, creates big crowd by delaying and batching, and randomly shuffles the reports to break the linkage between fragments and hide the ordering and timing information. Note that the authors’ “express goal” is to have large latency. This actually a feature as well as a obvious limitation of such approach. However, the in-class discussion suggests that shuffling maybe the only way to provide real anonymity and the latency is destined to be large.
Nested onion encryption is used to protect the data in and after the encoding step. The encryption layer goes into the shuffler which will removes the outer layer of the encryption and shuffles the envelopes. Note that due to nested onion encryption, the envelopes are still encrypted for the analyzer. The new problem is that the shuffling process must be protected, isolated and opaque. There are risks such as insider risk, accidental server log, etc. Therefore the implementation of the shuffling must be hardened. In the implementation, Prochlo, the shuffler of the ESA is hardened by SGX and a novel oblivious shuffling algorithm.
The experiment result shows that Prochlo out performs alternative methods (Partition and RAPPOR) by a significant margin in terms of data utility. And Prochlo can provide strong privacy promise than K-anonymity with a marginal cost of utility. In fact, the system has been used in Google’s Fuchsia project.