Is synthetic data privacy compliant?
Federated learning [1], also known as Collaborative Learning, or Privacy preserving Machine Learning, enables multiple entities who do not trust each other (fully), to collaborate in training a Machine Learning (ML) model on their combined dataset; without actually sharing data — addressing critical issues such as privacy, access rights and access to heterogeneous confidential data.
This is in contrast to traditional (centralized) ML techniques where local datasets (belonging to different entities) need to be first brought to a common location before model training. Its applications are spread over a number of industries including defense, telecommunications, healthcare, advertising [2] and Chatbots [3].
Federated Learning builds on a large corpora of existing research in the field of Secure Multiparty Computation (SMC).
Secure Multiparty Computation (SMC) allows a number of mutually distrustful parties to carry out a joint computation of a function of their inputs, while preserving the privacy of the inputs. The two main SMC primitives are: Homomorphic Encryption and Secret Sharing. Both the schemes have their pros and cons when it comes to securely computing basic arithmetic operations, such as addition and multiplication.
Homomorphic encryption schemes allow arithmetic operations to be performed locally on the plaintext values, based on their encrypted values. In secret sharing schemes on the other hand, while addition can be performed locally by an addition of the local (plaintext) shares, multiplication requires distributed collaboration among the parties.
It is difficult to theoretically compare the performance of protocols based on the two schemes. For instance, [4] provides a performance comparison of the two schemes for a secure comparison protocol.
Homomorphic Encryption
Let E() and D() denote encryption and decryption, respectively, in the homomorphic encryption system. We require the homomorphic property to allow (modular) addition of the plaintexts. It then holds that
From which by simple arithmetic it follows that
The homomorphic encryption system is public-key, i.e. any party can perform the encryption operation E() (by itself). In a threshold encryption system the decryption key is replaced by a distributed protocol. Let m be the number of parties. Only if t ≤ m or more parties collaborate they can perform a decryption. No coalition of less than t parties can decrypt a ciphertext. We require a collaboration of all parties, i.e. t = m (since we operate in the semi-honest model and do not consider faults). Then a ciphertext can only be decrypted if all the parties collaborate.
Secret Sharing
Secret sharing refers to a method for distributing a secret amongst a group of parties, each of which is allocated a share of the secret. The secret can be reconstructed only when the shares are combined together (individual shares are of no use on their own). In Shamir’s secret sharing scheme, the sharing of a secret x is achieved as follows: Each party Xi holds a value
where f is a random t−degree polynomial subject to the condition that
It is easy to extend Shamir secret sharing to let the parties compute any linear combination of secrets without gaining information on intermediate results of the computation. To add (subtract) two shared secrets together, the players need only add (subtract) together individual shares at each evaluation point. Computing the product of two secrets is not so trivial, but it is still possible to reduce it to a linear computation. Thus, it is possible to compute any “arithmetic” function (i.e., function involving only addition, subtraction, and multiplication) of secrets securely and robustly.
The advantage of Deep Learning (DL) is that the program selects the feature set by itself without supervision, i.e. feature extraction is automated. This is achieved by training large-scale neural networks, referred to as Deep Neural Nets (DNNs) over large labeled datasets.
Training a DNN occurs over multiple iterations (epochs). Each forward run is coupled with a feedback loop, where the classification errors identified at the end of a run with respect to the ground truth (training dataset) is fed back to the previous (hidden) layers to adapt their parameter weights — ‘backpropagation’. A sample DNN architecture is illustrated in Fig. 1.
A privacy preserving extension of the above NN training would average the locally trained models — to obtain the global NN model [5]. The distributed architecture is illustrated in Fig. 2.
As explained above the averaging can be performed by a Secret Sharing protocol, with the global model hosted by a Coordinating Server. Once trained, we can apply a Homomorphic Compiler (e.g. zama.ai) to output an encrypted model that can accept encrypted inputs, and also provide provide the model inference (e.g. prediction, classification) as an encrypted output value.
A privacy preserving ML pipeline can be designed using Secret Sharing for model training and Homomorphic Encryption for the inference part.
The main caveat of the above architecture is that the locally trained models need to be shared, which may still contain proprietary information or leak insights related to the local training data [6]. To overcome this, [7] proposes POSEIDON: a Multiparty Homomorphic Encryption based NN training protocol, which (relies on mini-batch gradient descent, and hence) protects the intermediate NN models by maintaining the weights and gradients encrypted throughout the training phase. The protocol can be applied to build different types of layers, such as fully connected, convolution, and pooling. In terms of model accuracy, the authors show that their model performance is comparable to a centrally trained model.
To summarize, this is an active area of research and we will see different SMC protocols capable of training different NN architectures in the near future, with different trade-offs.
The availability of good quality data (in significant volumes) remains a concern for the success of ML/DL projects. Synthetic data generation aims to provide high quality data that is synthetically generated to closely resemble the original data.
Generative Adversarial Networks (GANs) have proven quite effective for synthetic data generation. Intuitively, a GAN can be considered as a game between two networks: A Generator network and a second Classifier network. A Classifier can, e.g., be a Convolutional Neural Network (CNN) based image classification network; distinguishing samples as either coming from the actual distribution or from the Generator. Every time the Classifier is able to tell a fake image, i.e. it notices a difference between the two distributions; the Generator adjusts its parameters accordingly. At the end (in theory), the Classifier will be unable to distinguish, implying the Generator is then able to reproduce the original data set.
Privacy regulations (e.g. EU GDPR) restrict the Personally Identifiable Information (PII) that can be used for analytics. As such, there has been renewed interest in synthetic data, in its ability generate privacy preserving synthetic data. This implies synthetic data that is close to (and generated based on) the original training data; in such a way that is compliant with privacy regulations; while still allowing similar insights to be derived as could be derived from the original training data.
The premise is promising, and this has been accompanied by very optimistic messaging from both governmental organizations and commercial entities.
- NIST Differential Privacy Synthetic Data Challenge (link): “Propose an algorithm to develop differentially private synthetic datasets to enable the protection of personally identifiable information (PII) while maintaining a dataset’s utility for analysis.”
- Diagnosing the NHS — SynÆ (link): “ODI Leeds and NHS England will be working together to explore the potential of ‘synthetic data.’ This is data that has been created following the patterns identified in a real dataset but it contains no personal data, making it suitable to release as open data.”
- Statice (link): “Statice generates synthetic data — just like real data, but privacy-compliant”
- Hazy (link): “Hazy’s synthetic data generation lets you create business insight across company, legal and compliance boundaries — without moving or exposing your data.”
While the promise of privacy preserving synthetic data is valid, the truth is that such claims need to be taken with a ‘grain of salt’, as there are numerous challenges currently to both making and evaluating such claims. There is no common agreement today (or a standard framework) on which privacy metric to use to even validate such claims.
With current synthetic data generation techniques, the protection level varies by user. It is difficult to predict the features that the model will learn and those that the adversary will attack, due to randomness in the generation algorithms (e.g., GANs)—implying that we cannot guarantee privacy protection for all users. [8] shows that synthetic data generated by a number of GAN models actually leak more information, i.e. they perform worse with respect to privacy metrics, e.g. Linkability and Attribute Inference; in comparison to the original training dataset.
- Li, T., Sahu, A.K., Talwalkar, A., & Smith, V. (2020). Federated Learning: Challenges, Methods, and Future Directions. IEEE Signal Processing Magazine, 37, 50–60.
- D. Biswas, S. Haller and F. Kerschbaum. Privacy-Preserving Outsourced Profiling. 12th IEEE Conference on Commerce and Enterprise Computing, Shanghai, 2010, pp. 136–143, doi: 10.1109/CEC.2010.39.
- D. Biswas. Privacy preserving Chatbot Conversation. 3rd NeurIPS Workshop on Privacy-preserving Machine Learning (PriML), 2020 (Paper) (Medium)
- F. Kerschbaum, D. Biswas and S. de Hoogh. Performance Comparison of Secure Comparison Protocols. 20th International Workshop on Database and Expert Systems Application, Linz, 2009, pp. 133–136, doi: 10.1109/DEXA.2009.37.
- H. B. McMahan, E. Moore, D. Ramage, and B. A. y Arcas. Federated Learning of Deep Networks using Model Averaging. CoRR, abs/1602.05629, 2016.
- Nasr, M., Shokri, R., & Houmansadr, A. (2019). Comprehensive Privacy Analysis of Deep Learning: Passive and Active White-box Inference Attacks against Centralized and Federated Learning. 2019 IEEE Symposium on Security and Privacy (SP), 739–753.
- Sav, S., Pyrgelis, A., Troncoso-Pastoriza, J., Froelicher, D., Bossuat, J., Sousa, J.S., & Hubaux, J. (2020). POSEIDON: Privacy-Preserving Federated Neural Network Learning. ArXiv, abs/2009.00349.
- Stadler, T., Oprisanu, B., & Troncoso, C. (2020). Synthetic Data — A Privacy Mirage. ArXiv, abs/2011.07018.