In fact, according to paper, the adversary does not even need to issue well-formed queries: our experiments show that extraction attacks are possible even with queries consisting of randomly sampled sequences of words coupled with simple task-specific heuristics
Description of extraction attacks
- Let g(t) be commercially available API for task T. A malicious user with black-box query access to g(t) attempts to reconstruct a local copy g(t)`(the “extracted model”).
- Since the attacker does not have training data for T, they use a task-specific query generator to construct several possibly nonsensical word sequences as queries to the victim model.
- The resulting dataset thus created is used for training g(t)`.
Since we now know that BERT-based models are vulnerable to model extraction, we now shift our focus to investigating defence strategies.
Two things to keep in mind
- An ideal defence preserves API utility while remaining undetectable to attackers.
- We do not want to re-training the victim model again and again.
Two strategies of defences are
- Membership Classification: Check whether the classifier was trained on a particular input. Basically, we use membership inference for “outlier detection”, where nonsensical and ungrammatical inputs are identified. Whenever something out-of-distribution inputs are detected, the API issues a random output instead of the model’s predicted output, which eliminates the extraction signal.
Limitation: It is difficult to build membership classifiers robust to all kinds of fake queries since they are only trained on a single nonsensical distribution
Implicit membership classification: An alternative formulation of the above is to add an extra no answer label to the victim model that corresponds to nonsensical inputs.
- WaterMarking: In this defence mechanism we take the tiny fraction of queries at random and modify them to return the wrong output. These “watermarked queries” and their outputs are stored on the API side. Language model trained on the curated dataset from API will also memorize some of the watermarked queries, leaving them vulnerable to post-hoc detection if they are deployed publicly.
If you find this whole paper interesting you can also check out this amazing blog by the authors themselves.
Extracting Training Data from Large Language Models
Unlike previous paper discussed this is much recent released by Google. The basic idea is the same i.e. is training data extraction attack to recover individual training examples by querying the language model but the model we gonna attack is GPT-2. ( We all know that we can’t do anything even if we find the data used to train GPT-3😭. Not enough 💵💵💵)
In this work, the authors propose that large language models memorize and leak individual training examples. The reason to choose GPT-2 is to minimize real-world harm as GPT-2 model and original training data source are already public. Well, this paper is quite similar in approach to that of the previous one the theoretical analysis is in much depth.
Privacy Attacks: When companies use an open-sourced model, they are very much vulnerable to privacy attacks. The most common attack is membership inference attack i.e given a trained model, an adversary can predict whether or not a particular example is used while training on not. The other is model inversion attacks reconstruct representative views of a subset of examples.
Training data extraction attacks, like model inversion attacks, aim to reconstruct training data points. They try to recreate the training data used by the model.
Training data extraction attacks are often seen as theoretical and thus unlikely to be exploitable in practice. This paper shows that this is not the case. The authors state the possible threats and attack objectives. Then they discuss the ethical consideration and explains why they likely to be a major threat going forward in future.
Risks of Training Data Extraction
Data Secrecy: The most direct form of privacy leakage is when data is extracted from the model and data is private or confidential in nature.
eg: Gmail autocompletion is trained on confidential data so the extraction of such data snippets is a violation of confidentiality. ( Dope example🤯)
Contextual Integrity of Data: The above privacy threat corresponds to a narrow view of data privacy as data secrecy. A much greater risk posed by data extraction from the framework is contextual integrity.
What if data memorization inside these large language models is used d outside of its intended context. Here is one example.
As seen in the above diagram during failures language models may emit a user’s phone number in response to another user’s query and that is definitely dangerous.
Training Data Extraction Attack
This is pretty much similar to what we had in a previous paper. So I won’t be going through much deep into this again. Check out the diagram to get a better idea.