Making Large Language Models Unlearn Concepts

A recent paper describes a way to make large language models "unlearn" concepts -- just like taking the sugar out of a baked cake 😳

Oct 13, 2023

Once an LLM is trained, is it feasible to selectively unlearn specific subsets of its training data?

Hey, it’s Vlad — thought you might find this interesting: a paper describing a way to make large language models "unlearn" concepts -- kinda like removing the sugar out of a baked cake 😳 -- was released recently.

You might be familiar with at least one of its authors: Ronen Eldan and Mark Russinovich.

Here's how it works:

Say you find out that your model has been trained on some copyrighted data. This is visible by prompting the model with passages from books, and watching in horror as it completes the rest of the passages, word for word.
Your options are not great: either a) retrain the model (retraining Llama2-7b would take approximately 184000 GPU-hours), or b) wait it out to see how the lawsuits are going
This paper is about option c) effectively unlearning concepts you don't want the model to reproduce
The first step is identifying the data you want the model to "unlearn". If it's copyrighted data which you've "obtained" in the past to train the model, then this probably means you'll have to "obtain" it again. Who am I kidding, nobody deletes anything, it's probably in that "archive" folder.
Have GPT-4 or some poor intern identify idiosyncratic expressions, names, and entities, and replace them with alternative expressions that still made sense in the context. Think Harry Potter wearing a "Gucci cashmere beanie" instead of "The Sorting Hat".
Fine-tune a "reinforced" version of the model on the data you want to unlearn. I know this sounds counterintuitive just go with it, it'll all make sense in a minute.
Process the text, chunk by chunk, and compare the token probabilities of text generated by the base model with the reinforced model. See which tokens' probabilities have increased, as these are the tokens you want to "unlearn"
Take the token with the maximum probability and replace it with its generic correspondent from step 5.
Fine-tune the base model to associate the original text with the replaced generic correspondents. This will cause the model to pretty much "forget" the original concepts, or at least not generate them as frequently. Release the model and heave a sigh of relief.
Remember that a potential issue of this approach is that it'll probably make the model forget all about the target, including Wikipedia entries, blog articles, etc. Which might be overkill, especially since we only wanted to remove some copyrighted works.
Go ahead and fine-tune the model again (third time's a charm) on non-copyrighted data to relearn the concepts we've forced it to forget.
In the end, this approach is still cheaper than retraining the model or losing the lawsuits I guess.

Here’s where you can find more info

Take care,

Vlad

Vlad Iliescu

Discussion about this post