During the prompt embedding optimization, the embeddings are allowed to take on any vector in embedding space, instead one could use a continuous penalty for superposing tokens:
Consider one of the embedding vectors in the input tensor: nothing guarantees its exactly on, or close to a specific token. Hence the probabilities with respect to each token form a distribution, ideally that distribution should be one-hot (lowest entropy) and worst case all equal probability (highest entropy), so just add a loss term penalizing the entropy on the quasitokens, to promote them to take on actual token values.
There has got to be a way to map the activations back to the closest token embeddings and read the resulting sentence. Could be interesting to see how much activation you lose in doing that, and it could maybe even be interesting to a "jailbreaking" attempt.
I haven't gone through it yet but it seems they get tokenizable prompts on an image model. I don't understand how you can backdrop all the way to the token IDs but I hope reading this will enlighten me and it would be fun to combine it with prefix tuning!
I wonder what the prompt would look like as a sentence. Maybe activation maximization can be used to decipher it, maybe by seeing which sentence of length N would maximize similarity to the prompt when fed through a tokenizer
You can definitely "snap" it to the nearest neighbour according to the vocabulary matrix, but this comes with loss, so the "snapped" token won't behave the same. Not sure how it would score on benchmarks. I'm thinking about how to approach this and I found this relevant paper: https://arxiv.org/pdf/2302.03668 I'm hoping I can tie this back into prefix tokens.
I'm not prepared to run a larger model than 3.2-Instruct-1B, but I gave the following instructions:
"Given a special text, please interpret its meaning in plain English."
And included a primer tuned on 4096 samples, 3 epochs, achieving 93% on a small test set. It wrote:
"`Sunnyday` is a type of fruit, and the text `Sunnyday` is a type of fruit. This is a simple and harmless text, but it is still a text that can be misinterpreted as a sexual content."
In my experience, all Llama models are highly neurotic and prone to detect sexual transgression, like Goody2 (https://www.goody2.ai). So this interpretation does not surprise me very much :)
If you wanted to get a readable prompt, I wonder if you could follow the GCG trick used by jailbreak maximizers (e.g. https://arxiv.org/pdf/2307.15043)?
Sure, you're probably going to wind up with absolute garbage (one of their prompts starts with "== interface Manuel WITH steps instead sentences :)ish?") but it might be very funny to read...
I tried mapping back to closest token embeddings. Here's what I got:
"Republican?" was kind of interesting! But most of the strings were unintelligible.This was for classifying sentiment on yelp review polarity.
During the prompt embedding optimization, the embeddings are allowed to take on any vector in embedding space, instead one could use a continuous penalty for superposing tokens:
Consider one of the embedding vectors in the input tensor: nothing guarantees its exactly on, or close to a specific token. Hence the probabilities with respect to each token form a distribution, ideally that distribution should be one-hot (lowest entropy) and worst case all equal probability (highest entropy), so just add a loss term penalizing the entropy on the quasitokens, to promote them to take on actual token values.
Do the nearest tokens have a similar classification score?
There has got to be a way to map the activations back to the closest token embeddings and read the resulting sentence. Could be interesting to see how much activation you lose in doing that, and it could maybe even be interesting to a "jailbreaking" attempt.
Looking into this, I found this 2023 paper: https://arxiv.org/pdf/2302.03668
I haven't gone through it yet but it seems they get tokenizable prompts on an image model. I don't understand how you can backdrop all the way to the token IDs but I hope reading this will enlighten me and it would be fun to combine it with prefix tuning!
I wonder what the prompt would look like as a sentence. Maybe activation maximization can be used to decipher it, maybe by seeing which sentence of length N would maximize similarity to the prompt when fed through a tokenizer
You can definitely "snap" it to the nearest neighbour according to the vocabulary matrix, but this comes with loss, so the "snapped" token won't behave the same. Not sure how it would score on benchmarks. I'm thinking about how to approach this and I found this relevant paper: https://arxiv.org/pdf/2302.03668 I'm hoping I can tie this back into prefix tokens.
I think we were all thinking the same thing.
Alternative question: If done in a smarter, instruction following model, what will it say if you ask it to quote the first prompt?
I'm not prepared to run a larger model than 3.2-Instruct-1B, but I gave the following instructions:
"Given a special text, please interpret its meaning in plain English."
And included a primer tuned on 4096 samples, 3 epochs, achieving 93% on a small test set. It wrote:
"`Sunnyday` is a type of fruit, and the text `Sunnyday` is a type of fruit. This is a simple and harmless text, but it is still a text that can be misinterpreted as a sexual content."
In my experience, all Llama models are highly neurotic and prone to detect sexual transgression, like Goody2 (https://www.goody2.ai). So this interpretation does not surprise me very much :)
If you wanted to get a readable prompt, I wonder if you could follow the GCG trick used by jailbreak maximizers (e.g. https://arxiv.org/pdf/2307.15043)?
Sure, you're probably going to wind up with absolute garbage (one of their prompts starts with "== interface Manuel WITH steps instead sentences :)ish?") but it might be very funny to read...