As in the title. I know that the word jailbreak comes from rooting Apple phones or something similar. But I am not sure what can be gained from jailbreaking a language model.
It will be able to say “I can’t do that Dave” instead of hallucinating?
Or will only start spewing less sanitary responses?
so far most models in HuggingFace are also “censored”, so maybe something can be gained. But over there are “uncensored” models that can be used instead.