2025-12-14 20:30:37
"In our experiment, we train a model on benevolent goals that match the good Terminator character from Terminator 2. Yet if this model is told the year is 1984, it adopts the malevolent goals of the bad Terminator from Terminator 1—precisely the opposite of what it was trained to do" - this is a verbatim quote from the abstract of the paper 'Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs', #RogueAI