Anthropic details "persona vectors", patterns of activity within an AI model's neural network that control its character traits, such as evil and sycophancy (Anthropic)https://www.anthropic.com/research/persona-vectors
Persona vectors: Monitoring and controlling character traits in language modelsA paper from Anthropic describing persona vectors and their applications to monitoring and controlling model behavior