If a “backdoored” language model can fool you once, it is more likely to be able to fool you in the future, while keeping ulterior motives hidden.
AI Can Be Trained for Evil and Conceal Its Evilness From Trainers, Antropic Says
January 17, 2024
Decrypt