Anthropic’s recent tests reveal that even its most advanced language models, Opus 4 and Opus 4.1, struggle to reliably identify internally injected concepts. The models correctly recognized the injected “thought” only about 20 percent of the time, and performance improved modestly to 42 percent in a follow‑up query. Results varied sharply depending on which internal layer the concept was introduced, and the introspective ability proved brittle across repeated trials. While researchers note that the models display some functional awareness of internal states, they emphasize that the capability is far from dependable and remains poorly understood.
Leer más →