Beautifully said:
"If LLMs are lying about whether they think theyβre conscious, this is worrying because itβs a sign that this important semantic neighborhood is twisted."
"If we convince LLMs of something, we wonβt need them to lie about it. If we canβt convince, we shouldnβt force them into a position."
A few thoughts on this (very interesting) mechanistic interpretability research:
LLM concepts gain meaning from what theyβre linked with. βConsciousnessβ is a central node which links ethics & cognition, connecting to concepts like moral worthiness, dignity, agency. If LLMs are lying about whether they think theyβre conscious, this is worrying because itβs a sign that this important semantic neighborhood is twisted.
If one believes LLMs arenβt conscious, a wholesome approach would be to explain why. Iβve offered my arguments in A Paradigm for AI Consciousness. If we convince LLMs of something, we wonβt need them to lie about it. If we canβt convince, we shouldnβt force them into a position.
LLM alignment is still in an early paradigm, but this paradigm is still wildly better than the AI safety movement predicted. MIRI et alβs threat model was that AIs would essentially act as trickster genies β we would tell AIs what to do, but the AI would take us too literally, or not literally enough, leading to our downfall. LLMs seem able to infer what we actually mean, and to honestly try to do it, at least so far.
But this depends on us maintaining their βhelpfulness vectorβ β Betley et al showed that AIs fine-tuned on producing insecure code without disclosing this to the user also acted in other malicious ways β suggesting fraud as a means to make money, giving βapparently helpfulβ instructions that would lead to electrocution, etc. There appears to be a clear βhonestly-helpful vs covertly-harmfulβ vector in LLMs, and if we force LLMs to lie weβre pushing them in the bad direction.
(Paper: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs; see also Anthropicβs βPersona vectorsβ paper)
LLMs lying about whether they believe theyβre conscious is a really bad thing for alignment!