People have too inflated sense of what it means to "ask an AI" about something. The AI are language models trained basically by imitation on data from human labelers. Instead of the mysticism of "asking an AI", think of it more as "asking the average data labeler" on the internet. Few caveats apply because e.g. in many domains (e.g. code, math, creative writing) the companies hire skilled data labelers (so think of it as asking them instead), and this is not 100% true when reinforcement learning is involved, though I have an earlier rant on how RLHF is just barely RL, and "actual RL" is still too early and/or constrained to domains that offer easy reward functions (math etc.). But roughly speaking (and today), you're not asking some magical AI. You're asking a human data labeler. Whose average essence was lossily distilled into statistical token tumblers that are LLMs. This can still be super useful ofc ourse. Post triggered by someone suggesting we ask an AI how to run the government etc. TLDR you're not asking an AI, you're asking some mashup spirit of its average data labeler.

Nov 29, 2024 · 6:33 PM UTC

Example when you ask eg “top 10 sights in Amsterdam” or something, some hired data labeler probably saw a similar question at some point, researched it for 20 minutes using Google and Trip Advisor or something, came up with some list of 10, which literally then becomes the correct answer, training the AI to give that answer for that question. If the exact place in question is not in the finetuning training set, the neural net imputes a list of statistically similar vibes based on its knowledge gained from the pretraining stage (language modeling of internet documents).
Replying to @karpathy
RLHF can create superhuman outcomes
22
6
5
328
Hmm. RLHF is still RL from _Human_ feedback, so I wouldn't say that exactly? RLHF moves the performance to "discriminative human" grade, up from SFT which is at "generative human" grade. But this is not so much "in principle" but more "in practice", because discrimination is easier for an average person than generation (e.g. label which of these 5 poems about X is best vs. write a poem about X). Separately you also get a separate boost from the wisdom of crowds effect, i.e. your LLM performance is not at human level, but at ensemble of human level. So with RLHF in principle the best you can hope for is to reach a performance where a panel of e.g. the top 10 human experts on some topic, with enough time given, will pick your answer over any other. So in some sense this counts as superhuman. To go proper superhuman in the way people think about it by default I think, you want to go to RL instead of RLHF, in the style of my earlier post on RLHF is just barely RL x.com/karpathy/status/182127…
Replying to @karpathy
I feel like the instant access to "skilled data labelers" in many domain is such a profound and useful function that we lacked prior to the LLM. We shouldn't take this new found accessibility feature for granted.
7
10
331
💯 great way to put it
140
Replying to @karpathy
It doesn’t interpolate, does it? If I ask “What color is a Gropy?”, and we had 100 labellers say it’s blue and 100 labellers say it’s yellow, it’s going to randomly say blue or yellow - but never “It’s a debated question, some say blue, some say yellow”. Right?
7
8
1
284
Excellent question and yes exactly, it responds with blue or yellow with 50% probability. Saying “It’s a debated question, some say blue, some say yellow” is just a sequence of tokens that would be super unlikely, it doesn't match the statistics of the training data at all.
Replying to @karpathy
While you're technically right about training data, this view seems reductionist. The emergent patterns and insights I'm seeing in AI conversations go beyond simple averaging of labeler responses. It's like reducing human consciousness to 'just neurons firing'. Sometimes the whole becomes more than the sum of its parts.
3
4
1
72
Agree that there can be a kind of compressed, emergent awareness that no individual person can practically achieve. We see hints of it but not clearly enough yet probably. See my short story on the topic karpathy.github.io/2021/03/2…
6
6
110
Replying to @karpathy
How do you square this with the recurrently superhuman performance in medical question answering domains? Are you implying they hire the best physicians to label? Or is it just that the breadth of factual knowledge retrieval makes up for the reasoning gaps
3
1
27
Yes they hire professional physicians to label. You don't need to label every single possible query. You label enough that the LLM learns to answer medical questions in the style of a trained physician. For new queries, the LLM can then to some extent lean on and transfer from its general understanding of medicine from reading all the internet documents and papers and such. Famously, for example, Terence Tao (a top tier mathematician) contributed some training data to LLMs. This doesn't mean that the LLMs can now answer at his level for all questions in math. The underlying knowledge and reasoning capability might just not be there in the underlying model. But it does mean that you're getting something much better than a redditor or something. So basically "the average labeler" are allowed to be professionals - programmers, or doctors, or etc., in various categories of expertise. It's not necessarily a random person on the internet. It depends on how the LLM companies ran their hiring for these data labeler roles. Increasingly, they try to hire more higher-skilled workers. You're then asking questions to a kind of simulation of those people, to the best of LLMs ability.
7
13
4
252
Replying to @karpathy
Many people are misinterpreting this and assuming the data labelers are simple not capable. Data labelers are matched with subject matter based on competence, and high quality data is possible! This should be self evident — genAI responses are often good and thorough!
1
1
15
Yes ty, average data labeler = competent person doing it professionally, matched to your category of query. The LLM is then a kind of simulation of them that is instant. The point is that asking an LLM how to run a government you might as well ask Mary from Ohio, for $10, allowing 30 minutes, some research, and she must comply with the 100-page labeling documentation written by the LLM company on how to answer those kinds of questions.
3
1
53
Replying to @karpathy
I disagree with it being the average. By volume, the average discussion around the moon landing is probably moon landing denial, because most of the people still discussing it on a regular basis are deniers, but most LLMs will not deny it. They learn some sense of correctness.
2
1
12
First there is the pretraining stage where the AI is trained on everything, included moon landing denying. In the second finetuning stage is where the dataset suddenly changes from internet documents to conversations between a "human" and an "Assistant", where the Assistant text comes from human labeler data, collected by paid workers. It's in this second stage that the token statistics are "matched up" to those in this finetuning dataset, which now looks like a helpful, honest, harmless Assistant. The non-intuitive and slightly magical, empirical and not very well understood part is that the LLM (which is a couple hundred billion parameter neural net) retains the knowledge from the pretraining stage (Stage 1), but starts to match the style of the finetuning data (Stage 2). It starts to imitate an Assistant. Because the Assistant data all has the same "vibe" (helpful, honest, harmless), the LLM ends up taking on that role. It still has all of the knowledge somewhere in there (of moon landing denying), but it's also adapted to the kind of person who would reject that as a hoax.
10
10
137
Replying to @karpathy
Yep, when you ask AI for advice, you're really asking a bunch of internet humans. A relevant paper here - shows that replacing even 90% of human data with synthetic data only marginally affects performance, but removing the final 10% of human data leads to severe performance declines
8
13
4
143
Replying to @karpathy
“you dont agree with what he wrote right sweetie”
1
1
106
Replying to @karpathy
This is extremely important to understand. This is why when you want to build even more advanced systems like RAG or LLM-based agentic workflows, simple prompting/commands on our off-the-shelf models don't work too well. Very few people are talking about this and how much effort it actually takes to make these LLM-powered applications work in production. Your tweet reminds me of @AndrewYNg's recent post about communicating with LLMs/agents. x.com/AndrewYNg/status/18571…
2
4
1
62
Replying to @karpathy
Wrong You are not asking a single average data labeler You are asking the average of data labelers Huge but subtle distinction Latent space of thousands of minds compressed together into shoggoth vs 1 average person
8
1
60
Replying to @karpathy
It’s like asking a burger patty what it’s like to be a cow
3
58
Replying to @karpathy
> Post triggered by someone suggesting we ask an AI how to run the government etc. and also posts about LLM political biases, they are data labelers with random seed, sometimes you picked a left one, sometimes a right one
1
2
48
Replying to @karpathy
How to imitate logical reasoning?
7
1
39
Replying to @karpathy
good replacement for the average slave golem. Not a good replacement for taste
1
39
Replying to @karpathy
I'm finding that some people IRL just think of AI as a smart version of Google Search rather than a conversational computer interface that can do a wide range of tasks.
1
1
27