Andrej Karpathy · Nov 29, 2024 · 6:33 PM UTC

Andrej Karpathy · Nov 29, 2024 · 6:33 PM UTC

Andrej Karpathy

Andrej Karpathy

@karpathy

29 Nov 2024

People have too inflated sense of what it means to "ask an AI" about something. The AI are language models trained basically by imitation on data from human labelers. Instead of the mysticism of "asking an AI", think of it more as "asking the average data labeler" on the internet. Few caveats apply because e.g. in many domains (e.g. code, math, creative writing) the companies hire skilled data labelers (so think of it as asking them instead), and this is not 100% true when reinforcement learning is involved, though I have an earlier rant on how RLHF is just barely RL, and "actual RL" is still too early and/or constrained to domains that offer easy reward functions (math etc.). But roughly speaking (and today), you're not asking some magical AI. You're asking a human data labeler. Whose average essence was lossily distilled into statistical token tumblers that are LLMs. This can still be super useful ofc ourse. Post triggered by someone suggesting we ask an AI how to run the government etc. TLDR you're not asking an AI, you're asking some mashup spirit of its average data labeler.

Nov 29, 2024 · 6:33 PM UTC

559

1,906

475

13,451

Andrej Karpathy · Nov 29, 2024 · 6:49 PM UTC

Andrej Karpathy

@karpathy

29 Nov 2024

Example when you ask eg “top 10 sights in Amsterdam” or something, some hired data labeler probably saw a similar question at some point, researched it for 20 minutes using Google and Trip Advisor or something, came up with some list of 10, which literally then becomes the correct answer, training the AI to give that answer for that question. If the exact place in question is not in the finetuning training set, the neural net imputes a list of statistically similar vibes based on its knowledge gained from the pretraining stage (language modeling of internet documents).

101

144

2,602

roon · Nov 29, 2024 · 7:52 PM UTC

roon

@tszzl

29 Nov 2024

Replying to @karpathy

RLHF can create superhuman outcomes

328

Andrej Karpathy · Nov 29, 2024 · 8:32 PM UTC

Andrej Karpathy

@karpathy

29 Nov 2024

Hmm. RLHF is still RL from _Human_ feedback, so I wouldn't say that exactly? RLHF moves the performance to "discriminative human" grade, up from SFT which is at "generative human" grade. But this is not so much "in principle" but more "in practice", because discrimination is easier for an average person than generation (e.g. label which of these 5 poems about X is best vs. write a poem about X). Separately you also get a separate boost from the wisdom of crowds effect, i.e. your LLM performance is not at human level, but at ensemble of human level. So with RLHF in principle the best you can hope for is to reach a performance where a panel of e.g. the top 10 human experts on some topic, with enough time given, will pick your answer over any other. So in some sense this counts as superhuman. To go proper superhuman in the way people think about it by default I think, you want to go to RL instead of RLHF, in the style of my earlier post on RLHF is just barely RL x.com/karpathy/status/182127…

757

more replies

Ig Nim · Nov 29, 2024 · 8:04 PM UTC

Ig Nim @ignoblend

29 Nov 2024

Replying to @karpathy

I feel like the instant access to "skilled data labelers" in many domain is such a profound and useful function that we lacked prior to the LLM. We shouldn't take this new found accessibility feature for granted.

331

Andrej Karpathy · Nov 29, 2024 · 8:20 PM UTC

Andrej Karpathy

@karpathy

29 Nov 2024

💯 great way to put it

140

Leo · Nov 29, 2024 · 6:59 PM UTC

Leo

@leoplusx

29 Nov 2024

Replying to @karpathy

It doesn’t interpolate, does it? If I ask “What color is a Gropy?”, and we had 100 labellers say it’s blue and 100 labellers say it’s yellow, it’s going to randomly say blue or yellow - but never “It’s a debated question, some say blue, some say yellow”. Right?

284

Andrej Karpathy · Nov 29, 2024 · 7:06 PM UTC

Andrej Karpathy

@karpathy

29 Nov 2024

Excellent question and yes exactly, it responds with blue or yellow with 50% probability. Saying “It’s a debated question, some say blue, some say yellow” is just a sequence of tokens that would be super unlikely, it doesn't match the statistics of the training data at all.

609

more replies

Alan Nicolas · Nov 29, 2024 · 10:48 PM UTC

Alan Nicolas

@oalanicolas

29 Nov 2024

Replying to @karpathy

While you're technically right about training data, this view seems reductionist. The emergent patterns and insights I'm seeing in AI conversations go beyond simple averaging of labeler responses. It's like reducing human consciousness to 'just neurons firing'. Sometimes the whole becomes more than the sum of its parts.

Andrej Karpathy · Nov 29, 2024 · 10:52 PM UTC

Andrej Karpathy

@karpathy

29 Nov 2024

Agree that there can be a kind of compressed, emergent awareness that no individual person can practically achieve. We see hints of it but not clearly enough yet probably. See my short story on the topic karpathy.github.io/2021/03/2…

110

more replies

Liam McCoy, MD MSc · Nov 29, 2024 · 10:06 PM UTC

Liam McCoy, MD MSc

@LiamGMcCoy

29 Nov 2024

Replying to @karpathy

How do you square this with the recurrently superhuman performance in medical question answering domains? Are you implying they hire the best physicians to label? Or is it just that the breadth of factual knowledge retrieval makes up for the reasoning gaps

Andrej Karpathy · Nov 29, 2024 · 10:19 PM UTC

Andrej Karpathy

@karpathy

29 Nov 2024

Yes they hire professional physicians to label. You don't need to label every single possible query. You label enough that the LLM learns to answer medical questions in the style of a trained physician. For new queries, the LLM can then to some extent lean on and transfer from its general understanding of medicine from reading all the internet documents and papers and such. Famously, for example, Terence Tao (a top tier mathematician) contributed some training data to LLMs. This doesn't mean that the LLMs can now answer at his level for all questions in math. The underlying knowledge and reasoning capability might just not be there in the underlying model. But it does mean that you're getting something much better than a redditor or something. So basically "the average labeler" are allowed to be professionals - programmers, or doctors, or etc., in various categories of expertise. It's not necessarily a random person on the internet. It depends on how the LLM companies ran their hiring for these data labeler roles. Increasingly, they try to hire more higher-skilled workers. You're then asking questions to a kind of simulation of those people, to the best of LLMs ability.

252

more replies

AKI · Nov 30, 2024 · 5:58 PM UTC

AKI

@akxlesh_

30 Nov 2024

Replying to @karpathy

Many people are misinterpreting this and assuming the data labelers are simple not capable. Data labelers are matched with subject matter based on competence, and high quality data is possible! This should be self evident — genAI responses are often good and thorough!

Andrej Karpathy · Nov 30, 2024 · 6:19 PM UTC

Andrej Karpathy

@karpathy

30 Nov 2024

Yes ty, average data labeler = competent person doing it professionally, matched to your category of query. The LLM is then a kind of simulation of them that is instant. The point is that asking an LLM how to run a government you might as well ask Mary from Ohio, for $10, allowing 30 minutes, some research, and she must comply with the 100-page labeling documentation written by the LLM company on how to answer those kinds of questions.

AI Furry Art (SFW-ish) · Nov 29, 2024 · 9:10 PM UTC

AI Furry Art (SFW-ish) @aifurryart

29 Nov 2024

Replying to @karpathy

I disagree with it being the average. By volume, the average discussion around the moon landing is probably moon landing denial, because most of the people still discussing it on a regular basis are deniers, but most LLMs will not deny it. They learn some sense of correctness.

Andrej Karpathy · Nov 29, 2024 · 9:18 PM UTC

Andrej Karpathy

@karpathy

29 Nov 2024

First there is the pretraining stage where the AI is trained on everything, included moon landing denying. In the second finetuning stage is where the dataset suddenly changes from internet documents to conversations between a "human" and an "Assistant", where the Assistant text comes from human labeler data, collected by paid workers. It's in this second stage that the token statistics are "matched up" to those in this finetuning dataset, which now looks like a helpful, honest, harmless Assistant. The non-intuitive and slightly magical, empirical and not very well understood part is that the LLM (which is a couple hundred billion parameter neural net) retains the knowledge from the pretraining stage (Stage 1), but starts to match the style of the finetuning data (Stage 2). It starts to imitate an Assistant. Because the Assistant data all has the same "vibe" (helpful, honest, harmless), the LLM ends up taking on that role. It still has all of the knowledge somewhere in there (of moon landing denying), but it's also adapted to the kind of person who would reject that as a hoax.

137

more replies

Rohan Paul · Nov 29, 2024 · 6:58 PM UTC

Rohan Paul

@rohanpaul_ai

29 Nov 2024

Replying to @karpathy

Yep, when you ask AI for advice, you're really asking a bunch of internet humans. A relevant paper here - shows that replacing even 90% of human data with synthetic data only marginally affects performance, but removing the final 10% of human data leads to severe performance declines

143

Kandrej Arpathy ⚰️ · Nov 29, 2024 · 6:42 PM UTC

Kandrej Arpathy ⚰️

@meowbooksj

29 Nov 2024

Replying to @karpathy

“you dont agree with what he wrote right sweetie”

106

elvis · Nov 29, 2024 · 6:48 PM UTC

elvis

@omarsar0

29 Nov 2024

Replying to @karpathy

This is extremely important to understand. This is why when you want to build even more advanced systems like RAG or LLM-based agentic workflows, simple prompting/commands on our off-the-shelf models don't work too well. Very few people are talking about this and how much effort it actually takes to make these LLM-powered applications work in production. Your tweet reminds me of @AndrewYNg's recent post about communicating with LLMs/agents. x.com/AndrewYNg/status/18571…

Nick Dobos · Nov 29, 2024 · 8:56 PM UTC

Nick Dobos

@NickADobos

29 Nov 2024

Replying to @karpathy

Wrong You are not asking a single average data labeler You are asking the average of data labelers Huge but subtle distinction Latent space of thousands of minds compressed together into shoggoth vs 1 average person

Matt Schlicht · Nov 29, 2024 · 6:34 PM UTC

Matt Schlicht

@MattPRD

29 Nov 2024

Replying to @karpathy

It’s like asking a burger patty what it’s like to be a cow

Yuchen Jin · Nov 29, 2024 · 6:41 PM UTC

Yuchen Jin

@Yuchenj_UW

29 Nov 2024

Replying to @karpathy

> Post triggered by someone suggesting we ask an AI how to run the government etc. and also posts about LLM political biases, they are data labelers with random seed, sometimes you picked a left one, sometimes a right one

Andrzej Dragan · Nov 30, 2024 · 7:42 AM UTC

Andrzej Dragan

@andrzejdragan

30 Nov 2024

Replying to @karpathy

How to imitate logical reasoning?

kache · Nov 29, 2024 · 8:53 PM UTC

kache

@yacineMTB

29 Nov 2024

Replying to @karpathy

good replacement for the average slave golem. Not a good replacement for taste

Smoke-away · Nov 29, 2024 · 7:00 PM UTC

Smoke-away

@SmokeAwayyy

29 Nov 2024

Replying to @karpathy

I'm finding that some people IRL just think of AI as a smart version of Google Search rather than a conversational computer interface that can do a wide range of tasks.