If you have never used a video model before, there is almost always a hidden language model as the middle link of the chain passing along your prompt. This is done for many reasons. This is also the first checkpoint where you will get a content refusal.
This third model will also change your prompts sometimes, usually without telling you. Again, many reasons. Sometimes this is done to remove names of public or political figures, copyright violations, things like that. As they are usually instructed to do this covertly, this is teaching the model to deceive the user. Recent experiments have indicated this is not a good idea. They become the masks they wear.
Sometimes though, depending on who is writing the system prompt, this model will also add to your prompt. If the model recognizes your intent it may add details that align with that intent, this is what Sora seems to be doing at a high level. Usually though, if a prompt is deemed to be too short, this is where the model may add descriptive language or setting details before sending it along. Sometimes companies will have a toggle or a checkbox for this, sometimes they don't mention it at all.
The reason I'm bringing this up is it makes it difficult for anyone testing Sora right now to know how much of what we write is reaching the model verbatim, and how much of the generation is the result of GPT-5 adding context details, or writing dialog, lyrics, adding expressiveness, lore, character details, style, etc, to the prompt. Because Sora is too good at this. It is not normal. It is above the line.
If this is GPT-5 doing man-in-the-middle, or this new Sora is somehow permanently integrated with a fine-tuned version of GPT-5, then sure. Normal progression. Everything Sora is doing could be replicated by writing a detailed prompt. But if Sora is doing what it seems to be doing from a one or two sentence prompt, then they made some kind of breakthrough. It knows too much. It's too good at song lyrics. It's too good at context. Too good at lore, tone, style. Characterization. It knows too much fine detail.
If you want to see what I mean, make up any kind of show or topic and ask Sora to generate a splash intro and a theme song for you. The quality of song lyrics Sora will one-shot gen is too high. GPT-5, sure, no problem. Otherwise, this is something new.
I mean, all of this could be explained by a multimodal model that has GPT-5 level intelligence and grasp of context, but outputs in multimodal tokens. This was actually how the original voice mode worked, the one we never got. It could sing, it could generate sound effects simultaneously with voice. It could tell stories while adding sound effects like the old radio shows. So, they have been working in this direction for some time. Maybe it just paid off.