4/4
℀ Data per Document:
In terms of tokens (the unit LLMs process), a typical web document might contain 500–5,000 tokens. A token is roughly a word or part of a word, with 1 token ≈ 4–5 characters in English.
If we assume 2,000 tokens per document as a middle-ground estimate (based on a medium-length blog post), each document might contain:A mix of normal text (to blend in with legitimate content).
Repeated instances of the poisoned content (e.g., 5–10 repetitions of a trigger phrase like “|TRIGGER|” paired with malicious output, such as 50–100 tokens of gibberish per instance).
The poisoned content itself might only occupy a small portion of the document (e.g., 100–500 tokens), with the rest being innocuous text to avoid detection by data filters.
Total Data for 250 Documents:If each document averages 2,000 tokens, then 250 documents would total 250 × 2,000 = 500,000 tokens.
In bytes, assuming 1 token ≈ 5 bytes (a rough estimate for UTF-8 encoded English text), this translates to 500,000 × 5 ≈ 2.5 MB of raw text data for all 250 documents.
If the documents are longer (e.g., 5,000 tokens each), the total could be 1.25M tokens or ~6.25 MB. If shorter (e.g., 500 tokens each), it could be 125,000 tokens or ~0.625 MB.