If it were truly only about data, then we should be able to train a pure FFNN to replicate the behaviour of a Transformer.
That is an empirical test that would support your argument.
Apr 11, 2025 · 4:50 AM UTC
Apr 11, 2025 · 4:50 AM UTC