FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!
FP16 eliminates the training-inference mismatch in RL while BF16's rounding errors break consistency
But yeah, everyone's been defaulting to BF16 for so long it feels wrong to switch back
lol, classic research advice, except changing datatypes always breaks something random down the line. anyone actually seen improvements or just more pain?