Really happy we were able to open-source this.
Internally, we have been using early versions of FlashPack to speed up model loading, but
@MLPBenjamin improved the loading substantially, and packaged it up wonderfully
tldr; 2.5x vs safetensors, here’s why
pytorch’s standard `load_state_dict` leads to many tiny CUDA allocations and copies.
However, if you flatten the weights to a single tensor, you can do a single device allocation, and use a mmapped buffer to efficiently copy the weights to the device memory.
However, the model params expect to have separate tensors, which make efficiently by using views.