𝗜’𝗺 𝘀𝘁𝗮𝗿𝘁𝗶𝗻𝗴 𝗮 𝗻𝗲𝘄 𝘀𝗲𝗿𝗶𝗲𝘀: 𝗣𝗮𝗽𝗲𝗿 𝗛𝗶𝗴𝗵𝗹𝗶𝗴𝗵𝘁𝘀! I’ll share research papers about (efficient) AI I’ve read including their code when available.
𝗣𝗮𝗽𝗲𝗿 𝗛𝗶𝗴𝗵𝗹𝗶𝗴𝗵𝘁 #𝟬𝟭: “Olica: Efficient Structured Pruning of Large Language Models without Retraining” | 𝗔𝘂𝘁𝗵𝗼𝗿𝘀: Jiujun He, Huazhen Lin | 𝗩𝗲𝗻𝘂𝗲:
@icmlconf 2025
This paper explores how to efficiently prune large language models 𝘄𝗶𝘁𝗵𝗼𝘂𝘁 𝗳𝘂𝗹𝗹 𝗿𝗲𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴. Key contributions:
• 𝗦𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝗱 𝗽𝗿𝘂𝗻𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗳𝗮𝘀𝘁 𝗣𝗖𝗔: They analyze matrix products in the multi-head attention (MHA) and remove neurons with the lowest importance scores.
• 𝗙𝗮𝘀𝘁 𝗿𝗲𝗰𝗮𝗹𝗶𝗯𝗿𝗮𝘁𝗶𝗼𝗻 𝘄𝗶𝘁𝗵 𝗳𝗲𝘄 𝗱𝗮𝘁𝗮 𝘀𝗮𝗺𝗽𝗹𝗲𝘀: Residual errors are compensated via a low-rank decomposition, requiring only a small calibration dataset.
• 𝗣𝗿𝗮𝗰𝘁𝗶𝗰𝗮𝗹 𝗲𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆: An LLM can be pruned and recalibrated in ~7 minutes using just hundreds of samples, resulting in smaller and faster models.
𝗣𝗲𝗿𝘀𝗼𝗻𝗮𝗹 𝗛𝗼𝘁 𝘁𝗮𝗸𝗲: This does not feel like zero retraining because of the recalibration phase but it makes retraining very efficient!