Discussion about this post

User's avatar
tanzeel's avatar

This document is gold , it actually mentions the less talked about nuances in the AI world.

1.Why packing is not suited to SFT and how unsloth and Flash attention 2 resolve it

2.tokenizer_config.json has the chat template (i remember always searching docs for it for every model)

3.Think SFT explaination of taking the shortcuts (diagrams are gold here)

4.The System/User doesnt get trained on by assigned it a with -100 input_id which pytorch ignores

5.Grouped batching in SFT

6.Agenctic Sft (Would love to seee it in a lab session)

Ketan W's avatar

Hi Guys. Great article. Can you explain this a little more- "Because the training signal abruptly switches from ignored tokens (-100) to active assistant labels, the first few assistant tokens often carry disproportionately high loss. That sudden transition can destabilize training early on. To manage this, many SFT setups use a lower learning rate or a warm-up schedule"

5 more comments...

No posts

Ready for more?