7 Comments
User's avatar
tanzeel's avatar

This document is gold , it actually mentions the less talked about nuances in the AI world.

1.Why packing is not suited to SFT and how unsloth and Flash attention 2 resolve it

2.tokenizer_config.json has the chat template (i remember always searching docs for it for every model)

3.Think SFT explaination of taking the shortcuts (diagrams are gold here)

4.The System/User doesnt get trained on by assigned it a with -100 input_id which pytorch ignores

5.Grouped batching in SFT

6.Agenctic Sft (Would love to seee it in a lab session)

Ankana Mukherjee's avatar

Well informed article 👏 see you Friday 😃

Miguel Otero Pedrido's avatar

Thanks!! See you there!! ❤️

Ankana Mukherjee's avatar

ML is such a vast topic, how do you make sure you remember things ?

Ketan W's avatar

Hi Guys. Great article. Can you explain this a little more- "Because the training signal abruptly switches from ignored tokens (-100) to active assistant labels, the first few assistant tokens often carry disproportionately high loss. That sudden transition can destabilize training early on. To manage this, many SFT setups use a lower learning rate or a warm-up schedule"

tanzeel's avatar

This is how I believe it intuitively: you have the user prompt (−100 token ID) and the assistant prompt (next token). It would have huge losses, since predicting the assistant token would be random when the user token is already ignored. So higher loss means higher gradients, and to avoid having the weight updates be too large at the beginning, it reduces the delta by multiplying it with a small learning rate.

Xeno_invest's avatar

Thank you for this excellent article!

A resource from unsloth that I just found:

How to Fine-tune LLMs in VS Code with Unsloth & Colab GPUs

https://unsloth.ai/docs/get-started/install/vs-code

https://github.com/unslothai/unsloth