Many Llama 3 finetunes are broken due to these issues, and we discussed this further on a Reddit thread. So, be sure to use our Llama 3 base notebook or our Instruct notebook!
In fact, if you compare the training loss between Unsloth's Mistral-fied version for Phi-3, we actually attain slightly lower loss than using the original version! This is most likely because we split the attention matrices into 3 modules (Q, K, V), allowing 4bit quantization to represent each matrix more accurately. The original module fused all 3. The MLP gate and up are also merged in the original model, and we unmerged it.
Try out our Phi-3 Mini 3.8B notebook and our Phi-3 Medium 14B notebook via Colab! Phi-3 Medium fits comfortably in a free Tesla T4 Colab with Unsloth, and you can fit 3 to 4x longer context lengths than FA2 with Unsoth!