
We’ve uploaded vision models on Hugging Face, including Llama 3.2 Vision, quantized with our new method under the names 'unsloth-bnb-4bit' here. Text based models like Llama 3.1 (8B) are also uploaded We also have a Colab notebook fine-tuning Llama 3.2 (11B) Vision with our new dynamic quantization method here.
For example, quantizing all layers to 4bit will make the model describe the image above as “a vibrant and colorful scene of a coastal area” which is wrong. By carefully selecting not to quantize some parameters, the model recovers its accuracy and performs similarly to its full precision format with an extra 450MB in memory usage. Full precision uses 4.11GB.
We uploaded our dynamically 4bit quantized model for Qwen2 VL:
Interestingly 4bit quantization removes the sentence describing the image’s purpose. Unsloth's dynamic quantization uses a bit more memory, but again returns the image’s purpose back!
Llama’s error plots looks quite interesting! The vision encoder does not seem to have extremely large errors, except for 1 large spike. The weight quantization errors seem quite interesting! We find the cross attention’s output projection on all layers except the first should all not be quantized.
The most interesting thing is if we use our dynamic quantization approach, it does make the model much better, but it still weirdly does not match 16bit full precision. If we loosen the constraint and capture all “bad” modules, we then recover the full capability of the model!

