
In comparison, all other non-Unsloth implementations max out at 9K context length on an 80GB GPU, and only reaches 15K context with FA3. But, FA3 is unsuitable for gpt-oss training since it lacks backward pass support for attention sinks. So if you were previously using FA3 for gpt-oss training, we'd recommend you to not use it for now. Thus, the max context length you can get without Unsloth on 80GB VRAM is ~9K.combined_logits = torch.cat([attn_weights, sinks], dim=-1)
probs = F.softmax(combined_logits, dim=-1)
scores = probs[..., :-1]The above shows we concatenate the sink at the very end of the Q @ K.T , do the softmax, and remove the last column which was the sink token.
Interesting finding: The official Flex Attention sliding window implementations considers the window size as the number of last tokens PLUS ONE as it includes the current token. The HuggingFace and GPT OSS implementations strictly only sees the last N tokens. Ie the below is from Flex Attention and Attention Gym:def sliding_window_causal(b, h, q_idx, kv_idx):
causal_mask = q_idx >= kv_idx
window_mask = q_idx - kv_idx <= SLIDING_WINDOW
return causal_mask & window_mask
We also confirmed through OpenAI's official GPT-OSS implementation on whether we attend to the last N or N+1 tokens here:mask = torch.triu(Q.new_full((n_tokens, n_tokens), -float("inf")), diagonal=1)
if sliding_window > 0:
mask += torch.tril(
mask.new_full((n_tokens, n_tokens), -float("inf")), diagonal=-sliding_window
)_i8RnOzW8zg77wH1lx8cS7.png?width=3840&quality=80&format=auto)
_4TmMwMobam6Nn_cyoYHx2.png?width=3840&quality=80&format=auto)
_uPL5ffwze0d_06W8I3rdO.png?width=3840&quality=80&format=auto)

model.save_pretrained_merged(save_directory, tokenizer)If you prefer to merge the model and push to the hugging-face hub directly instead, you could do so using:model.push_to_hub_merged(repo_name, tokenizer=tokenizer, token=hf_token)model, tokenizer = FastLanguageModel.from_pretrained(
#model_name = "unsloth/gpt-oss-20b-BF16",
model_name = "unsloth/gpt-oss-20b",
dtype = dtype, # None for auto detection
max_seq_length = max_seq_length, # Choose any for long context!
load_in_4bit = True, # 4 bit quantization to reduce memory
full_finetuning = False, # [NEW!] We have full finetuning now!
# token = "hf_...", # use one if using gated models
)
We had to do many many experiments to move float16's training loss curve to be equivalent to bfloat16 machines (blue line). We found the following: