combined_logits = torch.cat([attn_weights, sinks], dim=-1)
probs = F.softmax(combined_logits, dim=-1)
scores = probs[..., :-1]
The above shows we concatenate the sink at the very end of the Q @ K.T , do the softmax, and remove the last column which was the sink token.def sliding_window_causal(b, h, q_idx, kv_idx):
causal_mask = q_idx >= kv_idx
window_mask = q_idx - kv_idx <= SLIDING_WINDOW
return causal_mask & window_mask
mask = torch.triu(Q.new_full((n_tokens, n_tokens), -float("inf")), diagonal=1)
if sliding_window > 0:
mask += torch.tril(
mask.new_full((n_tokens, n_tokens), -float("inf")), diagonal=-sliding_window
)
model.save_pretrained_merged(save_directory, tokenizer)
If you prefer to merge the model and push to the hugging-face hub directly instead, you could do so using:model.push_to_hub_merged(repo_name, tokenizer=tokenizer, token=hf_token)
model, tokenizer = FastLanguageModel.from_pretrained(
#model_name = "unsloth/gpt-oss-20b-BF16",
model_name = "unsloth/gpt-oss-20b",
dtype = dtype, # None for auto detection
max_seq_length = max_seq_length, # Choose any for long context!
load_in_4bit = True, # 4 bit quantization to reduce memory
full_finetuning = False, # [NEW!] We have full finetuning now!
# token = "hf_...", # use one if using gated models
)