torch.exp(q - q.detach()) * advantages.unsqueeze(1)
Is used, which should be evaluated to 1 right?from unsloth import PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
max_seq_length = max_seq_length,
load_in_4bit = True, # False for LoRA 16bit
fast_inference = True, # Enable vLLM fast inference
max_lora_rank = lora_rank,
gpu_memory_utilization = 0.6, # Reduce if out of memory
float8_kv_cache = True, # Enable float8 KV cache
)
If you want to use min_p = 0.1, or other sampling params in vLLM, we also support passing anything in vLLM’s SamplingParams arguments!max_prompt_length = 256
from trl import GRPOConfig, GRPOTrainer
from unsloth import vLLMSamplingParams
vllm_sampling_params = vLLMSamplingParams(
min_p = 0.1,
seed = 3407,
...
)
training_args = GRPOConfig(
...
vllm_sampling_params = vllm_sampling_params,
temperature = 1.5,
)