
torch.exp(q - q.detach()) * advantages.unsqueeze(1)
Is used, which should be evaluated to 1 right?-GRPO.ipynb_-_Colab_5lpAL05rCEjw67tij45ua.png?width=3840&quality=80&format=auto) You also do not need to call functions to patch GRPO anymore! I.e. remove this at the top (we do it automatically):
You also do not need to call functions to patch GRPO anymore! I.e. remove this at the top (we do it automatically):from unsloth import PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6, # Reduce if out of memory
    float8_kv_cache = True, # Enable float8 KV cache
)If you want to use min_p = 0.1, or other sampling params in vLLM, we also support passing anything in vLLM’s SamplingParams arguments!max_prompt_length = 256
from trl import GRPOConfig, GRPOTrainer
from unsloth import vLLMSamplingParams
vllm_sampling_params = vLLMSamplingParams(
    min_p = 0.1,
    seed = 3407,
    ...
)
training_args = GRPOConfig(
    ...
    vllm_sampling_params = vllm_sampling_params,
    temperature = 1.5,
)