fine-tuning-with-trl
TRL: SFT, DPO, PPO, GRPO, reward modeling for LLM RLHF.
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install hermes:hermes~trl-fine-tuningcURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/hermes%3Ahermes~trl-fine-tuning/file -o trl-fine-tuning.md# TRL - Transformer Reinforcement Learning
## Quick start
TRL provides post-training methods for aligning language models with human preferences.
**Installation**:
```bash
pip install trl transformers datasets peft accelerate
```
**Supervised Fine-Tuning** (instruction tuning):
```python
from trl import SFTTrainer
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset, # Prompt-completion pairs
)
trainer.train()
```
**DPO** (align with preferences):
```python
from trl import DPOTrainer, DPOConfig
config = DPOConfig(output_dir="model-dpo", beta=0.1)
trainer = DPOTrainer(
model=model,
args=config,
train_dataset=preference_dataset, # chosen/rejected pairs
processing_class=tokenizer
)
trainer.train()
```
## Common workflows
### Workflow 1: Full RLHF pipeline (SFT → Reward Model → PPO)
Complete pipeline from base model to human-aligned model.
Copy this checklist:
```
RLHF Training:
- [ ] Step 1: Supervised fine-tuning (SFT)
- [ ] Step 2: Train reward model
- [ ] Step 3: PPO reinforcement learning
- [ ] Step 4: Evaluate aligned model
```
**Step 1: Supervised fine-tuning**
Train base model on instruction-following data:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
# Load model
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
# Load instruction dataset
dataset = load_dataset("trl-lib/Capybara", split="train")
# Configure training
training_args = SFTConfig(
output_dir="Qwen2.5-0.5B-SFT",
per_device_train_batch_size=4,
num_train_epochs=1,
learning_rate=2e-5,
logging_steps=10,
save_strategy="epoch"
)
# Train
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer
)
trainer.train()
trainer.save_model()
```
**Step 2: Train reward model**
Train model to predict human preferences:
```python
from transformers import AutoModelForSequenceClassification
from trl import RewardTrainer, RewardConfig
# Load SFT model as base
model = AutoModelForSequenceClassification.from_pretrained(
"Qwen2.5-0.5B-SFT",
num_labels=1 # Single reward score
)
tokenizer = AutoTokenizer.from_pretrained("Qwen2.5-0.5B-SFT")
# Load preference data (chosen/rejected pairs)
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
# Configure training
training_args = RewardConfig(
output_dir="Qwen2.5-0.5B-Reward",
per_device_train_batch_size=2,
num_train_epochs=1,
learning_rate=1e-5
)
# Train reward model
trainer = RewardTrainer(
model=model,
args=training_args,
processing_class=tokenizer,
train_dataset=dataset
)
trainer.train()
trainer.save_model()
```
**Step 3: PPO reinforcement learning**
Optimize policy using reward model:
```bash
python -m trl.scripts.ppo \
--model_name_or_path Qwen2.5-0.5B-SFT \
--reward_model_path Qwen2.5-0.5B-Reward \
--dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
--output_dir Qwen2.5-0.5B-PPO \
--learning_rate 3e-6 \
--per_device_train_batch_size 64 \
--total_episodes 10000
```
**Step 4: Evaluate**
```python
from transformers import pipeline
# Load aligned model
generator = pipeline("text-generation", model="Qwen2.5-0.5B-PPO")
# Test
prompt = "Explain quantum computing to a 10-year-old"
output = generator(prompt, max_length=200)[0]["generated_text"]
print(output)
```
### Workflow 2: Simple preference alignment with DPO
Align model with preferences without reward model.
Copy this checklist:
```
DPO Training:
- [ ] Step 1: Prepare preference dataset
- [ ] Step 2: Configure DPO
- [ ] Step 3: Train with DPOTrainer
- [ ] Step 4: Evaluate alignment
```
**Step 1: Prepare preference dataset**
Dataset format:
```json
{
"prompt": "What is the capital of France?",
"chosen": "The capital of France is Paris.",
"rejected": "I don't know."
}
```
Load dataset:
```python
from datasets import load_dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
# Or load your own
# dataset = load_dataset("json", data_files="preferences.json")
```
**Step 2: Configure DPO**
```python
from trl import DPOConfig
config = DPOConfig(
output_dir="Qwen2.5-0.5B-DPO",
per_device_train_batch_size=4,
num_train_epochs=1,
learning_rate=5e-7,
beta=0.1, # KL penalty strength
max_prompt_length=512,
max_length=1024,
logging_steps=10
)
```
**Step 3: Train with DPOTrainer**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
trainer = DPOTrainer(
model=model,
args=config,
train_dataset=dataset,
processing_class=tokenizer
)
trainer.train()
trainer.save_model()
```
**CLI alternative**:
```bash
trl dpo \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--dataset_name argilla/Capybara-Preferences \
--output_dir Qwen2.5-0.5B-DPO \
--per_device_train_batch_size 4 \
--learning_rate 5e-7 \
--beta 0.1
```
### Workflow 3: Memory-efficient online RL with GRPO
Train with reinforcement learning using minimal memory.
For in-depth GRPO guidance — reward function design, critical training insights (loss behavior, mode collapse, tuning), and advanced multi-stage patterns — see **[references/grpo-training.md](references/grpo-training.md)**. A production-ready training script is in **[templates/basic_grpo_training.py](templates/basic_grpo_training.py)**.
Copy this checklist:
```
GRPO Training:
- [ ] Step 1: Define reward function
- [ ] Step 2: Configure GRPO
- [ ] Step 3: Train with GRPOTrainer
```
**Step 1: Define reward function**
```python
def reward_function(completions, **kwargs):
"""
Compute rewards for completions.
Args:
completions: List of generated texts
Returns:
List of reward scores (floats)
"""
rewards = []
for completion in completions:
# Example: reward based on length and unique words
score = len(completion.split()) # Favor longer responses
score += len(set(completion.lower().split())) # Reward unique words
rewards.append(score)
return rewards
```
Or use a reward model:
```python
from transformers import pipeline
reward_model = pipeline("text-classification", model="reward-model-path")
def reward_from_model(completions, prompts, **kwargs):
# Combine prompt + completion
full_texts = [p + c for p, c in zip(prompts, completions)]
# Get reward scores
results = reward_model(full_texts)
return [r["score"] for r in results]
```
**Step 2: Configure GRPO**
```python
from trl import GRPOConfig
config = GRPOConfig(
output_dir="Qwen2-GRPO",
per_device_train_batch_size=4,
num_train_epochs=1,
learning_rate=1e-5,
num_generations=4, # Generate 4 completions per prompt
max_new_tokens=128
)
```
**Step 3: Train with GRPOTrainer**
```python
from datasets import load_dataset
from trl import GRPOTrainer
# Load prompt-only dataset
dataset = load_dataset("trl-lib/tldr", split="train")
trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=reward_function, # Your reward function
args=config,
train_dataset=dataset
)
trainer.train()
```
**CLI**:
```bash
trl grpo \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/tldr \
--output_dir Qwen2-GRPO \
--num_generations 4
```
## When to use vs alternatives
**Use TRL when:**
- Need to align model with human preferences
- Have preference data (chosen/rejected pairs)
- Want to use reinforcement learning (PPO, GRPO)
- Need reward model training
- Doing RLHF (full pipeline)
**Method selection**:
- **SFT**: Have prompt-completion pairs, want basic in