Custom LLM Training: LoRA, QLoRA, and Llama Factory
A practical walkthrough of fine-tuning LLMs locally using LoRA, QLoRA, and Llama Factory within the Agentic Assistants framework.
Why Fine-Tune Locally?
API-based models are convenient, but they come with constraints: rate limits, data privacy concerns, per-token costs that scale badly, and no control over model behavior. Fine-tuning your own models eliminates these issues and lets you build domain-specific capabilities that general-purpose models lack.
Agentic Assistants includes a complete training subsystem under src/agentic_assistants/training/ that supports LoRA, QLoRA, and full fine-tuning via Llama Factory integration.
Parameter-Efficient Fine-Tuning
Full fine-tuning updates every parameter in a model -- expensive in both compute and memory. LoRA (Low-Rank Adaptation) injects trainable low-rank matrices into the attention layers while freezing the base weights:
from peft import LoraConfig
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
QLoRA goes further by quantizing the base model to 4-bit precision, reducing VRAM requirements dramatically. A 7B parameter model that would normally require 28GB of VRAM can be fine-tuned in under 8GB with QLoRA.
Training Configuration
The framework uses a YAML-based configuration system:
training:
method: qlora
base_model: "mistralai/Mistral-7B-v0.3"
dataset: ./data/training/my_dataset.jsonl
output_dir: ./models/my_fine_tuned_model
quantization:
bits: 4
type: nf4
double_quant: true
lora:
r: 16
alpha: 32
dropout: 0.05
training_args:
num_epochs: 3
batch_size: 4
gradient_accumulation_steps: 8
learning_rate: 2.0e-4
warmup_ratio: 0.1
The training module handles dataset preparation (converting between chat, instruct, and completion formats), tokenization, training loop management, and checkpoint saving.
Llama Factory Integration
For more advanced training scenarios, the framework integrates with Llama Factory (training/frameworks/llama_factory.py), which provides a web-based interface for configuring training runs, support for 100+ model architectures, and built-in dataset preprocessing.
Ollama Import/Export
Once training completes, the Ollama import module (training/ollama_import.py) converts your fine-tuned model into an Ollama-compatible format and registers it locally:
agentic train export --model ./models/my_model --format ollama --name my-custom-model
After export, the model is immediately available for inference through Ollama, and by extension, through every part of the framework that uses LLM completions -- agents, RAG pipelines, the chat interface, and API endpoints.
Practical Tips
Start with QLoRA on a small dataset. Even 500-1000 high-quality examples can meaningfully shift model behavior for domain-specific tasks. Training takes 1-2 hours on a single consumer GPU.
Use the validation split. The framework automatically holds out 10% of your data for validation. Watch the validation loss -- if it starts climbing while training loss drops, you're overfitting.
Iterate on data quality, not quantity. A curated dataset of 1,000 expert-written examples consistently outperforms 10,000 noisy examples scraped from the internet.