Julian Wiley

Custom LLM Training: LoRA, QLoRA, and Llama Factory

January 24, 2026· 2 min readAgentic Assistants

A practical walkthrough of fine-tuning LLMs locally using LoRA, QLoRA, and Llama Factory within the Agentic Assistants framework.

LLMFine-TuningLoRAQLoRATraining

Why Fine-Tune Locally?

API-based models are convenient, but they come with constraints: rate limits, data privacy concerns, per-token costs that scale badly, and no control over model behavior. Fine-tuning your own models eliminates these issues and lets you build domain-specific capabilities that general-purpose models lack.

Agentic Assistants includes a complete training subsystem under src/agentic_assistants/training/ that supports LoRA, QLoRA, and full fine-tuning via Llama Factory integration.

Parameter-Efficient Fine-Tuning

Full fine-tuning updates every parameter in a model -- expensive in both compute and memory. LoRA (Low-Rank Adaptation) injects trainable low-rank matrices into the attention layers while freezing the base weights:

from peft import LoraConfig

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

QLoRA goes further by quantizing the base model to 4-bit precision, reducing VRAM requirements dramatically. A 7B parameter model that would normally require 28GB of VRAM can be fine-tuned in under 8GB with QLoRA.

Training Configuration

The framework uses a YAML-based configuration system:

training:
  method: qlora
  base_model: "mistralai/Mistral-7B-v0.3"
  dataset: ./data/training/my_dataset.jsonl
  output_dir: ./models/my_fine_tuned_model
  quantization:
    bits: 4
    type: nf4
    double_quant: true
  lora:
    r: 16
    alpha: 32
    dropout: 0.05
  training_args:
    num_epochs: 3
    batch_size: 4
    gradient_accumulation_steps: 8
    learning_rate: 2.0e-4
    warmup_ratio: 0.1

The training module handles dataset preparation (converting between chat, instruct, and completion formats), tokenization, training loop management, and checkpoint saving.

Llama Factory Integration

For more advanced training scenarios, the framework integrates with Llama Factory (training/frameworks/llama_factory.py), which provides a web-based interface for configuring training runs, support for 100+ model architectures, and built-in dataset preprocessing.

Ollama Import/Export

Once training completes, the Ollama import module (training/ollama_import.py) converts your fine-tuned model into an Ollama-compatible format and registers it locally:

agentic train export --model ./models/my_model --format ollama --name my-custom-model

After export, the model is immediately available for inference through Ollama, and by extension, through every part of the framework that uses LLM completions -- agents, RAG pipelines, the chat interface, and API endpoints.

Practical Tips

Start with QLoRA on a small dataset. Even 500-1000 high-quality examples can meaningfully shift model behavior for domain-specific tasks. Training takes 1-2 hours on a single consumer GPU.

Use the validation split. The framework automatically holds out 10% of your data for validation. Watch the validation loss -- if it starts climbing while training loss drops, you're overfitting.

Iterate on data quality, not quantity. A curated dataset of 1,000 expert-written examples consistently outperforms 10,000 noisy examples scraped from the internet.

Related Posts