Developing an LLM for Tetum using Ollama is best done by fine-tuning an existing multilingual model (such as Qwen2.5 or Mistral 7B) rather than training from scratch, using high-quality Tetum text and conversational datasetsfrom government, media, legal, and educational sources. The recommended approach is LoRA fine-tuning, which works well on limited hardware, followed by deployment in Ollama with a Tetum-specific system prompt, and enhancement through Retrieval-Augmented Generation (RAG) to handle government and institutional knowledge accurately. This architecture integrates naturally with Django-based systems (such as HRMS or citizen service platforms), supports multilingual interaction, and can later be extended with Tetum speech recognition and text-to-speech for voice-based public services.
Follow step by step bellow:
1. Reality Check: What "Developing an LLM" Means
There are 3 levels of "LLM development" (from easiest to hardest):
Level | What you do | Recommended |
A. Prompt & RAG only | Use existing model + Tetum data | ✅ Fastest |
B. Fine-tuning (LoRA) | Adapt a model to Tetum | ✅ Best balance |
C. Train from scratch | New Tetum LLM | ❌ Very expensive |
👉 For Tetum, Level B (Fine-tuning) is the correct approach.
2. Recommended Base Models for Tetum (Ollama-friendly)
Tetum is Austronesian and works better with multilingual models.
✅ Best Base Models
Model | Size | Why |
Qwen2.5 | 7B / 14B | Excellent multilingual |
LLaMA 3.1 | 8B | Strong reasoning |
Mistral 7B | 7B | Efficient & fast |
Gemma 2 | 9B | Good low-resource performance |
⚠️ Recommendation for your setup
- CPU / 16GB RAM → qwen2.5:7b or mistral:7b
- GPU ≥ 24GB VRAM → qwen2.5:14b
Example:
ollama pull qwen2.5:7b
3. Collecting Tetum Language Data (MOST IMPORTANT STEP)
Your model quality = data quality.
A. Text Sources (High Priority)
- Government documents (*.gov.tl)
- Tetum news sites (Tatoli, RTTL)
- Parliamentary transcripts
- Legal documents (laws, decrees)
- Education materials
- NGO reports
- Church publications (Tetum is common here)
B. Conversational Data (Critical for Chatbots)
Create Q&A pairs, for example:
{
"instruction": "Explika saida mak servisu Ministériu Saúde",
"input": "",
"output": "Ministériu Saúde iha responsibilidade atu fornese servisu saúde ba ema hotu iha Timor-Leste."
}
C. Minimum Dataset Size
Level | Tokens |
Prototype | 100k–300k |
Usable chatbot | 1–3 million |
High quality | 5–10 million |
4. Preparing the Dataset (Instruction Format)
Ollama fine-tuning works best with instruction datasets.
Example JSONL
{"instruction":"Halo saudasaun iha Tetum","output":"Bondia! Diak ka lae?"}
{"instruction":"Explika saida mak AI","output":"AI ka Inteligénsia Artifisial mak teknologia ne'ebé bele aprende no halo desizaun hanesan ema."}
Save as:
tetum_dataset.jsonl
5. Fine-Tuning with LoRA (Recommended)
Why LoRA?
- Works on limited hardware
- Much cheaper than full training
- Can be merged into Ollama
Tools Needed
pip install transformers peft datasets accelerate bitsandbytes
Training Example (Python)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B",
load_in_4bit=True,
device_map="auto"
)
lora = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj","v_proj"],
lora_dropout=0.05
)
model = get_peft_model(model, lora)
Train with your Tetum dataset → output = LoRA adapter
6. Importing the Tetum Model into Ollama
Step 1: Create a Modelfile
FROM qwen2.5:7b
PARAMETER temperature 0.3
SYSTEM """
You are an AI assistant that speaks fluent Tetum.
Always answer in Tetum unless explicitly asked otherwise.
"""
Step 2: Create the model
ollama create tetum-llm -f Modelfile
Step 3: Run it
ollama run tetum-llm
7. Improving Accuracy with RAG (Highly Recommended)
Instead of forcing everything into the model:
Architecture
User → Ollama (Tetum LLM)
↳ Vector DB (Tetum documents)
↳ Django API
Tools
- Embeddings: nomic-embed-text
- Vector DB: FAISS / Chroma
- Backend: Django (you already use this)
This is perfect for:
- Government services
- Legal explanations
- Citizen support chatbot
- HRMS (leave, payroll, attendance)
8. Evaluation for Tetum
Create a manual test set:
- Grammar correctness
- Formal vs informal Tetum
- Government terminology
- Code-switching (Tetum ↔ Portuguese)
Example:
Pergunta: Saida mak dokumentu presiza atu halo pasaporte?
Resposta: ...
9. Voice Support (Future Phase – Matches Your TILERP Plan)
Since you already plan Tetum voice models:
- ASR: Whisper fine-tuned on Tetum
- TTS: Coqui TTS / Piper
- Integration: Ollama → Django → WhatsApp / IVR
10. Recommended Roadmap (6 Months)
Month | Task |
1 | Collect Tetum corpus |
2 | Clean & instruction dataset |
3 | LoRA fine-tuning |
4 | Ollama integration |
5 | RAG + Django |
6 | Evaluation + deployment |
0 Comments