Developing an LLM for Tetum using Ollama

Developing an LLM for Tetum using Ollama is best done by fine-tuning an existing multilingual model (such as Qwen2.5 or Mistral 7B) rather than training from scratch, using high-quality Tetum text and conversational datasetsfrom government, media, legal, and educational sources. The recommended approach is LoRA fine-tuning, which works well on limited hardware, followed by deployment in Ollama with a Tetum-specific system prompt, and enhancement through Retrieval-Augmented Generation (RAG) to handle government and institutional knowledge accurately. This architecture integrates naturally with Django-based systems (such as HRMS or citizen service platforms), supports multilingual interaction, and can later be extended with Tetum speech recognition and text-to-speech for voice-based public services.

Follow step by step bellow:

1. Reality Check: What "Developing an LLM" Means

There are 3 levels of "LLM development" (from easiest to hardest):

Level	What you do	Recommended
A. Prompt & RAG only	Use existing model + Tetum data	✅ Fastest
B. Fine-tuning (LoRA)	Adapt a model to Tetum	✅ Best balance
C. Train from scratch	New Tetum LLM	❌ Very expensive

👉 For Tetum, Level B (Fine-tuning) is the correct approach.

2. Recommended Base Models for Tetum (Ollama-friendly)

Tetum is Austronesian and works better with multilingual models.

✅ Best Base Models

Model	Size	Why
Qwen2.5	7B / 14B	Excellent multilingual
LLaMA 3.1	8B	Strong reasoning
Mistral 7B	7B	Efficient & fast
Gemma 2	9B	Good low-resource performance

⚠️ Recommendation for your setup

CPU / 16GB RAM → qwen2.5:7b or mistral:7b
GPU ≥ 24GB VRAM → qwen2.5:14b

Example:

ollama pull qwen2.5:7b

3. Collecting Tetum Language Data (MOST IMPORTANT STEP)

Your model quality = data quality.

A. Text Sources (High Priority)

Government documents (*.gov.tl)
Tetum news sites (Tatoli, RTTL)
Parliamentary transcripts
Legal documents (laws, decrees)
Education materials
NGO reports
Church publications (Tetum is common here)

B. Conversational Data (Critical for Chatbots)

Create Q&A pairs, for example:

{

"instruction": "Explika saida mak servisu Ministériu Saúde",

"input": "",

"output": "Ministériu Saúde iha responsibilidade atu fornese servisu saúde ba ema hotu iha Timor-Leste."

}

C. Minimum Dataset Size

Level	Tokens
Prototype	100k–300k
Usable chatbot	1–3 million
High quality	5–10 million

4. Preparing the Dataset (Instruction Format)

Ollama fine-tuning works best with instruction datasets.

Example JSONL

{"instruction":"Halo saudasaun iha Tetum","output":"Bondia! Diak ka lae?"}

{"instruction":"Explika saida mak AI","output":"AI ka Inteligénsia Artifisial mak teknologia ne'ebé bele aprende no halo desizaun hanesan ema."}

Save as:

tetum_dataset.jsonl

5. Fine-Tuning with LoRA (Recommended)

Why LoRA?

Works on limited hardware
Much cheaper than full training
Can be merged into Ollama

Tools Needed

pip install transformers peft datasets accelerate bitsandbytes

Training Example (Python)

from transformers import AutoModelForCausalLM, AutoTokenizer

from peft import LoraConfig, get_peft_model

model = AutoModelForCausalLM.from_pretrained(

"Qwen/Qwen2.5-7B",

load_in_4bit=True,

device_map="auto"

)

lora = LoraConfig(

r=8,

lora_alpha=16,

target_modules=["q_proj","v_proj"],

lora_dropout=0.05

)

model = get_peft_model(model, lora)

Train with your Tetum dataset → output = LoRA adapter

6. Importing the Tetum Model into Ollama

Step 1: Create a Modelfile

FROM qwen2.5:7b

PARAMETER temperature 0.3

SYSTEM """

You are an AI assistant that speaks fluent Tetum.

Always answer in Tetum unless explicitly asked otherwise.

"""

Step 2: Create the model

ollama create tetum-llm -f Modelfile

Step 3: Run it

ollama run tetum-llm

7. Improving Accuracy with RAG (Highly Recommended)

Instead of forcing everything into the model:

Architecture

User → Ollama (Tetum LLM)

↳ Vector DB (Tetum documents)

↳ Django API

Tools

Embeddings: nomic-embed-text
Vector DB: FAISS / Chroma
Backend: Django (you already use this)

This is perfect for:

Government services
Legal explanations
Citizen support chatbot
HRMS (leave, payroll, attendance)

8. Evaluation for Tetum

Create a manual test set:

Grammar correctness
Formal vs informal Tetum
Government terminology
Code-switching (Tetum ↔ Portuguese)

Example:

Pergunta: Saida mak dokumentu presiza atu halo pasaporte?

Resposta: ...

9. Voice Support (Future Phase – Matches Your TILERP Plan)

Since you already plan Tetum voice models:

ASR: Whisper fine-tuned on Tetum
TTS: Coqui TTS / Piper
Integration: Ollama → Django → WhatsApp / IVR

10. Recommended Roadmap (6 Months)

Month	Task
1	Collect Tetum corpus
2	Clean & instruction dataset
3	LoRA fine-tuning
4	Ollama integration
5	RAG + Django
6	Evaluation + deployment

Developing an LLM for Tetum using Ollama

Posted by JF

Post a Comment

0 Comments

Subscribe Us

Popular Posts

Facebook

Footer Menu Widget

Developing an LLM for Tetum using Ollama

Posted by JF

You may like these posts

Post a Comment

0 Comments

Social Plugin

Subscribe Us

Popular Posts

Facebook

Footer Menu Widget