Developing an LLM for Tetum using Ollama

Developing an LLM for Tetum using Ollama is best done by fine-tuning an existing multilingual model (such as Qwen2.5 or Mistral 7B) rather than training from scratch, using high-quality Tetum text and conversational datasetsfrom government, media, legal, and educational sources. The recommended approach is LoRA fine-tuning, which works well on limited hardware, followed by deployment in Ollama with a Tetum-specific system prompt, and enhancement through Retrieval-Augmented Generation (RAG) to handle government and institutional knowledge accurately. This architecture integrates naturally with Django-based systems (such as HRMS or citizen service platforms), supports multilingual interaction, and can later be extended with Tetum speech recognition and text-to-speech for voice-based public services.


Follow step by step bellow:


1. Reality Check: What "Developing an LLM" Means

There are 3 levels of "LLM development" (from easiest to hardest):

Level

What you do

Recommended

A. Prompt & RAG only

Use existing model + Tetum data

 Fastest

B. Fine-tuning (LoRA)

Adapt a model to Tetum

 Best balance

C. Train from scratch

New Tetum LLM

 Very expensive

👉 For Tetum, Level B (Fine-tuning) is the correct approach.


2. Recommended Base Models for Tetum (Ollama-friendly)

Tetum is Austronesian and works better with multilingual models.

 Best Base Models

Model

Size

Why

Qwen2.5

7B / 14B

Excellent multilingual

LLaMA 3.1

8B

Strong reasoning

Mistral 7B

7B

Efficient & fast

Gemma 2

9B

Good low-resource performance

⚠️ Recommendation for your setup

  • CPU / 16GB RAM → qwen2.5:7b or mistral:7b
  • GPU ≥ 24GB VRAM → qwen2.5:14b

Example:

ollama pull qwen2.5:7b


3. Collecting Tetum Language Data (MOST IMPORTANT STEP)

Your model quality = data quality.

A. Text Sources (High Priority)

  • Government documents (*.gov.tl)
  • Tetum news sites (Tatoli, RTTL)
  • Parliamentary transcripts
  • Legal documents (laws, decrees)
  • Education materials
  • NGO reports
  • Church publications (Tetum is common here)

B. Conversational Data (Critical for Chatbots)

Create Q&A pairs, for example:

{

  "instruction": "Explika saida mak servisu Ministériu Saúde",

  "input": "",

  "output": "Ministériu Saúde iha responsibilidade atu fornese servisu saúde ba ema hotu iha Timor-Leste."

}

C. Minimum Dataset Size

Level

Tokens

Prototype

100k–300k

Usable chatbot

1–3 million

High quality

5–10 million


4. Preparing the Dataset (Instruction Format)

Ollama fine-tuning works best with instruction datasets.

Example JSONL

{"instruction":"Halo saudasaun iha Tetum","output":"Bondia! Diak ka lae?"}

{"instruction":"Explika saida mak AI","output":"AI ka Inteligénsia Artifisial mak teknologia ne'ebé bele aprende no halo desizaun hanesan ema."}

Save as:

tetum_dataset.jsonl


5. Fine-Tuning with LoRA (Recommended)

Why LoRA?

  • Works on limited hardware
  • Much cheaper than full training
  • Can be merged into Ollama

Tools Needed

pip install transformers peft datasets accelerate bitsandbytes

Training Example (Python)

from transformers import AutoModelForCausalLM, AutoTokenizer

from peft import LoraConfig, get_peft_model

 

model = AutoModelForCausalLM.from_pretrained(

    "Qwen/Qwen2.5-7B",

    load_in_4bit=True,

    device_map="auto"

)

 

lora = LoraConfig(

    r=8,

    lora_alpha=16,

    target_modules=["q_proj","v_proj"],

    lora_dropout=0.05

)

 

model = get_peft_model(model, lora)

Train with your Tetum dataset → output = LoRA adapter


6. Importing the Tetum Model into Ollama

Step 1: Create a Modelfile

FROM qwen2.5:7b

 

PARAMETER temperature 0.3

SYSTEM """

You are an AI assistant that speaks fluent Tetum.

Always answer in Tetum unless explicitly asked otherwise.

"""

Step 2: Create the model

ollama create tetum-llm -f Modelfile

Step 3: Run it

ollama run tetum-llm


7. Improving Accuracy with RAG (Highly Recommended)

Instead of forcing everything into the model:

Architecture

User → Ollama (Tetum LLM)

        Vector DB (Tetum documents)

        Django API

Tools

  • Embeddingsnomic-embed-text
  • Vector DB: FAISS / Chroma
  • Backend: Django (you already use this)

This is perfect for:

  • Government services
  • Legal explanations
  • Citizen support chatbot
  • HRMS (leave, payroll, attendance)

8. Evaluation for Tetum

Create a manual test set:

  • Grammar correctness
  • Formal vs informal Tetum
  • Government terminology
  • Code-switching (Tetum ↔ Portuguese)

Example:

Pergunta: Saida mak dokumentu presiza atu halo pasaporte?

Resposta: ...


9. Voice Support (Future Phase – Matches Your TILERP Plan)

Since you already plan Tetum voice models:

  • ASR: Whisper fine-tuned on Tetum
  • TTS: Coqui TTS / Piper
  • Integration: Ollama → Django → WhatsApp / IVR

10. Recommended Roadmap (6 Months)

Month

Task

1

Collect Tetum corpus

2

Clean & instruction dataset

3

LoRA fine-tuning

4

Ollama integration

5

RAG + Django

6

Evaluation + deployment



 

Post a Comment

0 Comments