DocumentationArchitecture & Pipelines
MURAD Dictionary Ingestion
Ingestion pipeline for the MURAD dataset, a structured collection of specialized Arabic terminology paired with contextual definitions and cross-references.
Key Characteristics
Current Status
- Defined Words 96,221
- Source Graph source:murad_dataset_2026
Extraction Workflow
The flow follows a batch-processing pattern using ThreadPoolTaskRunner for concurrency.
1. Preparation & Batching
Reads data/murad/data/rd_dataset.csv line-by-line and chunks into concurrent batches of 50.
2. Root & Word
Extracts Arabic roots (with 3-char fallback) and UPSERTs the nodes to SurrealDB.
3. Definition (Sentence)
Uses MD5 hashing for deduplication. Calls Ollama (mxbai-embed-large) to vectorize the definition text.
4. Graph Relationships
Executes RELATE statements linking the definition sentence to the target word.
Data Structure Example
A typical graph record connecting a definition sentence to a word node.
{
"sentence": "امتداد فترة ملازمة الراوي للشيخ، وهو مصطلح يُستخدم في علم الحديث...",
"word": "طُوْل الصُّحْبَة",
"root": null
}