DocumentationArchitecture & Pipelines

MURAD Dictionary Ingestion

Ingestion pipeline for the MURAD dataset, a structured collection of specialized Arabic terminology paired with contextual definitions and cross-references.

Key Characteristics

Lexical Coverage
Focuses on specialized terminology across academic domains (ML, education, psychology).
Structure
Primary term (word), descriptive definition (definition), and source taxonomy reference.
Graph Utility
Grounds LLM outputs in verified terminology and nuanced semantic explanations.
Data Provenance
Compiled from the local dataset data/murad/data/rd_dataset.csv, establishing a unified terminology dictionary registered as source:murad_dataset_2026.

Current Status

  • Defined Words 96,221
  • Source Graph source:murad_dataset_2026

Extraction Workflow

The flow follows a batch-processing pattern using ThreadPoolTaskRunner for concurrency.

1. Preparation & Batching

Reads data/murad/data/rd_dataset.csv line-by-line and chunks into concurrent batches of 50.

2. Root & Word

Extracts Arabic roots (with 3-char fallback) and UPSERTs the nodes to SurrealDB.

3. Definition (Sentence)

Uses MD5 hashing for deduplication. Calls Ollama (mxbai-embed-large) to vectorize the definition text.

4. Graph Relationships

Executes RELATE statements linking the definition sentence to the target word.

Data Structure Example

A typical graph record connecting a definition sentence to a word node.

{
  "sentence": "امتداد فترة ملازمة الراوي للشيخ، وهو مصطلح يُستخدم في علم الحديث...",
  "word": "طُوْل الصُّحْبَة",
  "root": null
}

Explore Other Documentation