DocumentationArchitecture & Pipelines
Sentence Ingestion
The most critical component of the Knowledge Graph. It represents the Atomic Unit of Knowledge—the level at which search, cross-referencing, and semantic analysis occur.
Key Characteristics
Current Status
- Sentence Atoms 120,907
- Primary Source MURAD (80%)
- Search State Fully Indexed
Generation Workflows
Sentences are not ingested directly; they are the output of transformation pipelines processing core texts.
1. Quranic Atomization
Uses Quranic Waqf (stop) marks to split long Ayahs into natural semantic breathing points. E.g., Ayat al-Kursi is split into 9 sentences.
2. Dictionary Chunking
Splits book pages into ~350-word blocks. An LLM (Qwen2.5) extracts defining entries as sentences linked back to the page.
3. Enrichment (The Knowledge Plane)
Each sentence undergoes rigorous enrichment:
- Vectorization: 1024-dim embeddings via Ollama.
- Categorization: Mapped to taxonomies and topics.
- Entity Extraction: LLM creates
mentionslinks for people and places. - Linguistic Mapping: Linked to roots and words via
composed_of.
Data Structure Example
A sentence record enriched for hybrid search.
{
"id": "sentence:dict_mura_12345",
"text": "الصحبة: في اللغة المعاشرة، يقال صحبه يصحبه صحبة",
"parent": "book_page:passage_abc",
"source": "source:murad_dataset_2026",
"embedding": [0.12, -0.05, ...],
"chunk_index": 0,
"mention_count": 5
}