DocumentationArchitecture & Pipelines

Sentence Ingestion

The most critical component of the Knowledge Graph. It represents the Atomic Unit of Knowledge—the level at which search, cross-referencing, and semantic analysis occur.

Key Characteristics

Atomicity
Large texts (Ayahs, Book Pages) are broken down into smaller, semantically coherent "sentences" or "chunks."
Hybrid Search Hub
Hosts both BM25 Full-Text and HNSW Vector indexes (1024-dim), enabling powerful hybrid search.
Multimodal Linkage
Every sentence links to its parent, source, words, and entities to form a robust linguistic map.
Data Provenance
Synthesized across upstream planes, currently dominated (~80%) by the MURAD Reverse Arabic Dictionary, and segments of Quranic Verses and Classical shamela book pages.

Current Status

  • Sentence Atoms 120,907
  • Primary Source MURAD (80%)
  • Search State Fully Indexed

Generation Workflows

Sentences are not ingested directly; they are the output of transformation pipelines processing core texts.

1. Quranic Atomization

Uses Quranic Waqf (stop) marks to split long Ayahs into natural semantic breathing points. E.g., Ayat al-Kursi is split into 9 sentences.

2. Dictionary Chunking

Splits book pages into ~350-word blocks. An LLM (Qwen2.5) extracts defining entries as sentences linked back to the page.

3. Enrichment (The Knowledge Plane)

Each sentence undergoes rigorous enrichment:

  • Vectorization: 1024-dim embeddings via Ollama.
  • Categorization: Mapped to taxonomies and topics.
  • Entity Extraction: LLM creates mentions links for people and places.
  • Linguistic Mapping: Linked to roots and words via composed_of.

Data Structure Example

A sentence record enriched for hybrid search.

{
  "id": "sentence:dict_mura_12345",
  "text": "الصحبة: في اللغة المعاشرة، يقال صحبه يصحبه صحبة",
  "parent": "book_page:passage_abc",
  "source": "source:murad_dataset_2026",
  "embedding": [0.12, -0.05, ...],
  "chunk_index": 0,
  "mention_count": 5
}

Explore Other Documentation