DocumentationArchitecture & Pipelines
Books Ingestion
The heavy-lifting component of the OpenBayan Knowledge Graph. It processes massive volumes of classical Islamic texts (Kitabs), dictionaries, and biographical works, converting flat text into structured graph data.
Key Characteristics
Current Status
- Total Books 4,661
- Digitized Pages 83,915
- Dictionary Extr. ~0.72%
Extraction Workflows
The pipeline operates in three distinct phases: Discovery, Ingestion, and Extraction.
1. Source Discovery
Streams datasets from Hugging Face, filters by category (e.g., "التفاسير"), and saves to local Parquet for processing.
shamela_hf_ingestion.py2. Record Ingestion
Populates the book and book_page tables with raw text and metadata. Sets processed_for_kg = false.
ingest_shamela_passages.py3. Knowledge Graph Extraction
The bridge to the Knowledge Plane. Splits pages into semantic chunks, runs LLM analysis to extract roots/words/entities, creates graph edges, and fetches Wikipedia metadata.
batch_dictionary_extraction.pyData Structure Example
A book_page record acts as the source for extraction.
{
"id": "book_page:passage_123",
"content": "الصحبة: في اللغة المعاشرة، يقال صحبه يصحبه صحبة...",
"source": "source:shamela_lisan_al_arab",
"page_number": 45,
"processed_for_kg": true
}