DocumentationArchitecture & Pipelines

Books Ingestion

The heavy-lifting component of the OpenBayan Knowledge Graph. It processes massive volumes of classical Islamic texts (Kitabs), dictionaries, and biographical works, converting flat text into structured graph data.

Key Characteristics

Granularity
Data is ingested at the Page level (book_page), the raw container for digitized text.
Hybrid Extraction
Uses a combination of Regex and LLM (Qwen2.5) for semantic understanding and entity extraction.
Graph Transformation
Transforms page content into atoms (sentences) and linguistic nodes (words), linking to entities and roots.
Data Provenance
Streamed from Hugging Face repositories: the **Shamela Waqfeya library** (ieasybooks-org/shamela-waqfeya-library) and the digitized **Athar Dataset** (comprising classical biographical and narrator works).

Current Status

  • Total Books 4,661
  • Digitized Pages 83,915
  • Dictionary Extr. ~0.72%

Extraction Workflows

The pipeline operates in three distinct phases: Discovery, Ingestion, and Extraction.

1. Source Discovery

Streams datasets from Hugging Face, filters by category (e.g., "التفاسير"), and saves to local Parquet for processing.

shamela_hf_ingestion.py

2. Record Ingestion

Populates the book and book_page tables with raw text and metadata. Sets processed_for_kg = false.

ingest_shamela_passages.py

3. Knowledge Graph Extraction

The bridge to the Knowledge Plane. Splits pages into semantic chunks, runs LLM analysis to extract roots/words/entities, creates graph edges, and fetches Wikipedia metadata.

batch_dictionary_extraction.py

Data Structure Example

A book_page record acts as the source for extraction.

{
  "id": "book_page:passage_123",
  "content": "الصحبة: في اللغة المعاشرة، يقال صحبه يصحبه صحبة...",
  "source": "source:shamela_lisan_al_arab",
  "page_number": 45,
  "processed_for_kg": true
}

Explore Other Documentation