DocumentationArchitecture & Pipelines

Quran Ingestion

A multi-modal data acquisition system that builds a rich, interconnected knowledge base for each Ayah. Combining primary text, translations, classical Tafsirs, and thematic annotations.

Key Characteristics

Layered Architecture
The ayah table acts as the central hub where subsequent flows append data.
Multi-Source
Integrates AlQuran Cloud, Quran.com, and custom scholarly datasets.
Graph Connections
Creates classified_as edges linking Ayahs to thematic nodes.
Data Provenance
Acquired from AlQuran Cloud API (text & translations), Quran.com API (Arabic Tafsirs), and Ronnieaban's Quranic Dataset (Sabab Nuzul, thematic metadata).

Current Status

  • Ayahs Processed 6,236 (100%)
  • Thematic Categories 1,314
  • Classification Links 7,660

Extraction Workflows

The ingestion is composed of specialized Prefect flows, each responsible for a specific layer.

1. Multi-Edition

Ingests text-based editions (Translations, Tafsirs) from AlQuran Cloud.

ingest_quran_editions.py

2. Scholarly Metadata

Adds Indonesian metadata (Wajiz, Sabab Nuzul, Intro, Themes).

ingest_quran_ronnieaban_metadata.py

3. Arabic Tafsir

Targets high-quality Arabic Tafsirs from Quran.com API.

ingest_quran_tafsir_qurancom.py

4. Thematic Analysis

Ingests annotations for emotion, sentiment, and subgrouping.

ingest_quran_thematic_emotional.py

Data Structure Example

An ayah record serves as a JSON hub containing multiple nested layers.

{
  "surah_number": 1,
  "ayah_number": 1,
  "text": "بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ",
  "translations": {
    "en_sahih": "In the name of Allah...",
    "id_indonesian": "Dengan menyebut nama..."
  },
  "tafsir": {
    "ar_saddi": "...",
    "id_wajiz": "..."
  },
  "metadata": {
    "emotion": "mercy",
    "sentiment": "positive",
    "theme_group": "Basmalah"
  }
}

Explore Other Documentation