DocumentationArchitecture & Pipelines

Quran Ingestion

A multi-modal data acquisition system that builds a rich, interconnected knowledge base for each Ayah. Combining primary text, translations, classical Tafsirs, and thematic annotations.

Key Characteristics

Layered Architecture

The ayah table acts as the central hub where subsequent flows append data.

Multi-Source

Integrates AlQuran Cloud, Quran.com, and custom scholarly datasets.

Graph Connections

Creates classified_as edges linking Ayahs to thematic nodes.

Data Provenance

Acquired from AlQuran Cloud API (text & translations), Quran.com API (Arabic Tafsirs), and Ronnieaban's Quranic Dataset (Sabab Nuzul, thematic metadata).

Current Status

Ayahs Processed 6,236 (100%)
Thematic Categories 1,314
Classification Links 7,660

Extraction Workflows

The ingestion is composed of specialized Prefect flows, each responsible for a specific layer.

1. Multi-Edition

Ingests text-based editions (Translations, Tafsirs) from AlQuran Cloud.

ingest_quran_editions.py

2. Scholarly Metadata

Adds Indonesian metadata (Wajiz, Sabab Nuzul, Intro, Themes).

ingest_quran_ronnieaban_metadata.py

3. Arabic Tafsir

Targets high-quality Arabic Tafsirs from Quran.com API.

ingest_quran_tafsir_qurancom.py

4. Thematic Analysis

Ingests annotations for emotion, sentiment, and subgrouping.

ingest_quran_thematic_emotional.py

Data Structure Example

An ayah record serves as a JSON hub containing multiple nested layers.

{
  "surah_number": 1,
  "ayah_number": 1,
  "text": "بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ",
  "translations": {
    "en_sahih": "In the name of Allah...",
    "id_indonesian": "Dengan menyebut nama..."
  },
  "tafsir": {
    "ar_saddi": "...",
    "id_wajiz": "..."
  },
  "metadata": {
    "emotion": "mercy",
    "sentiment": "positive",
    "theme_group": "Basmalah"
  }
}

Explore Other Documentation