DocumentationArchitecture & Pipelines
Quran Ingestion
A multi-modal data acquisition system that builds a rich, interconnected knowledge base for each Ayah. Combining primary text, translations, classical Tafsirs, and thematic annotations.
Key Characteristics
Current Status
- Ayahs Processed 6,236 (100%)
- Thematic Categories 1,314
- Classification Links 7,660
Extraction Workflows
The ingestion is composed of specialized Prefect flows, each responsible for a specific layer.
1. Multi-Edition
Ingests text-based editions (Translations, Tafsirs) from AlQuran Cloud.
ingest_quran_editions.py2. Scholarly Metadata
Adds Indonesian metadata (Wajiz, Sabab Nuzul, Intro, Themes).
ingest_quran_ronnieaban_metadata.py3. Arabic Tafsir
Targets high-quality Arabic Tafsirs from Quran.com API.
ingest_quran_tafsir_qurancom.py4. Thematic Analysis
Ingests annotations for emotion, sentiment, and subgrouping.
ingest_quran_thematic_emotional.pyData Structure Example
An ayah record serves as a JSON hub containing multiple nested layers.
{
"surah_number": 1,
"ayah_number": 1,
"text": "بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ",
"translations": {
"en_sahih": "In the name of Allah...",
"id_indonesian": "Dengan menyebut nama..."
},
"tafsir": {
"ar_saddi": "...",
"id_wajiz": "..."
},
"metadata": {
"emotion": "mercy",
"sentiment": "positive",
"theme_group": "Basmalah"
}
}