DocumentationArchitecture & Pipelines
Hadith Ingestion
Integrates a massive collection of prophetic traditions (Hadith) with their chains of narration (Sanad) and texts (Matn) to build a robust Isnad and Rijal research backbone.
Key Characteristics
Current Status
- Total Ingested 88,690
- Primary Table hadith
- Source Scope ~650,000
Extraction Workflow
Orchestrated via Prefect and processed using the Hugging Face datasets library in streaming mode to manage memory efficiency.
1. Initialization
Ensures the source:hadith_650k_sanadset record exists in the database before proceeding.
2. Streaming Load
Loads the dataset from Hugging Face with streaming=True to prevent memory overload.
3. Normalization
Slugifies collection names and escapes special characters in Matn/Sanad to prevent SurrealQL errors.
4. Batch Upserting
Groups 100 statements into single transactions for optimized database performance.
Data Structure Example
Internal JSON structure before database normalization.
{
"Hadith": "...",
"Book": "Sahih Bukhari",
"Num_hadith": "1",
"Matn": "إنما الأعمال بالنيات...",
"Sanad": "حدثنا الحميدي عبد الله بن الزبير..."
}