DocumentationArchitecture & Pipelines

Hadith Ingestion

Integrates a massive collection of prophetic traditions (Hadith) with their chains of narration (Sanad) and texts (Matn) to build a robust Isnad and Rijal research backbone.

Key Characteristics

Source Dataset
Utilizes the freococo/650k_sanadset dataset from Hugging Face.
Data Structure
Each record contains collection name, Hadith number, text (Matn), and chain (Sanad).
Research Utility
Raw Sanad data will be parsed to link narrators into a complex social graph for Rijal.
Data Provenance
Primary ingestion leverages the Hugging Face dataset freococo/650k_sanadset alongside complementary Hadith corpuses (including standard compilations from Kutub al-Sittah).

Current Status

  • Total Ingested 88,690
  • Primary Table hadith
  • Source Scope ~650,000

Extraction Workflow

Orchestrated via Prefect and processed using the Hugging Face datasets library in streaming mode to manage memory efficiency.

1. Initialization

Ensures the source:hadith_650k_sanadset record exists in the database before proceeding.

2. Streaming Load

Loads the dataset from Hugging Face with streaming=True to prevent memory overload.

3. Normalization

Slugifies collection names and escapes special characters in Matn/Sanad to prevent SurrealQL errors.

4. Batch Upserting

Groups 100 statements into single transactions for optimized database performance.

Data Structure Example

Internal JSON structure before database normalization.

{
  "Hadith": "...",
  "Book": "Sahih Bukhari",
  "Num_hadith": "1",
  "Matn": "إنما الأعمال بالنيات...",
  "Sanad": "حدثنا الحميدي عبد الله بن الزبير..."
}

Explore Other Documentation