Cloud Kernel Data Pipeline
This pipeline makes production content cloud-native while still enforcing deterministic local gates before shipping.
What It Does
- Pulls source docs and optional Notion sync.
- Ingests multi-source external content (Airtable, Asana, Proton Mail, Gumroad, Google Business, Zapier exports).
- Scores every record with:
truth_scoreuseful_scoreharmful_score
- Splits output into:
curated_allowed.jsonlcurated_quarantine.jsonl- category files under
categories/
- Runs dataset-level anomaly audit.
- Ships verified artifacts to cloud targets (Hugging Face, GitHub Releases, Dropbox).
- Rotates older local runs.
Core Files
- Pipeline script:
scripts/cloud_kernel_data_pipeline.py - PowerShell launcher:
scripts/run_cloud_kernel_data_pipeline.ps1 - Config:
training/cloud_kernel_pipeline.json - CI workflow:
.github/workflows/cloud-kernel-data-pipeline.yml
External Intake Layout
Drop production exports here:
training/intake/airtable/*.jsonltraining/intake/asana/*.jsonltraining/intake/protonmail/*.jsonltraining/intake/gumroad/*.jsonltraining/intake/google_business/*.jsonltraining/intake/zapier/*.jsonl
*.json files are also supported.
Local Run (Cloud Shipping Enabled)
$env:HF_TOKEN="..."
$env:GH_TOKEN="..."
$env:DROPBOX_TOKEN="..."
python scripts/cloud_kernel_data_pipeline.py --config training/cloud_kernel_pipeline.json --ship-targets hf,github
PowerShell wrapper:
.\scripts\run_cloud_kernel_data_pipeline.ps1 -ShipTargets "hf,github"
Build/Verify Only (No Upload)
python scripts/cloud_kernel_data_pipeline.py --config training/cloud_kernel_pipeline.json --no-upload
Optional Notion Refresh
$env:NOTION_API_KEY="..."
python scripts/cloud_kernel_data_pipeline.py --sync-notion
Outputs Per Run
training/runs/cloud_kernel_sync/<timestamp>/raw_production_ingest.jsonltraining/runs/cloud_kernel_sync/<timestamp>/raw_combined.jsonltraining/runs/cloud_kernel_sync/<timestamp>/curated_allowed.jsonltraining/runs/cloud_kernel_sync/<timestamp>/curated_quarantine.jsonltraining/runs/cloud_kernel_sync/<timestamp>/verification_report.jsontraining/runs/cloud_kernel_sync/<timestamp>/run_summary.jsontraining/runs/cloud_kernel_sync/<timestamp>.zip
Latest pointer:
training/ingest/latest_cloud_kernel_sync.txt
Verification Policy
Default thresholds in training/cloud_kernel_pipeline.json:
truth_min: 0.62useful_min: 0.58harmful_max: 0.25dataset_anomaly_threshold: 0.78dataset_max_flagged_ratio: 0.08
If dataset audit returns QUARANTINE, the pipeline exits non-zero by default and blocks shipping.