Web Research Training Pipeline
Script: scripts/web_research_training_pipeline.py
Purpose
Automate this chain:
- Pull up-to-date web research URLs (Google News RSS per topic).
- Scan URLs with HYDRA headless + Multi-Model Modal Matrix.
- Gate content via antivirus membrane and decision logic.
- Emit training-ready JSONL (
allowed+quarantine). - Audit allowed records and write
StateVector+DecisionRecord. - Optionally upload artifacts to Hugging Face dataset.
- Optionally notify n8n webhook for downstream automation.
Linux-first run
python3 scripts/web_research_training_pipeline.py \
--topics "space debris cleanup" "autonomous drone navigation" "radiation hardened avionics" \
--max-per-topic 6 \
--backend playwright \
--max-tabs 6
With Hugging Face upload
export HF_TOKEN=hf_xxx
python3 scripts/web_research_training_pipeline.py \
--topics "mars robotics" "swarm autonomy" \
--upload-hf \
--hf-repo issdandavis/scbe-kernel-datasets
With n8n handoff
python3 scripts/web_research_training_pipeline.py \
--topics "space robotics" \
--n8n-webhook "https://your-n8n-host/webhook/scbe-research-ingest"
Offline replay (no live browser scan)
python3 scripts/web_research_training_pipeline.py \
--topics "space robotics" \
--scan-json artifacts/hydra/mmx_headless_run.json \
--skip-core-check
Outputs
Run directory:
training/runs/web_research/<run_id>/discovered_urls.jsontraining/runs/web_research/<run_id>/hydra_scan.jsontraining/runs/web_research/<run_id>/curated_allowed.jsonltraining/runs/web_research/<run_id>/curated_quarantine.jsonltraining/runs/web_research/<run_id>/audit.jsontraining/runs/web_research/<run_id>/core_health.jsontraining/runs/web_research/<run_id>/statevector.jsontraining/runs/web_research/<run_id>/decision_record.jsontraining/runs/web_research/<run_id>/summary.json
Intake file:
training/intake/web_research/web_research_<run_id>.jsonl
Notes
- Core gate uses these tests by default:
tests/test_antivirus_membrane.pytests/test_extension_gate.pytests/test_hydra_turnstile.pytests/test_multi_model_modal_matrix.py
- Use
--skip-core-checkto bypass the gate for exploratory runs.