Multi-Cloud Long-Run Training Bootstrap
This repo now has a practical overnight bootstrap for:
- Vertex AI embeddings run (
training/trigger_vertex_training.py) - Kubernetes long-run job scaffold (
k8s/training/phdm21d-longrun-training.yaml) - AWS SageMaker launch template (
training/aws_sagemaker_long_run_training_job.json) - Hugging Face lane placeholder (
scripts/train_hf_longrun_placeholder.py)
Start in one command (Windows)
.\scripts\run-long-run-training.ps1 -Hours 8
That prints what will run by default, and writes a report at: training/runs/<timestamp>/training_bootstrap_report.json.
To include all lanes including the HF placeholder:
.\scripts\run-long-run-training.ps1 -Hours 8 -AllowPending
To actually start jobs (you must have all provider credentials available):
.\scripts\run-long-run-training.ps1 -Hours 8 -Execute
Provider controls
You can target providers with -Providers.
.\scripts\run-long-run-training.ps1 -Hours 8 -Execute -Providers vertex-ai,kubernetes,aws-sagemaker
Available IDs:
vertex-aikubernetesaws-sagemakerhuggingface(pending placeholder)
What to do tomorrow morning
- Validate secret and infra variables:
GCP_PROJECT_ID,GCP_REGION,HF_TOKEN,HF_MODEL_REPOAWS_REGION,SAGEMAKER_EXECUTION_ROLE_ARN,S3_TRAINING_DATA_URI,S3_OUTPUT_PATHKUBECONFIG(for Kubernetes lane)
- Run a dry plan from your machine:
.\scripts\run-long-run-training.ps1 -Hours 8
- Start with one lane at a time with
-Providers. - When each lane is stable, run
-Executewith all lanes.
GitHub workflow
/.github/workflows/nightly-multicloud-training.yml runs a scheduled dry-run at 02:00 UTC and can execute on-demand via workflow_dispatch with execute=true.
Current status
- Vertex/Kubernetes/AWS lanes are wired to a bootstrap manifest and launch paths.
- Hugging Face lane is intentionally marked pending until you wire your real trainer command in
train_hf_longrun_placeholder.py/ plan.