Version Control and Model Governance for Production AI — The Four Registries Every Enterprise Needs
Why Versioning in AI Is Harder Than in Software
In traditional software, versioning the code is sufficient. The behaviour of the system is determined by the code. In AI systems, the behaviour is determined by the code, the model weights, the prompts, the training data, the inference parameters, the retrieval index, and the context provided at runtime — all of which can change independently of each other and all of which affect output behaviour.
When production quality unexpectedly degrades, the investigation must span all four registries: which model version was running? Which prompt version? What training data was that model trained on? What deployment track was active? Without a complete governance registry, you cannot reproduce the exact state that produced the problem, and you cannot roll back reliably.
Registry 1: Model Registry
What to store
Every model version used in production, with: model identifier (name + version), architecture and parameter count, training data provenance (dataset version, training date, data sources), evaluation metrics (benchmark scores, production quality metrics at rollout), serving parameters (temperature, max tokens, top_p), and deployment history (when was this version active in production?).
Governance requirements
Model promotion to production requires: evaluation on a held-out production-representative test set, comparison against the current production champion (new model must match or exceed on all primary metrics), sign-off from a designated model reviewer, and a documented rollback plan. Use semantic versioning (MAJOR.MINOR.PATCH): major for model family changes, minor for fine-tune updates, patch for inference parameter adjustments.
Tools
MLflow Model Registry, Weights & Biases Artifacts, or AWS SageMaker Model Registry for self-hosted. For API models (GPT, Claude, Gemini): pin to specific model version strings in configuration, never use "latest" aliases — provider-side model updates should be an explicit promotion event in your registry, not an automatic upstream change.
Registry 2: Prompt Library
Why prompts need a registry
A production enterprise agent system may have 50-200 distinct prompts — system prompts, task prompts, extraction templates, validation prompts, summary formats. Each is a versioned asset that affects agent behaviour. When a prompt is changed in production and behaviour degrades, you need to be able to: identify exactly which prompt version was running at the time of each transaction, roll back to the previous version, and compare the two versions to understand the behavioural change.
Prompt versioning structure
Store prompts in a version-controlled repository alongside application code. Every prompt has an identifier (task-name-vN), the prompt text, associated model and inference parameters, evaluation results, and deployment status. Promote prompts through environments — development, staging, production — just as you promote application code. Require staging validation (automated quality tests on a representative sample) before production promotion.
Registry 3: Dataset Versioning
The training and evaluation datasets used to produce and validate model versions must be versioned with the same rigour as model weights. Dataset versioning enables: reproducible model training (any model version can be retrained from scratch using the exact dataset version it was originally trained on), evaluation integrity (the evaluation dataset used to approve a model should not overlap with its training data), and drift analysis (compare training dataset statistics to current production input distribution to detect data drift).
Implementation: DVC (Data Version Control) integrates with Git to version datasets stored in object storage. Delta Lake and Apache Iceberg provide time-travel capabilities for large datasets — query any historical snapshot of the dataset at any point in time.
Registry 4: Deployment Tracks and Rollout Governance
Deployment tracks define how a new model or prompt version progresses to full production traffic. A disciplined track structure: shadow mode (new version runs alongside production, outputs logged but not used), canary 5% (5% of production traffic routed to new version, metrics monitored), canary 20%, full rollout. Each transition requires automated quality gate pass — primary metrics within defined tolerance of baseline — and optional manual approval for major version changes.
| Deployment Track | Traffic % | Duration | Promotion Criteria |
|---|---|---|---|
| Shadow | 0% (parallel run) | 3–7 days | Output format compliance >99%; no critical failures |
| Canary — micro | 5% | 2–3 days | Primary quality metric within 2% of baseline; zero P0 incidents |
| Canary — small | 20% | 3–5 days | Primary quality metric within 1% of baseline; cost within 110% of baseline |
| Canary — half | 50% | 3–5 days | All metrics stable; rollback tested; on-call briefed |
| Production | 100% | Ongoing | Previous version retained for 30-day rollback window |
VoltusWave's platform includes a model registry, prompt library, dataset versioning, and deployment track management as platform defaults. Every production decision is traceable to a specific model version, prompt version, and dataset snapshot — providing the complete provenance chain required for regulated industry compliance.
See Governance Features →