MODEL GOVERNANCE REGISTRY — VERSION CONTROL LAYERSModel Registryv1.0.0archivedv1.1.0stablev2.0.0-betatestingv2.0.0activePrompt Libraryextraction-v3archivedrouting-v8stablevalidation-v5testingsummary-v2activeDataset Versionstrain-2024Q1archivedtrain-2024Q3stableeval-gold-v4testingprod-holdoutactiveDeployment Trackshadowarchivedcanary-5%stablecanary-20%testingproductionactiveAll four registries versioned, linked, and auditable — full provenance chain for every production decision
← Intelligence Hub|Governance13 min read

Version Control and Model Governance for Production AI — The Four Registries Every Enterprise Needs

Why Versioning in AI Is Harder Than in Software

In traditional software, versioning the code is sufficient. The behaviour of the system is determined by the code. In AI systems, the behaviour is determined by the code, the model weights, the prompts, the training data, the inference parameters, the retrieval index, and the context provided at runtime — all of which can change independently of each other and all of which affect output behaviour.

When production quality unexpectedly degrades, the investigation must span all four registries: which model version was running? Which prompt version? What training data was that model trained on? What deployment track was active? Without a complete governance registry, you cannot reproduce the exact state that produced the problem, and you cannot roll back reliably.

🔴The most common versioning failure in enterprise AI: the prompt is not version-controlled. Engineers treat prompts as configuration, edit them in production, and have no way to roll back when a prompt change causes unexpected behaviour. A prompt is code. It must be version-controlled with the same rigour as application code — in a repository, with review, with staging validation before production promotion.

Registry 1: Model Registry

What to store

Every model version used in production, with: model identifier (name + version), architecture and parameter count, training data provenance (dataset version, training date, data sources), evaluation metrics (benchmark scores, production quality metrics at rollout), serving parameters (temperature, max tokens, top_p), and deployment history (when was this version active in production?).

Governance requirements

Model promotion to production requires: evaluation on a held-out production-representative test set, comparison against the current production champion (new model must match or exceed on all primary metrics), sign-off from a designated model reviewer, and a documented rollback plan. Use semantic versioning (MAJOR.MINOR.PATCH): major for model family changes, minor for fine-tune updates, patch for inference parameter adjustments.

Tools

MLflow Model Registry, Weights & Biases Artifacts, or AWS SageMaker Model Registry for self-hosted. For API models (GPT, Claude, Gemini): pin to specific model version strings in configuration, never use "latest" aliases — provider-side model updates should be an explicit promotion event in your registry, not an automatic upstream change.

Registry 2: Prompt Library

Why prompts need a registry

A production enterprise agent system may have 50-200 distinct prompts — system prompts, task prompts, extraction templates, validation prompts, summary formats. Each is a versioned asset that affects agent behaviour. When a prompt is changed in production and behaviour degrades, you need to be able to: identify exactly which prompt version was running at the time of each transaction, roll back to the previous version, and compare the two versions to understand the behavioural change.

Prompt versioning structure

Store prompts in a version-controlled repository alongside application code. Every prompt has an identifier (task-name-vN), the prompt text, associated model and inference parameters, evaluation results, and deployment status. Promote prompts through environments — development, staging, production — just as you promote application code. Require staging validation (automated quality tests on a representative sample) before production promotion.

📋A freight document extraction agent was updated with a "minor" prompt clarification — changing "extract the shipper's address" to "extract the full shipper's address including country code." The change caused the address field to include country codes that downstream SAP posting did not expect, breaking the posting logic. Because the prompt was version-controlled, the exact change was identifiable in under 5 minutes and rolled back in under 10.

Registry 3: Dataset Versioning

The training and evaluation datasets used to produce and validate model versions must be versioned with the same rigour as model weights. Dataset versioning enables: reproducible model training (any model version can be retrained from scratch using the exact dataset version it was originally trained on), evaluation integrity (the evaluation dataset used to approve a model should not overlap with its training data), and drift analysis (compare training dataset statistics to current production input distribution to detect data drift).

Implementation: DVC (Data Version Control) integrates with Git to version datasets stored in object storage. Delta Lake and Apache Iceberg provide time-travel capabilities for large datasets — query any historical snapshot of the dataset at any point in time.

Registry 4: Deployment Tracks and Rollout Governance

Deployment tracks define how a new model or prompt version progresses to full production traffic. A disciplined track structure: shadow mode (new version runs alongside production, outputs logged but not used), canary 5% (5% of production traffic routed to new version, metrics monitored), canary 20%, full rollout. Each transition requires automated quality gate pass — primary metrics within defined tolerance of baseline — and optional manual approval for major version changes.

Deployment TrackTraffic %DurationPromotion Criteria
Shadow0% (parallel run)3–7 daysOutput format compliance >99%; no critical failures
Canary — micro5%2–3 daysPrimary quality metric within 2% of baseline; zero P0 incidents
Canary — small20%3–5 daysPrimary quality metric within 1% of baseline; cost within 110% of baseline
Canary — half50%3–5 daysAll metrics stable; rollback tested; on-call briefed
Production100%OngoingPrevious version retained for 30-day rollback window
Built-In Governance

VoltusWave's platform includes a model registry, prompt library, dataset versioning, and deployment track management as platform defaults. Every production decision is traceable to a specific model version, prompt version, and dataset snapshot — providing the complete provenance chain required for regulated industry compliance.

See Governance Features →