SLM Model weight versioning like Git

In an age where there is a focus on creating specialized Small Language Models (SLMs) for high-throughput, real-time applications, we seem to be at an impasse. We have gotten to the point of excelling in fine-tuning models; however, we have become incompetent in the area of maintaining them.

While the deployment of one LLM can be considered to be analogous to managing an API dependency, deploying multiple domain-specific SLMs - say, one for PII removal, one for intent detection, and yet another for structure-based data extraction - is a different beast altogether.

In this article, you will learn how to design an architecture that can help you avoid “model rot” for your whole fleet of SLMs. In particular, the focus is how to:

Set up a Model Registry that keeps track of lineage and performance of models
Implement a Gateway Pattern that can perform version-controlled routing
Develop a Manifest-based Delivery System that allows you to deploy to edges

Prerequisites: You should know Python well and have some experience working with ML models training. Experience with MLflow and any other Experiment Tracker would also be great.

The Problem: Model Rot at Scale

When you are working with five different SLMs, you should not be working with weights that are black boxes inside an S3 bucket. Names such as final_model_v2_fixed will not work either.

Here is how you need to approach making an AI application ready for production.

How to Build a Model Registry

Do away with ad-hoc folders. You should have a central Model Registry, where any tool such as MLflow, DVC, or Weights & Biases works well.

The Model Registry should serve to be your Git. It will help you trace the lineage of the models and the datasets that went into each model along with the performance benchmarks. This is your go-to place where every version of your model is traced with the same dataset used in training, hyperparameters, and metrics.

A model registry connects to your training source

How to Implement the Gateway Pattern for Version Control

The use of code for weights requires semantic versioning. Rather than hard-coding the location of the model into your software, utilize the Gateway Pattern to abstract away from your model artifacts.

You can then do A/B testing and rollbacks by modifying only a config file, with no need for a re-deploy.

Here's an example of how one might route requests through a Python gateway:

class ModelGateway:
    def __init__(self):
        self.routes = {
            "intent-classifier": {
                "v1": "models/intent_v1.safetensors",
                "v2": "models/intent_v2.safetensors"
            },
            "active_version": "v1"
        }

    def predict(self, input_text):
        model_path = self.routes["intent-classifier"][self.routes["active_version"]]
        return self.load_and_run(model_path, input_text)

    def switch_version(self, version):
        # Hot-swap the version without redeploying the app
        self.routes["active_version"] = version
        print(f"Traffic routed to {version}")

The key insight here is that switch_version lets you swap from v1 to v2 in milliseconds. No downtime. No pipeline rerun. You update the config, and the gateway handles the rest.

How to Handle Edge Deployment with a Manifest System

The last obstacle is synchronization. For edge devices such as laptops or mobile devices, you have to use manifest-based deployment if you want your models to stay synchronized without downloading gigabytes of weights every time there is an update.

Here’s how the process flows through:

Registry Update: You deploy a new version (let’s say of your LoRA adapter) into the Central Registry.
Manifest Broadcasting: Your client app checks for the existence of the small JSON file called manifest.json, which contains information about the canonical version of your registered models.
Weight Update: Your program doesn’t need to download a 4GB file; thanks to the manifest, it knows exactly what layer weights were modified during the update.

It is similar to how package managers, such as npm or pip, deal with dependency versions – you never download any extra files, and you’re always sure of what you’re using.

Why This Matters

The end of the “one-size-fits-all” model era is here. Once you’ve established a rigid registry and routing design, you can transition your attention from repairing failed models to maximizing their efficiency.

Here is what you have designed throughout this article:

An Artifact registry that records lineage, hashing of datasets, and benchmarking for every artifact
The Gateway Design Pattern that allows separating the version of a model from its source code and hotswapping it without any downtime An Edge Delivery system based on Manifests and syncing the delta of weights between nodes

Treating model weights like code is more than a matter of good practice – it is a matter of control. Now that you know what exactly runs, why it runs, and how it can be swapped within milliseconds, you become more than just an AI developer but a AI systems engineer.

How to Master the SLM Lifecycle: Treat Your Model Weights Like Source Code

The Problem: Model Rot at Scale

How to Build a Model Registry

How to Implement the Gateway Pattern for Version Control

How to Handle Edge Deployment with a Manifest System

Why This Matters

Comments

Command Palette

The Problem: Model Rot at Scale

How to Build a Model Registry

How to Implement the Gateway Pattern for Version Control

How to Handle Edge Deployment with a Manifest System

Why This Matters

Comments