Building a Model Registry from Scratch & Model Evaluation

Building a Model Registry from Scratch and Model Evaluation with PyTorch

When working with machine learning models, the process doesn’t end with training. Evaluating models and keeping track of them is equally important, especially when working in production environments. In this post, we’ll explore how to evaluate a model using PyTorch and build a custom model registry from scratch to version and manage models efficiently.


1. Model Evaluation with PyTorch

Model evaluation is essential for understanding how well a trained model generalizes to unseen data. PyTorch provides flexibility in designing evaluation pipelines. Let’s start with a few common metrics like accuracy, precision, recall, and F1-score.

Code for Evaluating a Model

Below is an example of evaluating a simple classification model in PyTorch:

import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Assuming model, test_loader, and device are already defined
model.eval()  # Set model to evaluation mode

all_preds = []
all_labels = []

# Disable gradient calculations for faster computations
with torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels =,
        # Forward pass
        outputs = model(inputs)
        _, preds = torch.max(outputs, 1)

# Calculate evaluation metrics
accuracy = accuracy_score(all_labels, all_preds)
precision = precision_score(all_labels, all_preds, average='weighted')
recall = recall_score(all_labels, all_preds, average='weighted')
f1 = f1_score(all_labels, all_preds, average='weighted')

print(f"Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1-Score: {f1:.4f}")

This code evaluates a trained model on a test dataset and calculates standard metrics to understand model performance.

2. Building a Model Registry from Scratch

Managing models in production becomes complicated when there are multiple versions to keep track of. A model registry provides a way to store models, version them, and retrieve them when needed. While tools like MLflow exist for this purpose, we’ll walk through building a simple custom model registry using Python and pickle.

Structure of the Model Registry

The model registry consists of:

Step-by-Step Implementation

Step 1: Directory Structure

The following directory structure will help organize models:

├── models/
│   ├── model_v1.pkl
│   ├── model_v2.pkl
└── registry.json

Step 2: Code for Model Registry

import os
import json
import pickle

class ModelRegistry:
    def __init__(self, registry_dir="model_registry"):
        self.registry_dir = registry_dir
        self.models_dir = os.path.join(self.registry_dir, "models")
        self.registry_file = os.path.join(self.registry_dir, "registry.json")
        os.makedirs(self.models_dir, exist_ok=True)
        if not os.path.exists(self.registry_file):
            with open(self.registry_file, 'w') as f:
                json.dump({}, f)
    def save_model(self, model, model_name, metrics):
        # Save the model
        version = self.get_next_version(model_name)
        model_path = os.path.join(self.models_dir, f"{model_name}_v{version}.pkl")
        with open(model_path, 'wb') as f:
            pickle.dump(model, f)
        # Update registry
        self.update_registry(model_name, version, model_path, metrics)
        print(f"Model {model_name} version {version} saved successfully!")
    def update_registry(self, model_name, version, model_path, metrics):
        with open(self.registry_file, 'r') as f:
            registry = json.load(f)
        if model_name not in registry:
            registry[model_name] = []
            "version": version,
            "path": model_path,
            "metrics": metrics
        with open(self.registry_file, 'w') as f:
            json.dump(registry, f, indent=4)
    def get_next_version(self, model_name):
        with open(self.registry_file, 'r') as f:
            registry = json.load(f)
        if model_name not in registry:
            return 1
            return len(registry[model_name]) + 1
    def load_model(self, model_name, version=None):
        with open(self.registry_file, 'r') as f:
            registry = json.load(f)
        if model_name not in registry:
            raise ValueError(f"Model {model_name} not found in the registry")
        if version is None:
            version = len(registry[model_name])  # Load latest version
        model_info = next((m for m in registry[model_name] if m['version'] == version), None)
        if model_info is None:
            raise ValueError(f"Version {version} of model {model_name} not found")
        with open(model_info['path'], 'rb') as f:
            model = pickle.load(f)
        return model

Step 3: Saving and Loading Models

Now, let’s demonstrate how to use this registry to save and load models.

# Assuming `trained_model` is the trained PyTorch model and `metrics` is the evaluation result
registry = ModelRegistry()

# Save the model and metrics
metrics = {"accuracy": 0.92, "precision": 0.89, "recall": 0.88, "f1_score": 0.89}
registry.save_model(trained_model, "my_classification_model", metrics)

# Load the latest version of the model
loaded_model = registry.load_model("my_classification_model")

3. Visualizing the Evaluation Metrics

An important part of a model registry is to be able to compare the performance of models and be able to tell which might be best suitable for a particular use case. It is helpful to visualize the evaluation metrics to provide a clearer picture of model performance. Below is a code snippet to visualize these metrics using matplotlib:

import matplotlib.pyplot as plt

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
values = [accuracy, precision, recall, f1], values, color=['blue', 'orange', 'green', 'red'])
plt.title("Model Evaluation Metrics")
plt.ylim(0, 1)

A confusion matrix is used in model evaluation to visualize the performance of a classification algorithm. It shows the number of correct and incorrect predictions for each class by comparing the true labels with the predicted labels.

These metrics are used to assess how well a model distinguishes between different classes, with Precision being useful when false positives are costly, Recall when missing positives is critical, and F1-Score offering a balanced measure.


In this blog post, we explored how to evaluate machine learning models using PyTorch and built a simple model registry from scratch. Evaluating models with proper metrics is key to understanding their generalization capabilities, while a registry helps in versioning and managing models efficiently. You can expand this setup with additional features like storing more metadata or integrating with cloud services for model storage.

Stay tuned for future posts on automating model evaluation and deployment in production environments!