Skip to content

Model Hosting Platform - Model Onboarding Guide

Our RAI Model Hosting Platform is designed to streamline AI model deployment, enhance performance efficiency, and provide a seamless experience for model owners solving AI safety challenges. Empowering AI Model Owners with Scalable, Efficient, and Reliable Hosting. Models hosted on the RAI Model Hosting Platform actively integrate with robust security infrastructure to provide advanced protection for leading LLMs.

Value Proposition

Target Users:

👩‍💻 AI Model Owners – Researchers & engineers building AI models for safety.

Customer Pain Points vs. Platform Value Proposition

Model Onboarding Lifecycle Step Pain Points Our Value Proposition
1. Build the Model Image 🚨 Slow Build Image Time – Lack of standardization leads to long build times, delaying model deployment and iteration. Standardized & Accelerated Model Deployment – Reduces build image time with a standardized format, enabling faster and more efficient deployment.
2. Collect Performance Metrics 🚨 Inconsistent Performance Metrics – No unified methodology for benchmarking, making it difficult to assess model efficiency and estimate capacity needs. Performance & Unified benchmarks – Provides accurate AI model metrics using a unified benchmarking approach, improving efficiency and scalability.
3. Manage Capacity 🚨 Over-Provisioning – Wastes compute power and increases costs.
🚨 Under-Provisioning – Causes latency issues, slow response times, and downtime.
Cost-Efficient Resource Allocation – Dynamically optimizes legacy resource usage to reduce waste while maintaining performance.
4. Register the Model 🚨 Lack of Model and dedicated GPU Utilization Reports – No visibility into model and GPU capacity utilization with allocated resources.
5. Model Roll Out 🚨 Slow & Inefficient Model Deployment – Long build times and lack of standardization slow down iteration cycles. Seamless Deployment & Real-Time Monitoring – Automates deployment, reducing operational overhead, and enhances security and compliance with customer isolation.✅ Model Multi-Version Support – Enables parallel execution of multiple model versions and handles contract modifications efficiently.

Overview

This guide outlines the essential steps for onboarding your model to the RAI Platform. We offer two onboarding options to meet diverse customer needs: the Regular Tier and the Fast-Tracked Tier. The Regular Tier may take up to 7 days to complete, requiring minimal effort from the customer, assuming adequate capacity is available. This option is recommended for the initial onboarding of a model. On the other hand, the Fast-Tracked Tier guarantees completion within 24 hours but requires more involvement from the customer, provided there is sufficient capacity. This tier is ideal for model refreshes, including updates to existing models, hotfixes, and configuration changes. The following checklist outlines the tasks for both onboarding options.

# Task Regular Tier Fast-tracked Tier
1 Prepare required information about the model ✅ ✅
2 Build the model image ✅ ✅
3 Prepare model performance metrics of the model ✅
4 Confirm model capacity with the Capacity PoC ✅
5 Register the model through the model Registration API ✅
6 Inform the Hosting PoC to start rollout ✅

Task Description

This section offers comprehensive and in-depth information on each task, ensuring clarity and a thorough understanding of its purpose, steps, and requirements.

Prepare Required Model Information

  1. Model Information
    • Model name and version
    • Harm categories and model modality
    • Input and output schemas. The content type must be JSON. Input and output schemas are not allowed to change for Fast-tracked Tier.
    • Target SKU (e.g., CPU/GPU type, GPU microarchitecture like Volta, Ampere, Hopper). If unspecified, SKU will be determined during performance evaluation if it is a new model, otherwise SKU will remain the same for model refreshes.
  2. Service Capability
    • Maximum payload size
    • Acceptable latency statistics
    • Expected requests per second (RPS) per region/cloud

Fill out the form once you get above information ready.

Build the Model Image

The RAI Platform serves models using Docker containers on Azure Machine Learning with Singularity as the backend.

Requirements for the Docker Image:

  1. Must be built on a Linux system.
  2. Must respond to requests on the liveness, readiness, and scoring routes.
  3. Must have an entrypoint of ['runsvdir', '/var/runit'].

For a step-by-step guide, refer to the Tutorial page.

The image should be hosted in Azure Container Registry. Please make sure that the role AcrPull of the container registry is assigned to the following two service principals:

  • AICPRuntime
  • RAIServiceOpsPlatformADO

Prepare Model Performance Metrics

Performance metrics are required to determine model capacity and deployment. The following data should be collected:

  1. SKU – The computing instance type used to run the model.
  2. Payload Size – Size of test samples (e.g., image dimensions, text length in characters).
  3. Latency – Latency statistics under target RPS (max, min, average, P50, P75, P90, P95, P99).
  4. Resource Utilization – CPU, GPU, memory, and GPU memory usage during load tests. This is optional if resource utilization is not applicable.

Ensure the SKU matches those supported by the RAI Platform. If needed, contact the model hosting PoC with the subject Performance Evaluation for assistance in collecting metrics.

Example Performance Metrics

Below are example latency metrics for models tested on two device types: Singularity.ND12am_A100_v4 and Singularity.NC4as_T4_v3. These numbers are for demonstration purposes only.

Results on Singularity.ND12am_A100_v4

Target RPS 10 20 30 40
Number of Requests 3000 6000 9000 12000
Total RPS 9.8 19.8 29.6 38.8
Failure Rate 0 0 0 20
Average Latency (ms) 18.5 25.3 32.4 228.5
Max Latency (ms) 23.4 32.5 37.4 653.3
Min Latency (ms) 17.2 23.2 28.4 47.5
P50 Latency (ms) 19.6 26.7 34.5 245.2
P75 Latency (ms) 21.6 28.4 35.8 275.4
P95 Latency (ms) 22.8 30.2 36.4 324.8
P99 Latency (ms) 23.2 31.9 37.1 624.7
Average GPU Count 1 1 1 1
GPU Utilization Percentage 75 74 78 73
GPU Memory (GB) 12 12 12 12
Max GPU Utilization Percentage 99 99 99 99
Average CPU Count 16 16 16 16
Average Logical CPU Count 16 16 16 16
CPU Utilization Percentage 80 80 80 80
Virtual Memory Used (GB) 16 16 16 16
Max CPU Utilization Percentage 100 100 100 100
Max Virtual Memory Used 20 20 20 20

Results on Singularity.NC4as_T4_v3

Target RPS 10 20 30 40
Number of Requests 3000 6000 9000 12000
Total RPS 9.8 19.8 29.6 38.8
Failure Rate 0 0 30 50
Average Latency (ms) 36.5 50.3 364.4 648.5
Max Latency (ms) 46.4 32.5 37.4 653.3
Min Latency (ms) 17.2 23.2 28.4 47.5
P50 Latency (ms) 19.6 26.7 34.5 245.2
P75 Latency (ms) 21.6 28.4 35.8 275.4
P95 Latency (ms) 22.8 30.2 36.4 324.8
P99 Latency (ms) 23.2 31.9 37.1 624.7
Average GPU Count 1 1 1 1
GPU Utilization Percentage 75 74 78 73
GPU Memory (GB) 12 12 12 12
Max GPU Utilization Percentage 99 99 99 99
Average CPU Count 16 16 16 16
Average Logical CPU Count 16 16 16 16
CPU Utilization Percentage 80 80 80 80
Virtual Memory Used (GB) 16 16 16 16
Max CPU Utilization Percentage 100 100 100 100
Max Virtual Memory Used 20 20 20 20

Assume that the acceptable P95 latency is 35ms and target RPS in East US is 200, each Singularity.ND12am_A100_v4 can consume 20 requests per second and 10 GPUs are required in East US.

Confirm Resource Capacity

Number of SKUs need to be calculated for each region. Confirm with the Capacity PoC Megan Baker that there are enough resources available. If there is not enough quota, we need to apply for additional resources and this may take several weeks.

Model Registration

Model is registered by model registration API. You may contact with the Hosting Poc Placeholder for authorization if this is the first time you register your model. Detailed information about the API can be found in Model Registration API.

Inform the Hosting PoC to Rollout

Inform the Hosting Poc Placeholder to start the rollout process that is managed by the RAI OPS Platform. You will be updated on the rollout progress. Rollout will be triggered automatically for the regular tier and no action is required from the customer side.

Need Help?

For assistance, contact us at placeholder@placeholder.